Wed, 25 Sep 2024 13:52:42 +0100 |
Henry S. Thompson |
works, although output not checked
|
Wed, 25 Sep 2024 13:51:15 +0100 |
Henry S. Thompson |
maybe triggers jdb on tests with -DdebugTest=true on command line
|
Wed, 25 Sep 2024 09:49:12 +0100 |
Henry S. Thompson |
orig, more or less
|
Tue, 24 Sep 2024 17:08:05 +0100 |
Henry S. Thompson |
working, with issues:
|
Tue, 24 Sep 2024 12:34:51 +0100 |
Henry S. Thompson |
compiles with content, but fails with EOF -- need blank lines?
|
Mon, 23 Sep 2024 19:18:36 +0100 |
Henry S. Thompson |
runs, but no cdx yet, because no value.content I presume
|
Mon, 23 Sep 2024 16:35:22 +0100 |
Henry S. Thompson |
add lastmod to cdx lines,
|
Thu, 15 Feb 2024 22:31:43 +0000 |
Henry S. Thompson |
csing-related tweaks
|
Wed, 06 Dec 2023 13:38:58 +0000 |
Henry S. Thompson |
too many overdue updates to break down
|
Fri, 08 Sep 2023 21:44:48 +0100 |
Henry S. Thompson |
use csing, and _runme_c.sh to get it initialised
|
Fri, 08 Sep 2023 21:42:55 +0100 |
Henry S. Thompson |
MANPATH (?)
|
Fri, 08 Sep 2023 21:42:12 +0100 |
Henry S. Thompson |
tab completion fix
|
Fri, 21 Jul 2023 11:38:20 +0100 |
Henry S. Thompson |
add support for multiple calls to srun with a counter
|
Wed, 05 Jul 2023 15:08:59 +0100 |
Henry S. Thompson |
add private work bin dir to PATH
|
Wed, 05 Jul 2023 15:07:51 +0100 |
Henry S. Thompson |
tweak UI: copy/paste and title bar
|
Wed, 05 Jul 2023 15:02:53 +0100 |
Henry S. Thompson |
ec184 now, run w. unbuffered output
|
Wed, 05 Jul 2023 14:52:00 +0100 |
Henry S. Thompson |
moved to work tree
|
Wed, 05 Jul 2023 14:50:00 +0100 |
Henry S. Thompson |
working, about to move to work tree
|
Mon, 03 Jul 2023 18:16:14 +0100 |
Henry S. Thompson |
working on implementing types and parts:
|
Tue, 10 Jan 2023 17:48:26 +0000 |
Henry S. Thompson |
change account back
|
Thu, 28 Jul 2022 17:25:09 +0100 |
Henry S. Thompson |
x
|
Thu, 28 Jul 2022 17:24:29 +0100 |
Henry S. Thompson |
generalised sbatch front-end to cdx2tsv.py
|
Thu, 28 Jul 2022 15:33:21 +0100 |
Henry S. Thompson |
x
|
Wed, 20 Jul 2022 19:48:11 +0100 |
Henry S. Thompson |
add $W
|
Wed, 20 Jul 2022 19:47:21 +0100 |
Henry S. Thompson |
new-style log notice
|
Wed, 20 Jul 2022 19:46:51 +0100 |
Henry S. Thompson |
x
|
Mon, 18 Jul 2022 19:16:20 +0100 |
Henry S. Thompson |
new style batch jobs, see cirrus_work repo for _xxx.sh
|
Mon, 18 Jul 2022 19:15:20 +0100 |
Henry S. Thompson |
old style
|
Mon, 18 Jul 2022 18:40:12 +0100 |
Henry S. Thompson |
symlink to dir does't work
|
Mon, 18 Jul 2022 18:30:56 +0100 |
Henry S. Thompson |
work-path bin dir
|
Mon, 18 Jul 2022 18:16:27 +0100 |
Henry S. Thompson |
previous approach to lang/field extraction
|
Mon, 18 Jul 2022 18:11:46 +0100 |
Henry S. Thompson |
moved to shared/bin
|
Mon, 18 Jul 2022 17:59:43 +0100 |
Henry S. Thompson |
x
|
Mon, 18 Jul 2022 17:39:35 +0100 |
Henry S. Thompson |
x
|
Wed, 06 Jul 2022 18:07:34 +0100 |
Henry S. Thompson |
demo of slurm usage using cdx2tsv.py
|
Wed, 06 Jul 2022 18:00:53 +0100 |
Henry S. Thompson |
do whole line
|
Mon, 04 Jul 2022 18:14:41 +0100 |
Henry S. Thompson |
no more gentoo,
|
Mon, 04 Jul 2022 18:12:26 +0100 |
Henry S. Thompson |
allow use of global stash
|
Fri, 01 Jul 2022 17:50:06 +0200 |
Henry Thompson |
for 2022 exercise
|
Wed, 17 Nov 2021 18:26:33 +0000 |
Henry S. Thompson |
instead of csv
|
Mon, 01 Nov 2021 21:23:13 +0000 |
Henry S. Thompson |
add -c switch to btot
|
Thu, 28 Oct 2021 12:11:08 +0000 |
Henry S. Thompson |
use sqlite3 just to tabulate
|
Tue, 26 Oct 2021 14:07:34 +0000 |
Henry S. Thompson |
fixed
|
Tue, 26 Oct 2021 14:05:35 +0000 |
Henry S. Thompson |
working, with compound driver files
|
Mon, 25 Oct 2021 15:07:03 +0000 |
Henry S. Thompson |
better comments
|
Mon, 25 Oct 2021 15:05:46 +0000 |
Henry S. Thompson |
do the work for cdx2sql
|
Mon, 25 Oct 2021 15:05:25 +0000 |
Henry S. Thompson |
change test to use Master
|
Fri, 22 Oct 2021 12:36:15 +0000 |
Henry S. Thompson |
works for 0--9
|
Thu, 21 Oct 2021 19:18:47 +0000 |
Henry S. Thompson |
replace too-complex invocation of cdx2tsv
|
Wed, 20 Oct 2021 17:14:18 +0000 |
Henry S. Thompson |
basic, works
|
Wed, 20 Oct 2021 15:47:55 +0000 |
Henry S. Thompson |
too clever by half, keys won't work in parallel for e.g. media types
|
Tue, 19 Oct 2021 12:57:50 +0000 |
Henry S. Thompson |
working, w. pickle
|
Tue, 19 Oct 2021 12:56:14 +0000 |
Henry S. Thompson |
mail-lib
|
Tue, 19 Oct 2021 12:55:30 +0000 |
Henry S. Thompson |
move to ec164.guest
|
Fri, 23 Jul 2021 22:19:15 +0000 |
Henry S. Thompson |
fixed bug(s) wrt large payload files
|
Fri, 23 Jul 2021 16:23:46 +0000 |
Henry S. Thompson |
just barely working
|
Wed, 21 Jul 2021 20:05:42 +0000 |
Henry S. Thompson |
add cl arg --fpath replacing FPAT, which is now default value
|
Wed, 21 Jul 2021 20:04:11 +0000 |
Henry S. Thompson |
more paths
|
Wed, 14 Jul 2021 16:50:30 +0000 |
Henry S. Thompson |
add usage/help info
|
Wed, 14 Jul 2021 16:49:54 +0000 |
Henry S. Thompson |
add usage/help info
|
Wed, 14 Jul 2021 16:49:35 +0000 |
Henry S. Thompson |
parameterise the temp file and move it to /dev/shm
|
Wed, 14 Jul 2021 15:30:29 +0000 |
Henry S. Thompson |
sic
|
Fri, 09 Jul 2021 14:20:51 +0000 |
Henry S. Thompson |
use printf safely
|
Fri, 09 Jul 2021 13:46:10 +0000 |
Henry S. Thompson |
handle multiple L-M lines :-(
|
Fri, 09 Jul 2021 13:45:43 +0000 |
Henry S. Thompson |
improve error handling
|
Fri, 09 Jul 2021 13:45:04 +0000 |
Henry S. Thompson |
more focussed, better SLURM_... vars
|
Tue, 29 Jun 2021 08:00:40 +0000 |
Henry S. Thompson |
bits and pieces
|
Tue, 29 Jun 2021 07:53:47 +0000 |
Henry S. Thompson |
better btot
|
Mon, 28 Jun 2021 21:50:30 +0000 |
Henry S. Thompson |
extract Last Modified via cdx
|
Mon, 28 Jun 2021 17:16:34 +0000 |
Henry S. Thompson |
fix path to qpdf
|
Mon, 28 Jun 2021 17:16:15 +0000 |
Henry S. Thompson |
silently skip robotstxt
|
Mon, 28 Jun 2021 17:15:19 +0000 |
Henry S. Thompson |
workaround histcontrol
|
Mon, 28 Jun 2021 15:40:10 +0000 |
Henry S. Thompson |
support field edit
|
Mon, 28 Jun 2021 14:01:41 +0000 |
Henry S. Thompson |
for use in processing CC index files
|
Wed, 16 Jun 2021 16:12:46 +0000 |
Henry S. Thompson |
implement --cmd
|
Wed, 16 Jun 2021 16:12:16 +0000 |
Henry S. Thompson |
qpdf needs LD_LIB_PATH
|
Tue, 15 Jun 2021 18:04:34 +0000 |
Henry S. Thompson |
refactor final processing loop,
|
Tue, 15 Jun 2021 16:58:31 +0000 |
Henry S. Thompson |
frame size
|
Tue, 15 Jun 2021 16:58:03 +0000 |
Henry S. Thompson |
include sh-script
|
Mon, 26 Apr 2021 17:18:29 +0000 |
Henry S. Thompson |
all parts working, idempotency achieved
|
Mon, 26 Apr 2021 17:17:58 +0000 |
Henry S. Thompson |
debugging
|
Mon, 26 Apr 2021 17:17:38 +0000 |
Henry S. Thompson |
(none)
|
Mon, 26 Apr 2021 15:28:23 +0000 |
Henry S. Thompson |
warc and headers parts working
|
Thu, 22 Apr 2021 21:31:03 +0000 |
Henry S. Thompson |
back to IGzipFile
|
Thu, 22 Apr 2021 21:10:02 +0000 |
Henry S. Thompson |
approved Popen version using .communicate
|
Thu, 22 Apr 2021 19:06:55 +0000 |
Henry S. Thompson |
using Popen to run igzip (also not great)
|
Tue, 20 Apr 2021 19:11:57 +0000 |
Henry S. Thompson |
added support for copying to/using /dev/shm or /tmp
|
Tue, 20 Apr 2021 12:26:09 +0000 |
Henry S. Thompson |
working with -x and rich directory structure
|
Tue, 20 Apr 2021 11:12:35 +0000 |
Henry S. Thompson |
convert to rich directory structure per 2019-35
|
Mon, 19 Apr 2021 18:09:51 +0000 |
Henry S. Thompson |
-x barely working
|
Mon, 19 Apr 2021 18:09:25 +0000 |
Henry S. Thompson |
never should have added
|
Mon, 19 Apr 2021 13:08:16 +0000 |
Henry S. Thompson |
better dd error handling
|
Mon, 19 Apr 2021 13:07:58 +0000 |
Henry S. Thompson |
(none)
|
Sun, 18 Apr 2021 17:03:45 +0000 |
Henry S. Thompson |
bare minimum working
|
Fri, 16 Apr 2021 18:28:00 +0000 |
Henry S. Thompson |
triple args checked, filename opened
|
Fri, 16 Apr 2021 13:15:23 +0000 |
Henry S. Thompson |
help format hacking done
|
Fri, 16 Apr 2021 12:55:05 +0000 |
Henry S. Thompson |
basic help format hacking works
|
Fri, 16 Apr 2021 09:01:16 +0000 |
Henry S. Thompson |
(none)
|
Fri, 16 Apr 2021 09:00:17 +0000 |
Henry S. Thompson |
(none)
|
Thu, 15 Apr 2021 19:22:27 +0000 |
Henry S. Thompson |
just strugging with argparse
|
Thu, 15 Apr 2021 10:59:25 +0000 |
Henry S. Thompson |
support a command to receive each result,
|
Wed, 14 Apr 2021 20:15:32 +0000 |
Henry S. Thompson |
accepts index lines, less line-at-a-time
|
Wed, 14 Apr 2021 10:08:41 +0000 |
Henry S. Thompson |
working with one input
|
Wed, 14 Apr 2021 08:57:43 +0000 |
Henry S. Thompson |
-w and -h working
|
Tue, 13 Apr 2021 17:52:31 +0000 |
Henry S. Thompson |
working on flags
|
Tue, 13 Apr 2021 17:02:09 +0000 |
Henry S. Thompson |
new
|
Tue, 16 Mar 2021 16:20:02 +0000 |
Henry S. Thompson |
working with locking and copying
|
Mon, 15 Mar 2021 14:26:42 +0000 |
Henry S. Thompson |
working for -t 2 -c 2
|
Mon, 15 Mar 2021 14:20:00 +0000 |
Henry S. Thompson |
minor
|
Sun, 14 Mar 2021 21:28:02 +0000 |
Henry S. Thompson |
prepare for real parallel distribution
|
Sun, 14 Mar 2021 21:25:01 +0000 |
Henry S. Thompson |
environment improvements
|
Wed, 03 Mar 2021 19:33:56 +0000 |
Henry S. Thompson |
trying to move to slurm
|
Sat, 09 May 2020 16:16:28 +0100 |
Henry S. Thompson |
improved F handling/logging
|
Fri, 08 May 2020 19:52:36 +0100 |
Henry S. Thompson |
keep separate antecedants separate, buggy?
|
Thu, 07 May 2020 18:47:24 +0100 |
Henry S. Thompson |
track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
|
Thu, 07 May 2020 11:33:24 +0100 |
Henry S. Thompson |
refactor, change summary print (problem?)
|
Wed, 06 May 2020 18:28:52 +0100 |
Henry S. Thompson |
bare framework working
|
Wed, 06 May 2020 14:25:44 +0100 |
Henry S. Thompson |
starting on tool to assemble as complete as we have info wrt a seed URI
|
Wed, 06 May 2020 14:24:42 +0100 |
Henry S. Thompson |
use local .m2/repository for Hadoop 3.4.0
|
Wed, 06 May 2020 14:23:33 +0100 |
Henry S. Thompson |
works for big files with Hadoop 3.4.0
|
Wed, 06 May 2020 14:22:48 +0100 |
Henry S. Thompson |
x
|
Tue, 28 Apr 2020 19:02:34 +0100 |
Henry S. Thompson |
log trucations
|
Tue, 28 Apr 2020 19:02:14 +0100 |
Henry S. Thompson |
impose some limits
|
Tue, 28 Apr 2020 19:01:41 +0100 |
Henry S. Thompson |
x
|
Fri, 24 Apr 2020 20:12:44 +0100 |
Henry S. Thompson |
x
|
Fri, 24 Apr 2020 20:12:29 +0100 |
Henry S. Thompson |
mostly from Sebastian
|
Fri, 24 Apr 2020 20:03:29 +0100 |
Henry S. Thompson |
misc
|
Fri, 24 Apr 2020 20:01:35 +0100 |
Henry S. Thompson |
misc
|
Fri, 24 Apr 2020 20:01:25 +0100 |
Henry S. Thompson |
fix from Sebastian
|
Fri, 24 Apr 2020 19:57:16 +0100 |
Henry S. Thompson |
misc
|
Fri, 24 Apr 2020 19:55:11 +0100 |
Henry S. Thompson |
misc
|
Fri, 24 Apr 2020 15:20:33 +0100 |
Henry S. Thompson |
several efficiency (hofentlich) tweaks
|
Thu, 23 Apr 2020 17:26:55 +0100 |
Henry S. Thompson |
x
|
Thu, 23 Apr 2020 17:25:25 +0100 |
Henry S. Thompson |
switch for use on login server, invoke by hand with 0/1 as only cmd line arg
|
Wed, 22 Apr 2020 18:42:40 +0100 |
Henry S. Thompson |
java stuff
|
Wed, 22 Apr 2020 18:42:23 +0100 |
Henry S. Thompson |
try nutch fetch for big pdfs
|
Wed, 15 Apr 2020 18:44:18 +0100 |
Henry S. Thompson |
final most general versin
|
Tue, 14 Apr 2020 17:52:34 +0100 |
Henry S. Thompson |
too big for /dev/shm, split in half
|
Tue, 14 Apr 2020 16:10:22 +0100 |
Henry S. Thompson |
one-off to convert big extracts.tar into lots of smaller ones
|
Mon, 13 Apr 2020 17:29:31 +0100 |
Henry S. Thompson |
as used successfully for 3rd run
|
Mon, 13 Apr 2020 15:24:32 +0100 |
Henry S. Thompson |
ready to try another pass with robust diff checking
|
Mon, 13 Apr 2020 14:12:12 +0100 |
Henry S. Thompson |
working towards more robust diff checking
|
Sat, 11 Apr 2020 13:41:46 +0100 |
Henry S. Thompson |
a few tweaks after 2nd parallel run
|
Fri, 10 Apr 2020 18:45:30 +0100 |
Henry S. Thompson |
another few log fixes
|
Fri, 10 Apr 2020 18:42:08 +0100 |
Henry S. Thompson |
as running, modulo 1 log output wrong
|
Fri, 10 Apr 2020 18:22:48 +0100 |
Henry S. Thompson |
log more, work around more glitches
|
Fri, 10 Apr 2020 18:22:24 +0100 |
Henry S. Thompson |
x
|
Wed, 08 Apr 2020 14:11:04 +0100 |
Henry S. Thompson |
start try to work around failures
|
Wed, 08 Apr 2020 11:27:33 +0100 |
Henry S. Thompson |
parallelised version of reExtract.sh
|
Tue, 07 Apr 2020 18:00:29 +0100 |
Henry S. Thompson |
complete change of array var construction, used it for log file names too, tar update enabled, so maybe complete but w/o any parallel
|
Sat, 04 Apr 2020 15:31:58 +0100 |
Henry S. Thompson |
added computation of required additions to tar file, but not actually added
|
Fri, 03 Apr 2020 19:04:06 +0100 |
Henry S. Thompson |
refactored, not tested
|
Fri, 03 Apr 2020 17:35:17 +0100 |
Henry S. Thompson |
done through re-extraction, fixing tars still to come
|
Thu, 02 Apr 2020 19:21:21 +0100 |
Henry S. Thompson |
sketching more
|
Thu, 02 Apr 2020 19:14:23 +0100 |
Henry S. Thompson |
towards re-running extraction in part
|
Thu, 02 Apr 2020 19:13:40 +0100 |
Henry S. Thompson |
up the time limit
|
Thu, 02 Apr 2020 19:13:14 +0100 |
Henry S. Thompson |
clean up after ourselves
|
Thu, 26 Mar 2020 15:29:12 +0000 |
Henry S. Thompson |
fixed scope pblm in tar step
|
Thu, 26 Mar 2020 12:24:30 +0000 |
Henry S. Thompson |
sync up filenames and log names,
|
Thu, 26 Mar 2020 12:23:33 +0000 |
Henry S. Thompson |
pass through extract args
|
Tue, 24 Mar 2020 17:53:35 +0000 |
Henry S. Thompson |
towards sub-division of resulting tar files
|
Tue, 24 Mar 2020 17:52:52 +0000 |
Henry S. Thompson |
not relevant
|
Thu, 19 Mar 2020 13:13:35 +0000 |
Henry S. Thompson |
x
|
Thu, 19 Mar 2020 13:13:02 +0000 |
Henry S. Thompson |
better quoting
|
Wed, 18 Mar 2020 21:54:47 +0000 |
Henry S. Thompson |
try to fix multi-line lossage
|
Wed, 18 Mar 2020 21:52:06 +0000 |
Henry S. Thompson |
fix missing use of $t
|
Wed, 18 Mar 2020 15:23:53 +0000 |
Henry S. Thompson |
first cut at doing extraction here
|
Wed, 18 Mar 2020 13:42:47 +0000 |
Henry S. Thompson |
finally hacked something that works
|
Wed, 18 Mar 2020 11:08:47 +0000 |
Henry S. Thompson |
(none)
|
Wed, 18 Mar 2020 11:08:23 +0000 |
Henry S. Thompson |
(none)
|
Wed, 18 Mar 2020 11:06:35 +0000 |
Henry S. Thompson |
x
|
Wed, 18 Mar 2020 10:57:56 +0000 |
Henry S. Thompson |
more job scripts
|
Wed, 18 Mar 2020 10:57:21 +0000 |
Henry S. Thompson |
more job scripts
|
Wed, 18 Mar 2020 10:56:23 +0000 |
Henry S. Thompson |
local setup
|
Mon, 16 Mar 2020 15:57:23 +0000 |
Henry S. Thompson |
copied from valhalla/bin
|
Thu, 27 Feb 2020 17:18:02 +0000 |
Henry S. Thompson |
fix a mis-folded link file
|
Thu, 27 Feb 2020 13:24:19 +0000 |
Henry S. Thompson |
sic
|
Wed, 26 Feb 2020 21:50:25 +0000 |
Henry S. Thompson |
use awk to do a join between links and 1132dates
|
Wed, 26 Feb 2020 16:02:22 +0000 |
Henry S. Thompson |
works after minor tweaks
|
Wed, 26 Feb 2020 15:47:20 +0000 |
Henry S. Thompson |
modelled on plinks
|
Wed, 26 Feb 2020 12:40:14 +0000 |
Henry S. Thompson |
fixes to pdfx to timeout, use regex
|
Tue, 25 Feb 2020 22:14:07 +0000 |
Henry S. Thompson |
add args for start tar and number of tars
|
Tue, 25 Feb 2020 18:33:22 +0000 |
Henry S. Thompson |
give up on mpiexec_mpt
|
Tue, 25 Feb 2020 14:56:36 +0000 |
Henry S. Thompson |
bigger run, longer limit
|
Tue, 25 Feb 2020 10:34:41 +0000 |
Henry S. Thompson |
logging tweaks, preparing for timeout on problem pdfs
|
Mon, 24 Feb 2020 12:16:10 +0000 |
Henry S. Thompson |
longer run, terser logging
|
Mon, 24 Feb 2020 00:44:53 +0000 |
Henry S. Thompson |
more logging
|
Sun, 23 Feb 2020 16:48:34 +0000 |
Henry S. Thompson |
refactor to address tarred-up pdfs
|
Wed, 19 Feb 2020 10:41:59 +0000 |
Henry S. Thompson |
merge
|
Wed, 19 Feb 2020 10:39:05 +0000 |
Henry S. Thompson |
try harder not to write empty links files
|
Tue, 18 Feb 2020 22:41:08 +0000 |
Henry Thompson |
only create links file if there are some
|
Tue, 18 Feb 2020 22:19:51 +0000 |
Henry Thompson |
typos
|
Tue, 18 Feb 2020 21:33:35 +0000 |
Henry Thompson |
switch to file loop inside python, assume file index integer in pipe as well as filename, check /dev/shm/stopJob
|
Tue, 18 Feb 2020 13:15:05 +0000 |
Henry S. Thompson |
bolting the barn door...
|