log

age author description
5 months ago Henry S. Thompson add target test-core which (dangerously) avoids (we hope pointless) recompilation of all the plugins default tip
5 months ago Henry S. Thompson move DummyContext out
5 months ago Henry S. Thompson works, although output not checked
5 months ago Henry S. Thompson maybe triggers jdb on tests with -DdebugTest=true on command line
5 months ago Henry S. Thompson orig, more or less
5 months ago Henry S. Thompson working, with issues:
5 months ago Henry S. Thompson compiles with content, but fails with EOF -- need blank lines?
5 months ago Henry S. Thompson runs, but no cdx yet, because no value.content I presume
5 months ago Henry S. Thompson add lastmod to cdx lines,
12 months ago Henry S. Thompson csing-related tweaks
15 months ago Henry S. Thompson too many overdue updates to break down
18 months ago Henry S. Thompson use csing, and _runme_c.sh to get it initialised
18 months ago Henry S. Thompson MANPATH (?)
18 months ago Henry S. Thompson tab completion fix
19 months ago Henry S. Thompson add support for multiple calls to srun with a counter
20 months ago Henry S. Thompson add private work bin dir to PATH
20 months ago Henry S. Thompson tweak UI: copy/paste and title bar
20 months ago Henry S. Thompson ec184 now, run w. unbuffered output
20 months ago Henry S. Thompson moved to work tree
20 months ago Henry S. Thompson working, about to move to work tree
20 months ago Henry S. Thompson working on implementing types and parts:
2023-01-10 Henry S. Thompson change account back
2022-07-28 Henry S. Thompson x
2022-07-28 Henry S. Thompson generalised sbatch front-end to cdx2tsv.py
2022-07-28 Henry S. Thompson x
2022-07-20 Henry S. Thompson add $W
2022-07-20 Henry S. Thompson new-style log notice
2022-07-20 Henry S. Thompson x
2022-07-18 Henry S. Thompson new style batch jobs, see cirrus_work repo for _xxx.sh
2022-07-18 Henry S. Thompson old style
2022-07-18 Henry S. Thompson symlink to dir does't work
2022-07-18 Henry S. Thompson work-path bin dir
2022-07-18 Henry S. Thompson previous approach to lang/field extraction
2022-07-18 Henry S. Thompson moved to shared/bin
2022-07-18 Henry S. Thompson x
2022-07-18 Henry S. Thompson x
2022-07-06 Henry S. Thompson demo of slurm usage using cdx2tsv.py
2022-07-06 Henry S. Thompson do whole line
2022-07-04 Henry S. Thompson no more gentoo,
2022-07-04 Henry S. Thompson allow use of global stash
2022-07-01 Henry Thompson for 2022 exercise
2021-11-17 Henry S. Thompson instead of csv
2021-11-01 Henry S. Thompson add -c switch to btot
2021-10-28 Henry S. Thompson use sqlite3 just to tabulate
2021-10-26 Henry S. Thompson fixed
2021-10-26 Henry S. Thompson working, with compound driver files
2021-10-25 Henry S. Thompson better comments
2021-10-25 Henry S. Thompson do the work for cdx2sql
2021-10-25 Henry S. Thompson change test to use Master
2021-10-22 Henry S. Thompson works for 0--9
2021-10-21 Henry S. Thompson replace too-complex invocation of cdx2tsv
2021-10-20 Henry S. Thompson basic, works
2021-10-20 Henry S. Thompson too clever by half, keys won't work in parallel for e.g. media types
2021-10-19 Henry S. Thompson working, w. pickle
2021-10-19 Henry S. Thompson mail-lib
2021-10-19 Henry S. Thompson move to ec164.guest
2021-07-23 Henry S. Thompson fixed bug(s) wrt large payload files
2021-07-23 Henry S. Thompson just barely working
2021-07-21 Henry S. Thompson add cl arg --fpath replacing FPAT, which is now default value
2021-07-21 Henry S. Thompson more paths
2021-07-14 Henry S. Thompson add usage/help info
2021-07-14 Henry S. Thompson add usage/help info
2021-07-14 Henry S. Thompson parameterise the temp file and move it to /dev/shm
2021-07-14 Henry S. Thompson sic
2021-07-09 Henry S. Thompson use printf safely
2021-07-09 Henry S. Thompson handle multiple L-M lines :-(
2021-07-09 Henry S. Thompson improve error handling
2021-07-09 Henry S. Thompson more focussed, better SLURM_... vars
2021-06-29 Henry S. Thompson bits and pieces
2021-06-29 Henry S. Thompson better btot
2021-06-28 Henry S. Thompson extract Last Modified via cdx
2021-06-28 Henry S. Thompson fix path to qpdf
2021-06-28 Henry S. Thompson silently skip robotstxt
2021-06-28 Henry S. Thompson workaround histcontrol
2021-06-28 Henry S. Thompson support field edit
2021-06-28 Henry S. Thompson for use in processing CC index files
2021-06-16 Henry S. Thompson implement --cmd
2021-06-16 Henry S. Thompson qpdf needs LD_LIB_PATH
2021-06-15 Henry S. Thompson refactor final processing loop,
2021-06-15 Henry S. Thompson frame size
2021-06-15 Henry S. Thompson include sh-script
2021-04-26 Henry S. Thompson all parts working, idempotency achieved
2021-04-26 Henry S. Thompson debugging
2021-04-26 Henry S. Thompson (none)
2021-04-26 Henry S. Thompson warc and headers parts working
2021-04-22 Henry S. Thompson back to IGzipFile
2021-04-22 Henry S. Thompson approved Popen version using .communicate
2021-04-22 Henry S. Thompson using Popen to run igzip (also not great)
2021-04-20 Henry S. Thompson added support for copying to/using /dev/shm or /tmp
2021-04-20 Henry S. Thompson working with -x and rich directory structure
2021-04-20 Henry S. Thompson convert to rich directory structure per 2019-35
2021-04-19 Henry S. Thompson -x barely working
2021-04-19 Henry S. Thompson never should have added
2021-04-19 Henry S. Thompson better dd error handling
2021-04-19 Henry S. Thompson (none)
2021-04-18 Henry S. Thompson bare minimum working
2021-04-16 Henry S. Thompson triple args checked, filename opened
2021-04-16 Henry S. Thompson help format hacking done
2021-04-16 Henry S. Thompson basic help format hacking works
2021-04-16 Henry S. Thompson (none)
2021-04-16 Henry S. Thompson (none)
2021-04-15 Henry S. Thompson just strugging with argparse
2021-04-15 Henry S. Thompson support a command to receive each result,
2021-04-14 Henry S. Thompson accepts index lines, less line-at-a-time
2021-04-14 Henry S. Thompson working with one input
2021-04-14 Henry S. Thompson -w and -h working
2021-04-13 Henry S. Thompson working on flags
2021-04-13 Henry S. Thompson new
2021-03-16 Henry S. Thompson working with locking and copying
2021-03-15 Henry S. Thompson working for -t 2 -c 2
2021-03-15 Henry S. Thompson minor
2021-03-14 Henry S. Thompson prepare for real parallel distribution
2021-03-14 Henry S. Thompson environment improvements
2021-03-03 Henry S. Thompson trying to move to slurm
2020-05-09 Henry S. Thompson improved F handling/logging
2020-05-08 Henry S. Thompson keep separate antecedants separate, buggy?
2020-05-07 Henry S. Thompson track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
2020-05-07 Henry S. Thompson refactor, change summary print (problem?)
2020-05-06 Henry S. Thompson bare framework working
2020-05-06 Henry S. Thompson starting on tool to assemble as complete as we have info wrt a seed URI
2020-05-06 Henry S. Thompson use local .m2/repository for Hadoop 3.4.0
2020-05-06 Henry S. Thompson works for big files with Hadoop 3.4.0
2020-05-06 Henry S. Thompson x
2020-04-28 Henry S. Thompson log trucations
2020-04-28 Henry S. Thompson impose some limits
2020-04-28 Henry S. Thompson x
2020-04-24 Henry S. Thompson x
2020-04-24 Henry S. Thompson mostly from Sebastian
2020-04-24 Henry S. Thompson misc
2020-04-24 Henry S. Thompson misc
2020-04-24 Henry S. Thompson fix from Sebastian
2020-04-24 Henry S. Thompson misc
2020-04-24 Henry S. Thompson misc
2020-04-24 Henry S. Thompson several efficiency (hofentlich) tweaks
2020-04-23 Henry S. Thompson x
2020-04-23 Henry S. Thompson switch for use on login server, invoke by hand with 0/1 as only cmd line arg
2020-04-22 Henry S. Thompson java stuff
2020-04-22 Henry S. Thompson try nutch fetch for big pdfs
2020-04-15 Henry S. Thompson final most general versin
2020-04-14 Henry S. Thompson too big for /dev/shm, split in half
2020-04-14 Henry S. Thompson one-off to convert big extracts.tar into lots of smaller ones
2020-04-13 Henry S. Thompson as used successfully for 3rd run
2020-04-13 Henry S. Thompson ready to try another pass with robust diff checking
2020-04-13 Henry S. Thompson working towards more robust diff checking
2020-04-11 Henry S. Thompson a few tweaks after 2nd parallel run
2020-04-10 Henry S. Thompson another few log fixes
2020-04-10 Henry S. Thompson as running, modulo 1 log output wrong
2020-04-10 Henry S. Thompson log more, work around more glitches
2020-04-10 Henry S. Thompson x
2020-04-08 Henry S. Thompson start try to work around failures
2020-04-08 Henry S. Thompson parallelised version of reExtract.sh
2020-04-07 Henry S. Thompson complete change of array var construction, used it for log file names too, tar update enabled, so maybe complete but w/o any parallel
2020-04-04 Henry S. Thompson added computation of required additions to tar file, but not actually added
2020-04-03 Henry S. Thompson refactored, not tested
2020-04-03 Henry S. Thompson done through re-extraction, fixing tars still to come
2020-04-02 Henry S. Thompson sketching more
2020-04-02 Henry S. Thompson towards re-running extraction in part
2020-04-02 Henry S. Thompson up the time limit
2020-04-02 Henry S. Thompson clean up after ourselves
2020-03-26 Henry S. Thompson fixed scope pblm in tar step
2020-03-26 Henry S. Thompson sync up filenames and log names,
2020-03-26 Henry S. Thompson pass through extract args
2020-03-24 Henry S. Thompson towards sub-division of resulting tar files
2020-03-24 Henry S. Thompson not relevant
2020-03-19 Henry S. Thompson x
2020-03-19 Henry S. Thompson better quoting
2020-03-18 Henry S. Thompson try to fix multi-line lossage
2020-03-18 Henry S. Thompson fix missing use of $t
2020-03-18 Henry S. Thompson first cut at doing extraction here
2020-03-18 Henry S. Thompson finally hacked something that works
2020-03-18 Henry S. Thompson (none)
2020-03-18 Henry S. Thompson (none)
2020-03-18 Henry S. Thompson x
2020-03-18 Henry S. Thompson more job scripts
2020-03-18 Henry S. Thompson more job scripts
2020-03-18 Henry S. Thompson local setup
2020-03-16 Henry S. Thompson copied from valhalla/bin
2020-02-27 Henry S. Thompson fix a mis-folded link file
2020-02-27 Henry S. Thompson sic
2020-02-26 Henry S. Thompson use awk to do a join between links and 1132dates
2020-02-26 Henry S. Thompson works after minor tweaks
2020-02-26 Henry S. Thompson modelled on plinks
2020-02-26 Henry S. Thompson fixes to pdfx to timeout, use regex
2020-02-25 Henry S. Thompson add args for start tar and number of tars
2020-02-25 Henry S. Thompson give up on mpiexec_mpt
2020-02-25 Henry S. Thompson bigger run, longer limit
2020-02-25 Henry S. Thompson logging tweaks, preparing for timeout on problem pdfs
2020-02-24 Henry S. Thompson longer run, terser logging
2020-02-24 Henry S. Thompson more logging
2020-02-23 Henry S. Thompson refactor to address tarred-up pdfs
2020-02-19 Henry S. Thompson merge
2020-02-19 Henry S. Thompson try harder not to write empty links files
2020-02-18 Henry Thompson only create links file if there are some
2020-02-18 Henry Thompson typos
2020-02-18 Henry Thompson switch to file loop inside python, assume file index integer in pipe as well as filename, check /dev/shm/stopJob
2020-02-18 Henry S. Thompson bolting the barn door...