2021-10-19 |
Henry S. Thompson |
mail-lib
|
2021-10-19 |
Henry S. Thompson |
move to ec164.guest
|
2021-07-23 |
Henry S. Thompson |
fixed bug(s) wrt large payload files
|
2021-07-23 |
Henry S. Thompson |
just barely working
|
2021-07-21 |
Henry S. Thompson |
add cl arg --fpath replacing FPAT, which is now default value
|
2021-07-21 |
Henry S. Thompson |
more paths
|
2021-07-14 |
Henry S. Thompson |
add usage/help info
|
2021-07-14 |
Henry S. Thompson |
add usage/help info
|
2021-07-14 |
Henry S. Thompson |
parameterise the temp file and move it to /dev/shm
|
2021-07-14 |
Henry S. Thompson |
sic
|
2021-07-09 |
Henry S. Thompson |
use printf safely
|
2021-07-09 |
Henry S. Thompson |
handle multiple L-M lines :-(
|
2021-07-09 |
Henry S. Thompson |
improve error handling
|
2021-07-09 |
Henry S. Thompson |
more focussed, better SLURM_... vars
|
2021-06-29 |
Henry S. Thompson |
bits and pieces
|
2021-06-29 |
Henry S. Thompson |
better btot
|
2021-06-28 |
Henry S. Thompson |
extract Last Modified via cdx
|
2021-06-28 |
Henry S. Thompson |
fix path to qpdf
|
2021-06-28 |
Henry S. Thompson |
silently skip robotstxt
|
2021-06-28 |
Henry S. Thompson |
workaround histcontrol
|
2021-06-28 |
Henry S. Thompson |
support field edit
|
2021-06-28 |
Henry S. Thompson |
for use in processing CC index files
|
2021-06-16 |
Henry S. Thompson |
implement --cmd
|
2021-06-16 |
Henry S. Thompson |
qpdf needs LD_LIB_PATH
|
2021-06-15 |
Henry S. Thompson |
refactor final processing loop,
|
2021-06-15 |
Henry S. Thompson |
frame size
|
2021-06-15 |
Henry S. Thompson |
include sh-script
|
2021-04-26 |
Henry S. Thompson |
all parts working, idempotency achieved
|
2021-04-26 |
Henry S. Thompson |
debugging
|
2021-04-26 |
Henry S. Thompson |
(none)
|
2021-04-26 |
Henry S. Thompson |
warc and headers parts working
|
2021-04-22 |
Henry S. Thompson |
back to IGzipFile
|
2021-04-22 |
Henry S. Thompson |
approved Popen version using .communicate
|
2021-04-22 |
Henry S. Thompson |
using Popen to run igzip (also not great)
|
2021-04-20 |
Henry S. Thompson |
added support for copying to/using /dev/shm or /tmp
|
2021-04-20 |
Henry S. Thompson |
working with -x and rich directory structure
|
2021-04-20 |
Henry S. Thompson |
convert to rich directory structure per 2019-35
|
2021-04-19 |
Henry S. Thompson |
-x barely working
|
2021-04-19 |
Henry S. Thompson |
never should have added
|
2021-04-19 |
Henry S. Thompson |
better dd error handling
|
2021-04-19 |
Henry S. Thompson |
(none)
|
2021-04-18 |
Henry S. Thompson |
bare minimum working
|
2021-04-16 |
Henry S. Thompson |
triple args checked, filename opened
|
2021-04-16 |
Henry S. Thompson |
help format hacking done
|
2021-04-16 |
Henry S. Thompson |
basic help format hacking works
|
2021-04-16 |
Henry S. Thompson |
(none)
|
2021-04-16 |
Henry S. Thompson |
(none)
|
2021-04-15 |
Henry S. Thompson |
just strugging with argparse
|
2021-04-15 |
Henry S. Thompson |
support a command to receive each result,
|
2021-04-14 |
Henry S. Thompson |
accepts index lines, less line-at-a-time
|
2021-04-14 |
Henry S. Thompson |
working with one input
|
2021-04-14 |
Henry S. Thompson |
-w and -h working
|
2021-04-13 |
Henry S. Thompson |
working on flags
|
2021-04-13 |
Henry S. Thompson |
new
|
2021-03-16 |
Henry S. Thompson |
working with locking and copying
|
2021-03-15 |
Henry S. Thompson |
working for -t 2 -c 2
|
2021-03-15 |
Henry S. Thompson |
minor
|
2021-03-14 |
Henry S. Thompson |
prepare for real parallel distribution
|
2021-03-14 |
Henry S. Thompson |
environment improvements
|
2021-03-03 |
Henry S. Thompson |
trying to move to slurm
|
2020-05-09 |
Henry S. Thompson |
improved F handling/logging
|
2020-05-08 |
Henry S. Thompson |
keep separate antecedants separate, buggy?
|
2020-05-07 |
Henry S. Thompson |
track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
|
2020-05-07 |
Henry S. Thompson |
refactor, change summary print (problem?)
|
2020-05-06 |
Henry S. Thompson |
bare framework working
|
2020-05-06 |
Henry S. Thompson |
starting on tool to assemble as complete as we have info wrt a seed URI
|
2020-05-06 |
Henry S. Thompson |
use local .m2/repository for Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
works for big files with Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
x
|
2020-04-28 |
Henry S. Thompson |
log trucations
|
2020-04-28 |
Henry S. Thompson |
impose some limits
|
2020-04-28 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
mostly from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
fix from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
several efficiency (hofentlich) tweaks
|
2020-04-23 |
Henry S. Thompson |
x
|
2020-04-23 |
Henry S. Thompson |
switch for use on login server, invoke by hand with 0/1 as only cmd line arg
|
2020-04-22 |
Henry S. Thompson |
java stuff
|
2020-04-22 |
Henry S. Thompson |
try nutch fetch for big pdfs
|
2020-04-15 |
Henry S. Thompson |
final most general versin
|
2020-04-14 |
Henry S. Thompson |
too big for /dev/shm, split in half
|
2020-04-14 |
Henry S. Thompson |
one-off to convert big extracts.tar into lots of smaller ones
|
2020-04-13 |
Henry S. Thompson |
as used successfully for 3rd run
|
2020-04-13 |
Henry S. Thompson |
ready to try another pass with robust diff checking
|
2020-04-13 |
Henry S. Thompson |
working towards more robust diff checking
|
2020-04-11 |
Henry S. Thompson |
a few tweaks after 2nd parallel run
|
2020-04-10 |
Henry S. Thompson |
another few log fixes
|
2020-04-10 |
Henry S. Thompson |
as running, modulo 1 log output wrong
|
2020-04-10 |
Henry S. Thompson |
log more, work around more glitches
|
2020-04-10 |
Henry S. Thompson |
x
|
2020-04-08 |
Henry S. Thompson |
start try to work around failures
|
2020-04-08 |
Henry S. Thompson |
parallelised version of reExtract.sh
|
2020-04-07 |
Henry S. Thompson |
complete change of array var construction, used it for log file names too, tar update enabled, so maybe complete but w/o any parallel
|
2020-04-04 |
Henry S. Thompson |
added computation of required additions to tar file, but not actually added
|
2020-04-03 |
Henry S. Thompson |
refactored, not tested
|
2020-04-03 |
Henry S. Thompson |
done through re-extraction, fixing tars still to come
|
2020-04-02 |
Henry S. Thompson |
sketching more
|
2020-04-02 |
Henry S. Thompson |
towards re-running extraction in part
|
2020-04-02 |
Henry S. Thompson |
up the time limit
|
2020-04-02 |
Henry S. Thompson |
clean up after ourselves
|
2020-03-26 |
Henry S. Thompson |
fixed scope pblm in tar step
|
2020-03-26 |
Henry S. Thompson |
sync up filenames and log names,
|
2020-03-26 |
Henry S. Thompson |
pass through extract args
|
2020-03-24 |
Henry S. Thompson |
towards sub-division of resulting tar files
|
2020-03-24 |
Henry S. Thompson |
not relevant
|
2020-03-19 |
Henry S. Thompson |
x
|
2020-03-19 |
Henry S. Thompson |
better quoting
|
2020-03-18 |
Henry S. Thompson |
try to fix multi-line lossage
|
2020-03-18 |
Henry S. Thompson |
fix missing use of $t
|
2020-03-18 |
Henry S. Thompson |
first cut at doing extraction here
|
2020-03-18 |
Henry S. Thompson |
finally hacked something that works
|
2020-03-18 |
Henry S. Thompson |
(none)
|
2020-03-18 |
Henry S. Thompson |
(none)
|
2020-03-18 |
Henry S. Thompson |
x
|
2020-03-18 |
Henry S. Thompson |
more job scripts
|
2020-03-18 |
Henry S. Thompson |
more job scripts
|
2020-03-18 |
Henry S. Thompson |
local setup
|
2020-03-16 |
Henry S. Thompson |
copied from valhalla/bin
|
2020-02-27 |
Henry S. Thompson |
fix a mis-folded link file
|
2020-02-27 |
Henry S. Thompson |
sic
|
2020-02-26 |
Henry S. Thompson |
use awk to do a join between links and 1132dates
|
2020-02-26 |
Henry S. Thompson |
works after minor tweaks
|
2020-02-26 |
Henry S. Thompson |
modelled on plinks
|
2020-02-26 |
Henry S. Thompson |
fixes to pdfx to timeout, use regex
|
2020-02-25 |
Henry S. Thompson |
add args for start tar and number of tars
|
2020-02-25 |
Henry S. Thompson |
give up on mpiexec_mpt
|
2020-02-25 |
Henry S. Thompson |
bigger run, longer limit
|
2020-02-25 |
Henry S. Thompson |
logging tweaks, preparing for timeout on problem pdfs
|
2020-02-24 |
Henry S. Thompson |
longer run, terser logging
|
2020-02-24 |
Henry S. Thompson |
more logging
|
2020-02-23 |
Henry S. Thompson |
refactor to address tarred-up pdfs
|
2020-02-19 |
Henry S. Thompson |
merge
|
2020-02-19 |
Henry S. Thompson |
try harder not to write empty links files
|
2020-02-18 |
Henry Thompson |
only create links file if there are some
|
2020-02-18 |
Henry Thompson |
typos
|
2020-02-18 |
Henry Thompson |
switch to file loop inside python, assume file index integer in pipe as well as filename, check /dev/shm/stopJob
|
2020-02-18 |
Henry S. Thompson |
bolting the barn door...
|