2021-04-26 |
Henry S. Thompson |
warc and headers parts working
|
2021-04-22 |
Henry S. Thompson |
back to IGzipFile
|
2021-04-22 |
Henry S. Thompson |
approved Popen version using .communicate
|
2021-04-22 |
Henry S. Thompson |
using Popen to run igzip (also not great)
|
2021-04-20 |
Henry S. Thompson |
added support for copying to/using /dev/shm or /tmp
|
2021-04-20 |
Henry S. Thompson |
working with -x and rich directory structure
|
2021-04-20 |
Henry S. Thompson |
convert to rich directory structure per 2019-35
|
2021-04-19 |
Henry S. Thompson |
-x barely working
|
2021-04-19 |
Henry S. Thompson |
never should have added
|
2021-04-19 |
Henry S. Thompson |
better dd error handling
|
2021-04-19 |
Henry S. Thompson |
(none)
|
2021-04-18 |
Henry S. Thompson |
bare minimum working
|
2021-04-16 |
Henry S. Thompson |
triple args checked, filename opened
|
2021-04-16 |
Henry S. Thompson |
help format hacking done
|
2021-04-16 |
Henry S. Thompson |
basic help format hacking works
|
2021-04-16 |
Henry S. Thompson |
(none)
|
2021-04-16 |
Henry S. Thompson |
(none)
|
2021-04-15 |
Henry S. Thompson |
just strugging with argparse
|
2021-04-15 |
Henry S. Thompson |
support a command to receive each result,
|
2021-04-14 |
Henry S. Thompson |
accepts index lines, less line-at-a-time
|
2021-04-14 |
Henry S. Thompson |
working with one input
|
2021-04-14 |
Henry S. Thompson |
-w and -h working
|
2021-04-13 |
Henry S. Thompson |
working on flags
|
2021-04-13 |
Henry S. Thompson |
new
|
2021-03-16 |
Henry S. Thompson |
working with locking and copying
|
2021-03-15 |
Henry S. Thompson |
working for -t 2 -c 2
|
2021-03-15 |
Henry S. Thompson |
minor
|
2021-03-14 |
Henry S. Thompson |
prepare for real parallel distribution
|
2021-03-14 |
Henry S. Thompson |
environment improvements
|
2021-03-03 |
Henry S. Thompson |
trying to move to slurm
|
2020-05-09 |
Henry S. Thompson |
improved F handling/logging
|
2020-05-08 |
Henry S. Thompson |
keep separate antecedants separate, buggy?
|
2020-05-07 |
Henry S. Thompson |
track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
|
2020-05-07 |
Henry S. Thompson |
refactor, change summary print (problem?)
|
2020-05-06 |
Henry S. Thompson |
bare framework working
|
2020-05-06 |
Henry S. Thompson |
starting on tool to assemble as complete as we have info wrt a seed URI
|
2020-05-06 |
Henry S. Thompson |
use local .m2/repository for Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
works for big files with Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
x
|
2020-04-28 |
Henry S. Thompson |
log trucations
|
2020-04-28 |
Henry S. Thompson |
impose some limits
|
2020-04-28 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
mostly from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
fix from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
several efficiency (hofentlich) tweaks
|
2020-04-23 |
Henry S. Thompson |
x
|
2020-04-23 |
Henry S. Thompson |
switch for use on login server, invoke by hand with 0/1 as only cmd line arg
|
2020-04-22 |
Henry S. Thompson |
java stuff
|
2020-04-22 |
Henry S. Thompson |
try nutch fetch for big pdfs
|
2020-04-15 |
Henry S. Thompson |
final most general versin
|
2020-04-14 |
Henry S. Thompson |
too big for /dev/shm, split in half
|
2020-04-14 |
Henry S. Thompson |
one-off to convert big extracts.tar into lots of smaller ones
|
2020-04-13 |
Henry S. Thompson |
as used successfully for 3rd run
|
2020-04-13 |
Henry S. Thompson |
ready to try another pass with robust diff checking
|
2020-04-13 |
Henry S. Thompson |
working towards more robust diff checking
|
2020-04-11 |
Henry S. Thompson |
a few tweaks after 2nd parallel run
|
2020-04-10 |
Henry S. Thompson |
another few log fixes
|
2020-04-10 |
Henry S. Thompson |
as running, modulo 1 log output wrong
|
2020-04-10 |
Henry S. Thompson |
log more, work around more glitches
|