2021-03-14 |
Henry S. Thompson |
environment improvements
|
2021-03-03 |
Henry S. Thompson |
trying to move to slurm
|
2020-05-09 |
Henry S. Thompson |
improved F handling/logging
|
2020-05-08 |
Henry S. Thompson |
keep separate antecedants separate, buggy?
|
2020-05-07 |
Henry S. Thompson |
track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
|
2020-05-07 |
Henry S. Thompson |
refactor, change summary print (problem?)
|
2020-05-06 |
Henry S. Thompson |
bare framework working
|
2020-05-06 |
Henry S. Thompson |
starting on tool to assemble as complete as we have info wrt a seed URI
|
2020-05-06 |
Henry S. Thompson |
use local .m2/repository for Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
works for big files with Hadoop 3.4.0
|
2020-05-06 |
Henry S. Thompson |
x
|
2020-04-28 |
Henry S. Thompson |
log trucations
|
2020-04-28 |
Henry S. Thompson |
impose some limits
|
2020-04-28 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
x
|
2020-04-24 |
Henry S. Thompson |
mostly from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
fix from Sebastian
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
misc
|
2020-04-24 |
Henry S. Thompson |
several efficiency (hofentlich) tweaks
|
2020-04-23 |
Henry S. Thompson |
x
|
2020-04-23 |
Henry S. Thompson |
switch for use on login server, invoke by hand with 0/1 as only cmd line arg
|
2020-04-22 |
Henry S. Thompson |
java stuff
|
2020-04-22 |
Henry S. Thompson |
try nutch fetch for big pdfs
|
2020-04-15 |
Henry S. Thompson |
final most general versin
|
2020-04-14 |
Henry S. Thompson |
too big for /dev/shm, split in half
|
2020-04-14 |
Henry S. Thompson |
one-off to convert big extracts.tar into lots of smaller ones
|
2020-04-13 |
Henry S. Thompson |
as used successfully for 3rd run
|
2020-04-13 |
Henry S. Thompson |
ready to try another pass with robust diff checking
|
2020-04-13 |
Henry S. Thompson |
working towards more robust diff checking
|
2020-04-11 |
Henry S. Thompson |
a few tweaks after 2nd parallel run
|
2020-04-10 |
Henry S. Thompson |
another few log fixes
|
2020-04-10 |
Henry S. Thompson |
as running, modulo 1 log output wrong
|
2020-04-10 |
Henry S. Thompson |
log more, work around more glitches
|
2020-04-10 |
Henry S. Thompson |
x
|
2020-04-08 |
Henry S. Thompson |
start try to work around failures
|
2020-04-08 |
Henry S. Thompson |
parallelised version of reExtract.sh
|
2020-04-07 |
Henry S. Thompson |
complete change of array var construction, used it for log file names too, tar update enabled, so maybe complete but w/o any parallel
|
2020-04-04 |
Henry S. Thompson |
added computation of required additions to tar file, but not actually added
|
2020-04-03 |
Henry S. Thompson |
refactored, not tested
|
2020-04-03 |
Henry S. Thompson |
done through re-extraction, fixing tars still to come
|
2020-04-02 |
Henry S. Thompson |
sketching more
|
2020-04-02 |
Henry S. Thompson |
towards re-running extraction in part
|
2020-04-02 |
Henry S. Thompson |
up the time limit
|
2020-04-02 |
Henry S. Thompson |
clean up after ourselves
|
2020-03-26 |
Henry S. Thompson |
fixed scope pblm in tar step
|
2020-03-26 |
Henry S. Thompson |
sync up filenames and log names,
|
2020-03-26 |
Henry S. Thompson |
pass through extract args
|
2020-03-24 |
Henry S. Thompson |
towards sub-division of resulting tar files
|
2020-03-24 |
Henry S. Thompson |
not relevant
|
2020-03-19 |
Henry S. Thompson |
x
|
2020-03-19 |
Henry S. Thompson |
better quoting
|
2020-03-18 |
Henry S. Thompson |
try to fix multi-line lossage
|
2020-03-18 |
Henry S. Thompson |
fix missing use of $t
|