log

age author description
Mon, 28 Jun 2021 14:01:41 +0000 Henry S. Thompson for use in processing CC index files
Wed, 16 Jun 2021 16:12:46 +0000 Henry S. Thompson implement --cmd
Wed, 16 Jun 2021 16:12:16 +0000 Henry S. Thompson qpdf needs LD_LIB_PATH
Tue, 15 Jun 2021 18:04:34 +0000 Henry S. Thompson refactor final processing loop,
Tue, 15 Jun 2021 16:58:31 +0000 Henry S. Thompson frame size
Tue, 15 Jun 2021 16:58:03 +0000 Henry S. Thompson include sh-script
Mon, 26 Apr 2021 17:18:29 +0000 Henry S. Thompson all parts working, idempotency achieved
Mon, 26 Apr 2021 17:17:58 +0000 Henry S. Thompson debugging
Mon, 26 Apr 2021 17:17:38 +0000 Henry S. Thompson (none)
Mon, 26 Apr 2021 15:28:23 +0000 Henry S. Thompson warc and headers parts working
Thu, 22 Apr 2021 21:31:03 +0000 Henry S. Thompson back to IGzipFile
Thu, 22 Apr 2021 21:10:02 +0000 Henry S. Thompson approved Popen version using .communicate
Thu, 22 Apr 2021 19:06:55 +0000 Henry S. Thompson using Popen to run igzip (also not great)
Tue, 20 Apr 2021 19:11:57 +0000 Henry S. Thompson added support for copying to/using /dev/shm or /tmp
Tue, 20 Apr 2021 12:26:09 +0000 Henry S. Thompson working with -x and rich directory structure
Tue, 20 Apr 2021 11:12:35 +0000 Henry S. Thompson convert to rich directory structure per 2019-35
Mon, 19 Apr 2021 18:09:51 +0000 Henry S. Thompson -x barely working
Mon, 19 Apr 2021 18:09:25 +0000 Henry S. Thompson never should have added
Mon, 19 Apr 2021 13:08:16 +0000 Henry S. Thompson better dd error handling
Mon, 19 Apr 2021 13:07:58 +0000 Henry S. Thompson (none)
Sun, 18 Apr 2021 17:03:45 +0000 Henry S. Thompson bare minimum working
Fri, 16 Apr 2021 18:28:00 +0000 Henry S. Thompson triple args checked, filename opened
Fri, 16 Apr 2021 13:15:23 +0000 Henry S. Thompson help format hacking done
Fri, 16 Apr 2021 12:55:05 +0000 Henry S. Thompson basic help format hacking works
Fri, 16 Apr 2021 09:01:16 +0000 Henry S. Thompson (none)
Fri, 16 Apr 2021 09:00:17 +0000 Henry S. Thompson (none)
Thu, 15 Apr 2021 19:22:27 +0000 Henry S. Thompson just strugging with argparse
Thu, 15 Apr 2021 10:59:25 +0000 Henry S. Thompson support a command to receive each result,
Wed, 14 Apr 2021 20:15:32 +0000 Henry S. Thompson accepts index lines, less line-at-a-time
Wed, 14 Apr 2021 10:08:41 +0000 Henry S. Thompson working with one input
Wed, 14 Apr 2021 08:57:43 +0000 Henry S. Thompson -w and -h working
Tue, 13 Apr 2021 17:52:31 +0000 Henry S. Thompson working on flags
Tue, 13 Apr 2021 17:02:09 +0000 Henry S. Thompson new
Tue, 16 Mar 2021 16:20:02 +0000 Henry S. Thompson working with locking and copying
Mon, 15 Mar 2021 14:26:42 +0000 Henry S. Thompson working for -t 2 -c 2
Mon, 15 Mar 2021 14:20:00 +0000 Henry S. Thompson minor
Sun, 14 Mar 2021 21:28:02 +0000 Henry S. Thompson prepare for real parallel distribution
Sun, 14 Mar 2021 21:25:01 +0000 Henry S. Thompson environment improvements
Wed, 03 Mar 2021 19:33:56 +0000 Henry S. Thompson trying to move to slurm
Sat, 09 May 2020 16:16:28 +0100 Henry S. Thompson improved F handling/logging
Fri, 08 May 2020 19:52:36 +0100 Henry S. Thompson keep separate antecedants separate, buggy?
Thu, 07 May 2020 18:47:24 +0100 Henry S. Thompson track redirects, need to us full crawldiagnostics.warc.gz for "location:" and "Uri:"
Thu, 07 May 2020 11:33:24 +0100 Henry S. Thompson refactor, change summary print (problem?)
Wed, 06 May 2020 18:28:52 +0100 Henry S. Thompson bare framework working
Wed, 06 May 2020 14:25:44 +0100 Henry S. Thompson starting on tool to assemble as complete as we have info wrt a seed URI
Wed, 06 May 2020 14:24:42 +0100 Henry S. Thompson use local .m2/repository for Hadoop 3.4.0
Wed, 06 May 2020 14:23:33 +0100 Henry S. Thompson works for big files with Hadoop 3.4.0
Wed, 06 May 2020 14:22:48 +0100 Henry S. Thompson x
Tue, 28 Apr 2020 19:02:34 +0100 Henry S. Thompson log trucations
Tue, 28 Apr 2020 19:02:14 +0100 Henry S. Thompson impose some limits
Tue, 28 Apr 2020 19:01:41 +0100 Henry S. Thompson x
Fri, 24 Apr 2020 20:12:44 +0100 Henry S. Thompson x
Fri, 24 Apr 2020 20:12:29 +0100 Henry S. Thompson mostly from Sebastian
Fri, 24 Apr 2020 20:03:29 +0100 Henry S. Thompson misc
Fri, 24 Apr 2020 20:01:35 +0100 Henry S. Thompson misc
Fri, 24 Apr 2020 20:01:25 +0100 Henry S. Thompson fix from Sebastian
Fri, 24 Apr 2020 19:57:16 +0100 Henry S. Thompson misc
Fri, 24 Apr 2020 19:55:11 +0100 Henry S. Thompson misc
Fri, 24 Apr 2020 15:20:33 +0100 Henry S. Thompson several efficiency (hofentlich) tweaks
Thu, 23 Apr 2020 17:26:55 +0100 Henry S. Thompson x
Thu, 23 Apr 2020 17:25:25 +0100 Henry S. Thompson switch for use on login server, invoke by hand with 0/1 as only cmd line arg
Wed, 22 Apr 2020 18:42:40 +0100 Henry S. Thompson java stuff
Wed, 22 Apr 2020 18:42:23 +0100 Henry S. Thompson try nutch fetch for big pdfs
Wed, 15 Apr 2020 18:44:18 +0100 Henry S. Thompson final most general versin
Tue, 14 Apr 2020 17:52:34 +0100 Henry S. Thompson too big for /dev/shm, split in half
Tue, 14 Apr 2020 16:10:22 +0100 Henry S. Thompson one-off to convert big extracts.tar into lots of smaller ones
Mon, 13 Apr 2020 17:29:31 +0100 Henry S. Thompson as used successfully for 3rd run
Mon, 13 Apr 2020 15:24:32 +0100 Henry S. Thompson ready to try another pass with robust diff checking
Mon, 13 Apr 2020 14:12:12 +0100 Henry S. Thompson working towards more robust diff checking
Sat, 11 Apr 2020 13:41:46 +0100 Henry S. Thompson a few tweaks after 2nd parallel run
Fri, 10 Apr 2020 18:45:30 +0100 Henry S. Thompson another few log fixes
Fri, 10 Apr 2020 18:42:08 +0100 Henry S. Thompson as running, modulo 1 log output wrong
Fri, 10 Apr 2020 18:22:48 +0100 Henry S. Thompson log more, work around more glitches
Fri, 10 Apr 2020 18:22:24 +0100 Henry S. Thompson x
Wed, 08 Apr 2020 14:11:04 +0100 Henry S. Thompson start try to work around failures
Wed, 08 Apr 2020 11:27:33 +0100 Henry S. Thompson parallelised version of reExtract.sh
Tue, 07 Apr 2020 18:00:29 +0100 Henry S. Thompson complete change of array var construction, used it for log file names too, tar update enabled, so maybe complete but w/o any parallel
Sat, 04 Apr 2020 15:31:58 +0100 Henry S. Thompson added computation of required additions to tar file, but not actually added
Fri, 03 Apr 2020 19:04:06 +0100 Henry S. Thompson refactored, not tested
Fri, 03 Apr 2020 17:35:17 +0100 Henry S. Thompson done through re-extraction, fixing tars still to come
Thu, 02 Apr 2020 19:21:21 +0100 Henry S. Thompson sketching more
Thu, 02 Apr 2020 19:14:23 +0100 Henry S. Thompson towards re-running extraction in part
Thu, 02 Apr 2020 19:13:40 +0100 Henry S. Thompson up the time limit
Thu, 02 Apr 2020 19:13:14 +0100 Henry S. Thompson clean up after ourselves
Thu, 26 Mar 2020 15:29:12 +0000 Henry S. Thompson fixed scope pblm in tar step
Thu, 26 Mar 2020 12:24:30 +0000 Henry S. Thompson sync up filenames and log names,
Thu, 26 Mar 2020 12:23:33 +0000 Henry S. Thompson pass through extract args
Tue, 24 Mar 2020 17:53:35 +0000 Henry S. Thompson towards sub-division of resulting tar files
Tue, 24 Mar 2020 17:52:52 +0000 Henry S. Thompson not relevant
Thu, 19 Mar 2020 13:13:35 +0000 Henry S. Thompson x
Thu, 19 Mar 2020 13:13:02 +0000 Henry S. Thompson better quoting
Wed, 18 Mar 2020 21:54:47 +0000 Henry S. Thompson try to fix multi-line lossage
Wed, 18 Mar 2020 21:52:06 +0000 Henry S. Thompson fix missing use of $t
Wed, 18 Mar 2020 15:23:53 +0000 Henry S. Thompson first cut at doing extraction here
Wed, 18 Mar 2020 13:42:47 +0000 Henry S. Thompson finally hacked something that works
Wed, 18 Mar 2020 11:08:47 +0000 Henry S. Thompson (none)
Wed, 18 Mar 2020 11:08:23 +0000 Henry S. Thompson (none)
Wed, 18 Mar 2020 11:06:35 +0000 Henry S. Thompson x
Wed, 18 Mar 2020 10:57:56 +0000 Henry S. Thompson more job scripts
Wed, 18 Mar 2020 10:57:21 +0000 Henry S. Thompson more job scripts
Wed, 18 Mar 2020 10:56:23 +0000 Henry S. Thompson local setup
Mon, 16 Mar 2020 15:57:23 +0000 Henry S. Thompson copied from valhalla/bin
Thu, 27 Feb 2020 17:18:02 +0000 Henry S. Thompson fix a mis-folded link file
Thu, 27 Feb 2020 13:24:19 +0000 Henry S. Thompson sic
Wed, 26 Feb 2020 21:50:25 +0000 Henry S. Thompson use awk to do a join between links and 1132dates
Wed, 26 Feb 2020 16:02:22 +0000 Henry S. Thompson works after minor tweaks
Wed, 26 Feb 2020 15:47:20 +0000 Henry S. Thompson modelled on plinks
Wed, 26 Feb 2020 12:40:14 +0000 Henry S. Thompson fixes to pdfx to timeout, use regex
Tue, 25 Feb 2020 22:14:07 +0000 Henry S. Thompson add args for start tar and number of tars
Tue, 25 Feb 2020 18:33:22 +0000 Henry S. Thompson give up on mpiexec_mpt
Tue, 25 Feb 2020 14:56:36 +0000 Henry S. Thompson bigger run, longer limit
Tue, 25 Feb 2020 10:34:41 +0000 Henry S. Thompson logging tweaks, preparing for timeout on problem pdfs
Mon, 24 Feb 2020 12:16:10 +0000 Henry S. Thompson longer run, terser logging
Mon, 24 Feb 2020 00:44:53 +0000 Henry S. Thompson more logging
Sun, 23 Feb 2020 16:48:34 +0000 Henry S. Thompson refactor to address tarred-up pdfs
Wed, 19 Feb 2020 10:41:59 +0000 Henry S. Thompson merge
Wed, 19 Feb 2020 10:39:05 +0000 Henry S. Thompson try harder not to write empty links files
Tue, 18 Feb 2020 22:41:08 +0000 Henry Thompson only create links file if there are some
Tue, 18 Feb 2020 22:19:51 +0000 Henry Thompson typos
Tue, 18 Feb 2020 21:33:35 +0000 Henry Thompson switch to file loop inside python, assume file index integer in pipe as well as filename, check /dev/shm/stopJob