diff data/CC-MAIN-2019-35/samplepdfs/00README @ 132:128b18459f9e

sic
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 14 Jul 2021 15:30:29 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/data/CC-MAIN-2019-35/samplepdfs/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,18 @@
+Subsets of application/pdf index lines for segment 50
+
+1128 is just the segment 1566027323067.50 lines from cdx-00000 that
+contain application/pdf
+
+724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed.
+
+errs.txt and errs_100.txt contain the error log from running
+  ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data
+on 724 and the first 100 lines of 724 respectively.
+
+65 contains the subset of the first 100 lines in 724 that are valid
+pdfs (i.e. removing the 35 that are listed in errs_100.txt)
+
+509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt)
+
+All the files necessary to use 65 are in the cache, but please _don't_
+use the cache for anything else in this directory, as I'm out of disk space.