annotate data/CC-MAIN-2019-35/samplepdfs/00README @ 185:acae526510e2

too many overdue updates to break down
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 06 Dec 2023 13:38:58 +0000
parents 128b18459f9e
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
132
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 Subsets of application/pdf index lines for segment 50
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 1128 is just the segment 1566027323067.50 lines from cdx-00000 that
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 contain application/pdf
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 errs.txt and errs_100.txt contain the error log from running
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 on 724 and the first 100 lines of 724 respectively.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 65 contains the subset of the first 100 lines in 724 that are valid
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 pdfs (i.e. removing the 35 that are listed in errs_100.txt)
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt)
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 All the files necessary to use 65 are in the cache, but please _don't_
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 use the cache for anything else in this directory, as I'm out of disk space.