Mercurial > hg > cc > cirrus_home
comparison data/CC-MAIN-2019-35/samplepdfs/00README @ 132:128b18459f9e
sic
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 14 Jul 2021 15:30:29 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
131:bf943a2f0f37 | 132:128b18459f9e |
---|---|
1 Subsets of application/pdf index lines for segment 50 | |
2 | |
3 1128 is just the segment 1566027323067.50 lines from cdx-00000 that | |
4 contain application/pdf | |
5 | |
6 724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed. | |
7 | |
8 errs.txt and errs_100.txt contain the error log from running | |
9 ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data | |
10 on 724 and the first 100 lines of 724 respectively. | |
11 | |
12 65 contains the subset of the first 100 lines in 724 that are valid | |
13 pdfs (i.e. removing the 35 that are listed in errs_100.txt) | |
14 | |
15 509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt) | |
16 | |
17 All the files necessary to use 65 are in the cache, but please _don't_ | |
18 use the cache for anything else in this directory, as I'm out of disk space. |