Mercurial > hg > cc > cirrus_home
view data/CC-MAIN-2019-35/samplepdfs/00README @ 185:acae526510e2
too many overdue updates to break down
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 06 Dec 2023 13:38:58 +0000 |
parents | 128b18459f9e |
children |
line wrap: on
line source
Subsets of application/pdf index lines for segment 50 1128 is just the segment 1566027323067.50 lines from cdx-00000 that contain application/pdf 724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed. errs.txt and errs_100.txt contain the error log from running ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data on 724 and the first 100 lines of 724 respectively. 65 contains the subset of the first 100 lines in 724 that are valid pdfs (i.e. removing the 35 that are listed in errs_100.txt) 509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt) All the files necessary to use 65 are in the cache, but please _don't_ use the cache for anything else in this directory, as I'm out of disk space.