view data/CC-MAIN-2019-35/samplepdfs/00README @ 185:acae526510e2

too many overdue updates to break down
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 06 Dec 2023 13:38:58 +0000
parents 128b18459f9e
children
line wrap: on
line source

Subsets of application/pdf index lines for segment 50

1128 is just the segment 1566027323067.50 lines from cdx-00000 that
contain application/pdf

724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed.

errs.txt and errs_100.txt contain the error log from running
  ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data
on 724 and the first 100 lines of 724 respectively.

65 contains the subset of the first 100 lines in 724 that are valid
pdfs (i.e. removing the 35 that are listed in errs_100.txt)

509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt)

All the files necessary to use 65 are in the cache, but please _don't_
use the cache for anything else in this directory, as I'm out of disk space.