view data/CC-MAIN-2019-35/00README @ 187:9805323d9969

add lastmod to cdx lines, start writing test case
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 23 Sep 2024 16:35:22 +0100
parents 128b18459f9e
children
line wrap: on
line source

*15.../*

A small number of warc and crawldiagnostic files, sufficient to run
tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using
the data directory as the cache.  So, e.g.

>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l

will run about 3 times faster than 

>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l

I've pretty much used up my disk quota for the local (.../data) disk,
so please don't use -r .../data except for the first 90 in cdx-00000
and the ones in sample_pdfs/65_pdfs.txt.  Obviously if you are working
intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the
index files, using your _own_ disk space for the cache is a good idea,
including soft links to the files in the existing cache.

*cdx/warc/*

All the index files

*samplepdfs/*

Small subsets of index files for pdfs in segment 50