Mercurial > hg > cc > cirrus_home
diff data/CC-MAIN-2019-35/00README @ 132:128b18459f9e
sic
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 14 Jul 2021 15:30:29 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data/CC-MAIN-2019-35/00README Wed Jul 14 15:30:29 2021 +0000 @@ -0,0 +1,26 @@ +*15.../* + +A small number of warc and crawldiagnostic files, sufficient to run +tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using +the data directory as the cache. So, e.g. + +>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l + +will run about 3 times faster than + +>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l + +I've pretty much used up my disk quota for the local (.../data) disk, +so please don't use -r .../data except for the first 90 in cdx-00000 +and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working +intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the +index files, using your _own_ disk space for the cache is a good idea, +including soft links to the files in the existing cache. + +*cdx/warc/* + +All the index files + +*samplepdfs/* + +Small subsets of index files for pdfs in segment 50