diff data/CC-MAIN-2019-35/00README @ 132:128b18459f9e

sic
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 14 Jul 2021 15:30:29 +0000
parents
children
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/data/CC-MAIN-2019-35/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,26 @@
+*15.../*
+
+A small number of warc and crawldiagnostic files, sufficient to run
+tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using
+the data directory as the cache.  So, e.g.
+
+>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l
+
+will run about 3 times faster than 
+
+>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l
+
+I've pretty much used up my disk quota for the local (.../data) disk,
+so please don't use -r .../data except for the first 90 in cdx-00000
+and the ones in sample_pdfs/65_pdfs.txt.  Obviously if you are working
+intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the
+index files, using your _own_ disk space for the cache is a good idea,
+including soft links to the files in the existing cache.
+
+*cdx/warc/*
+
+All the index files
+
+*samplepdfs/*
+
+Small subsets of index files for pdfs in segment 50