Mercurial > hg > cc > cirrus_home
view data/CC-MAIN-2019-35/00README @ 148:f0bee28995f1
do the work for cdx2sql
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 25 Oct 2021 15:05:46 +0000 |
parents | 128b18459f9e |
children |
line wrap: on
line source
*15.../* A small number of warc and crawldiagnostic files, sufficient to run tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using the data directory as the cache. So, e.g. >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l will run about 3 times faster than >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l I've pretty much used up my disk quota for the local (.../data) disk, so please don't use -r .../data except for the first 90 in cdx-00000 and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the index files, using your _own_ disk space for the cache is a good idea, including soft links to the files in the existing cache. *cdx/warc/* All the index files *samplepdfs/* Small subsets of index files for pdfs in segment 50