Mercurial > hg > cc > cirrus_home
comparison data/CC-MAIN-2019-35/00README @ 132:128b18459f9e
sic
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 14 Jul 2021 15:30:29 +0000 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
131:bf943a2f0f37 | 132:128b18459f9e |
---|---|
1 *15.../* | |
2 | |
3 A small number of warc and crawldiagnostic files, sufficient to run | |
4 tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using | |
5 the data directory as the cache. So, e.g. | |
6 | |
7 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l | |
8 | |
9 will run about 3 times faster than | |
10 | |
11 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l | |
12 | |
13 I've pretty much used up my disk quota for the local (.../data) disk, | |
14 so please don't use -r .../data except for the first 90 in cdx-00000 | |
15 and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working | |
16 intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the | |
17 index files, using your _own_ disk space for the cache is a good idea, | |
18 including soft links to the files in the existing cache. | |
19 | |
20 *cdx/warc/* | |
21 | |
22 All the index files | |
23 | |
24 *samplepdfs/* | |
25 | |
26 Small subsets of index files for pdfs in segment 50 |