132
|
1 *15.../*
|
|
2
|
|
3 A small number of warc and crawldiagnostic files, sufficient to run
|
|
4 tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using
|
|
5 the data directory as the cache. So, e.g.
|
|
6
|
|
7 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l
|
|
8
|
|
9 will run about 3 times faster than
|
|
10
|
|
11 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l
|
|
12
|
|
13 I've pretty much used up my disk quota for the local (.../data) disk,
|
|
14 so please don't use -r .../data except for the first 90 in cdx-00000
|
|
15 and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working
|
|
16 intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the
|
|
17 index files, using your _own_ disk space for the cache is a good idea,
|
|
18 including soft links to the files in the existing cache.
|
|
19
|
|
20 *cdx/warc/*
|
|
21
|
|
22 All the index files
|
|
23
|
|
24 *samplepdfs/*
|
|
25
|
|
26 Small subsets of index files for pdfs in segment 50
|