annotate data/CC-MAIN-2019-35/00README @ 148:f0bee28995f1

do the work for cdx2sql
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 25 Oct 2021 15:05:46 +0000
parents 128b18459f9e
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
132
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 *15.../*
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 A small number of warc and crawldiagnostic files, sufficient to run
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 the data directory as the cache. So, e.g.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 will run about 3 times faster than
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 >: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 I've pretty much used up my disk quota for the local (.../data) disk,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 so please don't use -r .../data except for the first 90 in cdx-00000
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 index files, using your _own_ disk space for the cache is a good idea,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 including soft links to the files in the existing cache.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
20 *cdx/warc/*
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 All the index files
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24 *samplepdfs/*
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
26 Small subsets of index files for pdfs in segment 50