Mercurial > hg > cc > cirrus_home
changeset 132:128b18459f9e
sic
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 14 Jul 2021 15:30:29 +0000 |
parents | bf943a2f0f37 |
children | 660dc255542a |
files | bin/00README data/00README data/CC-MAIN-2019-35/00README data/CC-MAIN-2019-35/samplepdfs/00README |
diffstat | 4 files changed, 93 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/bin/00README Wed Jul 14 15:30:29 2021 +0000 @@ -0,0 +1,39 @@ +Various tools and bash function sources. + +All the tools will give useful output if run with a --help argument + +functions.sh Source this in your .bashrc to get useful functions + including ux, lss and btot + +cdx2tsv.py Extract fields and subparts from fields of a CDX-format + index file + +clm.sh Intended for use as a sub-command to ix.py: Given an + HTML response header, appends to a given file the Last-Modified value + if there is one, otherwise a blank line. + +ix.py Efficiently extract some or all of response data contents of + Common Crawl WARC-format files + +qpdf Wrapper for locally compiled version. + + Qpdf as supplied only works with a named file, but this + wrapper supports streamed input. + _If_ it's invoked as + qpdf [args...] - + it takes input from stdin, saves it as /dev/shm/$USER/xxx.pdf + and runs + qpdf args... /dev/shm/$USER/xxx.pdf + + Qpdf is the best available PDF validator + as far as I know. See + http://qpdf.sourceforge.net/files/qpdf-manual.html + for documentation. + +qpdf_check Runs qpdf with all the arguments needed to + make it run as a validator: no corrections are appied, + no warnings are output, + fails iff there are any errors in the input file. + + Uses the above qpdf wrapper, so supports input either + from stdin or a named file
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data/00README Wed Jul 14 15:30:29 2021 +0000 @@ -0,0 +1,10 @@ +*CC-MAIN-2019-35/* + +Around 100 sample WARC files, all the index files, index file for some +pdfs + +*bin/* + +Release version of various tools + +See 00README files in subdirectories for more information.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data/CC-MAIN-2019-35/00README Wed Jul 14 15:30:29 2021 +0000 @@ -0,0 +1,26 @@ +*15.../* + +A small number of warc and crawldiagnostic files, sufficient to run +tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using +the data directory as the cache. So, e.g. + +>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l + +will run about 3 times faster than + +>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l + +I've pretty much used up my disk quota for the local (.../data) disk, +so please don't use -r .../data except for the first 90 in cdx-00000 +and the ones in sample_pdfs/65_pdfs.txt. Obviously if you are working +intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the +index files, using your _own_ disk space for the cache is a good idea, +including soft links to the files in the existing cache. + +*cdx/warc/* + +All the index files + +*samplepdfs/* + +Small subsets of index files for pdfs in segment 50
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/data/CC-MAIN-2019-35/samplepdfs/00README Wed Jul 14 15:30:29 2021 +0000 @@ -0,0 +1,18 @@ +Subsets of application/pdf index lines for segment 50 + +1128 is just the segment 1566027323067.50 lines from cdx-00000 that +contain application/pdf + +724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed. + +errs.txt and errs_100.txt contain the error log from running + ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data +on 724 and the first 100 lines of 724 respectively. + +65 contains the subset of the first 100 lines in 724 that are valid +pdfs (i.e. removing the 35 that are listed in errs_100.txt) + +509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt) + +All the files necessary to use 65 are in the cache, but please _don't_ +use the cache for anything else in this directory, as I'm out of disk space.