changeset 132:128b18459f9e

sic
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 14 Jul 2021 15:30:29 +0000
parents bf943a2f0f37
children 660dc255542a
files bin/00README data/00README data/CC-MAIN-2019-35/00README data/CC-MAIN-2019-35/samplepdfs/00README
diffstat 4 files changed, 93 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/bin/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,39 @@
+Various tools and bash function sources.
+
+All the tools will give useful output if run with a --help argument
+
+functions.sh  Source this in your .bashrc to get useful functions
+	      including ux, lss and btot
+
+cdx2tsv.py    Extract fields and subparts from fields of a CDX-format
+	      index file
+
+clm.sh	      Intended for use as a sub-command to ix.py:  Given an
+	      HTML response header, appends to a given file the Last-Modified value
+	      if there is one, otherwise a blank line.
+
+ix.py	      Efficiently extract some or all of response data contents of
+	      Common Crawl WARC-format files
+
+qpdf	      Wrapper for locally compiled version.
+
+              Qpdf as supplied only works with a named file, but this
+	      wrapper supports streamed input.
+	      _If_ it's invoked as
+                  qpdf [args...] -
+              it takes input from stdin, saves it as /dev/shm/$USER/xxx.pdf
+	      and runs
+                  qpdf args... /dev/shm/$USER/xxx.pdf
+
+	      Qpdf is the best available PDF validator
+	      as far as I know.  See
+	      http://qpdf.sourceforge.net/files/qpdf-manual.html 
+	      for documentation.
+
+qpdf_check    Runs qpdf with all the arguments needed to
+	      make it run as a validator: no corrections are appied,
+	      no warnings are output,
+	      fails iff there are any errors in the input file.
+
+              Uses the above qpdf wrapper, so supports input either
+	      from stdin or a named file
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/data/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,10 @@
+*CC-MAIN-2019-35/*
+
+Around 100 sample WARC files, all the index files, index file for some
+pdfs
+
+*bin/*
+
+Release version of various tools
+
+See 00README files in subdirectories for more information.
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/data/CC-MAIN-2019-35/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,26 @@
+*15.../*
+
+A small number of warc and crawldiagnostic files, sufficient to run
+tests using ix.py on the first 90 lines from cdx/cdx-00000.gz, using
+the data directory as the cache.  So, e.g.
+
+>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h -r /lustre/home/dc007/hst/data | wc -l
+
+will run about 3 times faster than 
+
+>: uz cdx/cdx-00000.gz | head -90 | ix.py -x -h | wc -l
+
+I've pretty much used up my disk quota for the local (.../data) disk,
+so please don't use -r .../data except for the first 90 in cdx-00000
+and the ones in sample_pdfs/65_pdfs.txt.  Obviously if you are working
+intensively on 724_pdfs.txt or 509_pdfs.txt or any other subset of the
+index files, using your _own_ disk space for the cache is a good idea,
+including soft links to the files in the existing cache.
+
+*cdx/warc/*
+
+All the index files
+
+*samplepdfs/*
+
+Small subsets of index files for pdfs in segment 50
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/data/CC-MAIN-2019-35/samplepdfs/00README	Wed Jul 14 15:30:29 2021 +0000
@@ -0,0 +1,18 @@
+Subsets of application/pdf index lines for segment 50
+
+1128 is just the segment 1566027323067.50 lines from cdx-00000 that
+contain application/pdf
+
+724 contains the first 824 lines of 1128, with all the crawldiagnostics lines removed.
+
+errs.txt and errs_100.txt contain the error log from running
+  ix.py -x -b -c '/lustre/home/dc007/hst/bin/qpdf_check -' -r /lustre/home/dc007/hst/data
+on 724 and the first 100 lines of 724 respectively.
+
+65 contains the subset of the first 100 lines in 724 that are valid
+pdfs (i.e. removing the 35 that are listed in errs_100.txt)
+
+509 contains the subset of the lines in 724 that are valid pdfs (i.e. removing the 215 that are listed in errs.txt)
+
+All the files necessary to use 65 are in the cache, but please _don't_
+use the cache for anything else in this directory, as I'm out of disk space.