view lurid3/notes.txt @ 45:737c61f98cbf

foo
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 26 Sep 2024 17:47:58 +0100
parents 7209df5fa5b4
children 49672e9b4c1c
line wrap: on
line source

See old_notes.txt for all older notes on Common Crawl dataprocessing,
starting from Azure via Turing and then LURID and LURID2.

Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx
  >: cd results/CC-MAIN-2024-33/cdx/
  >: cut -f 2 counts.tsv | btot
  2,793,986,828 

State of play wrt data -- see status.xlsx

[in trying to tabulate the date ranges of the crawls, I found that the
WARC timestamp is sometimes bogus:

  >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx
  net,tyredeyes)/robots.txt 20090201191318	cdx-00230.gz	160573468	198277	920675

  >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz
  net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"}
  net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"}

This happens in 2019-35 as well :-(

  >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx
  com,gyshbsh)/robots.txt 20181023022000	cdx-00078.gz	356340085	162332	315406
  >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz
  com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"}
  ...

Tabulate all the date ranges for the WARC files we have

  >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d -  | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv
  >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv
  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done
2019-18	20190418101243-20190418122248
  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done
2019-18	20190426153423-20190426175423
  >: echo 2019-18       20190418101243-20190418122248   20190426153423-20190426175423 >> dates.tsv 
  >: pwd
  /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc
  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}'
  >: sort -mu /tmp/hst/??? > /tmp/hst/all
  >: wc -l /tmp/hst/all
  679686 /tmp/hst/all
  >: head -1 /tmp/hst/all
  20160723090435
  >: tail -1 /tmp/hst/all
  20160731110639
  >: cd ../../..
  >: echo 2016-30       20160723090435  20160731110639 >> dates.tsv 
tweaked and sorted in xemacs:
  2016-30	20160723090435	20160731110639
  2017-30	20170720121902	20170729132938
  2018-30	20180715183800	20180723184955
  2018-34	20180814062251	20180822085454
  2019-18	20190418101243	20190426175423
  2019-35	20190817102624	20190826111356
  2020-34	20200803083123	20200815214756
  2021-25	20210612103920	20210625145905
  2023-40	20230921073711	20231005042006
  2023-50	20231128083443	20231212000408

Added to status.xlsx in shortened form, with number of days
  8
  9
  8
  8
  8
  9
  12
  13
  15
  15

Fill a gap by downloading 2022-33

  >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
  130 minutes...
  >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
  59 minutes

Another day to get to a quarter?
  >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &


And finally 2015-35
Fetched in just 2 chunks, 0-9 and 10-99, e.g.
  >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &

Much smaller.
Compare 2023-40, with 900 files per segment:
  >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats
  n	=	1000
  min	=	1.14775e+09
  max	=	1.26702e+09
  sum	=	1.20192e+12
  mean	=	1.20192e+09
  sd	=	2.26049e+07

with 2015-35, with 353 files per segment
  >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
  n	=	1000
  min	=	1.66471e+08
  max	=	9.6322e+08
  sum	=	9.19222e+11
  mean	=	9.19222e+08
  sd	=	8.20542e+07

The min files all come from segment 1440644060633.7, whose files are
_all_ small:
  >: uz *00123-*.gz | wc -l
  12,759,931
Compare to 1440644060103.8
  >: zcat *00123-*.gz | wc -l
  75,806,738
Mystery

Also faster
Compare 2022-33:
 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max  mean    sd
                              98 19 256  75.1   25.2
with 2015-35:
  >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max mean sd
		       100 15  40 32.6 2.9

  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' &
  >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
  >: head -1 /tmp/hst/2015_all
  20150827191534
  >: tail -1 /tmp/hst/2015_all
  20150905180914
  >: wc -l /tmp/hst/2015_all
  698128 /tmp/hst/2015_all

What about wet files -- do they include text from pdfs?  What about
truncated pdfs?

  >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
  real    26m3.049s
  user    0m1.225s
  sys     0m1.310s

In the segment 0 cdx file (!) we find 3747 probable truncations:
  >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
  >: wc -l /tmp/hst/2019-35_seg0_pdf.idx
  42345 /tmp/hst/2019-35_seg0_pdf.idx
  >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
  >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
  3747
Of which 70 are in file 0:
  >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
  >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
  70 /tmp/hst/2019-35_seg0_file0_pdf.idx

In segment 0 file 0 we find 70 application/pdf Content-Type headers:
  >: ix.py -h -w  -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  70
  >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv


Of which 14 are truncated:
  >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  14

E.g.
  >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3
  1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
  1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
  1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339

Are any of the pdfs in the corresponding wet file?

Yes, 2:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
  WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
  WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00

Is it in fact corresponding?
  >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<'
  19

So, yes, mostly.  .2% are missing

Just checking the search:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l
  210
Correct

So, what pdfs make it into the WET:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
  >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
  2
 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f -   ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  11588   10913   http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
  1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 

Here's the short one:
WARC/1.0
WARC-Type: response
WARC-Date: 2019-08-17T22:40:17Z
WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
Content-Length: 11588
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
WARC-IP-Address: 92.175.114.24
WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
WARC-Identified-Payload-Type: application/pdf

HTTP/1.1 200 OK
Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
Pragma: public,no-cache
Content-Type: application/pdf",text/html; charset=utf-8
X-Crawler-Content-Encoding: gzip
Expires: 0
Server:
X-Powered-By:
Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
Content-Transfer-Encoding: binary
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
X-Content-Encoded-By:
X-Powered-By:
Date: Sat, 17 Aug 2019 22:40:16 GMT
X-Crawler-Content-Length: 5448
Content-Length: 10913

        %PDF-1.7
%<E2><E3><CF><D3>
7 0 obj
<< /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
cy /CS /DeviceRGB >> /PZ 1 >>
endobj
8 0 obj

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
  >: ps2ascii mediatheque.pdf
                             Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond

                             Médiathèque départementale des Deux-Sèvres - Résultats de
                             la recherche Belfond
                                                               A charge de revanche
                             Titre :
                             Auteur : Grippando, James (1958-....)
  ...
  etc., three pages, no errors

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an  https://museum.wrap.gov.tw/GetFile4.ashx
  38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
    >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF'
  27:%%EOF
  1114658:%%EOF
  1313299:%%EOF

Hunh?

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30
  1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
  3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
  4:WARC-Truncated: length
  5:WARC-Identified-Payload-Type: application/pdf
  27:%%EOF
  7725:WARC/1.0
  7726:WARC-Type: metadata
  7727:WARC-Date: 2019-08-17T22:59:14Z
  7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
  7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
  7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
  7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  7739:WARC/1.0

OK, so indeed truncated after 7700 lines or so...
  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
  >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.
Look in big_pdf?

====Modify the original CC indexer to write new indices including lastmod=====
Looks like WarcRecordWriter.write, in
src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what
needs to be editted to include LastModified date

To rebuild nutch-cc, particularly to recompile jar files after editting
anything

  >: cd $HHOME/src/nutch-cc
  >: ant

Fixed deprecation bug in WarcCdxWriter.java

Modified src/java/org/commoncrawl/util/WarcCdxWriter.java
to include lastmod

Can run just one test, which should allow testing this:

  >: ant test-core -Dtestcase='TestWarcRecordWriter'

Logic is tricky, and there's no easy way in

Basically, tools/WarcExport.java is launches a hadoop job based on a
hadoop-runnable WarcExport instance.  Hadoop will in due course call
ExportReducer.reduce, which will create an instance of WarcCapture
"for each page capture", and call ExportMapper.context.write with that instance (via
some configuration magic with the hadoop job Context).  That in turn
uses (more magic) WarcOutputFormat.getRecordWriter, which
(finally!) calls a previously created WarcRecordWriter
instance.write(the capture).

So to fake a test case, I need to build
 1) a WarcRecordWriter instance
 2) a WarcCapture instance
and then invoke 1.write(2)

Got that working, although still can't figure out where in the normal
flow the metadata entry for Response.CONTENT_TYPE gets set.

Now, add a test that takes a stream of WARC Response extracts and
rewrites their index entries

  >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10|  ix.py -h -w -x  > /tmp/hst/headers.txt
  >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/
  >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt

Won't quite work :-(
How do We reconstruct the Warc filename, offset and length from the
original index?

Well, we can find a .warc.gz records!
Thanks to https://stackoverflow.com/a/37042747/2595465

  >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt

Nearly working, got 1/3rd of the way through a single WARC and then failed:

  >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done
  ...
  20
  10215
  CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
  Process fail: Compressed file ended before the end-of-stream marker was reached, input:
   length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz

  >: head -10217 /tmp/hst/r3a | tail -4
  60784173 467
  60784640 10762
  60795402 463
  60795865 460
  >: ix.py 467 60784173   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target
  WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/

  >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz 
  ...
  co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"}
  >: ix.py 2570 60784640   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less
  >: echo $((10762 - 2570))
  8192

Ah, the error I was dreading :-(  I _think_ this happens when an
individual record ends exactly on a 8K boundary.

Yes:

  >: echo $((60784640 % 8192))
  0