view lurid3/notes.txt @ 43:6ae6a21ccfb9

more downloads, exploring pdfs in wet
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 05 Sep 2024 17:59:02 +0100
parents 0c472ae05f71
children 7209df5fa5b4
line wrap: on
line source

See old_notes.txt for all older notes on Common Crawl dataprocessing,
starting from Azure via Turing and then LURID and LURID2.

Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx
  >: cd results/CC-MAIN-2024-33/cdx/
  >: cut -f 2 counts.tsv | btot
  2,793,986,828 

State of play wrt data -- see status.xlsx

[in trying to tabulate the date ranges of the crawls, I found that the
WARC timestamp is sometimes bogus:

  >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx
  net,tyredeyes)/robots.txt 20090201191318	cdx-00230.gz	160573468	198277	920675

  >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz
  net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"}
  net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"}

This happens in 2019-35 as well :-(

  >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx
  com,gyshbsh)/robots.txt 20181023022000	cdx-00078.gz	356340085	162332	315406
  >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz
  com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"}
  ...

Tabulate all the date ranges for the WARC files we have

  >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d -  | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv
  >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv
  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done
2019-18	20190418101243-20190418122248
  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done
2019-18	20190426153423-20190426175423
  >: echo 2019-18       20190418101243-20190418122248   20190426153423-20190426175423 >> dates.tsv 
  >: pwd
  /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc
  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}'
  >: sort -mu /tmp/hst/??? > /tmp/hst/all
  >: wc -l /tmp/hst/all
  679686 /tmp/hst/all
  >: head -1 /tmp/hst/all
  20160723090435
  >: tail -1 /tmp/hst/all
  20160731110639
  >: cd ../../..
  >: echo 2016-30       20160723090435  20160731110639 >> dates.tsv 
tweaked and sorted in xemacs:
  2016-30	20160723090435	20160731110639
  2017-30	20170720121902	20170729132938
  2018-30	20180715183800	20180723184955
  2018-34	20180814062251	20180822085454
  2019-18	20190418101243	20190426175423
  2019-35	20190817102624	20190826111356
  2020-34	20200803083123	20200815214756
  2021-25	20210612103920	20210625145905
  2023-40	20230921073711	20231005042006
  2023-50	20231128083443	20231212000408

Added to status.xlsx in shortened form, with number of days
  8
  9
  8
  8
  8
  9
  12
  13
  15
  15

Fill a gap by downloading 2022-33

  >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
  130 minutes...
  >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
  59 minutes

Another day to get to a quarter?
  >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &


And finally 2015-35
Fetched in just 2 chunks, 0-9 and 10-99, e.g.
  >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &

Much smaller.
Compare 2023-40, with 900 files per segment:
  >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats
  n	=	1000
  min	=	1.14775e+09
  max	=	1.26702e+09
  sum	=	1.20192e+12
  mean	=	1.20192e+09
  sd	=	2.26049e+07

with 2015-35, with 353 files per segment
  >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
  n	=	1000
  min	=	1.66471e+08
  max	=	9.6322e+08
  sum	=	9.19222e+11
  mean	=	9.19222e+08
  sd	=	8.20542e+07

The min files all come from segment 1440644060633.7, whose files are
_all_ small:
  >: uz *00123-*.gz | wc -l
  12,759,931
Compare to 1440644060103.8
  >: zcat *00123-*.gz | wc -l
  75,806,738
Mystery

Also faster
Compare 2022-33:
 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max  mean    sd
                              98 19 256  75.1   25.2
with 2015-35:
  >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max mean sd
		       100 15  40 32.6 2.9

  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' &
  >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
  >: head -1 /tmp/hst/2015_all
  20150827191534
  >: tail -1 /tmp/hst/2015_all
  20150905180914
  >: wc -l /tmp/hst/2015_all
  698128 /tmp/hst/2015_all

What about wet files -- do they include text from pdfs?  What about
truncated pdfs?

  >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
  real    26m3.049s
  user    0m1.225s
  sys     0m1.310s

In the segment 0 cdx file (!) we find 3747 probable truncations:
  >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
  >: wc -l /tmp/hst/2019-35_seg0_pdf.idx
  42345 /tmp/hst/2019-35_seg0_pdf.idx
  >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
  >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
  3747
Of which 70 are in file 0:
  >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
  >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
  70 /tmp/hst/2019-35_seg0_file0_pdf.idx

In segment 0 file 0 we find 70 application/pdf Content-Type headers:
  >: ix.py -h -w  -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  70
  >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv


Of which 14 are truncated:
  >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  14

E.g.
  >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3
  1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
  1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
  1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339

Are any of the pdfs in the corresponding wet file?

Yes, 2:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
  WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
  WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00

Is it in fact corresponding?
  >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<'
  19

So, yes, mostly.  .2% are missing

Just checking the search:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l
  210
Correct

So, what pdfs make it into the WET:
  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
  >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
  2
 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f -   ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
  11588   10913   http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
  1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 

Here's the short one:
WARC/1.0
WARC-Type: response
WARC-Date: 2019-08-17T22:40:17Z
WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
Content-Length: 11588
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
WARC-IP-Address: 92.175.114.24
WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
WARC-Identified-Payload-Type: application/pdf

HTTP/1.1 200 OK
Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
Pragma: public,no-cache
Content-Type: application/pdf",text/html; charset=utf-8
X-Crawler-Content-Encoding: gzip
Expires: 0
Server:
X-Powered-By:
Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
Content-Transfer-Encoding: binary
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
X-Content-Encoded-By:
X-Powered-By:
Date: Sat, 17 Aug 2019 22:40:16 GMT
X-Crawler-Content-Length: 5448
Content-Length: 10913

        %PDF-1.7
%<E2><E3><CF><D3>
7 0 obj
<< /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
cy /CS /DeviceRGB >> /PZ 1 >>
endobj
8 0 obj

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
  >: ps2ascii mediatheque.pdf
                             Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond

                             Médiathèque départementale des Deux-Sèvres - Résultats de
                             la recherche Belfond
                                                               A charge de revanche
                             Titre :
                             Auteur : Grippando, James (1958-....)
  ...
  etc., three pages, no errors

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an  https://museum.wrap.gov.tw/GetFile4.ashx
  38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
    >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF'
  27:%%EOF
  1114658:%%EOF
  1313299:%%EOF

Hunh?

  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30
  1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
  3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
  4:WARC-Truncated: length
  5:WARC-Identified-Payload-Type: application/pdf
  27:%%EOF
  7725:WARC/1.0
  7726:WARC-Type: metadata
  7727:WARC-Date: 2019-08-17T22:59:14Z
  7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
  7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
  7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
  7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
  7739:WARC/1.0

OK, so indeed truncated after 7700 lines or so...
  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
  >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.
Look in big_pdf?