cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 43:6ae6a21ccfb9

more downloads, exploring pdfs in wet

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Thu, 05 Sep 2024 17:59:02 +0100
parents	0c472ae05f71
children	7209df5fa5b4

comparison

equal deleted inserted replaced

-:0c472ae05f71
+:6ae6a21ccfb9
 mean	=	1.20192e+09
 sd	=	2.26049e+07
 with 2015-35, with 353 files per segment
 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
-n	=	930
+n	=	1000
-min	=	1.66471e+08 [bug?]
+min	=	1.66471e+08
 max	=	9.6322e+08
-sum	=	8.54009e+11
+sum	=	9.19222e+11
-mean	=	9.1829e+08
+mean	=	9.19222e+08
-sd	=	8.48938e+07
+sd	=	8.20542e+07
 The min files all come from segment 1440644060633.7, whose files are
 _all_ small:
 >: uz *00123-*.gz | wc -l
 12,759,931
 >: zcat *00123-*.gz | wc -l
 75,806,738
 Mystery
 Also faster
-Compare 2023-40:
+Compare 2022-33:
 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max  mean    sd
 98 19 256  75.1   25.2
 with 2015-35:
 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max mean sd
-		        95 15  40 32.4 2.90
+		       100 15  40 32.6 2.9
+>: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' &
+>: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
+>: head -1 /tmp/hst/2015_all
+20150827191534
+>: tail -1 /tmp/hst/2015_all
+20150905180914
+>: wc -l /tmp/hst/2015_all
+698128 /tmp/hst/2015_all
+What about wet files -- do they include text from pdfs?  What about
+truncated pdfs?
+>: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
+real    26m3.049s
+user    0m1.225s
+sys     0m1.310s
+In the segment 0 cdx file (!) we find 3747 probable truncations:
+>: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
+>: wc -l /tmp/hst/2019-35_seg0_pdf.idx
+42345 /tmp/hst/2019-35_seg0_pdf.idx
+>: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
+>: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
+3747
+Of which 70 are in file 0:
+>: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
+>: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
+70 /tmp/hst/2019-35_seg0_file0_pdf.idx
+In segment 0 file 0 we find 70 application/pdf Content-Type headers:
+>: ix.py -h -w  -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+>: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+70
+>: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+Of which 14 are truncated:
+>: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+14
+E.g.
+>: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3
+1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
+1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
+1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339
+Are any of the pdfs in the corresponding wet file?
+Yes, 2:
+>: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
+WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00
+Is it in fact corresponding?
+>: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<'
+19
+So, yes, mostly.  .2% are missing
+Just checking the search:
+>: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l
+210
+Correct
+So, what pdfs make it into the WET:
+>: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
+>: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
+2
+>: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f -   ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+11588   10913   http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+Here's the short one:
+WARC/1.0
+WARC-Type: response
+WARC-Date: 2019-08-17T22:40:17Z
+WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
+Content-Length: 11588
+Content-Type: application/http; msgtype=response
+WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
+WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
+WARC-IP-Address: 92.175.114.24
+WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
+WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
+WARC-Identified-Payload-Type: application/pdf
+HTTP/1.1 200 OK
+Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
+Pragma: public,no-cache
+Content-Type: application/pdf",text/html; charset=utf-8
+X-Crawler-Content-Encoding: gzip
+Expires: 0
+Server:
+X-Powered-By:
+Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
+Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
+Content-Transfer-Encoding: binary
+P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
+X-Content-Encoded-By:
+X-Powered-By:
+Date: Sat, 17 Aug 2019 22:40:16 GMT
+X-Crawler-Content-Length: 5448
+Content-Length: 10913
+%PDF-1.7
+%<E2><E3><CF><D3>
+7 0 obj
+<< /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
+0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
+000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
+rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
+76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
+cy /CS /DeviceRGB >> /PZ 1 >>
+endobj
+8 0 obj
+>: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
+>: ps2ascii mediatheque.pdf
+Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond
+Médiathèque départementale des Deux-Sèvres - Résultats de
+la recherche Belfond
+A charge de revanche
+Titre :
+Auteur : Grippando, James (1958-....)
+...
+etc., three pages, no errors
+>: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an  https://museum.wrap.gov.tw/GetFile4.ashx
+38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+>: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF'
+27:%%EOF
+1114658:%%EOF
+1313299:%%EOF
+Hunh?
+>: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30
+1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
+3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
+4:WARC-Truncated: length
+5:WARC-Identified-Payload-Type: application/pdf
+27:%%EOF
+7725:WARC/1.0
+7726:WARC-Type: metadata
+7727:WARC-Date: 2019-08-17T22:59:14Z
+7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
+7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
+7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
+7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+7739:WARC/1.0
+OK, so indeed truncated after 7700 lines or so...
+>: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
+>: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
+**** Error:  An error occurred while reading an XREF table.
+**** The file has been damaged.
+Look in big_pdf?

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 43:6ae6a21ccfb9