changeset 43:6ae6a21ccfb9

more downloads, exploring pdfs in wet
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 05 Sep 2024 17:59:02 +0100
parents 0c472ae05f71
children 7209df5fa5b4
files lurid3/notes.txt
diffstat 1 files changed, 170 insertions(+), 7 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Mon Sep 02 15:02:01 2024 +0100
+++ b/lurid3/notes.txt	Thu Sep 05 17:59:02 2024 +0100
@@ -98,12 +98,12 @@
 
 with 2015-35, with 353 files per segment
   >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
-  n	=	930
-  min	=	1.66471e+08 [bug?]
+  n	=	1000
+  min	=	1.66471e+08
   max	=	9.6322e+08
-  sum	=	8.54009e+11
-  mean	=	9.1829e+08
-  sd	=	8.48938e+07
+  sum	=	9.19222e+11
+  mean	=	9.19222e+08
+  sd	=	8.20542e+07
 
 The min files all come from segment 1440644060633.7, whose files are
 _all_ small:
@@ -115,9 +115,172 @@
 Mystery
 
 Also faster
-Compare 2023-40:
+Compare 2022-33:
  >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max  mean    sd
                               98 19 256  75.1   25.2
 with 2015-35:
   >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max mean sd
-		        95 15  40 32.4 2.90
+		       100 15  40 32.6 2.9
+
+  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' &
+  >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
+  >: head -1 /tmp/hst/2015_all
+  20150827191534
+  >: tail -1 /tmp/hst/2015_all
+  20150905180914
+  >: wc -l /tmp/hst/2015_all
+  698128 /tmp/hst/2015_all
+
+What about wet files -- do they include text from pdfs?  What about
+truncated pdfs?
+
+  >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
+  real    26m3.049s
+  user    0m1.225s
+  sys     0m1.310s
+
+In the segment 0 cdx file (!) we find 3747 probable truncations:
+  >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
+  >: wc -l /tmp/hst/2019-35_seg0_pdf.idx
+  42345 /tmp/hst/2019-35_seg0_pdf.idx
+  >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
+  >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
+  3747
+Of which 70 are in file 0:
+  >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
+  >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
+  70 /tmp/hst/2019-35_seg0_file0_pdf.idx
+
+In segment 0 file 0 we find 70 application/pdf Content-Type headers:
+  >: ix.py -h -w  -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+  >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+  70
+  >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+
+
+Of which 14 are truncated:
+  >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+  14
+
+E.g.
+  >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3
+  1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
+  1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
+  1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339
+
+Are any of the pdfs in the corresponding wet file?
+
+Yes, 2:
+  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
+  WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+  WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00
+
+Is it in fact corresponding?
+  >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<'
+  19
+
+So, yes, mostly.  .2% are missing
+
+Just checking the search:
+  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l
+  210
+Correct
+
+So, what pdfs make it into the WET:
+  >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
+  >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
+  2
+ >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f -   ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
+  11588   10913   http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+  1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 
+
+Here's the short one:
+WARC/1.0
+WARC-Type: response
+WARC-Date: 2019-08-17T22:40:17Z
+WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
+Content-Length: 11588
+Content-Type: application/http; msgtype=response
+WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
+WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
+WARC-IP-Address: 92.175.114.24
+WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
+WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
+WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
+WARC-Identified-Payload-Type: application/pdf
+
+HTTP/1.1 200 OK
+Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
+Pragma: public,no-cache
+Content-Type: application/pdf",text/html; charset=utf-8
+X-Crawler-Content-Encoding: gzip
+Expires: 0
+Server:
+X-Powered-By:
+Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
+Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
+Content-Transfer-Encoding: binary
+P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
+X-Content-Encoded-By:
+X-Powered-By:
+Date: Sat, 17 Aug 2019 22:40:16 GMT
+X-Crawler-Content-Length: 5448
+Content-Length: 10913
+
+        %PDF-1.7
+%<E2><E3><CF><D3>
+7 0 obj
+<< /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
+ 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
+000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
+rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
+76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
+cy /CS /DeviceRGB >> /PZ 1 >>
+endobj
+8 0 obj
+
+  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
+  >: ps2ascii mediatheque.pdf
+                             Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond
+
+                             Médiathèque départementale des Deux-Sèvres - Résultats de
+                             la recherche Belfond
+                                                               A charge de revanche
+                             Titre :
+                             Auteur : Grippando, James (1958-....)
+  ...
+  etc., three pages, no errors
+
+  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an  https://museum.wrap.gov.tw/GetFile4.ashx
+  38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+  38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+  38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+    >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF'
+  27:%%EOF
+  1114658:%%EOF
+  1313299:%%EOF
+
+Hunh?
+
+  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30
+  1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+  2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
+  3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
+  4:WARC-Truncated: length
+  5:WARC-Identified-Payload-Type: application/pdf
+  27:%%EOF
+  7725:WARC/1.0
+  7726:WARC-Type: metadata
+  7727:WARC-Date: 2019-08-17T22:59:14Z
+  7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
+  7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
+  7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
+  7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
+  7739:WARC/1.0
+
+OK, so indeed truncated after 7700 lines or so...
+  >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
+  >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
+   **** Error:  An error occurred while reading an XREF table.
+   **** The file has been damaged.
+Look in big_pdf?