Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 43:6ae6a21ccfb9
more downloads,
exploring pdfs in wet
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 05 Sep 2024 17:59:02 +0100 |
parents | 0c472ae05f71 |
children | 7209df5fa5b4 |
comparison
equal
deleted
inserted
replaced
42:0c472ae05f71 | 43:6ae6a21ccfb9 |
---|---|
96 mean = 1.20192e+09 | 96 mean = 1.20192e+09 |
97 sd = 2.26049e+07 | 97 sd = 2.26049e+07 |
98 | 98 |
99 with 2015-35, with 353 files per segment | 99 with 2015-35, with 353 files per segment |
100 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats | 100 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats |
101 n = 930 | 101 n = 1000 |
102 min = 1.66471e+08 [bug?] | 102 min = 1.66471e+08 |
103 max = 9.6322e+08 | 103 max = 9.6322e+08 |
104 sum = 8.54009e+11 | 104 sum = 9.19222e+11 |
105 mean = 9.1829e+08 | 105 mean = 9.19222e+08 |
106 sd = 8.48938e+07 | 106 sd = 8.20542e+07 |
107 | 107 |
108 The min files all come from segment 1440644060633.7, whose files are | 108 The min files all come from segment 1440644060633.7, whose files are |
109 _all_ small: | 109 _all_ small: |
110 >: uz *00123-*.gz | wc -l | 110 >: uz *00123-*.gz | wc -l |
111 12,759,931 | 111 12,759,931 |
113 >: zcat *00123-*.gz | wc -l | 113 >: zcat *00123-*.gz | wc -l |
114 75,806,738 | 114 75,806,738 |
115 Mystery | 115 Mystery |
116 | 116 |
117 Also faster | 117 Also faster |
118 Compare 2023-40: | 118 Compare 2022-33: |
119 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd | 119 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd |
120 98 19 256 75.1 25.2 | 120 98 19 256 75.1 25.2 |
121 with 2015-35: | 121 with 2015-35: |
122 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd | 122 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd |
123 95 15 40 32.4 2.90 | 123 100 15 40 32.6 2.9 |
124 | |
125 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' & | |
126 >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all | |
127 >: head -1 /tmp/hst/2015_all | |
128 20150827191534 | |
129 >: tail -1 /tmp/hst/2015_all | |
130 20150905180914 | |
131 >: wc -l /tmp/hst/2015_all | |
132 698128 /tmp/hst/2015_all | |
133 | |
134 What about wet files -- do they include text from pdfs? What about | |
135 truncated pdfs? | |
136 | |
137 >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log & | |
138 real 26m3.049s | |
139 user 0m1.225s | |
140 sys 0m1.310s | |
141 | |
142 In the segment 0 cdx file (!) we find 3747 probable truncations: | |
143 >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx | |
144 >: wc -l /tmp/hst/2019-35_seg0_pdf.idx | |
145 42345 /tmp/hst/2019-35_seg0_pdf.idx | |
146 >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx & | |
147 >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx | |
148 3747 | |
149 Of which 70 are in file 0: | |
150 >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx | |
151 >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx | |
152 70 /tmp/hst/2019-35_seg0_file0_pdf.idx | |
153 | |
154 In segment 0 file 0 we find 70 application/pdf Content-Type headers: | |
155 >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
156 >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
157 70 | |
158 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
159 | |
160 | |
161 Of which 14 are truncated: | |
162 >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
163 14 | |
164 | |
165 E.g. | |
166 >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3 | |
167 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf | |
168 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4 | |
169 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339 | |
170 | |
171 Are any of the pdfs in the corresponding wet file? | |
172 | |
173 Yes, 2: | |
174 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) | |
175 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
176 WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00 | |
177 | |
178 Is it in fact corresponding? | |
179 >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<' | |
180 19 | |
181 | |
182 So, yes, mostly. .2% are missing | |
183 | |
184 Just checking the search: | |
185 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l | |
186 210 | |
187 Correct | |
188 | |
189 So, what pdfs make it into the WET: | |
190 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | |
191 >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | |
192 2 | |
193 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
194 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
195 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
196 | |
197 Here's the short one: | |
198 WARC/1.0 | |
199 WARC-Type: response | |
200 WARC-Date: 2019-08-17T22:40:17Z | |
201 WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a> | |
202 Content-Length: 11588 | |
203 Content-Type: application/http; msgtype=response | |
204 WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> | |
205 WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15> | |
206 WARC-IP-Address: 92.175.114.24 | |
207 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
208 WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA | |
209 WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T | |
210 WARC-Identified-Payload-Type: application/pdf | |
211 | |
212 HTTP/1.1 200 OK | |
213 Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache | |
214 Pragma: public,no-cache | |
215 Content-Type: application/pdf",text/html; charset=utf-8 | |
216 X-Crawler-Content-Encoding: gzip | |
217 Expires: 0 | |
218 Server: | |
219 X-Powered-By: | |
220 Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/ | |
221 Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf" | |
222 Content-Transfer-Encoding: binary | |
223 P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" | |
224 X-Content-Encoded-By: | |
225 X-Powered-By: | |
226 Date: Sat, 17 Aug 2019 22:40:16 GMT | |
227 X-Crawler-Content-Length: 5448 | |
228 Content-Length: 10913 | |
229 | |
230 %PDF-1.7 | |
231 %<E2><E3><CF><D3> | |
232 7 0 obj | |
233 << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2 | |
234 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000 | |
235 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T | |
236 rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2 | |
237 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen | |
238 cy /CS /DeviceRGB >> /PZ 1 >> | |
239 endobj | |
240 8 0 obj | |
241 | |
242 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf | |
243 >: ps2ascii mediatheque.pdf | |
244 Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond | |
245 | |
246 Médiathèque départementale des Deux-Sèvres - Résultats de | |
247 la recherche Belfond | |
248 A charge de revanche | |
249 Titre : | |
250 Auteur : Grippando, James (1958-....) | |
251 ... | |
252 etc., three pages, no errors | |
253 | |
254 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx | |
255 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
256 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
257 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
258 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF' | |
259 27:%%EOF | |
260 1114658:%%EOF | |
261 1313299:%%EOF | |
262 | |
263 Hunh? | |
264 | |
265 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30 | |
266 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
267 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE | |
268 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2 | |
269 4:WARC-Truncated: length | |
270 5:WARC-Identified-Payload-Type: application/pdf | |
271 27:%%EOF | |
272 7725:WARC/1.0 | |
273 7726:WARC-Type: metadata | |
274 7727:WARC-Date: 2019-08-17T22:59:14Z | |
275 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25> | |
276 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> | |
277 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4> | |
278 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
279 7739:WARC/1.0 | |
280 | |
281 OK, so indeed truncated after 7700 lines or so... | |
282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf | |
283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf | |
284 **** Error: An error occurred while reading an XREF table. | |
285 **** The file has been damaged. | |
286 Look in big_pdf? |