Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 45:737c61f98cbf
foo
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 26 Sep 2024 17:47:58 +0100 |
parents | 7209df5fa5b4 |
children | 49672e9b4c1c |
comparison
equal
deleted
inserted
replaced
44:7209df5fa5b4 | 45:737c61f98cbf |
---|---|
302 to include lastmod | 302 to include lastmod |
303 | 303 |
304 Can run just one test, which should allow testing this: | 304 Can run just one test, which should allow testing this: |
305 | 305 |
306 >: ant test-core -Dtestcase='TestWarcRecordWriter' | 306 >: ant test-core -Dtestcase='TestWarcRecordWriter' |
307 | |
308 Logic is tricky, and there's no easy way in | |
309 | |
310 Basically, tools/WarcExport.java is launches a hadoop job based on a | |
311 hadoop-runnable WarcExport instance. Hadoop will in due course call | |
312 ExportReducer.reduce, which will create an instance of WarcCapture | |
313 "for each page capture", and call ExportMapper.context.write with that instance (via | |
314 some configuration magic with the hadoop job Context). That in turn | |
315 uses (more magic) WarcOutputFormat.getRecordWriter, which | |
316 (finally!) calls a previously created WarcRecordWriter | |
317 instance.write(the capture). | |
318 | |
319 So to fake a test case, I need to build | |
320 1) a WarcRecordWriter instance | |
321 2) a WarcCapture instance | |
322 and then invoke 1.write(2) | |
323 | |
324 Got that working, although still can't figure out where in the normal | |
325 flow the metadata entry for Response.CONTENT_TYPE gets set. | |
326 | |
327 Now, add a test that takes a stream of WARC Response extracts and | |
328 rewrites their index entries | |
329 | |
330 >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt | |
331 >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/ | |
332 >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt | |
333 | |
334 Won't quite work :-( | |
335 How do We reconstruct the Warc filename, offset and length from the | |
336 original index? | |
337 | |
338 Well, we can find a .warc.gz records! | |
339 Thanks to https://stackoverflow.com/a/37042747/2595465 | |
340 | |
341 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt | |
342 | |
343 Nearly working, got 1/3rd of the way through a single WARC and then failed: | |
344 | |
345 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done | |
346 ... | |
347 20 | |
348 10215 | |
349 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
350 Process fail: Compressed file ended before the end-of-stream marker was reached, input: | |
351 length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
352 | |
353 >: head -10217 /tmp/hst/r3a | tail -4 | |
354 60784173 467 | |
355 60784640 10762 | |
356 60795402 463 | |
357 60795865 460 | |
358 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target | |
359 WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/ | |
360 | |
361 >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz | |
362 ... | |
363 co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"} | |
364 >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less | |
365 >: echo $((10762 - 2570)) | |
366 8192 | |
367 | |
368 Ah, the error I was dreading :-( I _think_ this happens when an | |
369 individual record ends exactly on a 8K boundary. | |
370 | |
371 Yes: | |
372 | |
373 >: echo $((60784640 % 8192)) | |
374 0 | |
375 |