comparison lurid3/notes.txt @ 45:737c61f98cbf

foo
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 26 Sep 2024 17:47:58 +0100
parents 7209df5fa5b4
children 49672e9b4c1c
comparison
equal deleted inserted replaced
44:7209df5fa5b4 45:737c61f98cbf
302 to include lastmod 302 to include lastmod
303 303
304 Can run just one test, which should allow testing this: 304 Can run just one test, which should allow testing this:
305 305
306 >: ant test-core -Dtestcase='TestWarcRecordWriter' 306 >: ant test-core -Dtestcase='TestWarcRecordWriter'
307
308 Logic is tricky, and there's no easy way in
309
310 Basically, tools/WarcExport.java is launches a hadoop job based on a
311 hadoop-runnable WarcExport instance. Hadoop will in due course call
312 ExportReducer.reduce, which will create an instance of WarcCapture
313 "for each page capture", and call ExportMapper.context.write with that instance (via
314 some configuration magic with the hadoop job Context). That in turn
315 uses (more magic) WarcOutputFormat.getRecordWriter, which
316 (finally!) calls a previously created WarcRecordWriter
317 instance.write(the capture).
318
319 So to fake a test case, I need to build
320 1) a WarcRecordWriter instance
321 2) a WarcCapture instance
322 and then invoke 1.write(2)
323
324 Got that working, although still can't figure out where in the normal
325 flow the metadata entry for Response.CONTENT_TYPE gets set.
326
327 Now, add a test that takes a stream of WARC Response extracts and
328 rewrites their index entries
329
330 >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt
331 >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/
332 >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt
333
334 Won't quite work :-(
335 How do We reconstruct the Warc filename, offset and length from the
336 original index?
337
338 Well, we can find a .warc.gz records!
339 Thanks to https://stackoverflow.com/a/37042747/2595465
340
341 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt
342
343 Nearly working, got 1/3rd of the way through a single WARC and then failed:
344
345 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done
346 ...
347 20
348 10215
349 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
350 Process fail: Compressed file ended before the end-of-stream marker was reached, input:
351 length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
352
353 >: head -10217 /tmp/hst/r3a | tail -4
354 60784173 467
355 60784640 10762
356 60795402 463
357 60795865 460
358 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target
359 WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/
360
361 >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz
362 ...
363 co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"}
364 >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less
365 >: echo $((10762 - 2570))
366 8192
367
368 Ah, the error I was dreading :-( I _think_ this happens when an
369 individual record ends exactly on a 8K boundary.
370
371 Yes:
372
373 >: echo $((60784640 % 8192))
374 0
375