Mercurial > hg > cc > work
changeset 45:737c61f98cbf
foo
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 26 Sep 2024 17:47:58 +0100 |
parents | 7209df5fa5b4 |
children | 49672e9b4c1c |
files | lurid3/notes.txt |
diffstat | 1 files changed, 69 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Sun Sep 22 23:13:56 2024 +0100 +++ b/lurid3/notes.txt Thu Sep 26 17:47:58 2024 +0100 @@ -304,3 +304,72 @@ Can run just one test, which should allow testing this: >: ant test-core -Dtestcase='TestWarcRecordWriter' + +Logic is tricky, and there's no easy way in + +Basically, tools/WarcExport.java is launches a hadoop job based on a +hadoop-runnable WarcExport instance. Hadoop will in due course call +ExportReducer.reduce, which will create an instance of WarcCapture +"for each page capture", and call ExportMapper.context.write with that instance (via +some configuration magic with the hadoop job Context). That in turn +uses (more magic) WarcOutputFormat.getRecordWriter, which +(finally!) calls a previously created WarcRecordWriter +instance.write(the capture). + +So to fake a test case, I need to build + 1) a WarcRecordWriter instance + 2) a WarcCapture instance +and then invoke 1.write(2) + +Got that working, although still can't figure out where in the normal +flow the metadata entry for Response.CONTENT_TYPE gets set. + +Now, add a test that takes a stream of WARC Response extracts and +rewrites their index entries + + >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt + >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/ + >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt + +Won't quite work :-( +How do We reconstruct the Warc filename, offset and length from the +original index? + +Well, we can find a .warc.gz records! +Thanks to https://stackoverflow.com/a/37042747/2595465 + + >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt + +Nearly working, got 1/3rd of the way through a single WARC and then failed: + + >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done + ... + 20 + 10215 + CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz + Process fail: Compressed file ended before the end-of-stream marker was reached, input: + length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz + + >: head -10217 /tmp/hst/r3a | tail -4 + 60784173 467 + 60784640 10762 + 60795402 463 + 60795865 460 + >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target + WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/ + + >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz + ... + co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"} + >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less + >: echo $((10762 - 2570)) + 8192 + +Ah, the error I was dreading :-( I _think_ this happens when an +individual record ends exactly on a 8K boundary. + +Yes: + + >: echo $((60784640 % 8192)) + 0 +