changeset 45:737c61f98cbf

foo
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 26 Sep 2024 17:47:58 +0100
parents 7209df5fa5b4
children 49672e9b4c1c
files lurid3/notes.txt
diffstat 1 files changed, 69 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Sun Sep 22 23:13:56 2024 +0100
+++ b/lurid3/notes.txt	Thu Sep 26 17:47:58 2024 +0100
@@ -304,3 +304,72 @@
 Can run just one test, which should allow testing this:
 
   >: ant test-core -Dtestcase='TestWarcRecordWriter'
+
+Logic is tricky, and there's no easy way in
+
+Basically, tools/WarcExport.java is launches a hadoop job based on a
+hadoop-runnable WarcExport instance.  Hadoop will in due course call
+ExportReducer.reduce, which will create an instance of WarcCapture
+"for each page capture", and call ExportMapper.context.write with that instance (via
+some configuration magic with the hadoop job Context).  That in turn
+uses (more magic) WarcOutputFormat.getRecordWriter, which
+(finally!) calls a previously created WarcRecordWriter
+instance.write(the capture).
+
+So to fake a test case, I need to build
+ 1) a WarcRecordWriter instance
+ 2) a WarcCapture instance
+and then invoke 1.write(2)
+
+Got that working, although still can't figure out where in the normal
+flow the metadata entry for Response.CONTENT_TYPE gets set.
+
+Now, add a test that takes a stream of WARC Response extracts and
+rewrites their index entries
+
+  >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10|  ix.py -h -w -x  > /tmp/hst/headers.txt
+  >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/
+  >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt
+
+Won't quite work :-(
+How do We reconstruct the Warc filename, offset and length from the
+original index?
+
+Well, we can find a .warc.gz records!
+Thanks to https://stackoverflow.com/a/37042747/2595465
+
+  >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt
+
+Nearly working, got 1/3rd of the way through a single WARC and then failed:
+
+  >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done
+  ...
+  20
+  10215
+  CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
+  Process fail: Compressed file ended before the end-of-stream marker was reached, input:
+   length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
+
+  >: head -10217 /tmp/hst/r3a | tail -4
+  60784173 467
+  60784640 10762
+  60795402 463
+  60795865 460
+  >: ix.py 467 60784173   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target
+  WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/
+
+  >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz 
+  ...
+  co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"}
+  >: ix.py 2570 60784640   CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less
+  >: echo $((10762 - 2570))
+  8192
+
+Ah, the error I was dreading :-(  I _think_ this happens when an
+individual record ends exactly on a 8K boundary.
+
+Yes:
+
+  >: echo $((60784640 % 8192))
+  0
+