cc/work: lurid3/notes.txt comparison

thinking about merging

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Fri, 04 Oct 2024 15:24:00 +0100
parents	fbdaede4155a
children	deeac8a0a682

comparison

equal deleted inserted replaced

-:fbdaede4155a
+:f688c437180b
 which turns out to be a case of two Last-Modified headers in the same
 the same response record's HTTP headers.  RFCs 2616 and 7230 rule it
 out but neither specifies a recovery, so first-wins is as good as
 anything, and indeed 6797 specifies that.
+Start looking at how we do the merge of cdx_extras.py with existing index
+Try it with the existing _per segment_ index we have for 2019-35
+Assuming we have to key on segment plus offset, as reconstructing the
+proper index key is such a pain / buggy / is going to change with the year.
+Stay with segment 49
+>: uz cdx.gz |wc -l
+29,870,307
+>: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc
+29,870,307 119,481,228 1,241,098,122
+= 4 * 29,870,307
+So no bogons, not _too_ surprising :-)
+Bad news is it's a _big_ file:
+>: ls -lh cdx.gz
+-rw-r--r-- 1 hst dc007 2.0G Mar 18  2021 cdx.gz
+So not viable to paste offset as a key and then sort on command line,
+or to load it in to python and do the work there...
+Do it per warc file and then merge?
+>: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx
+real  0m23.494s
+user  0m14.541s
+sys   0m9.158s
+>: wc -l /tmp/hst/558.warc.cdx
+53432 /tmp/hst/558.warc.cdx
+So, 600 of those, plus approx. same again for extracting, that pbly
+_is_ doable in python, not more than 10 hours total, assuming internal
+sort and external merge is not too expensive...

Mercurial > hg > cc > work