changeset 48:f688c437180b

thinking about merging
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 04 Oct 2024 15:24:00 +0100
parents fbdaede4155a
children deeac8a0a682
files lurid3/notes.txt
diffstat 1 files changed, 40 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Thu Oct 03 18:16:05 2024 +0100
+++ b/lurid3/notes.txt	Fri Oct 04 15:24:00 2024 +0100
@@ -755,3 +755,43 @@
 out but neither specifies a recovery, so first-wins is as good as
 anything, and indeed 6797 specifies that.
 
+Start looking at how we do the merge of cdx_extras.py with existing index
+
+Try it with the existing _per segment_ index we have for 2019-35
+
+Assuming we have to key on segment plus offset, as reconstructing the
+proper index key is such a pain / buggy / is going to change with the year.
+
+Stay with segment 49
+
+  >: uz cdx.gz |wc -l
+ 29,870,307
+
+  >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc
+  29,870,307 119,481,228 1,241,098,122
+             = 4 * 29,870,307
+
+So no bogons, not _too_ surprising :-)
+
+Bad news is it's a _big_ file:
+
+  >: ls -lh cdx.gz
+  -rw-r--r-- 1 hst dc007 2.0G Mar 18  2021 cdx.gz
+
+So not viable to paste offset as a key and then sort on command line,
+or to load it in to python and do the work there...
+
+Do it per warc file and then merge?
+
+  >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx
+
+  real  0m23.494s
+  user  0m14.541s
+  sys   0m9.158s
+
+  >: wc -l /tmp/hst/558.warc.cdx
+  53432 /tmp/hst/558.warc.cdx
+
+So, 600 of those, plus approx. same again for extracting, that pbly
+_is_ doable in python, not more than 10 hours total, assuming internal
+sort and external merge is not too expensive...