Mercurial > hg > cc > work
changeset 48:f688c437180b
thinking about merging
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 04 Oct 2024 15:24:00 +0100 |
parents | fbdaede4155a |
children | deeac8a0a682 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 40 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Thu Oct 03 18:16:05 2024 +0100 +++ b/lurid3/notes.txt Fri Oct 04 15:24:00 2024 +0100 @@ -755,3 +755,43 @@ out but neither specifies a recovery, so first-wins is as good as anything, and indeed 6797 specifies that. +Start looking at how we do the merge of cdx_extras.py with existing index + +Try it with the existing _per segment_ index we have for 2019-35 + +Assuming we have to key on segment plus offset, as reconstructing the +proper index key is such a pain / buggy / is going to change with the year. + +Stay with segment 49 + + >: uz cdx.gz |wc -l + 29,870,307 + + >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc + 29,870,307 119,481,228 1,241,098,122 + = 4 * 29,870,307 + +So no bogons, not _too_ surprising :-) + +Bad news is it's a _big_ file: + + >: ls -lh cdx.gz + -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz + +So not viable to paste offset as a key and then sort on command line, +or to load it in to python and do the work there... + +Do it per warc file and then merge? + + >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx + + real 0m23.494s + user 0m14.541s + sys 0m9.158s + + >: wc -l /tmp/hst/558.warc.cdx + 53432 /tmp/hst/558.warc.cdx + +So, 600 of those, plus approx. same again for extracting, that pbly +_is_ doable in python, not more than 10 hours total, assuming internal +sort and external merge is not too expensive...