comparison lurid3/notes.txt @ 48:f688c437180b

thinking about merging
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 04 Oct 2024 15:24:00 +0100
parents fbdaede4155a
children deeac8a0a682
comparison
equal deleted inserted replaced
47:fbdaede4155a 48:f688c437180b
753 which turns out to be a case of two Last-Modified headers in the same 753 which turns out to be a case of two Last-Modified headers in the same
754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it 754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it
755 out but neither specifies a recovery, so first-wins is as good as 755 out but neither specifies a recovery, so first-wins is as good as
756 anything, and indeed 6797 specifies that. 756 anything, and indeed 6797 specifies that.
757 757
758 Start looking at how we do the merge of cdx_extras.py with existing index
759
760 Try it with the existing _per segment_ index we have for 2019-35
761
762 Assuming we have to key on segment plus offset, as reconstructing the
763 proper index key is such a pain / buggy / is going to change with the year.
764
765 Stay with segment 49
766
767 >: uz cdx.gz |wc -l
768 29,870,307
769
770 >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc
771 29,870,307 119,481,228 1,241,098,122
772 = 4 * 29,870,307
773
774 So no bogons, not _too_ surprising :-)
775
776 Bad news is it's a _big_ file:
777
778 >: ls -lh cdx.gz
779 -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz
780
781 So not viable to paste offset as a key and then sort on command line,
782 or to load it in to python and do the work there...
783
784 Do it per warc file and then merge?
785
786 >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx
787
788 real 0m23.494s
789 user 0m14.541s
790 sys 0m9.158s
791
792 >: wc -l /tmp/hst/558.warc.cdx
793 53432 /tmp/hst/558.warc.cdx
794
795 So, 600 of those, plus approx. same again for extracting, that pbly
796 _is_ doable in python, not more than 10 hours total, assuming internal
797 sort and external merge is not too expensive...