Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 48:f688c437180b
thinking about merging
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 04 Oct 2024 15:24:00 +0100 |
parents | fbdaede4155a |
children | deeac8a0a682 |
comparison
equal
deleted
inserted
replaced
47:fbdaede4155a | 48:f688c437180b |
---|---|
753 which turns out to be a case of two Last-Modified headers in the same | 753 which turns out to be a case of two Last-Modified headers in the same |
754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it | 754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it |
755 out but neither specifies a recovery, so first-wins is as good as | 755 out but neither specifies a recovery, so first-wins is as good as |
756 anything, and indeed 6797 specifies that. | 756 anything, and indeed 6797 specifies that. |
757 | 757 |
758 Start looking at how we do the merge of cdx_extras.py with existing index | |
759 | |
760 Try it with the existing _per segment_ index we have for 2019-35 | |
761 | |
762 Assuming we have to key on segment plus offset, as reconstructing the | |
763 proper index key is such a pain / buggy / is going to change with the year. | |
764 | |
765 Stay with segment 49 | |
766 | |
767 >: uz cdx.gz |wc -l | |
768 29,870,307 | |
769 | |
770 >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc | |
771 29,870,307 119,481,228 1,241,098,122 | |
772 = 4 * 29,870,307 | |
773 | |
774 So no bogons, not _too_ surprising :-) | |
775 | |
776 Bad news is it's a _big_ file: | |
777 | |
778 >: ls -lh cdx.gz | |
779 -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz | |
780 | |
781 So not viable to paste offset as a key and then sort on command line, | |
782 or to load it in to python and do the work there... | |
783 | |
784 Do it per warc file and then merge? | |
785 | |
786 >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx | |
787 | |
788 real 0m23.494s | |
789 user 0m14.541s | |
790 sys 0m9.158s | |
791 | |
792 >: wc -l /tmp/hst/558.warc.cdx | |
793 53432 /tmp/hst/558.warc.cdx | |
794 | |
795 So, 600 of those, plus approx. same again for extracting, that pbly | |
796 _is_ doable in python, not more than 10 hours total, assuming internal | |
797 sort and external merge is not too expensive... |