comparison lurid3/notes.txt @ 49:deeac8a0a682

tentative plan for merging
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 04 Oct 2024 21:41:53 +0100
parents f688c437180b
children 5556c04c7597
comparison
equal deleted inserted replaced
48:f688c437180b 49:deeac8a0a682
790 sys 0m9.158s 790 sys 0m9.158s
791 791
792 >: wc -l /tmp/hst/558.warc.cdx 792 >: wc -l /tmp/hst/558.warc.cdx
793 53432 /tmp/hst/558.warc.cdx 793 53432 /tmp/hst/558.warc.cdx
794 794
795 >: echo $((600 * 53432))
796 32,059,200
797
795 So, 600 of those, plus approx. same again for extracting, that pbly 798 So, 600 of those, plus approx. same again for extracting, that pbly
796 _is_ doable in python, not more than 10 hours total, assuming internal 799 _is_ doable in python, not more than 10 hours total, assuming internal
797 sort and external merge is not too expensive... 800 sort and external merge is not too expensive...
801
802 For each segment, suppose we pull out 60 groups of 10 target files
803 >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx
804
805 real 0m42.129s
806 user 0m35.147s
807 sys 0m9.140s
808 >: wc -l /tmp/hst/0000.warc.cdx
809 533150
810
811 Key it with offset and sort:
812
813 >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets
814
815 real 0m5.578s
816 user 0m5.593s
817 sys 0m0.265s
818
819 >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx
820
821 real 0m4.185s
822 user 0m2.001s
823 sys 0m1.334s
824
825 >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv"
826
827 real 0m24.610s
828 user 2m54.146s
829 sys 0m10.226s
830
831 >: head /tmp/hst/lm_00000.tsv
832 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT
833 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT
834 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT
835 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT
836 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT
837 ...
838 sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx
839 com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"}
840
841 bingo
842
843 So, the python code is pretty straightfoward: open the 10 individual
844 lm-*.tsv outputs into an array, initialise a 10-elt array with the
845 first line of each and another with its offset, record the
846 fileno(s) of the lowest offset, then iterate
847
848 read cdx lines and write unchanged until offset = lowest
849 merge line from fileno and output
850 remove fileno from list of matches
851 read and store a new line for fileno [handle EOF]
852 if list of matches is empty, redo setting of lowest
853
854 Resort the result by actual key