Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 49:deeac8a0a682
tentative plan for merging
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 04 Oct 2024 21:41:53 +0100 |
parents | f688c437180b |
children | 5556c04c7597 |
comparison
equal
deleted
inserted
replaced
48:f688c437180b | 49:deeac8a0a682 |
---|---|
790 sys 0m9.158s | 790 sys 0m9.158s |
791 | 791 |
792 >: wc -l /tmp/hst/558.warc.cdx | 792 >: wc -l /tmp/hst/558.warc.cdx |
793 53432 /tmp/hst/558.warc.cdx | 793 53432 /tmp/hst/558.warc.cdx |
794 | 794 |
795 >: echo $((600 * 53432)) | |
796 32,059,200 | |
797 | |
795 So, 600 of those, plus approx. same again for extracting, that pbly | 798 So, 600 of those, plus approx. same again for extracting, that pbly |
796 _is_ doable in python, not more than 10 hours total, assuming internal | 799 _is_ doable in python, not more than 10 hours total, assuming internal |
797 sort and external merge is not too expensive... | 800 sort and external merge is not too expensive... |
801 | |
802 For each segment, suppose we pull out 60 groups of 10 target files | |
803 >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx | |
804 | |
805 real 0m42.129s | |
806 user 0m35.147s | |
807 sys 0m9.140s | |
808 >: wc -l /tmp/hst/0000.warc.cdx | |
809 533150 | |
810 | |
811 Key it with offset and sort: | |
812 | |
813 >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets | |
814 | |
815 real 0m5.578s | |
816 user 0m5.593s | |
817 sys 0m0.265s | |
818 | |
819 >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx | |
820 | |
821 real 0m4.185s | |
822 user 0m2.001s | |
823 sys 0m1.334s | |
824 | |
825 >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv" | |
826 | |
827 real 0m24.610s | |
828 user 2m54.146s | |
829 sys 0m10.226s | |
830 | |
831 >: head /tmp/hst/lm_00000.tsv | |
832 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT | |
833 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT | |
834 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT | |
835 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT | |
836 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT | |
837 ... | |
838 sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx | |
839 com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"} | |
840 | |
841 bingo | |
842 | |
843 So, the python code is pretty straightfoward: open the 10 individual | |
844 lm-*.tsv outputs into an array, initialise a 10-elt array with the | |
845 first line of each and another with its offset, record the | |
846 fileno(s) of the lowest offset, then iterate | |
847 | |
848 read cdx lines and write unchanged until offset = lowest | |
849 merge line from fileno and output | |
850 remove fileno from list of matches | |
851 read and store a new line for fileno [handle EOF] | |
852 if list of matches is empty, redo setting of lowest | |
853 | |
854 Resort the result by actual key |