cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 49:deeac8a0a682

tentative plan for merging

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Fri, 04 Oct 2024 21:41:53 +0100
parents	f688c437180b
children	5556c04c7597

comparison

equal deleted inserted replaced

-:f688c437180b
+:deeac8a0a682
 sys   0m9.158s
 >: wc -l /tmp/hst/558.warc.cdx
 53432 /tmp/hst/558.warc.cdx
+>: echo $((600 * 53432))
+32,059,200
 So, 600 of those, plus approx. same again for extracting, that pbly
 _is_ doable in python, not more than 10 hours total, assuming internal
 sort and external merge is not too expensive...
+For each segment, suppose we pull out 60 groups of 10 target files
+>: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx
+real  0m42.129s
+user  0m35.147s
+sys   0m9.140s
+>: wc -l /tmp/hst/0000.warc.cdx
+533150
+Key it with offset and sort:
+>: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \"  > /tmp/hst/0000_offsets
+real  0m5.578s
+user  0m5.593s
+sys   0m0.265s
+>: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx
+real  0m4.185s
+user  0m2.001s
+sys   0m1.334s
+>: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv"
+real  0m24.610s
+user  2m54.146s
+sys   0m10.226s
+>: head /tmp/hst/lm_00000.tsv
+9398  16432     Mon, 19 Aug 2019 02:44:15 GMT
+20796 26748     Tue, 16 Jul 2019 04:39:09 GMT
+4648  340633    Fri, 07 Dec 2018 09:05:59 GMT
+3465  357109    Sun, 18 Aug 2019 11:48:23 GMT
+7450  914189    Mon, 19 Aug 2019 02:50:08 GMT
+...
+sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx
+com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"}
+bingo
+So, the python code is pretty straightfoward: open the 10 individual
+lm-*.tsv outputs into an array, initialise a 10-elt array with the
+first line of each and another with its offset, record the
+fileno(s) of the lowest offset, then iterate
+read cdx lines and write unchanged until offset = lowest
+merge line from fileno and output
+remove fileno from list of matches
+read and store a new line for fileno [handle EOF]
+if list of matches is empty, redo setting of lowest
+Resort the result by actual key

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 49:deeac8a0a682