Mercurial > hg > cc > work
changeset 49:deeac8a0a682
tentative plan for merging
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 04 Oct 2024 21:41:53 +0100 |
parents | f688c437180b |
children | 5556c04c7597 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 57 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Fri Oct 04 15:24:00 2024 +0100 +++ b/lurid3/notes.txt Fri Oct 04 21:41:53 2024 +0100 @@ -792,6 +792,63 @@ >: wc -l /tmp/hst/558.warc.cdx 53432 /tmp/hst/558.warc.cdx + >: echo $((600 * 53432)) + 32,059,200 + So, 600 of those, plus approx. same again for extracting, that pbly _is_ doable in python, not more than 10 hours total, assuming internal sort and external merge is not too expensive... + +For each segment, suppose we pull out 60 groups of 10 target files + >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx + + real 0m42.129s + user 0m35.147s + sys 0m9.140s + >: wc -l /tmp/hst/0000.warc.cdx + 533150 + +Key it with offset and sort: + + >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets + + real 0m5.578s + user 0m5.593s + sys 0m0.265s + + >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx + + real 0m4.185s + user 0m2.001s + sys 0m1.334s + + >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv" + + real 0m24.610s + user 2m54.146s + sys 0m10.226s + + >: head /tmp/hst/lm_00000.tsv + 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT + 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT + 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT + 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT + 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT + ... + sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx + com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"} + +bingo + +So, the python code is pretty straightfoward: open the 10 individual +lm-*.tsv outputs into an array, initialise a 10-elt array with the +first line of each and another with its offset, record the +fileno(s) of the lowest offset, then iterate + + read cdx lines and write unchanged until offset = lowest + merge line from fileno and output + remove fileno from list of matches + read and store a new line for fileno [handle EOF] + if list of matches is empty, redo setting of lowest + +Resort the result by actual key