changeset 49:deeac8a0a682

tentative plan for merging
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 04 Oct 2024 21:41:53 +0100
parents f688c437180b
children 5556c04c7597
files lurid3/notes.txt
diffstat 1 files changed, 57 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Fri Oct 04 15:24:00 2024 +0100
+++ b/lurid3/notes.txt	Fri Oct 04 21:41:53 2024 +0100
@@ -792,6 +792,63 @@
   >: wc -l /tmp/hst/558.warc.cdx
   53432 /tmp/hst/558.warc.cdx
 
+  >: echo $((600 * 53432))
+  32,059,200
+
 So, 600 of those, plus approx. same again for extracting, that pbly
 _is_ doable in python, not more than 10 hours total, assuming internal
 sort and external merge is not too expensive...
+
+For each segment, suppose we pull out 60 groups of 10 target files
+  >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx
+
+  real  0m42.129s
+  user  0m35.147s
+  sys   0m9.140s
+  >: wc -l /tmp/hst/0000.warc.cdx
+  533150
+
+Key it with offset and sort:
+
+  >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \"  > /tmp/hst/0000_offsets
+
+  real  0m5.578s
+  user  0m5.593s
+  sys   0m0.265s
+
+  >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx
+
+  real  0m4.185s
+  user  0m2.001s
+  sys   0m1.334s
+
+  >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv"
+
+  real  0m24.610s
+  user  2m54.146s
+  sys   0m10.226s
+
+  >: head /tmp/hst/lm_00000.tsv
+  9398  16432     Mon, 19 Aug 2019 02:44:15 GMT
+  20796 26748     Tue, 16 Jul 2019 04:39:09 GMT
+  4648  340633    Fri, 07 Dec 2018 09:05:59 GMT
+  3465  357109    Sun, 18 Aug 2019 11:48:23 GMT
+  7450  914189    Mon, 19 Aug 2019 02:50:08 GMT
+  ...
+  sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx
+  com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"}
+
+bingo
+
+So, the python code is pretty straightfoward: open the 10 individual
+lm-*.tsv outputs into an array, initialise a 10-elt array with the
+first line of each and another with its offset, record the
+fileno(s) of the lowest offset, then iterate
+
+  read cdx lines and write unchanged until offset = lowest
+  merge line from fileno and output
+  remove fileno from list of matches
+  read and store a new line for fileno [handle EOF]
+  if list of matches is empty, redo setting of lowest
+
+Resort the result by actual key