changeset 73:1283a574260d

working on cross-check of warc2cdb
author Henry Thompson <ht@markup.co.uk>
date Sun, 09 Mar 2025 19:59:50 +0000
parents 7901ce4a39e3
children ff6ef190f901
files lurid3/notes.txt
diffstat 1 files changed, 99 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Thu Mar 06 01:45:24 2025 +0000
+++ b/lurid3/notes.txt	Sun Mar 09 19:59:50 2025 +0000
@@ -1630,6 +1630,105 @@
   >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101
   >: echo $?
   0
+
+OK, now to fill up 2023-40 so we can do a real trial.
+
+Cythonised lmh.py and warc.py, still running about 14 seconds per file
+Don't actually need these yet, as 2023-40 has a full set of lmh
+outputs, but not many ks.tsv:
+  >: ls */ks.tsv
+  0/ks.tsv   15/ks.tsv  4/ks.tsv   68/ks.tsv
+  12/ks.tsv  36/ks.tsv  56/ks.tsv  best_two_by_nl1/ks.tsv
+
+Note, that to populate cdbs, we don't need most of the logic in
+sort_date.py, because all the sorting and merging isn't needed!
+
+Hmm.  In fact, just go straight to cdb_in.txt
+
+  >: time python3 -c 'import sys,warc2cdb; sys.exit(warc2cdb.main(*sys.argv[1:]))'  2023-40 15 0 /tmp/hst 2>/tmp/hst/15/errs
+  /tmp/hst/15/lmh.cdb_in
+
+  real    223m28.517s
+  user    168m34.751s
+  sys     12m6.020s
+
+This compares to 273 minutes for 806 files in the cleanup of segment
+95, see end of old_notes.  Best full segment was 96, i.e. 279 for
+900.  And those figures are for lmh.py only
+
+Are results the same?
+Old:
+  >: wc -l *.tsv
+     8247 ks_errs.tsv
+  5324616 ks.tsv
+New:
+  >: wc -l *
+     9147 errs
+  5324617 lmh.cdb_in
+
+I _think_ that's good -- there's one line per input file added to the
+error output now, and one blank line at the end of the cdb_in file.
+Old:
+  >: cut -f 2- ks_errs.tsv |sus
+     5420 cannot unpack non-iterable NoneType object
+     2670 list index out of range
+      113 year 641471 is out of range
+       20 year 641474 is out of range
+	9 hour must be in 0..23
+	9 year 642435 is out of range
+	3 year 642437 is out of range
+	1 year 4262575 is out of range
+	1 year 641769 is out of range
+	1 year 642434 is out of range
+New:
+  >: cut -f 2- <(fgrep -v beegfs /tmp/hst/15/errs) | cut -f 1-5 -d ' ' |sus
+     8090 Invalid date value or format
+      113 year 641471 is out of
+       20 year 641474 is out of
+	9 hour must be in 0..23
+	9 year 642435 is out of
+	3 year 642437 is out of
+	1 year 4262575 is out of
+	1 year 641769 is out of
+	1 year 642434 is out of
+
+But, failed to fix the case where the URI in the original WARC is not
+%-encoded:
+  >: diff -bw <(cut -f2- ks.tsv| sed 's/ //;s/\.0$//'|sort -k1,1) <(sed 's/^\+[0-9,]*://;s/->\([0-9]*\)$/ \1/' /tmp/hst/15/lmh.cdb_in_nopercent | sort -k1,1)|head
+  4360c4361
+  < 20230922034234https://www.hansaton.com/zh-hk/\u6D88\u8CBB\u8005.html  1695354153
+  ---
+  > 20230922034234https://www.hansaton.com/zh-hk/消費者.html 1695354153
+  6574c6575
+  < 20230922034248https://blog.arabtherapy.com/\u0639\u0644\u0627\u062C-\u0627\u0644\u062E\u062C\u0644-\u0627\u0644\u0627\u062C\u062A\u0645\u0627\u0639\u064A/     1695354168
+  ---
+  > 20230922034248https://blog.arabtherapy.com/علاج-الخجل-الاجتماعي/ 1695354168
+
+Could try fixing that at lookup time
+
+Fixed in warc2cdb, seems OK for timing:
+  /tmp/hst/15/lmh.cdb_in
+
+  real    224m32.815s
+  user    169m25.216s
+  sys     12m7.867s
+
+Lengths are right, difference with ks looks better, try the real
+thing:
+
+  >: cd /beegfs/common_crawl/CC-MAIN-2023-40/cdx/warc
+  >: ls | parallel -j 10 "uz '{}' |cut -f 2,4,18 -d ' '|fgrep .15/warc/CC-MAIN | cut -f 1,2 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" &
+  >: sort -m *.cdx | sponge all.cdx
+
+Nope:
+  >: diff <(sed 's/->[0-9]*$/",/' test.cdb_in)  all.cdx |less
+  >: fgrep -n dryelf.com test.cdb_in all.cdx
+  test.cdb_in:6710:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe",
+  test.cdb_in:1600193:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe",
+  all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3",
+  all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0",
+
+Stupid -- mime-detected has a space in it...
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35