Mercurial > hg > cc > work
changeset 73:1283a574260d
working on cross-check of warc2cdb
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Sun, 09 Mar 2025 19:59:50 +0000 |
parents | 7901ce4a39e3 |
children | ff6ef190f901 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 99 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Thu Mar 06 01:45:24 2025 +0000 +++ b/lurid3/notes.txt Sun Mar 09 19:59:50 2025 +0000 @@ -1630,6 +1630,105 @@ >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 >: echo $? 0 + +OK, now to fill up 2023-40 so we can do a real trial. + +Cythonised lmh.py and warc.py, still running about 14 seconds per file +Don't actually need these yet, as 2023-40 has a full set of lmh +outputs, but not many ks.tsv: + >: ls */ks.tsv + 0/ks.tsv 15/ks.tsv 4/ks.tsv 68/ks.tsv + 12/ks.tsv 36/ks.tsv 56/ks.tsv best_two_by_nl1/ks.tsv + +Note, that to populate cdbs, we don't need most of the logic in +sort_date.py, because all the sorting and merging isn't needed! + +Hmm. In fact, just go straight to cdb_in.txt + + >: time python3 -c 'import sys,warc2cdb; sys.exit(warc2cdb.main(*sys.argv[1:]))' 2023-40 15 0 /tmp/hst 2>/tmp/hst/15/errs + /tmp/hst/15/lmh.cdb_in + + real 223m28.517s + user 168m34.751s + sys 12m6.020s + +This compares to 273 minutes for 806 files in the cleanup of segment +95, see end of old_notes. Best full segment was 96, i.e. 279 for +900. And those figures are for lmh.py only + +Are results the same? +Old: + >: wc -l *.tsv + 8247 ks_errs.tsv + 5324616 ks.tsv +New: + >: wc -l * + 9147 errs + 5324617 lmh.cdb_in + +I _think_ that's good -- there's one line per input file added to the +error output now, and one blank line at the end of the cdb_in file. +Old: + >: cut -f 2- ks_errs.tsv |sus + 5420 cannot unpack non-iterable NoneType object + 2670 list index out of range + 113 year 641471 is out of range + 20 year 641474 is out of range + 9 hour must be in 0..23 + 9 year 642435 is out of range + 3 year 642437 is out of range + 1 year 4262575 is out of range + 1 year 641769 is out of range + 1 year 642434 is out of range +New: + >: cut -f 2- <(fgrep -v beegfs /tmp/hst/15/errs) | cut -f 1-5 -d ' ' |sus + 8090 Invalid date value or format + 113 year 641471 is out of + 20 year 641474 is out of + 9 hour must be in 0..23 + 9 year 642435 is out of + 3 year 642437 is out of + 1 year 4262575 is out of + 1 year 641769 is out of + 1 year 642434 is out of + +But, failed to fix the case where the URI in the original WARC is not +%-encoded: + >: diff -bw <(cut -f2- ks.tsv| sed 's/ //;s/\.0$//'|sort -k1,1) <(sed 's/^\+[0-9,]*://;s/->\([0-9]*\)$/ \1/' /tmp/hst/15/lmh.cdb_in_nopercent | sort -k1,1)|head + 4360c4361 + < 20230922034234https://www.hansaton.com/zh-hk/\u6D88\u8CBB\u8005.html 1695354153 + --- + > 20230922034234https://www.hansaton.com/zh-hk/消費者.html 1695354153 + 6574c6575 + < 20230922034248https://blog.arabtherapy.com/\u0639\u0644\u0627\u062C-\u0627\u0644\u062E\u062C\u0644-\u0627\u0644\u0627\u062C\u062A\u0645\u0627\u0639\u064A/ 1695354168 + --- + > 20230922034248https://blog.arabtherapy.com/علاج-الخجل-الاجتماعي/ 1695354168 + +Could try fixing that at lookup time + +Fixed in warc2cdb, seems OK for timing: + /tmp/hst/15/lmh.cdb_in + + real 224m32.815s + user 169m25.216s + sys 12m7.867s + +Lengths are right, difference with ks looks better, try the real +thing: + + >: cd /beegfs/common_crawl/CC-MAIN-2023-40/cdx/warc + >: ls | parallel -j 10 "uz '{}' |cut -f 2,4,18 -d ' '|fgrep .15/warc/CC-MAIN | cut -f 1,2 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" & + >: sort -m *.cdx | sponge all.cdx + +Nope: + >: diff <(sed 's/->[0-9]*$/",/' test.cdb_in) all.cdx |less + >: fgrep -n dryelf.com test.cdb_in all.cdx + test.cdb_in:6710:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", + test.cdb_in:1600193:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", + all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", + all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", + +Stupid -- mime-detected has a space in it... ================ Try it with the existing _per segment_ index we have for 2019-35