cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 73:1283a574260d

working on cross-check of warc2cdb

author	Henry Thompson <ht@markup.co.uk>
date	Sun, 09 Mar 2025 19:59:50 +0000
parents	7901ce4a39e3
children	ff6ef190f901

comparison

equal deleted inserted replaced

-:7901ce4a39e3
+:1283a574260d
 >: uz idx/cdx-00101.gz | wc
 14681147 314748509 7090034893
 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101
 >: echo $?
 0
+OK, now to fill up 2023-40 so we can do a real trial.
+Cythonised lmh.py and warc.py, still running about 14 seconds per file
+Don't actually need these yet, as 2023-40 has a full set of lmh
+outputs, but not many ks.tsv:
+>: ls */ks.tsv
+0/ks.tsv   15/ks.tsv  4/ks.tsv   68/ks.tsv
+12/ks.tsv  36/ks.tsv  56/ks.tsv  best_two_by_nl1/ks.tsv
+Note, that to populate cdbs, we don't need most of the logic in
+sort_date.py, because all the sorting and merging isn't needed!
+Hmm.  In fact, just go straight to cdb_in.txt
+>: time python3 -c 'import sys,warc2cdb; sys.exit(warc2cdb.main(*sys.argv[1:]))'  2023-40 15 0 /tmp/hst 2>/tmp/hst/15/errs
+/tmp/hst/15/lmh.cdb_in
+real    223m28.517s
+user    168m34.751s
+sys     12m6.020s
+This compares to 273 minutes for 806 files in the cleanup of segment
+95, see end of old_notes.  Best full segment was 96, i.e. 279 for
+900.  And those figures are for lmh.py only
+Are results the same?
+Old:
+>: wc -l *.tsv
+8247 ks_errs.tsv
+5324616 ks.tsv
+New:
+>: wc -l *
+9147 errs
+5324617 lmh.cdb_in
+I _think_ that's good -- there's one line per input file added to the
+error output now, and one blank line at the end of the cdb_in file.
+Old:
+>: cut -f 2- ks_errs.tsv |sus
+5420 cannot unpack non-iterable NoneType object
+2670 list index out of range
+113 year 641471 is out of range
+20 year 641474 is out of range
+	9 hour must be in 0..23
+	9 year 642435 is out of range
+	3 year 642437 is out of range
+	1 year 4262575 is out of range
+	1 year 641769 is out of range
+	1 year 642434 is out of range
+New:
+>: cut -f 2- <(fgrep -v beegfs /tmp/hst/15/errs) | cut -f 1-5 -d ' ' |sus
+8090 Invalid date value or format
+113 year 641471 is out of
+20 year 641474 is out of
+	9 hour must be in 0..23
+	9 year 642435 is out of
+	3 year 642437 is out of
+	1 year 4262575 is out of
+	1 year 641769 is out of
+	1 year 642434 is out of
+But, failed to fix the case where the URI in the original WARC is not
+%-encoded:
+>: diff -bw <(cut -f2- ks.tsv| sed 's/ //;s/\.0$//'|sort -k1,1) <(sed 's/^\+[0-9,]*://;s/->\([0-9]*\)$/ \1/' /tmp/hst/15/lmh.cdb_in_nopercent | sort -k1,1)|head
+4360c4361
+< 20230922034234https://www.hansaton.com/zh-hk/\u6D88\u8CBB\u8005.html  1695354153
+---
+> 20230922034234https://www.hansaton.com/zh-hk/消費者.html 1695354153
+6574c6575
+< 20230922034248https://blog.arabtherapy.com/\u0639\u0644\u0627\u062C-\u0627\u0644\u062E\u062C\u0644-\u0627\u0644\u0627\u062C\u062A\u0645\u0627\u0639\u064A/     1695354168
+---
+> 20230922034248https://blog.arabtherapy.com/علاج-الخجل-الاجتماعي/ 1695354168
+Could try fixing that at lookup time
+Fixed in warc2cdb, seems OK for timing:
+/tmp/hst/15/lmh.cdb_in
+real    224m32.815s
+user    169m25.216s
+sys     12m7.867s
+Lengths are right, difference with ks looks better, try the real
+thing:
+>: cd /beegfs/common_crawl/CC-MAIN-2023-40/cdx/warc
+>: ls | parallel -j 10 "uz '{}' |cut -f 2,4,18 -d ' '|fgrep .15/warc/CC-MAIN | cut -f 1,2 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" &
+>: sort -m *.cdx | sponge all.cdx
+Nope:
+>: diff <(sed 's/->[0-9]*$/",/' test.cdb_in)  all.cdx |less
+>: fgrep -n dryelf.com test.cdb_in all.cdx
+test.cdb_in:6710:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe",
+test.cdb_in:1600193:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe",
+all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3",
+all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0",
+Stupid -- mime-detected has a space in it...
 ================
 Try it with the existing _per segment_ index we have for 2019-35
 Assuming we have to key on segment / file and offset, as reconstructing the

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 73:1283a574260d