cc/work: lurid3/notes.txt comparison

using python dict test

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Thu, 02 Jan 2025 15:01:48 +0000
parents	d9ba3ce783ff
children	e6bab0972142

comparison

equal deleted inserted replaced

-:d9ba3ce783ff
+:3be7b53d726e
 52369734
 52369734
 52369734
 [69.63967163302004, 69.09140252694488, 66.49750975705683]
 That's tolerable.
->: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+>: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
 52369734
 52369734
 52369734
 [64.51177835091949, 71.6610240675509, 67.74966451153159]
 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
 Last line is 100000 lookups.
+So, try a test:
+>: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+52369734
+[70.98342595621943]
+[0.0037928372621536255]
+real  1m51.456s
+user  1m32.901s
+sys   0m17.937s
+>: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
+-rw-r--r-- 1 hst dc007 5.5G Jan  2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
+cdx_out.write(b' ')
+cdx_out.write(b' ')
+>: time ~/lib/python/cc/lmh/test_lookup1.py
+52369734
+1076046 130318
+real  1m52.668s
+user  1m40.751s
+sys   0m9.610s
+Not bad.  1.5 minutes per file, plus 10 x 20 secs or so for the
+unpickles =~ 453 minutes == 8 hours.
+Try pre-filter with the Bloom filter.
 ================
 Try it with the existing _per segment_ index we have for 2019-35

Mercurial > hg > cc > work