Mercurial > hg > cc > work
changeset 60:3be7b53d726e
using python dict test
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 02 Jan 2025 15:01:48 +0000 |
parents | d9ba3ce783ff |
children | e6bab0972142 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 27 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Wed Jan 01 23:03:07 2025 +0000 +++ b/lurid3/notes.txt Thu Jan 02 15:01:48 2025 +0000 @@ -883,13 +883,39 @@ 52369734 [69.63967163302004, 69.09140252694488, 66.49750975705683] That's tolerable. - >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv 52369734 52369734 52369734 [64.51177835091949, 71.6610240675509, 67.74966451153159] [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] Last line is 100000 lookups. + +So, try a test: + >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + 52369734 + [70.98342595621943] + [0.0037928372621536255] + + real 1m51.456s + user 1m32.901s + sys 0m17.937s + >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle + -rw-r--r-- 1 hst dc007 5.5G Jan 2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle + cdx_out.write(b' ') + cdx_out.write(b' ') + >: time ~/lib/python/cc/lmh/test_lookup1.py + 52369734 + 1076046 130318 + + real 1m52.668s + user 1m40.751s + sys 0m9.610s + +Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the +unpickles =~ 453 minutes == 8 hours. + +Try pre-filter with the Bloom filter. ================