Mercurial > hg > cc > work
changeset 61:e6bab0972142
tried pre-filtering with bloom, not much benefit if any
Built the other 9 dictionaries and pickled them
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 02 Jan 2025 18:55:11 +0000 |
parents | 3be7b53d726e |
children | bc0bdb649c08 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 33 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Thu Jan 02 15:01:48 2025 +0000 +++ b/lurid3/notes.txt Thu Jan 02 18:55:11 2025 +0000 @@ -916,6 +916,39 @@ unpickles =~ 453 minutes == 8 hours. Try pre-filter with the Bloom filter. + +Rebuild with byte values: + >>> from pybloomfilter import BloomFilter + >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom') + >>> def bff(f,fn): + ... with open(fn,'rb') as uf: + ... for l in uf: + ... f.add(l.split(b'\t')[2]) + ... + >>> timeit.timeit("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals()) + 80.17189159989357 + + >: time ~/lib/python/cc/lmh/test_lookup2.py + 52369734 + 1076046 entries, 130318 given lastmod, 78 false positives + + real 1m49.567s + user 1m38.818s + sys 0m8.668s + +Not worth it :-(. + +Build the rest: + >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.tsv +52489306 +[54.296645537018776] + +real 1m32.529s +user 1m16.956s +sys 0m14.267s + >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv' + +Slightly slower when running in parallel, but done ================