changeset 61:e6bab0972142

tried pre-filtering with bloom, not much benefit if any Built the other 9 dictionaries and pickled them
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 02 Jan 2025 18:55:11 +0000
parents 3be7b53d726e
children bc0bdb649c08
files lurid3/notes.txt
diffstat 1 files changed, 33 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Thu Jan 02 15:01:48 2025 +0000
+++ b/lurid3/notes.txt	Thu Jan 02 18:55:11 2025 +0000
@@ -916,6 +916,39 @@
 unpickles =~ 453 minutes == 8 hours.
 
 Try pre-filter with the Bloom filter.
+
+Rebuild with byte values:
+  >>> from pybloomfilter import BloomFilter
+  >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
+  >>> def bff(f,fn):
+  ...  with open(fn,'rb') as uf:
+  ...   for l in uf:
+  ...    f.add(l.split(b'\t')[2])
+  ...
+  >>> timeit.timeit("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
+  80.17189159989357
+
+  >: time ~/lib/python/cc/lmh/test_lookup2.py
+  52369734
+  1076046 entries, 130318 given lastmod, 78 false positives
+
+  real  1m49.567s
+  user  1m38.818s
+  sys   0m8.668s
+
+Not worth it :-(.
+
+Build the rest:
+  >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.tsv
+52489306
+[54.296645537018776]
+
+real	1m32.529s
+user	1m16.956s
+sys	0m14.267s
+ >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv'
+
+Slightly slower when running in parallel, but done
 ================