comparison lurid3/notes.txt @ 61:e6bab0972142

tried pre-filtering with bloom, not much benefit if any Built the other 9 dictionaries and pickled them
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 02 Jan 2025 18:55:11 +0000
parents 3be7b53d726e
children bc0bdb649c08
comparison
equal deleted inserted replaced
60:3be7b53d726e 61:e6bab0972142
914 914
915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the 915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the
916 unpickles =~ 453 minutes == 8 hours. 916 unpickles =~ 453 minutes == 8 hours.
917 917
918 Try pre-filter with the Bloom filter. 918 Try pre-filter with the Bloom filter.
919
920 Rebuild with byte values:
921 >>> from pybloomfilter import BloomFilter
922 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
923 >>> def bff(f,fn):
924 ... with open(fn,'rb') as uf:
925 ... for l in uf:
926 ... f.add(l.split(b'\t')[2])
927 ...
928 >>> timeit.timeit("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
929 80.17189159989357
930
931 >: time ~/lib/python/cc/lmh/test_lookup2.py
932 52369734
933 1076046 entries, 130318 given lastmod, 78 false positives
934
935 real 1m49.567s
936 user 1m38.818s
937 sys 0m8.668s
938
939 Not worth it :-(.
940
941 Build the rest:
942 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.tsv
943 52489306
944 [54.296645537018776]
945
946 real 1m32.529s
947 user 1m16.956s
948 sys 0m14.267s
949 >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv'
950
951 Slightly slower when running in parallel, but done
919 ================ 952 ================
920 953
921 954
922 Try it with the existing _per segment_ index we have for 2019-35 955 Try it with the existing _per segment_ index we have for 2019-35
923 956