Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 61:e6bab0972142
tried pre-filtering with bloom, not much benefit if any
Built the other 9 dictionaries and pickled them
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 02 Jan 2025 18:55:11 +0000 |
parents | 3be7b53d726e |
children | bc0bdb649c08 |
comparison
equal
deleted
inserted
replaced
60:3be7b53d726e | 61:e6bab0972142 |
---|---|
914 | 914 |
915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the | 915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the |
916 unpickles =~ 453 minutes == 8 hours. | 916 unpickles =~ 453 minutes == 8 hours. |
917 | 917 |
918 Try pre-filter with the Bloom filter. | 918 Try pre-filter with the Bloom filter. |
919 | |
920 Rebuild with byte values: | |
921 >>> from pybloomfilter import BloomFilter | |
922 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom') | |
923 >>> def bff(f,fn): | |
924 ... with open(fn,'rb') as uf: | |
925 ... for l in uf: | |
926 ... f.add(l.split(b'\t')[2]) | |
927 ... | |
928 >>> timeit.timeit("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals()) | |
929 80.17189159989357 | |
930 | |
931 >: time ~/lib/python/cc/lmh/test_lookup2.py | |
932 52369734 | |
933 1076046 entries, 130318 given lastmod, 78 false positives | |
934 | |
935 real 1m49.567s | |
936 user 1m38.818s | |
937 sys 0m8.668s | |
938 | |
939 Not worth it :-(. | |
940 | |
941 Build the rest: | |
942 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_10-19.tsv | |
943 52489306 | |
944 [54.296645537018776] | |
945 | |
946 real 1m32.529s | |
947 user 1m16.956s | |
948 sys 0m14.267s | |
949 >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv' | |
950 | |
951 Slightly slower when running in parallel, but done | |
919 ================ | 952 ================ |
920 | 953 |
921 | 954 |
922 Try it with the existing _per segment_ index we have for 2019-35 | 955 Try it with the existing _per segment_ index we have for 2019-35 |
923 | 956 |