Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 59:d9ba3ce783ff
python dict testing
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 01 Jan 2025 23:03:07 +0000 |
parents | 3012ca7fc6b7 |
children | 3be7b53d726e |
comparison
equal
deleted
inserted
replaced
58:3012ca7fc6b7 | 59:d9ba3ce783ff |
---|---|
845 Be sure to f.close() | 845 Be sure to f.close() |
846 Use BloomFilter.open for an existing bloom file | 846 Use BloomFilter.open for an existing bloom file |
847 Copying a file from /tmp to work/... still gives good (quick) lookup, | 847 Copying a file from /tmp to work/... still gives good (quick) lookup, |
848 but _creating and filling_ a file on work/... takes ... I stopped | 848 but _creating and filling_ a file on work/... takes ... I stopped |
849 waiting after an hour or so. | 849 waiting after an hour or so. |
850 | |
851 How much bigger is .05 false positive? | |
852 Less than expected: | |
853 >: ls -l /tmp/hst | |
854 -rwxr-xr-x 1 hst dc007 408301988 Jan 1 16:52 uris_20.bloom | |
855 -rwxr-xr-x 1 hst dc007 313830100 Jan 1 15:04 uris.bloom | |
856 And still same (?) fill time: | |
857 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom') | |
858 >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals()) | |
859 >>> T.repeat(3,number=1) | |
860 [89.64385064691305, 90.9979057777673, 83.9632708914578] | |
861 Build a test harness wrt the python dict I'm going to need... | |
862 Can't immediately find a way to optimise a dict to have umpty millions | |
863 of entries... | |
864 >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5 | |
865 1000002 | |
866 1000002 | |
867 1000002 | |
868 1000002 | |
869 1000002 | |
870 [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533] | |
871 Full as-it-were segment: | |
872 >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
873 6000002 | |
874 6000002 | |
875 6000002 | |
876 6000002 | |
877 6000002 | |
878 [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647] | |
879 Full 10th of the data: | |
880 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
881 52369734 | |
882 52369734 | |
883 52369734 | |
884 [69.63967163302004, 69.09140252694488, 66.49750975705683] | |
885 That's tolerable. | |
886 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
887 52369734 | |
888 52369734 | |
889 52369734 | |
890 [64.51177835091949, 71.6610240675509, 67.74966451153159] | |
891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] | |
892 Last line is 100000 lookups. | |
850 ================ | 893 ================ |
851 | 894 |
852 | 895 |
853 Try it with the existing _per segment_ index we have for 2019-35 | 896 Try it with the existing _per segment_ index we have for 2019-35 |
854 | 897 |