comparison lurid3/notes.txt @ 59:d9ba3ce783ff

python dict testing
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 01 Jan 2025 23:03:07 +0000
parents 3012ca7fc6b7
children 3be7b53d726e
comparison
equal deleted inserted replaced
58:3012ca7fc6b7 59:d9ba3ce783ff
845 Be sure to f.close() 845 Be sure to f.close()
846 Use BloomFilter.open for an existing bloom file 846 Use BloomFilter.open for an existing bloom file
847 Copying a file from /tmp to work/... still gives good (quick) lookup, 847 Copying a file from /tmp to work/... still gives good (quick) lookup,
848 but _creating and filling_ a file on work/... takes ... I stopped 848 but _creating and filling_ a file on work/... takes ... I stopped
849 waiting after an hour or so. 849 waiting after an hour or so.
850
851 How much bigger is .05 false positive?
852 Less than expected:
853 >: ls -l /tmp/hst
854 -rwxr-xr-x 1 hst dc007 408301988 Jan 1 16:52 uris_20.bloom
855 -rwxr-xr-x 1 hst dc007 313830100 Jan 1 15:04 uris.bloom
856 And still same (?) fill time:
857 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
858 >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals())
859 >>> T.repeat(3,number=1)
860 [89.64385064691305, 90.9979057777673, 83.9632708914578]
861 Build a test harness wrt the python dict I'm going to need...
862 Can't immediately find a way to optimise a dict to have umpty millions
863 of entries...
864 >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5
865 1000002
866 1000002
867 1000002
868 1000002
869 1000002
870 [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533]
871 Full as-it-were segment:
872 >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
873 6000002
874 6000002
875 6000002
876 6000002
877 6000002
878 [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647]
879 Full 10th of the data:
880 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
881 52369734
882 52369734
883 52369734
884 [69.63967163302004, 69.09140252694488, 66.49750975705683]
885 That's tolerable.
886 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
887 52369734
888 52369734
889 52369734
890 [64.51177835091949, 71.6610240675509, 67.74966451153159]
891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
892 Last line is 100000 lookups.
850 ================ 893 ================
851 894
852 895
853 Try it with the existing _per segment_ index we have for 2019-35 896 Try it with the existing _per segment_ index we have for 2019-35
854 897