Mercurial > hg > cc > work
changeset 59:d9ba3ce783ff
python dict testing
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 01 Jan 2025 23:03:07 +0000 |
parents | 3012ca7fc6b7 |
children | 3be7b53d726e |
files | lurid3/notes.txt |
diffstat | 1 files changed, 43 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Wed Jan 01 15:11:09 2025 +0000 +++ b/lurid3/notes.txt Wed Jan 01 23:03:07 2025 +0000 @@ -847,6 +847,49 @@ Copying a file from /tmp to work/... still gives good (quick) lookup, but _creating and filling_ a file on work/... takes ... I stopped waiting after an hour or so. + +How much bigger is .05 false positive? +Less than expected: + >: ls -l /tmp/hst + -rwxr-xr-x 1 hst dc007 408301988 Jan 1 16:52 uris_20.bloom + -rwxr-xr-x 1 hst dc007 313830100 Jan 1 15:04 uris.bloom +And still same (?) fill time: + >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom') + >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals()) + >>> T.repeat(3,number=1) + [89.64385064691305, 90.9979057777673, 83.9632708914578] +Build a test harness wrt the python dict I'm going to need... +Can't immediately find a way to optimise a dict to have umpty millions + of entries... + >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5 + 1000002 + 1000002 + 1000002 + 1000002 + 1000002 + [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533] +Full as-it-were segment: + >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + 6000002 + 6000002 + 6000002 + 6000002 + 6000002 + [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647] +Full 10th of the data: + >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + 52369734 + 52369734 + 52369734 + [69.63967163302004, 69.09140252694488, 66.49750975705683] +That's tolerable. + >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv + 52369734 + 52369734 + 52369734 + [64.51177835091949, 71.6610240675509, 67.74966451153159] + [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] +Last line is 100000 lookups. ================