changeset 59:d9ba3ce783ff

python dict testing
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 01 Jan 2025 23:03:07 +0000
parents 3012ca7fc6b7
children 3be7b53d726e
files lurid3/notes.txt
diffstat 1 files changed, 43 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Wed Jan 01 15:11:09 2025 +0000
+++ b/lurid3/notes.txt	Wed Jan 01 23:03:07 2025 +0000
@@ -847,6 +847,49 @@
 Copying a file from /tmp to work/... still gives good (quick) lookup,
   but _creating and filling_ a file on work/... takes ... I stopped
 waiting after an hour or so.
+
+How much bigger is .05 false positive?
+Less than expected:
+  >: ls -l /tmp/hst
+  -rwxr-xr-x 1 hst dc007 408301988 Jan  1 16:52 uris_20.bloom
+  -rwxr-xr-x 1 hst dc007 313830100 Jan  1 15:04 uris.bloom
+And still same (?) fill time:
+  >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
+  >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals())
+  >>> T.repeat(3,number=1)
+  [89.64385064691305, 90.9979057777673, 83.9632708914578]
+Build a test harness wrt the python dict I'm going to need...
+Can't immediately find a way to optimise a dict to have umpty millions
+ of entries...
+  >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5
+  1000002
+  1000002
+  1000002
+  1000002
+  1000002
+  [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533]
+Full as-it-were segment:
+  >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  6000002
+  6000002
+  6000002
+  6000002
+  6000002
+  [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647]
+Full 10th of the data:
+  >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  52369734
+  52369734
+  52369734
+  [69.63967163302004, 69.09140252694488, 66.49750975705683]
+That's tolerable.
+  >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  52369734
+  52369734
+  52369734
+  [64.51177835091949, 71.6610240675509, 67.74966451153159]
+  [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
+Last line is 100000 lookups.
 ================