cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 59:d9ba3ce783ff

python dict testing

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Wed, 01 Jan 2025 23:03:07 +0000
parents	3012ca7fc6b7
children	3be7b53d726e

comparison

equal deleted inserted replaced

-:3012ca7fc6b7
+:d9ba3ce783ff
 Be sure to f.close()
 Use BloomFilter.open for an existing bloom file
 Copying a file from /tmp to work/... still gives good (quick) lookup,
 but _creating and filling_ a file on work/... takes ... I stopped
 waiting after an hour or so.
+How much bigger is .05 false positive?
+Less than expected:
+>: ls -l /tmp/hst
+-rwxr-xr-x 1 hst dc007 408301988 Jan  1 16:52 uris_20.bloom
+-rwxr-xr-x 1 hst dc007 313830100 Jan  1 15:04 uris.bloom
+And still same (?) fill time:
+>>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
+>>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals())
+>>> T.repeat(3,number=1)
+[89.64385064691305, 90.9979057777673, 83.9632708914578]
+Build a test harness wrt the python dict I'm going to need...
+Can't immediately find a way to optimise a dict to have umpty millions
+of entries...
+>: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5
+1000002
+1000002
+1000002
+1000002
+1000002
+[1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533]
+Full as-it-were segment:
+>: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+6000002
+6000002
+6000002
+6000002
+6000002
+[7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647]
+Full 10th of the data:
+>: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+52369734
+52369734
+52369734
+[69.63967163302004, 69.09140252694488, 66.49750975705683]
+That's tolerable.
+>: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+52369734
+52369734
+52369734
+[64.51177835091949, 71.6610240675509, 67.74966451153159]
+[0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
+Last line is 100000 lookups.
 ================
 Try it with the existing _per segment_ index we have for 2019-35

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 59:d9ba3ce783ff