Mercurial > hg > cc > work
changeset 58:3012ca7fc6b7
pybloomfilter testing
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 01 Jan 2025 15:11:09 +0000 |
parents | 4b5117db4929 |
children | d9ba3ce783ff |
files | lurid3/notes.txt |
diffstat | 1 files changed, 76 insertions(+), 8 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Tue Dec 17 21:25:28 2024 +0000 +++ b/lurid3/notes.txt Wed Jan 01 15:11:09 2025 +0000 @@ -677,7 +677,8 @@ user 0m22.372s sys 0m5.400s -So, not worth the risk, let's try python +So, not worth the risk, let's try python: cdx_extras implements a +callback for unpackz that outputs the LM header if it's there >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l 9238 @@ -760,18 +761,18 @@ >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) 853d852 - < Mon, 19 Aug 2019 01:46:49 GMT + < Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML] 4058d4056 - < Tue, 03 Nov 2015 21:31:18 GMT<br /> + < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table] 4405d4402 - < Mon, 19 Aug 2019 01:54:52 GMT + < Mon, 19 Aug 2019 01:54:52 GMT [double lm] 5237,5238d5233 - < 3 + < 3 [bogus extension lines to preceding LM] < Asia/Amman 7009d7003 - < Mon, 19 Aug 2019 02:34:20 GMT + < Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML] 9198d9191 - < Mon, 19 Aug 2019 02:14:49 GMT + < Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML] All good. The only implausable case is < Mon, 19 Aug 2019 01:54:52 GMT @@ -782,9 +783,76 @@ Start looking at how we do the merge of cdx_extras.py with existing index +====2024-12-19==== + +The above test shows 17.6% of entries have an LM value +For a 3 billion entry dataset, than means 530 million LM entries, call +this n. + +Sizes? For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2 + + (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB + +That's too much :-) Per segment, that becomes possible? + 25,355,594 bits =~ 3.2MB + +But maybe it's _not_ too much. One of the python implementations I +saw uses mmap: + + https://github.com/prashnts/pybloomfiltermmap3 + +Build a Bloom filter with all the URIs whose entries have LM value + _and_ a python hashtable mapping from URI to LM and offset (is that + enough for deduping?) +Rewrite one index file at a time + Probe with each URI, if positive + look up in hashtable and use if found + + >: wc -l ks*.tsv + 52369734 ks_0-9.tsv + 52489306 ks_10-19.tsv + 52381115 ks_20-29.tsv + 52438862 ks_30-39.tsv + 52512044 ks_40-49.tsv + 52476964 ks_50-59.tsv + 52317116 ks_60-69.tsv + 52200680 ks_70-79.tsv + 52382426 ks_80-89.tsv + 52295136 ks_90-99.tsv + 523863383 total + + >>> from pybloomfilter import BloomFilter + >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom') + >>> def bff(f,fn): + ... with open(fn) as uf: + ... while (l:=uf.readline()): + ... f.add(l.split('\t')[2]) + ... + >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals()) + 0.00012309104204177856 + >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals()) + 77.57737312093377 + >>> 'http://71.43.189.10/dermorph' in f + False + >>> 'http://71.43.189.10/dermorph/' in f + True + >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals()) + 0.02377822808921337 + >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals()) + 0.019318239763379097 + +_That's_ encouraging... +Be sure to f.close() +Use BloomFilter.open for an existing bloom file +Copying a file from /tmp to work/... still gives good (quick) lookup, + but _creating and filling_ a file on work/... takes ... I stopped +waiting after an hour or so. +================ + + Try it with the existing _per segment_ index we have for 2019-35 -Assuming we have to key on segment plus offset, as reconstructing the +Assuming we have to key on segment / file and offset, as reconstructing the proper index key is such a pain / buggy / is going to change with the year. Stay with segment 49