changeset 58:3012ca7fc6b7

pybloomfilter testing
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 01 Jan 2025 15:11:09 +0000
parents 4b5117db4929
children d9ba3ce783ff
files lurid3/notes.txt
diffstat 1 files changed, 76 insertions(+), 8 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Tue Dec 17 21:25:28 2024 +0000
+++ b/lurid3/notes.txt	Wed Jan 01 15:11:09 2025 +0000
@@ -677,7 +677,8 @@
   user  0m22.372s
   sys   0m5.400s
 
-So, not worth the risk, let's try python
+So, not worth the risk, let's try python: cdx_extras implements a
+callback for unpackz that outputs the LM header if it's there
 
   >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l
   9238
@@ -760,18 +761,18 @@
 
   >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
   853d852
-  <       Mon, 19 Aug 2019 01:46:49 GMT
+  <       Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML]
   4058d4056
-  < Tue, 03 Nov 2015 21:31:18 GMT<br />
+  < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table]
   4405d4402
-  < Mon, 19 Aug 2019 01:54:52 GMT
+  < Mon, 19 Aug 2019 01:54:52 GMT       [double lm]
   5237,5238d5233
-  < 3
+  < 3                                   [bogus extension lines to preceding LM]
   < Asia/Amman
   7009d7003
-  <       Mon, 19 Aug 2019 02:34:20 GMT
+  <       Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML]
   9198d9191
-  <       Mon, 19 Aug 2019 02:14:49 GMT
+  <       Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML]
 
 All good.  The only implausable case is
   < Mon, 19 Aug 2019 01:54:52 GMT
@@ -782,9 +783,76 @@
 
 Start looking at how we do the merge of cdx_extras.py with existing index
 
+====2024-12-19====
+
+The above test shows 17.6% of entries have an LM value
+For a 3 billion entry dataset, than means 530 million LM entries, call
+this n.
+
+Sizes?  For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2
+
+ (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB
+	    
+That's too much :-)  Per segment, that becomes possible?
+  25,355,594 bits =~ 3.2MB
+
+But maybe it's _not_ too much.  One of the python implementations I
+saw uses mmap:
+
+  https://github.com/prashnts/pybloomfiltermmap3
+
+Build a Bloom filter with all the URIs whose entries have LM value
+ _and_ a python hashtable mapping from URI to LM and offset (is that
+ enough for deduping?)
+Rewrite one index file at a time
+ Probe with each URI, if positive
+   look up in hashtable and use if found
+
+    >: wc -l ks*.tsv
+       52369734 ks_0-9.tsv
+       52489306 ks_10-19.tsv
+       52381115 ks_20-29.tsv
+       52438862 ks_30-39.tsv
+       52512044 ks_40-49.tsv
+       52476964 ks_50-59.tsv
+       52317116 ks_60-69.tsv
+       52200680 ks_70-79.tsv
+       52382426 ks_80-89.tsv
+       52295136 ks_90-99.tsv
+      523863383 total
+
+  >>> from pybloomfilter import BloomFilter
+  >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom')
+  >>> def bff(f,fn):
+  ...  with open(fn) as uf:
+  ...   while (l:=uf.readline()):
+  ...    f.add(l.split('\t')[2])
+  ...
+  >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals())
+  0.00012309104204177856
+  >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
+  77.57737312093377
+  >>> 'http://71.43.189.10/dermorph' in f
+  False
+  >>> 'http://71.43.189.10/dermorph/' in f
+  True
+  >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals())
+  0.02377822808921337
+  >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals())
+  0.019318239763379097
+
+_That's_ encouraging...
+Be sure to f.close()
+Use BloomFilter.open for an existing bloom file
+Copying a file from /tmp to work/... still gives good (quick) lookup,
+  but _creating and filling_ a file on work/... takes ... I stopped
+waiting after an hour or so.
+================
+
+
 Try it with the existing _per segment_ index we have for 2019-35
 
-Assuming we have to key on segment plus offset, as reconstructing the
+Assuming we have to key on segment / file and offset, as reconstructing the
 proper index key is such a pain / buggy / is going to change with the year.
 
 Stay with segment 49