cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 58:3012ca7fc6b7

pybloomfilter testing

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Wed, 01 Jan 2025 15:11:09 +0000
parents	4b5117db4929
children	d9ba3ce783ff

comparison

equal deleted inserted replaced

-:4b5117db4929
+:3012ca7fc6b7
 real  0m21.483s
 user  0m22.372s
 sys   0m5.400s
-So, not worth the risk, let's try python
+So, not worth the risk, let's try python: cdx_extras implements a
+callback for unpackz that outputs the LM header if it's there
 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l
 9238
 real  0m25.426s
 >: wc -l /tmp/hst/lm.tsv
 9423 /tmp/hst/lm.tsv
 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
 853d852
-<       Mon, 19 Aug 2019 01:46:49 GMT
+<       Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML]
 4058d4056
-< Tue, 03 Nov 2015 21:31:18 GMT<br />
+< Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table]
 4405d4402
-< Mon, 19 Aug 2019 01:54:52 GMT
+< Mon, 19 Aug 2019 01:54:52 GMT       [double lm]
 5237,5238d5233
-< 3
+< 3                                   [bogus extension lines to preceding LM]
 < Asia/Amman
 7009d7003
-<       Mon, 19 Aug 2019 02:34:20 GMT
+<       Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML]
 9198d9191
-<       Mon, 19 Aug 2019 02:14:49 GMT
+<       Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML]
 All good.  The only implausable case is
 < Mon, 19 Aug 2019 01:54:52 GMT
 which turns out to be a case of two Last-Modified headers in the same
 the same response record's HTTP headers.  RFCs 2616 and 7230 rule it
 out but neither specifies a recovery, so first-wins is as good as
 anything, and indeed 6797 specifies that.
 Start looking at how we do the merge of cdx_extras.py with existing index
+====2024-12-19====
+The above test shows 17.6% of entries have an LM value
+For a 3 billion entry dataset, than means 530 million LM entries, call
+this n.
+Sizes?  For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2
+(- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB
+That's too much :-)  Per segment, that becomes possible?
+25,355,594 bits =~ 3.2MB
+But maybe it's _not_ too much.  One of the python implementations I
+saw uses mmap:
+https://github.com/prashnts/pybloomfiltermmap3
+Build a Bloom filter with all the URIs whose entries have LM value
+_and_ a python hashtable mapping from URI to LM and offset (is that
+enough for deduping?)
+Rewrite one index file at a time
+Probe with each URI, if positive
+look up in hashtable and use if found
+>: wc -l ks*.tsv
+52369734 ks_0-9.tsv
+52489306 ks_10-19.tsv
+52381115 ks_20-29.tsv
+52438862 ks_30-39.tsv
+52512044 ks_40-49.tsv
+52476964 ks_50-59.tsv
+52317116 ks_60-69.tsv
+52200680 ks_70-79.tsv
+52382426 ks_80-89.tsv
+52295136 ks_90-99.tsv
+523863383 total
+>>> from pybloomfilter import BloomFilter
+>>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom')
+>>> def bff(f,fn):
+...  with open(fn) as uf:
+...   while (l:=uf.readline()):
+...    f.add(l.split('\t')[2])
+...
+>>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals())
+0.00012309104204177856
+>>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
+77.57737312093377
+>>> 'http://71.43.189.10/dermorph' in f
+False
+>>> 'http://71.43.189.10/dermorph/' in f
+True
+>>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals())
+0.02377822808921337
+>>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals())
+0.019318239763379097
+_That's_ encouraging...
+Be sure to f.close()
+Use BloomFilter.open for an existing bloom file
+Copying a file from /tmp to work/... still gives good (quick) lookup,
+but _creating and filling_ a file on work/... takes ... I stopped
+waiting after an hour or so.
+================
 Try it with the existing _per segment_ index we have for 2019-35
-Assuming we have to key on segment plus offset, as reconstructing the
+Assuming we have to key on segment / file and offset, as reconstructing the
 proper index key is such a pain / buggy / is going to change with the year.
 Stay with segment 49
 >: uz cdx.gz |wc -l

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 58:3012ca7fc6b7