comparison lurid3/notes.txt @ 58:3012ca7fc6b7

pybloomfilter testing
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 01 Jan 2025 15:11:09 +0000
parents 4b5117db4929
children d9ba3ce783ff
comparison
equal deleted inserted replaced
57:4b5117db4929 58:3012ca7fc6b7
675 675
676 real 0m21.483s 676 real 0m21.483s
677 user 0m22.372s 677 user 0m22.372s
678 sys 0m5.400s 678 sys 0m5.400s
679 679
680 So, not worth the risk, let's try python 680 So, not worth the risk, let's try python: cdx_extras implements a
681 callback for unpackz that outputs the LM header if it's there
681 682
682 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l 683 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l
683 9238 684 9238
684 685
685 real 0m25.426s 686 real 0m25.426s
758 >: wc -l /tmp/hst/lm.tsv 759 >: wc -l /tmp/hst/lm.tsv
759 9423 /tmp/hst/lm.tsv 760 9423 /tmp/hst/lm.tsv
760 761
761 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) 762 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
762 853d852 763 853d852
763 < Mon, 19 Aug 2019 01:46:49 GMT 764 < Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML]
764 4058d4056 765 4058d4056
765 < Tue, 03 Nov 2015 21:31:18 GMT<br /> 766 < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table]
766 4405d4402 767 4405d4402
767 < Mon, 19 Aug 2019 01:54:52 GMT 768 < Mon, 19 Aug 2019 01:54:52 GMT [double lm]
768 5237,5238d5233 769 5237,5238d5233
769 < 3 770 < 3 [bogus extension lines to preceding LM]
770 < Asia/Amman 771 < Asia/Amman
771 7009d7003 772 7009d7003
772 < Mon, 19 Aug 2019 02:34:20 GMT 773 < Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML]
773 9198d9191 774 9198d9191
774 < Mon, 19 Aug 2019 02:14:49 GMT 775 < Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML]
775 776
776 All good. The only implausable case is 777 All good. The only implausable case is
777 < Mon, 19 Aug 2019 01:54:52 GMT 778 < Mon, 19 Aug 2019 01:54:52 GMT
778 which turns out to be a case of two Last-Modified headers in the same 779 which turns out to be a case of two Last-Modified headers in the same
779 the same response record's HTTP headers. RFCs 2616 and 7230 rule it 780 the same response record's HTTP headers. RFCs 2616 and 7230 rule it
780 out but neither specifies a recovery, so first-wins is as good as 781 out but neither specifies a recovery, so first-wins is as good as
781 anything, and indeed 6797 specifies that. 782 anything, and indeed 6797 specifies that.
782 783
783 Start looking at how we do the merge of cdx_extras.py with existing index 784 Start looking at how we do the merge of cdx_extras.py with existing index
784 785
786 ====2024-12-19====
787
788 The above test shows 17.6% of entries have an LM value
789 For a 3 billion entry dataset, than means 530 million LM entries, call
790 this n.
791
792 Sizes? For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2
793
794 (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB
795
796 That's too much :-) Per segment, that becomes possible?
797 25,355,594 bits =~ 3.2MB
798
799 But maybe it's _not_ too much. One of the python implementations I
800 saw uses mmap:
801
802 https://github.com/prashnts/pybloomfiltermmap3
803
804 Build a Bloom filter with all the URIs whose entries have LM value
805 _and_ a python hashtable mapping from URI to LM and offset (is that
806 enough for deduping?)
807 Rewrite one index file at a time
808 Probe with each URI, if positive
809 look up in hashtable and use if found
810
811 >: wc -l ks*.tsv
812 52369734 ks_0-9.tsv
813 52489306 ks_10-19.tsv
814 52381115 ks_20-29.tsv
815 52438862 ks_30-39.tsv
816 52512044 ks_40-49.tsv
817 52476964 ks_50-59.tsv
818 52317116 ks_60-69.tsv
819 52200680 ks_70-79.tsv
820 52382426 ks_80-89.tsv
821 52295136 ks_90-99.tsv
822 523863383 total
823
824 >>> from pybloomfilter import BloomFilter
825 >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom')
826 >>> def bff(f,fn):
827 ... with open(fn) as uf:
828 ... while (l:=uf.readline()):
829 ... f.add(l.split('\t')[2])
830 ...
831 >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals())
832 0.00012309104204177856
833 >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
834 77.57737312093377
835 >>> 'http://71.43.189.10/dermorph' in f
836 False
837 >>> 'http://71.43.189.10/dermorph/' in f
838 True
839 >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals())
840 0.02377822808921337
841 >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals())
842 0.019318239763379097
843
844 _That's_ encouraging...
845 Be sure to f.close()
846 Use BloomFilter.open for an existing bloom file
847 Copying a file from /tmp to work/... still gives good (quick) lookup,
848 but _creating and filling_ a file on work/... takes ... I stopped
849 waiting after an hour or so.
850 ================
851
852
785 Try it with the existing _per segment_ index we have for 2019-35 853 Try it with the existing _per segment_ index we have for 2019-35
786 854
787 Assuming we have to key on segment plus offset, as reconstructing the 855 Assuming we have to key on segment / file and offset, as reconstructing the
788 proper index key is such a pain / buggy / is going to change with the year. 856 proper index key is such a pain / buggy / is going to change with the year.
789 857
790 Stay with segment 49 858 Stay with segment 49
791 859
792 >: uz cdx.gz |wc -l 860 >: uz cdx.gz |wc -l