Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 58:3012ca7fc6b7
pybloomfilter testing
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 01 Jan 2025 15:11:09 +0000 |
parents | 4b5117db4929 |
children | d9ba3ce783ff |
comparison
equal
deleted
inserted
replaced
57:4b5117db4929 | 58:3012ca7fc6b7 |
---|---|
675 | 675 |
676 real 0m21.483s | 676 real 0m21.483s |
677 user 0m22.372s | 677 user 0m22.372s |
678 sys 0m5.400s | 678 sys 0m5.400s |
679 | 679 |
680 So, not worth the risk, let's try python | 680 So, not worth the risk, let's try python: cdx_extras implements a |
681 callback for unpackz that outputs the LM header if it's there | |
681 | 682 |
682 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l | 683 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l |
683 9238 | 684 9238 |
684 | 685 |
685 real 0m25.426s | 686 real 0m25.426s |
758 >: wc -l /tmp/hst/lm.tsv | 759 >: wc -l /tmp/hst/lm.tsv |
759 9423 /tmp/hst/lm.tsv | 760 9423 /tmp/hst/lm.tsv |
760 | 761 |
761 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) | 762 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) |
762 853d852 | 763 853d852 |
763 < Mon, 19 Aug 2019 01:46:49 GMT | 764 < Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML] |
764 4058d4056 | 765 4058d4056 |
765 < Tue, 03 Nov 2015 21:31:18 GMT<br /> | 766 < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table] |
766 4405d4402 | 767 4405d4402 |
767 < Mon, 19 Aug 2019 01:54:52 GMT | 768 < Mon, 19 Aug 2019 01:54:52 GMT [double lm] |
768 5237,5238d5233 | 769 5237,5238d5233 |
769 < 3 | 770 < 3 [bogus extension lines to preceding LM] |
770 < Asia/Amman | 771 < Asia/Amman |
771 7009d7003 | 772 7009d7003 |
772 < Mon, 19 Aug 2019 02:34:20 GMT | 773 < Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML] |
773 9198d9191 | 774 9198d9191 |
774 < Mon, 19 Aug 2019 02:14:49 GMT | 775 < Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML] |
775 | 776 |
776 All good. The only implausable case is | 777 All good. The only implausable case is |
777 < Mon, 19 Aug 2019 01:54:52 GMT | 778 < Mon, 19 Aug 2019 01:54:52 GMT |
778 which turns out to be a case of two Last-Modified headers in the same | 779 which turns out to be a case of two Last-Modified headers in the same |
779 the same response record's HTTP headers. RFCs 2616 and 7230 rule it | 780 the same response record's HTTP headers. RFCs 2616 and 7230 rule it |
780 out but neither specifies a recovery, so first-wins is as good as | 781 out but neither specifies a recovery, so first-wins is as good as |
781 anything, and indeed 6797 specifies that. | 782 anything, and indeed 6797 specifies that. |
782 | 783 |
783 Start looking at how we do the merge of cdx_extras.py with existing index | 784 Start looking at how we do the merge of cdx_extras.py with existing index |
784 | 785 |
786 ====2024-12-19==== | |
787 | |
788 The above test shows 17.6% of entries have an LM value | |
789 For a 3 billion entry dataset, than means 530 million LM entries, call | |
790 this n. | |
791 | |
792 Sizes? For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2 | |
793 | |
794 (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB | |
795 | |
796 That's too much :-) Per segment, that becomes possible? | |
797 25,355,594 bits =~ 3.2MB | |
798 | |
799 But maybe it's _not_ too much. One of the python implementations I | |
800 saw uses mmap: | |
801 | |
802 https://github.com/prashnts/pybloomfiltermmap3 | |
803 | |
804 Build a Bloom filter with all the URIs whose entries have LM value | |
805 _and_ a python hashtable mapping from URI to LM and offset (is that | |
806 enough for deduping?) | |
807 Rewrite one index file at a time | |
808 Probe with each URI, if positive | |
809 look up in hashtable and use if found | |
810 | |
811 >: wc -l ks*.tsv | |
812 52369734 ks_0-9.tsv | |
813 52489306 ks_10-19.tsv | |
814 52381115 ks_20-29.tsv | |
815 52438862 ks_30-39.tsv | |
816 52512044 ks_40-49.tsv | |
817 52476964 ks_50-59.tsv | |
818 52317116 ks_60-69.tsv | |
819 52200680 ks_70-79.tsv | |
820 52382426 ks_80-89.tsv | |
821 52295136 ks_90-99.tsv | |
822 523863383 total | |
823 | |
824 >>> from pybloomfilter import BloomFilter | |
825 >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom') | |
826 >>> def bff(f,fn): | |
827 ... with open(fn) as uf: | |
828 ... while (l:=uf.readline()): | |
829 ... f.add(l.split('\t')[2]) | |
830 ... | |
831 >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals()) | |
832 0.00012309104204177856 | |
833 >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals()) | |
834 77.57737312093377 | |
835 >>> 'http://71.43.189.10/dermorph' in f | |
836 False | |
837 >>> 'http://71.43.189.10/dermorph/' in f | |
838 True | |
839 >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals()) | |
840 0.02377822808921337 | |
841 >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals()) | |
842 0.019318239763379097 | |
843 | |
844 _That's_ encouraging... | |
845 Be sure to f.close() | |
846 Use BloomFilter.open for an existing bloom file | |
847 Copying a file from /tmp to work/... still gives good (quick) lookup, | |
848 but _creating and filling_ a file on work/... takes ... I stopped | |
849 waiting after an hour or so. | |
850 ================ | |
851 | |
852 | |
785 Try it with the existing _per segment_ index we have for 2019-35 | 853 Try it with the existing _per segment_ index we have for 2019-35 |
786 | 854 |
787 Assuming we have to key on segment plus offset, as reconstructing the | 855 Assuming we have to key on segment / file and offset, as reconstructing the |
788 proper index key is such a pain / buggy / is going to change with the year. | 856 proper index key is such a pain / buggy / is going to change with the year. |
789 | 857 |
790 Stay with segment 49 | 858 Stay with segment 49 |
791 | 859 |
792 >: uz cdx.gz |wc -l | 860 >: uz cdx.gz |wc -l |