Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 60:3be7b53d726e
using python dict test
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 02 Jan 2025 15:01:48 +0000 |
parents | d9ba3ce783ff |
children | e6bab0972142 |
comparison
equal
deleted
inserted
replaced
59:d9ba3ce783ff | 60:3be7b53d726e |
---|---|
881 52369734 | 881 52369734 |
882 52369734 | 882 52369734 |
883 52369734 | 883 52369734 |
884 [69.63967163302004, 69.09140252694488, 66.49750975705683] | 884 [69.63967163302004, 69.09140252694488, 66.49750975705683] |
885 That's tolerable. | 885 That's tolerable. |
886 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | 886 >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv |
887 52369734 | 887 52369734 |
888 52369734 | 888 52369734 |
889 52369734 | 889 52369734 |
890 [64.51177835091949, 71.6610240675509, 67.74966451153159] | 890 [64.51177835091949, 71.6610240675509, 67.74966451153159] |
891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] | 891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] |
892 Last line is 100000 lookups. | 892 Last line is 100000 lookups. |
893 | |
894 So, try a test: | |
895 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
896 52369734 | |
897 [70.98342595621943] | |
898 [0.0037928372621536255] | |
899 | |
900 real 1m51.456s | |
901 user 1m32.901s | |
902 sys 0m17.937s | |
903 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle | |
904 -rw-r--r-- 1 hst dc007 5.5G Jan 2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle | |
905 cdx_out.write(b' ') | |
906 cdx_out.write(b' ') | |
907 >: time ~/lib/python/cc/lmh/test_lookup1.py | |
908 52369734 | |
909 1076046 130318 | |
910 | |
911 real 1m52.668s | |
912 user 1m40.751s | |
913 sys 0m9.610s | |
914 | |
915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the | |
916 unpickles =~ 453 minutes == 8 hours. | |
917 | |
918 Try pre-filter with the Bloom filter. | |
893 ================ | 919 ================ |
894 | 920 |
895 | 921 |
896 Try it with the existing _per segment_ index we have for 2019-35 | 922 Try it with the existing _per segment_ index we have for 2019-35 |
897 | 923 |