comparison lurid3/notes.txt @ 60:3be7b53d726e

using python dict test
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 02 Jan 2025 15:01:48 +0000
parents d9ba3ce783ff
children e6bab0972142
comparison
equal deleted inserted replaced
59:d9ba3ce783ff 60:3be7b53d726e
881 52369734 881 52369734
882 52369734 882 52369734
883 52369734 883 52369734
884 [69.63967163302004, 69.09140252694488, 66.49750975705683] 884 [69.63967163302004, 69.09140252694488, 66.49750975705683]
885 That's tolerable. 885 That's tolerable.
886 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv 886 >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
887 52369734 887 52369734
888 52369734 888 52369734
889 52369734 889 52369734
890 [64.51177835091949, 71.6610240675509, 67.74966451153159] 890 [64.51177835091949, 71.6610240675509, 67.74966451153159]
891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] 891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
892 Last line is 100000 lookups. 892 Last line is 100000 lookups.
893
894 So, try a test:
895 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
896 52369734
897 [70.98342595621943]
898 [0.0037928372621536255]
899
900 real 1m51.456s
901 user 1m32.901s
902 sys 0m17.937s
903 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
904 -rw-r--r-- 1 hst dc007 5.5G Jan 2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
905 cdx_out.write(b' ')
906 cdx_out.write(b' ')
907 >: time ~/lib/python/cc/lmh/test_lookup1.py
908 52369734
909 1076046 130318
910
911 real 1m52.668s
912 user 1m40.751s
913 sys 0m9.610s
914
915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the
916 unpickles =~ 453 minutes == 8 hours.
917
918 Try pre-filter with the Bloom filter.
893 ================ 919 ================
894 920
895 921
896 Try it with the existing _per segment_ index we have for 2019-35 922 Try it with the existing _per segment_ index we have for 2019-35
897 923