changeset 60:3be7b53d726e

using python dict test
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 02 Jan 2025 15:01:48 +0000
parents d9ba3ce783ff
children e6bab0972142
files lurid3/notes.txt
diffstat 1 files changed, 27 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Wed Jan 01 23:03:07 2025 +0000
+++ b/lurid3/notes.txt	Thu Jan 02 15:01:48 2025 +0000
@@ -883,13 +883,39 @@
   52369734
   [69.63967163302004, 69.09140252694488, 66.49750975705683]
 That's tolerable.
-  >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
   52369734
   52369734
   52369734
   [64.51177835091949, 71.6610240675509, 67.74966451153159]
   [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
 Last line is 100000 lookups.
+
+So, try a test:
+  >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  52369734
+  [70.98342595621943]
+  [0.0037928372621536255]
+
+  real  1m51.456s
+  user  1m32.901s
+  sys   0m17.937s
+  >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
+  -rw-r--r-- 1 hst dc007 5.5G Jan  2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
+          cdx_out.write(b' ')
+          cdx_out.write(b' ')
+  >: time ~/lib/python/cc/lmh/test_lookup1.py
+  52369734
+  1076046 130318
+
+  real  1m52.668s
+  user  1m40.751s
+  sys   0m9.610s
+
+Not bad.  1.5 minutes per file, plus 10 x 20 secs or so for the
+unpickles =~ 453 minutes == 8 hours.
+
+Try pre-filter with the Bloom filter.
 ================