changeset 62:bc0bdb649c08

tried cdb, slower by 2 OoM
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 03 Jan 2025 13:35:14 +0000
parents e6bab0972142
children 663e55844c1d
files lurid3/notes.txt
diffstat 1 files changed, 79 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Thu Jan 02 18:55:11 2025 +0000
+++ b/lurid3/notes.txt	Fri Jan 03 13:35:14 2025 +0000
@@ -949,6 +949,85 @@
  >: seq 2 9 | parallel -j 8 'time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_{}0-{}9.tsv'
 
 Slightly slower when running in parallel, but done
+
+Try using cdb file, has to be one seg per file because of 4GB limit of
+cdb C-code
+
+  >: ~/lib/python/cc/lmh/ks2cdb.py -f <(head -5236974 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in
+  5236974
+  11.358113696798682
+  >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in
+
+  real  0m4.981s
+  user  0m2.865s
+  sys   0m1.886s
+  >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in
+  -rw-r--r-- 1 hst dc007 574M Jan  3 11:34 results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb_in
+  >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
+  -rw-r--r-- 1 hst dc007 642M Jan  3 11:34 results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
+  >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.{tsv,pickle}
+  -rw-r--r-- 1 hst dc007 5.4G Jan  2 13:41 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
+  -rw-r--r-- 1 hst dc007 8.8G Oct  3  2023 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
+  >: cdbdump <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb | head
+  +52,10:20190817225913http://85.163.4.0:9999/faces/login.jsp->1566082752
+  +58,10:20190817232933https://1.1.1.1/?source=https://www.tuppu.fi->1560198539
+  +34,10:20190822010151http://163.30.193.1/->1256804738
+  +64,10:20190825182530http://45.70.224.1:20026/index.php?tipo=minhaconta->1565613324
+  +32,10:20190818210532http://50.63.77.1/->1366940302
+  +41,10:20190825182606http://119.145.170.10:2160/->1497240281
+  +43,10:20190825142846http://71.43.189.10/dermorph/->1564555978
+  +53,10:20190825143020http://71.43.189.10/dermorph/about.html->1564555936
+  +55,10:20190825142857http://71.43.189.10/dermorph/contact.html->1564748606
+  +54,10:20190825142842http://71.43.189.10/dermorph/guided.html->1563634912
+  sing<4795>: cdbget 20190825142846http://71.43.189.10/dermorph/ <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
+  1564555978
+  >: cdbstats <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
+  records    5226422
+  d0         3920023
+  d1          748992
+  d2          269375
+  d3          124011
+  d4           64304
+  d5           36623
+  d6           21765
+  d7           13737
+  d8            8712
+  d9            5866
+  >9           13014
+  >: cdbtest <results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
+  found: 5226422
+  different record: 0
+  bad length: 0
+  not found: 0
+  untested: 10552
+
+All good, _but_
+  >>> r=cdblib.Reader.from_file_path('results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb')
+  >>> r.get(b'xyzzy')
+  >>> r.get(b'20190825142846http://71.43.189.10/dermorph/')
+  b'1564555978'
+  >>> from timeit import Timer
+  >>> t=Timer("r.get(b'20190825142846http://71.43.189.10/dermorph/')",
+      globals=globals())
+  >>> t.timeit(100000)
+  0.30162674374878407
+  >>> t.timeit(100000)
+  0.30807616002857685
+  >>> t.timeit(100000)
+  0.30170613154768944
+
+So, 100 x slower than a dict :-(.  Just checking that...
+  >>> t = Timer("b'20190825142846http://71.43.189.10/dermorph/' in d",globals={'d':d})
+  >>> r=cdblib.Reader.from_file_path('results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb')
+  >>> r.get(b'20190825142846http://71.43.189.10/dermorph/')
+  b'1564555978'
+  >>> s = Timer("r.get(b'20190825142846http://71.43.189.10/dermorph/')",
+  ...       globals=globals())
+  >>> (t.repeat(5,100000),s.repeat(5,100000))
+  [0.005662968382239342, 0.005780909210443497, 0.005478940904140472, 0.005713008344173431, 0.005547545850276947]
+  [0.30250774696469307, 0.303345350548625, 0.3002819549292326, 0.30161340720951557, 0.30262864381074905])
+
+so, forget cdb
 ================