changeset 66:0c814f07865a default tip

sequestration of cdb handle complete and working
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 31 Jan 2025 13:26:07 +0000
parents ded30d0d097f
children
files lurid3/notes.txt
diffstat 1 files changed, 92 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Fri Jan 24 15:15:33 2025 +0000
+++ b/lurid3/notes.txt	Fri Jan 31 13:26:07 2025 +0000
@@ -1188,13 +1188,105 @@
   1564555978
   tested
   2.055266048759222
+Oops, that was ndb, and nndb doesn't work!
 
 Things to try next:
  1) Build a bigger .cdb w. as close to 4GB as possible
  2) Shift to a shared library for cdb-0.75
  3) Get rid of the single fixed Cdb struct instance and malloc it as
     required
+ 3a) Remove debugging output and recompile everything
  4) Build and test the real harness to process .cdx files using .cdb
+
+Try 50% more, e.g. approx. 1.5 segments
+  >: python3 -c "print(1.5 * 5236974)"
+  7855461.0
+  >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -7855461 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in7855461
+  real    0m14.585s
+  user    0m13.909s
+  sys     0m3.308s
+  >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in
+  real    0m6.075s
+  user    0m3.682s
+  sys     0m2.337s
+  >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
+  -rw-r--r-- 1 hst dc007 991M Jan 27 15:00 results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
+  >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+  mv 672917132 cdb 140324514468736
+  1 2488 10
+  1564555978
+  tested
+  2.016317328438163
+  >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+  mv 1039033960 cdb 140207982297984
+  1 2488 10
+  1564555978
+  tested
+  2.0484518501907587
+
+So, that's OK, but still, would need 67 hash files == 265GB
+
+Wait, isn't it 4GB max???
+Build ks_0-9.60.cdb, (as in 60% of 0-9, 31M lines ):
+  >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -$((4 * 7855461)) ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
+  31421844
+  real    0m58.676s
+  user    0m56.312s
+  sys     0m12.306s
+  >: cdbmake ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.tmp < ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
+  >: ls -lh ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+-rw-r--r-- 1 hst dc007 3.8G Jan 27 15:26 /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+
+cdbget works:
+  >: cdb-0.75/cdbget  20190818122159https://m.europapress.es/navarra/noticia-gobierno-navarra-gamesa-acuerdan-necesidad-reindustrializar-planta-alsasua-20100525100333.html <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+  1566130319
+
+But cdbtest and cdbstats die with that as input.  Looks like it's when int
+overflows to -2MB
+
+Careful review of cdb.c/h to change int to uint32 whenever an offset
+in the mmap is stored finally got odb working, in due course should
+try to fix and fork all of cdb
+
+Still crashing with 0-9.60
+
+Ah, a further int pblm -- editted seek-pos to use uint32, now
+
+  >: cdb-0.75/cdbtest <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+  found: 31403644
+  different record: 0
+  bad length: 0
+  not found: 0
+  untested: 18200
+
+Finally got a version of nndb (which cimports db as well as cdb) to
+work by moving _all_ uses of _c_cdb into db.  Still comparable in
+speed to the mixed version in e.g. odb:
+
+  >: python3 -c 'import odb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing...
+  mv 672917132 cdb 140317635072864
+  2488 10
+  1564555978
+  tested
+  2.1426388323307037
+
+  >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+  1 2488 10
+  1564555978
+  tested
+  2.0193762965500355
+  2.702429808676243 10000000
+
+Interesting that just adding a counter to the test loop slows it down
+so much:
+  'cfind(probe)',
+vs
+  '(X:=X+1) if cfind(probe)==1 else None',
+   setup = 'global X'
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35