# HG changeset patch # User Henry S. Thompson # Date 1738329967 0 # Node ID 0c814f07865a9ab84eaee0411e694b6ecc080795 # Parent ded30d0d097fbfa33751b7eca62302beb7ef2c04 sequestration of cdb handle complete and working diff -r ded30d0d097f -r 0c814f07865a lurid3/notes.txt --- a/lurid3/notes.txt Fri Jan 24 15:15:33 2025 +0000 +++ b/lurid3/notes.txt Fri Jan 31 13:26:07 2025 +0000 @@ -1188,13 +1188,105 @@ 1564555978 tested 2.055266048759222 +Oops, that was ndb, and nndb doesn't work! Things to try next: 1) Build a bigger .cdb w. as close to 4GB as possible 2) Shift to a shared library for cdb-0.75 3) Get rid of the single fixed Cdb struct instance and malloc it as required + 3a) Remove debugging output and recompile everything 4) Build and test the real harness to process .cdx files using .cdb + +Try 50% more, e.g. approx. 1.5 segments + >: python3 -c "print(1.5 * 5236974)" + 7855461.0 + >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -7855461 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in7855461 + real 0m14.585s + user 0m13.909s + sys 0m3.308s + >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in + real 0m6.075s + user 0m3.682s + sys 0m2.337s + >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb + -rw-r--r-- 1 hst dc007 991M Jan 27 15:00 results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb + >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 + mv 672917132 cdb 140324514468736 + 1 2488 10 + 1564555978 + tested + 2.016317328438163 + >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 + mv 1039033960 cdb 140207982297984 + 1 2488 10 + 1564555978 + tested + 2.0484518501907587 + +So, that's OK, but still, would need 67 hash files == 265GB + +Wait, isn't it 4GB max??? +Build ks_0-9.60.cdb, (as in 60% of 0-9, 31M lines ): + >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -$((4 * 7855461)) ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in + 31421844 + real 0m58.676s + user 0m56.312s + sys 0m12.306s + >: cdbmake ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.tmp < ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in + >: ls -lh ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb +-rw-r--r-- 1 hst dc007 3.8G Jan 27 15:26 /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb + +cdbget works: + >: cdb-0.75/cdbget 20190818122159https://m.europapress.es/navarra/noticia-gobierno-navarra-gamesa-acuerdan-necesidad-reindustrializar-planta-alsasua-20100525100333.html <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb + 1566130319 + +But cdbtest and cdbstats die with that as input. Looks like it's when int +overflows to -2MB + +Careful review of cdb.c/h to change int to uint32 whenever an offset +in the mmap is stored finally got odb working, in due course should +try to fix and fork all of cdb + +Still crashing with 0-9.60 + +Ah, a further int pblm -- editted seek-pos to use uint32, now + + >: cdb-0.75/cdbtest <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb + found: 31403644 + different record: 0 + bad length: 0 + not found: 0 + untested: 18200 + +Finally got a version of nndb (which cimports db as well as cdb) to +work by moving _all_ uses of _c_cdb into db. Still comparable in +speed to the mixed version in e.g. odb: + + >: python3 -c 'import odb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... + mv 672917132 cdb 140317635072864 + 2488 10 + 1564555978 + tested + 2.1426388323307037 + + >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 + 1 2488 10 + 1564555978 + tested + 2.0193762965500355 + 2.702429808676243 10000000 + +Interesting that just adding a counter to the test loop slows it down +so much: + 'cfind(probe)', +vs + '(X:=X+1) if cfind(probe)==1 else None', + setup = 'global X' ================ Try it with the existing _per segment_ index we have for 2019-35