cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 66:0c814f07865a

sequestration of cdb handle complete and working

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Fri, 31 Jan 2025 13:26:07 +0000
parents	ded30d0d097f
children	24ca6ab32e47

comparison

equal deleted inserted replaced

-:ded30d0d097f
+:0c814f07865a
 mv 672917132 cdb 140602889433984
 1 2488 10
 1564555978
 tested
 2.055266048759222
+Oops, that was ndb, and nndb doesn't work!
 Things to try next:
 1) Build a bigger .cdb w. as close to 4GB as possible
 2) Shift to a shared library for cdb-0.75
 3) Get rid of the single fixed Cdb struct instance and malloc it as
 required
+3a) Remove debugging output and recompile everything
 4) Build and test the real harness to process .cdx files using .cdb
+Try 50% more, e.g. approx. 1.5 segments
+>: python3 -c "print(1.5 * 5236974)"
+7855461.0
+>: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -7855461 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in7855461
+real    0m14.585s
+user    0m13.909s
+sys     0m3.308s
+>: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in
+real    0m6.075s
+user    0m3.682s
+sys     0m2.337s
+>: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
+-rw-r--r-- 1 hst dc007 991M Jan 27 15:00 results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
+>: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+mv 672917132 cdb 140324514468736
+1 2488 10
+1564555978
+tested
+2.016317328438163
+>: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+mv 1039033960 cdb 140207982297984
+1 2488 10
+1564555978
+tested
+2.0484518501907587
+So, that's OK, but still, would need 67 hash files == 265GB
+Wait, isn't it 4GB max???
+Build ks_0-9.60.cdb, (as in 60% of 0-9, 31M lines ):
+>: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -$((4 * 7855461)) ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
+31421844
+real    0m58.676s
+user    0m56.312s
+sys     0m12.306s
+>: cdbmake ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.tmp < ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
+>: ls -lh ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+-rw-r--r-- 1 hst dc007 3.8G Jan 27 15:26 /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+cdbget works:
+>: cdb-0.75/cdbget  20190818122159https://m.europapress.es/navarra/noticia-gobierno-navarra-gamesa-acuerdan-necesidad-reindustrializar-planta-alsasua-20100525100333.html <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+1566130319
+But cdbtest and cdbstats die with that as input.  Looks like it's when int
+overflows to -2MB
+Careful review of cdb.c/h to change int to uint32 whenever an offset
+in the mmap is stored finally got odb working, in due course should
+try to fix and fork all of cdb
+Still crashing with 0-9.60
+Ah, a further int pblm -- editted seek-pos to use uint32, now
+>: cdb-0.75/cdbtest <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
+found: 31403644
+different record: 0
+bad length: 0
+not found: 0
+untested: 18200
+Finally got a version of nndb (which cimports db as well as cdb) to
+work by moving _all_ uses of _c_cdb into db.  Still comparable in
+speed to the mixed version in e.g. odb:
+>: python3 -c 'import odb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing...
+mv 672917132 cdb 140317635072864
+2488 10
+1564555978
+tested
+2.1426388323307037
+>: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+1 2488 10
+1564555978
+tested
+2.0193762965500355
+2.702429808676243 10000000
+Interesting that just adding a counter to the test loop slows it down
+so much:
+'cfind(probe)',
+vs
+'(X:=X+1) if cfind(probe)==1 else None',
+setup = 'global X'
 ================
 Try it with the existing _per segment_ index we have for 2019-35
 Assuming we have to key on segment / file and offset, as reconstructing the

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 66:0c814f07865a