Mercurial > hg > cc > work
diff lurid3/notes.txt @ 65:ded30d0d097f default tip
baby steps with cdb
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 24 Jan 2025 15:15:33 +0000 |
parents | a70ceb9d1e82 |
children |
line wrap: on
line diff
--- a/lurid3/notes.txt Sat Jan 18 21:33:00 2025 +0000 +++ b/lurid3/notes.txt Fri Jan 24 15:15:33 2025 +0000 @@ -1168,9 +1168,35 @@ 1564555978 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 1564555978 + +OK, definitely worth trying: + >: python3 -c 'import db' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... + 2488 10 + 1564555978 + tested + 1.800725992769003 +A long struggle to get the module structure better, finally sorted +cdb.pxd exposes cdb.h from cdb-0.75 (slightly updated) +db.{pyx, pxd} define an interface cython class to hold a cdb.Cdb instance +nndb.pyx does a stress test of lookup in a cdb file from 1% of the +timestamps + >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 + mv 672917132 cdb 140602889433984 + 1 2488 10 + 1564555978 + tested + 2.055266048759222 + +Things to try next: + 1) Build a bigger .cdb w. as close to 4GB as possible + 2) Shift to a shared library for cdb-0.75 + 3) Get rid of the single fixed Cdb struct instance and malloc it as + required + 4) Build and test the real harness to process .cdx files using .cdb ================ - Try it with the existing _per segment_ index we have for 2019-35 Assuming we have to key on segment / file and offset, as reconstructing the