cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 65:ded30d0d097f default tip

baby steps with cdb

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Fri, 24 Jan 2025 15:15:33 +0000
parents	a70ceb9d1e82
children

comparison

equal deleted inserted replaced

-:a70ceb9d1e82
+:ded30d0d097f
 1.9035462848842144
 >: dd ibs=1 skip=2488 count=10 if=~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb of=/dev/stdout
 1564555978
 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
 1564555978
+OK, definitely worth trying:
+>: python3 -c 'import db' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing...
+2488 10
+1564555978
+tested
+1.800725992769003
+A long struggle to get the module structure better, finally sorted
+cdb.pxd exposes cdb.h from cdb-0.75 (slightly updated)
+db.{pyx, pxd} define an interface cython class to hold a cdb.Cdb instance
+nndb.pyx does a stress test of lookup in a cdb file from 1% of the
+timestamps
+>: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+mv 672917132 cdb 140602889433984
+1 2488 10
+1564555978
+tested
+2.055266048759222
+Things to try next:
+1) Build a bigger .cdb w. as close to 4GB as possible
+2) Shift to a shared library for cdb-0.75
+3) Get rid of the single fixed Cdb struct instance and malloc it as
+required
+4) Build and test the real harness to process .cdx files using .cdb
 ================
 Try it with the existing _per segment_ index we have for 2019-35
 Assuming we have to key on segment / file and offset, as reconstructing the
 proper index key is such a pain / buggy / is going to change with the year.

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 65:ded30d0d097f default tip