changeset 65:ded30d0d097f default tip

baby steps with cdb
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 24 Jan 2025 15:15:33 +0000
parents a70ceb9d1e82
children
files lurid3/notes.txt
diffstat 1 files changed, 27 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Sat Jan 18 21:33:00 2025 +0000
+++ b/lurid3/notes.txt	Fri Jan 24 15:15:33 2025 +0000
@@ -1168,9 +1168,35 @@
   1564555978
   >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
   1564555978
+
+OK, definitely worth trying:
+  >: python3 -c 'import db' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing...
+  2488 10
+  1564555978
+  tested
+  1.800725992769003
+A long struggle to get the module structure better, finally sorted
+cdb.pxd exposes cdb.h from cdb-0.75 (slightly updated)
+db.{pyx, pxd} define an interface cython class to hold a cdb.Cdb instance
+nndb.pyx does a stress test of lookup in a cdb file from 1% of the
+timestamps
+  >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+  mv 672917132 cdb 140602889433984
+  1 2488 10
+  1564555978
+  tested
+  2.055266048759222
+
+Things to try next:
+ 1) Build a bigger .cdb w. as close to 4GB as possible
+ 2) Shift to a shared library for cdb-0.75
+ 3) Get rid of the single fixed Cdb struct instance and malloc it as
+    required
+ 4) Build and test the real harness to process .cdx files using .cdb
 ================
 
-
 Try it with the existing _per segment_ index we have for 2019-35
 
 Assuming we have to key on segment / file and offset, as reconstructing the