comparison lurid3/notes.txt @ 65:ded30d0d097f default tip

baby steps with cdb
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 24 Jan 2025 15:15:33 +0000
parents a70ceb9d1e82
children
comparison
equal deleted inserted replaced
64:a70ceb9d1e82 65:ded30d0d097f
1166 1.9035462848842144 1166 1.9035462848842144
1167 >: dd ibs=1 skip=2488 count=10 if=~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb of=/dev/stdout 1167 >: dd ibs=1 skip=2488 count=10 if=~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb of=/dev/stdout
1168 1564555978 1168 1564555978
1169 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 1169 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb
1170 1564555978 1170 1564555978
1171
1172 OK, definitely worth trying:
1173 >: python3 -c 'import db' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1174 testing...
1175 2488 10
1176 1564555978
1177 tested
1178 1.800725992769003
1179 A long struggle to get the module structure better, finally sorted
1180 cdb.pxd exposes cdb.h from cdb-0.75 (slightly updated)
1181 db.{pyx, pxd} define an interface cython class to hold a cdb.Cdb instance
1182 nndb.pyx does a stress test of lookup in a cdb file from 1% of the
1183 timestamps
1184 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1185 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
1186 mv 672917132 cdb 140602889433984
1187 1 2488 10
1188 1564555978
1189 tested
1190 2.055266048759222
1191
1192 Things to try next:
1193 1) Build a bigger .cdb w. as close to 4GB as possible
1194 2) Shift to a shared library for cdb-0.75
1195 3) Get rid of the single fixed Cdb struct instance and malloc it as
1196 required
1197 4) Build and test the real harness to process .cdx files using .cdb
1171 ================ 1198 ================
1172
1173 1199
1174 Try it with the existing _per segment_ index we have for 2019-35 1200 Try it with the existing _per segment_ index we have for 2019-35
1175 1201
1176 Assuming we have to key on segment / file and offset, as reconstructing the 1202 Assuming we have to key on segment / file and offset, as reconstructing the
1177 proper index key is such a pain / buggy / is going to change with the year. 1203 proper index key is such a pain / buggy / is going to change with the year.