Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 65:ded30d0d097f default tip
baby steps with cdb
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 24 Jan 2025 15:15:33 +0000 |
parents | a70ceb9d1e82 |
children |
comparison
equal
deleted
inserted
replaced
64:a70ceb9d1e82 | 65:ded30d0d097f |
---|---|
1166 1.9035462848842144 | 1166 1.9035462848842144 |
1167 >: dd ibs=1 skip=2488 count=10 if=~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb of=/dev/stdout | 1167 >: dd ibs=1 skip=2488 count=10 if=~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb of=/dev/stdout |
1168 1564555978 | 1168 1564555978 |
1169 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb | 1169 >: cdbget 20190825142846http://71.43.189.10/dermorph/ <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb |
1170 1564555978 | 1170 1564555978 |
1171 | |
1172 OK, definitely worth trying: | |
1173 >: python3 -c 'import db' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1174 testing... | |
1175 2488 10 | |
1176 1564555978 | |
1177 tested | |
1178 1.800725992769003 | |
1179 A long struggle to get the module structure better, finally sorted | |
1180 cdb.pxd exposes cdb.h from cdb-0.75 (slightly updated) | |
1181 db.{pyx, pxd} define an interface cython class to hold a cdb.Cdb instance | |
1182 nndb.pyx does a stress test of lookup in a cdb file from 1% of the | |
1183 timestamps | |
1184 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1185 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 | |
1186 mv 672917132 cdb 140602889433984 | |
1187 1 2488 10 | |
1188 1564555978 | |
1189 tested | |
1190 2.055266048759222 | |
1191 | |
1192 Things to try next: | |
1193 1) Build a bigger .cdb w. as close to 4GB as possible | |
1194 2) Shift to a shared library for cdb-0.75 | |
1195 3) Get rid of the single fixed Cdb struct instance and malloc it as | |
1196 required | |
1197 4) Build and test the real harness to process .cdx files using .cdb | |
1171 ================ | 1198 ================ |
1172 | |
1173 | 1199 |
1174 Try it with the existing _per segment_ index we have for 2019-35 | 1200 Try it with the existing _per segment_ index we have for 2019-35 |
1175 | 1201 |
1176 Assuming we have to key on segment / file and offset, as reconstructing the | 1202 Assuming we have to key on segment / file and offset, as reconstructing the |
1177 proper index key is such a pain / buggy / is going to change with the year. | 1203 proper index key is such a pain / buggy / is going to change with the year. |