comparison lurid3/notes.txt @ 66:0c814f07865a

sequestration of cdb handle complete and working
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 31 Jan 2025 13:26:07 +0000
parents ded30d0d097f
children 24ca6ab32e47
comparison
equal deleted inserted replaced
65:ded30d0d097f 66:0c814f07865a
1186 mv 672917132 cdb 140602889433984 1186 mv 672917132 cdb 140602889433984
1187 1 2488 10 1187 1 2488 10
1188 1564555978 1188 1564555978
1189 tested 1189 tested
1190 2.055266048759222 1190 2.055266048759222
1191 Oops, that was ndb, and nndb doesn't work!
1191 1192
1192 Things to try next: 1193 Things to try next:
1193 1) Build a bigger .cdb w. as close to 4GB as possible 1194 1) Build a bigger .cdb w. as close to 4GB as possible
1194 2) Shift to a shared library for cdb-0.75 1195 2) Shift to a shared library for cdb-0.75
1195 3) Get rid of the single fixed Cdb struct instance and malloc it as 1196 3) Get rid of the single fixed Cdb struct instance and malloc it as
1196 required 1197 required
1198 3a) Remove debugging output and recompile everything
1197 4) Build and test the real harness to process .cdx files using .cdb 1199 4) Build and test the real harness to process .cdx files using .cdb
1200
1201 Try 50% more, e.g. approx. 1.5 segments
1202 >: python3 -c "print(1.5 * 5236974)"
1203 7855461.0
1204 >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -7855461 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in7855461
1205 real 0m14.585s
1206 user 0m13.909s
1207 sys 0m3.308s
1208 >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in
1209 real 0m6.075s
1210 user 0m3.682s
1211 sys 0m2.337s
1212 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
1213 -rw-r--r-- 1 hst dc007 991M Jan 27 15:00 results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb
1214 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1215 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
1216 mv 672917132 cdb 140324514468736
1217 1 2488 10
1218 1564555978
1219 tested
1220 2.016317328438163
1221 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1222 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
1223 mv 1039033960 cdb 140207982297984
1224 1 2488 10
1225 1564555978
1226 tested
1227 2.0484518501907587
1228
1229 So, that's OK, but still, would need 67 hash files == 265GB
1230
1231 Wait, isn't it 4GB max???
1232 Build ks_0-9.60.cdb, (as in 60% of 0-9, 31M lines ):
1233 >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -$((4 * 7855461)) ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
1234 31421844
1235 real 0m58.676s
1236 user 0m56.312s
1237 sys 0m12.306s
1238 >: cdbmake ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.tmp < ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in
1239 >: ls -lh ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
1240 -rw-r--r-- 1 hst dc007 3.8G Jan 27 15:26 /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
1241
1242 cdbget works:
1243 >: cdb-0.75/cdbget 20190818122159https://m.europapress.es/navarra/noticia-gobierno-navarra-gamesa-acuerdan-necesidad-reindustrializar-planta-alsasua-20100525100333.html <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
1244 1566130319
1245
1246 But cdbtest and cdbstats die with that as input. Looks like it's when int
1247 overflows to -2MB
1248
1249 Careful review of cdb.c/h to change int to uint32 whenever an offset
1250 in the mmap is stored finally got odb working, in due course should
1251 try to fix and fork all of cdb
1252
1253 Still crashing with 0-9.60
1254
1255 Ah, a further int pblm -- editted seek-pos to use uint32, now
1256
1257 >: cdb-0.75/cdbtest <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb
1258 found: 31403644
1259 different record: 0
1260 bad length: 0
1261 not found: 0
1262 untested: 18200
1263
1264 Finally got a version of nndb (which cimports db as well as cdb) to
1265 work by moving _all_ uses of _c_cdb into db. Still comparable in
1266 speed to the mixed version in e.g. odb:
1267
1268 >: python3 -c 'import odb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1269 testing...
1270 mv 672917132 cdb 140317635072864
1271 2488 10
1272 1564555978
1273 tested
1274 2.1426388323307037
1275
1276 >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
1277 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
1278 1 2488 10
1279 1564555978
1280 tested
1281 2.0193762965500355
1282 2.702429808676243 10000000
1283
1284 Interesting that just adding a counter to the test loop slows it down
1285 so much:
1286 'cfind(probe)',
1287 vs
1288 '(X:=X+1) if cfind(probe)==1 else None',
1289 setup = 'global X'
1198 ================ 1290 ================
1199 1291
1200 Try it with the existing _per segment_ index we have for 2019-35 1292 Try it with the existing _per segment_ index we have for 2019-35
1201 1293
1202 Assuming we have to key on segment / file and offset, as reconstructing the 1294 Assuming we have to key on segment / file and offset, as reconstructing the