Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 66:0c814f07865a
sequestration of cdb handle complete and working
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 31 Jan 2025 13:26:07 +0000 |
parents | ded30d0d097f |
children | 24ca6ab32e47 |
comparison
equal
deleted
inserted
replaced
65:ded30d0d097f | 66:0c814f07865a |
---|---|
1186 mv 672917132 cdb 140602889433984 | 1186 mv 672917132 cdb 140602889433984 |
1187 1 2488 10 | 1187 1 2488 10 |
1188 1564555978 | 1188 1564555978 |
1189 tested | 1189 tested |
1190 2.055266048759222 | 1190 2.055266048759222 |
1191 Oops, that was ndb, and nndb doesn't work! | |
1191 | 1192 |
1192 Things to try next: | 1193 Things to try next: |
1193 1) Build a bigger .cdb w. as close to 4GB as possible | 1194 1) Build a bigger .cdb w. as close to 4GB as possible |
1194 2) Shift to a shared library for cdb-0.75 | 1195 2) Shift to a shared library for cdb-0.75 |
1195 3) Get rid of the single fixed Cdb struct instance and malloc it as | 1196 3) Get rid of the single fixed Cdb struct instance and malloc it as |
1196 required | 1197 required |
1198 3a) Remove debugging output and recompile everything | |
1197 4) Build and test the real harness to process .cdx files using .cdb | 1199 4) Build and test the real harness to process .cdx files using .cdb |
1200 | |
1201 Try 50% more, e.g. approx. 1.5 segments | |
1202 >: python3 -c "print(1.5 * 5236974)" | |
1203 7855461.0 | |
1204 >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -7855461 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in7855461 | |
1205 real 0m14.585s | |
1206 user 0m13.909s | |
1207 sys 0m3.308s | |
1208 >: time cdbmake results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.tmp < results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb_in | |
1209 real 0m6.075s | |
1210 user 0m3.682s | |
1211 sys 0m2.337s | |
1212 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb | |
1213 -rw-r--r-- 1 hst dc007 991M Jan 27 15:00 results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb | |
1214 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1215 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 | |
1216 mv 672917132 cdb 140324514468736 | |
1217 1 2488 10 | |
1218 1564555978 | |
1219 tested | |
1220 2.016317328438163 | |
1221 >: python3 -c 'import ndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1222 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.5.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 | |
1223 mv 1039033960 cdb 140207982297984 | |
1224 1 2488 10 | |
1225 1564555978 | |
1226 tested | |
1227 2.0484518501907587 | |
1228 | |
1229 So, that's OK, but still, would need 67 hash files == 265GB | |
1230 | |
1231 Wait, isn't it 4GB max??? | |
1232 Build ks_0-9.60.cdb, (as in 60% of 0-9, 31M lines ): | |
1233 >: time ~/lib/python/cc/lmh/ks2cdb.py -f <(head -$((4 * 7855461)) ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv) -c ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in | |
1234 31421844 | |
1235 real 0m58.676s | |
1236 user 0m56.312s | |
1237 sys 0m12.306s | |
1238 >: cdbmake ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.tmp < ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb_in | |
1239 >: ls -lh ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb | |
1240 -rw-r--r-- 1 hst dc007 3.8G Jan 27 15:26 /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb | |
1241 | |
1242 cdbget works: | |
1243 >: cdb-0.75/cdbget 20190818122159https://m.europapress.es/navarra/noticia-gobierno-navarra-gamesa-acuerdan-necesidad-reindustrializar-planta-alsasua-20100525100333.html <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb | |
1244 1566130319 | |
1245 | |
1246 But cdbtest and cdbstats die with that as input. Looks like it's when int | |
1247 overflows to -2MB | |
1248 | |
1249 Careful review of cdb.c/h to change int to uint32 whenever an offset | |
1250 in the mmap is stored finally got odb working, in due course should | |
1251 try to fix and fork all of cdb | |
1252 | |
1253 Still crashing with 0-9.60 | |
1254 | |
1255 Ah, a further int pblm -- editted seek-pos to use uint32, now | |
1256 | |
1257 >: cdb-0.75/cdbtest <~/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.60.cdb | |
1258 found: 31403644 | |
1259 different record: 0 | |
1260 bad length: 0 | |
1261 not found: 0 | |
1262 untested: 18200 | |
1263 | |
1264 Finally got a version of nndb (which cimports db as well as cdb) to | |
1265 work by moving _all_ uses of _c_cdb into db. Still comparable in | |
1266 speed to the mixed version in e.g. odb: | |
1267 | |
1268 >: python3 -c 'import odb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1269 testing... | |
1270 mv 672917132 cdb 140317635072864 | |
1271 2488 10 | |
1272 1564555978 | |
1273 tested | |
1274 2.1426388323307037 | |
1275 | |
1276 >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 | |
1277 testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 | |
1278 1 2488 10 | |
1279 1564555978 | |
1280 tested | |
1281 2.0193762965500355 | |
1282 2.702429808676243 10000000 | |
1283 | |
1284 Interesting that just adding a counter to the test loop slows it down | |
1285 so much: | |
1286 'cfind(probe)', | |
1287 vs | |
1288 '(X:=X+1) if cfind(probe)==1 else None', | |
1289 setup = 'global X' | |
1198 ================ | 1290 ================ |
1199 | 1291 |
1200 Try it with the existing _per segment_ index we have for 2019-35 | 1292 Try it with the existing _per segment_ index we have for 2019-35 |
1201 | 1293 |
1202 Assuming we have to key on segment / file and offset, as reconstructing the | 1294 Assuming we have to key on segment / file and offset, as reconstructing the |