Mercurial > hg > cc > work
changeset 68:3cd52d1849bb
build 16 .cdb
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 10 Feb 2025 15:27:12 +0000 |
parents | 24ca6ab32e47 |
children | fb3dcd144e59 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 108 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Tue Feb 04 12:50:52 2025 +0000 +++ b/lurid3/notes.txt Mon Feb 10 15:27:12 2025 +0000 @@ -820,6 +820,7 @@ 52382426 ks_80-89.tsv 52295136 ks_90-99.tsv 523863383 total +Later saved the above in kslines.tsv >>> from pybloomfilter import BloomFilter >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom') @@ -1315,7 +1316,18 @@ 2.474294847997953 3.2440936170023633 10000000 -At least it works: +[Forget that, it's back to more or less the same timing as before, +mystery + >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000 + testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000 + 1 2488 10 + 1564555978 + tested + 2.0566618730081245 + 2.6779559339920525 10000000 +] + +And it works: >: python3 -c 'import get2' cdb/rts-tmp/sv.cdb cdb/rts-tmp/12.cdb two discard/tcp two cdb/rts-tmp/sv.cdb missing @@ -1323,6 +1335,101 @@ two cdb/rts-tmp/12.cdb Goodbye discard/tcp cdb/rts-tmp/12.cdb missing +Spent a morning getting profiling working, but only from/for apps +(cdbget, cdbtest) within the CDB distro. Had to change _exit(0) to +exit(0) to get the profile data (mcount) actually written out. + +And guess what, by far the most time is spent in byte_copy. Noah is +still correctly predicting things! + + Each sample counts as 0.01 seconds. + % cumulative self self total + time seconds seconds calls ms/call ms/call name + 41.10 1.54 1.54 119212725 0.00 0.00 byte_copy + 13.21 2.04 0.50 479500623 0.00 0.00 cdb_hashadd + 8.14 2.34 0.31 83427967 0.00 0.00 get + 7.74 2.63 0.29 5226422 0.00 0.00 cdb_findnext + 6.94 2.89 0.26 5226422 0.00 0.00 cdb_hash + 5.60 3.10 0.21 83489614 0.00 0.00 buffer_get + 4.27 3.26 0.16 17427786 0.00 0.00 byte_diff + 3.20 3.38 0.12 main + 2.13 3.46 0.08 5226422 0.00 0.00 seek_cur + 2.00 3.54 0.08 5226422 0.00 0.00 cdb_find + 1.47 3.59 0.06 35723096 0.00 0.00 cdb_read + 1.33 3.64 0.05 10474460 0.00 0.00 getnum + +And + ---------------------------------------------- + 0.00 0.00 15/119212725 buffer_put [19] + 0.46 0.00 35723096/119212725 cdb_read [7] + 1.08 0.00 83489614/119212725 buffer_get [5] +[4] 41.1 1.54 0.00 119212725 byte_copy [4] + +Not clear why this has slowed down 20--25% since Cirrus reboot... +Obvs. could try increasing buffer size which might well help... + +Well, changing to 64K didn't help. + it's giving the same results... + +Anyway, let's do a real experiment: + + + >: seq 1 9 | parallel -j 9 'echo {} $(wc -l < ks_{}0-{}9.tsv)' |sort -k1n,1 >> ks_lines.tsv + >: rename -n 's/\.(.*)/-\1.tsv/' ks.?? + >: wc -l < ks.00 + 32468548 + >: wc -l < ../ks_0-9.60.cdb_in + 31421845 + >: cdbmake ks-00.cdb ks-00.tmp < ks-00.cdb_in + >: ls -lh ks-00.cdb_in + -rw-r--r-- 1 hst dc007 3.5G Feb 10 14:46 ks-00.cdb_in + +Try more? + + >: split -d -n l/1/15 /tmp/hst/ks.tsv > ks-00_15.tsv + >: ~/lib/python/cc/lmh/ks2cdb.py -f ks-00_15.tsv -c ks-00_15.cdb_in + 34382190 + >: cdbmake ks-00_15.cdb ks-00_15.tmp < ks-00_15.cdb_in + cdbmake: fatal: unable to create ks-00_15.tmp: out of memory + +OK, 16 it is. + >: seq -f "%02g" 1 15 | parallel -j 8 '~/lib/python/cc/lmh/ks2cdb.py -f ks-{}.tsv -c ks-{}.cdb_in && cdbmake ks-{}.cdb ks-{}.tmp < ks -{}.cdb_in' + 33058467 + /usr/bin/bash: line 1: ks: No such file or directory [oops] + 32925450 + 32194266 + 32611527 + 33059551 + 32235932 + 32466927 + 33341399 + >: seq -f "%02g" 9 15 | parallel -j 7 '~/lib/python/cc/lmh/ks2cdb.py -f ks-{}.tsv -c ks-{}.cdb_in' + 32282702 + 32852577 + 33016202 + 32294662 + 32584480 + 33069019 + 33401674 + >: seq -f "%02g" 1 15 | parallel -j 8 'cdbmake ks-{}.cdb ks-{}.tmp < ks-{}.cdb_in' + >: ls -lh *.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 14:47 ks-00.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-01.cdb + -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-02.cdb + -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-03.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-04.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-05.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-06.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-07.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-08.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-09.cdb + -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:22 ks-10.cdb + -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-11.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-12.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-13.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-14.cdb + -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-15.cdb + ================ Try it with the existing _per segment_ index we have for 2019-35