changeset 68:3cd52d1849bb

build 16 .cdb
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 10 Feb 2025 15:27:12 +0000
parents 24ca6ab32e47
children fb3dcd144e59
files lurid3/notes.txt
diffstat 1 files changed, 108 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Tue Feb 04 12:50:52 2025 +0000
+++ b/lurid3/notes.txt	Mon Feb 10 15:27:12 2025 +0000
@@ -820,6 +820,7 @@
        52382426 ks_80-89.tsv
        52295136 ks_90-99.tsv
       523863383 total
+Later saved the above in kslines.tsv
 
   >>> from pybloomfilter import BloomFilter
   >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom')
@@ -1315,7 +1316,18 @@
   2.474294847997953
   3.2440936170023633 10000000
 
-At least it works:
+[Forget that, it's back to more or less the same timing as before,
+mystery
+  >: python3 -c 'import nndb' ~/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb 20190825142846http://71.43.189.10/dermorph/ 10000000
+  testing... /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0.cdb b'20190825142846http://71.43.189.10/dermorph/' x 10000000
+  1 2488 10
+  1564555978
+  tested
+  2.0566618730081245
+  2.6779559339920525 10000000
+]
+
+And it works:
 
   >: python3 -c 'import get2' cdb/rts-tmp/sv.cdb cdb/rts-tmp/12.cdb two discard/tcp
   two cdb/rts-tmp/sv.cdb missing
@@ -1323,6 +1335,101 @@
   two cdb/rts-tmp/12.cdb Goodbye
   discard/tcp cdb/rts-tmp/12.cdb missing
 
+Spent a morning getting profiling working, but only from/for apps
+(cdbget, cdbtest) within the CDB distro.  Had to change _exit(0) to
+exit(0) to get the profile data (mcount) actually written out.
+
+And guess what, by far the most time is spent in byte_copy.  Noah is
+still correctly predicting things!
+
+ Each sample counts as 0.01 seconds.
+  %   cumulative   self              self     total
+ time   seconds   seconds    calls  ms/call  ms/call  name
+ 41.10      1.54     1.54 119212725     0.00     0.00  byte_copy
+ 13.21      2.04     0.50 479500623     0.00     0.00  cdb_hashadd
+  8.14      2.34     0.31 83427967     0.00     0.00  get
+  7.74      2.63     0.29  5226422     0.00     0.00  cdb_findnext
+  6.94      2.89     0.26  5226422     0.00     0.00  cdb_hash
+  5.60      3.10     0.21 83489614     0.00     0.00  buffer_get
+  4.27      3.26     0.16 17427786     0.00     0.00  byte_diff
+  3.20      3.38     0.12                             main
+  2.13      3.46     0.08  5226422     0.00     0.00  seek_cur
+  2.00      3.54     0.08  5226422     0.00     0.00  cdb_find
+  1.47      3.59     0.06 35723096     0.00     0.00  cdb_read
+  1.33      3.64     0.05 10474460     0.00     0.00  getnum
+
+And
+ ----------------------------------------------
+                0.00    0.00      15/119212725     buffer_put [19]
+                0.46    0.00 35723096/119212725     cdb_read [7]
+                1.08    0.00 83489614/119212725     buffer_get [5]
+[4]     41.1    1.54    0.00 119212725         byte_copy [4]
+
+Not clear why this has slowed down 20--25% since Cirrus reboot...
+Obvs. could try increasing buffer size which might well help...
+
+Well, changing to 64K didn't help.
+ it's giving the same results...
+
+Anyway, let's do a real experiment:
+
+
+  >: seq 1 9 | parallel -j 9 'echo {}    $(wc -l < ks_{}0-{}9.tsv)' |sort -k1n,1 >> ks_lines.tsv
+  >: rename -n 's/\.(.*)/-\1.tsv/' ks.??
+  >: wc -l < ks.00
+  32468548
+  >: wc -l < ../ks_0-9.60.cdb_in
+  31421845
+  >: cdbmake ks-00.cdb ks-00.tmp < ks-00.cdb_in
+  >: ls -lh ks-00.cdb_in
+  -rw-r--r-- 1 hst dc007 3.5G Feb 10 14:46 ks-00.cdb_in
+
+Try more?
+
+  >: split -d -n l/1/15 /tmp/hst/ks.tsv > ks-00_15.tsv
+  >: ~/lib/python/cc/lmh/ks2cdb.py -f ks-00_15.tsv -c ks-00_15.cdb_in
+  34382190
+  >: cdbmake ks-00_15.cdb ks-00_15.tmp < ks-00_15.cdb_in
+  cdbmake: fatal: unable to create ks-00_15.tmp: out of memory
+
+OK, 16 it is.
+  >: seq -f "%02g" 1 15 | parallel -j 8 '~/lib/python/cc/lmh/ks2cdb.py -f ks-{}.tsv -c ks-{}.cdb_in && cdbmake ks-{}.cdb ks-{}.tmp < ks -{}.cdb_in'
+  33058467
+  /usr/bin/bash: line 1: ks: No such file or directory [oops]
+  32925450
+  32194266
+  32611527
+  33059551
+  32235932
+  32466927
+  33341399
+  >: seq -f "%02g" 9 15 | parallel -j 7 '~/lib/python/cc/lmh/ks2cdb.py -f ks-{}.tsv -c ks-{}.cdb_in'
+  32282702
+  32852577
+  33016202
+  32294662
+  32584480
+  33069019
+  33401674
+  >: seq -f "%02g" 1 15 | parallel -j 8 'cdbmake ks-{}.cdb ks-{}.tmp < ks-{}.cdb_in'
+  >: ls -lh *.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 14:47 ks-00.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-01.cdb
+  -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-02.cdb
+  -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-03.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-04.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-05.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-06.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-07.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-08.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:21 ks-09.cdb
+  -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:22 ks-10.cdb
+  -rw-r--r-- 1 hst dc007 4.0G Feb 10 15:21 ks-11.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-12.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-13.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-14.cdb
+  -rw-r--r-- 1 hst dc007 3.9G Feb 10 15:22 ks-15.cdb
+
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35