changeset 71:6935ebce43e0 default tip

can't seem to give up on cdb...
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 26 Feb 2025 19:53:07 +0000
parents db142018ff9e
children
files lurid3/notes.txt
diffstat 1 files changed, 89 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Tue Feb 11 17:10:31 2025 +0000
+++ b/lurid3/notes.txt	Wed Feb 26 19:53:07 2025 +0000
@@ -1516,6 +1516,95 @@
   >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' &
   >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' &
   >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out &
+
+Finished deveopment of test_lookup3, now renamed as test_cdb.py.
+Runs, but _very_ slowly, only 117,666 lines output in 30 minutes.
+top shows it's in state I for Idle:
+  1238723 hst       20   0   62.3g  89664  43840 I   2.0   0.0   0:01.81 test_cdb
+No better on a compute node.
+Maybe it's thrashing?
+Tried 
+  >: cythonize -i test_cdb.py
+  >: PYTHONPATH=~/lib/python/cc/lmh python3 -c 'import test_cdb
+  test_cdb.mainp()
+  '  ks_%d-%d.cdb
+No better, AFAICS
+
+Try not using python isal/g(un)zipping?  That didn't help either (see
+test_cdbp.py, even a version which only tests entries from segment 0)
+
+    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
+1546945 hst       20   0   62.3g  50180  41104 I   1.0   0.0   0:00.98 python3  
+
+But the suppied test code, which processes 31 million keys, is
+averaging 
+
+Whereas
+  >: time cdbtest < ks_0-5.cdb
+  found: 31,281,173
+  untested: 15781
+
+  real    2m45.149s
+  user    0m48.747s
+  sys     0m31.113s
+
+31M in 165 seconds (2.75 minutes) == 5.27e-06 (5 microsec???) per key
+compared to nndb result of 2.67 _seconds_ for 10,000,000 (identical) probes
+i.e. 2.67e-7 per probe.
+
+Something weird just happened:
+    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
+1777850 hst       20   0   11.2g 334892 325768 R  91.8   0.1   0:43.41 python3  
+1777851 hst       20   0    5192   2900    608 S  23.9   0.0   0:10.28 igzip    
+  >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+  test_cdbp.mainp()
+  '  ks_%d-%d.cdb 3 0 >  ../cdx-00101 2>/tmp/hst/three
+
+  real    0m54.498s
+  user    0m44.699s
+  sys     0m17.472s
+  sing<4015>: fgrep -c lastmod ../cdx-00101
+  20203
+  sing<4016>: date
+  Wed Feb 26 05:59:06 PM GMT 2025
+  sing<4017>: ls -l date
+  ls: cannot access 'date': No such file or directory
+  sing<4018>: ls -l ../cdx-00101
+  -rw-r--r-- 1 hst dc007 7044044593 Feb 26 17:58 ../cdx-00101
+  sing<4019>: ls -l /tmp/hst/three
+  -rw-r--r-- 1 hst dc007 0 Feb 26 17:57 /tmp/hst/three
+_Not_ because of adding more:
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 1 0 >  ../cdx-00101 2>/tmp/hst/one
+
+real    0m52.195s
+user    0m42.983s
+sys     0m17.266s
+sing<4021>: fgrep -c lastmod ../cdx-00101
+20203
+sing<4022>: ls -l ../cdx-00101
+-rw-r--r-- 1 hst dc007 7044044593 Feb 26 18:04 ../cdx-00101
+
+Sometimes fast, sometimes not?
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 0 1 0 1 >  ../cdx-00101 2>/tmp/hst/one
+
+real    2m50.546s
+user    0m45.288s
+sys     0m20.332s
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | cat|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 0 1 0 1 >  ../cdx-00101x 2>/tmp/hst/onex
+
+real    0m49.305s
+user    0m41.800s
+sys     0m22.880s
+
+I thought having the 'cat' in the pipeline was making the difference,
+but no, just as fast w/o.  Something very odd
+
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35