cc/work: lurid3/notes.txt comparison

comparison lurid3/notes.txt @ 71:6935ebce43e0 default tip

can't seem to give up on cdb...

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Wed, 26 Feb 2025 19:53:07 +0000
parents	db142018ff9e
children

comparison

equal deleted inserted replaced

-:db142018ff9e
+:6935ebce43e0
 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "%s\t%s\n" $i $j; i=$((j+1)); done ; printf "%s\t%s\n" $i 99 ; } | parallel --colsep "\t" 'echo cat ../\{{1}..{2}\}/ks.tsv \> ks_{1}-{2}.tsv \&'
 [couldn't make this work as written, hence the echo, followed by copy-paste]
 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' &
 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' &
 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out &
+Finished deveopment of test_lookup3, now renamed as test_cdb.py.
+Runs, but _very_ slowly, only 117,666 lines output in 30 minutes.
+top shows it's in state I for Idle:
+1238723 hst       20   0   62.3g  89664  43840 I   2.0   0.0   0:01.81 test_cdb
+No better on a compute node.
+Maybe it's thrashing?
+Tried
+>: cythonize -i test_cdb.py
+>: PYTHONPATH=~/lib/python/cc/lmh python3 -c 'import test_cdb
+test_cdb.mainp()
+'  ks_%d-%d.cdb
+No better, AFAICS
+Try not using python isal/g(un)zipping?  That didn't help either (see
+test_cdbp.py, even a version which only tests entries from segment 0)
+PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
+1546945 hst       20   0   62.3g  50180  41104 I   1.0   0.0   0:00.98 python3
+But the suppied test code, which processes 31 million keys, is
+averaging
+Whereas
+>: time cdbtest < ks_0-5.cdb
+found: 31,281,173
+untested: 15781
+real    2m45.149s
+user    0m48.747s
+sys     0m31.113s
+31M in 165 seconds (2.75 minutes) == 5.27e-06 (5 microsec???) per key
+compared to nndb result of 2.67 _seconds_ for 10,000,000 (identical) probes
+i.e. 2.67e-7 per probe.
+Something weird just happened:
+PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
+1777850 hst       20   0   11.2g 334892 325768 R  91.8   0.1   0:43.41 python3
+1777851 hst       20   0    5192   2900    608 S  23.9   0.0   0:10.28 igzip
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 3 0 >  ../cdx-00101 2>/tmp/hst/three
+real    0m54.498s
+user    0m44.699s
+sys     0m17.472s
+sing<4015>: fgrep -c lastmod ../cdx-00101
+20203
+sing<4016>: date
+Wed Feb 26 05:59:06 PM GMT 2025
+sing<4017>: ls -l date
+ls: cannot access 'date': No such file or directory
+sing<4018>: ls -l ../cdx-00101
+-rw-r--r-- 1 hst dc007 7044044593 Feb 26 17:58 ../cdx-00101
+sing<4019>: ls -l /tmp/hst/three
+-rw-r--r-- 1 hst dc007 0 Feb 26 17:57 /tmp/hst/three
+_Not_ because of adding more:
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 1 0 >  ../cdx-00101 2>/tmp/hst/one
+real    0m52.195s
+user    0m42.983s
+sys     0m17.266s
+sing<4021>: fgrep -c lastmod ../cdx-00101
+20203
+sing<4022>: ls -l ../cdx-00101
+-rw-r--r-- 1 hst dc007 7044044593 Feb 26 18:04 ../cdx-00101
+Sometimes fast, sometimes not?
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 0 1 0 1 >  ../cdx-00101 2>/tmp/hst/one
+real    2m50.546s
+user    0m45.288s
+sys     0m20.332s
+>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | cat|python3 -c 'import test_cdbp
+test_cdbp.mainp()
+'  ks_%d-%d.cdb 0 1 0 1 >  ../cdx-00101x 2>/tmp/hst/onex
+real    0m49.305s
+user    0m41.800s
+sys     0m22.880s
+I thought having the 'cat' in the pipeline was making the difference,
+but no, just as fast w/o.  Something very odd
 ================
 Try it with the existing _per segment_ index we have for 2019-35
 Assuming we have to key on segment / file and offset, as reconstructing the

Mercurial > hg > cc > work

comparison lurid3/notes.txt @ 71:6935ebce43e0 default tip