Mercurial > hg > cc > work
changeset 71:6935ebce43e0 default tip
can't seem to give up on cdb...
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 26 Feb 2025 19:53:07 +0000 |
parents | db142018ff9e |
children | |
files | lurid3/notes.txt |
diffstat | 1 files changed, 89 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Tue Feb 11 17:10:31 2025 +0000 +++ b/lurid3/notes.txt Wed Feb 26 19:53:07 2025 +0000 @@ -1516,6 +1516,95 @@ >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' & >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' & >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out & + +Finished deveopment of test_lookup3, now renamed as test_cdb.py. +Runs, but _very_ slowly, only 117,666 lines output in 30 minutes. +top shows it's in state I for Idle: + 1238723 hst 20 0 62.3g 89664 43840 I 2.0 0.0 0:01.81 test_cdb +No better on a compute node. +Maybe it's thrashing? +Tried + >: cythonize -i test_cdb.py + >: PYTHONPATH=~/lib/python/cc/lmh python3 -c 'import test_cdb + test_cdb.mainp() + ' ks_%d-%d.cdb +No better, AFAICS + +Try not using python isal/g(un)zipping? That didn't help either (see +test_cdbp.py, even a version which only tests entries from segment 0) + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND +1546945 hst 20 0 62.3g 50180 41104 I 1.0 0.0 0:00.98 python3 + +But the suppied test code, which processes 31 million keys, is +averaging + +Whereas + >: time cdbtest < ks_0-5.cdb + found: 31,281,173 + untested: 15781 + + real 2m45.149s + user 0m48.747s + sys 0m31.113s + +31M in 165 seconds (2.75 minutes) == 5.27e-06 (5 microsec???) per key +compared to nndb result of 2.67 _seconds_ for 10,000,000 (identical) probes +i.e. 2.67e-7 per probe. + +Something weird just happened: + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND +1777850 hst 20 0 11.2g 334892 325768 R 91.8 0.1 0:43.41 python3 +1777851 hst 20 0 5192 2900 608 S 23.9 0.0 0:10.28 igzip + >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp + test_cdbp.mainp() + ' ks_%d-%d.cdb 3 0 > ../cdx-00101 2>/tmp/hst/three + + real 0m54.498s + user 0m44.699s + sys 0m17.472s + sing<4015>: fgrep -c lastmod ../cdx-00101 + 20203 + sing<4016>: date + Wed Feb 26 05:59:06 PM GMT 2025 + sing<4017>: ls -l date + ls: cannot access 'date': No such file or directory + sing<4018>: ls -l ../cdx-00101 + -rw-r--r-- 1 hst dc007 7044044593 Feb 26 17:58 ../cdx-00101 + sing<4019>: ls -l /tmp/hst/three + -rw-r--r-- 1 hst dc007 0 Feb 26 17:57 /tmp/hst/three +_Not_ because of adding more: +>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp +test_cdbp.mainp() +' ks_%d-%d.cdb 1 0 > ../cdx-00101 2>/tmp/hst/one + +real 0m52.195s +user 0m42.983s +sys 0m17.266s +sing<4021>: fgrep -c lastmod ../cdx-00101 +20203 +sing<4022>: ls -l ../cdx-00101 +-rw-r--r-- 1 hst dc007 7044044593 Feb 26 18:04 ../cdx-00101 + +Sometimes fast, sometimes not? +>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp +test_cdbp.mainp() +' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101 2>/tmp/hst/one + +real 2m50.546s +user 0m45.288s +sys 0m20.332s +>: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | cat|python3 -c 'import test_cdbp +test_cdbp.mainp() +' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101x 2>/tmp/hst/onex + +real 0m49.305s +user 0m41.800s +sys 0m22.880s + +I thought having the 'cat' in the pipeline was making the difference, +but no, just as fast w/o. Something very odd + ================ Try it with the existing _per segment_ index we have for 2019-35