Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 71:6935ebce43e0 default tip
can't seem to give up on cdb...
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Wed, 26 Feb 2025 19:53:07 +0000 |
parents | db142018ff9e |
children |
comparison
equal
deleted
inserted
replaced
70:db142018ff9e | 71:6935ebce43e0 |
---|---|
1514 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "%s\t%s\n" $i $j; i=$((j+1)); done ; printf "%s\t%s\n" $i 99 ; } | parallel --colsep "\t" 'echo cat ../\{{1}..{2}\}/ks.tsv \> ks_{1}-{2}.tsv \&' | 1514 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "%s\t%s\n" $i $j; i=$((j+1)); done ; printf "%s\t%s\n" $i 99 ; } | parallel --colsep "\t" 'echo cat ../\{{1}..{2}\}/ks.tsv \> ks_{1}-{2}.tsv \&' |
1515 [couldn't make this work as written, hence the echo, followed by copy-paste] | 1515 [couldn't make this work as written, hence the echo, followed by copy-paste] |
1516 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' & | 1516 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' & |
1517 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' & | 1517 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' & |
1518 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out & | 1518 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out & |
1519 | |
1520 Finished deveopment of test_lookup3, now renamed as test_cdb.py. | |
1521 Runs, but _very_ slowly, only 117,666 lines output in 30 minutes. | |
1522 top shows it's in state I for Idle: | |
1523 1238723 hst 20 0 62.3g 89664 43840 I 2.0 0.0 0:01.81 test_cdb | |
1524 No better on a compute node. | |
1525 Maybe it's thrashing? | |
1526 Tried | |
1527 >: cythonize -i test_cdb.py | |
1528 >: PYTHONPATH=~/lib/python/cc/lmh python3 -c 'import test_cdb | |
1529 test_cdb.mainp() | |
1530 ' ks_%d-%d.cdb | |
1531 No better, AFAICS | |
1532 | |
1533 Try not using python isal/g(un)zipping? That didn't help either (see | |
1534 test_cdbp.py, even a version which only tests entries from segment 0) | |
1535 | |
1536 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | |
1537 1546945 hst 20 0 62.3g 50180 41104 I 1.0 0.0 0:00.98 python3 | |
1538 | |
1539 But the suppied test code, which processes 31 million keys, is | |
1540 averaging | |
1541 | |
1542 Whereas | |
1543 >: time cdbtest < ks_0-5.cdb | |
1544 found: 31,281,173 | |
1545 untested: 15781 | |
1546 | |
1547 real 2m45.149s | |
1548 user 0m48.747s | |
1549 sys 0m31.113s | |
1550 | |
1551 31M in 165 seconds (2.75 minutes) == 5.27e-06 (5 microsec???) per key | |
1552 compared to nndb result of 2.67 _seconds_ for 10,000,000 (identical) probes | |
1553 i.e. 2.67e-7 per probe. | |
1554 | |
1555 Something weird just happened: | |
1556 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | |
1557 1777850 hst 20 0 11.2g 334892 325768 R 91.8 0.1 0:43.41 python3 | |
1558 1777851 hst 20 0 5192 2900 608 S 23.9 0.0 0:10.28 igzip | |
1559 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp | |
1560 test_cdbp.mainp() | |
1561 ' ks_%d-%d.cdb 3 0 > ../cdx-00101 2>/tmp/hst/three | |
1562 | |
1563 real 0m54.498s | |
1564 user 0m44.699s | |
1565 sys 0m17.472s | |
1566 sing<4015>: fgrep -c lastmod ../cdx-00101 | |
1567 20203 | |
1568 sing<4016>: date | |
1569 Wed Feb 26 05:59:06 PM GMT 2025 | |
1570 sing<4017>: ls -l date | |
1571 ls: cannot access 'date': No such file or directory | |
1572 sing<4018>: ls -l ../cdx-00101 | |
1573 -rw-r--r-- 1 hst dc007 7044044593 Feb 26 17:58 ../cdx-00101 | |
1574 sing<4019>: ls -l /tmp/hst/three | |
1575 -rw-r--r-- 1 hst dc007 0 Feb 26 17:57 /tmp/hst/three | |
1576 _Not_ because of adding more: | |
1577 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp | |
1578 test_cdbp.mainp() | |
1579 ' ks_%d-%d.cdb 1 0 > ../cdx-00101 2>/tmp/hst/one | |
1580 | |
1581 real 0m52.195s | |
1582 user 0m42.983s | |
1583 sys 0m17.266s | |
1584 sing<4021>: fgrep -c lastmod ../cdx-00101 | |
1585 20203 | |
1586 sing<4022>: ls -l ../cdx-00101 | |
1587 -rw-r--r-- 1 hst dc007 7044044593 Feb 26 18:04 ../cdx-00101 | |
1588 | |
1589 Sometimes fast, sometimes not? | |
1590 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp | |
1591 test_cdbp.mainp() | |
1592 ' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101 2>/tmp/hst/one | |
1593 | |
1594 real 2m50.546s | |
1595 user 0m45.288s | |
1596 sys 0m20.332s | |
1597 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | cat|python3 -c 'import test_cdbp | |
1598 test_cdbp.mainp() | |
1599 ' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101x 2>/tmp/hst/onex | |
1600 | |
1601 real 0m49.305s | |
1602 user 0m41.800s | |
1603 sys 0m22.880s | |
1604 | |
1605 I thought having the 'cat' in the pipeline was making the difference, | |
1606 but no, just as fast w/o. Something very odd | |
1607 | |
1519 ================ | 1608 ================ |
1520 | 1609 |
1521 Try it with the existing _per segment_ index we have for 2019-35 | 1610 Try it with the existing _per segment_ index we have for 2019-35 |
1522 | 1611 |
1523 Assuming we have to key on segment / file and offset, as reconstructing the | 1612 Assuming we have to key on segment / file and offset, as reconstructing the |