comparison lurid3/notes.txt @ 71:6935ebce43e0 default tip

can't seem to give up on cdb...
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 26 Feb 2025 19:53:07 +0000
parents db142018ff9e
children
comparison
equal deleted inserted replaced
70:db142018ff9e 71:6935ebce43e0
1514 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "%s\t%s\n" $i $j; i=$((j+1)); done ; printf "%s\t%s\n" $i 99 ; } | parallel --colsep "\t" 'echo cat ../\{{1}..{2}\}/ks.tsv \> ks_{1}-{2}.tsv \&' 1514 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "%s\t%s\n" $i $j; i=$((j+1)); done ; printf "%s\t%s\n" $i 99 ; } | parallel --colsep "\t" 'echo cat ../\{{1}..{2}\}/ks.tsv \> ks_{1}-{2}.tsv \&'
1515 [couldn't make this work as written, hence the echo, followed by copy-paste] 1515 [couldn't make this work as written, hence the echo, followed by copy-paste]
1516 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' & 1516 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 '~/lib/python/cc/lmh/ks2cdb.py -f {}.tsv -c {}.cdb_in' &
1517 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' & 1517 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdbmake {}.cdb {}.tmp <{}.cdb_in' &
1518 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out & 1518 >: cut -f 1 ../ks_divs.tsv | { i=0; while read j; do printf "ks_%s-%s\n" $i $j; i=$((j+1)); done ; printf "ks_%s-%s\n" $i 99; } | parallel -j 9 'cdb < {}.cdb' > cdbtest.out &
1519
1520 Finished deveopment of test_lookup3, now renamed as test_cdb.py.
1521 Runs, but _very_ slowly, only 117,666 lines output in 30 minutes.
1522 top shows it's in state I for Idle:
1523 1238723 hst 20 0 62.3g 89664 43840 I 2.0 0.0 0:01.81 test_cdb
1524 No better on a compute node.
1525 Maybe it's thrashing?
1526 Tried
1527 >: cythonize -i test_cdb.py
1528 >: PYTHONPATH=~/lib/python/cc/lmh python3 -c 'import test_cdb
1529 test_cdb.mainp()
1530 ' ks_%d-%d.cdb
1531 No better, AFAICS
1532
1533 Try not using python isal/g(un)zipping? That didn't help either (see
1534 test_cdbp.py, even a version which only tests entries from segment 0)
1535
1536 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1537 1546945 hst 20 0 62.3g 50180 41104 I 1.0 0.0 0:00.98 python3
1538
1539 But the suppied test code, which processes 31 million keys, is
1540 averaging
1541
1542 Whereas
1543 >: time cdbtest < ks_0-5.cdb
1544 found: 31,281,173
1545 untested: 15781
1546
1547 real 2m45.149s
1548 user 0m48.747s
1549 sys 0m31.113s
1550
1551 31M in 165 seconds (2.75 minutes) == 5.27e-06 (5 microsec???) per key
1552 compared to nndb result of 2.67 _seconds_ for 10,000,000 (identical) probes
1553 i.e. 2.67e-7 per probe.
1554
1555 Something weird just happened:
1556 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1557 1777850 hst 20 0 11.2g 334892 325768 R 91.8 0.1 0:43.41 python3
1558 1777851 hst 20 0 5192 2900 608 S 23.9 0.0 0:10.28 igzip
1559 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
1560 test_cdbp.mainp()
1561 ' ks_%d-%d.cdb 3 0 > ../cdx-00101 2>/tmp/hst/three
1562
1563 real 0m54.498s
1564 user 0m44.699s
1565 sys 0m17.472s
1566 sing<4015>: fgrep -c lastmod ../cdx-00101
1567 20203
1568 sing<4016>: date
1569 Wed Feb 26 05:59:06 PM GMT 2025
1570 sing<4017>: ls -l date
1571 ls: cannot access 'date': No such file or directory
1572 sing<4018>: ls -l ../cdx-00101
1573 -rw-r--r-- 1 hst dc007 7044044593 Feb 26 17:58 ../cdx-00101
1574 sing<4019>: ls -l /tmp/hst/three
1575 -rw-r--r-- 1 hst dc007 0 Feb 26 17:57 /tmp/hst/three
1576 _Not_ because of adding more:
1577 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
1578 test_cdbp.mainp()
1579 ' ks_%d-%d.cdb 1 0 > ../cdx-00101 2>/tmp/hst/one
1580
1581 real 0m52.195s
1582 user 0m42.983s
1583 sys 0m17.266s
1584 sing<4021>: fgrep -c lastmod ../cdx-00101
1585 20203
1586 sing<4022>: ls -l ../cdx-00101
1587 -rw-r--r-- 1 hst dc007 7044044593 Feb 26 18:04 ../cdx-00101
1588
1589 Sometimes fast, sometimes not?
1590 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz|python3 -c 'import test_cdbp
1591 test_cdbp.mainp()
1592 ' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101 2>/tmp/hst/one
1593
1594 real 2m50.546s
1595 user 0m45.288s
1596 sys 0m20.332s
1597 >: time uz /mnt/beegfs/pod12/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | cat|python3 -c 'import test_cdbp
1598 test_cdbp.mainp()
1599 ' ks_%d-%d.cdb 0 1 0 1 > ../cdx-00101x 2>/tmp/hst/onex
1600
1601 real 0m49.305s
1602 user 0m41.800s
1603 sys 0m22.880s
1604
1605 I thought having the 'cat' in the pipeline was making the difference,
1606 but no, just as fast w/o. Something very odd
1607
1519 ================ 1608 ================
1520 1609
1521 Try it with the existing _per segment_ index we have for 2019-35 1610 Try it with the existing _per segment_ index we have for 2019-35
1522 1611
1523 Assuming we have to key on segment / file and offset, as reconstructing the 1612 Assuming we have to key on segment / file and offset, as reconstructing the