comparison lurid3/notes.txt @ 72:7901ce4a39e3

breakthrough wrt performance of chained cdb approach
author Henry Thompson <ht@markup.co.uk>
date Thu, 06 Mar 2025 01:45:24 +0000
parents 6935ebce43e0
children 1283a574260d
comparison
equal deleted inserted replaced
71:6935ebce43e0 72:7901ce4a39e3
1603 sys 0m22.880s 1603 sys 0m22.880s
1604 1604
1605 I thought having the 'cat' in the pipeline was making the difference, 1605 I thought having the 'cat' in the pipeline was making the difference,
1606 but no, just as fast w/o. Something very odd 1606 but no, just as fast w/o. Something very odd
1607 1607
1608 OK, after many false starts, it's fairly simple: if a cdb file is
1609 _not_ in the local cache for whereever /work is coming from, mmap
1610 only accesses bits in 4MB or 8BM or something else relatively small.
1611 But wc -l ...cdb a) runs in a few seconds and b) gets the whole thing
1612 in the cache. So, always do that first, and a full 17-step pipeline
1613 runs nearly as fast as a single step:
1614
1615 >: time uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | ~/bin/cdb_chain.sh 17 >/tmp/hst/cdb-00101 2>/tmp/hst/ch.out &
1616
1617 real 1m5.802s
1618 user 1m27.873s
1619 sys 0m34.349s
1620
1621 And it matches the earlier version:
1622 >: fgrep -c lastmod /tmp/hst/cdb-00101
1623 1864371
1624 >: ls -l /tmp/hst/cdb-00101
1625 -rw-r--r-- 1 hst dc007 7090034893 Mar 6 01:24 /tmp/hst/cdb-00101
1626 >: uz idx/cdx-00101.gz | fgrep -c lastmod
1627 1864371
1628 >: uz idx/cdx-00101.gz | wc
1629 14681147 314748509 7090034893
1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101
1631 >: echo $?
1632 0
1608 ================ 1633 ================
1609 1634
1610 Try it with the existing _per segment_ index we have for 2019-35 1635 Try it with the existing _per segment_ index we have for 2019-35
1611 1636
1612 Assuming we have to key on segment / file and offset, as reconstructing the 1637 Assuming we have to key on segment / file and offset, as reconstructing the