Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 72:7901ce4a39e3
breakthrough wrt performance of chained cdb approach
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 06 Mar 2025 01:45:24 +0000 |
parents | 6935ebce43e0 |
children | 1283a574260d |
comparison
equal
deleted
inserted
replaced
71:6935ebce43e0 | 72:7901ce4a39e3 |
---|---|
1603 sys 0m22.880s | 1603 sys 0m22.880s |
1604 | 1604 |
1605 I thought having the 'cat' in the pipeline was making the difference, | 1605 I thought having the 'cat' in the pipeline was making the difference, |
1606 but no, just as fast w/o. Something very odd | 1606 but no, just as fast w/o. Something very odd |
1607 | 1607 |
1608 OK, after many false starts, it's fairly simple: if a cdb file is | |
1609 _not_ in the local cache for whereever /work is coming from, mmap | |
1610 only accesses bits in 4MB or 8BM or something else relatively small. | |
1611 But wc -l ...cdb a) runs in a few seconds and b) gets the whole thing | |
1612 in the cache. So, always do that first, and a full 17-step pipeline | |
1613 runs nearly as fast as a single step: | |
1614 | |
1615 >: time uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | ~/bin/cdb_chain.sh 17 >/tmp/hst/cdb-00101 2>/tmp/hst/ch.out & | |
1616 | |
1617 real 1m5.802s | |
1618 user 1m27.873s | |
1619 sys 0m34.349s | |
1620 | |
1621 And it matches the earlier version: | |
1622 >: fgrep -c lastmod /tmp/hst/cdb-00101 | |
1623 1864371 | |
1624 >: ls -l /tmp/hst/cdb-00101 | |
1625 -rw-r--r-- 1 hst dc007 7090034893 Mar 6 01:24 /tmp/hst/cdb-00101 | |
1626 >: uz idx/cdx-00101.gz | fgrep -c lastmod | |
1627 1864371 | |
1628 >: uz idx/cdx-00101.gz | wc | |
1629 14681147 314748509 7090034893 | |
1630 >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 | |
1631 >: echo $? | |
1632 0 | |
1608 ================ | 1633 ================ |
1609 | 1634 |
1610 Try it with the existing _per segment_ index we have for 2019-35 | 1635 Try it with the existing _per segment_ index we have for 2019-35 |
1611 | 1636 |
1612 Assuming we have to key on segment / file and offset, as reconstructing the | 1637 Assuming we have to key on segment / file and offset, as reconstructing the |