Mercurial > hg > cc > work
diff lurid3/notes.txt @ 72:7901ce4a39e3
breakthrough wrt performance of chained cdb approach
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 06 Mar 2025 01:45:24 +0000 |
parents | 6935ebce43e0 |
children | 1283a574260d |
line wrap: on
line diff
--- a/lurid3/notes.txt Wed Feb 26 19:53:07 2025 +0000 +++ b/lurid3/notes.txt Thu Mar 06 01:45:24 2025 +0000 @@ -1605,6 +1605,31 @@ I thought having the 'cat' in the pipeline was making the difference, but no, just as fast w/o. Something very odd +OK, after many false starts, it's fairly simple: if a cdb file is +_not_ in the local cache for whereever /work is coming from, mmap +only accesses bits in 4MB or 8BM or something else relatively small. +But wc -l ...cdb a) runs in a few seconds and b) gets the whole thing +in the cache. So, always do that first, and a full 17-step pipeline +runs nearly as fast as a single step: + + >: time uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | ~/bin/cdb_chain.sh 17 >/tmp/hst/cdb-00101 2>/tmp/hst/ch.out & + + real 1m5.802s + user 1m27.873s + sys 0m34.349s + +And it matches the earlier version: + >: fgrep -c lastmod /tmp/hst/cdb-00101 + 1864371 + >: ls -l /tmp/hst/cdb-00101 + -rw-r--r-- 1 hst dc007 7090034893 Mar 6 01:24 /tmp/hst/cdb-00101 + >: uz idx/cdx-00101.gz | fgrep -c lastmod + 1864371 + >: uz idx/cdx-00101.gz | wc + 14681147 314748509 7090034893 + >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101 + >: echo $? + 0 ================ Try it with the existing _per segment_ index we have for 2019-35