changeset 72:7901ce4a39e3

breakthrough wrt performance of chained cdb approach
author Henry Thompson <ht@markup.co.uk>
date Thu, 06 Mar 2025 01:45:24 +0000
parents 6935ebce43e0
children 1283a574260d
files lurid3/notes.txt
diffstat 1 files changed, 25 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Wed Feb 26 19:53:07 2025 +0000
+++ b/lurid3/notes.txt	Thu Mar 06 01:45:24 2025 +0000
@@ -1605,6 +1605,31 @@
 I thought having the 'cat' in the pipeline was making the difference,
 but no, just as fast w/o.  Something very odd
 
+OK, after many false starts, it's fairly simple:  if a cdb file is
+_not_ in the local  cache for whereever /work is coming from, mmap
+only accesses bits in 4MB or 8BM or something else relatively small.
+But wc -l ...cdb a) runs in a few seconds and b) gets the whole thing
+in the cache.  So, always do that first, and a full 17-step pipeline
+runs nearly as fast as a single step:
+
+  >: time uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00101.gz | ~/bin/cdb_chain.sh 17 >/tmp/hst/cdb-00101 2>/tmp/hst/ch.out &
+
+  real    1m5.802s
+  user    1m27.873s
+  sys     0m34.349s
+
+And it matches the earlier version: 
+  >: fgrep -c lastmod /tmp/hst/cdb-00101
+  1864371
+  >: ls -l /tmp/hst/cdb-00101
+  -rw-r--r-- 1 hst dc007 7090034893 Mar  6 01:24 /tmp/hst/cdb-00101
+  >: uz idx/cdx-00101.gz | fgrep -c lastmod
+  1864371
+  >: uz idx/cdx-00101.gz | wc
+  14681147 314748509 7090034893
+  >: uz idx/cdx-00101.gz | cmp - /tmp/hst/cdb-00101
+  >: echo $?
+  0
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35