changeset 52:8dffb8aa33da

prelim consistency check with published lmh-augmented cdx
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 10 Oct 2024 17:44:58 +0100
parents dc24bb6e524f
children d533894173d0
files lurid3/notes.txt
diffstat 1 files changed, 19 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Wed Oct 09 22:55:27 2024 +0100
+++ b/lurid3/notes.txt	Thu Oct 10 17:44:58 2024 +0100
@@ -906,3 +906,22 @@
 Not bad, so order 20MB for the whole thing
 
 Next step, compare to my existing cdx with timestamp
+
+First check looks about right:
+
+  [cd .../warc_lmhx]
+  >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums
+  >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}'
+
+  [cd .../aug_cdx/50]
+  >: wc -l 00123.tsv
+  9333
+  >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot
+  9300
+  >: wc -l 00400.tsv
+  9477 00400.tsv
+  >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot
+  9439
+
+Difference is presumable the bogus timestamps aren't in the augmented
+cdx as shipped.