Mercurial > hg > cc > work
changeset 52:8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 10 Oct 2024 17:44:58 +0100 |
parents | dc24bb6e524f |
children | d533894173d0 |
files | lurid3/notes.txt |
diffstat | 1 files changed, 19 insertions(+), 0 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Wed Oct 09 22:55:27 2024 +0100 +++ b/lurid3/notes.txt Thu Oct 10 17:44:58 2024 +0100 @@ -906,3 +906,22 @@ Not bad, so order 20MB for the whole thing Next step, compare to my existing cdx with timestamp + +First check looks about right: + + [cd .../warc_lmhx] + >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums + >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}' + + [cd .../aug_cdx/50] + >: wc -l 00123.tsv + 9333 + >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot + 9300 + >: wc -l 00400.tsv + 9477 00400.tsv + >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot + 9439 + +Difference is presumable the bogus timestamps aren't in the augmented +cdx as shipped.