comparison lurid3/notes.txt @ 52:8dffb8aa33da

prelim consistency check with published lmh-augmented cdx
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Thu, 10 Oct 2024 17:44:58 +0100
parents dc24bb6e524f
children d533894173d0
comparison
equal deleted inserted replaced
51:dc24bb6e524f 52:8dffb8aa33da
904 1,902,916 904 1,902,916
905 905
906 Not bad, so order 20MB for the whole thing 906 Not bad, so order 20MB for the whole thing
907 907
908 Next step, compare to my existing cdx with timestamp 908 Next step, compare to my existing cdx with timestamp
909
910 First check looks about right:
911
912 [cd .../warc_lmhx]
913 >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums
914 >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}'
915
916 [cd .../aug_cdx/50]
917 >: wc -l 00123.tsv
918 9333
919 >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot
920 9300
921 >: wc -l 00400.tsv
922 9477 00400.tsv
923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot
924 9439
925
926 Difference is presumable the bogus timestamps aren't in the augmented
927 cdx as shipped.