Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 52:8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 10 Oct 2024 17:44:58 +0100 |
parents | dc24bb6e524f |
children | d533894173d0 |
comparison
equal
deleted
inserted
replaced
51:dc24bb6e524f | 52:8dffb8aa33da |
---|---|
904 1,902,916 | 904 1,902,916 |
905 | 905 |
906 Not bad, so order 20MB for the whole thing | 906 Not bad, so order 20MB for the whole thing |
907 | 907 |
908 Next step, compare to my existing cdx with timestamp | 908 Next step, compare to my existing cdx with timestamp |
909 | |
910 First check looks about right: | |
911 | |
912 [cd .../warc_lmhx] | |
913 >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums | |
914 >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}' | |
915 | |
916 [cd .../aug_cdx/50] | |
917 >: wc -l 00123.tsv | |
918 9333 | |
919 >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot | |
920 9300 | |
921 >: wc -l 00400.tsv | |
922 9477 00400.tsv | |
923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot | |
924 9439 | |
925 | |
926 Difference is presumable the bogus timestamps aren't in the augmented | |
927 cdx as shipped. |