comparison lurid3/notes.txt @ 53:d533894173d0

detailed consistency check with 7 segments from published lmh-augmented cdx
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 11 Oct 2024 16:41:32 +0100
parents 8dffb8aa33da
children dd06d7afbfe0
comparison
equal deleted inserted replaced
52:8dffb8aa33da 53:d533894173d0
923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot 923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot
924 9439 924 9439
925 925
926 Difference is presumable the bogus timestamps aren't in the augmented 926 Difference is presumable the bogus timestamps aren't in the augmented
927 cdx as shipped. 927 cdx as shipped.
928
929 Note that the following 'bad' kind of timestamp is fixed before
930 sort_date.py does its thing:
931
932 ... sort_date.sh <(uz $arg/*00???.warc.gz | '"fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/')"' >$arg/ks.tsv
933
934
935 >: egrep -c '[^ ]GMT$' 50/00123.tsv
936 22
937 >: egrep -c '[^ ]GMT$' 50/00400.tsv
938 14
939
940 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00123.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/123_errs | wc -l
941 9300
942 >: fgrep -c Invalid /tmp/hst/123_errs
943 33
944 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00400.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/400_errs | wc -l
945 9439
946 >: fgrep -c Invalid /tmp/hst/400_errs
947 38
948
949 All good.
950
951 But
952 >: seq --format='%03g' 0 559 > /tmp/hst/warc_nums
953 >: xx () {
954 r=$(diff -bw
955 <(echo $((
956 $(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz |
957 fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l)
958 +
959 $(fgrep -c Invalid /tmp/hst/ec_$1))))
960 <(wc -l < 50/00$1.tsv))
961 if [ "$r" ]
962 then printf "%s:\n%s\n" $2 "$r"
963 fi
964 }
965 >: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs
966 >: fgrep -c 1c1 /tmp/hst/aug_bugs
967 77
968 sing<4318>: wc -l < /tmp/hst/aug_bugs
969 385
970 sing<4319>: echo $((77 * 5))
971 385
972
973 OK, there are a few other error messages from date conversion
974 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < 50/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; }
975 sing<4337>: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs2
976 [nothing]
977
978 So, I think we can believe we're OK
979 But 7 is better than 1:
980 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/$3/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < $3/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; }
981 >: for s in 49 {51..55}; do parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' $s | tee /tmp/hst/aug_bugs_$s; done
982 [nothing]
983
984 Next step: ?
985
986
987