Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 53:d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Fri, 11 Oct 2024 16:41:32 +0100 |
parents | 8dffb8aa33da |
children | dd06d7afbfe0 |
comparison
equal
deleted
inserted
replaced
52:8dffb8aa33da | 53:d533894173d0 |
---|---|
923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot | 923 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot |
924 9439 | 924 9439 |
925 | 925 |
926 Difference is presumable the bogus timestamps aren't in the augmented | 926 Difference is presumable the bogus timestamps aren't in the augmented |
927 cdx as shipped. | 927 cdx as shipped. |
928 | |
929 Note that the following 'bad' kind of timestamp is fixed before | |
930 sort_date.py does its thing: | |
931 | |
932 ... sort_date.sh <(uz $arg/*00???.warc.gz | '"fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/')"' >$arg/ks.tsv | |
933 | |
934 | |
935 >: egrep -c '[^ ]GMT$' 50/00123.tsv | |
936 22 | |
937 >: egrep -c '[^ ]GMT$' 50/00400.tsv | |
938 14 | |
939 | |
940 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00123.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/123_errs | wc -l | |
941 9300 | |
942 >: fgrep -c Invalid /tmp/hst/123_errs | |
943 33 | |
944 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00400.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/400_errs | wc -l | |
945 9439 | |
946 >: fgrep -c Invalid /tmp/hst/400_errs | |
947 38 | |
948 | |
949 All good. | |
950 | |
951 But | |
952 >: seq --format='%03g' 0 559 > /tmp/hst/warc_nums | |
953 >: xx () { | |
954 r=$(diff -bw | |
955 <(echo $(( | |
956 $(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz | | |
957 fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) | |
958 + | |
959 $(fgrep -c Invalid /tmp/hst/ec_$1)))) | |
960 <(wc -l < 50/00$1.tsv)) | |
961 if [ "$r" ] | |
962 then printf "%s:\n%s\n" $2 "$r" | |
963 fi | |
964 } | |
965 >: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs | |
966 >: fgrep -c 1c1 /tmp/hst/aug_bugs | |
967 77 | |
968 sing<4318>: wc -l < /tmp/hst/aug_bugs | |
969 385 | |
970 sing<4319>: echo $((77 * 5)) | |
971 385 | |
972 | |
973 OK, there are a few other error messages from date conversion | |
974 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < 50/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; } | |
975 sing<4337>: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs2 | |
976 [nothing] | |
977 | |
978 So, I think we can believe we're OK | |
979 But 7 is better than 1: | |
980 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/$3/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < $3/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; } | |
981 >: for s in 49 {51..55}; do parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' $s | tee /tmp/hst/aug_bugs_$s; done | |
982 [nothing] | |
983 | |
984 Next step: ? | |
985 | |
986 | |
987 |