# HG changeset patch # User Henry Thompson # Date 1741569596 0 # Node ID ff6ef190f90112760f2bc0761ebb6264abd6309b # Parent 1283a574260dcd1ea1e63992cdd97328cd50b764 cross-check wrt seg 15 of 2023-40 diff -r 1283a574260d -r ff6ef190f901 lurid3/notes.txt --- a/lurid3/notes.txt Sun Mar 09 19:59:50 2025 +0000 +++ b/lurid3/notes.txt Mon Mar 10 01:19:56 2025 +0000 @@ -1728,7 +1728,24 @@ all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", -Stupid -- mime-detected has a space in it... +Stupid -- mime-detected sometimes has a space in it... + + >: ls *.gz | parallel -j 10 "uz '{}' |fgrep .15/warc/CC-MAIN | cut -f 2,4 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" + >: sort -m *.cdx | sponge all.cdx + >: wc -l all.cdx + 34742533 +Diff looks plausible, and a few known cases are correct now: + >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' test.cdb_in + 4361:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html", + >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' all.cdx + 27516:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html", + >: fgrep dryelf.com test.cdb_in all.cdx + test.cdb_in:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", + test.cdb_in:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", + all.cdx:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", + all.cdx:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", + all.cdx:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", + all.cdx:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", ================ Try it with the existing _per segment_ index we have for 2019-35