Mercurial > hg > cc > work
changeset 74:ff6ef190f901 default tip
cross-check wrt seg 15 of 2023-40
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Mon, 10 Mar 2025 01:19:56 +0000 |
parents | 1283a574260d |
children | |
files | lurid3/notes.txt |
diffstat | 1 files changed, 18 insertions(+), 1 deletions(-) [+] |
line wrap: on
line diff
--- a/lurid3/notes.txt Sun Mar 09 19:59:50 2025 +0000 +++ b/lurid3/notes.txt Mon Mar 10 01:19:56 2025 +0000 @@ -1728,7 +1728,24 @@ all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", -Stupid -- mime-detected has a space in it... +Stupid -- mime-detected sometimes has a space in it... + + >: ls *.gz | parallel -j 10 "uz '{}' |fgrep .15/warc/CC-MAIN | cut -f 2,4 -d ' ' | sort > '/tmp/hst/15/orig_{#}.cdx'" + >: sort -m *.cdx | sponge all.cdx + >: wc -l all.cdx + 34742533 +Diff looks plausible, and a few known cases are correct now: + >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' test.cdb_in + 4361:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html", + >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' all.cdx + 27516:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html", + >: fgrep dryelf.com test.cdb_in all.cdx + test.cdb_in:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", + test.cdb_in:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", + all.cdx:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe", + all.cdx:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe", + all.cdx:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3", + all.cdx:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0", ================ Try it with the existing _per segment_ index we have for 2019-35