changeset 74:ff6ef190f901 default tip

cross-check wrt seg 15 of 2023-40
author Henry Thompson <ht@markup.co.uk>
date Mon, 10 Mar 2025 01:19:56 +0000
parents 1283a574260d
children
files lurid3/notes.txt
diffstat 1 files changed, 18 insertions(+), 1 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Sun Mar 09 19:59:50 2025 +0000
+++ b/lurid3/notes.txt	Mon Mar 10 01:19:56 2025 +0000
@@ -1728,7 +1728,24 @@
   all.cdx:12190045:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3",
   all.cdx:25401353:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0",
 
-Stupid -- mime-detected has a space in it...
+Stupid -- mime-detected sometimes has a space in it...
+
+  >: ls *.gz | parallel -j 10 "uz '{}' |fgrep .15/warc/CC-MAIN | cut -f 2,4 -d ' '  | sort > '/tmp/hst/15/orig_{#}.cdx'"
+  >: sort -m *.cdx | sponge all.cdx
+  >: wc -l all.cdx
+  34742533
+Diff looks plausible, and a few known cases are correct now:
+  >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' test.cdb_in
+  4361:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html",
+  >: fgrep -n '20230922034234 "https://www.hansaton.com/zh-hk/' all.cdx
+  27516:20230922034234 "https://www.hansaton.com/zh-hk/%E6%B6%88%E8%B2%BB%E8%80%85.html",
+  >: fgrep dryelf.com test.cdb_in all.cdx
+  test.cdb_in:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe",
+  test.cdb_in:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe",
+  all.cdx:20230922034248 "http://www.dryelf.com/files/DryCookieInstall.exe",
+  all.cdx:20230922043436 "http://www.dryelf.com/files/DryLottoInstall.exe",
+  all.cdx:20230922044250 "http://www.dryelf.com/downloads.asp?Sort=0&Cat=3",
+  all.cdx:20230922053834 "http://www.dryelf.com/downloads.asp?Sort=2&Cat=0",
 ================
 
 Try it with the existing _per segment_ index we have for 2019-35