changeset 42:0c472ae05f71

nearly finished downloading for now
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 02 Sep 2024 15:02:01 +0100
parents 64b7fb44e8dc
children 6ae6a21ccfb9
files lurid3/notes.txt
diffstat 1 files changed, 50 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Wed Aug 21 16:11:40 2024 +0100
+++ b/lurid3/notes.txt	Mon Sep 02 15:02:01 2024 +0100
@@ -71,3 +71,53 @@
   15
   15
 
+Fill a gap by downloading 2022-33
+
+  >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
+  130 minutes...
+  >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
+  59 minutes
+
+Another day to get to a quarter?
+  >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &
+
+
+And finally 2015-35
+Fetched in just 2 chunks, 0-9 and 10-99, e.g.
+  >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &
+
+Much smaller.
+Compare 2023-40, with 900 files per segment:
+  >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats
+  n	=	1000
+  min	=	1.14775e+09
+  max	=	1.26702e+09
+  sum	=	1.20192e+12
+  mean	=	1.20192e+09
+  sd	=	2.26049e+07
+
+with 2015-35, with 353 files per segment
+  >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
+  n	=	930
+  min	=	1.66471e+08 [bug?]
+  max	=	9.6322e+08
+  sum	=	8.54009e+11
+  mean	=	9.1829e+08
+  sd	=	8.48938e+07
+
+The min files all come from segment 1440644060633.7, whose files are
+_all_ small:
+  >: uz *00123-*.gz | wc -l
+  12,759,931
+Compare to 1440644060103.8
+  >: zcat *00123-*.gz | wc -l
+  75,806,738
+Mystery
+
+Also faster
+Compare 2023-40:
+ >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max  mean    sd
+                              98 19 256  75.1   25.2
+with 2015-35:
+  >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log |  cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done   | stats n min max mean sd
+		        95 15  40 32.4 2.90