comparison lurid3/notes.txt @ 42:0c472ae05f71

nearly finished downloading for now
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 02 Sep 2024 15:02:01 +0100
parents 64b7fb44e8dc
children 6ae6a21ccfb9
comparison
equal deleted inserted replaced
41:64b7fb44e8dc 42:0c472ae05f71
69 12 69 12
70 13 70 13
71 15 71 15
72 15 72 15
73 73
74 Fill a gap by downloading 2022-33
75
76 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
77 130 minutes...
78 >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
79 59 minutes
80
81 Another day to get to a quarter?
82 >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &
83
84
85 And finally 2015-35
86 Fetched in just 2 chunks, 0-9 and 10-99, e.g.
87 >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &
88
89 Much smaller.
90 Compare 2023-40, with 900 files per segment:
91 >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats
92 n = 1000
93 min = 1.14775e+09
94 max = 1.26702e+09
95 sum = 1.20192e+12
96 mean = 1.20192e+09
97 sd = 2.26049e+07
98
99 with 2015-35, with 353 files per segment
100 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
101 n = 930
102 min = 1.66471e+08 [bug?]
103 max = 9.6322e+08
104 sum = 8.54009e+11
105 mean = 9.1829e+08
106 sd = 8.48938e+07
107
108 The min files all come from segment 1440644060633.7, whose files are
109 _all_ small:
110 >: uz *00123-*.gz | wc -l
111 12,759,931
112 Compare to 1440644060103.8
113 >: zcat *00123-*.gz | wc -l
114 75,806,738
115 Mystery
116
117 Also faster
118 Compare 2023-40:
119 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd
120 98 19 256 75.1 25.2
121 with 2015-35:
122 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd
123 95 15 40 32.4 2.90