Mercurial > hg > cc > work
comparison lurid3/notes.txt @ 42:0c472ae05f71
nearly finished downloading for now
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 02 Sep 2024 15:02:01 +0100 |
parents | 64b7fb44e8dc |
children | 6ae6a21ccfb9 |
comparison
equal
deleted
inserted
replaced
41:64b7fb44e8dc | 42:0c472ae05f71 |
---|---|
69 12 | 69 12 |
70 13 | 70 13 |
71 15 | 71 15 |
72 15 | 72 15 |
73 | 73 |
74 Fill a gap by downloading 2022-33 | |
75 | |
76 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log & | |
77 130 minutes... | |
78 >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log & | |
79 59 minutes | |
80 | |
81 Another day to get to a quarter? | |
82 >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log & | |
83 | |
84 | |
85 And finally 2015-35 | |
86 Fetched in just 2 chunks, 0-9 and 10-99, e.g. | |
87 >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log & | |
88 | |
89 Much smaller. | |
90 Compare 2023-40, with 900 files per segment: | |
91 >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats | |
92 n = 1000 | |
93 min = 1.14775e+09 | |
94 max = 1.26702e+09 | |
95 sum = 1.20192e+12 | |
96 mean = 1.20192e+09 | |
97 sd = 2.26049e+07 | |
98 | |
99 with 2015-35, with 353 files per segment | |
100 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats | |
101 n = 930 | |
102 min = 1.66471e+08 [bug?] | |
103 max = 9.6322e+08 | |
104 sum = 8.54009e+11 | |
105 mean = 9.1829e+08 | |
106 sd = 8.48938e+07 | |
107 | |
108 The min files all come from segment 1440644060633.7, whose files are | |
109 _all_ small: | |
110 >: uz *00123-*.gz | wc -l | |
111 12,759,931 | |
112 Compare to 1440644060103.8 | |
113 >: zcat *00123-*.gz | wc -l | |
114 75,806,738 | |
115 Mystery | |
116 | |
117 Also faster | |
118 Compare 2023-40: | |
119 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd | |
120 98 19 256 75.1 25.2 | |
121 with 2015-35: | |
122 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd | |
123 95 15 40 32.4 2.90 |