Mercurial > hg > cc > work
annotate lurid3/notes.txt @ 60:3be7b53d726e
using python dict test
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Thu, 02 Jan 2025 15:01:48 +0000 |
parents | d9ba3ce783ff |
children | e6bab0972142 |
rev | line source |
---|---|
40 | 1 See old_notes.txt for all older notes on Common Crawl dataprocessing, |
2 starting from Azure via Turing and then LURID and LURID2. | |
3 | |
4 Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx | |
5 >: cd results/CC-MAIN-2024-33/cdx/ | |
6 >: cut -f 2 counts.tsv | btot | |
7 2,793,986,828 | |
8 | |
9 State of play wrt data -- see status.xlsx | |
10 | |
41
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
11 [in trying to tabulate the date ranges of the crawls, I found that the |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
12 WARC timestamp is sometimes bogus: |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
13 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
14 >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
15 net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
16 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
17 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
18 net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"} |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
19 net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"} |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
20 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
21 This happens in 2019-35 as well :-( |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
22 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
23 >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
24 com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
25 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
26 com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"} |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
27 ... |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
28 |
57 | 29 Full search: |
30 >: find CC*/cdx -type f -name cluster.idx > /tmp/hst/clus | |
31 >: cat /tmp/hst/clus | while read c; do printf '%s\t%s\n' $c $(cut -f 1 -d ' ' $c | fgrep -vc ${c:8:4}); done | |
32 CC-MAIN-2013-20/cdx/cluster.idx 0 | |
33 CC-MAIN-2014-35/cdx/cluster.idx 0 | |
34 CC-MAIN-2015-35/cdx/cluster.idx 0 | |
35 CC-MAIN-2016-30/cdx/cluster.idx 0 | |
36 CC-MAIN-2017-30/cdx/cluster.idx 0 | |
37 CC-MAIN-2018-30/cdx/warc/cluster.idx 0 | |
38 CC-MAIN-2018-34/cdx/cluster.idx 36 | |
39 CC-MAIN-2019-18/cdx/warc/cluster.idx 3 | |
40 CC-MAIN-2019-35/cdx/cluster.idx 1 | |
41 CC-MAIN-2020-34/cdx/cluster.idx 0 | |
42 CC-MAIN-2021-25/cdx/cluster.idx 0 | |
43 CC-MAIN-2021-31/cdx/cluster.idx 0 | |
44 CC-MAIN-2021-49/cdx/cluster.idx 0 | |
45 CC-MAIN-2022-21/cdx/warc/cluster.idx 0 | |
46 CC-MAIN-2022-33/cdx/warc/cluster.idx 0 | |
47 CC-MAIN-2022-40/cdx/warc/cluster.idx 0 | |
48 CC-MAIN-2022-49/cdx/warc/cluster.idx 0 | |
49 CC-MAIN-2023-40/cdx/warc/cluster.idx 0 | |
50 CC-MAIN-2023-50/cdx/warc/cluster.idx 0 | |
51 CC-MAIN-2024-33/cdx/warc/cluster.idx 0 | |
52 Emailed this info to Sebastian Nagel 2024-12-17 | |
53 | |
41
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
54 Tabulate all the date ranges for the WARC files we have |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
55 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
56 >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d - | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
57 >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
58 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
59 2019-18 20190418101243-20190418122248 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
60 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
61 2019-18 20190426153423-20190426175423 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
62 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
63 >: pwd |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
64 /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
65 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}' |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
66 >: sort -mu /tmp/hst/??? > /tmp/hst/all |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
67 >: wc -l /tmp/hst/all |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
68 679686 /tmp/hst/all |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
69 >: head -1 /tmp/hst/all |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
70 20160723090435 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
71 >: tail -1 /tmp/hst/all |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
72 20160731110639 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
73 >: cd ../../.. |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
74 >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
75 tweaked and sorted in xemacs: |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
76 2016-30 20160723090435 20160731110639 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
77 2017-30 20170720121902 20170729132938 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
78 2018-30 20180715183800 20180723184955 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
79 2018-34 20180814062251 20180822085454 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
80 2019-18 20190418101243 20190426175423 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
81 2019-35 20190817102624 20190826111356 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
82 2020-34 20200803083123 20200815214756 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
83 2021-25 20210612103920 20210625145905 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
84 2023-40 20230921073711 20231005042006 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
85 2023-50 20231128083443 20231212000408 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
86 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
87 Added to status.xlsx in shortened form, with number of days |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
88 8 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
89 9 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
90 8 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
91 8 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
92 8 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
93 9 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
94 12 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
95 13 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
96 15 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
97 15 |
64b7fb44e8dc
extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
40
diff
changeset
|
98 |
42
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
99 Fill a gap by downloading 2022-33 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
100 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
101 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log & |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
102 130 minutes... |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
103 >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log & |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
104 59 minutes |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
105 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
106 Another day to get to a quarter? |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
107 >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log & |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
108 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
109 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
110 And finally 2015-35 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
111 Fetched in just 2 chunks, 0-9 and 10-99, e.g. |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
112 >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log & |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
113 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
114 Much smaller. |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
115 Compare 2023-40, with 900 files per segment: |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
116 >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
117 n = 1000 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
118 min = 1.14775e+09 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
119 max = 1.26702e+09 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
120 sum = 1.20192e+12 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
121 mean = 1.20192e+09 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
122 sd = 2.26049e+07 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
123 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
124 with 2015-35, with 353 files per segment |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
125 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats |
43 | 126 n = 1000 |
127 min = 1.66471e+08 | |
42
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
128 max = 9.6322e+08 |
43 | 129 sum = 9.19222e+11 |
130 mean = 9.19222e+08 | |
131 sd = 8.20542e+07 | |
42
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
132 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
133 The min files all come from segment 1440644060633.7, whose files are |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
134 _all_ small: |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
135 >: uz *00123-*.gz | wc -l |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
136 12,759,931 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
137 Compare to 1440644060103.8 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
138 >: zcat *00123-*.gz | wc -l |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
139 75,806,738 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
140 Mystery |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
141 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
142 Also faster |
43 | 143 Compare 2022-33: |
42
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
144 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
145 98 19 256 75.1 25.2 |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
146 with 2015-35: |
0c472ae05f71
nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
41
diff
changeset
|
147 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd |
43 | 148 100 15 40 32.6 2.9 |
149 | |
150 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' & | |
151 >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all | |
152 >: head -1 /tmp/hst/2015_all | |
153 20150827191534 | |
154 >: tail -1 /tmp/hst/2015_all | |
155 20150905180914 | |
156 >: wc -l /tmp/hst/2015_all | |
157 698128 /tmp/hst/2015_all | |
158 | |
159 What about wet files -- do they include text from pdfs? What about | |
160 truncated pdfs? | |
161 | |
162 >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log & | |
163 real 26m3.049s | |
164 user 0m1.225s | |
165 sys 0m1.310s | |
166 | |
167 In the segment 0 cdx file (!) we find 3747 probable truncations: | |
168 >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx | |
169 >: wc -l /tmp/hst/2019-35_seg0_pdf.idx | |
170 42345 /tmp/hst/2019-35_seg0_pdf.idx | |
171 >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx & | |
172 >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx | |
173 3747 | |
174 Of which 70 are in file 0: | |
175 >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx | |
176 >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx | |
177 70 /tmp/hst/2019-35_seg0_file0_pdf.idx | |
178 | |
179 In segment 0 file 0 we find 70 application/pdf Content-Type headers: | |
180 >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
181 >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
182 70 | |
183 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
184 | |
185 | |
186 Of which 14 are truncated: | |
187 >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
188 14 | |
189 | |
190 E.g. | |
191 >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3 | |
192 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf | |
193 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4 | |
194 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339 | |
195 | |
196 Are any of the pdfs in the corresponding wet file? | |
197 | |
198 Yes, 2: | |
199 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) | |
200 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
201 WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00 | |
202 | |
203 Is it in fact corresponding? | |
204 >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<' | |
205 19 | |
206 | |
207 So, yes, mostly. .2% are missing | |
208 | |
209 Just checking the search: | |
210 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l | |
211 210 | |
212 Correct | |
213 | |
214 So, what pdfs make it into the WET: | |
215 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | |
216 >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | |
217 2 | |
218 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | |
219 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
220 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
221 | |
222 Here's the short one: | |
223 WARC/1.0 | |
224 WARC-Type: response | |
225 WARC-Date: 2019-08-17T22:40:17Z | |
226 WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a> | |
227 Content-Length: 11588 | |
228 Content-Type: application/http; msgtype=response | |
229 WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> | |
230 WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15> | |
231 WARC-IP-Address: 92.175.114.24 | |
232 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf | |
233 WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA | |
234 WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T | |
235 WARC-Identified-Payload-Type: application/pdf | |
236 | |
237 HTTP/1.1 200 OK | |
238 Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache | |
239 Pragma: public,no-cache | |
240 Content-Type: application/pdf",text/html; charset=utf-8 | |
241 X-Crawler-Content-Encoding: gzip | |
242 Expires: 0 | |
243 Server: | |
244 X-Powered-By: | |
245 Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/ | |
246 Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf" | |
247 Content-Transfer-Encoding: binary | |
248 P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" | |
249 X-Content-Encoded-By: | |
250 X-Powered-By: | |
251 Date: Sat, 17 Aug 2019 22:40:16 GMT | |
252 X-Crawler-Content-Length: 5448 | |
253 Content-Length: 10913 | |
254 | |
255 %PDF-1.7 | |
256 %<E2><E3><CF><D3> | |
257 7 0 obj | |
258 << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2 | |
259 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000 | |
260 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T | |
261 rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2 | |
262 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen | |
263 cy /CS /DeviceRGB >> /PZ 1 >> | |
264 endobj | |
265 8 0 obj | |
266 | |
267 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf | |
268 >: ps2ascii mediatheque.pdf | |
269 Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond | |
270 | |
271 Médiathèque départementale des Deux-Sèvres - Résultats de | |
272 la recherche Belfond | |
273 A charge de revanche | |
274 Titre : | |
275 Auteur : Grippando, James (1958-....) | |
276 ... | |
277 etc., three pages, no errors | |
278 | |
279 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx | |
280 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
281 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
282 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
283 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF' | |
284 27:%%EOF | |
285 1114658:%%EOF | |
286 1313299:%%EOF | |
287 | |
288 Hunh? | |
289 | |
290 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30 | |
291 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
292 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE | |
293 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2 | |
294 4:WARC-Truncated: length | |
295 5:WARC-Identified-Payload-Type: application/pdf | |
296 27:%%EOF | |
297 7725:WARC/1.0 | |
298 7726:WARC-Type: metadata | |
299 7727:WARC-Date: 2019-08-17T22:59:14Z | |
300 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25> | |
301 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e> | |
302 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4> | |
303 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005 | |
304 7739:WARC/1.0 | |
305 | |
306 OK, so indeed truncated after 7700 lines or so... | |
307 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf | |
308 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf | |
309 **** Error: An error occurred while reading an XREF table. | |
310 **** The file has been damaged. | |
311 Look in big_pdf? | |
44
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
312 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
313 ====Modify the original CC indexer to write new indices including lastmod===== |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
314 Looks like WarcRecordWriter.write, in |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
315 src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
316 needs to be editted to include LastModified date |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
317 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
318 To rebuild nutch-cc, particularly to recompile jar files after editting |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
319 anything |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
320 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
321 >: cd $HHOME/src/nutch-cc |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
322 >: ant |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
323 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
324 Fixed deprecation bug in WarcCdxWriter.java |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
325 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
326 Modified src/java/org/commoncrawl/util/WarcCdxWriter.java |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
327 to include lastmod |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
328 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
329 Can run just one test, which should allow testing this: |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
330 |
7209df5fa5b4
turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
43
diff
changeset
|
331 >: ant test-core -Dtestcase='TestWarcRecordWriter' |
45 | 332 |
333 Logic is tricky, and there's no easy way in | |
334 | |
335 Basically, tools/WarcExport.java is launches a hadoop job based on a | |
336 hadoop-runnable WarcExport instance. Hadoop will in due course call | |
337 ExportReducer.reduce, which will create an instance of WarcCapture | |
338 "for each page capture", and call ExportMapper.context.write with that instance (via | |
339 some configuration magic with the hadoop job Context). That in turn | |
340 uses (more magic) WarcOutputFormat.getRecordWriter, which | |
341 (finally!) calls a previously created WarcRecordWriter | |
342 instance.write(the capture). | |
343 | |
344 So to fake a test case, I need to build | |
345 1) a WarcRecordWriter instance | |
346 2) a WarcCapture instance | |
347 and then invoke 1.write(2) | |
348 | |
349 Got that working, although still can't figure out where in the normal | |
350 flow the metadata entry for Response.CONTENT_TYPE gets set. | |
351 | |
352 Now, add a test that takes a stream of WARC Response extracts and | |
353 rewrites their index entries | |
354 | |
355 >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt | |
356 >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/ | |
357 >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt | |
358 | |
359 Won't quite work :-( | |
360 How do We reconstruct the Warc filename, offset and length from the | |
361 original index? | |
362 | |
363 Well, we can find a .warc.gz records! | |
364 Thanks to https://stackoverflow.com/a/37042747/2595465 | |
365 | |
366 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt | |
367 | |
368 Nearly working, got 1/3rd of the way through a single WARC and then failed: | |
369 | |
370 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done | |
371 ... | |
372 20 | |
373 10215 | |
374 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
375 Process fail: Compressed file ended before the end-of-stream marker was reached, input: | |
376 length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
377 | |
378 >: head -10217 /tmp/hst/r3a | tail -4 | |
379 60784173 467 | |
380 60784640 10762 | |
381 60795402 463 | |
382 60795865 460 | |
383 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target | |
384 WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/ | |
385 | |
386 >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz | |
387 ... | |
388 co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"} | |
389 >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less | |
390 >: echo $((10762 - 2570)) | |
391 8192 | |
392 | |
393 Ah, the error I was dreading :-( I _think_ this happens when an | |
394 individual record ends exactly on a 8K boundary. | |
395 | |
396 Yes: | |
397 | |
398 >: echo $((60784640 % 8192)) | |
399 0 | |
400 | |
46 | 401 Even with buffer 1MB: |
402 21 | |
403 160245 | |
404 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
405 Process fail: Compressed file ended before the end-of-stream marker was reached, input: | |
406 length=8415, offset=1059033915, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | |
407 0 | |
408 160246 | |
409 | |
410 >: tail -60 /tmp/hst/r3b|head -20 | |
411 1059013061 423 | |
412 1059013484 7218 | |
413 1059020702 425 | |
414 1059021127 424 | |
415 1059021551 11471 | |
416 1059033022 426 | |
417 1059033448g 467 | |
418 1059033915 8415 | |
419 | |
420 Argh. This is at the _same_ point (before 51 fails before EOF). Ah, | |
421 maybe that's the point -- this is the last read before EOF, and it's | |
422 not a full buffer! | |
423 | |
424 >: ix.py 467 1059033448 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less | |
425 ... | |
426 WARC-Target-URI: https://zowiecarrpsychicmedium.com/tag/oracle/ | |
427 | |
428 Reran with more instrumentation, took at least all day: | |
429 | |
430 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3e_err.txt | while read o l; do | |
431 echo $((n+=1)); echo $o $l >> /tmp/hst/r3e_val; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | wc -l; | |
432 done > /tmp/hst/r3e_log 2>&1 | |
433 >: wc -l /tmp/hst/r3e_err.txt | |
434 160296 /tmp/hst/r3e_err.txt | |
435 >: tail -60 /tmp/hst/r3e_err.txt|cat -n | grep -C2 True\ True | |
436 7 b 28738 28738 28312 426 False False | |
437 8 b 28312 28312 27845 467 False False | |
438 9 b 27845 378162 369747 8415 True True < this is the first hit the last | |
439 (partial) block | |
440 10 b 369747 369747 369312 435 False True | |
441 11 b 369312 369312 368878 434 False True | |
442 | |
443 >: tail -55 /tmp/hst/r3e_val | head -3 | |
444 1059033022 426 | |
445 1059033448 467 | |
446 1059033915 8415 | |
447 >: dd ibs=1 skip=1059033022 count=426 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t | |
448 ... | |
449 426 bytes copied, 0.00468243 s, 91.0 kB/s | |
450 sing<3411>: dd ibs=1 skip=1059033448 count=467 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t | |
451 ... | |
452 467 bytes copied, 0.00382692 s, 122 kB/s | |
453 sing<3412>: dd ibs=1 skip=1059033915 count=8415 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t | |
454 igzip: Error (null) does not contain a complete gzip file | |
455 ... | |
456 8415 bytes (8.4 kB, 8.2 KiB) copied, 0.00968889 s, 869 kB/s | |
457 | |
458 So, tried one change to use the actually size rather than BUFSIZE at | |
459 one point, seems to work now: | |
460 | |
461 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3f_err.txt | tee /tmp/hst/r3f_val | while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz'; | |
462 done 2>&1 | tee /tmp/hst/r3f_log | ix.py -w | egrep -c '^WARC/1\.0' | |
463 160296 | |
464 real 3m48.393s | |
465 user 0m47.997s | |
466 sys 0m26.641s | |
467 | |
468 >: tail /tmp/hst/r3f_val | |
469 10851 1059370472 | |
470 475 1059381323 | |
471 444 1059381798 | |
472 22437 1059382242 | |
473 447 1059404679 | |
474 506 1059405126 | |
475 15183 1059405632 | |
476 471 1059420815 | |
477 457 1059421286 | |
478 17754 1059421743 | |
479 | |
480 >: wc -l /tmp/hst/*_val | |
481 171 /tmp/hst/r3d_val | |
482 160297 /tmp/hst/r3e_val | |
483 160296 /tmp/hst/r3f_val | |
484 320764 total | |
485 >: uz /tmp/hst/head.warc.gz |egrep -c '^WARC/1\.0.$' | |
486 171 | |
487 >: tail -n 3 /tmp/hst/*_val | |
488 ==> /tmp/hst/r3d_val <== | |
489 454 1351795 | |
490 414 1352249 | |
491 0 1352663 [so the 171 above is bogus, and we're missing one] | |
492 | |
493 ==> /tmp/hst/r3e_val <== | |
494 1059393441 457 | |
495 1059393898 17754 | |
496 0 [likewise bogus, so see below] | |
497 | |
498 ==> /tmp/hst/r3f_val <== | |
499 471 1059420815 | |
500 457 1059421286 | |
501 17754 1059421743 [better, but still one missing] | |
502 >: uz /tmp/hst/head.warc.gz |egrep '^WARC-Type: ' | tee >(wc -l 1>&2) | tail -4 | |
503 WARC-Type: response | |
504 WARC-Type: metadata | |
505 WARC-Type: request | |
506 WARC-Type: response [missing] | |
507 171 | |
508 >: ls -lt /tmp/hst/*_val | |
509 -rw-r--r-- 1 hst dc007 1977 Sep 29 09:27 /tmp/hst/r3d_val | |
510 -rw-r--r-- 1 hst dc007 2319237 Sep 28 14:28 /tmp/hst/r3f_val | |
511 -rw-r--r-- 1 hst dc007 2319238 Sep 27 19:41 /tmp/hst/r3e_val | |
512 >: ls -l ~/lib/python/unpackz.py | |
513 -rwxr-xr-x 1 hst dc007 1821 Sep 28 15:13 .../dc007/hst/lib/python/unpackz.py | |
514 So e and f are stale, rerun | |
515 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err.txt| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 & | |
516 >: Reading length, offset, filename tab-delimited triples from stdin... | |
517 WARC-Type: response | |
518 WARC-Type: metadata | |
519 WARC-Type: request | |
520 WARC-Type: response | |
521 | |
522 real 3m49.760s | |
523 user 0m47.180s | |
524 sys 0m32.218s | |
525 So missing the final metadata... | |
526 Back to head.warc.gz, with debug info | |
527 | |
528 >: n=0 && ~/lib/python/unpackz.py /tmp/hst/head.warc.gz 2>/tmp/hst/ttd.txt|while read l o; do echo $((n+=1)); echo $l $o >> /tmp/hst/r3d_val; dd ibs=1 skip=$o count=$l if=/tmp/hst/head.warc.gz of=/dev/stdout 2>/tmp/hst/r3d_ido| uz -t ; done >/tmp/hst/r3d_log 2>&1 | |
529 >: tail -2 /tmp/hst/r3d_log | |
530 171 | |
531 igzip: Error invalid gzip header found for file (null) | |
532 >: tail -n 3 /tmp/hst/ttd.txt /tmp/hst/r3d_val | |
533 ==> /tmp/hst/ttd.txt <== | |
534 b 9697 9697 9243 454 False True | |
535 b 9243 9243 8829 414 False True | |
536 n 8829 | |
537 | |
538 ==> /tmp/hst/r3d_val <== | |
539 454 1351795 | |
540 414 1352249 | |
541 0 1352663 | |
542 | |
543 >: cat -n /tmp/hst/r3f_val | head -172 | tail -4 | |
544 169 454 1351795 | |
545 170 414 1352249 | |
546 171 8829 1352663 | |
547 172 446 1361492 | |
548 | |
549 Fixed, maybe | |
550 | |
551 >: tail -n 3 /tmp/hst/r3d_log /tmp/hst/r3d_val | |
552 ==> /tmp/hst/r3d_log <== | |
553 169 | |
554 170 | |
555 171 | |
556 | |
557 ==> /tmp/hst/r3d_val <== | |
558 454 1351795 | |
559 414 1352249 | |
560 8829 1352663 | |
561 | |
562 Yes! | |
563 | |
564 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 | |
565 Reading length, offset, filename tab-delimited triples from stdin... | |
566 WARC-Type: metadata | |
567 WARC-Type: request | |
568 WARC-Type: response | |
569 WARC-Type: metadata | |
570 | |
571 real 3m26.042s | |
572 user 0m44.167s | |
573 sys 0m24.716s | |
574 >: tail -n 3 /tmp/hst/r3f* | |
575 ==> /tmp/hst/r3f_err <== | |
576 | |
577 ==> /tmp/hst/r3f_val <== | |
578 457 1059421286 | |
579 17754 1059421743 | |
580 425 1059439497 | |
581 | |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
582 Doubling buffer size doesn't speed up |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
583 >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err| tee /tmp/hst/r3g_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3g_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
584 Reading length, offset, filename tab-delimited triples from stdin... |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
585 WARC-Type: metadata |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
586 WARC-Type: request |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
587 WARC-Type: response |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
588 WARC-Type: metadata |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
589 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
590 real 3m34.519s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
591 user 0m52.312s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
592 sys 0m24.875s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
593 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
594 Tried using FileIO.readinto([a fixed buffer]), but didn't immediately |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
595 work. Abandoned because I still don't understand how zlib.decompress |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
596 works at all... |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
597 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
598 Time to convert unpackz to a library which takes a callback |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
599 alternative to an output file -- Done |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
600 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
601 W/o using callback, timing and structure for what we need for |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
602 re-indexing task looks encouraging: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
603 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA20 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^WARC-' |sus | tee >(wc -l 1>&2) |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
604 52468 WARC-Block-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
605 52468 WARC-Concurrent-To: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
606 52468 WARC-Date: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
607 52468 WARC-Identified-Payload-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
608 52468 WARC-IP-Address: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
609 52468 WARC-Payload-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
610 52468 WARC-Record-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
611 52468 WARC-Target-URI: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
612 52468 WARC-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
613 52468 WARC-Warcinfo-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
614 236 WARC-Truncated: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
615 11 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
616 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
617 real 0m20.308s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
618 user 0m19.720s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
619 sys 0m4.505s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
620 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
621 Whole thing, with no pre-filtering: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
622 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
623 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
624 211794 Content-Length: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
625 211162 Content-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
626 159323 WARC-Target-URI: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
627 159311 WARC-Warcinfo-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
628 159301 WARC-Record-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
629 159299 WARC-Date: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
630 159297 WARC-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
631 105901 WARC-Concurrent-To: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
632 105896 WARC-IP-Address: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
633 52484 WARC-Block-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
634 52484 WARC-Identified-Payload-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
635 52482 WARC-Payload-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
636 9239 Last-Modified: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
637 3941 Content-Language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
638 2262 Content-Security-Policy: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
639 642 Content-language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
640 326 Content-Security-Policy-Report-Only: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
641 238 WARC-Truncated: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
642 114 Content-Disposition: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
643 352 Content-*: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
644 1 WARC-Filename: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
645 42 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
646 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
647 real 0m30.896s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
648 user 0m37.335s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
649 sys 0m7.542s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
650 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
651 First 51 after WARC-Type: response |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
652 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
653 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA50 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
654 106775 Content-Length: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
655 106485 Content-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
656 55215 WARC-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
657 55123 WARC-Date: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
658 54988 WARC-Record-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
659 54551 WARC-Warcinfo-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
660 54246 WARC-Target-URI: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
661 54025 WARC-Concurrent-To: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
662 52806 WARC-IP-Address: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
663 52468 WARC-Block-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
664 52468 WARC-Identified-Payload-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
665 52468 WARC-Payload-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
666 9230 Last-Modified: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
667 3938 Content-Language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
668 2261 Content-Security-Policy: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
669 639 Content-language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
670 324 Content-Security-Policy-Report-Only: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
671 236 WARC-Truncated: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
672 114 Content-Disposition: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
673 342 Content-*: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
674 41 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
675 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
676 real 0m21.483s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
677 user 0m22.372s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
678 sys 0m5.400s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
679 |
58 | 680 So, not worth the risk, let's try python: cdx_extras implements a |
681 callback for unpackz that outputs the LM header if it's there | |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
682 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
683 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
684 9238 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
685 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
686 real 0m25.426s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
687 user 0m23.201s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
688 sys 0m0.711s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
689 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
690 Looks good, but why 9238 instead of 9239??? |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
691 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
692 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
693 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
694 Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
695 records correctly. Fixed. Redoing the above... |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
696 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
697 No pre-filter: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
698 >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|egrep -c '^WARC/1\.0.$' |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
699 160297 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
700 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
701 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2) |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
702 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
703 213719 Content-Length: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
704 213088 Content-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
705 160297 WARC-Date: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
706 160297 WARC-Record-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
707 160297 WARC-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
708 160296 WARC-Target-URI: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
709 160296 WARC-Warcinfo-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
710 106864 WARC-Concurrent-To: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
711 106864 WARC-IP-Address: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
712 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1] |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
713 53432 WARC-Identified-Payload-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
714 53432 WARC-Payload-Digest: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
715 9430 Last-Modified: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
716 4006 Content-Language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
717 2325 Content-Security-Policy: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
718 653 Content-language: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
719 331 Content-Security-Policy-Report-Only: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
720 298 WARC-Truncated: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
721 128 Content-Disposition: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
722 83 Content-Location: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
723 67 Content-type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
724 51 Content-MD5: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
725 45 Content-Script-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
726 42 Content-Style-Type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
727 31 Content-Transfer-Encoding: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
728 13 Content-disposition: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
729 8 Content-Md5: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
730 5 Content-Description: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
731 5 Content-script-type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
732 5 Content-style-type: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
733 3 Content-transfer-encoding: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
734 2 Content-Encoding-handler: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
735 1 Content-DocumentTitle: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
736 1 Content-Hash: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
737 1 Content-ID: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
738 1 Content-Legth: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
739 1 Content-length: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
740 1 Content-Range: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
741 1 Content-Secure-Policy: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
742 1 Content-security-policy: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
743 1 Content-Type-Options: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
744 1 WARC-Filename: |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
745 42 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
746 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
747 real 0m28.876s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
748 user 0m35.703s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
749 sys 0m6.976s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
750 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
751 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
752 >: wc -l /tmp/hst/lmo.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
753 9430 /tmp/hst/lmo.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
754 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
755 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
756 real 0m17.191s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
757 user 0m15.739s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
758 sys 0m0.594s |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
759 >: wc -l /tmp/hst/lm.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
760 9423 /tmp/hst/lm.tsv |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
761 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
762 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv) |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
763 853d852 |
58 | 764 < Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
765 4058d4056 |
58 | 766 < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
767 4405d4402 |
58 | 768 < Mon, 19 Aug 2019 01:54:52 GMT [double lm] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
769 5237,5238d5233 |
58 | 770 < 3 [bogus extension lines to preceding LM] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
771 < Asia/Amman |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
772 7009d7003 |
58 | 773 < Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
774 9198d9191 |
58 | 775 < Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML] |
47
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
776 |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
777 All good. The only implausable case is |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
778 < Mon, 19 Aug 2019 01:54:52 GMT |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
779 which turns out to be a case of two Last-Modified headers in the same |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
780 the same response record's HTTP headers. RFCs 2616 and 7230 rule it |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
781 out but neither specifies a recovery, so first-wins is as good as |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
782 anything, and indeed 6797 specifies that. |
fbdaede4155a
cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
46
diff
changeset
|
783 |
48 | 784 Start looking at how we do the merge of cdx_extras.py with existing index |
785 | |
58 | 786 ====2024-12-19==== |
787 | |
788 The above test shows 17.6% of entries have an LM value | |
789 For a 3 billion entry dataset, than means 530 million LM entries, call | |
790 this n. | |
791 | |
792 Sizes? For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2 | |
793 | |
794 (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB | |
795 | |
796 That's too much :-) Per segment, that becomes possible? | |
797 25,355,594 bits =~ 3.2MB | |
798 | |
799 But maybe it's _not_ too much. One of the python implementations I | |
800 saw uses mmap: | |
801 | |
802 https://github.com/prashnts/pybloomfiltermmap3 | |
803 | |
804 Build a Bloom filter with all the URIs whose entries have LM value | |
805 _and_ a python hashtable mapping from URI to LM and offset (is that | |
806 enough for deduping?) | |
807 Rewrite one index file at a time | |
808 Probe with each URI, if positive | |
809 look up in hashtable and use if found | |
810 | |
811 >: wc -l ks*.tsv | |
812 52369734 ks_0-9.tsv | |
813 52489306 ks_10-19.tsv | |
814 52381115 ks_20-29.tsv | |
815 52438862 ks_30-39.tsv | |
816 52512044 ks_40-49.tsv | |
817 52476964 ks_50-59.tsv | |
818 52317116 ks_60-69.tsv | |
819 52200680 ks_70-79.tsv | |
820 52382426 ks_80-89.tsv | |
821 52295136 ks_90-99.tsv | |
822 523863383 total | |
823 | |
824 >>> from pybloomfilter import BloomFilter | |
825 >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom') | |
826 >>> def bff(f,fn): | |
827 ... with open(fn) as uf: | |
828 ... while (l:=uf.readline()): | |
829 ... f.add(l.split('\t')[2]) | |
830 ... | |
831 >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals()) | |
832 0.00012309104204177856 | |
833 >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals()) | |
834 77.57737312093377 | |
835 >>> 'http://71.43.189.10/dermorph' in f | |
836 False | |
837 >>> 'http://71.43.189.10/dermorph/' in f | |
838 True | |
839 >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals()) | |
840 0.02377822808921337 | |
841 >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals()) | |
842 0.019318239763379097 | |
843 | |
844 _That's_ encouraging... | |
845 Be sure to f.close() | |
846 Use BloomFilter.open for an existing bloom file | |
847 Copying a file from /tmp to work/... still gives good (quick) lookup, | |
848 but _creating and filling_ a file on work/... takes ... I stopped | |
849 waiting after an hour or so. | |
59 | 850 |
851 How much bigger is .05 false positive? | |
852 Less than expected: | |
853 >: ls -l /tmp/hst | |
854 -rwxr-xr-x 1 hst dc007 408301988 Jan 1 16:52 uris_20.bloom | |
855 -rwxr-xr-x 1 hst dc007 313830100 Jan 1 15:04 uris.bloom | |
856 And still same (?) fill time: | |
857 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom') | |
858 >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals()) | |
859 >>> T.repeat(3,number=1) | |
860 [89.64385064691305, 90.9979057777673, 83.9632708914578] | |
861 Build a test harness wrt the python dict I'm going to need... | |
862 Can't immediately find a way to optimise a dict to have umpty millions | |
863 of entries... | |
864 >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv|~/lib/python/cc/lmh/test.py -n 1000000 -r 5 | |
865 1000002 | |
866 1000002 | |
867 1000002 | |
868 1000002 | |
869 1000002 | |
870 [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533] | |
871 Full as-it-were segment: | |
872 >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
873 6000002 | |
874 6000002 | |
875 6000002 | |
876 6000002 | |
877 6000002 | |
878 [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647] | |
879 Full 10th of the data: | |
880 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
881 52369734 | |
882 52369734 | |
883 52369734 | |
884 [69.63967163302004, 69.09140252694488, 66.49750975705683] | |
885 That's tolerable. | |
60 | 886 >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv |
59 | 887 52369734 |
888 52369734 | |
889 52369734 | |
890 [64.51177835091949, 71.6610240675509, 67.74966451153159] | |
891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404] | |
892 Last line is 100000 lookups. | |
60 | 893 |
894 So, try a test: | |
895 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv | |
896 52369734 | |
897 [70.98342595621943] | |
898 [0.0037928372621536255] | |
899 | |
900 real 1m51.456s | |
901 user 1m32.901s | |
902 sys 0m17.937s | |
903 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle | |
904 -rw-r--r-- 1 hst dc007 5.5G Jan 2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle | |
905 cdx_out.write(b' ') | |
906 cdx_out.write(b' ') | |
907 >: time ~/lib/python/cc/lmh/test_lookup1.py | |
908 52369734 | |
909 1076046 130318 | |
910 | |
911 real 1m52.668s | |
912 user 1m40.751s | |
913 sys 0m9.610s | |
914 | |
915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the | |
916 unpickles =~ 453 minutes == 8 hours. | |
917 | |
918 Try pre-filter with the Bloom filter. | |
58 | 919 ================ |
920 | |
921 | |
48 | 922 Try it with the existing _per segment_ index we have for 2019-35 |
923 | |
58 | 924 Assuming we have to key on segment / file and offset, as reconstructing the |
48 | 925 proper index key is such a pain / buggy / is going to change with the year. |
926 | |
927 Stay with segment 49 | |
928 | |
929 >: uz cdx.gz |wc -l | |
930 29,870,307 | |
931 | |
932 >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc | |
933 29,870,307 119,481,228 1,241,098,122 | |
934 = 4 * 29,870,307 | |
935 | |
936 So no bogons, not _too_ surprising :-) | |
937 | |
938 Bad news is it's a _big_ file: | |
939 | |
940 >: ls -lh cdx.gz | |
941 -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz | |
942 | |
943 So not viable to paste offset as a key and then sort on command line, | |
944 or to load it in to python and do the work there... | |
945 | |
946 Do it per warc file and then merge? | |
947 | |
948 >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx | |
949 | |
950 real 0m23.494s | |
951 user 0m14.541s | |
952 sys 0m9.158s | |
953 | |
954 >: wc -l /tmp/hst/558.warc.cdx | |
955 53432 /tmp/hst/558.warc.cdx | |
956 | |
49
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
957 >: echo $((600 * 53432)) |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
958 32,059,200 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
959 |
48 | 960 So, 600 of those, plus approx. same again for extracting, that pbly |
961 _is_ doable in python, not more than 10 hours total, assuming internal | |
962 sort and external merge is not too expensive... | |
49
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
963 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
964 For each segment, suppose we pull out 60 groups of 10 target files |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
965 >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
966 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
967 real 0m42.129s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
968 user 0m35.147s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
969 sys 0m9.140s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
970 >: wc -l /tmp/hst/0000.warc.cdx |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
971 533150 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
972 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
973 Key it with offset and sort: |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
974 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
975 >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
976 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
977 real 0m5.578s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
978 user 0m5.593s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
979 sys 0m0.265s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
980 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
981 >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
982 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
983 real 0m4.185s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
984 user 0m2.001s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
985 sys 0m1.334s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
986 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
987 >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv" |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
988 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
989 real 0m24.610s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
990 user 2m54.146s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
991 sys 0m10.226s |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
992 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
993 >: head /tmp/hst/lm_00000.tsv |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
994 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
995 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
996 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
997 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
998 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
999 ... |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1000 sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1001 com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"} |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1002 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1003 bingo |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1004 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1005 So, the python code is pretty straightfoward: open the 10 individual |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1006 lm-*.tsv outputs into an array, initialise a 10-elt array with the |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1007 first line of each and another with its offset, record the |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1008 fileno(s) of the lowest offset, then iterate |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1009 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1010 read cdx lines and write unchanged until offset = lowest |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1011 merge line from fileno and output |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1012 remove fileno from list of matches |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1013 read and store a new line for fileno [handle EOF] |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1014 if list of matches is empty, redo setting of lowest |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1015 |
deeac8a0a682
tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
48
diff
changeset
|
1016 Resort the result by actual key |
50 | 1017 |
1018 Meanwhile, get a whole test set: | |
1019 sbatch --output=slurm_aug_cdx_49_10-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 00 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49 | |
1020 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH | |
1021 seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\"" | |
1022 | |
1023 Actually finished 360 in the hour. | |
1024 | |
1025 Leaving | |
1026 | |
1027 sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 36 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49 | |
1028 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH | |
1029 seq 0 9 | parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\"" | |
1030 | |
1031 But something is wrong, the number of jobs is all wrong: | |
1032 | |
1033 5>: fgrep -c parallel slurm_aug_cdx_49_0-359-out | |
1034 741 | |
1035 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/|wc -l | |
1036 372 | |
1037 | |
1038 Every file is being produced twice. | |
51
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1039 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1040 Took me a while to figure out my own code :-( |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1041 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1042 >: sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 49 49 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1043 export SEG=$xarg |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1044 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1045 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1046 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1047 Oops, only 560, not 600 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1048 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1049 Took 3.5 minutes for 200, so call it 10 for 560, so do 6 more in an |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1050 hour: |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1051 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1052 >: sbatch --output=slurm_aug_cdx_50-55_out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 50 55 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1053 mkdir -p $resdir |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1054 > export SEG=$xarg |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1055 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1056 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/*.$SEG/orig/warc/CC-MAIN-*-*-00${arg}.warc.gz > $resdir/00${arg}.tsv' |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1057 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1058 >: tail slurm_aug_cdx_50-55_out |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1059 ... |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1060 Wed Oct 9 22:25:47 BST 2024 Finished 55 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1061 >: head -1 slurm_aug_cdx_50-55_out |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1062 Wed Oct 9 21:29:43 BST |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1063 56:04 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1064 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1065 >: du -s CC-MAIN-2019-35/aug_cdx |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1066 1,902,916 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1067 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1068 Not bad, so order 20MB for the whole thing |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1069 |
dc24bb6e524f
done cdx_aux for segments 49--55 of 2019-35
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
50
diff
changeset
|
1070 Next step, compare to my existing cdx with timestamp |
52
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1071 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1072 First check looks about right: |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1073 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1074 [cd .../warc_lmhx] |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1075 >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1076 >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz | egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.]*[.]50.*\"lastmod\":" | sed "s/^.*-00//;s/^\(...\).*/\1/"| sus > /tmp/hst/checkseg_50_{}' |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1077 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1078 [cd .../aug_cdx/50] |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1079 >: wc -l 00123.tsv |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1080 9333 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1081 >: egrep -h '123$' /tmp/hst/checkseg_50_??? | acut 1 | btot |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1082 9300 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1083 >: wc -l 00400.tsv |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1084 9477 00400.tsv |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1085 >: egrep -h '400$' /tmp/hst/checkseg_50_??? | acut 1 | btot |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1086 9439 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1087 |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1088 Difference is presumable the bogus timestamps aren't in the augmented |
8dffb8aa33da
prelim consistency check with published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
51
diff
changeset
|
1089 cdx as shipped. |
53
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1090 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1091 Note that the following 'bad' kind of timestamp is fixed before |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1092 sort_date.py does its thing: |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1093 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1094 ... sort_date.sh <(uz $arg/*00???.warc.gz | '"fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/')"' >$arg/ks.tsv |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1095 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1096 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1097 >: egrep -c '[^ ]GMT$' 50/00123.tsv |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1098 22 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1099 >: egrep -c '[^ ]GMT$' 50/00400.tsv |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1100 14 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1101 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1102 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00123.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/123_errs | wc -l |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1103 9300 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1104 >: fgrep -c Invalid /tmp/hst/123_errs |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1105 33 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1106 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00400.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2> /tmp/hst/400_errs | wc -l |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1107 9439 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1108 >: fgrep -c Invalid /tmp/hst/400_errs |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1109 38 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1110 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1111 All good. |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1112 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1113 But |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1114 >: seq --format='%03g' 0 559 > /tmp/hst/warc_nums |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1115 >: xx () { |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1116 r=$(diff -bw |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1117 <(echo $(( |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1118 $(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz | |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1119 fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1120 + |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1121 $(fgrep -c Invalid /tmp/hst/ec_$1)))) |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1122 <(wc -l < 50/00$1.tsv)) |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1123 if [ "$r" ] |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1124 then printf "%s:\n%s\n" $2 "$r" |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1125 fi |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1126 } |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1127 >: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1128 >: fgrep -c 1c1 /tmp/hst/aug_bugs |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1129 77 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1130 sing<4318>: wc -l < /tmp/hst/aug_bugs |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1131 385 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1132 sing<4319>: echo $((77 * 5)) |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1133 385 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1134 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1135 OK, there are a few other error messages from date conversion |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1136 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < 50/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; } |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1137 sing<4337>: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' | tee /tmp/hst/aug_bugs2 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1138 [nothing] |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1139 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1140 So, I think we can believe we're OK |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1141 But 7 is better than 1: |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1142 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/$3/*00$1.warc.gz | fgrep $'\t'|sed '/GMT$/s/\([^ ]\)GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 |wc -l) + $(egrep -c 'Invalid|must be in|out of range' /tmp/hst/ec_$1)))) <(wc -l < $3/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; } |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1143 >: for s in 49 {51..55}; do parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' $s | tee /tmp/hst/aug_bugs_$s; done |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1144 [nothing] |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1145 |
d533894173d0
detailed consistency check with 7 segments from published lmh-augmented cdx
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
52
diff
changeset
|
1146 Next step: ? |