annotate lurid3/notes.txt @ 49:deeac8a0a682

tentative plan for merging
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Fri, 04 Oct 2024 21:41:53 +0100
parents f688c437180b
children 5556c04c7597
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
40
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 See old_notes.txt for all older notes on Common Crawl dataprocessing,
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 starting from Azure via Turing and then LURID and LURID2.
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 >: cd results/CC-MAIN-2024-33/cdx/
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 >: cut -f 2 counts.tsv | btot
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 2,793,986,828
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 State of play wrt data -- see status.xlsx
4167d8f33325 start lab notes for LURID3
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10
41
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
11 [in trying to tabulate the date ranges of the crawls, I found that the
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
12 WARC timestamp is sometimes bogus:
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
13
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
14 >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
15 net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
16
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
17 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
18 net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
19 net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
20
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
21 This happens in 2019-35 as well :-(
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
22
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
23 >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
24 com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
25 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
26 com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
27 ...
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
28
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
29 Tabulate all the date ranges for the WARC files we have
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
30
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
31 >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d - | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
32 >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
33 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
34 2019-18 20190418101243-20190418122248
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
35 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
36 2019-18 20190426153423-20190426175423
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
37 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
38 >: pwd
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
39 /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
40 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}'
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
41 >: sort -mu /tmp/hst/??? > /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
42 >: wc -l /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
43 679686 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
44 >: head -1 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
45 20160723090435
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
46 >: tail -1 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
47 20160731110639
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
48 >: cd ../../..
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
49 >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
50 tweaked and sorted in xemacs:
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
51 2016-30 20160723090435 20160731110639
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
52 2017-30 20170720121902 20170729132938
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
53 2018-30 20180715183800 20180723184955
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
54 2018-34 20180814062251 20180822085454
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
55 2019-18 20190418101243 20190426175423
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
56 2019-35 20190817102624 20190826111356
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
57 2020-34 20200803083123 20200815214756
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
58 2021-25 20210612103920 20210625145905
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
59 2023-40 20230921073711 20231005042006
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
60 2023-50 20231128083443 20231212000408
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
61
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
62 Added to status.xlsx in shortened form, with number of days
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
63 8
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
64 9
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
65 8
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
66 8
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
67 8
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
68 9
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
69 12
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
70 13
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
71 15
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
72 15
64b7fb44e8dc extract actual date info for WARC crawls
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 40
diff changeset
73
42
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
74 Fill a gap by downloading 2022-33
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
75
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
76 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
77 130 minutes...
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
78 >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
79 59 minutes
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
80
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
81 Another day to get to a quarter?
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
82 >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
83
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
84
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
85 And finally 2015-35
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
86 Fetched in just 2 chunks, 0-9 and 10-99, e.g.
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
87 >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
88
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
89 Much smaller.
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
90 Compare 2023-40, with 900 files per segment:
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
91 >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
92 n = 1000
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
93 min = 1.14775e+09
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
94 max = 1.26702e+09
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
95 sum = 1.20192e+12
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
96 mean = 1.20192e+09
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
97 sd = 2.26049e+07
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
98
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
99 with 2015-35, with 353 files per segment
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
100 >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats
43
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
101 n = 1000
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
102 min = 1.66471e+08
42
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
103 max = 9.6322e+08
43
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
104 sum = 9.19222e+11
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
105 mean = 9.19222e+08
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
106 sd = 8.20542e+07
42
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
107
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
108 The min files all come from segment 1440644060633.7, whose files are
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
109 _all_ small:
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
110 >: uz *00123-*.gz | wc -l
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
111 12,759,931
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
112 Compare to 1440644060103.8
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
113 >: zcat *00123-*.gz | wc -l
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
114 75,806,738
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
115 Mystery
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
116
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
117 Also faster
43
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
118 Compare 2022-33:
42
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
119 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
120 98 19 256 75.1 25.2
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
121 with 2015-35:
0c472ae05f71 nearly finished downloading for now
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 41
diff changeset
122 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd
43
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
123 100 15 40 32.6 2.9
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
124
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
125 >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/2015_{}' &
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
126 >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
127 >: head -1 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
128 20150827191534
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
129 >: tail -1 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
130 20150905180914
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
131 >: wc -l /tmp/hst/2015_all
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
132 698128 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
133
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
134 What about wet files -- do they include text from pdfs? What about
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
135 truncated pdfs?
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
136
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
137 >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
138 real 26m3.049s
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
139 user 0m1.225s
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
140 sys 0m1.310s
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
141
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
142 In the segment 0 cdx file (!) we find 3747 probable truncations:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
143 >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
144 >: wc -l /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
145 42345 /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
146 >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
147 >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
148 3747
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
149 Of which 70 are in file 0:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
150 >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
151 >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
152 70 /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
153
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
154 In segment 0 file 0 we find 70 application/pdf Content-Type headers:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
155 >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx |egrep '^(WARC-Target-URI:|Content-Length:) '|cut -f 2 -d ' ' |tr -d '\r'|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
156 >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
157 70
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
158 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
159
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
160
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
161 Of which 14 are truncated:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
162 >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
163 14
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
164
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
165 E.g.
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
166 >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | head -3
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
167 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
168 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
169 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
170
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
171 Are any of the pdfs in the corresponding wet file?
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
172
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
173 Yes, 2:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
174 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
175 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
176 WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
177
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
178 Is it in fact corresponding?
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
179 >: diff -bw <(uz 1566027313501.0/orig/warc/*-00000.warc.gz | egrep -a '^WARC-Target-URI: ' | uniq | head -1000) <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz | egrep -a '^WARC-Target-URI: ' | head -1000)|egrep -c '^<'
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
180 19
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
181
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
182 So, yes, mostly. .2% are missing
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
183
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
184 Just checking the search:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
185 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) | wc -l
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
186 210
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
187 Correct
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
188
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
189 So, what pdfs make it into the WET:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
190 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv | fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
191 >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
192 2
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
193 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt | tr -d '\r' | fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
194 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
195 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
196
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
197 Here's the short one:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
198 WARC/1.0
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
199 WARC-Type: response
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
200 WARC-Date: 2019-08-17T22:40:17Z
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
201 WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
202 Content-Length: 11588
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
203 Content-Type: application/http; msgtype=response
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
204 WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
205 WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
206 WARC-IP-Address: 92.175.114.24
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
207 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
208 WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
209 WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
210 WARC-Identified-Payload-Type: application/pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
211
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
212 HTTP/1.1 200 OK
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
213 Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
214 Pragma: public,no-cache
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
215 Content-Type: application/pdf",text/html; charset=utf-8
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
216 X-Crawler-Content-Encoding: gzip
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
217 Expires: 0
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
218 Server:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
219 X-Powered-By:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
220 Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
221 Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
222 Content-Transfer-Encoding: binary
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
223 P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
224 X-Content-Encoded-By:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
225 X-Powered-By:
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
226 Date: Sat, 17 Aug 2019 22:40:16 GMT
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
227 X-Crawler-Content-Length: 5448
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
228 Content-Length: 10913
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
229
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
230 %PDF-1.7
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
231 %<E2><E3><CF><D3>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
232 7 0 obj
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
233 << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
234 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
235 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
236 rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
237 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
238 cy /CS /DeviceRGB >> /PZ 1 >>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
239 endobj
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
240 8 0 obj
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
241
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
242 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +1823434 | tail -n +24 | head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
243 >: ps2ascii mediatheque.pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
244 Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
245
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
246 Médiathèque départementale des Deux-Sèvres - Résultats de
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
247 la recherche Belfond
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
248 A charge de revanche
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
249 Titre :
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
250 Auteur : Grippando, James (1958-....)
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
251 ...
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
252 etc., three pages, no errors
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
253
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
254 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
255 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
256 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
257 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
258 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^%%EOF'
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
259 27:%%EOF
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
260 1114658:%%EOF
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
261 1313299:%%EOF
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
262
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
263 Hunh?
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
264
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
265 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | egrep -an '^(%%EOF|WARC)' | head -30
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
266 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
267 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
268 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
269 4:WARC-Truncated: length
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
270 5:WARC-Identified-Payload-Type: application/pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
271 27:%%EOF
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
272 7725:WARC/1.0
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
273 7726:WARC-Type: metadata
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
274 7727:WARC-Date: 2019-08-17T22:59:14Z
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
275 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
276 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
277 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
278 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
279 7739:WARC/1.0
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
280
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
281 OK, so indeed truncated after 7700 lines or so...
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
282 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz|tail -n +38896858 | tail -n +21 | head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
283 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
284 **** Error: An error occurred while reading an XREF table.
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
285 **** The file has been damaged.
6ae6a21ccfb9 more downloads,
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 42
diff changeset
286 Look in big_pdf?
44
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
287
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
288 ====Modify the original CC indexer to write new indices including lastmod=====
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
289 Looks like WarcRecordWriter.write, in
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
290 src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
291 needs to be editted to include LastModified date
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
292
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
293 To rebuild nutch-cc, particularly to recompile jar files after editting
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
294 anything
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
295
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
296 >: cd $HHOME/src/nutch-cc
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
297 >: ant
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
298
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
299 Fixed deprecation bug in WarcCdxWriter.java
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
300
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
301 Modified src/java/org/commoncrawl/util/WarcCdxWriter.java
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
302 to include lastmod
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
303
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
304 Can run just one test, which should allow testing this:
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
305
7209df5fa5b4 turn attention to nutch-cc and its Cdx code
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 43
diff changeset
306 >: ant test-core -Dtestcase='TestWarcRecordWriter'
45
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
307
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
308 Logic is tricky, and there's no easy way in
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
309
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
310 Basically, tools/WarcExport.java is launches a hadoop job based on a
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
311 hadoop-runnable WarcExport instance. Hadoop will in due course call
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
312 ExportReducer.reduce, which will create an instance of WarcCapture
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
313 "for each page capture", and call ExportMapper.context.write with that instance (via
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
314 some configuration magic with the hadoop job Context). That in turn
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
315 uses (more magic) WarcOutputFormat.getRecordWriter, which
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
316 (finally!) calls a previously created WarcRecordWriter
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
317 instance.write(the capture).
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
318
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
319 So to fake a test case, I need to build
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
320 1) a WarcRecordWriter instance
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
321 2) a WarcCapture instance
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
322 and then invoke 1.write(2)
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
323
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
324 Got that working, although still can't figure out where in the normal
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
325 flow the metadata entry for Response.CONTENT_TYPE gets set.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
326
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
327 Now, add a test that takes a stream of WARC Response extracts and
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
328 rewrites their index entries
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
329
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
330 >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)|tail -10| ix.py -h -w -x > /tmp/hst/headers.txt
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
331 >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
332 >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
333
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
334 Won't quite work :-(
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
335 How do We reconstruct the Warc filename, offset and length from the
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
336 original index?
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
337
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
338 Well, we can find a .warc.gz records!
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
339 Thanks to https://stackoverflow.com/a/37042747/2595465
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
340
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
341 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
342
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
343 Nearly working, got 1/3rd of the way through a single WARC and then failed:
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
344
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
345 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz| wc -l; done
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
346 ...
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
347 20
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
348 10215
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
349 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
350 Process fail: Compressed file ended before the end-of-stream marker was reached, input:
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
351 length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
352
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
353 >: head -10217 /tmp/hst/r3a | tail -4
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
354 60784173 467
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
355 60784640 10762
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
356 60795402 463
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
357 60795865 460
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
358 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|fgrep Target
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
359 WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
360
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
361 >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
362 ...
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
363 co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"}
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
364 >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
365 >: echo $((10762 - 2570))
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
366 8192
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
367
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
368 Ah, the error I was dreading :-( I _think_ this happens when an
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
369 individual record ends exactly on a 8K boundary.
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
370
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
371 Yes:
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
372
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
373 >: echo $((60784640 % 8192))
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
374 0
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 44
diff changeset
375
46
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
376 Even with buffer 1MB:
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
377 21
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
378 160245
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
379 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
380 Process fail: Compressed file ended before the end-of-stream marker was reached, input:
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
381 length=8415, offset=1059033915, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
382 0
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
383 160246
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
384
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
385 >: tail -60 /tmp/hst/r3b|head -20
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
386 1059013061 423
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
387 1059013484 7218
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
388 1059020702 425
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
389 1059021127 424
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
390 1059021551 11471
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
391 1059033022 426
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
392 1059033448g 467
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
393 1059033915 8415
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
394
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
395 Argh. This is at the _same_ point (before 51 fails before EOF). Ah,
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
396 maybe that's the point -- this is the last read before EOF, and it's
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
397 not a full buffer!
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
398
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
399 >: ix.py 467 1059033448 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|less
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
400 ...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
401 WARC-Target-URI: https://zowiecarrpsychicmedium.com/tag/oracle/
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
402
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
403 Reran with more instrumentation, took at least all day:
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
404
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
405 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3e_err.txt | while read o l; do
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
406 echo $((n+=1)); echo $o $l >> /tmp/hst/r3e_val; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | wc -l;
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
407 done > /tmp/hst/r3e_log 2>&1
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
408 >: wc -l /tmp/hst/r3e_err.txt
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
409 160296 /tmp/hst/r3e_err.txt
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
410 >: tail -60 /tmp/hst/r3e_err.txt|cat -n | grep -C2 True\ True
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
411 7 b 28738 28738 28312 426 False False
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
412 8 b 28312 28312 27845 467 False False
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
413 9 b 27845 378162 369747 8415 True True < this is the first hit the last
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
414 (partial) block
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
415 10 b 369747 369747 369312 435 False True
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
416 11 b 369312 369312 368878 434 False True
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
417
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
418 >: tail -55 /tmp/hst/r3e_val | head -3
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
419 1059033022 426
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
420 1059033448 467
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
421 1059033915 8415
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
422 >: dd ibs=1 skip=1059033022 count=426 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
423 ...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
424 426 bytes copied, 0.00468243 s, 91.0 kB/s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
425 sing<3411>: dd ibs=1 skip=1059033448 count=467 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
426 ...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
427 467 bytes copied, 0.00382692 s, 122 kB/s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
428 sing<3412>: dd ibs=1 skip=1059033915 count=8415 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout | uz -t
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
429 igzip: Error (null) does not contain a complete gzip file
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
430 ...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
431 8415 bytes (8.4 kB, 8.2 KiB) copied, 0.00968889 s, 869 kB/s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
432
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
433 So, tried one change to use the actually size rather than BUFSIZE at
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
434 one point, seems to work now:
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
435
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
436 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3f_err.txt | tee /tmp/hst/r3f_val | while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz';
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
437 done 2>&1 | tee /tmp/hst/r3f_log | ix.py -w | egrep -c '^WARC/1\.0'
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
438 160296
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
439 real 3m48.393s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
440 user 0m47.997s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
441 sys 0m26.641s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
442
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
443 >: tail /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
444 10851 1059370472
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
445 475 1059381323
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
446 444 1059381798
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
447 22437 1059382242
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
448 447 1059404679
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
449 506 1059405126
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
450 15183 1059405632
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
451 471 1059420815
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
452 457 1059421286
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
453 17754 1059421743
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
454
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
455 >: wc -l /tmp/hst/*_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
456 171 /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
457 160297 /tmp/hst/r3e_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
458 160296 /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
459 320764 total
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
460 >: uz /tmp/hst/head.warc.gz |egrep -c '^WARC/1\.0.$'
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
461 171
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
462 >: tail -n 3 /tmp/hst/*_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
463 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
464 454 1351795
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
465 414 1352249
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
466 0 1352663 [so the 171 above is bogus, and we're missing one]
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
467
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
468 ==> /tmp/hst/r3e_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
469 1059393441 457
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
470 1059393898 17754
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
471 0 [likewise bogus, so see below]
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
472
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
473 ==> /tmp/hst/r3f_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
474 471 1059420815
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
475 457 1059421286
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
476 17754 1059421743 [better, but still one missing]
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
477 >: uz /tmp/hst/head.warc.gz |egrep '^WARC-Type: ' | tee >(wc -l 1>&2) | tail -4
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
478 WARC-Type: response
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
479 WARC-Type: metadata
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
480 WARC-Type: request
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
481 WARC-Type: response [missing]
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
482 171
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
483 >: ls -lt /tmp/hst/*_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
484 -rw-r--r-- 1 hst dc007 1977 Sep 29 09:27 /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
485 -rw-r--r-- 1 hst dc007 2319237 Sep 28 14:28 /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
486 -rw-r--r-- 1 hst dc007 2319238 Sep 27 19:41 /tmp/hst/r3e_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
487 >: ls -l ~/lib/python/unpackz.py
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
488 -rwxr-xr-x 1 hst dc007 1821 Sep 28 15:13 .../dc007/hst/lib/python/unpackz.py
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
489 So e and f are stale, rerun
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
490 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err.txt| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4 &
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
491 >: Reading length, offset, filename tab-delimited triples from stdin...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
492 WARC-Type: response
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
493 WARC-Type: metadata
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
494 WARC-Type: request
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
495 WARC-Type: response
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
496
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
497 real 3m49.760s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
498 user 0m47.180s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
499 sys 0m32.218s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
500 So missing the final metadata...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
501 Back to head.warc.gz, with debug info
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
502
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
503 >: n=0 && ~/lib/python/unpackz.py /tmp/hst/head.warc.gz 2>/tmp/hst/ttd.txt|while read l o; do echo $((n+=1)); echo $l $o >> /tmp/hst/r3d_val; dd ibs=1 skip=$o count=$l if=/tmp/hst/head.warc.gz of=/dev/stdout 2>/tmp/hst/r3d_ido| uz -t ; done >/tmp/hst/r3d_log 2>&1
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
504 >: tail -2 /tmp/hst/r3d_log
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
505 171
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
506 igzip: Error invalid gzip header found for file (null)
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
507 >: tail -n 3 /tmp/hst/ttd.txt /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
508 ==> /tmp/hst/ttd.txt <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
509 b 9697 9697 9243 454 False True
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
510 b 9243 9243 8829 414 False True
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
511 n 8829
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
512
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
513 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
514 454 1351795
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
515 414 1352249
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
516 0 1352663
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
517
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
518 >: cat -n /tmp/hst/r3f_val | head -172 | tail -4
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
519 169 454 1351795
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
520 170 414 1352249
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
521 171 8829 1352663
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
522 172 446 1361492
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
523
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
524 Fixed, maybe
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
525
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
526 >: tail -n 3 /tmp/hst/r3d_log /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
527 ==> /tmp/hst/r3d_log <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
528 169
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
529 170
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
530 171
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
531
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
532 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
533 454 1351795
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
534 414 1352249
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
535 8829 1352663
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
536
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
537 Yes!
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
538
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
539 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err| tee /tmp/hst/r3f_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3f_log |ix.py -w |egrep '^WARC-Type: ' | tail -4
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
540 Reading length, offset, filename tab-delimited triples from stdin...
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
541 WARC-Type: metadata
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
542 WARC-Type: request
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
543 WARC-Type: response
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
544 WARC-Type: metadata
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
545
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
546 real 3m26.042s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
547 user 0m44.167s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
548 sys 0m24.716s
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
549 >: tail -n 3 /tmp/hst/r3f*
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
550 ==> /tmp/hst/r3f_err <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
551
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
552 ==> /tmp/hst/r3f_val <==
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
553 457 1059421286
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
554 17754 1059421743
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
555 425 1059439497
49672e9b4c1c unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 45
diff changeset
556
47
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
557 Doubling buffer size doesn't speed up
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
558 >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err| tee /tmp/hst/r3g_val|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done |& tee /tmp/hst/r3g_log |ix.py -w |egrep '^WARC-Type: ' | tail -4
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
559 Reading length, offset, filename tab-delimited triples from stdin...
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
560 WARC-Type: metadata
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
561 WARC-Type: request
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
562 WARC-Type: response
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
563 WARC-Type: metadata
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
564
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
565 real 3m34.519s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
566 user 0m52.312s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
567 sys 0m24.875s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
568
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
569 Tried using FileIO.readinto([a fixed buffer]), but didn't immediately
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
570 work. Abandoned because I still don't understand how zlib.decompress
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
571 works at all...
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
572
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
573 Time to convert unpackz to a library which takes a callback
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
574 alternative to an output file -- Done
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
575
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
576 W/o using callback, timing and structure for what we need for
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
577 re-indexing task looks encouraging:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
578 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA20 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^WARC-' |sus | tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
579 52468 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
580 52468 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
581 52468 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
582 52468 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
583 52468 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
584 52468 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
585 52468 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
586 52468 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
587 52468 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
588 52468 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
589 236 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
590 11
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
591
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
592 real 0m20.308s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
593 user 0m19.720s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
594 sys 0m4.505s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
595
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
596 Whole thing, with no pre-filtering:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
597
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
598 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
599 211794 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
600 211162 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
601 159323 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
602 159311 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
603 159301 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
604 159299 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
605 159297 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
606 105901 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
607 105896 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
608 52484 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
609 52484 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
610 52482 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
611 9239 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
612 3941 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
613 2262 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
614 642 Content-language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
615 326 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
616 238 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
617 114 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
618 352 Content-*:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
619 1 WARC-Filename:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
620 42
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
621
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
622 real 0m30.896s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
623 user 0m37.335s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
624 sys 0m7.542s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
625
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
626 First 51 after WARC-Type: response
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
627
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
628 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz |egrep -aA50 '^WARC-Type: response' | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
629 106775 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
630 106485 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
631 55215 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
632 55123 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
633 54988 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
634 54551 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
635 54246 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
636 54025 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
637 52806 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
638 52468 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
639 52468 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
640 52468 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
641 9230 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
642 3938 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
643 2261 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
644 639 Content-language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
645 324 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
646 236 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
647 114 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
648 342 Content-*:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
649 41
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
650
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
651 real 0m21.483s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
652 user 0m22.372s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
653 sys 0m5.400s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
654
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
655 So, not worth the risk, let's try python
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
656
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
657 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|wc -l
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
658 9238
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
659
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
660 real 0m25.426s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
661 user 0m23.201s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
662 sys 0m0.711s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
663
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
664 Looks good, but why 9238 instead of 9239???
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
665
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
666 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
667
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
668 Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
669 records correctly. Fixed. Redoing the above...
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
670
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
671 No pre-filter:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
672 >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz|egrep -c '^WARC/1\.0.$'
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
673 160297
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
674
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
675 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | cut -f 1 -d ' ' | egrep -a '^(WARC-|Content-|Last-Modified)' |sus | tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
676
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
677 213719 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
678 213088 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
679 160297 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
680 160297 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
681 160297 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
682 160296 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
683 160296 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
684 106864 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
685 106864 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
686 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1]
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
687 53432 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
688 53432 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
689 9430 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
690 4006 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
691 2325 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
692 653 Content-language:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
693 331 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
694 298 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
695 128 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
696 83 Content-Location:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
697 67 Content-type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
698 51 Content-MD5:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
699 45 Content-Script-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
700 42 Content-Style-Type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
701 31 Content-Transfer-Encoding:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
702 13 Content-disposition:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
703 8 Content-Md5:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
704 5 Content-Description:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
705 5 Content-script-type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
706 5 Content-style-type:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
707 3 Content-transfer-encoding:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
708 2 Content-Encoding-handler:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
709 1 Content-DocumentTitle:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
710 1 Content-Hash:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
711 1 Content-ID:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
712 1 Content-Legth:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
713 1 Content-length:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
714 1 Content-Range:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
715 1 Content-Secure-Policy:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
716 1 Content-security-policy:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
717 1 Content-Type-Options:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
718 1 WARC-Filename:
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
719 42
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
720
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
721 real 0m28.876s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
722 user 0m35.703s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
723 sys 0m6.976s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
724
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
725 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
726 >: wc -l /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
727 9430 /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
728 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
729
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
730 real 0m17.191s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
731 user 0m15.739s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
732 sys 0m0.594s
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
733 >: wc -l /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
734 9423 /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
735
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
736 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv | tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
737 853d852
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
738 < Mon, 19 Aug 2019 01:46:49 GMT
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
739 4058d4056
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
740 < Tue, 03 Nov 2015 21:31:18 GMT<br />
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
741 4405d4402
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
742 < Mon, 19 Aug 2019 01:54:52 GMT
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
743 5237,5238d5233
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
744 < 3
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
745 < Asia/Amman
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
746 7009d7003
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
747 < Mon, 19 Aug 2019 02:34:20 GMT
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
748 9198d9191
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
749 < Mon, 19 Aug 2019 02:14:49 GMT
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
750
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
751 All good. The only implausable case is
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
752 < Mon, 19 Aug 2019 01:54:52 GMT
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
753 which turns out to be a case of two Last-Modified headers in the same
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
754 the same response record's HTTP headers. RFCs 2616 and 7230 rule it
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
755 out but neither specifies a recovery, so first-wins is as good as
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
756 anything, and indeed 6797 specifies that.
fbdaede4155a cdx_extras and unpackz.py working
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 46
diff changeset
757
48
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
758 Start looking at how we do the merge of cdx_extras.py with existing index
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
759
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
760 Try it with the existing _per segment_ index we have for 2019-35
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
761
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
762 Assuming we have to key on segment plus offset, as reconstructing the
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
763 proper index key is such a pain / buggy / is going to change with the year.
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
764
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
765 Stay with segment 49
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
766
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
767 >: uz cdx.gz |wc -l
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
768 29,870,307
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
769
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
770 >: time uz cdx.gz|egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' |wc
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
771 29,870,307 119,481,228 1,241,098,122
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
772 = 4 * 29,870,307
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
773
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
774 So no bogons, not _too_ surprising :-)
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
775
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
776 Bad news is it's a _big_ file:
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
777
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
778 >: ls -lh cdx.gz
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
779 -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
780
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
781 So not viable to paste offset as a key and then sort on command line,
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
782 or to load it in to python and do the work there...
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
783
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
784 Do it per warc file and then merge?
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
785
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
786 >: time uz cdx.gz |fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz | sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
787
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
788 real 0m23.494s
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
789 user 0m14.541s
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
790 sys 0m9.158s
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
791
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
792 >: wc -l /tmp/hst/558.warc.cdx
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
793 53432 /tmp/hst/558.warc.cdx
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
794
49
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
795 >: echo $((600 * 53432))
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
796 32,059,200
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
797
48
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
798 So, 600 of those, plus approx. same again for extracting, that pbly
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
799 _is_ doable in python, not more than 10 hours total, assuming internal
f688c437180b thinking about merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 47
diff changeset
800 sort and external merge is not too expensive...
49
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
801
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
802 For each segment, suppose we pull out 60 groups of 10 target files
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
803 >: time uz cdx.gz |egrep -a warc/CC-MAIN-2019[^-]*-2019[^-]*-0000..warc.gz > /tmp/hst/0000.warc.cdx
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
804
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
805 real 0m42.129s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
806 user 0m35.147s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
807 sys 0m9.140s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
808 >: wc -l /tmp/hst/0000.warc.cdx
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
809 533150
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
810
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
811 Key it with offset and sort:
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
812
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
813 >: time egrep -ao ' "length": "[0-9]*", "offset": "[0-9]*"' /tmp/hst/0000.warc.cdx | cut -f 5 -d ' ' | tr -d \" > /tmp/hst/0000_offsets
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
814
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
815 real 0m5.578s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
816 user 0m5.593s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
817 sys 0m0.265s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
818
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
819 >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx |sort -nk1,1 | cut -f 2 > /tmp/hst/0000_sorted.warc.cdx
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
820
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
821 real 0m4.185s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
822 user 0m2.001s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
823 sys 0m1.334s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
824
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
825 >: time seq 0 9 | parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-*-*-0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv"
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
826
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
827 real 0m24.610s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
828 user 2m54.146s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
829 sys 0m10.226s
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
830
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
831 >: head /tmp/hst/lm_00000.tsv
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
832 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
833 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
834 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
835 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
836 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
837 ...
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
838 sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
839 com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"}
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
840
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
841 bingo
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
842
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
843 So, the python code is pretty straightfoward: open the 10 individual
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
844 lm-*.tsv outputs into an array, initialise a 10-elt array with the
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
845 first line of each and another with its offset, record the
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
846 fileno(s) of the lowest offset, then iterate
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
847
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
848 read cdx lines and write unchanged until offset = lowest
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
849 merge line from fileno and output
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
850 remove fileno from list of matches
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
851 read and store a new line for fileno [handle EOF]
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
852 if list of matches is empty, redo setting of lowest
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
853
deeac8a0a682 tentative plan for merging
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 48
diff changeset
854 Resort the result by actual key