cc/work: lurid3/notes.txt annotate

annotate lurid3/notes.txt @ 60:3be7b53d726e

using python dict test

author	Henry S. Thompson <ht@inf.ed.ac.uk>
date	Thu, 02 Jan 2025 15:01:48 +0000
parents	d9ba3ce783ff
children	e6bab0972142

rev	line source
40 4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	1 See old_notes.txt for all older notes on Common Crawl dataprocessing,
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	2 starting from Azure via Turing and then LURID and LURID2.
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	3
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	4 Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	5 >: cd results/CC-MAIN-2024-33/cdx/
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	6 >: cut -f 2 counts.tsv \| btot
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	7 2,793,986,828
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	8
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	9 State of play wrt data -- see status.xlsx
4167d8f33325 start lab notes for LURID3 Henry S. Thompson <ht@inf.ed.ac.uk> parents: diff changeset	10
41 64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	11 [in trying to tabulate the date ranges of the crawls, I found that the
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	12 WARC timestamp is sometimes bogus:
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	13
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	14 >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	15 net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	16
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	17 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	18 net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	19 net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	20
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	21 This happens in 2019-35 as well :-(
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	22
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	23 >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	24 com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	25 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	26 com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"}
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	27 ...
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	28
57 4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	29 Full search:
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	30 >: find CC*/cdx -type f -name cluster.idx > /tmp/hst/clus
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	31 >: cat /tmp/hst/clus \| while read c; do printf '%s\t%s\n' $c $(cut -f 1 -d ' ' $c \| fgrep -vc ${c:8:4}); done
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	32 CC-MAIN-2013-20/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	33 CC-MAIN-2014-35/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	34 CC-MAIN-2015-35/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	35 CC-MAIN-2016-30/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	36 CC-MAIN-2017-30/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	37 CC-MAIN-2018-30/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	38 CC-MAIN-2018-34/cdx/cluster.idx 36
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	39 CC-MAIN-2019-18/cdx/warc/cluster.idx 3
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	40 CC-MAIN-2019-35/cdx/cluster.idx 1
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	41 CC-MAIN-2020-34/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	42 CC-MAIN-2021-25/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	43 CC-MAIN-2021-31/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	44 CC-MAIN-2021-49/cdx/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	45 CC-MAIN-2022-21/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	46 CC-MAIN-2022-33/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	47 CC-MAIN-2022-40/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	48 CC-MAIN-2022-49/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	49 CC-MAIN-2023-40/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	50 CC-MAIN-2023-50/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	51 CC-MAIN-2024-33/cdx/warc/cluster.idx 0
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	52 Emailed this info to Sebastian Nagel 2024-12-17
4b5117db4929 minor updates Henry S. Thompson <ht@inf.ed.ac.uk> parents: 56 diff changeset	53
41 64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	54 Tabulate all the date ranges for the WARC files we have
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	55
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	56 >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc \| fgrep .gz \| cut -f 3,4 -d - \| sort -u \|tee /dev/fd/3 \| head -1 ) 3> >( tail -1 ) \| tr '\n' '\t'; echo; done \| cut -f 1,2,4 -d - \| sed 's/-20/ 20/;s/.$//' \| tr ' ' '\t' > dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	57 >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{.?,.??} \| fgrep warc.gz \| cut -f 3,4 -d - \| sort -u \|tee /dev/fd/3 \| { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) \| tr '\n' '\t'; echo; done >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	58 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{.?,.??} \| fgrep warc.gz \| cut -f 3,4 -d - \| sort -u \| head -1); done
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	59 2019-18 20190418101243-20190418122248
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	60 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{.?,.??} \| fgrep warc.gz \| cut -f 3,4 -d - \| sort -u \| tail -1); done
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	61 2019-18 20190426153423-20190426175423
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	62 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	63 >: pwd
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	64 /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	65 >: echo {000..299} \| tr ' ' '\n' \| parallel -j 10 'uz cdx-00{}.gz \| cut -f 2 -d " " \| sort -u > /tmp/hst/{}'
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	66 >: sort -mu /tmp/hst/??? > /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	67 >: wc -l /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	68 679686 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	69 >: head -1 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	70 20160723090435
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	71 >: tail -1 /tmp/hst/all
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	72 20160731110639
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	73 >: cd ../../..
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	74 >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	75 tweaked and sorted in xemacs:
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	76 2016-30 20160723090435 20160731110639
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	77 2017-30 20170720121902 20170729132938
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	78 2018-30 20180715183800 20180723184955
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	79 2018-34 20180814062251 20180822085454
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	80 2019-18 20190418101243 20190426175423
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	81 2019-35 20190817102624 20190826111356
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	82 2020-34 20200803083123 20200815214756
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	83 2021-25 20210612103920 20210625145905
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	84 2023-40 20230921073711 20231005042006
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	85 2023-50 20231128083443 20231212000408
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	86
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	87 Added to status.xlsx in shortened form, with number of days
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	88 8
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	89 9
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	90 8
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	91 8
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	92 8
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	93 9
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	94 12
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	95 13
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	96 15
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	97 15
64b7fb44e8dc extract actual date info for WARC crawls Henry S. Thompson <ht@inf.ed.ac.uk> parents: 40 diff changeset	98
42 0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	99 Fill a gap by downloading 2022-33
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	100
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	101 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log &
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	102 130 minutes...
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	103 >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log &
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	104 59 minutes
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	105
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	106 Another day to get to a quarter?
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	107 >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log &
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	108
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	109
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	110 And finally 2015-35
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	111 Fetched in just 2 chunks, 0-9 and 10-99, e.g.
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	112 >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log &
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	113
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	114 Much smaller.
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	115 Compare 2023-40, with 900 files per segment:
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	116 >: lss /orig/warc/-0023?.* \| cut -f 5 -d ' ' \| stats
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	117 n = 1000
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	118 min = 1.14775e+09
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	119 max = 1.26702e+09
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	120 sum = 1.20192e+12
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	121 mean = 1.20192e+09
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	122 sd = 2.26049e+07
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	123
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	124 with 2015-35, with 353 files per segment
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	125 >: lss /orig/warc/-0023?-* \| cut -f 5 -d ' ' \| stats
43 6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	126 n = 1000
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	127 min = 1.66471e+08
42 0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	128 max = 9.6322e+08
43 6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	129 sum = 9.19222e+11
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	130 mean = 9.19222e+08
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	131 sd = 8.20542e+07
42 0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	132
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	133 The min files all come from segment 1440644060633.7, whose files are
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	134 _all_ small:
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	135 >: uz 00123-.gz \| wc -l
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	136 12,759,931
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	137 Compare to 1440644060103.8
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	138 >: zcat 00123-.gz \| wc -l
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	139 75,806,738
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	140 Mystery
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	141
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	142 Also faster
43 6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	143 Compare 2022-33:
42 0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	144 >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log \| cut -f 1-7 -d ' ' \| while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done \| stats n min max mean sd
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	145 98 19 256 75.1 25.2
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	146 with 2015-35:
0c472ae05f71 nearly finished downloading for now Henry S. Thompson <ht@inf.ed.ac.uk> parents: 41 diff changeset	147 >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log \| cut -f 1-7 -d ' ' \| while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done \| stats n min max mean sd
43 6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	148 100 15 40 32.6 2.9
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	149
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	150 >: echo {000..299} \| tr ' ' '\n' \| parallel -j 10 'uz cdx-00{}.gz \| cut -f 2 -d " " \| sort -u > /tmp/hst/2015_{}' &
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	151 >: sort --parallel=10 -mu /tmp/hst/2015_??? > /tmp/hst/2015_all
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	152 >: head -1 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	153 20150827191534
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	154 >: tail -1 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	155 20150905180914
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	156 >: wc -l /tmp/hst/2015_all
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	157 698128 /tmp/hst/2015_all
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	158
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	159 What about wet files -- do they include text from pdfs? What about
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	160 truncated pdfs?
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	161
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	162 >: time for s in 0; do ~/bin/getcc_wet_multi.aws CC-MAIN-2019-35 $s 10; done > /tmp/hst/get_wet_19-35_0.log &
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	163 real 26m3.049s
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	164 user 0m1.225s
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	165 sys 0m1.310s
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	166
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	167 In the segment 0 cdx file (!) we find 3747 probable truncations:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	168 >: zgrep -a '"mime-detected": "application/pdf", ' cdx.gz > /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	169 >: wc -l /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	170 42345 /tmp/hst/2019-35_seg0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	171 >: egrep -a '"length": "10....."' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_long_pdf.idx &
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	172 >: wc -l < /tmp/hst/2019-35_seg0_long_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	173 3747
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	174 Of which 70 are in file 0:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	175 >: egrep -a '.-00000\.' /tmp/hst/2019-35_seg0_pdf.idx > /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	176 >: wc -l /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	177 70 /tmp/hst/2019-35_seg0_file0_pdf.idx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	178
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	179 In segment 0 file 0 we find 70 application/pdf Content-Type headers:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	180 >: ix.py -h -w -x </tmp/hst/2019-35_seg0_file0_pdf.idx \|egrep '^(WARC-Target-URI:\|Content-Length:) '\|cut -f 2 -d ' ' \|tr -d '\r'\|while read l1; do read uri; read l2; printf '%s\t%s\t%s\n' $l1 $l2 "$uri"; done > ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	181 >: wc -l < ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	182 70
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	183 >: head -3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	184
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	185
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	186 Of which 14 are truncated:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	187 >: fgrep -c 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	188 14
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	189
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	190 E.g.
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	191 >: fgrep 1048576 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv \| head -3
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	192 1049051 1048576 https://en.heks.ch/sites/default/files/documents/2017-09/HEKS_EPER_Mission_Statement_2016_e.pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	193 1049469 1048576 https://bmcmicrobiol.biomedcentral.com/track/pdf/10.1186/s12866-017-0951-4
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	194 1048824 1048576 https://citydocs.fcgov.com/?action=cao-cases&cmd=convert&docid=3332339
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	195
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	196 Are any of the pdfs in the corresponding wet file?
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	197
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	198 Yes, 2:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	199 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv \| fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz)
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	200 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	201 WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D00
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	202
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	203 Is it in fact corresponding?
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	204 >: diff -bw <(uz 1566027313501.0/orig/warc/-00000.warc.gz \| egrep -a '^WARC-Target-URI: ' \| uniq \| head -1000) <(uz 1566027313501.0/orig/wet/-00000.warc.wet.gz \| egrep -a '^WARC-Target-URI: ' \| head -1000)\|egrep -c '^<'
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	205 19
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	206
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	207 So, yes, mostly. .2% are missing
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	208
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	209 Just checking the search:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	210 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv \| fgrep -af - <(uz 1566027313501.0/orig/warc/*-00000.warc.gz) \| wc -l
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	211 210
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	212 Correct
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	213
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	214 So, what pdfs make it into the WET:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	215 >: cut -f 3 ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv \| fgrep -af - <(uz 1566027313501.0/orig/wet/*-00000.warc.wet.gz) > ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	216 >: wc -l < ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	217 2
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	218 >: cut -f 2 -d ' ' ~/results/CC-MAIN-2019-35/s0_file0_pdf.txt \| tr -d '\r' \| fgrep -f - ~/results/CC-MAIN-2019-35/seg0_file0_lengths.tsv
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	219 11588 10913 http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	220 1048979 1048576 https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	221
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	222 Here's the short one:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	223 WARC/1.0
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	224 WARC-Type: response
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	225 WARC-Date: 2019-08-17T22:40:17Z
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	226 WARC-Record-ID: <urn:uuid:ea98167b-c42a-4233-b57e-994aa627e38a>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	227 Content-Length: 11588
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	228 Content-Type: application/http; msgtype=response
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	229 WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	230 WARC-Concurrent-To: <urn:uuid:2d51c956-0012-4d78-affc-8f57fe9d2e15>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	231 WARC-IP-Address: 92.175.114.24
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	232 WARC-Target-URI: http://bdds.deux-sevres.com/recherche/simple/Editeur/2/Belfond/vignette?format=pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	233 WARC-Payload-Digest: sha1:7VVIUDQ4Q6XKNOAURYU4VTMRSZNPHDQA
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	234 WARC-Block-Digest: sha1:OSTWXLV772XNHS22T4UBSCSJAAXM2J6T
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	235 WARC-Identified-Payload-Type: application/pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	236
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	237 HTTP/1.1 200 OK
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	238 Cache-Control: must-revalidate, post-check=0, pre-check=0,no-cache
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	239 Pragma: public,no-cache
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	240 Content-Type: application/pdf",text/html; charset=utf-8
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	241 X-Crawler-Content-Encoding: gzip
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	242 Expires: 0
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	243 Server:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	244 X-Powered-By:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	245 Set-Cookie: 166d74d734106ba68b20ea303011f622=301619e3fe31ecb98c8473f0ff5f35a2; path=/
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	246 Content-Disposition: attachment; filename="Mdiathque dpartementale des Deux-Svres - Rsultats de la recherche Belfond.pdf"
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	247 Content-Transfer-Encoding: binary
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	248 P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	249 X-Content-Encoded-By:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	250 X-Powered-By:
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	251 Date: Sat, 17 Aug 2019 22:40:16 GMT
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	252 X-Crawler-Content-Length: 5448
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	253 Content-Length: 10913
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	254
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	255 %PDF-1.7
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	256 %<E2><E3><CF><D3>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	257 7 0 obj
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	258 << /Type /Page /Parent 1 0 R /LastModified (D:20190818004016+02'00') /Resources 2
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	259 0 R /MediaBox [0.000000 0.000000 595.276000 841.890000] /CropBox [0.000000 0.000
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	260 000 595.276000 841.890000] /BleedBox [0.000000 0.000000 595.276000 841.890000] /T
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	261 rimBox [0.000000 0.000000 595.276000 841.890000] /ArtBox [0.000000 0.000000 595.2
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	262 76000 841.890000] /Contents 8 0 R /Rotate 0 /Group << /Type /Group /S /Transparen
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	263 cy /CS /DeviceRGB >> /PZ 1 >>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	264 endobj
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	265 8 0 obj
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	266
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	267 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz\|tail -n +1823434 \| tail -n +24 \| head -c 20000 > ~/results/CC-MAIN-2019-35/mediatheque.pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	268 >: ps2ascii mediatheque.pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	269 Médiathèque départementale des Deux-Sèvres - Résultats de la recherche Belfond
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	270
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	271 Médiathèque départementale des Deux-Sèvres - Résultats de
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	272 la recherche Belfond
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	273 A charge de revanche
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	274 Titre :
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	275 Auteur : Grippando, James (1958-....)
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	276 ...
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	277 etc., three pages, no errors
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	278
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	279 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz\|fgrep -an https://museum.wrap.gov.tw/GetFile4.ashx
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	280 38896837:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	281 38896858:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	282 38904590:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	283 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz\|tail -n +38896858 \| egrep -an '^%%EOF'
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	284 27:%%EOF
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	285 1114658:%%EOF
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	286 1313299:%%EOF
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	287
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	288 Hunh?
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	289
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	290 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz\|tail -n +38896858 \| egrep -an '^(%%EOF\|WARC)' \| head -30
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	291 1:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	292 2:WARC-Payload-Digest: sha1:SZ53DQQHENC7DDN7GQ5IS7VMEPAXAMBE
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	293 3:WARC-Block-Digest: sha1:QTKJA6A7445Z7264K2YAFBUUM2OYH2T2
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	294 4:WARC-Truncated: length
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	295 5:WARC-Identified-Payload-Type: application/pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	296 27:%%EOF
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	297 7725:WARC/1.0
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	298 7726:WARC-Type: metadata
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	299 7727:WARC-Date: 2019-08-17T22:59:14Z
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	300 7728:WARC-Record-ID: <urn:uuid:77df2747-e567-45d3-8646-3069ae9a9f25>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	301 7731:WARC-Warcinfo-ID: <urn:uuid:f689f8d0-24f3-4824-9a38-4f3fee422a4e>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	302 7732:WARC-Concurrent-To: <urn:uuid:eceb4adc-d81e-4497-82fe-eea61ce171f4>
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	303 7733:WARC-Target-URI: https://museum.wrap.gov.tw/GetFile4.ashx?Serial=201609200919D005
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	304 7739:WARC/1.0
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	305
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	306 OK, so indeed truncated after 7700 lines or so...
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	307 >: uz 1566027313501.0/orig/warc/*-00000.warc.gz\|tail -n +38896858 \| tail -n +21 \| head -c 1048576 > ~/results/CC-MAIN-2019-35/museum.pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	308 >: ps2ascii ~/results/CC-MAIN-2019-35/museum.pdf
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	309 **** Error: An error occurred while reading an XREF table.
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	310 **** The file has been damaged.
6ae6a21ccfb9 more downloads, Henry S. Thompson <ht@inf.ed.ac.uk> parents: 42 diff changeset	311 Look in big_pdf?
44 7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	312
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	313 ====Modify the original CC indexer to write new indices including lastmod=====
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	314 Looks like WarcRecordWriter.write, in
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	315 src/nutch-cc/src/java/org/commoncrawl/util/WarcRecordWriter, is what
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	316 needs to be editted to include LastModified date
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	317
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	318 To rebuild nutch-cc, particularly to recompile jar files after editting
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	319 anything
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	320
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	321 >: cd $HHOME/src/nutch-cc
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	322 >: ant
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	323
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	324 Fixed deprecation bug in WarcCdxWriter.java
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	325
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	326 Modified src/java/org/commoncrawl/util/WarcCdxWriter.java
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	327 to include lastmod
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	328
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	329 Can run just one test, which should allow testing this:
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	330
7209df5fa5b4 turn attention to nutch-cc and its Cdx code Henry S. Thompson <ht@inf.ed.ac.uk> parents: 43 diff changeset	331 >: ant test-core -Dtestcase='TestWarcRecordWriter'
45 737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	332
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	333 Logic is tricky, and there's no easy way in
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	334
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	335 Basically, tools/WarcExport.java is launches a hadoop job based on a
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	336 hadoop-runnable WarcExport instance. Hadoop will in due course call
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	337 ExportReducer.reduce, which will create an instance of WarcCapture
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	338 "for each page capture", and call ExportMapper.context.write with that instance (via
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	339 some configuration magic with the hadoop job Context). That in turn
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	340 uses (more magic) WarcOutputFormat.getRecordWriter, which
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	341 (finally!) calls a previously created WarcRecordWriter
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	342 instance.write(the capture).
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	343
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	344 So to fake a test case, I need to build
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	345 1) a WarcRecordWriter instance
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	346 2) a WarcCapture instance
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	347 and then invoke 1.write(2)
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	348
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	349 Got that working, although still can't figure out where in the normal
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	350 flow the metadata entry for Response.CONTENT_TYPE gets set.
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	351
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	352 Now, add a test that takes a stream of WARC Response extracts and
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	353 rewrites their index entries
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	354
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	355 >: head -8804 <(uz /beegfs/common_crawl/CC-MAIN-2019-35/cdx/warc/cdx-00150.gz)\|tail -10\| ix.py -h -w -x > /tmp/hst/headers.txt
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	356 >: cp /tmp/hst/headers.txt src/test/org/commoncrawl/util/
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	357 >: shuf /tmp/hst/headers.txt > src/test/org/commoncrawl/util/headers_mixed.txt
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	358
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	359 Won't quite work :-(
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	360 How do We reconstruct the Warc filename, offset and length from the
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	361 original index?
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	362
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	363 Well, we can find a .warc.gz records!
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	364 Thanks to https://stackoverflow.com/a/37042747/2595465
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	365
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	366 >: ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/recs.txt
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	367
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	368 Nearly working, got 1/3rd of the way through a single WARC and then failed:
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	369
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	370 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/tt.txt\|while read o l; do echo $((n+=1)); echo $o $l >> /tmp/hst/r3a; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\| wc -l; done
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	371 ...
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	372 20
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	373 10215
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	374 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	375 Process fail: Compressed file ended before the end-of-stream marker was reached, input:
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	376 length=10762, offset=60784640, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	377
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	378 >: head -10217 /tmp/hst/r3a \| tail -4
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	379 60784173 467
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	380 60784640 10762
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	381 60795402 463
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	382 60795865 460
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	383 >: ix.py 467 60784173 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\|fgrep Target
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	384 WARC-Target-URI: http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	385
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	386 >: zcat /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/cdx/warc/cdx.gz
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	387 ...
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	388 co,drycarerestoration)/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom 20190819020224 {"url": "http://drycarerestoration.co/corner-furniture-piece/unique-corner-decoration-pieces-or-corner-furniture-pieces-corner-corner-furniture-piece-corner-furniture-pieces-bedroom/", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "DTKGJL45XQDXUS7PTXPYR6POMPLG46RZ", "length": "2570", "offset": "60784640", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz", "charset": "UTF-8", "languages": "eng"}
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	389 >: ix.py 2570 60784640 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\|less
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	390 >: echo $((10762 - 2570))
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	391 8192
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	392
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	393 Ah, the error I was dreading :-( I _think_ this happens when an
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	394 individual record ends exactly on a 8K boundary.
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	395
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	396 Yes:
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	397
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	398 >: echo $((60784640 % 8192))
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	399 0
737c61f98cbf foo Henry S. Thompson <ht@inf.ed.ac.uk> parents: 44 diff changeset	400
46 49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	401 Even with buffer 1MB:
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	402 21
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	403 160245
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	404 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	405 Process fail: Compressed file ended before the end-of-stream marker was reached, input:
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	406 length=8415, offset=1059033915, file=/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	407 0
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	408 160246
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	409
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	410 >: tail -60 /tmp/hst/r3b\|head -20
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	411 1059013061 423
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	412 1059013484 7218
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	413 1059020702 425
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	414 1059021127 424
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	415 1059021551 11471
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	416 1059033022 426
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	417 1059033448g 467
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	418 1059033915 8415
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	419
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	420 Argh. This is at the _same_ point (before 51 fails before EOF). Ah,
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	421 maybe that's the point -- this is the last read before EOF, and it's
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	422 not a full buffer!
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	423
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	424 >: ix.py 467 1059033448 CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\|less
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	425 ...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	426 WARC-Target-URI: https://zowiecarrpsychicmedium.com/tag/oracle/
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	427
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	428 Reran with more instrumentation, took at least all day:
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	429
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	430 >: n=0 && ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3e_err.txt \| while read o l; do
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	431 echo $((n+=1)); echo $o $l >> /tmp/hst/r3e_val; ix.py $l $o CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| wc -l;
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	432 done > /tmp/hst/r3e_log 2>&1
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	433 >: wc -l /tmp/hst/r3e_err.txt
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	434 160296 /tmp/hst/r3e_err.txt
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	435 >: tail -60 /tmp/hst/r3e_err.txt\|cat -n \| grep -C2 True\ True
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	436 7 b 28738 28738 28312 426 False False
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	437 8 b 28312 28312 27845 467 False False
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	438 9 b 27845 378162 369747 8415 True True < this is the first hit the last
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	439 (partial) block
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	440 10 b 369747 369747 369312 435 False True
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	441 11 b 369312 369312 368878 434 False True
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	442
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	443 >: tail -55 /tmp/hst/r3e_val \| head -3
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	444 1059033022 426
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	445 1059033448 467
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	446 1059033915 8415
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	447 >: dd ibs=1 skip=1059033022 count=426 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout \| uz -t
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	448 ...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	449 426 bytes copied, 0.00468243 s, 91.0 kB/s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	450 sing<3411>: dd ibs=1 skip=1059033448 count=467 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout \| uz -t
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	451 ...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	452 467 bytes copied, 0.00382692 s, 122 kB/s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	453 sing<3412>: dd ibs=1 skip=1059033915 count=8415 if=/beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz of=/dev/stdout \| uz -t
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	454 igzip: Error (null) does not contain a complete gzip file
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	455 ...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	456 8415 bytes (8.4 kB, 8.2 KiB) copied, 0.00968889 s, 869 kB/s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	457
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	458 So, tried one change to use the actually size rather than BUFSIZE at
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	459 one point, seems to work now:
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	460
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	461 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2> /tmp/hst/r3f_err.txt \| tee /tmp/hst/r3f_val \| while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz';
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	462 done 2>&1 \| tee /tmp/hst/r3f_log \| ix.py -w \| egrep -c '^WARC/1\.0'
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	463 160296
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	464 real 3m48.393s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	465 user 0m47.997s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	466 sys 0m26.641s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	467
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	468 >: tail /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	469 10851 1059370472
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	470 475 1059381323
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	471 444 1059381798
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	472 22437 1059382242
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	473 447 1059404679
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	474 506 1059405126
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	475 15183 1059405632
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	476 471 1059420815
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	477 457 1059421286
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	478 17754 1059421743
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	479
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	480 >: wc -l /tmp/hst/*_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	481 171 /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	482 160297 /tmp/hst/r3e_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	483 160296 /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	484 320764 total
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	485 >: uz /tmp/hst/head.warc.gz \|egrep -c '^WARC/1\.0.$'
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	486 171
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	487 >: tail -n 3 /tmp/hst/*_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	488 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	489 454 1351795
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	490 414 1352249
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	491 0 1352663 [so the 171 above is bogus, and we're missing one]
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	492
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	493 ==> /tmp/hst/r3e_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	494 1059393441 457
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	495 1059393898 17754
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	496 0 [likewise bogus, so see below]
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	497
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	498 ==> /tmp/hst/r3f_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	499 471 1059420815
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	500 457 1059421286
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	501 17754 1059421743 [better, but still one missing]
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	502 >: uz /tmp/hst/head.warc.gz \|egrep '^WARC-Type: ' \| tee >(wc -l 1>&2) \| tail -4
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	503 WARC-Type: response
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	504 WARC-Type: metadata
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	505 WARC-Type: request
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	506 WARC-Type: response [missing]
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	507 171
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	508 >: ls -lt /tmp/hst/*_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	509 -rw-r--r-- 1 hst dc007 1977 Sep 29 09:27 /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	510 -rw-r--r-- 1 hst dc007 2319237 Sep 28 14:28 /tmp/hst/r3f_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	511 -rw-r--r-- 1 hst dc007 2319238 Sep 27 19:41 /tmp/hst/r3e_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	512 >: ls -l ~/lib/python/unpackz.py
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	513 -rwxr-xr-x 1 hst dc007 1821 Sep 28 15:13 .../dc007/hst/lib/python/unpackz.py
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	514 So e and f are stale, rerun
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	515 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err.txt\| tee /tmp/hst/r3f_val\|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done \|& tee /tmp/hst/r3f_log \|ix.py -w \|egrep '^WARC-Type: ' \| tail -4 &
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	516 >: Reading length, offset, filename tab-delimited triples from stdin...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	517 WARC-Type: response
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	518 WARC-Type: metadata
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	519 WARC-Type: request
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	520 WARC-Type: response
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	521
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	522 real 3m49.760s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	523 user 0m47.180s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	524 sys 0m32.218s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	525 So missing the final metadata...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	526 Back to head.warc.gz, with debug info
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	527
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	528 >: n=0 && ~/lib/python/unpackz.py /tmp/hst/head.warc.gz 2>/tmp/hst/ttd.txt\|while read l o; do echo $((n+=1)); echo $l $o >> /tmp/hst/r3d_val; dd ibs=1 skip=$o count=$l if=/tmp/hst/head.warc.gz of=/dev/stdout 2>/tmp/hst/r3d_ido\| uz -t ; done >/tmp/hst/r3d_log 2>&1
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	529 >: tail -2 /tmp/hst/r3d_log
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	530 171
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	531 igzip: Error invalid gzip header found for file (null)
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	532 >: tail -n 3 /tmp/hst/ttd.txt /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	533 ==> /tmp/hst/ttd.txt <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	534 b 9697 9697 9243 454 False True
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	535 b 9243 9243 8829 414 False True
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	536 n 8829
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	537
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	538 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	539 454 1351795
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	540 414 1352249
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	541 0 1352663
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	542
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	543 >: cat -n /tmp/hst/r3f_val \| head -172 \| tail -4
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	544 169 454 1351795
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	545 170 414 1352249
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	546 171 8829 1352663
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	547 172 446 1361492
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	548
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	549 Fixed, maybe
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	550
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	551 >: tail -n 3 /tmp/hst/r3d_log /tmp/hst/r3d_val
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	552 ==> /tmp/hst/r3d_log <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	553 169
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	554 170
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	555 171
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	556
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	557 ==> /tmp/hst/r3d_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	558 454 1351795
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	559 414 1352249
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	560 8829 1352663
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	561
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	562 Yes!
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	563
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	564 >: time ~/lib/python/unpackz.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3f_err\| tee /tmp/hst/r3f_val\|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done \|& tee /tmp/hst/r3f_log \|ix.py -w \|egrep '^WARC-Type: ' \| tail -4
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	565 Reading length, offset, filename tab-delimited triples from stdin...
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	566 WARC-Type: metadata
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	567 WARC-Type: request
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	568 WARC-Type: response
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	569 WARC-Type: metadata
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	570
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	571 real 3m26.042s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	572 user 0m44.167s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	573 sys 0m24.716s
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	574 >: tail -n 3 /tmp/hst/r3f*
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	575 ==> /tmp/hst/r3f_err <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	576
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	577 ==> /tmp/hst/r3f_val <==
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	578 457 1059421286
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	579 17754 1059421743
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	580 425 1059439497
49672e9b4c1c unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 45 diff changeset	581
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	582 Doubling buffer size doesn't speed up
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	583 >: time ~/lib/python/unpackz.py -b $((2 * 1024 * 1024)) /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz 2>/tmp/hst/r3g_err\| tee /tmp/hst/r3g_val\|while read l o; do printf '%s\t%s\t%s\n' $l $o 'CC-MAIN-2019-35/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz' ;done \|& tee /tmp/hst/r3g_log \|ix.py -w \|egrep '^WARC-Type: ' \| tail -4
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	584 Reading length, offset, filename tab-delimited triples from stdin...
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	585 WARC-Type: metadata
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	586 WARC-Type: request
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	587 WARC-Type: response
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	588 WARC-Type: metadata
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	589
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	590 real 3m34.519s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	591 user 0m52.312s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	592 sys 0m24.875s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	593
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	594 Tried using FileIO.readinto([a fixed buffer]), but didn't immediately
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	595 work. Abandoned because I still don't understand how zlib.decompress
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	596 works at all...
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	597
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	598 Time to convert unpackz to a library which takes a callback
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	599 alternative to an output file -- Done
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	600
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	601 W/o using callback, timing and structure for what we need for
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	602 re-indexing task looks encouraging:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	603 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \|egrep -aA20 '^WARC-Type: response' \| cut -f 1 -d ' ' \| egrep -a '^WARC-' \|sus \| tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	604 52468 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	605 52468 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	606 52468 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	607 52468 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	608 52468 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	609 52468 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	610 52468 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	611 52468 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	612 52468 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	613 52468 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	614 236 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	615 11
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	616
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	617 real 0m20.308s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	618 user 0m19.720s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	619 sys 0m4.505s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	620
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	621 Whole thing, with no pre-filtering:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	622
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	623 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| cut -f 1 -d ' ' \| egrep -a '^(WARC-\|Content-\|Last-Modified)' \|sus \| tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	624 211794 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	625 211162 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	626 159323 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	627 159311 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	628 159301 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	629 159299 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	630 159297 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	631 105901 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	632 105896 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	633 52484 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	634 52484 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	635 52482 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	636 9239 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	637 3941 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	638 2262 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	639 642 Content-language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	640 326 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	641 238 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	642 114 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	643 352 Content-*:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	644 1 WARC-Filename:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	645 42
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	646
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	647 real 0m30.896s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	648 user 0m37.335s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	649 sys 0m7.542s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	650
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	651 First 51 after WARC-Type: response
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	652
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	653 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \|egrep -aA50 '^WARC-Type: response' \| cut -f 1 -d ' ' \| egrep -a '^(WARC-\|Content-\|Last-Modified)' \|sus \| tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	654 106775 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	655 106485 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	656 55215 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	657 55123 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	658 54988 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	659 54551 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	660 54246 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	661 54025 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	662 52806 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	663 52468 WARC-Block-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	664 52468 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	665 52468 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	666 9230 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	667 3938 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	668 2261 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	669 639 Content-language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	670 324 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	671 236 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	672 114 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	673 342 Content-*:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	674 41
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	675
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	676 real 0m21.483s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	677 user 0m22.372s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	678 sys 0m5.400s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	679
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	680 So, not worth the risk, let's try python: cdx_extras implements a
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	681 callback for unpackz that outputs the LM header if it's there
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	682
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	683 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\|wc -l
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	684 9238
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	685
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	686 real 0m25.426s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	687 user 0m23.201s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	688 sys 0m0.711s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	689
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	690 Looks good, but why 9238 instead of 9239???
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	691
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	692 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	693
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	694 Argh. Serious bug in unpackz, wasn't handline cross-buffer-boundary
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	695 records correctly. Fixed. Redoing the above...
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	696
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	697 No pre-filter:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	698 >: uz /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz\|egrep -c '^WARC/1\.0.$'
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	699 160297
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	700
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	701 >: time ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| cut -f 1 -d ' ' \| egrep -a '^(WARC-\|Content-\|Last-Modified)' \|sus \| tee >(wc -l 1>&2)
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	702
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	703 213719 Content-Length:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	704 213088 Content-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	705 160297 WARC-Date:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	706 160297 WARC-Record-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	707 160297 WARC-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	708 160296 WARC-Target-URI:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	709 160296 WARC-Warcinfo-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	710 106864 WARC-Concurrent-To:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	711 106864 WARC-IP-Address:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	712 53432 WARC-Block-Digest: [consistent with 106297 == (3 * 53432) + 1]
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	713 53432 WARC-Identified-Payload-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	714 53432 WARC-Payload-Digest:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	715 9430 Last-Modified:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	716 4006 Content-Language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	717 2325 Content-Security-Policy:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	718 653 Content-language:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	719 331 Content-Security-Policy-Report-Only:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	720 298 WARC-Truncated:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	721 128 Content-Disposition:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	722 83 Content-Location:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	723 67 Content-type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	724 51 Content-MD5:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	725 45 Content-Script-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	726 42 Content-Style-Type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	727 31 Content-Transfer-Encoding:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	728 13 Content-disposition:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	729 8 Content-Md5:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	730 5 Content-Description:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	731 5 Content-script-type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	732 5 Content-style-type:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	733 3 Content-transfer-encoding:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	734 2 Content-Encoding-handler:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	735 1 Content-DocumentTitle:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	736 1 Content-Hash:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	737 1 Content-ID:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	738 1 Content-Legth:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	739 1 Content-length:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	740 1 Content-Range:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	741 1 Content-Secure-Policy:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	742 1 Content-security-policy:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	743 1 Content-Type-Options:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	744 1 WARC-Filename:
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	745 42
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	746
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	747 real 0m28.876s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	748 user 0m35.703s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	749 sys 0m6.976s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	750
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	751 >: ~/lib/python/cc/unpackz.py -o /dev/stdout /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| egrep -a '^Last-Modified: ' > /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	752 >: wc -l /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	753 9430 /tmp/hst/lmo.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	754 >: time ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz > /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	755
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	756 real 0m17.191s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	757 user 0m15.739s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	758 sys 0m0.594s
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	759 >: wc -l /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	760 9423 /tmp/hst/lm.tsv
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	761
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	762 >: diff <(sed 's/^Last-Modified: //' /tmp/hst/lmo.tsv \| tr -d '\r') <(cut -f 3 /tmp/hst/lm.tsv)
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	763 853d852
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	764 < Mon, 19 Aug 2019 01:46:49 GMT [in XML comment at very end of xHTML]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	765 4058d4056
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	766 < Tue, 03 Nov 2015 21:31:18 GMT<br /> [in an HTML table]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	767 4405d4402
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	768 < Mon, 19 Aug 2019 01:54:52 GMT [double lm]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	769 5237,5238d5233
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	770 < 3 [bogus extension lines to preceding LM]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	771 < Asia/Amman
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	772 7009d7003
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	773 < Mon, 19 Aug 2019 02:34:20 GMT [in XML comment at very end of xHTML]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	774 9198d9191
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	775 < Mon, 19 Aug 2019 02:14:49 GMT [in XML comment at very end of xHTML]
47 fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	776
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	777 All good. The only implausable case is
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	778 < Mon, 19 Aug 2019 01:54:52 GMT
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	779 which turns out to be a case of two Last-Modified headers in the same
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	780 the same response record's HTTP headers. RFCs 2616 and 7230 rule it
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	781 out but neither specifies a recovery, so first-wins is as good as
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	782 anything, and indeed 6797 specifies that.
fbdaede4155a cdx_extras and unpackz.py working Henry S. Thompson <ht@inf.ed.ac.uk> parents: 46 diff changeset	783
48 f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	784 Start looking at how we do the merge of cdx_extras.py with existing index
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	785
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	786 ====2024-12-19====
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	787
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	788 The above test shows 17.6% of entries have an LM value
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	789 For a 3 billion entry dataset, than means 530 million LM entries, call
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	790 this n.
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	791
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	792 Sizes? For a 10% error rate, we need m bits = -n * ln(.1) / ln(2)^2
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	793
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	794 (- (/ (* n (log .1)) (* (log 2)(log 2))) = 2,535,559,358 =~ 320MB
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	795
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	796 That's too much :-) Per segment, that becomes possible?
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	797 25,355,594 bits =~ 3.2MB
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	798
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	799 But maybe it's _not_ too much. One of the python implementations I
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	800 saw uses mmap:
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	801
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	802 https://github.com/prashnts/pybloomfiltermmap3
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	803
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	804 Build a Bloom filter with all the URIs whose entries have LM value
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	805 _and_ a python hashtable mapping from URI to LM and offset (is that
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	806 enough for deduping?)
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	807 Rewrite one index file at a time
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	808 Probe with each URI, if positive
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	809 look up in hashtable and use if found
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	810
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	811 >: wc -l ks*.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	812 52369734 ks_0-9.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	813 52489306 ks_10-19.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	814 52381115 ks_20-29.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	815 52438862 ks_30-39.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	816 52512044 ks_40-49.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	817 52476964 ks_50-59.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	818 52317116 ks_60-69.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	819 52200680 ks_70-79.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	820 52382426 ks_80-89.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	821 52295136 ks_90-99.tsv
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	822 523863383 total
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	823
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	824 >>> from pybloomfilter import BloomFilter
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	825 >>> f=BloomFilter(523863383,0.1,'/tmp/hst/uris.bloom')
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	826 >>> def bff(f,fn):
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	827 ... with open(fn) as uf:
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	828 ... while (l:=uf.readline()):
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	829 ... f.add(l.split('\t')[2])
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	830 ...
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	831 >>> timeit.timeit("bff(f,'/dev/null')",number=1,globals=globals())
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	832 0.00012309104204177856
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	833 >>> timeit.timeit("bff(f,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",number=1,globals=globals())
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	834 77.57737312093377
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	835 >>> 'http://71.43.189.10/dermorph' in f
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	836 False
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	837 >>> 'http://71.43.189.10/dermorph/' in f
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	838 True
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	839 >>> timeit.timeit("'http://71.43.189.10/dermorph/' in f",number=100000,globals=globals())
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	840 0.02377822808921337
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	841 >>> timeit.timeit("'http://71.43.189.10/dermorph' in f",number=100000,globals=globals())
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	842 0.019318239763379097
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	843
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	844 _That's_ encouraging...
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	845 Be sure to f.close()
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	846 Use BloomFilter.open for an existing bloom file
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	847 Copying a file from /tmp to work/... still gives good (quick) lookup,
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	848 but _creating and filling_ a file on work/... takes ... I stopped
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	849 waiting after an hour or so.
59 d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	850
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	851 How much bigger is .05 false positive?
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	852 Less than expected:
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	853 >: ls -l /tmp/hst
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	854 -rwxr-xr-x 1 hst dc007 408301988 Jan 1 16:52 uris_20.bloom
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	855 -rwxr-xr-x 1 hst dc007 313830100 Jan 1 15:04 uris.bloom
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	856 And still same (?) fill time:
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	857 >>> g=BloomFilter(523863383,0.05,'/tmp/hst/uris_20.bloom')
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	858 >>> T=timeit.Timer("bff(g,'/work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv')",globals=globals())
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	859 >>> T.repeat(3,number=1)
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	860 [89.64385064691305, 90.9979057777673, 83.9632708914578]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	861 Build a test harness wrt the python dict I'm going to need...
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	862 Can't immediately find a way to optimise a dict to have umpty millions
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	863 of entries...
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	864 >: cat /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv\|~/lib/python/cc/lmh/test.py -n 1000000 -r 5
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	865 1000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	866 1000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	867 1000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	868 1000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	869 1000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	870 [1.229693355038762, 1.3374222852289677, 1.3509841952472925, 1.080365838482976, 1.1893387716263533]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	871 Full as-it-were segment:
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	872 >: ~/lib/python/cc/lmh/test.py -n 6000000 -r 5 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	873 6000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	874 6000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	875 6000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	876 6000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	877 6000002
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	878 [7.250897390767932, 7.237801244482398, 7.239673590287566, 7.32976414449513, 7.23588689416647]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	879 Full 10th of the data:
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	880 >: ~/lib/python/cc/lmh/test.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	881 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	882 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	883 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	884 [69.63967163302004, 69.09140252694488, 66.49750975705683]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	885 That's tolerable.
60 3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	886 >: ~/lib/python/cc/lmh/test_hash.py -r 3 -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
59 d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	887 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	888 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	889 52369734
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	890 [64.51177835091949, 71.6610240675509, 67.74966451153159]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	891 [0.0034751780331134796, 0.0034532323479652405, 0.0033454522490501404]
d9ba3ce783ff python dict testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 58 diff changeset	892 Last line is 100000 lookups.
60 3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	893
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	894 So, try a test:
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	895 >: time ~/lib/python/cc/lmh/test_hash.py -r 1 -p results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle -f /work/dc007/dc007/hst/results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.tsv
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	896 52369734
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	897 [70.98342595621943]
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	898 [0.0037928372621536255]
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	899
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	900 real 1m51.456s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	901 user 1m32.901s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	902 sys 0m17.937s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	903 >: ls -lh results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	904 -rw-r--r-- 1 hst dc007 5.5G Jan 2 12:19 results/CC-MAIN-2019-35/warc_lmhx/ks_0-9.pickle
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	905 cdx_out.write(b' ')
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	906 cdx_out.write(b' ')
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	907 >: time ~/lib/python/cc/lmh/test_lookup1.py
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	908 52369734
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	909 1076046 130318
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	910
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	911 real 1m52.668s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	912 user 1m40.751s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	913 sys 0m9.610s
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	914
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	915 Not bad. 1.5 minutes per file, plus 10 x 20 secs or so for the
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	916 unpickles =~ 453 minutes == 8 hours.
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	917
3be7b53d726e using python dict test Henry S. Thompson <ht@inf.ed.ac.uk> parents: 59 diff changeset	918 Try pre-filter with the Bloom filter.
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	919 ================
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	920
3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	921
48 f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	922 Try it with the existing _per segment_ index we have for 2019-35
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	923
58 3012ca7fc6b7 pybloomfilter testing Henry S. Thompson <ht@inf.ed.ac.uk> parents: 57 diff changeset	924 Assuming we have to key on segment / file and offset, as reconstructing the
48 f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	925 proper index key is such a pain / buggy / is going to change with the year.
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	926
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	927 Stay with segment 49
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	928
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	929 >: uz cdx.gz \|wc -l
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	930 29,870,307
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	931
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	932 >: time uz cdx.gz\|egrep -ao ' "length": "[0-9]", "offset": "[0-9]"' \|wc
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	933 29,870,307 119,481,228 1,241,098,122
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	934 = 4 * 29,870,307
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	935
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	936 So no bogons, not _too_ surprising :-)
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	937
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	938 Bad news is it's a _big_ file:
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	939
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	940 >: ls -lh cdx.gz
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	941 -rw-r--r-- 1 hst dc007 2.0G Mar 18 2021 cdx.gz
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	942
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	943 So not viable to paste offset as a key and then sort on command line,
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	944 or to load it in to python and do the work there...
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	945
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	946 Do it per warc file and then merge?
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	947
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	948 >: time uz cdx.gz \|fgrep -a warc/CC-MAIN-20190819011034-20190819033034-00558.warc.gz \| sort -n -t\" -k28,28 > /tmp/hst/558.warc.cdx
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	949
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	950 real 0m23.494s
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	951 user 0m14.541s
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	952 sys 0m9.158s
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	953
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	954 >: wc -l /tmp/hst/558.warc.cdx
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	955 53432 /tmp/hst/558.warc.cdx
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	956
49 deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	957 >: echo $((600 * 53432))
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	958 32,059,200
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	959
48 f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	960 So, 600 of those, plus approx. same again for extracting, that pbly
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	961 _is_ doable in python, not more than 10 hours total, assuming internal
f688c437180b thinking about merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 47 diff changeset	962 sort and external merge is not too expensive...
49 deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	963
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	964 For each segment, suppose we pull out 60 groups of 10 target files
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	965 >: time uz cdx.gz \|egrep -a warc/CC-MAIN-2019[^-]-2019[^-]-0000..warc.gz > /tmp/hst/0000.warc.cdx
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	966
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	967 real 0m42.129s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	968 user 0m35.147s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	969 sys 0m9.140s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	970 >: wc -l /tmp/hst/0000.warc.cdx
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	971 533150
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	972
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	973 Key it with offset and sort:
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	974
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	975 >: time egrep -ao ' "length": "[0-9]", "offset": "[0-9]"' /tmp/hst/0000.warc.cdx \| cut -f 5 -d ' ' \| tr -d \" > /tmp/hst/0000_offsets
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	976
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	977 real 0m5.578s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	978 user 0m5.593s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	979 sys 0m0.265s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	980
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	981 >: time paste /tmp/hst/0000_offsets /tmp/hst/0000.warc.cdx \|sort -nk1,1 \| cut -f 2 > /tmp/hst/0000_sorted.warc.cdx
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	982
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	983 real 0m4.185s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	984 user 0m2.001s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	985 sys 0m1.334s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	986
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	987 >: time seq 0 9 \| parallel -j 10 "~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN---0000'{}'.warc.gz > /tmp/hst/lm_0000'{}'.tsv"
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	988
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	989 real 0m24.610s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	990 user 2m54.146s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	991 sys 0m10.226s
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	992
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	993 >: head /tmp/hst/lm_00000.tsv
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	994 9398 16432 Mon, 19 Aug 2019 02:44:15 GMT
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	995 20796 26748 Tue, 16 Jul 2019 04:39:09 GMT
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	996 4648 340633 Fri, 07 Dec 2018 09:05:59 GMT
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	997 3465 357109 Sun, 18 Aug 2019 11:48:23 GMT
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	998 7450 914189 Mon, 19 Aug 2019 02:50:08 GMT
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	999 ...
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1000 sing<3956>: fgrep '"length": "9398", "offset": "16432"' /tmp/hst/0000_sorted.warc.cdx
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1001 com,roommeme,0401a)/index.phtml?channel=&op=&p=140&put=show&r2= 20190819024416 {"url": "http://0401a.roommeme.com/index.phtml?PUT=SHOW&R2=&OP=&P=140&CHANNEL=", "mime": "text/html", "mime-detected": "text/html", "status": "200", "digest": "5DNDVX5HQBOOBHISSCOI4UBVMUL63L36", "length": "9398", "offset": "16432", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027314638.49/warc/CC-MAIN-20190819011034-20190819033034-00000.warc.gz", "charset": "Big5", "languages": "zho"}
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1002
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1003 bingo
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1004
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1005 So, the python code is pretty straightfoward: open the 10 individual
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1006 lm-*.tsv outputs into an array, initialise a 10-elt array with the
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1007 first line of each and another with its offset, record the
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1008 fileno(s) of the lowest offset, then iterate
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1009
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1010 read cdx lines and write unchanged until offset = lowest
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1011 merge line from fileno and output
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1012 remove fileno from list of matches
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1013 read and store a new line for fileno [handle EOF]
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1014 if list of matches is empty, redo setting of lowest
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1015
deeac8a0a682 tentative plan for merging Henry S. Thompson <ht@inf.ed.ac.uk> parents: 48 diff changeset	1016 Resort the result by actual key
50 5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1017
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1018 Meanwhile, get a whole test set:
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1019 sbatch --output=slurm_aug_cdx_49_10-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 00 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1020 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1021 seq 0 9 \| parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN---00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1022
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1023 Actually finished 360 in the hour.
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1024
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1025 Leaving
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1026
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1027 sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 36 59 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/49
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1028 export DEC=$xarg' "export PYTHONPATH=./lib/python/cc:$PYTHONPATH
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1029 seq 0 9 \| parallel -j 10 \"~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/1566027314638.49/orig/warc/CC-MAIN---00\${DEC}'{}'.warc.gz > \$resdir/00\${DEC}'{}'.tsv\""
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1030
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1031 But something is wrong, the number of jobs is all wrong:
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1032
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1033 5>: fgrep -c parallel slurm_aug_cdx_49_0-359-out
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1034 741
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1035 sing<4046>: ls -lt CC-MAIN-2019-35/aug_cdx/49/\|wc -l
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1036 372
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1037
5556c04c7597 all of 49? Henry S. Thompson <ht@inf.ed.ac.uk> parents: 49 diff changeset	1038 Every file is being produced twice.
51 dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1039
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1040 Took me a while to figure out my own code :-(
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1041
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1042 >: sbatch --output=slurm_aug_cdx_49_360-599-out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 49 49 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1043 export SEG=$xarg
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1044 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1045 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/.$SEG/orig/warc/CC-MAIN--*-00${arg}.warc.gz > $resdir/00${arg}.tsv'
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1046
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1047 Oops, only 560, not 600
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1048
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1049 Took 3.5 minutes for 200, so call it 10 for 560, so do 6 more in an
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1050 hour:
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1051
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1052 >: sbatch --output=slurm_aug_cdx_50-55_out --time=01:00:00 --ntasks=10 -c 36 --exclusive $HOME/bin/runme.sh -m 50 55 $PWD -t 18 -b 'export resdir=CC-MAIN-2019-35/aug_cdx/$xarg
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1053 mkdir -p $resdir
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1054 > export SEG=$xarg
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1055 share_by_task.sh -f "%03g\n" -s 360 599 $n $task > /tmp/hst_$task' -i 'cat /tmp/hst_$task' 'export PYTHONPATH=./lib/python/cc:$PYTHONPATH
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1056 ~/lib/python/cc/cdx_extras.py /beegfs/common_crawl/CC-MAIN-2019-35/.$SEG/orig/warc/CC-MAIN--*-00${arg}.warc.gz > $resdir/00${arg}.tsv'
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1057
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1058 >: tail slurm_aug_cdx_50-55_out
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1059 ...
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1060 Wed Oct 9 22:25:47 BST 2024 Finished 55
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1061 >: head -1 slurm_aug_cdx_50-55_out
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1062 Wed Oct 9 21:29:43 BST
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1063 56:04
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1064
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1065 >: du -s CC-MAIN-2019-35/aug_cdx
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1066 1,902,916
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1067
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1068 Not bad, so order 20MB for the whole thing
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1069
dc24bb6e524f done cdx_aux for segments 49--55 of 2019-35 Henry S. Thompson <ht@inf.ed.ac.uk> parents: 50 diff changeset	1070 Next step, compare to my existing cdx with timestamp
52 8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1071
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1072 First check looks about right:
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1073
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1074 [cd .../warc_lmhx]
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1075 >: seq --format='%03g' 0 299 > /tmp/hst/cdx_nums
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1076 >: parallel -j 20 -a /tmp/hst/cdx_nums 'uz idx/cdx-00{}.gz \| egrep -o "\"filename\": \"crawl-data/CC-MAIN-2019-35/segments/[^.][.]50.\"lastmod\":" \| sed "s/^.-00//;s/^$...$./\1/"\| sus > /tmp/hst/checkseg_50_{}'
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1077
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1078 [cd .../aug_cdx/50]
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1079 >: wc -l 00123.tsv
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1080 9333
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1081 >: egrep -h '123$' /tmp/hst/checkseg_50_??? \| acut 1 \| btot
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1082 9300
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1083 >: wc -l 00400.tsv
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1084 9477 00400.tsv
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1085 >: egrep -h '400$' /tmp/hst/checkseg_50_??? \| acut 1 \| btot
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1086 9439
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1087
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1088 Difference is presumable the bogus timestamps aren't in the augmented
8dffb8aa33da prelim consistency check with published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 51 diff changeset	1089 cdx as shipped.
53 d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1090
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1091 Note that the following 'bad' kind of timestamp is fixed before
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1092 sort_date.py does its thing:
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1093
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1094 ... sort_date.sh <(uz $arg/*00???.warc.gz \| '"fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/')"' >$arg/ks.tsv
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1095
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1096
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1097 >: egrep -c '[^ ]GMT$' 50/00123.tsv
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1098 22
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1099 >: egrep -c '[^ ]GMT$' 50/00400.tsv
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1100 14
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1101
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1102 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00123.warc.gz \| fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/') 2> /tmp/hst/123_errs \| wc -l
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1103 9300
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1104 >: fgrep -c Invalid /tmp/hst/123_errs
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1105 33
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1106 >: PYTHONPATH=~/.local/lib/python3.9/site-packages:$PYTHONPATH sort_date.sh <(uz ../warc_lmhx/50/*00400.warc.gz \| fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/') 2> /tmp/hst/400_errs \| wc -l
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1107 9439
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1108 >: fgrep -c Invalid /tmp/hst/400_errs
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1109 38
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1110
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1111 All good.
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1112
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1113 But
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1114 >: seq --format='%03g' 0 559 > /tmp/hst/warc_nums
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1115 >: xx () {
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1116 r=$(diff -bw
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1117 <(echo $((
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1118 $(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz \|
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1119 fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 \|wc -l)
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1120 +
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1121 $(fgrep -c Invalid /tmp/hst/ec_$1))))
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1122 <(wc -l < 50/00$1.tsv))
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1123 if [ "$r" ]
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1124 then printf "%s:\n%s\n" $2 "$r"
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1125 fi
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1126 }
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1127 >: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' \| tee /tmp/hst/aug_bugs
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1128 >: fgrep -c 1c1 /tmp/hst/aug_bugs
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1129 77
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1130 sing<4318>: wc -l < /tmp/hst/aug_bugs
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1131 385
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1132 sing<4319>: echo $((77 * 5))
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1133 385
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1134
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1135 OK, there are a few other error messages from date conversion
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1136 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/50/*00$1.warc.gz \| fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 \|wc -l) + $(egrep -c 'Invalid\|must be in\|out of range' /tmp/hst/ec_$1)))) <(wc -l < 50/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; }
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1137 sing<4337>: parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' \| tee /tmp/hst/aug_bugs2
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1138 [nothing]
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1139
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1140 So, I think we can believe we're OK
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1141 But 7 is better than 1:
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1142 >: xx () { r=$(diff -bw <(echo $(($(sort_date.sh <(uz ../warc_lmhx/$3/*00$1.warc.gz \| fgrep $'\t'\|sed '/GMT$/s/$[^ ]$GMT$/\1 GMT/') 2>/tmp/hst/ec_$1 \|wc -l) + $(egrep -c 'Invalid\|must be in\|out of range' /tmp/hst/ec_$1)))) <(wc -l < $3/00$1.tsv)); if [ "$r" ]; then printf "%s:\n%s\n" $2 "$r"; fi; }
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1143 >: for s in 49 {51..55}; do parallel -j 20 -a /tmp/hst/warc_nums xx '{}' '$(({#} - 1))' $s \| tee /tmp/hst/aug_bugs_$s; done
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1144 [nothing]
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1145
d533894173d0 detailed consistency check with 7 segments from published lmh-augmented cdx Henry S. Thompson <ht@inf.ed.ac.uk> parents: 52 diff changeset	1146 Next step: ?

Mercurial > hg > cc > work

annotate lurid3/notes.txt @ 60:3be7b53d726e