Mercurial > hg > cc > work
view lurid3/notes.txt @ 42:0c472ae05f71
nearly finished downloading for now
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 02 Sep 2024 15:02:01 +0100 |
parents | 64b7fb44e8dc |
children | 6ae6a21ccfb9 |
line wrap: on
line source
See old_notes.txt for all older notes on Common Crawl dataprocessing, starting from Azure via Turing and then LURID and LURID2. Installed /beegfs/common_crawl/CC-MAIN-2024-33/cdx >: cd results/CC-MAIN-2024-33/cdx/ >: cut -f 2 counts.tsv | btot 2,793,986,828 State of play wrt data -- see status.xlsx [in trying to tabulate the date ranges of the crawls, I found that the WARC timestamp is sometimes bogus: >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx net,tyredeyes)/robots.txt 20090201191318 cdx-00230.gz 160573468 198277 920675 >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"} net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"} This happens in 2019-35 as well :-( >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx com,gyshbsh)/robots.txt 20181023022000 cdx-00078.gz 356340085 162332 315406 >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"} ... Tabulate all the date ranges for the WARC files we have >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d - | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done 2019-18 20190418101243-20190418122248 >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done 2019-18 20190426153423-20190426175423 >: echo 2019-18 20190418101243-20190418122248 20190426153423-20190426175423 >> dates.tsv >: pwd /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}' >: sort -mu /tmp/hst/??? > /tmp/hst/all >: wc -l /tmp/hst/all 679686 /tmp/hst/all >: head -1 /tmp/hst/all 20160723090435 >: tail -1 /tmp/hst/all 20160731110639 >: cd ../../.. >: echo 2016-30 20160723090435 20160731110639 >> dates.tsv tweaked and sorted in xemacs: 2016-30 20160723090435 20160731110639 2017-30 20170720121902 20170729132938 2018-30 20180715183800 20180723184955 2018-34 20180814062251 20180822085454 2019-18 20190418101243 20190426175423 2019-35 20190817102624 20190826111356 2020-34 20200803083123 20200815214756 2021-25 20210612103920 20210625145905 2023-40 20230921073711 20231005042006 2023-50 20231128083443 20231212000408 Added to status.xlsx in shortened form, with number of days 8 9 8 8 8 9 12 13 15 15 Fill a gap by downloading 2022-33 >: for s in 0; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 5; done > /tmp/hst/get_22-33_0.log & 130 minutes... >: for s in 1; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_1.log & 59 minutes Another day to get to a quarter? >: for s in {2..23}; do ~/bin/getcc_multi.aws CC-MAIN-2022-33 $s 10; done > /tmp/hst/get_22-33_2-23.log & And finally 2015-35 Fetched in just 2 chunks, 0-9 and 10-99, e.g. >: for s in {10..99}; do ~/bin/getcc_multi.aws CC-MAIN-2015-35 $s 10; done > /tmp/hst/get_15-35_10-99.log & Much smaller. Compare 2023-40, with 900 files per segment: >: lss */orig/warc/*-0023?.* | cut -f 5 -d ' ' | stats n = 1000 min = 1.14775e+09 max = 1.26702e+09 sum = 1.20192e+12 mean = 1.20192e+09 sd = 2.26049e+07 with 2015-35, with 353 files per segment >: lss */orig/warc/*-0023?-* | cut -f 5 -d ' ' | stats n = 930 min = 1.66471e+08 [bug?] max = 9.6322e+08 sum = 8.54009e+11 mean = 9.1829e+08 sd = 8.48938e+07 The min files all come from segment 1440644060633.7, whose files are _all_ small: >: uz *00123-*.gz | wc -l 12,759,931 Compare to 1440644060103.8 >: zcat *00123-*.gz | wc -l 75,806,738 Mystery Also faster Compare 2023-40: >: fgrep -h BST /tmp/hst/get_22-33_{2-23,24-49,50-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 98 19 256 75.1 25.2 with 2015-35: >: fgrep -h BST /tmp/hst/get_15-35_{0-9,10-99}.log | cut -f 1-7 -d ' ' | while read s; do if read e; then echo $((($(date --date="$e" +%s) - $(date --date="$s" +%s)) / 60)); fi; done | stats n min max mean sd 95 15 40 32.4 2.90