changeset 41:64b7fb44e8dc

extract actual date info for WARC crawls
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Wed, 21 Aug 2024 16:11:40 +0100
parents 4167d8f33325
children 0c472ae05f71
files lurid3/notes.txt lurid3/status.xlsx
diffstat 2 files changed, 63 insertions(+), 0 deletions(-) [+]
line wrap: on
line diff
--- a/lurid3/notes.txt	Tue Aug 20 15:27:47 2024 +0100
+++ b/lurid3/notes.txt	Wed Aug 21 16:11:40 2024 +0100
@@ -8,3 +8,66 @@
 
 State of play wrt data -- see status.xlsx
 
+[in trying to tabulate the date ranges of the crawls, I found that the
+WARC timestamp is sometimes bogus:
+
+  >: fgrep ' 2009' CC-MAIN-2018-34/cdx/cluster.idx
+  net,tyredeyes)/robots.txt 20090201191318	cdx-00230.gz	160573468	198277	920675
+
+  >: zgrep '^net,tyredeyes)/robots.txt' CC-MAIN-2018-34/cdx/warc/cdx-00230.gz
+  net,tyredeyes)/robots.txt 20090201191318 {"url": "http://tyredeyes.net/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "QH732FYSV7UM34JYWVYMB7EZGR2CYM6B", "length": "582", "offset": "1224614", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00558.warc.gz"}
+  net,tyredeyes)/robots.txt 20090201191319 {"url": "http://www.tyredeyes.net/robots.txt", "mime": "text/plain", "mime-detected": "text/plain", "status": "200", "digest": "PSX5IZU4B4SIXGNDKXCVFH75Q27VHUTJ", "length": "549", "offset": "2069841", "filename": "crawl-data/CC-MAIN-2018-34/segments/1534221215075.58/robotstxt/CC-MAIN-20180819090604-20180819110604-00485.warc.gz"}
+
+This happens in 2019-35 as well :-(
+
+  >: fgrep ' 20181023' CC-MAIN-2019-35/cdx/cluster.idx
+  com,gyshbsh)/robots.txt 20181023022000	cdx-00078.gz	356340085	162332	315406
+  >: zgrep ' 20181023' CC-MAIN-2019-35/cdx/warc/cdx-00078.gz
+  com,gyshbsh)/robots.txt 20181023022000 {"url": "http://gyshbsh.com/robots.txt", "mime": "text/html", "mime-detected": "text/html", "status": "301", "digest": "3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ", "length": "529", "offset": "638892", "filename": "crawl-data/CC-MAIN-2019-35/segments/1566027315618.73/robotstxt/CC-MAIN-20190820200701-20190820222701-00120.warc.gz"}
+  ...
+
+Tabulate all the date ranges for the WARC files we have
+
+  >: for d in {2017-30,2019-35,2020-34,2021-25,2023-40,2023-50}; do printf "%s\t" $d; (ls CC-MAIN-$d/*.{?,??}/orig/warc | fgrep .gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | head -1 ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done | cut -f 1,2,4 -d -  | sed 's/-20/ 20/;s/.$//' | tr ' ' '\t' > dates.tsv
+  >: for d in {2018-30,2018-34}; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u |tee /dev/fd/3 | { sleep 10 ; head -1 ; } ) 3> >( tail -1 ) | tr '\n' '\t'; echo; done >> dates.tsv
+  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | head -1); done
+2019-18	20190418101243-20190418122248
+  >: for d in 2019-18; do printf "%s\t" $d; (ls CC-MAIN-$d/{*.?,*.??} | fgrep warc.gz | cut -f 3,4 -d - | sort -u | tail -1); done
+2019-18	20190426153423-20190426175423
+  >: echo 2019-18       20190418101243-20190418122248   20190426153423-20190426175423 >> dates.tsv 
+  >: pwd
+  /beegfs/common_crawl/CC-MAIN-2016-30/cdx/warc
+  >: echo {000..299} | tr ' ' '\n' | parallel -j 10 'uz cdx-00{}.gz | cut -f 2 -d " " | sort -u > /tmp/hst/{}'
+  >: sort -mu /tmp/hst/??? > /tmp/hst/all
+  >: wc -l /tmp/hst/all
+  679686 /tmp/hst/all
+  >: head -1 /tmp/hst/all
+  20160723090435
+  >: tail -1 /tmp/hst/all
+  20160731110639
+  >: cd ../../..
+  >: echo 2016-30       20160723090435  20160731110639 >> dates.tsv 
+tweaked and sorted in xemacs:
+  2016-30	20160723090435	20160731110639
+  2017-30	20170720121902	20170729132938
+  2018-30	20180715183800	20180723184955
+  2018-34	20180814062251	20180822085454
+  2019-18	20190418101243	20190426175423
+  2019-35	20190817102624	20190826111356
+  2020-34	20200803083123	20200815214756
+  2021-25	20210612103920	20210625145905
+  2023-40	20230921073711	20231005042006
+  2023-50	20231128083443	20231212000408
+
+Added to status.xlsx in shortened form, with number of days
+  8
+  9
+  8
+  8
+  8
+  9
+  12
+  13
+  15
+  15
+
Binary file lurid3/status.xlsx has changed