annotate index.html @ 6:cc5cef8ba548 default tip

expanded with example script, updated to point to full paper, include slides
author Henry Thompson <ht@markup.co.uk>
date Thu, 23 May 2024 16:51:36 +0200
parents 268fe5fd117f
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
1 <?xml version="1.0" encoding="US-ASCII"?>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
2 <!DOCTYPE html
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
3 PUBLIC "-//HST//DTD XHTML5 1.0 Transitional//EN" "http://www.ltg.ed.ac.uk/~ht/xhtml5.dtd">
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
4 <html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="copyright" content="Copyright &#xa9; 2024 &lt;a href=&#34;http://www.ltg.ed.ac.uk/~ht/&#34;&gt;Henry S. Thompson&lt;/a&gt;&amp;#160;&lt;a href=&#34;http://creativecommons.org/licenses/by-sa/3.0/deed.en&#34;&gt;CC-BY-SA&lt;/a&gt;"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><style type="text/css">
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
5 ul.nolabel { margin: 0; margin-left: -2.5em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
6 ul.naked.nolabel {margin: 0; margin-left: 0; padding-left: 0}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
7 ul.cdefn {clear: both}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
8 div.ndli { margin-bottom: 1ex }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
9 div.hidden {display: none}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
10
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
11 ul.naked > li { list-style-type: none; background: none; margin-left: 2em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
12 margin-bottom: 0 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
13 li ul.naked > li, dd ul.naked > li { list-style-type: none; background: none; margin-left: 0;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
14 margin-bottom: 0 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
15 li.cdefni {}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
16 li.cdefni span.cl {display: inline-block; vertical-align: bottom}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
17 li.cdefni span.cr {display: inline-block; margin-left: 1em; vertical-align: bottom}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
18 pre.code {display: inline-block}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
19 blockquote.vanilla {display: inline-block; margin-left: 1em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
20 border: solid 1px; background: rgb(238,234,230);
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
21 padding: .5ex .5em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
22 blockquote.vanilla ul.naked li {margin-left: 0 ! important;font-size: 100%}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
23 ol ol ol, ol ol ol li {list-style-type: lower-roman}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
24 ol ol, ol ol li {list-style-type: lower-alpha}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
25 i i {font-style: normal}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
26 li li {font-style: normal}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
27 li ul li {font-style: normal}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
28 li { line-height: 100%; margin-top: 0.3em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
29 .math {font-family: 'Arial Unicode MS', 'Lucida Sans Unicode', serif}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
30 .sub {font-size: 80%; vertical-align: sub}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
31 .termref {text-decoration: none; color: #606000}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
32 .licence {margin-left: 1em; font-size: 70%}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
33 .credits {margin-left: 1.5em; font-size: 70%}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
34 .right {position: absolute}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
35 .stackdown {vertical-align: text-top; margin-top: 0}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
36 body {font-size: 12pt}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
37 pre.numbered {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
38 white-space: pre-wrap;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
39 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
40 div.counter {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
41 counter-reset: listing;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
42 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
43 pre.numbered code {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
44 counter-increment: listing;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
45 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
46 pre.cl code::before {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
47 content: "$ " ;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
48 font-size: 80%;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
49 width: 2em
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
50 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
51 pre.numbered code::before {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
52 content: counter(listing) ". ";
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
53 display: inline-block;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
54 font-size: 80%;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
55 width: 3em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
56 padding-left: auto;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
57 margin-left: auto;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
58 text-align: right;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
59 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
60
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
61 /* content doesn't combine :-( */
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
62 pre.numbered.cl code::before {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
63 content: counter(listing) ". $ ";
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
64 display: inline-block;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
65 font-size: 80%;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
66 width: 3em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
67 padding-left: auto;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
68 margin-left: auto;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
69 text-align: right;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
70 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
71 @page { size: A4 portrait; margin: 2cm;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
72 orphans: 2; widows: 2;}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
73 @media screen {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
74 body {width: 20cm; margin-left: auto; margin-right: auto}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
75 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
76 @media print {
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
77 body {font-size: 10pt}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
78 h1, h2, h3, h4 {page-break-after: avoid}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
79 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
80 pre.code {font-family: monospace;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
81 font-weight: bold;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
82 line-height: 120%;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
83 padding-top: 0.2em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
84 padding-bottom: 0.2em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
85 padding-left: 1em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
86 padding-right: 1em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
87 border-style: solid;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
88 border-left-width: 1em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
89 border-top-width: thin;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
90 border-right-width: thin;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
91 border-bottom-width: thin;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
92 border-color: #95ABD0;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
93 color: #00428C;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
94 background-color: #E4E5E7;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
95 }
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
96 pre {margin-left: 0em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
97 div.toc h2 {font-size: 120%; margin-top: 0em; margin-bottom: 0em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
98 div.toc h4 {font-size: 100%; margin-top: 0em; margin-bottom: 0em;
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
99 margin-left: 1em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
100 div.toc h1 {font-size: 140%; margin-bottom: 0em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
101 div.toc ul {margin-top: 1ex}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
102 .byline {font-size: 120%}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
103 div.figure {margin-left: 2em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
105 i i {font-style: normal}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
106 img {border: 0}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
107 .copyright {font-size: 70%}
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
108 .note {width: 20%; float: right; clear: right; margin-left: .5em}
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">23 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a copy of my augmented index files
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains all of <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</a>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p><p>The format of the Common Crawl's index files is described in <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</a>.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">paper</a>, presented at WebSci24, describing
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
112 the individual gzipped index files themselves</a>, with names of the form
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Efficient access to Common Crawl using Amazon S3</h2><p>The University of Edinburgh's <a href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</a> (EIDF) hosts a
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
114 copy of the augmented index in an Amazon S3 server. It supports open
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
115 access to the index via unsigned requests to (range-restricted)
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
116 <b>s3:</b> URIs, for example using the <a href="https://aws.amazon.com/cli/">Amazon <code>aws</code>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
117 Command Line Interface</a>.</p><p>The best way to understand how this works, once you've read how
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
118 the index itself works <a href="Thompson_WebSci24.pdf">in the paper, section 2.1</a>, is to work through <a href="eidf125_example.sh">an example</a> of using the augmented index to access an individual
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
119 Common Crawl retrieval record using a timestamp.</p></div><div><h2>4. Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
120 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
121 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>,
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
122 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
123 <a href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</a>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
124 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
125 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. <i>Augmented index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
126 for Common Crawl August 2019, with Last-Modified timestamps</i>.
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
127 <a href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</a>. Retrieved ...</div></blockquote></li></ul></div><div><h2>5. Acknowledgements</h2><p>Without the vision of those responsible for Common Crawl and the
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
128 generosity of Amazon in hosting it this work could never have happened.</p><p>Access to the <a href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</a> at the Edinburgh
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
129 Parallel Computing Centre used to produce the augmented index was supported
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
130 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p><p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
131 replies to many emails over the years, and to Greg Lindahl of Common
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
132 Crawl and Tom Morris for more recent help with consistency problems in the index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 0
diff changeset
133 and the challenges of increasing load on the Common Crawl servers.</p></div></div></body></html>