annotate index.xml @ 6:cc5cef8ba548 default tip

expanded with example script, updated to point to full paper, include slides
author Henry Thompson <ht@markup.co.uk>
date Thu, 23 May 2024 16:51:36 +0200
parents 268fe5fd117f
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 <?xml version='1.0'?>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 <?xml-stylesheet type="text/xsl" href="../../../lib/xml/doc.xsl" ?>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 <doc>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 <head>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 <title>Augmentations to Common Crawl</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 <author>Henry S. Thompson</author>
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
8 <date>23 May 2024</date>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 </head>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 <body>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 <title>Introduction</title>
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
13 <p>This site contains a copy of my augmented index files
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains all of <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</link>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
15 <p>The format of the Common Crawl's index files is described in <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</link>.</p>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 <title>Contents</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19 <list>
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
20 <item>My <link href="Thompson_WebSci24.pdf">paper</link>, presented at WebSci24, describing
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 the augmented index and its uses</item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24 the individual gzipped index files themselves</link>, with names of the form
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
4
268fe5fd117f add slides
Henry Thompson <ht@markup.co.uk>
parents: 1
diff changeset
26 <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
27 </list>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
28 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
29 <div>
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
30 <title>Efficient access to Common Crawl using Amazon S3<!-- <item> <link href="">eidf125_example.sh</link>.</item>--></title>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
31 <p>The University of Edinburgh's <link href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</link> (EIDF) hosts a
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
32 copy of the augmented index in an Amazon S3 server. It supports open
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
33 access to the index via unsigned requests to (range-restricted)
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
34 <name>s3:</name> URIs, for example using the <link href="https://aws.amazon.com/cli/">Amazon <code>aws</code>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
35 Command Line Interface</link>.</p>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
36 <p>The best way to understand how this works, once you've read how
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
37 the index itself works <link href="Thompson_WebSci24.pdf">in the paper, section 2.1</link>, is to work through <link href="eidf125_example.sh">an example</link> of using the augmented index to access an individual
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
38 Common Crawl retrieval record using a timestamp.</p>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
39 </div>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
40 <div>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
41 <title>Licence and citation</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
42 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
43 <p>Please cite information from here as follows:</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
44 <list type="1defn">
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
45 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
46 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>,
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
47 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
6
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
48 <link href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</link>
cc5cef8ba548 expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents: 4
diff changeset
49 </display></item>
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
50 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
51 for Common Crawl August 2019, with Last-Modified timestamps</emph>.
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
52 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
53 </list>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
54 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
55 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
56 <title>Acknowledgements</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
57 <p>Without the vision of those responsible for Common Crawl and the
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
58 generosity of Amazon in hosting it this work could never have happened.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
59 <p>Access to the <link href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</link> at the Edinburgh
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
60 Parallel Computing Centre used to produce the augmented index was supported
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
61 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
62 <p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
63 replies to many emails over the years, and to Greg Lindahl of Common
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
64 Crawl and Tom Morris for more recent help with consistency problems in the index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
65 and the challenges of increasing load on the Common Crawl servers.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
66 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
67 </body>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
68 </doc>