Mercurial > hg > cc > pub
annotate index.xml @ 6:cc5cef8ba548 default tip
expanded with example script,
updated to point to full paper,
include slides
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 23 May 2024 16:51:36 +0200 |
parents | 268fe5fd117f |
children |
rev | line source |
---|---|
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
1 <?xml version='1.0'?> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
2 <?xml-stylesheet type="text/xsl" href="../../../lib/xml/doc.xsl" ?> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" > |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
4 <doc> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
5 <head> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
6 <title>Augmentations to Common Crawl</title> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
7 <author>Henry S. Thompson</author> |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
8 <date>23 May 2024</date> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
9 </head> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
10 <body> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
11 <div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
12 <title>Introduction</title> |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
13 <p>This site contains a copy of my augmented index files |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains all of <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</link>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
15 <p>The format of the Common Crawl's index files is described in <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</link>.</p> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
16 </div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
17 <div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
18 <title>Contents</title> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
19 <list> |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
20 <item>My <link href="Thompson_WebSci24.pdf">paper</link>, presented at WebSci24, describing |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
21 the augmented index and its uses</item> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
22 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
23 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
24 the individual gzipped index files themselves</link>, with names of the form |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
25 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></item> |
4 | 26 <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
27 </list> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
28 </div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
29 <div> |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
30 <title>Efficient access to Common Crawl using Amazon S3<!-- <item> <link href="">eidf125_example.sh</link>.</item>--></title> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
31 <p>The University of Edinburgh's <link href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</link> (EIDF) hosts a |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
32 copy of the augmented index in an Amazon S3 server. It supports open |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
33 access to the index via unsigned requests to (range-restricted) |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
34 <name>s3:</name> URIs, for example using the <link href="https://aws.amazon.com/cli/">Amazon <code>aws</code> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
35 Command Line Interface</link>.</p> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
36 <p>The best way to understand how this works, once you've read how |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
37 the index itself works <link href="Thompson_WebSci24.pdf">in the paper, section 2.1</link>, is to work through <link href="eidf125_example.sh">an example</link> of using the augmented index to access an individual |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
38 Common Crawl retrieval record using a timestamp.</p> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
39 </div> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
40 <div> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
41 <title>Licence and citation</title> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
42 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
43 <p>Please cite information from here as follows:</p> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
44 <list type="1defn"> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
45 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
46 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>, |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
47 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
48 <link href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</link> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
49 </display></item> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
50 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
51 for Common Crawl August 2019, with Last-Modified timestamps</emph>. |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
52 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
53 </list> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
54 </div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
55 <div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
56 <title>Acknowledgements</title> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
57 <p>Without the vision of those responsible for Common Crawl and the |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
58 generosity of Amazon in hosting it this work could never have happened.</p> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
59 <p>Access to the <link href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</link> at the Edinburgh |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
60 Parallel Computing Centre used to produce the augmented index was supported |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
61 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
62 <p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
63 replies to many emails over the years, and to Greg Lindahl of Common |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
64 Crawl and Tom Morris for more recent help with consistency problems in the index |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
65 and the challenges of increasing load on the Common Crawl servers.</p> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
66 </div> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
67 </body> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff
changeset
|
68 </doc> |