annotate index.xml @ 3:7ec8f691a25a

rename
author Henry Thompson <ht@markup.co.uk>
date Wed, 22 May 2024 17:14:13 +0200
parents d6f13dda3a11
children 268fe5fd117f
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
1
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 <?xml version='1.0'?>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 <?xml-stylesheet type="text/xsl" href="../../../lib/xml/doc.xsl" ?>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 <doc>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 <head>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 <title>Augmentations to Common Crawl</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 <author>Henry S. Thompson</author>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 <date>15 Apr 2024</date>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 </head>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 <body>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 <title>Introduction</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 <p>This site contains a preliminary publication of my augmented <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</link>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 <title>Contents</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 <list>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
19 <item>My <link href="Thompson_WebSci24.pdf">forthcoming paper</link> describing
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
20 the augmented index and its uses</item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23 the individual gzipped index files themselves</link>, with names of the form
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
24 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
25 </list>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
26 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
27 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
28 <title>Licence and citation</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
29 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
30 <p>Please cite information from here as follows:</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
31 <list type="1defn">
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
32 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
33 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>,
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
34 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
35 <link href="...">[coming soon]</link> </display><!--https://doi.org/10.1145/3614419.3644018--></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
36 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
37 for Common Crawl August 2019, with Last-Modified timestamps</emph>.
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
38 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
39 </list>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
40 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
41 <div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
42 <title>Acknowledgements</title>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
43 <p>Without the vision of those responsible for Common Crawl and the
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
44 generosity of Amazon in hosting it this work could never have happened.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
45 <p>Access to the <link href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</link> at the Edinburgh
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
46 Parallel Computing Centre used to produce the augmented index was supported
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
47 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
48 <p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
49 replies to many emails over the years, and to Greg Lindahl of Common
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
50 Crawl and Tom Morris for more recent help with consistency problems in the index
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
51 and the challenges of increasing load on the Common Crawl servers.</p>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
52 </div>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
53 </body>
d6f13dda3a11 As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
54 </doc>