comparison index.xml @ 1:d6f13dda3a11

As sent to Lindahl and Nagel
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 15 Apr 2024 15:25:50 +0100
parents
children 268fe5fd117f
comparison
equal deleted inserted replaced
0:104cc8b6789b 1:d6f13dda3a11
1 <?xml version='1.0'?>
2 <?xml-stylesheet type="text/xsl" href="../../../lib/xml/doc.xsl" ?>
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
4 <doc>
5 <head>
6 <title>Augmentations to Common Crawl</title>
7 <author>Henry S. Thompson</author>
8 <date>15 Apr 2024</date>
9 </head>
10 <body>
11 <div>
12 <title>Introduction</title>
13 <p>This site contains a preliminary publication of my augmented <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</link>
14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p>
15 </div>
16 <div>
17 <title>Contents</title>
18 <list>
19 <item>My <link href="Thompson_WebSci24.pdf">forthcoming paper</link> describing
20 the augmented index and its uses</item>
21 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item>
22 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
23 the individual gzipped index files themselves</link>, with names of the form
24 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
25 </list>
26 </div>
27 <div>
28 <title>Licence and citation</title>
29 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p>
30 <p>Please cite information from here as follows:</p>
31 <list type="1defn">
32 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
33 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>,
34 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
35 <link href="...">[coming soon]</link>
36 </display><!--https://doi.org/10.1145/3614419.3644018--></item>
37 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index
38 for Common Crawl August 2019, with Last-Modified timestamps</emph>.
39 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>
40 </list>
41 </div>
42 <div>
43 <title>Acknowledgements</title>
44 <p>Without the vision of those responsible for Common Crawl and the
45 generosity of Amazon in hosting it this work could never have happened.</p>
46 <p>Access to the <link href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</link> at the Edinburgh
47 Parallel Computing Centre used to produce the augmented index was supported
48 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p>
49 <p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful
50 replies to many emails over the years, and to Greg Lindahl of Common
51 Crawl and Tom Morris for more recent help with consistency problems in the index
52 and the challenges of increasing load on the Common Crawl servers.</p>
53 </div>
54 </body>
55 </doc>