comparison index.xml @ 6:cc5cef8ba548 default tip

expanded with example script, updated to point to full paper, include slides
author Henry Thompson <ht@markup.co.uk>
date Thu, 23 May 2024 16:51:36 +0200
parents 268fe5fd117f
children
comparison
equal deleted inserted replaced
5:e265fcc42974 6:cc5cef8ba548
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" > 3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
4 <doc> 4 <doc>
5 <head> 5 <head>
6 <title>Augmentations to Common Crawl</title> 6 <title>Augmentations to Common Crawl</title>
7 <author>Henry S. Thompson</author> 7 <author>Henry S. Thompson</author>
8 <date>22 May 2024</date> 8 <date>23 May 2024</date>
9 </head> 9 </head>
10 <body> 10 <body>
11 <div> 11 <div>
12 <title>Introduction</title> 12 <title>Introduction</title>
13 <p>This site contains a preliminary publication of my augmented <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</link> 13 <p>This site contains a copy of my augmented index files
14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p> 14 for <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</link>. This index contains all of <link href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</link>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p>
15 <p>The format of the Common Crawl's index files is described in <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</link>.</p>
15 </div> 16 </div>
16 <div> 17 <div>
17 <title>Contents</title> 18 <title>Contents</title>
18 <list> 19 <list>
19 <item>My <link href="Thompson_WebSci24.pdf">forthcoming paper</link> describing 20 <item>My <link href="Thompson_WebSci24.pdf">paper</link>, presented at WebSci24, describing
20 the augmented index and its uses</item> 21 the augmented index and its uses</item>
21 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item> 22 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item>
22 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing 23 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
23 the individual gzipped index files themselves</link>, with names of the form 24 the individual gzipped index files themselves</link>, with names of the form
24 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item> 25 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
25 <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item> 26 <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item>
26 </list> 27 </list>
27 </div> 28 </div>
28 <div> 29 <div>
30 <title>Efficient access to Common Crawl using Amazon S3<!-- <item> <link href="">eidf125_example.sh</link>.</item>--></title>
31 <p>The University of Edinburgh's <link href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</link> (EIDF) hosts a
32 copy of the augmented index in an Amazon S3 server. It supports open
33 access to the index via unsigned requests to (range-restricted)
34 <name>s3:</name> URIs, for example using the <link href="https://aws.amazon.com/cli/">Amazon <code>aws</code>
35 Command Line Interface</link>.</p>
36 <p>The best way to understand how this works, once you've read how
37 the index itself works <link href="Thompson_WebSci24.pdf">in the paper, section 2.1</link>, is to work through <link href="eidf125_example.sh">an example</link> of using the augmented index to access an individual
38 Common Crawl retrieval record using a timestamp.</p>
39 </div>
40 <div>
29 <title>Licence and citation</title> 41 <title>Licence and citation</title>
30 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p> 42 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p>
31 <p>Please cite information from here as follows:</p> 43 <p>Please cite information from here as follows:</p>
32 <list type="1defn"> 44 <list type="1defn">
33 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web 45 <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
34 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>, 46 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>,
35 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. 47 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
36 <link href="...">[coming soon]</link> 48 <link href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</link>
37 </display><!--https://doi.org/10.1145/3614419.3644018--></item> 49 </display></item>
38 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index 50 <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index
39 for Common Crawl August 2019, with Last-Modified timestamps</emph>. 51 for Common Crawl August 2019, with Last-Modified timestamps</emph>.
40 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item> 52 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>
41 </list> 53 </list>
42 </div> 54 </div>