comparison index.html @ 6:cc5cef8ba548 default tip

expanded with example script, updated to point to full paper, include slides
author Henry Thompson <ht@markup.co.uk>
date Thu, 23 May 2024 16:51:36 +0200
parents 268fe5fd117f
children
comparison
equal deleted inserted replaced
5:e265fcc42974 6:cc5cef8ba548
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} 104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em}
105 i i {font-style: normal} 105 i i {font-style: normal}
106 img {border: 0} 106 img {border: 0}
107 .copyright {font-size: 70%} 107 .copyright {font-size: 70%}
108 .note {width: 20%; float: right; clear: right; margin-left: .5em} 108 .note {width: 20%; float: right; clear: right; margin-left: .5em}
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> 109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">23 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a copy of my augmented index files
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing 110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains all of <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</a>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p><p>The format of the Common Crawl's index files is described in <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</a>.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">paper</a>, presented at WebSci24, describing
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing 111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
112 the individual gzipped index files themselves</a>, with names of the form 112 the individual gzipped index files themselves</a>, with names of the form
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> 113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Efficient access to Common Crawl using Amazon S3</h2><p>The University of Edinburgh's <a href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</a> (EIDF) hosts a
114 copy of the augmented index in an Amazon S3 server. It supports open
115 access to the index via unsigned requests to (range-restricted)
116 <b>s3:</b> URIs, for example using the <a href="https://aws.amazon.com/cli/">Amazon <code>aws</code>
117 Command Line Interface</a>.</p><p>The best way to understand how this works, once you've read how
118 the index itself works <a href="Thompson_WebSci24.pdf">in the paper, section 2.1</a>, is to work through <a href="eidf125_example.sh">an example</a> of using the augmented index to access an individual
119 Common Crawl retrieval record using a timestamp.</p></div><div><h2>4. Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a>
114 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web 120 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>, 121 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>,
116 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. 122 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
117 <a href="...">[coming soon]</a> 123 <a href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</a>
118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> 124 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a>
119 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. <i>Augmented index 125 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. <i>Augmented index
120 for Common Crawl August 2019, with Last-Modified timestamps</i>. 126 for Common Crawl August 2019, with Last-Modified timestamps</i>.
121 <a href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</a>. Retrieved ...</div></blockquote></li></ul></div><div><h2>4. Acknowledgements</h2><p>Without the vision of those responsible for Common Crawl and the 127 <a href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</a>. Retrieved ...</div></blockquote></li></ul></div><div><h2>5. Acknowledgements</h2><p>Without the vision of those responsible for Common Crawl and the
122 generosity of Amazon in hosting it this work could never have happened.</p><p>Access to the <a href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</a> at the Edinburgh 128 generosity of Amazon in hosting it this work could never have happened.</p><p>Access to the <a href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</a> at the Edinburgh
123 Parallel Computing Centre used to produce the augmented index was supported 129 Parallel Computing Centre used to produce the augmented index was supported
124 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p><p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful 130 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p><p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful
125 replies to many emails over the years, and to Greg Lindahl of Common 131 replies to many emails over the years, and to Greg Lindahl of Common
126 Crawl and Tom Morris for more recent help with consistency problems in the index 132 Crawl and Tom Morris for more recent help with consistency problems in the index