diff index.html @ 4:268fe5fd117f

add slides
author Henry Thompson <ht@markup.co.uk>
date Wed, 22 May 2024 17:18:23 +0200
parents d6f13dda3a11
children cc5cef8ba548
line wrap: on
line diff
--- a/index.html	Wed May 22 17:14:13 2024 +0200
+++ b/index.html	Wed May 22 17:18:23 2024 +0200
@@ -106,11 +106,11 @@
        img {border: 0}
        .copyright {font-size: 70%}
        .note {width: 20%; float: right; clear: right; margin-left: .5em}
-     </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1.  Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a>
+     </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1.  Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a>
 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>.  This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources.  The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2.  Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing
 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
 the individual gzipped index files themselves</a>, with names of the form
-<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li></ul></div><div><h2>3.  Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked  "><li><a name="For_the_paper"><b>For the paper</b></a>
+<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3.  Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked  "><li><a name="For_the_paper"><b>For the paper</b></a>
 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>,
 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.