changeset 4:268fe5fd117f

add slides
author Henry Thompson <ht@markup.co.uk>
date Wed, 22 May 2024 17:18:23 +0200
parents 7ec8f691a25a
children e265fcc42974
files index.html index.xml
diffstat 2 files changed, 6 insertions(+), 4 deletions(-) [+]
line wrap: on
line diff
--- a/index.html	Wed May 22 17:14:13 2024 +0200
+++ b/index.html	Wed May 22 17:18:23 2024 +0200
@@ -106,11 +106,11 @@
        img {border: 0}
        .copyright {font-size: 70%}
        .note {width: 20%; float: right; clear: right; margin-left: .5em}
-     </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1.  Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a>
+     </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1.  Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a>
 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>.  This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources.  The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2.  Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing
 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
 the individual gzipped index files themselves</a>, with names of the form
-<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li></ul></div><div><h2>3.  Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked  "><li><a name="For_the_paper"><b>For the paper</b></a>
+<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3.  Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked  "><li><a name="For_the_paper"><b>For the paper</b></a>
 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>,
 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
--- a/index.xml	Wed May 22 17:14:13 2024 +0200
+++ b/index.xml	Wed May 22 17:18:23 2024 +0200
@@ -5,7 +5,7 @@
  <head>
   <title>Augmentations to Common Crawl</title>
   <author>Henry S. Thompson</author>
-  <date>15 Apr 2024</date>
+  <date>22 May 2024</date>
  </head>
  <body>
   <div>
@@ -22,6 +22,7 @@
     <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
 the individual gzipped index files themselves</link>, with names of the form
 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
+    <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item>
    </list>
   </div>
   <div>
@@ -32,7 +33,8 @@
     <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
 analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>,
 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
-<link href="...">[coming soon]</link>
</display><!--https://doi.org/10.1145/3614419.3644018--></item>
+<link href="...">[coming soon]</link>
+</display><!--https://doi.org/10.1145/3614419.3644018--></item>
     <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index
 for Common Crawl August 2019, with Last-Modified timestamps</emph>.
 <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>