comparison index.xml @ 4:268fe5fd117f

add slides
author Henry Thompson <ht@markup.co.uk>
date Wed, 22 May 2024 17:18:23 +0200
parents d6f13dda3a11
children cc5cef8ba548
comparison
equal deleted inserted replaced
3:7ec8f691a25a 4:268fe5fd117f
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" > 3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
4 <doc> 4 <doc>
5 <head> 5 <head>
6 <title>Augmentations to Common Crawl</title> 6 <title>Augmentations to Common Crawl</title>
7 <author>Henry S. Thompson</author> 7 <author>Henry S. Thompson</author>
8 <date>15 Apr 2024</date> 8 <date>22 May 2024</date>
9 </head> 9 </head>
10 <body> 10 <body>
11 <div> 11 <div>
12 <title>Introduction</title> 12 <title>Introduction</title>
13 <p>This site contains a preliminary publication of my augmented <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</link> 13 <p>This site contains a preliminary publication of my augmented <link href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</link>
20 the augmented index and its uses</item> 20 the augmented index and its uses</item>
21 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item> 21 <item>The <link href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</link></item>
22 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing 22 <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
23 the individual gzipped index files themselves</link>, with names of the form 23 the individual gzipped index files themselves</link>, with names of the form
24 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item> 24 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&ndash;299</code></item>
25 <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item>
25 </list> 26 </list>
26 </div> 27 </div>
27 <div> 28 <div>
28 <title>Licence and citation</title> 29 <title>Licence and citation</title>
29 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p> 30 <p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <link href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</link></p>