Mercurial > hg > cc > pub
changeset 4:268fe5fd117f
add slides
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Wed, 22 May 2024 17:18:23 +0200 |
parents | 7ec8f691a25a |
children | e265fcc42974 |
files | index.html index.xml |
diffstat | 2 files changed, 6 insertions(+), 4 deletions(-) [+] |
line wrap: on
line diff
--- a/index.html Wed May 22 17:14:13 2024 +0200 +++ b/index.html Wed May 22 17:18:23 2024 +0200 @@ -106,11 +106,11 @@ img {border: 0} .copyright {font-size: 70%} .note {width: 20%; float: right; clear: right; margin-left: .5em} - </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> + </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing the individual gzipped index files themselves</a>, with names of the form -<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> +<code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a>   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web analytics using Common Crawl". In <i>ACM Web Science Conference (Websci ’24)</i>, May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
--- a/index.xml Wed May 22 17:14:13 2024 +0200 +++ b/index.xml Wed May 22 17:18:23 2024 +0200 @@ -5,7 +5,7 @@ <head> <title>Augmentations to Common Crawl</title> <author>Henry S. Thompson</author> - <date>15 Apr 2024</date> + <date>22 May 2024</date> </head> <body> <div> @@ -22,6 +22,7 @@ <item><link href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing the individual gzipped index files themselves</link>, with names of the form <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></item> + <item><link href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</link></item> </list> </div> <div> @@ -32,7 +33,8 @@ <item term="For the paper"><display>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web analytics using Common Crawl". In <emph>ACM Web Science Conference (Websci ’24)</emph>, May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. -<link href="...">[coming soon]</link> </display><!--https://doi.org/10.1145/3614419.3644018--></item> +<link href="...">[coming soon]</link> +</display><!--https://doi.org/10.1145/3614419.3644018--></item> <item term="For the data"><display>Henry S. Thompson. 2024. <emph>Augmented index for Common Crawl August 2019, with Last-Modified timestamps</emph>. <link href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</link>. Retrieved ...</display></item>