# HG changeset patch # User Henry Thompson # Date 1716391103 -7200 # Node ID 268fe5fd117f5743a741ecae049a8c3a7388892f # Parent 7ec8f691a25ab29779dce483d69c36d7455c0fb4 add slides diff -r 7ec8f691a25a -r 268fe5fd117f index.html --- a/index.html Wed May 22 17:14:13 2024 +0200 +++ b/index.html Wed May 22 17:18:23 2024 +0200 @@ -106,11 +106,11 @@ img {border: 0} .copyright {font-size: 70%} .note {width: 20%; float: right; clear: right; margin-left: .5em} - Augmentations to Common Crawl

Augmentations to Common Crawl


1. Introduction

This site contains a preliminary publication of my augmented index files + Augmentations to Common Crawl

Augmentations to Common Crawl


1. Introduction

This site contains a preliminary publication of my augmented index files for CC-MAIN-2019-35. This index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

2. Contents

3. Licence and citation

The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

Please cite information from here as follows:

3. Licence and citation

The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

Please cite information from here as follows:

  • For the paper   
    Henry S. Thompson. 2024. "Improved methodology for longitudinal Web analytics using Common Crawl". In ACM Web Science Conference (Websci ’24), May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. diff -r 7ec8f691a25a -r 268fe5fd117f index.xml --- a/index.xml Wed May 22 17:14:13 2024 +0200 +++ b/index.xml Wed May 22 17:18:23 2024 +0200 @@ -5,7 +5,7 @@ Augmentations to Common Crawl Henry S. Thompson - 15 Apr 2024 + 22 May 2024
    @@ -22,6 +22,7 @@ The directory containing the individual gzipped index files themselves, with names of the form cdx-00nnn.gz, for nnn in 000–299 + WebSci 24 conference slides
    @@ -32,7 +33,8 @@ Henry S. Thompson. 2024. "Improved methodology for longitudinal Web analytics using Common Crawl". In ACM Web Science Conference (Websci ’24), May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. -[coming soon] +[coming soon] + Henry S. Thompson. 2024. Augmented index for Common Crawl August 2019, with Last-Modified timestamps. https://markup.co.uk/ccrawl/. Retrieved ...