# HG changeset patch # User Henry S. Thompson # Date 1713191150 -3600 # Node ID d6f13dda3a11ce2197e1ddc8df73efa445f3ca59 # Parent 104cc8b6789b25819fe4e2cc1bcbeb29cec2857d As sent to Lindahl and Nagel diff -r 104cc8b6789b -r d6f13dda3a11 Thompson_WebSci24.pdf Binary file Thompson_WebSci24.pdf has changed diff -r 104cc8b6789b -r d6f13dda3a11 index.html --- a/index.html Mon Apr 15 07:44:37 2024 -0400 +++ b/index.html Mon Apr 15 15:25:50 2024 +0100 @@ -0,0 +1,127 @@ + + +Augmentations to Common Crawl

Augmentations to Common Crawl


1. Introduction

This site contains a preliminary publication of my augmented index files +for CC-MAIN-2019-35. This index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

2. Contents

3. Licence and citation

The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

Please cite information from here as follows:

  • For the paper +  
    Henry S. Thompson. 2024. "Improved methodology for longitudinal Web +analytics using Common Crawl". In ACM Web Science Conference (Websci ’24), +May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. +[coming soon] +
  • For the data +  
    Henry S. Thompson. 2024. Augmented index +for Common Crawl August 2019, with Last-Modified timestamps. +https://markup.co.uk/ccrawl/. Retrieved ...

4. Acknowledgements

Without the vision of those responsible for Common Crawl and the +generosity of Amazon in hosting it this work could never have happened.

Access to the Cirrus UK National Tier-2 HPC Service at the Edinburgh +Parallel Computing Centre used to produce the augmented index was supported +by EPSRC and UKRI HPC Access awards to Henry S. Thompson.

Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful +replies to many emails over the years, and to Greg Lindahl of Common +Crawl and Tom Morris for more recent help with consistency problems in the index +and the challenges of increasing load on the Common Crawl servers.

\ No newline at end of file diff -r 104cc8b6789b -r d6f13dda3a11 index.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/index.xml Mon Apr 15 15:25:50 2024 +0100 @@ -0,0 +1,54 @@ + + + + + + Augmentations to Common Crawl + Henry S. Thompson + 15 Apr 2024 + + +
+ Introduction +

This site contains a preliminary publication of my augmented index files +for CC-MAIN-2019-35. This index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

+
+
+ Contents + + My forthcoming paper describing +the augmented index and its uses + The top-level index file + The directory containing +the individual gzipped index files themselves, with names of the form +cdx-00nnn.gz, for nnn in 000–299 + +
+
+ Licence and citation +

The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

+

Please cite information from here as follows:

+ + Henry S. Thompson. 2024. "Improved methodology for longitudinal Web +analytics using Common Crawl". In ACM Web Science Conference (Websci ’24), +May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. +[coming soon] + Henry S. Thompson. 2024. Augmented index +for Common Crawl August 2019, with Last-Modified timestamps. +https://markup.co.uk/ccrawl/. Retrieved ... + +
+
+ Acknowledgements +

Without the vision of those responsible for Common Crawl and the +generosity of Amazon in hosting it this work could never have happened.

+

Access to the Cirrus UK National Tier-2 HPC Service at the Edinburgh +Parallel Computing Centre used to produce the augmented index was supported +by EPSRC and UKRI HPC Access awards to Henry S. Thompson.

+

Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful +replies to many emails over the years, and to Greg Lindahl of Common +Crawl and Tom Morris for more recent help with consistency problems in the index +and the challenges of increasing load on the Common Crawl servers.

+
+ +