# HG changeset patch # User Henry Thompson # Date 1716475896 -7200 # Node ID cc5cef8ba548b374b5ec326353758f346dacf3df # Parent e265fcc4297487ac1b9e75c48989365aabc5e6ac expanded with example script, updated to point to full paper, include slides diff -r e265fcc42974 -r cc5cef8ba548 index.html --- a/index.html Thu May 23 15:00:40 2024 +0100 +++ b/index.html Thu May 23 16:51:36 2024 +0200 @@ -106,19 +106,25 @@ img {border: 0} .copyright {font-size: 70%} .note {width: 20%; float: right; clear: right; margin-left: .5em} - Augmentations to Common Crawl

Augmentations to Common Crawl


1. Introduction

This site contains a preliminary publication of my augmented index files -for CC-MAIN-2019-35. This index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

2. Contents

  • My forthcoming paper describing + Augmentations to Common Crawl

    Augmentations to Common Crawl


    1. Introduction

    This site contains a copy of my augmented index files +for CC-MAIN-2019-35. This index contains all of the original index, with one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

    The format of the Common Crawl's index files is described in this announcement.

    2. Contents

    3. Licence and citation

    The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

    Please cite information from here as follows:

    3. Efficient access to Common Crawl using Amazon S3

    The University of Edinburgh's Edinburgh International Data Facility (EIDF) hosts a +copy of the augmented index in an Amazon S3 server. It supports open +access to the index via unsigned requests to (range-restricted) +s3: URIs, for example using the Amazon aws +Command Line Interface.

    The best way to understand how this works, once you've read how +the index itself works in the paper, section 2.1, is to work through an example of using the augmented index to access an individual +Common Crawl retrieval record using a timestamp.

    4. Licence and citation

    The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

    Please cite information from here as follows:

    4. Acknowledgements

    Without the vision of those responsible for Common Crawl and the +https://markup.co.uk/ccrawl/. Retrieved ...

5. Acknowledgements

Without the vision of those responsible for Common Crawl and the generosity of Amazon in hosting it this work could never have happened.

Access to the Cirrus UK National Tier-2 HPC Service at the Edinburgh Parallel Computing Centre used to produce the augmented index was supported by EPSRC and UKRI HPC Access awards to Henry S. Thompson.

Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful diff -r e265fcc42974 -r cc5cef8ba548 index.xml --- a/index.xml Thu May 23 15:00:40 2024 +0100 +++ b/index.xml Thu May 23 16:51:36 2024 +0200 @@ -5,18 +5,19 @@ Augmentations to Common Crawl Henry S. Thompson - 22 May 2024 + 23 May 2024

Introduction -

This site contains a preliminary publication of my augmented index files -for CC-MAIN-2019-35. This index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

+

This site contains a copy of my augmented index files +for CC-MAIN-2019-35. This index contains all of the original index, with one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.

+

The format of the Common Crawl's index files is described in this announcement.

Contents - My forthcoming paper describing + My paper, presented at WebSci24, describing the augmented index and its uses The top-level index file The directory containing @@ -26,6 +27,17 @@
+ Efficient access to Common Crawl using Amazon S3<!-- <item> <link href="">eidf125_example.sh</link>.</item>--> +

The University of Edinburgh's Edinburgh International Data Facility (EIDF) hosts a +copy of the augmented index in an Amazon S3 server. It supports open +access to the index via unsigned requests to (range-restricted) +s3: URIs, for example using the Amazon aws +Command Line Interface.

+

The best way to understand how this works, once you've read how +the index itself works in the paper, section 2.1, is to work through an example of using the augmented index to access an individual +Common Crawl retrieval record using a timestamp.

+
+
Licence and citation

The paper and data contained herein are Copyright © 2024 Henry S. Thompson CC-BY-SA

Please cite information from here as follows:

@@ -33,8 +45,8 @@ Henry S. Thompson. 2024. "Improved methodology for longitudinal Web analytics using Common Crawl". In ACM Web Science Conference (Websci ’24), May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. -[coming soon] - +https://doi.org/10.1145/3614419.3644018 + Henry S. Thompson. 2024. Augmented index for Common Crawl August 2019, with Last-Modified timestamps. https://markup.co.uk/ccrawl/. Retrieved ...