comparison index.html @ 4:268fe5fd117f

add slides
author Henry Thompson <ht@markup.co.uk>
date Wed, 22 May 2024 17:18:23 +0200
parents d6f13dda3a11
children cc5cef8ba548
comparison
equal deleted inserted replaced
3:7ec8f691a25a 4:268fe5fd117f
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} 104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em}
105 i i {font-style: normal} 105 i i {font-style: normal}
106 img {border: 0} 106 img {border: 0}
107 .copyright {font-size: 70%} 107 .copyright {font-size: 70%}
108 .note {width: 20%; float: right; clear: right; margin-left: .5em} 108 .note {width: 20%; float: right; clear: right; margin-left: .5em}
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> 109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright &#xa9; 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a>
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing 110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing 111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing
112 the individual gzipped index files themselves</a>, with names of the form 112 the individual gzipped index files themselves</a>, with names of the form
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> 113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000&#x2013;299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright &#xa9; 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a>
114 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web 114 &#xa0;&#xa0;<blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web
115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>, 115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci &#x2019;24)</i>,
116 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. 116 May 21&#x2013;24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages.
117 <a href="...">[coming soon]</a> 117 <a href="...">[coming soon]</a>
118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> 118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a>