Mercurial > hg > cc > pub
comparison index.html @ 4:268fe5fd117f
add slides
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Wed, 22 May 2024 17:18:23 +0200 |
parents | d6f13dda3a11 |
children | cc5cef8ba548 |
comparison
equal
deleted
inserted
replaced
3:7ec8f691a25a | 4:268fe5fd117f |
---|---|
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} | 104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} |
105 i i {font-style: normal} | 105 i i {font-style: normal} |
106 img {border: 0} | 106 img {border: 0} |
107 .copyright {font-size: 70%} | 107 .copyright {font-size: 70%} |
108 .note {width: 20%; float: right; clear: right; margin-left: .5em} | 108 .note {width: 20%; float: right; clear: right; margin-left: .5em} |
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> | 109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">22 May 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> |
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing | 110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing |
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing | 111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing |
112 the individual gzipped index files themselves</a>, with names of the form | 112 the individual gzipped index files themselves</a>, with names of the form |
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> | 113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> |
114   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web | 114   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web |
115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci ’24)</i>, | 115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci ’24)</i>, |
116 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. | 116 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. |
117 <a href="...">[coming soon]</a> | 117 <a href="...">[coming soon]</a> |
118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> | 118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> |