Mercurial > hg > cc > pub
comparison index.html @ 1:d6f13dda3a11
As sent to Lindahl and Nagel
author | Henry S. Thompson <ht@inf.ed.ac.uk> |
---|---|
date | Mon, 15 Apr 2024 15:25:50 +0100 |
parents | 104cc8b6789b |
children | 268fe5fd117f |
comparison
equal
deleted
inserted
replaced
0:104cc8b6789b | 1:d6f13dda3a11 |
---|---|
1 <?xml version="1.0" encoding="US-ASCII"?> | |
2 <!DOCTYPE html | |
3 PUBLIC "-//HST//DTD XHTML5 1.0 Transitional//EN" "http://www.ltg.ed.ac.uk/~ht/xhtml5.dtd"> | |
4 <html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="copyright" content="Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a>"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><style type="text/css"> | |
5 ul.nolabel { margin: 0; margin-left: -2.5em} | |
6 ul.naked.nolabel {margin: 0; margin-left: 0; padding-left: 0} | |
7 ul.cdefn {clear: both} | |
8 div.ndli { margin-bottom: 1ex } | |
9 div.hidden {display: none} | |
10 | |
11 ul.naked > li { list-style-type: none; background: none; margin-left: 2em; | |
12 margin-bottom: 0 } | |
13 li ul.naked > li, dd ul.naked > li { list-style-type: none; background: none; margin-left: 0; | |
14 margin-bottom: 0 } | |
15 li.cdefni {} | |
16 li.cdefni span.cl {display: inline-block; vertical-align: bottom} | |
17 li.cdefni span.cr {display: inline-block; margin-left: 1em; vertical-align: bottom} | |
18 pre.code {display: inline-block} | |
19 blockquote.vanilla {display: inline-block; margin-left: 1em; | |
20 border: solid 1px; background: rgb(238,234,230); | |
21 padding: .5ex .5em} | |
22 blockquote.vanilla ul.naked li {margin-left: 0 ! important;font-size: 100%} | |
23 ol ol ol, ol ol ol li {list-style-type: lower-roman} | |
24 ol ol, ol ol li {list-style-type: lower-alpha} | |
25 i i {font-style: normal} | |
26 li li {font-style: normal} | |
27 li ul li {font-style: normal} | |
28 li { line-height: 100%; margin-top: 0.3em} | |
29 .math {font-family: 'Arial Unicode MS', 'Lucida Sans Unicode', serif} | |
30 .sub {font-size: 80%; vertical-align: sub} | |
31 .termref {text-decoration: none; color: #606000} | |
32 .licence {margin-left: 1em; font-size: 70%} | |
33 .credits {margin-left: 1.5em; font-size: 70%} | |
34 .right {position: absolute} | |
35 .stackdown {vertical-align: text-top; margin-top: 0} | |
36 body {font-size: 12pt} | |
37 pre.numbered { | |
38 white-space: pre-wrap; | |
39 } | |
40 div.counter { | |
41 counter-reset: listing; | |
42 } | |
43 pre.numbered code { | |
44 counter-increment: listing; | |
45 } | |
46 pre.cl code::before { | |
47 content: "$ " ; | |
48 font-size: 80%; | |
49 width: 2em | |
50 } | |
51 pre.numbered code::before { | |
52 content: counter(listing) ". "; | |
53 display: inline-block; | |
54 font-size: 80%; | |
55 width: 3em; | |
56 padding-left: auto; | |
57 margin-left: auto; | |
58 text-align: right; | |
59 } | |
60 | |
61 /* content doesn't combine :-( */ | |
62 pre.numbered.cl code::before { | |
63 content: counter(listing) ". $ "; | |
64 display: inline-block; | |
65 font-size: 80%; | |
66 width: 3em; | |
67 padding-left: auto; | |
68 margin-left: auto; | |
69 text-align: right; | |
70 } | |
71 @page { size: A4 portrait; margin: 2cm; | |
72 orphans: 2; widows: 2;} | |
73 @media screen { | |
74 body {width: 20cm; margin-left: auto; margin-right: auto} | |
75 } | |
76 @media print { | |
77 body {font-size: 10pt} | |
78 h1, h2, h3, h4 {page-break-after: avoid} | |
79 } | |
80 pre.code {font-family: monospace; | |
81 font-weight: bold; | |
82 line-height: 120%; | |
83 padding-top: 0.2em; | |
84 padding-bottom: 0.2em; | |
85 padding-left: 1em; | |
86 padding-right: 1em; | |
87 border-style: solid; | |
88 border-left-width: 1em; | |
89 border-top-width: thin; | |
90 border-right-width: thin; | |
91 border-bottom-width: thin; | |
92 border-color: #95ABD0; | |
93 color: #00428C; | |
94 background-color: #E4E5E7; | |
95 } | |
96 pre {margin-left: 0em} | |
97 div.toc h2 {font-size: 120%; margin-top: 0em; margin-bottom: 0em} | |
98 div.toc h4 {font-size: 100%; margin-top: 0em; margin-bottom: 0em; | |
99 margin-left: 1em} | |
100 div.toc h1 {font-size: 140%; margin-bottom: 0em} | |
101 div.toc ul {margin-top: 1ex} | |
102 .byline {font-size: 120%} | |
103 div.figure {margin-left: 2em} | |
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} | |
105 i i {font-style: normal} | |
106 img {border: 0} | |
107 .copyright {font-size: 70%} | |
108 .note {width: 20%; float: right; clear: right; margin-left: .5em} | |
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">15 Apr 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a preliminary publication of my augmented <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">index files</a> | |
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">forthcoming paper</a> describing | |
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing | |
112 the individual gzipped index files themselves</a>, with names of the form | |
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li></ul></div><div><h2>3. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> | |
114   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web | |
115 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci ’24)</i>, | |
116 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. | |
117 <a href="...">[coming soon]</a> | |
118 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> | |
119   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. <i>Augmented index | |
120 for Common Crawl August 2019, with Last-Modified timestamps</i>. | |
121 <a href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</a>. Retrieved ...</div></blockquote></li></ul></div><div><h2>4. Acknowledgements</h2><p>Without the vision of those responsible for Common Crawl and the | |
122 generosity of Amazon in hosting it this work could never have happened.</p><p>Access to the <a href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</a> at the Edinburgh | |
123 Parallel Computing Centre used to produce the augmented index was supported | |
124 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p><p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful | |
125 replies to many emails over the years, and to Greg Lindahl of Common | |
126 Crawl and Tom Morris for more recent help with consistency problems in the index | |
127 and the challenges of increasing load on the Common Crawl servers.</p></div></div></body></html> |