Mercurial > hg > cc > pub
annotate index.html @ 6:cc5cef8ba548 default tip
expanded with example script,
updated to point to full paper,
include slides
author | Henry Thompson <ht@markup.co.uk> |
---|---|
date | Thu, 23 May 2024 16:51:36 +0200 |
parents | 268fe5fd117f |
children |
rev | line source |
---|---|
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
1 <?xml version="1.0" encoding="US-ASCII"?> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
2 <!DOCTYPE html |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
3 PUBLIC "-//HST//DTD XHTML5 1.0 Transitional//EN" "http://www.ltg.ed.ac.uk/~ht/xhtml5.dtd"> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
4 <html xmlns="http://www.w3.org/1999/xhtml"><head><meta name="copyright" content="Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a>&#160;<a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a>"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><style type="text/css"> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
5 ul.nolabel { margin: 0; margin-left: -2.5em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
6 ul.naked.nolabel {margin: 0; margin-left: 0; padding-left: 0} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
7 ul.cdefn {clear: both} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
8 div.ndli { margin-bottom: 1ex } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
9 div.hidden {display: none} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
10 |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
11 ul.naked > li { list-style-type: none; background: none; margin-left: 2em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
12 margin-bottom: 0 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
13 li ul.naked > li, dd ul.naked > li { list-style-type: none; background: none; margin-left: 0; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
14 margin-bottom: 0 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
15 li.cdefni {} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
16 li.cdefni span.cl {display: inline-block; vertical-align: bottom} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
17 li.cdefni span.cr {display: inline-block; margin-left: 1em; vertical-align: bottom} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
18 pre.code {display: inline-block} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
19 blockquote.vanilla {display: inline-block; margin-left: 1em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
20 border: solid 1px; background: rgb(238,234,230); |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
21 padding: .5ex .5em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
22 blockquote.vanilla ul.naked li {margin-left: 0 ! important;font-size: 100%} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
23 ol ol ol, ol ol ol li {list-style-type: lower-roman} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
24 ol ol, ol ol li {list-style-type: lower-alpha} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
25 i i {font-style: normal} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
26 li li {font-style: normal} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
27 li ul li {font-style: normal} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
28 li { line-height: 100%; margin-top: 0.3em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
29 .math {font-family: 'Arial Unicode MS', 'Lucida Sans Unicode', serif} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
30 .sub {font-size: 80%; vertical-align: sub} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
31 .termref {text-decoration: none; color: #606000} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
32 .licence {margin-left: 1em; font-size: 70%} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
33 .credits {margin-left: 1.5em; font-size: 70%} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
34 .right {position: absolute} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
35 .stackdown {vertical-align: text-top; margin-top: 0} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
36 body {font-size: 12pt} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
37 pre.numbered { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
38 white-space: pre-wrap; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
39 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
40 div.counter { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
41 counter-reset: listing; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
42 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
43 pre.numbered code { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
44 counter-increment: listing; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
45 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
46 pre.cl code::before { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
47 content: "$ " ; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
48 font-size: 80%; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
49 width: 2em |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
50 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
51 pre.numbered code::before { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
52 content: counter(listing) ". "; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
53 display: inline-block; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
54 font-size: 80%; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
55 width: 3em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
56 padding-left: auto; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
57 margin-left: auto; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
58 text-align: right; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
59 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
60 |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
61 /* content doesn't combine :-( */ |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
62 pre.numbered.cl code::before { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
63 content: counter(listing) ". $ "; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
64 display: inline-block; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
65 font-size: 80%; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
66 width: 3em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
67 padding-left: auto; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
68 margin-left: auto; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
69 text-align: right; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
70 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
71 @page { size: A4 portrait; margin: 2cm; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
72 orphans: 2; widows: 2;} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
73 @media screen { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
74 body {width: 20cm; margin-left: auto; margin-right: auto} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
75 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
76 @media print { |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
77 body {font-size: 10pt} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
78 h1, h2, h3, h4 {page-break-after: avoid} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
79 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
80 pre.code {font-family: monospace; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
81 font-weight: bold; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
82 line-height: 120%; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
83 padding-top: 0.2em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
84 padding-bottom: 0.2em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
85 padding-left: 1em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
86 padding-right: 1em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
87 border-style: solid; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
88 border-left-width: 1em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
89 border-top-width: thin; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
90 border-right-width: thin; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
91 border-bottom-width: thin; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
92 border-color: #95ABD0; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
93 color: #00428C; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
94 background-color: #E4E5E7; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
95 } |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
96 pre {margin-left: 0em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
97 div.toc h2 {font-size: 120%; margin-top: 0em; margin-bottom: 0em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
98 div.toc h4 {font-size: 100%; margin-top: 0em; margin-bottom: 0em; |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
99 margin-left: 1em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
100 div.toc h1 {font-size: 140%; margin-bottom: 0em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
101 div.toc ul {margin-top: 1ex} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
102 .byline {font-size: 120%} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
103 div.figure {margin-left: 2em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
104 div.caption {font-style: italic; font-weight: bold; margin-top: 1em} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
105 i i {font-style: normal} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
106 img {border: 0} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
107 .copyright {font-size: 70%} |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
108 .note {width: 20%; float: right; clear: right; margin-left: .5em} |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
109 </style><title>Augmentations to Common Crawl</title></head><body style="font-family: DejaVu Sans, Arial; background: rgb(254,250,246)"><div style="text-align: center" class="head"><h1>Augmentations to Common Crawl</h1><hr/><div class="byline">Henry S. Thompson</div><div class="byline">23 May 2024</div><div class="copyright">Copyright © 2024 <a href="http://www.ltg.ed.ac.uk/~ht/">Henry S. Thompson</a> <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></div></div><div class="body"><div><h2>1. Introduction</h2><p>This site contains a copy of my augmented index files |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
110 for <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">CC-MAIN-2019-35</a>. This index contains all of <a href="https://commoncrawl.org/blog/august-2019-crawl-archive-now-available">the original index</a>, with one additional field, <code>lastmod</code>, in about 18% of the entries, giving the value of the <code>Last-Modified</code> header as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files.</p><p>The format of the Common Crawl's index files is described in <a href="https://commoncrawl.org/blog/announcing-the-common-crawl-index">this announcement</a>.</p></div><div><h2>2. Contents</h2><ul class=" "><li>My <a href="Thompson_WebSci24.pdf">paper</a>, presented at WebSci24, describing |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
111 the augmented index and its uses</li><li>The <a href="CC-MAIN-2019-35/cdx/warc/cluster.idx">top-level index file</a></li><li><a href="CC-MAIN-2019-35/cdx/warc/idx/">The directory containing |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
112 the individual gzipped index files themselves</a>, with names of the form |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
113 <code>cdx-00nnn.gz</code>, for <code>nnn</code> in <code>000–299</code></li><li><a href="Thompson_WebSci24_slides.pdf">WebSci 24 conference slides</a></li></ul></div><div><h2>3. Efficient access to Common Crawl using Amazon S3</h2><p>The University of Edinburgh's <a href="https://edinburgh-international-data-facility.ed.ac.uk/">Edinburgh International Data Facility</a> (EIDF) hosts a |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
114 copy of the augmented index in an Amazon S3 server. It supports open |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
115 access to the index via unsigned requests to (range-restricted) |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
116 <b>s3:</b> URIs, for example using the <a href="https://aws.amazon.com/cli/">Amazon <code>aws</code> |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
117 Command Line Interface</a>.</p><p>The best way to understand how this works, once you've read how |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
118 the index itself works <a href="Thompson_WebSci24.pdf">in the paper, section 2.1</a>, is to work through <a href="eidf125_example.sh">an example</a> of using the augmented index to access an individual |
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
119 Common Crawl retrieval record using a timestamp.</p></div><div><h2>4. Licence and citation</h2><p>The paper and data contained herein are Copyright © 2024 Henry S. Thompson <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.en">CC-BY-SA</a></p><p>Please cite information from here as follows:</p><ul class="naked "><li><a name="For_the_paper"><b>For the paper</b></a> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
120   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. "Improved methodology for longitudinal Web |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
121 analytics using Common Crawl". In <i>ACM Web Science Conference (Websci ’24)</i>, |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
122 May 21–24, 2024, Stuttgart, Germany. ACM, New York, NY, USA, 11 pages. |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
123 <a href="https://doi.org/10.1145/3614419.3644018">https://doi.org/10.1145/3614419.3644018</a> |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
124 </div></blockquote></li><li><a name="For_the_data"><b>For the data</b></a> |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
125   <blockquote class="vanilla"><div>Henry S. Thompson. 2024. <i>Augmented index |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
126 for Common Crawl August 2019, with Last-Modified timestamps</i>. |
6
cc5cef8ba548
expanded with example script,
Henry Thompson <ht@markup.co.uk>
parents:
4
diff
changeset
|
127 <a href="https://markup.co.uk/ccrawl/">https://markup.co.uk/ccrawl/</a>. Retrieved ...</div></blockquote></li></ul></div><div><h2>5. Acknowledgements</h2><p>Without the vision of those responsible for Common Crawl and the |
1
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
128 generosity of Amazon in hosting it this work could never have happened.</p><p>Access to the <a href="http://www.cirrus.ac.uk">Cirrus UK National Tier-2 HPC Service</a> at the Edinburgh |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
129 Parallel Computing Centre used to produce the augmented index was supported |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
130 by EPSRC and UKRI HPC Access awards to Henry S. Thompson.</p><p>Thanks to Sebastian Nagel of Common Crawl for many prompt and helpful |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
131 replies to many emails over the years, and to Greg Lindahl of Common |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
132 Crawl and Tom Morris for more recent help with consistency problems in the index |
d6f13dda3a11
As sent to Lindahl and Nagel
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
0
diff
changeset
|
133 and the challenges of increasing load on the Common Crawl servers.</p></div></div></body></html> |