annotate LURID3.xml @ 63:663e55844c1d

comparing profiles w and w/o cython
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 06 Jan 2025 17:59:20 +0000
parents 052f4ff4eae6
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
33
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
1 <?xml version='1.0'?>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
2 <?xml-stylesheet type="text/xsl" href="../../../lib/xml/doc.xsl" ?>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
3 <!DOCTYPE doc SYSTEM "../../../lib/xml/doc.dtd" >
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
4 <doc xmlns:x="http://www.w3.org/1999/xhtml">
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
5 <head>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
6 <title>LURID3: Longitudinal studies of the World Wide Web<x:br/>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
7 <span style="font-size:80%">UKRI reference APP39557</span></title>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
8 <author>Henry S. Thompson</author>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
9 <date>22 Apr 2024</date>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
10 </head>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
11 <body>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
12 <div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
13 <title>Vision</title>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
14 <div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
15 <title>Motivation</title>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
16 <p>Empirical evidence of how use of the Web has changed in the past provides crucial input to decisions about its future. Creative uses of the mechanisms the Web provides expand its potential, but also sometimes put it at risk, so it’s worrying that there’s surprisingly little empirical evidence available to guide standardization and planning more generally. Which aspects of the Web’s functionality are widely used? Hardly ever used? How is this changing over time?</p>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
17 <p>The kind of evidence needed to answer such questions is hard to come by.
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
18 The proposed research builds on our previous work in this area [Thompson and
34
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
19 Tong 2018], [Chen 2021], [Thompson 2024], taking advantage of the computational resource Cirrus provides to validate and expand our work on the Common Crawl web archive.</p>
33
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
20 <p>Common Crawl (CC) is a very-large-scale web archive, containing
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
21 petabytes of data from more than 65 monthly/bi-monthly archives, totalling over
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
22 100 billion web pages. Collection began in 2008 with annual archives,
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
23 expanding steadily to the point that since 2017 archives were collected monthly
34
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
24 until 2023, since when it's been bi-monthly. Recent archives contain over
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
25 3x10^9 pages, about 75 Terabytes (compressed). Together with Edinburgh colleagues we have created local copies of 8 months of CC in a petabyte store attached to Cirrus. For our purposes it is important to note that the overlap between any two archives as measured by Jaccard similarity of page checksums is less than .02 [Nagel 2022].</p>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
26 <p>The very large size of CC's constituent archives makes using it for research, particularly
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
27 longitudinal studies, which necessarily involve multiple archives, very
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
28 expensive in terms of compute time and storage space and/or web bandwidth. The proposed work will build on our just-completed project
33
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
29 (<name>LURID2: Assessing the validity of Common Crawl</name>, EPSRC Access to
34
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
30 HPC Award from 2022&ndash;12 to 2023&ndash;04). As reported in [Thompson
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
31 2024], the two main results of that project for
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
32 addressing the expense problem are based on exploiting and extending the much smaller (&lt;200 gigabytes (GB) compressed) <emph>index</emph> which is available for each archive, as follows:</p>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
33 <list>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
34 <item>By adding Last-Modified timestamps to the index we enable
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
35 fine-grained longitudinal exploration using only a single archive;</item>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
36 <item>By comparing the distribution of index features for each of the 100
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
37 segments into which each archive is packaged for access with their distribution over the whole archive, we identified the least and most representative segments for a number of recent archives. Using this allows the segment(s) that are most representative of an archive to be used as proxies for the whole.</item>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
38 </list>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
39 <p>Combining these two approaches allowed us to produces a fine-grained
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
40 analysis of how URI lengths have changed over time, leading to an unanticipated
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
41 insight into the how the process of creating of Web pages <emph>itself</emph> has changed.</p>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
42 </div>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
43 <div>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
44 <title>Objectives</title>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
45 <list type="1defn">
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
46 <item term="Objective 1: Foster research">The augmented index form
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
47 CC's August 2019 data is now available online. In </item>
052f4ff4eae6 struggling w. page limit
Henry S. Thompson <ht@inf.ed.ac.uk>
parents: 33
diff changeset
48 </list>
33
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
49 </div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
50 </div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
51 <div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
52 <title>Approach</title>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
53 </div>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
54 </body>
f899b1a922ce getting started
Henry S. Thompson <ht@inf.ed.ac.uk>
parents:
diff changeset
55 </doc>