changeset 34:052f4ff4eae6

struggling w. page limit
author Henry S. Thompson <ht@inf.ed.ac.uk>
date Mon, 22 Apr 2024 18:18:00 +0100
parents f899b1a922ce
children 88f4a1f0a8fa
files LURID3.xml
diffstat 1 files changed, 25 insertions(+), 5 deletions(-) [+]
line wrap: on
line diff
--- a/LURID3.xml	Mon Apr 22 15:17:02 2024 +0100
+++ b/LURID3.xml	Mon Apr 22 18:18:00 2024 +0100
@@ -16,16 +16,36 @@
     <p>Empirical evidence of how use of the Web has changed in the past provides crucial input to decisions about its future.   Creative uses of the mechanisms the Web provides expand its potential, but also sometimes put it at risk, so it’s worrying that there’s surprisingly little empirical evidence available to guide standardization and planning more generally. Which aspects of the Web’s functionality are widely used? Hardly ever used? How is this changing over time?</p>
     <p>The kind of evidence needed to answer such questions is hard to come by.
   The proposed research builds on our previous work in this area [Thompson and
-Tong 2018], [Thompson 2024], taking advantage of the computational resource Cirrus provides to validate and expand our work on the Common Crawl web archive.</p>
+Tong 2018], [Chen 2021], [Thompson 2024], taking advantage of the computational resource Cirrus provides to validate and expand our work on the Common Crawl web archive.</p>
     <p>Common Crawl (CC) is a very-large-scale web archive, containing
 petabytes of data from more than 65 monthly/bi-monthly archives, totalling over
 100 billion web pages.  Collection began in 2008 with annual archives,
 expanding steadily to the point that since 2017 archives were collected monthly
-until 2023, since when it's been bi-monthly. Recent archives contain over 3x10^9 pages, about 50 Terabytes (compressed).  Together with Edinburgh colleagues we have created local copies of 8 months of CC in a petabyte store attached to Cirrus.  For our purposes it is important to note that the overlap between any two archives as measured by Jaccard similarity of page checksums is less than .02 [9].</p>
-    <p>The proposed work will build on results from our just-completed project
+until 2023, since when it's been bi-monthly. Recent archives contain over
+3x10^9 pages, about 75 Terabytes (compressed).  Together with Edinburgh colleagues we have created local copies of 8 months of CC in a petabyte store attached to Cirrus.  For our purposes it is important to note that the overlap between any two archives as measured by Jaccard similarity of page checksums is less than .02 [Nagel 2022].</p>
+    <p>The very large size of CC's constituent archives makes using it for research, particularly
+longitudinal studies, which necessarily involve multiple archives, very
+expensive in terms of compute time and storage space and/or web bandwidth. The proposed work will build on our just-completed project
 (<name>LURID2: Assessing the validity of Common Crawl</name>, EPSRC Access to
-HPC Award from 2022&ndash;12 to 2023&ndash;04) on the Common Crawl web
-archive (CC), in the course of which we accomplished almost all of our four main objectives.</p>
+HPC Award from 2022&ndash;12 to 2023&ndash;04).  As reported in [Thompson
+2024], the two main results of that project for
+addressing the expense problem are based on exploiting and extending the much smaller (&lt;200 gigabytes (GB) compressed) <emph>index</emph> which is available for each archive, as follows:</p>
+    <list>
+     <item>By adding Last-Modified timestamps to the index we enable
+fine-grained longitudinal exploration using only a single archive;</item>
+     <item>By comparing the distribution of index features for each of the 100
+segments into which each archive is packaged for access with their distribution over the whole archive, we identified the least and most representative segments for a number of recent archives. Using this allows the segment(s) that are most representative of an archive to be used as proxies for the whole.</item>
+    </list>
+    <p>Combining these two approaches allowed us to produces a fine-grained
+analysis of how URI lengths have changed over time, leading to an unanticipated
+insight into the how the process of creating of Web pages <emph>itself</emph> has changed.</p>
+   </div>
+   <div>
+    <title>Objectives</title>
+    <list type="1defn">
+     <item term="Objective 1: Foster research">The augmented index form
+CC's August 2019 data is now available online.  In </item>
+    </list>
    </div>
   </div>
   <div>