ChapterPDF Available

Live versus archive: comparing a web archive to a population of web pages.

Authors:

Abstract and Figures

We examine whether biases exist in the British websites stored in the Internet Archive data. We find that the Internet Archive contains a surprisingly small subset, about 24%, of the web pages of the website used for our case study (the travel site, TripAdvisor). Furthermore, the subset of data we found in the Internet Archive appears to be biased toward large prominent web pages and is not a random sample of the web pages on the site. This bias could create serious problems for research using archived websites, and we discuss this issue at the end of the chapter.
Content may be subject to copyright.
45
45
2
Live versus archive: Comparing a web
archive to a population of webpages
Scott A. Hale, Grant Blank and Victoria D. Alexander
Introduction
With its seemingly limitless scope, the World Wide Web promises
enormous advantages, along with enormous problems, to researchers
who seek to use it as a source of data. Websites change continually
and a high level of flux makes it challenging to capture a snapshot
of the web, or even a cross- section of a small subset of the web. Web
archives, such as those at the Internet Archive, promise to store and
deliver repeated cross- sections of the web, oering the potential for
longitudinal analysis. Whether this potential is realized depends on
the extent to which the archive has fully captured the web. Therefore,
a crucial question for Internet researchers is: ‘How good are the archi-
val data?’
We ask if there are systematic biases in the Internet Archive, using
a case study to address this question. Specifically, we are interested in
whether biases exist in t he British websites stored i n the Internet Archive
data. We find that the Internet Archive contains a surprisingly small sub-
set, about 24%, of the web pages of the website used for our case study
(the travel site, TripAdvisor). Furthermore, the subset of data we found
in the Internet Archive appears to be biased and is not a random sample
of the web pages on the site. The archived data we examine has a bias
toward prominent web pages. This bias could create serious problems
for research using archived websites, and we discuss this issue at the
end of the chapter.
The web has alway s been an extremely dy namic object. One widely
quoted study found that 3540% of web pages changed content in any
9781911307556_pi-274NEW.indd 45 1/13/2017 9:25:52 AM
Please cite as: Hale, Scott A., Grant Blank and Victoria D. Alexander (2017).
‘Live versus archive: comparing a web archive to a population of web pages’ in
Niels Brügger and Ralph Schroeder (eds.) The web as history. London: UCL
Press, pp. 63-79.
THE WEB AS HISTORY
46
46
given week (Fetterly etal., 2004). Another study found that 26% of all
web pages visited by users twice within an hour had changed content,
and 69% of web pages revisited within a day had changed (Weinreich
etal., 2008). For researchers interested in the evolution of the web or
any part of the web (such as the diusion of certain web technologies),
this is a serious challenge. They need historical data, and almost all of
this history islost.
This problem was recognized early in the development of the web,
and the Internet Archive was incorporated in 1996 by Bruce Gilliat and
Brewster Kahle (Kimpton and Ubios, 2006). The goal of the Internet
Archive is to collect digital data in danger of disappearing. There has
never been any way to completely enumerate all web pages; so, all
attempts to archive the web are to some extent incomplete. The general
approach is to use a web crawler, a software program that starts with a
list of Uniform Resource Locators (URLs) to visit (a seed list) and down-
loads a copy of the content at each of these URLs. Each downloaded web
page is examined to find all the hyperlinks, which are then added to
the list of URLs to be downloaded (subject to certain policies about how
much content and what types of content to download). In this way, the
software ‘crawls’ from page to page following hyperlinks somewhat like
snowball sampling. Despite its best eorts the Internet Archive cannot
collect everything. This leads to the question:How much of the web is
archived?
In order to answer this question, we looked at two dierent collec-
tions of web pages, one that was collected and archived by the Internet
Archive, and one that we collected ourselves. In this way, we are able
to examine the completeness of the data that are held in the Internet
Archive, at least with respect to our case study. To achieve this, we
needed a case where we could reasonably find and download the full
population of historical web pages. It is extremely dicult to find such
a population since the Internet is constantly changing, and purposely
collected archives are often the only source of historical web pages. We
chose TripAdvisor as our case study as the website stores all reviews,
including those written years ago, and thus allows us to reconstruct a
historical population of webpages.
Our case study compares a full population of web pages from
TripAdvisor with the subset stored by the Internet Archive. We
defined our population as all tourist attractions in London listed
on the TripAdvisor website. We downloaded these attractions from
the current TripAdvisor site and found the earliest review of each
9781911307556_pi-274NEW.indd 46 1/13/2017 9:25:52 AM
47
47
LIVE VER SUS ARCHIVE
attraction. We call this data the ‘live data’, and compare it to Internet
Archive data. The specific data we use for comparison are a copy of
all the Internet Archive data for all web pages in the .uk country- code
top- level domain from 1996 to 2013 that were copied to the British
Library, which is where we obtained them. We refer to these data as
the ‘archived data’ and note that they form a ‘subset’ rather than a
‘sample’ of the web because the Internet Archive does not claim to
select a probability sample.
While others have looked at arc hive coverage in terms of web pages
(URLs) generally, notably Ainsworth et al. (2013), this chapter is the
first attempt to look at the extent of coverage of an individual website in
depth. The remainder of this chapter is organized as follows. We review
the existing literature comparing archived coverage to the web. We
describe the Internet Archive and the source of our data before discuss-
ing TripAdvisor. We report our methodology and results and then turn
to the implications of these results for research using web archivaldata.
Literature
Prior research on the success of web archiving is surprisingly sparse.
Two studies, based on small subsets, address this issue. Thelwall and
Vaughan (2004) studied dierences in website coverage. They used ran-
domly constructed names up to four letters long to find a total of 521
commercial websites related to four countries:the USA, Taiwan, China
and Singapore and found large dierences across the countries. They
found that the Internet Archive in 2004 had at least one page stored for
92% of the US commercial websites, but had at least one page stored for
only 58% of the Chinese commercial websites. Russell and Kane (2008)
looked at web citations in history journals. They attempted to retrieve,
from the Internet Archive, those citations that were no longer available
on live websites. Only 57% of the citations not available online were
retrievable from the Internet Archive.
Both of these studies examined only a small number of websites,
and Russell and Kane’s selection was not a random sample. The most
complete study on the extent to which the web is archived is Ainsworth
etal. (2013).1 They sampled 1,000 URLs each from the Open Directory
Project (DMOZ), the recent URLs bookmarked on the social book-
marking site Delicious, randomly created hash values from Bitly, and
the Google search engine index. They used the Memento API (Van de
9781911307556_pi-274NEW.indd 47 1/13/2017 9:25:52 AM
THE WEB AS HISTORY
48
48
Sompel etal., 2009; Van de Sompel etal., 2010) to search 12 archives
(including the Internet Archive) for each of the samples of 1,000 URLs
and found that between 35% and 90% of the web was archived.
This is not a very satisfactory answer because it is such a wide
range, but it broadly confirms the results from the smaller projects of
Thelwall and Vaughan (2004) and Russell and Kane (2008). Large parts
of the web are not included in any archive. A major weakness of these
studies is a lack of detail about how much of each website has been
archived. Thelwall and Vaughan (2004) counted a website as present
in the archive as long as at least one page was archived. Ainsworth etal.
(2013) and Russell and Kane (2008) looked at web pages (URLs) from
many websites but did not examine how much of each site was in the
archive. We address this gap by analysing how much of a website has
been archived and whether the archived pages in the website dier in a
systematic way from the population of all pages on the website.
There is a large literature on the use of Internet Archive data.
However, this literature is less helpful to scholars than it could be, as
it largely discusses what authors think should be possible without ref-
erence to the reality of what actually is possible (e.g. Arms etal., 2006;
Weber, 2014). Our study uses a computational approach to assess what
can actually be learnt from Internet Archivedata.
Case selection
We study London attractions found on the travel website TripAdvisor
(TripAdvisor.co.uk). TripAdvisor, according to its own strapline, is the
‘world’s largest trave l website’. TripAdvisor (2014) cites Google Analy tics
as showing that it received an average of 315 million unique visitors
each month in the third quarter of 2014. This figure shows the extraordi-
nary importance of TripAdvisor in the travel business. It is therefore not
surprising that most academic research on TripAdvisor is found in the
tourism literature and focuses on hotel reviews. Previous studies tend to
focus on practical issues such as how users decide how to trust reviews,
the response of hotels to reviews, or the content of negative reviews
and complaints (O’Connor, 2008; Cunningham et al., 2010; Sparks
and Browning, 2010; Stringam and Gerdes, 2010; Ayeh et al., 2013).
In contrast, our substantive interest, discussed elsewhere, is in how
TripAdvisor works to convey cultural meanings. By studying reviews of
cultural organizations, we examine the blurring of distinctions bet ween
9781911307556_pi-274NEW.indd 48 1/13/2017 9:25:52 AM
49
49
high and popular culture and between commercial and non- profit ven-
ues (Alexander et al., in preparation).
TripAdvisor displays user- generated reviews across categories
such as hotels, restaurants and attractions. (Attractions encompass all
elements of a city that are not restaurants or hotels.) Each review com-
prises a star rating, a title and a textual description. When starting a
review, users enter the name of the hotel, restaurant or attraction, and
if the target has been reviewed already, TripAdvisor suggests matches.
Users can choose to review an item that already exists in TripAdvisor,
or they can create an entry for a new, previously unreviewed establish-
ment. For each review, users must choose a star rating, ranging from
one star (negative) to five stars (positive). It is not possible for users to
post reviews without choosing a star rating. Users then enter a short
title or description in a free- form text box, and this serves as the title
of their review. They then write the review itself, which can be as short
or as long as they wish. TripAdvisor ranks hotels and attractions within
categories based on their reviews using a proprietary method and these
rankings may have a profound eect on the livelihood of hoteliers (Scott
and Orlikowski, 2012). From our perspective, however, a crucial benefit
of the reviews is that they provide a simple star rating combined with
a more nuanced textual description. The star ratings allow an explicit
comparison across dierent types of data, in this case, the archived data
and our own livedata.
We limited our live data to TripAdvisor’s user- generated reviews
of London attractions on TripAdvisor’s UK site (tripadvisor.co.uk).
This oers major advantages. London is a world- class metropolis with
an enormous variety of attractions, providing us with a large range of
reviews. Despite its size, however, London is still a bounded space so
that our dataset can include the entire population of attractions and the
entire population of reviews. Using TripAdvisor’s UK site for London
attractions makes it an appropriate vehicle for comparison to the
archiveddata.2
At the time of data collection, the British Museum was the top
attraction in London, and was described as ‘#1 of 1,277 things to do
in London’ (TripAdvisor, 2015). We have compiled a dataset of these
attractions, as detailed in Table2.1. This allows us to compare across
datasets (live data versus archived data) on easily measured variables,
such as number of attractions and reviews, the average star rating for
each attraction, and the dates of reviews. Table2.1 lists example attrac-
tions in each of TripAdvisor’s top- level categories.
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 49 1/13/2017 9:25:52 AM
THE WEB AS HISTORY
50
50
Table2.1 Categories of attractions on TripAdvisor in2015
Category Number of
attractions
in categorya
Example attractions
Amusement parks 3 The London Dungeon; Shrek’s
Adventure!
Boat tours &
watersports
45 Canal and River Cruises Day Tours;
Capital Pleasure Boats
Casinos &
gambling
17 Hippodrome Casino; Kempton Park
Racecourse
Classes &
workshops
90 Hairy Goat Photography Tours; Bread
Angels; East London Wine School
Food & drinkb120 Eating London Food Tours; Spice
Monkey Cookery School
Fun & games 232 ClueQuest– The Live Escape Game;
HintHunt; Secret Studio
Museums 280 Victoria and Albert Museum; National
Gallery
Nature & parks 129 St James’s Park; Thames River;
Nightlife 1231 City of London Distillery; Comedy
Store London; The Cavern Freehouse
Outdoor activities 139 London Duck Tours; Moo Canoes Ltd.;
Fishing London Coaching and Guide
Service
Shopping 571 Covent Garden; Harrods
Sites & landmarks 519 Houses of Parliament; Big Ben
Spas & wellness 210 Pure Massage; The Body Retreat
Theatre & concerts 292 Les Miserables; Brick Lane Music Hall
Tours & activities 521 Alternative London Tours; BrakeAway
Bike Tours; Shoreditch Street Art Tours
Transportation 67 London Tube; King’s Cross Station
Traveller resources 30 Barbican Centre; City of London
Information Centre
Zoos & aquariums 6 London Zoo
a Attr actions often a ppear in more th an one category; s o, the total adds to more t han the numbe r
of attrac tions in the data set.
b The Food and Drink category does not include restaurants, but does include food and drink
available in other attrac tions, such as a museum café, cookery school, or food- relatedtour.
Source:Data on categories and number of subtopics is from the live data on TripAdvisor. The
number of at tractions per c ategory and examples are dr awn from TripAdvisor (2015).
9781911307556_pi-274NEW.indd 50 1/13/2017 9:25:52 AM
51
51
Data and methods
There are many technical issues to resolve in order to study web pages.
We found all the London attraction pages on TripAdvisor had the form
of ‘Attraction_ Review- .*- London_ England.html’ where ‘.*’ indicates any
(or no) characters. We used the sitemap files published by tripadvisor.
co.uk that list all web pages on the site to create a complete list of all
the attractions in London available on TripAdvisor for the current, live
site and wrote a custom web crawler in Python3 to fetch the HTML of
all the pages. Each attraction page had up to ten user reviews on it. For
attractions with more than ten reviews, we downloaded all the addi-
tional pages of reviews.
We crafted regular expressions to extract the elements of the
attractions and user reviews in which we were interested. For attrac-
tions, we extracted the following elements:
• the name of the attraction;
• the number of reviews for the attraction;
• the average star rating of the attraction;
the category of the attraction as determined by TripAdvisor/its
users;
the ranking of the attraction amongst other attractions in London;
the total number of 5- star, 4- star, 3- star, 2- star and 1- star reviews.
We also extracted the date that each review was added to each attrac-
tion. We performed all data collection in July 2015. Our final live dataset
therefore contains all London attractions listed on TripAdvisor at that
time and all available reviews to these attractions.
TripAdvisor, like many websites, does not include all content in
the HTML of each web page, but loads some content separately using
JavaScript. For TripAdvisor, the text of all user reviews is truncated in
the HTML page and foreign- language reviews are not included at all. As
the website still exists, we were able to emulate the JavaScript requests
needed to collect the full text of reviews as well as foreign- language
reviews for the live site but not for the archived data. Even so, within
the live dataset, we were unable to collect 123 foreign- language reviews
and hence our dataset contains 516,641 (99.98%) of the 516,764 reviews
available in July2015.
The Internet A rchive is the oldest and biggest web ar chive, founded
in 1996. Anon- profit organization headquartered in San Francisco, it
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 51 1/13/2017 9:25:52 AM
THE WEB AS HISTORY
52
52
was created to preserve a historical copy of the World Wide Web. The
UK Joint Information Systems Committee (JISC, now ‘Jisc’, a third-
sector, charitable body) commissioned the Internet Archive to extract
all stored web pages within the .uk domain from its archives. These
data were stored in a new data centre at the British Library and form
the JISC UK Domain Dataset (UK Web Archive Open Data, n.d.). These
Internet Archive data are the data we use within this chapter, and note
that these data form the broadest dataset of UK domains available for
the time period we study (1996– 2013).3 In partnership with the British
Library, we extracted all TripAdvisor web pages stored in the archive
with URLs matching ‘Attraction_ Review- .*- London_ England.html’. The
data include the HTML of the web pages as well as information about
when the pages were added to the archive. We refer to these data simply
as the archiveddata.
Results
Data overview
The earliest review in the live dataset was written on 26 August 2001,
and the number of reviews on the site has been growing exponentially
since that time (Figure 2.1, note that the vertical axis is a logarithmic
scale).
TripAdvisor does not indicate when an attraction was first added
to the website; we therefore take the date of the earliest review as a
proxy for this measure. Measuring growth in this way, we found that the
number of attractions on the website has also been growing each year
(Figure2.2, again note the logarithmic scale on the verticalaxis).
The archived data contains 1,169 TripAdvisor web pages contain-
ing 340 unique attractions. The web pages of most attractions (57%)
were only archived once, but some attractions were archived multiple
times. The median number of copies was 1, the mean 3.4, and the max-
imum 31 (the most- archived attraction was ‘Alternative London Tours’).
The most recent data in the archived dataset are from 1 May 2013.
Using the live dataset and the date of the first review for each attrac-
tion as a proxy for when that attraction was added to TripAdvisor, we
estimate there were at least 1,406 attractions listed on the TripAdvisor
website at that time. Thus, the 340 attractions covered in the archived
dataset represent at most 24% of all the attractions available on the site
at that time. This is the first indication of what proportion of the website
9781911307556_pi-274NEW.indd 52 1/13/2017 9:25:52 AM
53
53
is contained within the archived dataset. The top panel of Figure 2.3
shows the number of new attractions added to the archived dataset
each month based on the date that the web page was crawled. The bot-
tom panel of Figure2.3 shows the number of new attractions added to
the live website each month based on the date of the earliest review.
Figure2.4 shows the estimated proportion of attractions in the archived
data compared to the live dataset.
The actual percentage of attractions stored in the archived data-
set is probably lower as the live dataset does not include attractions
2002
100,000
Cumulative number of reviews
(logarithmic scale)
10,000
1,000
100
10
1
2004 2006 2008 2010
Date
2012 2014
2016
Figure2.1 Cumulative number of reviews in the live dataset
2002
Cumulative number of attractions
(logarithmic scale)
1,000
100
10
1
2004 2006 2008 2010
Date
2012 2014
2016
Figure2.2 Cumulative number of attractions in the live dataset by
first appearance. The date of the earliest review is used as the date the
attraction first appeared on thesite
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 53 1/13/2017 9:25:53 AM
THE WEB AS HISTORY
54
54
2002 2004 2006 2008 2010
Archived
Live
2012 2014 2016
75
50
25
0
75
50
25
0
Number of new attractions
Figure 2.3 The number of new London attractions added each month
to the TripAdvisor website based on the archived data and live data. For
the archived data the date of a new attraction is the date that the web-
page of the attraction was first crawled, while for the live data the date
of a new attraction is the date of the oldest review for that attraction
25%
20%
15%
10%
5%
0%
Percentage of attractions crawled
2008 2009 2010 2011 2012 2013
Date
Figure2.4 The proportion of attractions stored in the archived
dataset increased irregularly to around 24% of all attractions on the
TripAdvisor website from 2007 to 2013 even as the overall number of
attractions on TripAdvisor continued togrow
9781911307556_pi-274NEW.indd 54 1/13/2017 9:25:53 AM
55
55
that were on TripAdvisor but later removed. This appears to apply to 37
attractions in the archived dataset that do not appear in the live dataset.
This means that there are actually 303 attractions in both the archived
data and the live data. In addition, our numbers do not include the 734
attractions in the live data (8 of these are in the archived data) with no
reviews and hence no proxy for when they wereadded.
Comparing the two datasets
We proceed by comparing the 303 attractions in both the archived data-
set and the live site with the 1,409 attractions known to be on the live
site at the last date of a new page being added to the archived data. We
find that the attractions in the archived dataset dier significantly and
are not representative of those on the livesite.
Attractions within the archived dataset have a considerably dif-
ferent distribution of reviews per attraction than attractions in the live
dataset. We demonstrate these dierences using two statistical tech-
niques.4 Figure 2.5 shows the distribution of the number of reviews per
attraction using a kernel density (note that the horizontal axis uses a log-
arithmic scale). Since the live data represents the actual population, we
use a one- sample t- test, which shows that the mean number of reviews
per attraction in the archived data diers significantly from the popula-
tion mean (t = 5.7, p < 0.001, N = 303). The distribution of the archived
1100
Archived
Live
10,000
Number of reviews
0.5
0.4
0.3
0.2
Proportion
0.1
0.0
Figure 2.5 Distribution of reviews per attraction in the live dataset
and the archived data. Vertical lines are means. Note that the horizon-
tal access uses a logarithmic scale
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 55 1/13/2017 9:25:53 AM
THE WEB AS HISTORY
56
56
data is skewed to the right; it contains attractions with 928 more reviews
on average, probably an indication that the archived data have a bias
towards more visible and prominent web pages. Figure 2.6 (also a kernel
density, but with linear scales) shows that attractions in the archived
dataset have higher average star ratings compared to attractions in the
live dataset: an i ndication that the arch ived data tend to be biased toward
more popular attractions. This dierence is confirmed by a one- sample
t- test (t = 3.2, p = 0.002, N = 303). Finally, Figure 2.7 (also a kernel
density with linear scales) shows that attractions in the archived dataset
tend to have a similar distribution of ranks. A one- sample t- test shows
that the mean rank of attractions in the archived data does not dier
significantly from the mean of the population, the live data (t = – 1.2, p =
0.22, N = 303). The fact that one of the three measures of bias does not
show a statistically significant dierence is noteworthy; however, rank-
ings are probably the least useful indicator because TripAdvisor reports
attraction rankings within a number of dierent subcategories and the
particular ranking criteria are not public.
Finally, in Table 2.2 we examine the percentage of attractions
in each dataset in each of the 18 top- level categories on the cur-
rent TripAdvisor website. Museums are most overrepresented in the
archived dataset, 9percentage points higher than in the live data. The
archived data also include an excessive number of Tours and Activities
(6.6percentage points higher). Nightlife is the most underrepresented,
6.9percentage points less in the archived data compared to the live
data. If a researcher were interested in using the archived data as a
proxy for attractions, these deviations could certainly bias results.
12
Archived
Live
543
Average star rating
0.75
0.50
0.25
0.00
Proportion
Figure2.6 Distribution of star ratings in live dataset and the archived
data. Vertical lines aremeans
9781911307556_pi-274NEW.indd 56 1/13/2017 9:25:53 AM
57
57
Archived
01,000
Live
2,000
Attraction ranking
0.0008
0.0006
0.0004
0.0002
0.0000
Proportion
Figure2.7 Distribution of attraction rankings in the live dataset and
the archived data. Vertical lines aremeans
Table2.2 Percentages in each attraction category in the live data and
archiveddata
Category Live data Archived data Dierence
Amusement parks 0.1 0.4 0.3
Boat tours & water sports 1.5 2.3 0.8
Casinos & gambling 0.5 0.8 0.3
Classes & workshops 1.9 1.9 0.0
Food & drink 1.4 1.2 – 0.3
Fun & games 5.8 5.0 – 0.8
Museums 11.8 20.8 9.0
Nature & parks 5.6 5.8 0.2
Nightlife 18.1 11.2 6.9
Outdoor activities 3.6 5.8 2.1
Shopping 15.3 12.3 – 3.0
Sights & landmarks 22.0 24.2 2.2
Spas & wellness 4.0 0.8 – 3.2
Theatre & concerts 11.2 12.7 1.5
Tours & activities 15.7 22.3 6.6
Transportation 0.7 1.9 1.2
Traveller resources 1.3 1.2 0.1
Zoos & aquariums 0.3 1.2 0.9
Note:The percentages in the live data and the archived data add to more than 100% because
some attr actions are categorized i n more than one categor y.
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 57 1/13/2017 9:25:54 AM
THE WEB AS HISTORY
58
58
Discussion
Much has been promised for the use of web archives, and there have
been a number of studies. For example, Chu etal. (2007) tracked the
longitudinal development of site content on e- commerce websites.
Mike Thelwall with various colleagues (Thelwall and Wilkinson,
2003; Vaughn and Thelwall, 2003; Payne and Thelwall, 2007) used
web data to demonstrate the interdependence of academic institutions
on the web. Hackett and Parmanto (2005) used the Internet Archive’s
Wayback Machine to analyse how technological advances were mani-
fest in changes in website design over time. Hale etal. (2014) studied
the evolution of the presence of British universities on the web using the
same .uk web archive dataset that we usedhere.
The work with web archives has not been as extensive as the
original founders anticipated, because, at least in part, there remain
major challenges to using web archives. Scholars using the biggest
archive, the Internet Archive, are mining data from a 9- petabyte data-
set as of August 2014 (Internet Archive, 2014). Confronted with this
enormous amount of data, few tools exist to help scholars find informa-
tion. Furthermore, web pages are not well- structured or consistently
structured, and they can be extremely dicult to transform into a for-
mat that can be used for large- scale quantitative research. In addition,
changes in web page format and changes in content often occur simul-
taneously. This complicates longitudinal research because just getting
the data into a consistent format may be dicult and slow. It may not
be something that many scholars will want to invest in, given pressures
to publish.
Once the data have been put into a consistent format what,
exactly, do researchers have? This is the question we have addressed.
First, researchers using web archive data have a subset of the full web.
Using Ainsworth etal.’s (2013) estimates of web pages they might have
between 35% and 90% of the web. By constructing their sample of
URLs from DMOZ, Delicious, Bitly, and Google, Ainsworth etal. (2013)
almost certainly examined the inclusion of more popular and prominent
URLs (i.e. the URLs included in DMOZ or added to Delicious are by defi-
nition more popular and prominent than the URLs that no one adds to
these platforms). We have avoided this bias by comparing archived data
to the entire population of London attraction web pages on TripAdvisor.
Although TripAdvisor is a prominent website, we still found that only
24% of the web pages about London attractions were archived.
9781911307556_pi-274NEW.indd 58 1/13/2017 9:25:54 AM
59
59
This suggests that previous results are dramatic overestimates of
the amount of the web that has been stored in archives. Our findings
also complement the results from previous studies that have examined
the percentage of websites included in web archives (e.g. Thelwall and
Vaughan, 2004). Whereas these studies looked at the inclusion of at
least one page of a website in the archive, we looked deeper into the site
itself at whether web pages within the site are stored. Even though the
TripAdvisor site itself is included in our archived data, only at most 24%
of the pages about London attractions have been stored. This may also
suggest that there are enormous variations in the archival coverage, and
the simple presence of one web page from a website in the archive does
not provide an indication of how much of that website is actually within
the archive.
We also found that the archived pages do not resemble a random
probability sample. There is a clear bias toward prominent, well- known
and highly- rated web pages. Smaller, less well- known and lower- rated
web pages are less likely to be archived. It is worth noting that all the
archived data we used came from the Internet Archive; so, the archived
data are probably the best, most complete source possible for this time
period but it is clearly not complete, and it contains significant biases.
In 2014, the British Library began conducting its own crawls of UK web-
sites, but the representativeness and completeness of these data are yet
to be determined.
What are the implications of these results for research using web
archives? Much of the appeal of the Internet is that it seems to provide
broader data than conventional sources. Advocates talk about it being
unrestricted in scale or geographic scope. One reason web archives
were seen as valuable was because they promised to provide full histor-
ical data on things such as diusion of innovations, community forma-
tion, emergence of issues and the formation and dynamics of networks
(Arms et al., 2006). The Internet is certainly broader than most con-
ventional data sources, but the web archive we examined is broader
in a certain way. It focuses on the big and the prominent. Due to the
limits on the number of pages found and crawled from any one website,
web archives are necessarily incomplete even when they start with a
seed list of all domain names (as is now the case for the British Library
crawls of the .uk country- code top- level domain). In some instances
the limit on the number of pages for each website is relatively high
as is the case of the national web archive in Denmark (see Brügger,
2017) but it remains dicult to assess what content is not archived
(as archiving strategies change over time and technical issues in capturing
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 59 1/13/2017 9:25:54 AM
THE WEB AS HISTORY
60
60
dynamic/ JavaScript content arise). Therefore, a web archive- based
study of di usion of innovation on the Internet would act ually be a study
of diusion among prominent, highly- rated web pages, not among all
web pages. Astudy of network formation or network dynamics would
be a study of networks of well- known, highly- rated web pages. It would
not be a study of diusion among all web pages. Hale etal.’s (2014)
study of British university websites, for instance, is a study biased
toward hyperlinks on more prominent webpages.
The incomplete nature of web archives limits the type of analyses
available to researchers. We were only able to conduct our analysis, for
instance, at the level of attractions in London and not about the content
of reviews:the archived data are so incomplete with reference to review
text that it did not make sense to even attempt such a comparison. These
problems are only getting worse as content moves o the web to other
channels (e.g. mobile apps), personalization means there is no definitive
version and dynamic sites use JavaScript or other technologies to fetch
content separately from the HTMLpages.
The promise raised by Arms etal. (2006) was that web archives
would eliminate the need to proactively collect data for longitudi-
nal studies of networks, innovations, community formation, etc., and
instead allow for fine- grained, retrospective analyses over longer peri-
ods of time. Web archive data can certainly provide insights that would
otherwise be unavailable (e.g. we were able to find attractions that had
been deleted from TripAdvisor in the archive that were unavailable on
the live site). With suitable modelling, networks of hyperlinks from web
archive data may be compared to null model controls. However, our
study highlights that web archive data does not replace the need to col-
lect specific data proactively over set periods of time for many types of
longitudinal analysis. The level of incompleteness of web archive data
also raises questions about the extent to which archived web data can
be used to conduct longitudinal research at all. An approach that would
yield much higher quality data is the same as we might have used for
pre- Internet longitudinal data. That is, collect repeated cross- sectional
datasets proactively in real time and then do retrospective, time- series
analyses of the data only at the end of the study period. The irony is
striking, but the point is that web archives do not provide a free lunch to
good research.
These are serious problems. Web archives are an extensive and
permanent record, but they are also an incomplete and biased record.
While it is certainly possible to analyse larger numbers of many things,
9781911307556_pi-274NEW.indd 60 1/13/2017 9:25:54 AM
61
61
are large, biased numbers a good idea? The answer is that a biased set
of data remains biased no matter how many cases it contains and biased
datasets provide biased answers regardless of their sizes. So researchers
have to confront the bias problem. Web archives do not contain a com-
plete population, except perhaps in certain limited areas, and what is
missing from the archives is often unknown.
LIVE VER SUS ARCHIVE
9781911307556_pi-274NEW.indd 61 1/13/2017 9:25:54 AM
... Researchers have also begun to use web archives themselves as a reflexive method for critically analysing the effects of web archiving on the nature of what is archived and made available. Through the use of large-scale analytics to identify and assess 'archival artefacts' and biases (Ben-David and Huurdeman, 2014), these studies have demonstrated the effects of temporal drift in the composite re-presentation of web assemblages in the Wayback Machine (Ainsworth, Nelson and Sompel, 2015), compared platform-specific coverage between the 'live Web' and the Internet Archive's UK Domain Dataset 11 (Hale, Blank and Alexander, 2017) and examined the effects of different mechanisms for selecting 'seed' nominations in web archives (Milligan, Ruest and Lin, 2016;Nwala, Weigle and Nelson, 2018). These approaches examine and attempt to 'reverse engineer' (Gehl, 2017) the infrastructural contingencies of web archiving and have further revealed the importance of understanding the direct impact of collection activities on the shape and subsequent use of web archives. ...
... The exclusion rules that govern access to web archives stored in the Wayback Machine have received increased attention in recent years through public discussions of the continued utility of the robots.txt protocol (Koster, 1993) and its role in enabling and preventing access to the archived Web at the Internet Archive (Scott, 2017;Summers, 2017 the Internet Archive became one of several web publishers to be implicated in wideranging attempts by the Church of Scientology and their lawyers to remove access to web content that was seen to be critical of the church (Bowman, 2002;Jeff, 2002 ...
... 17 This is best summarised in the Archive Team's decisions to disregard the robots.txt protocol, which was made particularly clear through Scott's deliberately provocative post on the Archive Team wiki entitled: 'Robots.txt is a suicide note' (Scott, 2017) Hobsbawm (1959; describes as 'social banditry'. Social banditry is characterised as a 'primitive form of organized social protest' (Blok, 1972, p.494 with a thorough discussion of the decades of critique to this particular social theory, the analogy nonetheless provides another window into how Archive Team positions its own form of citizen action as a moral good to in effect, 'save it for the people' (Scott, 2009a). ...
Thesis
This thesis makes visible the work of archiving the Web. It demonstrates the growing role of web archives (WAs) in the circulation of information and culture online, and emphasises the inherent connections between how the Web is archived, its future use and our understandings of WAs, archivists and the Web itself. As the first in-depth sociotechnical study of web archiving, this research offers a view into the ways that web archivists are shaping what and how the Web is saved for the future. Using a combination of ethnographic observation, interviews and documentary sources, the thesis investigates web archiving at three sites: the Internet Archive – the world’s largest web archive; Archive Team – ‘a loose collective of rogue archivists and programmers’ archiving the Web; and the Environmental Data & Governance Initiative (EDGI) –a community of academics, librarians and activists formed in the wake of the 2016US Presidential Election to safe-guard environmental and climate data. Through the application of practice theory, thematic analysis and facet methodology, I frame my findings through three ‘facets of web archiving’: infrastructure, culture and politics.I show that the web archival activities of organisations, people and bots are both historically-situated and embedded in the contemporary politics of online communication and information sharing. WAs are reflected on as ‘places’ where the past,present and future of the Web collapses around an evolving assemblage of sociotechnical practices and actors dedicated to enabling different (and at times, conflicting)community-defined imaginaries for the Web. WAs are revealed to be contested sites where these politics are enabled and enacted over time. This thesis therefore contributes to research on the performance of power and politics on the Web, and raises new questions concerning how different communities negotiate the challenges of ephemerality and strive to build the ‘Web they want’.iii
... Established in 1996, the Internet Archive is a non-profit organization that archives web content via a web crawler and a seed list of URLs. During the archival of the HTML documents from these URLs, it also discovers the hyperlinks included in these documents and uses them to discover more URLs following a snowball-like sampling technique (Hale, Blank, and Alexander 2017). In 2016 the Internet Archive contained 273 billion webpages from 361 million websites, which took up 15 petabytes of storage (Internet Archive 2016). ...
... We cannot rule out poor coverage for a small number of individual websites. For instance,Hale, Blank, and Alexander (2017) compare the live and archived TripAdvisor London webpages on the Internet Archive. For this single case, they find that only 24 per cent were archived, with webpage popularity being the main driver for the archival bias. ...
Article
Full-text available
This paper proposes a new methodological framework to identify economic clusters over space and time. We employ a unique open source dataset of geolocated and archived business webpages and interrogate them using Natural Language Processing to build bottom-up classifications of economic activities. We validate our method on an iconic UK tech cluster – Shoreditch, East London. We benchmark our results against existing case studies and administrative data, replicating the main features of the cluster and providing fresh insights. As well as overcoming limitations in conventional industrial classification, our method addresses some of the spatial and temporal limitations of the clustering literature.
... С этой точки зрения веб можно рассматривать как культурный феномен, позволяющий с помощью специализированных сервисов транслировать и формировать общественное мнение. Одним исследований стало изучения точек притяжения в Лондоне на сайте TripAdvisor [16]. Главная идея исследования заключалась в том, чтобы проанализировать, как популярный сервис передавал культурную значимость тех или иных локаций, как происходило размытие различий между высокой и поп-культурами, а также между коммерческими и некоммерческими местами для посещения туристами. ...
Conference Paper
Full-text available
Web archives contain information about political, economic, social, and cultural history, and they can be the basis for the reconstruction of the history of the information society. Contemporary web archiving initiatives aim to preserve the web globally, nationally, and locally and to build a wide range of thematical collections. The paper focuses on the possibilities of using web archival materials in historical research and provides examples of such research projects.
Book
Full-text available
Учебное пособие является первым в России, призванным представить основы работы с веб-архивами при проведении исторических исследований. Предназначено для углубленного изучения веб-архивов как исторических источников и возможностей их использования в исследованиях. В главах книги показана специфика веб-истории как междисциплинарного исследовательского поля, описан процесс веб-архивирования, продемонстрировано влияние веб-архивов на складывание исторических источников нового типа, представлен краткий обзор исторических исследований, проведенных на основе использования ресурсов веб-архивов, рассмотрен инструментарий и методы проведения исследования в области веб-истории. Учебное пособие предназначено студентам исторических специальностей, исследователям, изучающим социальную, культурную, экономическую и политическую историю современности, а также историю информационных технологий, сети Интернет и Всемирной паутины. Также пособие будет полезно студентам и специалистам в области социальных и гуманитарных наук, использующим ресурсы Интернета и веб-архивов в профессиональной деятельности.
Article
Full-text available
This article is an attempt to build a quantitative panorama of the Polish country code top-level domain (ccTLD) in the years 1996–2001 on the basis of data generously provided by the Internet Archive. The purpose of analyzing over 72 million captures is to show that these resources have limited potential in reconstructing the early Polish Web. The availability of historical Web resources and tools for their easy exploration in no way determines their potential value and usefulness in research, even if we do not have access to alternative sources. Czy to był prawdziwy Web? Ilościowy przegląd polskiej domeny krajowej w zbiorach Internet Archive (1996–2001) Artykuł przedstawia ilościowy opis zasobów polskiej domeny krajowej (country code top-level domain, ccTLD) z lat 1996–2001, dostępnych w zbio­rach Wayback Machine, archiwum Webu prowadzonym przez Internet Archive. Celem analizy ponad 72 mln archiwizacji (captures) jest wykaza­nie, że zasoby te mają ograniczony potencjał w rekonstruowaniu polskiego wczesnego Webu. Dostępność historycznych zasobów WWW i narzędzi do ich łatwej eksploracji w żaden sposób nie przesądza o ich potencjalnej wartości i przydatności w badaniach, nawet jeśli nie mamy dostępu do al­ternatywnych źródeł.
Conference Paper
Full-text available
This article describes the trends in online education caused by the COVID-19 pandemic. The introduction of learning analytics into the educational process is substantiated. The main methods and tools of educational analytics are considered. Using a specific example, we will understand the construction and assessment of a student classification model using the high-level programming language Python.
Conference Paper
Full-text available
The article discusses the main vectors of the application of digital technologies in the archival field. It is noted that the digitization of documents and the development of open access technologies have created favorable conditions for the use and preservation of archival documents. The creation of popular science projects contributes to the expansion of the audience of archive users and forms a respectful attitude towards documentary heritage in society. The main trend in the application of digital technologies in scientific research remains the improvement of methods for creating digital editions.
Conference Paper
Full-text available
The article discusses the curriculum of the course “Data Mining – Data Mining” for graduate students studying in the specialty “History”. The definition of the term “Data Mining” is given, the areas of application are listed and the importance of mastering these technologies by undergraduates of this specialty is emphasized. The content of the lecture component of the discipline, laboratory workshop is considered, a list of useful Internet resources is provided
Conference Paper
Full-text available
The Spanish flu pandemic of 1917–1920 killed, according to some researchers, up to 100 million people while others estimate lower numbers such as 20 million victims. The reason for this uncertainty is that data for a number of countries, including Russia, are rather rough estimates based on mortality rates from other parts of the world. This study analyzes the causes of death in the population of Еkaterinburg during the period of the Spanish pandemic to determine likely signs of the spread of influenza. Databases on the causes of death registered in the city’s parish registers were used as a source
Conference Paper
Full-text available
The article discusses the need and problems of organizing sources of data for the study of ideological and political and agitation-propaganda discourses of the “reds” and “whites” during the Civil War based on materials from the Perm province newspapers of 1918–1919. It is noted that the solution to these problems is determined by the tasks of study, using digital technologies and mainly reduced to ensuring the machine readability of data sources, their structuring and organization based on forms that allow machine processing. The main ways to solve these problems are the creation of digital sources of complexes based on source-oriented information systems, arrays in the form of file collections of publications in text formats and data in tabular forms. It is shown that solving the problems of organizing data creates the necessary conditions for the effective use of digital methods of analysis and obtaining the expected results at subsequent, analytical stages of the study.
ResearchGate has not been able to resolve any references for this publication.