Conference PaperPDF Available

Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?

Authors:

Abstract and Figures

Social media content has grown exponentially in the recent years and the role of social media has evolved from just narrating life events to actually shaping them. In this paper we explore how many resources shared in social media are still available on the live web or in public web archives. By analyzing six different event-centric datasets of resources shared in social media in the period from June 2009 to March 2012, we found about 11% lost and 20% archived after just a year and an average of 27% lost and 41% archived after two and a half years. Furthermore, we found a nearly linear relationship between time of sharing of the resource and the percentage lost, with a slightly less linear relationship between time of sharing and archiving coverage of the resource. From this model we conclude that after the first year of publishing, nearly 11% of shared resources will be lost and after that we will continue to lose 0.02% per day.
Content may be subject to copyright.
Losing My Revolution
How Many Resources Shared on Social Media
Have Been Lost?
Hany M. SalahEldeen and Michael L. Nelson
Old Dominion University, Department of Computer Science
Norfolk VA, 23529, USA
{hany,mln}@cs.odu.edu
Abstract. Social media content has grown exponentially in the recent
years and the role of social media has evolved from just narrating life
events to actually shaping them. In this paper we explore how many
resources shared in social media are still available on the live web or
in public web archives. By analyzing six different event-centric datasets
of resources shared in social media in the period from June 2009 to
March 2012, we found about 11% lost and 20% archived after just a
year and an average of 27% lost and 41% archived after two and a half
years. Furthermore, we found a nearly linear relationship between time
of sharing of the resource and the percentage lost, with a slightly less
linear relationship between time of sharing and archiving coverage of
the resource. From this model we conclude that after the first year of
publishing, nearly 11% of shared resources will be lost and after that we
will continue to lose 0.02% per day.
Keywords: Web Archiving, Social Media, Digital Preservation
1 Introduction
With more than 845 million Facebook users at the end of 2011 [5] and over 140
million tweets sent daily in 2011 [16] users can take photos, videos, post their
opinions, and report incidents as they happen. Many of the posts and tweets are
about quotidian events and their preservation is debatable. However, some of
the posts and events are about culturally important events whose preservation
is less controversial. In this paper we shed light on the importance of archiving
social media content about these events and estimate how much of this content
is archived, still available, or lost with no possibility of recovery.
To emphasize the culturally important commentary and sharing, we col-
lected data about six events in the time period of June 2009 to March 2012:
the H1N1 virus outbreak, Michael Jackson’s death, the Iranian elections and
protests, Barack Obama’s Nobel Peace Prize, the Egyptian revolution, and the
Syrian uprising.
arXiv:1209.3026v1 [cs.DL] 13 Sep 2012
2 Hany M. SalahEldeen and Michael L. Nelson
2 Related Work
To our knowledge, no prior study has analyzed the amount of shared resources
in social media lost through time. There have been many studies analyzing the
behavior of users within a social network, how they interact, and what content
they share [3, 19, 20, 23]. As for Twitter, Kwak et al. [6] studied its nature and
its topological characteristics and found a deviation from known characteristics
of human social networks that were analyzed by Newman and Park [10]. Lee
analyzed the reasons behind sharing news in social media and found that infor-
mativeness was the strongest motivation in predicting news sharing intention,
followed by socializing and status seeking [4]. Also shared content in social media
like Twitter move and diffuse relatively fast as stated by Yang et al. [22].
Further more, many concerns were raised about the persistence of shared
resources and web content in general. Nelson and Allen studied the persistence
of objects in a digital library and found that, with just over a year, 3% of the
sample they collected have appeared to no longer be available [9]. Sanderson et al.
analyzed the persistence and availability of web resources referenced from papers
in scholarly repositories using Memento and found that 28% of these resources
have been lost [14]. Memento [17] is a collection of HTTP extensions that enables
uniform, inter-archive access. Ainsworth et al. [1] examined how much of the
web is archived and found it ranges from 16% to 79%, depending on the starting
seed URIs. McCown et al. examined the factors affecting reconstructing websites
(using caches and archives) and found that PageRank, Age, and the number of
hops from the top-level of the site were most influential [8].
3 Data Gathering
We compiled a list of URIs that were shared in social media and correspond to
specific culturally important events. In this section we describe the data acqui-
sition and sampling process we performed to extract six different datasets which
will be tested and analyzed in the following sections.
3.1 Stanford SNAP Project Dataset
The Stanford Large Network Dataset is a collection of about 50 large network
datasets having millions of nodes, edges and tuples. It was collected as a part
of the Stanford Network Analysis Platform (SNAP) project [15]. It includes
social networks, web graphs, road networks, Internet networks, citation networks,
collaboration networks, and communication networks. For the purpose of our
investigation, we selected their Twitter posts dataset. This dataset was collected
from June 1st, 2009 to December 31st, 2009 and contains nearly 476 million
tweets posted by nearly 17 million users. The dataset is estimated to cover 20%-
30% of all posts published on Twitter during that time frame [21]. To select which
Losing My Revolution 3
events will be covered in this study, we examined CNN’s 2009 events timeline1.
We wanted to select a small number of events that were diverse, with limited
overlap, and relatively important to a large number of people. Given that, we
selected four events: the H1N1 virus outbreak, the Iranian protests and elections,
Michael Jackson’s death, and Barrack Obama’s Nobel Peace Prize award.
Preparation: A tweet is typically composed of text, hashtags, embedded re-
sources or URIs and usertags all spanning a maximum of 140 characters. Here
is an example of a tweet record in the SNAP dataset:
T2009-07-31 23:57:18
Uhttp://Twitter.com/nickgotch
WRT @rockingjude: December 21, 2009 Depopulation by Food Will Begin
http://is.gd/1WMZb WHOA..BETTER WATCH RT plz #pwa #tcot
The line starting with the letter Tindicates the date and time of the tweet
creation. While the line starting with Ushows a link to the user who au-
thored this particular tweet. Finally, the line starting with Wshows the en-
tire tweet including all the user-references “@rockingjude”, the embedded URIs
“http://is.gd/1WMZb”, and hashtags “#pwa #tcot”.
Tag Expansion: We wanted to select tweets that we can say with high confi-
dence are about a selected event. In this case, precision is more important than
recall as collecting every single tweet published about a certain event is less
important than making sure that the selected tweets are definitely about that
event. Several studies focused on estimating the aboutness of a certain web page
or a resource in general [12, 18]. Fortunately in Twitter, hashtags incorporated
within a tweet can help us estimate their “aboutness ”. Users normally add cer-
tain hashtags to their tweets to ease the search and discoverability in following
a certain topic. These hashtags will be utilized in the event-centric filtration
process.
For each event, we selected initial tags that describe it (Table 1). Those initial
tags were derived empirically after examining some event-related tweets. Next
we extracted all the hashtags that co-occurred with our initial set of hashtags.
For example, in class H1N1 we extracted all the other hashtags that appeared
along with #h1n1 within the same tweet and kept count of their frequency.
Those extracted hashtags were sorted in descending order of the frequency of
their appearance in tweets. We removed all the general scope tags like #cnn,
#health,#death,#war and others. In regards to aboutness, removing general
tags will indeed decrease recall but will increase precision. Finally we picked the
top 8-10 hashtags to represent this event-class and be utilized in the filtration
process. Table 1 shows the final set of tags selected for each class.
Tweet Filtration: In the previous step we extracted the tags that will help us
classify and filter tweets in the dataset according to each event. This filtration
1http://www.cnn.com/2009/US/12/16/year.timeline/index.html
4 Hany M. SalahEldeen and Michael L. Nelson
Event Initial Hashtags Top Co-occurring Hashtags
H1N1 ‘h1n1’ ‘swine’=61,829 ‘swineflu’=56,419 ‘flu’=8,436
Outbreak =61,351 ‘pandemic’=6,839 ‘influenza’=1,725 ‘grippe’=1,559 ‘tamiflu’=331
M. Jackson’s ‘michaeljackson’ ‘michael’=27,075 ‘mj’=18,584 ‘thisisit’8,770 ‘rip’=3,559 ‘jacko’=3,325
Death =22,934 ‘kingofpop’=2,888 ‘jackson’=2,559 ‘thriller’=1,357 ‘thankyoumichael’=1,050
Iranian ‘iranelection’ ‘iran’949,641 ‘gr88’=197,113‘tehran’=109,006 ‘freeiran’=13,378
Elections =911,808 ‘neda’=191,067 ‘mousavi’=16,587 ‘united4iran’=9,198 ‘iranrevolution’=7,295
Obama’s ‘obama’=48,161 & ‘nobel’=2,261 ‘obamanobel’=14 ‘nobelprize’ ‘nobelp eace’=113
Nobel Prize ‘peace‘=3,721 ‘barack’=1292 ‘nobelpeaceprize’=107
Table 1. Twitter hashtags generated for filtering and their frequency of occurring
process aims to extract a reasonable sized dataset of tweets for each event and to
minimize the inter-event overlap. Since the life and persistence of the tweet itself
is not the focus of this study but rather the associated resource that appears
in the tweet (image, video, shortened URI or other embedded resource), we will
extract only the tweets that contain an embedded resource. This step resulted in
181 million tweets with embedded resources (http://is.gd/1WMZb in the prior
example). These tweets were further filtered to keep only the tweets that have
at least one of the expanded tags obtained from Table 1. The number of tweets
after this phase reached 1.1 million tweets.
Filtering the tweets based on the occurrence of at least one of the hashtags
only is undesirable as it will cause two problems: First, it will introduce possible
event overlap due to general tweets talking about two or more topics. Second,
is that using only the single occurrence of these tags will yield a huge amount
of tweets and we need to reduce this size to reach a more manageable size. In-
tuitively speaking, strongly related hashtags will co-occur often. For example,
a tweet that has #h1n1 along with #swineflu and #pandemic is most likely
about the H1N1 outbreak rather than a tweet having just the tag #flu or just
#sick. Filtering with this co-occurrence will in turn solve both problems as by
increasing relevance to a particular event, general tweets that talk about several
events will be filtered out thus diminishing the overlap, and in turn it will reduce
the size of the dataset.
Next, we increase the precision of the tweets associated with each event from
the set of 1.1 million tweets. In the first iteration we selected the tag that had the
highest frequency of co-occurrence in the dataset with the initial tag and added
it to a set we will call the selection set. After that we check the co-occurrence
of all the remaining extracted tags with the tag in the selection set and record
the frequencies of co-occurrence. After sorting the frequencies of co-occurrence
with the tag from the selection set, we pick the highest one to keep add it to
the selection set. We repeat this step of counting co-occurrences but with all the
previously extracted hashtags in the selection set from previous iterations.
To elaborate, for H1N1 assume that the hastag ‘#h1n1’ had the highest
frequency of appearance in the dataset so we add it to the selection set. In the
Losing My Revolution 5
next iteration we record the how many times each tag in the list appeared along
with ‘#h1n1’ in a same tweet. If we selected ‘#swine’ as the one with the highest
frequency of occurrence with the initial tag ‘#h1n1’ we add it to the selection list
and in the next iteration we record the frequency of occurrence of the remaining
hashtags with both of the extracted tags ‘#h1n1’ and ‘#swine’. We repeat this
step, for each event, to the point where we have a manageable size dataset which
we are confident in its ‘aboutness’ in relation to the event.
Event Hashtags selected for filteration Tweets Extracted Operation Performed Final Tweets
MJ michael 27,075
michael & michaeljackson 22,934 Sample 10% 2,293
Iran iran 949,641
iran & iranelection 911,808
iran & iranelection & gr88 189,757
iran & iranelection & gr88 & neda 91,815
iran & iranelection & gr88 & neda & tehran 34,294 Sample 10% 3,429
H1N1 h1n1 61,351
h1n1 & swine 44,972
h1n1 & swine & swineflu 42,574
h1n1 & swine & swineflu & pandemic 5,517 Take All 5,517
Obama obama 48,161
obama & nobel 1,118 Take All 1,118
Table 2. Tweet Filtration iterations and final tweet collections
Two problems appeared from this approach with the Iran and Michael Jack-
son datasets. In the Iran dataset the number of tweets was in hundreds of thou-
sands and even with 5 tags co-occurrence it was still about 34K+ tweets. To
solve this we performed a random sampling from those resulting tweets to take
only 10% of them resulting in a smaller manageable dataset. The second problem
with the Michael Jackson dataset upon using 5 tags to decrease it to a manage-
able size we realized there were few unique domains for the embedded resources.
A closer look revealed this combination of tags was mostly border-line tweet
spam (MJ ringtones). To solve this we used only the two top tags “#michael”
and “#michaeljackson”, and then we randomly sampled 10% of the resulting
tweets to reach the desired dataset size (Table 2).
3.2 Egyptian Revolution Dataset
The one year anniversary of this event was the original motivation for this
study [13]. In this case, we started with an event and then tried to get so-
cial media content describing it. Despite its ubiquity, gathering social media for
a past event is surprisingly hard. We picked the Egyptian revolution due to the
role of the social media in curating and driving the incidents that led to the
resignation of the president. Several initiatives were commenced to collect and
curate the social media content during the revolution like R-sheif.org2which
specializes in social content analysis of the issues in the Arab world by using
aggregate data from Twitter and the Web. We are currently in the process of
obtaining the millions of records related to the Arab Spring of 2011. Meanwhile,
we decided to build our own dataset manually.
2http://www.r-shief.org/
6 Hany M. SalahEldeen and Michael L. Nelson
There are several sites that curate resources about the Egyptian Revolution
and we want to investigate as many of them as possible. At the same time,
we need to diversify our resources and the types of digital artifacts that are
embedded in them. Tweets, videos, images, embedded links, entire web pages
and books were included in our investigation. For the sake of consistency, we
limited our analysis to resources created within the period from the 20th of
January 2011 to the 1st of March 2011. In the next subsections we explain each
of the resources we utilized in our data acquisition in detail.
Storify: Storify is a website that enables users to create stories by creating
collections of URIs (e.g., Tweets, images, videos, links) and arrange them tem-
porally. These entries are posted by reference to their host websites. Thus, adding
content to Storify does not necessarily mean it is archived. If a user added a video
from YouTube and after a while the publisher of that video decided to remove it
from YouTube the user is left with a gap in their Storify entry. For this purpose
we gathered all the Storify entries that were created between 20th of January
2011 and the 1st of March 2011, resulting in 219 unique resources.
IAmJan25: Some entire websites were dedicated as a collection hub of media
to curate the revolution. Based on public contributions, those websites collect
different types of media, classify them, order them chronologically and publish
them to the public. We picked a website named IAmJan25.com, as an example
of these websites, to analyze and investigate. The administrators of the website
received selected videos and images for notable events and actions that happened
during the revolution. Those images and videos were selected by users as they
vouched for them to be of some importance and they send the resource’s URI to
the web site administrators. The website itself is divided into two collections: a
video collection and an image collection. The video collection had 2387 unique
URIs while the image collection had 3525 unique URIs.
Tweets From Tahrir: Several books were published in 2011 documenting the
revolution and the Arab Spring. To bridge the gap between books and digital
media we analyzed a book entitled Tweets from Tahrir [11] which was pub-
lished on April 21st, 2011. As the name states, this book tells a story formed by
tweets of people during the revolution and the clashes with the past regime. We
analyzed this book as a collection of tweets that had the luxury of a paperback
preservation and focused on the tweeted media, in this case images. The book
had a total of 1118 tweets having 23 unique images.
3.3 Syria Dataset
This dataset has been selected to represent a current (March 2012) event. Using
the Twitter search API, we followed the same pattern of data acquisition as
in section 3.1. We started with one hashtag, #Syria, and expanded it. Table 3
Losing My Revolution 7
show the tags produced from the tag expansion step. After that each of those
tags were input into a process utilizing the Twitter streaming API and produced
the first 1000 results matching each tag. From this set, we randomly sampled
10%. As a result, 1955 tweets were extracted each having one or more embedded
resources and tags from the expanded tags in Table 3.
Initial Hashtags Extracted Hashtags
‘Syria’ ‘Bashar’ ‘RiseDamascus’ ‘GenocideInSyria’ ‘STOPASSAD2012’ ‘AssadCrimes’ ‘Assad’
Table 3. Twitter #Tags generated for filtering the Syrian uprising
Table 4 shows the resources collected along with the top level domains that
those resources belong to for each event.
Event Top Domains (number of resources found)
MJ youtube (110), twitpic (45), latimes (43), cnn (30), amazon (30)
Iran youtube (385), twitpic (36), blogspot (30), roozonline (29)
H1N1 rhizalabs (676), reuters (17), google (16), flutrackers (16), calgaryherald (11)
Obama blogspot (16), nytimes (15), wordpress (12), youtube (11), cnn (10)
Egypt youtube (2414), cloudfront (2303), yfrog (1255), twitpic (114), imageshack.us (20)
Syria youtube (130), twitter (61), hostpic.biz (9), telegraph.co.uk (5)
Table 4. The top level domains found for each event ordered descendingly by the
number of resources.
4 Uniqueness and Existence
From the previous data gathering step we obtained six different datasets related
to six different historic events. For each event we extracted a list of URIs that
were shared in tweets or uploaded to sites like Storify or IAmJan25. To answer
the question of how much of the social media content is missing we test those
URIs for each dataset to eliminate URI aliases in which several URIs identify to
the same resource. Upon obtaining those unique URIs we examine how many of
which are still available on the live web and how many are available in public
web archives.
4.1 Uniqueness
Some URIs, especially those that appear in Twitter, may be aliases for the
same resource. For example “http://bit.ly/2EEjBl” and “http://goo.gl/2ViC”
both resolve to “http://www.cnn.com”. To solve this, we resolved all the URIs
following redirects to the final URI. The HTTP response of the last redirect has
a field called location that contains the original long URI of the resource. This
step reduced the total number of URIs in the six datasets from 21,625 to 11,051.
Table 5 shows the number of unique resources in every dataset.
4.2 Existence on the Live-Web
After obtaining the unique URIs from the previous step we resolve all of them and
classify them as Success or Failure. The Success class includes all the resources
8 Hany M. SalahEldeen and Michael L. Nelson
All Unique
2,293 1,187=51.77%
MJ Archived Not Archived
Available 316=26.62% 474=39.93%
Missing 90=7.58% 307=25.86% 397=33.45%
406=34.20% each/1,187
All Unique
3,429 1,340=39.08%
Iran Archived Not Archived
Available 415=30.97% 586=43.73%
Missing 101=7.54% 238=17.76% 339=25.30%
516=38.51% each/1,340
All Unique
5,517 1,645=29.82%
H1N1 Archived Not Archived
Available 595=36.17% 656=39.88%
Missing 98=5.96% 296=17.99% 394=23.95%
693=42.12% each/1,645
All Unique
1,118 370=33.09%
Obama Archived Not Archived
Available 143=38.65% 135=36.49%
Missing 33=8.92% 59=15.95% 92=24.86%
176=47.57% each/370
All Unique
7,313 6,154=84.15%
Egypt Archived Not Archived
Available 1,069=17.37% 4440=72.15%
Missing 173=2.81% 472=7.67% 645=10.48%
1242=20.18% each/6,154
All Unique
1,955 355=18.16%
Syria Archived Not Archived
Available 19=5.35% 311=87.61%
Missing 0=0% 25=7.04% 25=7.04%
19=5.35% each/355
Table 5. Percentages of unique resources from all the extracted ones we obtained per
event and the percentages of presence of those unique resources on live web and in
archives. All resources = 21,625, Unique resources = 11,051
that ultimately return a “200 OK” HTTP response. The Failure class includes
all the resources that return a “4XX” family response like: “404 Not Found”,
“403 Forbidden” and “410 Gone”, the “30X” redirect family while having infinite
loop redirects, and server errors with response “50X”. To avoid transient errors
we repeated the requests, on all datasets, several times for a week to resolve
those errors.
We also test for “Soft 404s”, which are pages that return “200 OK” response
code but are not a representation of the resource, using a technique based on a
heuristic for automatically discovering soft 404s from Bar-Yossef et al. [2]. We
also include no response from the server, as well as DNS timeouts, as failures.
Note that failure means that this resource is missing on the live web. Table 5
summarizes, for each dataset, the total percentages of the resources missing from
the live web and the number of missing resources divided by the total number
of unique resources.
4.3 Existence in the Archives
In the previous step we tested the existence of the unique list of URIs for each
event on the live web. Next, we evaluate how many URIs have been archived
in public web archives. To check those archives we utilize the Memento frame-
work. If there is a memento for the URI, we download its memento timemap and
analyze it. The timemap is a datestamp ordered list of all known archived ver-
sions (called “mementos”) of a URI. Next, we parse this timemap and extract
Losing My Revolution 9
the number of mementos that point to versions of the resource in the public
archives. We declare the resource to be archived if it has at least one memento.
This step was also repeated several times to avoid the transient states of the
archives before deeming a resource as unarchived. The results of this experiment
along with the archive coverage percentage are presented in Table 5.
5 Existence as a Function of Time
Inspecting the results from the previous steps suggests that the number of miss-
ing shared resources in social media corresponding to an event is directly propor-
tional with its age. To determine dates for each of the events this we extracted
all the creation dates from all the tweet-based datasets and sorted them. For
each event, we plotted a graph illustrating the number of tweets per day related
to that event as shown in figure 1. Since the dataset is separated temporally into
3 partitions, and in order to display all the events on one graph we reduced the
size of the x-axis by removing the time periods not covered in our study.
Fig. 1. URIs shared per day corresponding to each event and showing the two peaks
in the non-Syrian and non-Egyptian events
Upon examining the graph we found an interesting phenomena in the non-
Syrian and non-Egyptian events: each event has two peaks. Upon investigating
history timelines we came to conclusion that those peaks reflect a second wave of
social media interaction as a result of new incident within the same event after
a period of time. For example, in the H1N1 dataset, the first peak illustrates the
world-wide outbreak announcement while the second peak denotes the release
of the vaccine. In the Iran dataset, the first peak shows the peak of the elections
while the second peak pinpoints the Iranian trials. As for the MJ dataset the
first peak corresponds to his death and the second peak describes the rumors
that Michael Jackson died of unnatural causes and a possible homicide. For
the Obama dataset, the first peak reveals the announcement of his winning the
prize while the second peak presents the award-giving ceremony in Oslo. For
the Egyptian evolution, the resources are all within a small time slot of 2 weeks
10 Hany M. SalahEldeen and Michael L. Nelson
around the date 11th of February. As for the Syrian event, since the collection was
very recent there was no obvious peaks. Those peaks we examined will become
temporal centroids of the social content collections (the datasets). MJ (June
25th & July 10th 2009), Iran (June 13th & 1st August 2009), H1N1 (September
11th & 5th October 2009), and Obama (October 9th & December 10th 2009).
Egypt was (February 11th 2011) and the Syria dataset also had one centroid
on March 27th 2012. We split each event according to the two centroids in each
event accordingly. Figure 1 shows those peaks and Table 6 shows the missing
content and the archived content percentages corresponding to each centroid.
MJ Iran H1N1 Obama Egypt Syria
% Missing 36.24% 31.62% 26.98% 24.47% 23.49% 25.64% 24.59% 26.15% 10.48% 7.04%
% Archived 39.45% 30.78% 43.08% 36.26% 41.65% 43.87% 47.87% 46.15% 20.18% 5.35%
Table 6. The Split Dataset
Fig. 2. Percentage of content missing and archived for the events as a function of time.
Figure 2 shows the missing and archived values from Table 6 as a function of time
since shared. Equation 1 shows the modeled estimate for the percentage of shared
resources lost, where Age is in days. While there is a less linear relationship
between time and being archived, equation 2 shows the modeled estimate for
the percentage of shared resources archived in a public archive.
Content Lost P ercentage = 0.02(Age in days)+4.20 (1)
Content Archived P ercentage = 0.04(Age in days)+6.74 (2)
Given these observations and our curve fitting we estimate that after a year from
publishing about 11% of content shared in social media will be gone. After this
point, we are losing roughly 0.02% of this content per day.
Losing My Revolution 11
6 Conclusions and Future work
We can conclude that there is a nearly linear relationship between time of shar-
ing in the social media and the percentage lost. Although not as linear, there is
a similar relationship between the time of sharing and the expected percentage
of coverage in the archives. To reach this conclusion, we extracted collections of
tweets and other social media content that was posted and shared in relation to
six different events that occurred in the time period from June 2009 to March
2012. Next we extracted the embedded resources within this social media content
and tested their existence on the live web and in the archives. After analyzing
the percentages lost and archived in relation to time and plotting them we used
a linear regression model to fit those points. Finally we presented two linear
models that can estimate the existence of a resource, that was posted or shared
at one point of time in the social media, on the live web and in the archives as
a function of age in the social media.
In the next stage of our research we need to expand the datasets and import
other similar datasets especially in the uncovered temporal areas (e.g., the year of
2010 and before 2009). Examining more datasets across extended points in time
could enable us to better model these two functions of time. Also several other
factors beside time would be analyzed to understand their effect on persistence
on the live web and archiving coverage like: publishing venue, rate of sharing,
popularity of authors and the nature of the related event.
7 Acknowledgments
This work was supported in part by the Library of Congress and NSF IIS-
1009392.
References
1. Ainsworth, Scott G. and Alsum, Ahmed and SalahEldeen, Hany and Weigle, Michele
C. and Nelson, Michael L.: How Much of the Web Is Archived? In Proceedings of the
11th annual international ACM/IEEE joint conference on Digital libraries, JCDL
’11, pages 133-136, (2011).
2. Bar-Yossef, Ziv and Broder, Andrei Z. and Kumar, Ravi and Tomkins, Andrew.: Sic
Transit Gloria Telae: Towards an Understanding of the Web’s Decay. In Proceedings
of the 13th international conference on World Wide Web, WWW ’04, pages 328-337,
(2004).
3. F. Benevenut, T. Rodrigues, M. Cha, and V. Almeida.: Characterizing User Behav-
ior in Online Social Networks. In In Proc. of ACM SIGCOMM Internet Measure-
ment Conference, SIGCOMM ’09, pages 49-62, (2009).
4. Lee, Chei and Ma, Long and Goh, Dion.: Why Do People Share News in Social
Media? Active Media Technology, Springer Berlin / Heidelberg, pages 129-140, Vol-
ume:6890, (2011).
12 Hany M. SalahEldeen and Michael L. Nelson
5. Facebook official fact sheet, http://newsroom.fb.com/content/default.aspx?
NewsAreaId=22
6. Kwak, Haewoon and Lee, Changhyun and Park, Hosung and Moon, Sue.: What is
Twitter, a Social Network or a News Media? In Proceedings of the 19th international
conference on World wide web, WWW ’10, pages 591-600, (2010).
7. Gordon Mohr, Michele Kimpton, Micheal Stack and Igor Ranitovic.: Introduction
to Heritrix, an Archival Quality Web Crawler. In 4th International Web Archiving
Workshop, IWAW ’04,(2004).
8. Frank McCown and Norou Diawara and Michael L. Nelson.: Factors Affecting
Website Reconstruction from the Web Infrastructure. In Proceedings of the 7th
ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’07, pages 39-48,
(2007).
9. Michael L. Nelson, B. Danette Allen.: Object Persistence and Availability in Digital
Libraries. D-Lib Magazine, Volume 8, Number 1, January (2002)
10. M. E. J. Newman and J. Park.: Why social networks are different from other types
of networks. Phys. Rev. E, 68(3):036122, September, (2003).
11. Alex Nunns and Nadia Idle.: Tweets From Tahrir. ISBN-10: 1935928457.
12. T. A. Phelps and R. Wilensky.: Robust Hyperlinks Cost Just Five Words Each.
Technical Report, UCB/CSD-00-1091, EECS Department, University of California,
Berkeley, (2000).
13. Hany M. SalahEldeen, Michael L. Nelson.: Losing My Revolution: A year after
the Egyptian Revolution, 10% of the social media documentation is gone. http:
//ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution- year.html
14. Robert Sanderson, Mark Phillips and Herbert Van de Sompel.: Analyzing the
Persistence of Referenced Web Resources with Memento. CoRR, arXiv:1105.3459,
(2011)
15. Stanford SNAP Project Dataset, http://snap.stanford.edu/
16. Twitter numbers, http://blog.Twitter.com/2011/03/numbers.html
17. H. Van de Sompel, M. L. Nelson, R. Sanderson, L. L. Balakireva, S. Ainsworth, H.
Shankar.: Memento: Time Travel for the Web, Technical Report, arXiv:0911.1112,
November, (2009).
18. Wan, X., Yang, J.: Wordrank-based Lexical Signatures for Finding Lost or Related
Web Pages. In Proceedings of the 8th Asia-Pacific Web conference on Frontiers of
WWW Research and Development, APWeb’06, pages 843-849, (2006).
19. C. Wilson, B. Boe, A. Sala, K. P. Puttaswamy, and B. Y. Zhao.: User Interactions
in Social Networks and their Implications. In Proceedings of the 4th ACM European
conference on Computer systems, EuroSys ’09, pages 205-218, (2009).
20. Wu, Shaomei and Hofman, Jake M. and Mason, Winter A. and Watts, Duncan
J.: Who Says What to Whom on Twitter. In Proceedings of the 20th international
conference on World wide web, WWW ’11, pages 705-714, (2011).
21. Jaewon Yang and Jure Leskovec.: Patterns of Temporal Variation in Online Media.
In ACM International Conference on Web Search and Data Minig, WSDM ’11,
pages 177-186, (2011).
22. J. Yang and S. Counts.: Predicting the Speed, Scale, and Range of Information
Diffusion in Twitter. In 4th International AAAI Conference on Weblogs and Social
Media, ICWSM ’10, May, (2010).
23. D. Zhao and M. B. Rosson.: How and Why People Twitter: The Role that Micro-
blogging Plays in Informal Communication at Work. In Proceedings of the ACM
2009 international conference on Supporting group work. GROUP ’09, pages 243-
252, (2009).
... Due to its unstable and mutable nature, it is indeed extremely difficult to foresee how this material would respond to current preservation techniques (Bollacker 2010). SalahEldeen and Nelson (2012) investigated the limited persistence of resources on the web and social media about past events, calculating that 10% of content shared is usually lost after one year, number that increases to 27% after only two years from its publication. The inevitable loss and ephemerality of personal material published on the web appear to be connected, on the one hand, to the ease with which users can edit and delete their posts at any point in time, the existence of some web content that is designed to disappear (e.g. ...
Article
Full-text available
After more than a decade of usage, social media have become a virtual environment where meaningful content is created and kept, highlighting its potential to become part of personal digital archives. This study investigates users’ attitudes and preservation practices related to digital memories created on social media. Survey findings highlighted how users seem to consider these items as meaningful digital traces to document important events of their lives, and a potential inherent part of their personal archives. However, results show how this attitude does not seem to be supported by adequate preservation strategies. After analysing social media platforms’ policies in relation to users’ preservation practices, we advocate for raising more awareness among both users and service providers regarding the risks posed by the ephemerality of the digital world and the need for specific provisions that go beyond the short-term retention of data and look to the future and potential use of what appears to be considered an inherent part of individuals’ personal archives.
... The GDELT Project, a platform that monitors the world's news media, reported in 2015 that around 2% of the news articles disappear in a couple of weeks and up to 14% in a couple of months [19]. Similarly, SalahEldeen and Nelson reported that about 11% of the resources shared on social media during the 2011 Egyptian Revolution were lost after a year [20]. This is an alarming rate with which resources on the web disappear. ...
Preprint
Full-text available
Prior work on web archive profiling were focused on Archival Holdings to describe what is present in an archive. This work defines and explores Archival Voids to establish a means to represent portions of URI spaces that are not present in a web archive. Archival Holdings and Archival Voids profiles can work independently or as complements to each other to maximize the Accuracy of Memento Aggregators. We discuss various sources of truth that can be used to create Archival Voids profiles. We use access logs from Arquivo.pt to create various Archival Voids profiles and analyze them against our MemGator access logs for evaluation. We find that we could have avoided more than 8% of additional False Positives on top of the 60% Accuracy we got from profiling Archival Holdings in our prior work, if Arquivo.pt were to provide an Archival Voids profile based on URIs that were requested hundreds of times and never returned any success responses.
... Reputation-narrow Uncategorized 12 Scarcity Uncategorized sent out a tweet 2 requesting social media users to contribute URLs about Coronavirus for preservation. It is important to preserve webpages chronicling important events such as the 2020 Coronavirus pandemic because according to SalahEldeen and Nelson, 11% of Web resources shared on social media are lost after the first year of publication [37], so we run the risk of losing a portion of our collective digital heritage if they are not preserved. The Internet Archive (IA) was founded in 1996, and since then, it has been archiving the Web by collecting and saving public webpages. ...
Preprint
Full-text available
From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media. The quality of social media content varies widely, therefore, we propose a framework for assigning multi-dimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ~0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains.
... Concerns about the ephemerality of URLs have been reinforced by various longitudinal studies of 'link rot' or 'link decay' (Oguz and Koehler, 2015), where the average half-life of web pages and websites has been estimated anywhere between 75 days (Kahle, 1997) and two years (Koehler, 2004). This is bolstered by social media research (SalahEldeen and Nelson, 2012) which found that during a period between 2009 and 2012, on average, 11% of online resources shared on social media failed to resolve one year later. ...
Thesis
This thesis makes visible the work of archiving the Web. It demonstrates the growing role of web archives (WAs) in the circulation of information and culture online, and emphasises the inherent connections between how the Web is archived, its future use and our understandings of WAs, archivists and the Web itself. As the first in-depth sociotechnical study of web archiving, this research offers a view into the ways that web archivists are shaping what and how the Web is saved for the future. Using a combination of ethnographic observation, interviews and documentary sources, the thesis investigates web archiving at three sites: the Internet Archive – the world’s largest web archive; Archive Team – ‘a loose collective of rogue archivists and programmers’ archiving the Web; and the Environmental Data & Governance Initiative (EDGI) –a community of academics, librarians and activists formed in the wake of the 2016US Presidential Election to safe-guard environmental and climate data. Through the application of practice theory, thematic analysis and facet methodology, I frame my findings through three ‘facets of web archiving’: infrastructure, culture and politics.I show that the web archival activities of organisations, people and bots are both historically-situated and embedded in the contemporary politics of online communication and information sharing. WAs are reflected on as ‘places’ where the past,present and future of the Web collapses around an evolving assemblage of sociotechnical practices and actors dedicated to enabling different (and at times, conflicting)community-defined imaginaries for the Web. WAs are revealed to be contested sites where these politics are enabled and enacted over time. This thesis therefore contributes to research on the performance of power and politics on the Web, and raises new questions concerning how different communities negotiate the challenges of ephemerality and strive to build the ‘Web they want’.iii
Conference Paper
Full-text available
Integrating the digital humanities (DH) into undergraduate level higher education programs has often been a difficult and ambiguous process. Faculty sometimes struggle to create syllabi that incorporate technologies but that do not require constant redesign as technologies evolve. Institutions may lack systems to connect students with faculty and staff who are interested in collaborative research, and collaboration beyond one’s own institution can be complicated or inaccessible for students. These are real challenges; as institutions increasingly develop DH courses and degrees, the impact on undergraduate students is diverse, ranging in minimal involvement, to career-altering. So, what should the role of the undergraduate in DH be, and how can we address these challenges? For the past three years I have explored these questions. This exploration has led to helping redesign and teach the foundational seminar for Hope College’s Mellon Scholars DH Program, as well as co-founding and chairing the Undergraduate Network for Research in the Humanities (UNRH), an undergraduate-led organization with the mission of reimagining the undergraduate role in DH through the establishment of a network of digital humanists who present research, collaborate, and share ideas. On the basis of these experiences as an alumna of Hope’s DH Program and UNRH Chair, I have been considering the ways in which faculty, staff, and institutions might support undergraduate DH researchers. My work has culminated in a series of models, programs, and initiatives that address the need for fostering the next generation of digital humanists in the classroom, at the institution, and beyond.
Chapter
In order to address the requirements of different user groups and use cases of web archives, we have identified three views to access and explore web archives: user-, data- and graph-centric. The user-centric view is the natural way to look at the archived pages in a browser, just like the live web is consumed. By zooming out from there and looking at whole collections in a web archive, data processing methods can enable analysis at scale. In this data-centric view, the web and its dynamics as well as the contents of archived pages can be looked at from two angles: (1) by retrospectively analysing crawl metadata with respect to the size, age and growth of the web and (2) by processing archival collections to build research corpora from web archives. Finally, the third perspective is what we call the graph-centric view, which considers websites, pages or extracted facts as nodes in a graph. Links among pages or the extracted information are represented by edges in the graph. This structural perspective conveys an overview of the holdings and connections among contained resources and information. Only all three views together provide the holistic view that is required to effectively work with web archives.
Chapter
The Internet Archive pioneered web archiving and remains the largest publicly accessible web archive hosting archived copies of web pages (Mementos) going back as far as early 1996. Its holdings have grown steadily since, and it hosts more than 881 billion URIs as of September 2019. However, the landscape of web archiving has changed significantly over the last two decades. Today we can freely access Mementos from more than 20 web archives around the world, operated by for-profit and nonprofit organisations, national libraries and academic institutions, as well as individuals. The resulting diversity improves the odds of the survival of archived records but also requires technical standards to ensure interoperability between archival systems. To date, the Memento Protocol and the WARC file format are the main enablers of interoperability between web archives. We describe a variety of tools and services that leverage the broad adoption of the Memento Protocol and discuss a selection of research efforts that would likely not have been possible without these interoperability standards. In addition, we outline examples of technical specifications that build on the ability of machines to access resource versions on the Web in an automatic, standardised and interoperable manner.
Article
Full-text available
Although the Internet Archive's Wayback Machine is the largest and most well-known web archive, there have been a number of public web archives that have emerged in the last several years. With varying resources, audiences and collection development policies, these archives have varying levels of overlap with each other. While individual archives can be measured in terms of number of URIs, number of copies per URI, and intersection with other archives, to date there has been no answer to the question "How much of the Web is archived?" We study the question by approximating the Web using sample URIs from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the number of copies of the sample URIs exist in various public web archives. Each sample set provides its own bias. The results from our sample sets indicate that range from 35%-90% of the Web has at least one archived copy, 17%-49% has between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies in public web archives. The number of URI copies varies as a function of time, but no more than 31.3% of URIs are archived more than once per month.
Conference Paper
Full-text available
Understanding how users behave when they connect to social networking sites creates opportunities for better interface design, richer studies of social interactions, and improved design of content distribution systems. In this paper, we present a rst of a kind analysis of user workloads in on- line social networks. Our study is based on detailed click- stream data, collected over a 12-day period, summarizing HTTP sessions of 37,024 users who accessed four popular social networks: Orkut, MySpace, Hi5, and LinkedIn. The data were collected from a social network aggregator web- site in Brazil, which enables users to connect to multiple social networks with a single authentication. Our analysis of the clickstream data reveals key features of the social net- work workloads, such as how frequently people connect to social networks and for how long, as well as the types and sequences of activities that users conduct on these sites. Ad- ditionally, we crawled the social network topology of Orkut, so that we could analyze user interaction data in light of the social graph. Our data analysis suggests insights into how users interact with friends in Orkut, such as how frequently users visit their friends' or non-immediate friends' pages. In summary, our analysis demonstrates the power of using clickstream data in identifying patterns in social network workloads and social interactions. Our analysis shows that browsing, which cannot be inferred from crawling publicly available data, accounts for 92% of all user activities. Con- sequently, compared to using only crawled data, considering silent interactions like browsing friends' pages increases the measured level of interaction among users.
Conference Paper
Full-text available
We study several longstanding questions in media communications research, in the context of the microblogging service Twitter, regarding the production, flow, and consumption of information. To do so, we exploit a recently introduced feature of Twitter known as "lists" to distinguish between elite users - by which we mean celebrities, bloggers, and representatives of media outlets and other formal organizations - and ordinary users. Based on this classification, we find a striking concentration of attention on Twitter, in that roughly 50% of URLs consumed are generated by just 20K elite users, where the media produces the most information, but celebrities are the most followed. We also find significant homophily within categories: celebrities listen to celebrities, while bloggers listen to bloggers etc; however, bloggers in general rebroadcast more information than the other categories. Next we re-examine the classical "two-step flow" theory of communications, finding considerable support for it on Twitter. Third, we find that URLs broadcast by different categories of users or containing different types of content exhibit systematically different lifespans. And finally, we examine the attention paid by the different user categories to different news topics.
Conference Paper
Full-text available
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
Article
We propose robust hyperlinks as a solution to the problem of broken hyperlinks. A robust hyperlink is a URL augmented with a small "signature", computed from the referenced document. The signature can be submitted as a query to web search engines to locate the document. It turns out that very small signatures are sufficient to readily locate individual documents out of the many millions on the web. Robust hyperlinks exhibit a number of desirable qualities: They can be computed and exploited automatically, are small and cheap to compute (so that it is practical to make all hyperlinks robust), do not require new server or infrastructure support, can be rolled out reasonably well in the existing URL syntax, can be used to automatically retrofit existing links to make them robust, and are easy to understand. In particular, one can start using robust hyperlinks now, as servers and web pages are mostly compatible as is, while clients can increase their support in the future. Robust hyperlinks are one example of using the web to bootstrap new features onto itself.
Conference Paper
Online content exhibits rich temporal dynamics, and diverse realtime user generated content further intensifies this process. However, temporal patterns by which online content grows and fades over time, and by which different pieces of content compete for attention remain largely unexplored. We study temporal patterns associated with online content and how the content's popularity grows and fades over time. The attention that content receives on the Web varies depending on many factors and occurs on very different time scales and at different resolutions. In order to uncover the temporal dynamics of online content we formulate a time series clustering problem using a similarity metric that is invariant to scaling and shifting. We develop the K-Spectral Centroid (K-SC) clustering algorithm that effectively finds cluster centroids with our similarity measure. By applying an adaptive wavelet-based incremental approach to clustering, we scale K-SC to large data sets. We demonstrate our approach on two massive datasets: a set of 580 million Tweets, and a set of 170 million blog posts and news media articles. We find that K-SC outperforms the K-means clustering algorithm in finding distinct shapes of time series. Our analysis shows that there are six main temporal shapes of attention of online content. We also present a simple model that reliably predicts the shape of attention by using information about only a small number of participants. Our analyses offer insight into common temporal patterns of the content on theWeb and broaden the understanding of the dynamics of human attention.
Conference Paper
Social networks are popular platforms for interaction, communication and collaboration between friends. Researchers have recently proposed an emerging class of applications that leverage relationships from social networks to improve security and performance in applications such as email, web browsing and overlay routing. While these applications often cite social network connectivity statistics to support their designs, researchers in psychology and sociology have repeatedly cast doubt on the practice of inferring meaningful relationships from social network connections alone. This leads to the question: Are social links valid indicators of real user interaction? If not, then how can we quantify these factors to form a more accurate model for evaluating socially-enhanced applications? In this paper, we address this question through a detailed study of user interactions in the Facebook social network. We propose the use of interaction graphs to impart meaning to online social links by quantifying user interactions. We analyze interaction graphs derived from Facebook user traces and show that they exhibit significantly lower levels of the "small-world" properties shown in their social graph counterparts. This means that these graphs have fewer "supernodes" with extremely high degree, and overall network diameter increases significantly as a result. To quantify the impact of our observations, we use both types of graphs to validate two well-known social-based applications (RE and SybilGuard). The results reveal new insights into both systems, and confirm our hypothesis that studies of social applications should use real indicators of user interactions in lieu of social graphs.
Conference Paper
We present results of network analyses of information diffusion on Twitter, via users' ongoing social interactions as denoted by "@username" mentions. Incorporating survival analysis, we constructed a novel model to capture the three major properties of information diffusion: speed, scale, and range. On the whole, we find that some properties of the tweets themselves predict greater information propagation but that properties of the users, the rate with which a user is mentioned historically in particular, are equal or stronger predictors. Implications for end users and system designers are discussed.
Conference Paper
A lexical signature of a web page consists of several key words carefully chosen from the web page and is used to generate robust hyperlink to find the web page when its URL fails. In this paper, we propose a novel method based on WordRank to compute lexical signatures, which can take into account the semantic relatedness between words and choose the most representative and salient words as lexical signature. Experiments show that the DF-based lexical signatures are best at uniquely identifying web pages, and hybrid lexical signatures are good candidates for retrieving the desired web pages, while WordRank-based lexical signatures are best for retrieving highly relevant web pages when the desired web page cannot be extracted.
Conference Paper
Sharing news in social media has influence on individuals as well as society and has become a global phenomenon. However, little empirical research has been conducted to understand why people share news in social media. Adopting the uses and gratifications theory, we investigate the gratification factors influencing news sharing intention on social media. A regression analysis was employed to analyze the data collected from 203 undergraduate and graduate students. The results show that informativeness was the strongest motivation in predicting news sharing intention, followed by socializing and status seeking. However, entertainment/escapism was not a significant predictor in contrast to prior work. Implications and opportunities for future work are also discussed.