Conference PaperPDF Available

How Much Data Do You Need? Twitter Decahose Data Analysis

Authors:

Abstract and Figures

Twitter generates between 500-700 million tweets a day. It is expensive, unnecessary and almost impossible to process the entire tweet data set for any application. Twitter's 1% Streaming API and Search API have their own limitations. In this paper, we present our findings on an alternative source, the 10% Decahose, to help researchers and businesses decide how much tweet data they need. This paper reports on the following analysis for the Decahose data: entity and metadata distribution; entity coverage and novelty evolution from 1% to 10% of Decahose; the amount of information change from 1% to 10%, as measured by recall of test tweets; and statistical comparison between Twitter's 1% streaming data and the Decahose data.
Content may be subject to copyright.
How Much Data Do You Need? Twitter Decahose Data
Analysis
Quanzhi Li, Sameena Shah, Merine Thomas, Kajsa Anderson, Xiaomo Liu, Armineh
Nourbakhsh, Rui Fang
Research and Development
Thomson Reuters
3 Times Square, NYC, NY 10036
{quanzhi.li, sameena.shah, merine.thomas, kajsa.anderson,
xiaomo.liu, armineh.nourbakhsh,
rui.fang}@thomsonreuters.com
Abstract. Twitter generates between 500-700 million tweets a day. It is expen-
sive, unnecessary and almost impossible to process the entire tweet data set for
any application. Twitter's 1% Streaming API and Search API have their own
limitations. In this paper, we present our findings on an alternative source, the
10% Decahose, to help researchers and businesses decide how much tweet data
they need. This paper reports on the following analysis for the Decahose data:
entity and metadata distribution; entity coverage and novelty evolution from 1%
to 10% of Decahose; the amount of information change from 1% to 10%, as
measured by recall of test tweets; and statistical comparison between Twitter's
1% streaming data and the Decahose data.
Keywords: twitter decahose twitter streaming API tweet metadata social
data analysis twitter social media
1 Introduction
Twitter makes its data available through several mechanisms. One is through the 1%
real-time Streaming API, which is 1% of Twitter data. This data is available at no
cost. The Streaming API has limitations; in addition to the volume limit, it also has
query limits if its filtering function is used. Another approach for accessing tweet data
is through the Search API. The Search API's results are not real time, and it has limi-
tations on the number of queries per user, per application, and per time (15 minute
windows) [11].
Neither the Search API nor the Streaming API is designed for enterprise access,
which usually needs high coverage of the data. Enterprise access options are the
Decahose and Halfhose, which are 10% and 50% randomly sampled data streams
respectively [2]. Researchers are very interested in the Decahose, but because of its
non-trivial costs, access is usually elusive. It has already been applied to verticals [9].
In this paper, we report on statistical properties of the Decahose. We hope the analysis
will provide helpful information for both the research community and businesses.
In this paper, we analyze: (1) Entities and metadata distribution, available through
the Decahose, such as hashtags, Urls and tweet topics. (2) Entity coverage and novel-
ty evolution as data sample size increases from 1% to 10%. (3) How the amount of
information changes from 1% to 10% sample size. We measure this by computing the
recall for a set of test tweets, based on each tweet's textual content. (4) We investigate
whether there are any differences between Twitter's 1% streaming data and Decahose
data. These analyses provide insight into the characteristics of the Decahose, and help
users understand how much Twitter data they may need. Our analyses also show how
significant and representative the different-sized samples are.
2 Related Work
In [7], the authors used Twitter's Sample API to test if its 1% Streaming API is bi-
ased. They took one hashtag and observed its trend line over one day in Sample API
and streaming API. They claimed that there was a small bias because they observed
two spikes and the spikes were not identical between the two APIs.
In [6], the authors questioned whether data obtained through Twitter's sampled
Streaming API is a sufficient representation of activity on Twitter as a whole. They
compared Firehose to the Streaming API, in terms of top n hashtags and geo-tagged
tweets. They found that the Streaming API data estimates the top n hashtags for a
large n well, but is often misleading when n is small. For geo-tagged tweets, they
found that the Streaming API almost returns the complete set of the geo-tagged tweets
despite covering a sub-sample of the Firehose.
Although the previous two studies [6, 7] are related to Twitter data, they focus on
comparing streaming data to Firehose or sam ple data. To the best of our knowledge,
our study is the first one exploring how information changes gradually as data size
changes. No previous study has focused on our research questions 2 and 3 introduced
in the next section. There are several studies that analyze the entire network sample of
Twitter [3, 4, 12], but none of them is about analyzing the effect of data size.
3 Research Questions and Data Set
Following each question, we briefly introduce the methodology for answering it. The
detailed research methodology will be described in the result analysis sections.
Q1. What is the metadata distribution across the Decahose data?
An individual tweet has multiple metadata. We study two types of metadata, those
derived from the tweet's sourcing and those derived from its content. Tweet's sourcing
include whether the author is a verified user, a news agency user, or an influential
user. Content-related metadata include tweet's topic, whether it is a retweet, its
hashtags, mentions, Urls, and proper nouns (named entities). This question is an-
swered in Section 4.
Q2. As tweet data is increased by 1% at each step going from 1% to 10%, how
does entity coverage change? How many new entities are available at each increment?
What does the novelty curve look like, that is, how many new entities are available at
each percentage point addition?
We answer these questions in Section 5, where we analyze the changes of the fol-
lowing entities: hashtag, Url, user, verified user, proper noun and mention.
Q3. Given a tweet as a target, what percentage of tweet data do we need in order
to find other tweets relevant to the target? At each increment from 1% to 10%, how
many relevant tweets are there?
We set this up as a recall experiment, where a set of test tweets are computed to
answer this question. Tweet text is used to measure the similarity between tweets for
computing recall. The recall was computed separately for news tweets and non-news
tweets.
Q4. Are there any differences, in terms of metadata distribution, between Twitter's
1% streaming data and Decahose data?
To answer this question, we compare Twitter's 1% streaming data to the 1%
Decahose sampling. The comparison is based on both user related and tweet content
related metadata. The result of this comparison is reported in Section 7.
Decahose Dataset. The Decahose is 10% of the whole Twitter data randomly
sampled. We obtained one month of Decahose data covering the entire month of Oct.
2015. In total, there are 1.04 billion tweets in the dataset. Among them, there are 280
million English-language tweets. The data used in the experiments are all English
tweets; non-English tweets are removed. Some experiments in this study require the
Decahose data to be split into 10 parts, each representing 1% of Twitter data. In order
to do this, each tweet was randomly assigned a sequence number, from 1 to 10, when
it was ingested into our storage.
Test Dataset for Recall Study. For the recall analysis, we need a set of tweets as
the standard dataset, on which the recall metrics are computed. These tweets need to
be from the same time period as the Decahose data, i.e. October 2015. We have two
types of test tweets - tweets from news organization accounts, and tweets from other
tweet accounts, such as politicians and sports. They will be called news tweets and
non-news tweets, respectively, hereafter. The reason for this distinction is that news
tweets are more related to important events, and we want to see if there is any differ-
ence between these two types of tweets, in terms of recall. The news tweets were
collected from 108 news organization accounts, such as CNN and Reuters. 3,875
news tweets were collected through Twitter's search API. The non-news tweets were
collected from 547 non-news accounts, and 2,704 non-news tweets were collected.
4 Tweet Metadata Distribution Analysis
This section tries to answer research question Q1.
4.1 Metadata Generation
Tweet topic. Each tweet is marked with one topic based on its content. We used
OpenCalais for tweet topic classification as previous studies [10].
Url. This is the link presented in a tweet. A short link is resolved to its absolute
address.
Verified user. This identifies whether a user is verified by Twitter.
Influential user. Following previous studies [1, 5], we use the number of follow-
ers to measure a user's influence level. We define two types of influential users by
using two thresholds. number of followers greater than 5,000, and 10,000, respec-
tively.
News organization user. These are user accounts that belong to news organiza-
tions, such as CNN. 2,040 news accounts are used in this study.
Proper noun. Proper noun (named entity) refers to the name of an organization,
person, or other types of entities. A tweet with proper nouns usually conveys more
meanings. The TweetNLP package [8] was used to identify proper nouns from
tweet text.
Other metadata. Retweet if it is a retweet; Hashtag the hashtags in the tweet;
Mention - mentions in each tweet; Media - if the tweet contains media content.
4.2 Metadata Distribution
Table 1 presents the metadata distribution. The result is based on one week's data
(Oct. 1-7). It presents the results for both the 10% and 1% datasets. The total number
of tweets for the one-week dataset is 63,699,142 in the 10% Decahose, and 6,369,072
in the 1% portion. This table shows that, in terms of distribution of metadata, there is
no major difference between 1% and 10% data, which is expected because the 1%
dataset is already very large from a statistical point of view.
Table 1. Metadata distribution (1% and 10%)
Metadata
Decahose data 1%
Decahose data 10%
User
related
Verified
0.51%
0.51%
News Organization
0.02%
0.02%
Influential (followers >5k)
5.94%
5.95%
Influential (followers >10k)
3.24%
3.25%
Tweet
related
Is retweet
37.99%
37.98%
Has hashtag
19.27%
19.25%
Has mention
63.89%
63.88%
Has url
22.65%
22.62%
Has media
17.47%
17.46%
Has > 1 proper noun
35.95%
35.92%
Has > 2 proper nouns
12.36%
12.36%
5 Entity Coverage and Novelty Analysis
This analysis tries to answer the research question Q2. When we talk about coverage
in this study, it is based on the 10% Decahose. The dataset used for this experiment is
similar to the one used in the last section: one week of Decahose data ranging from
6.37 million tweets for 1% to 63.7 million for the 10% Decahose.
Table 2. Entity coverage change from 1% to 10% Decahose
Decahose
data (%)
Hashtag
Url
Verified
User
Proper
Noun
Mention
1%
20.87%
13.57%
35.00%
17.72%
24.25%
2%
33.63%
25.05%
52.28%
30.12%
39.04%
3%
44.40%
35.69%
63.85%
40.96%
50.60%
4%
54.03%
45.75%
72.39%
50.81%
60.29%
5%
62.83%
55.43%
79.14%
60.00%
68.68%
6%
71.01%
64.79%
84.54%
68.68%
76.15%
7%
78.80%
73.90%
89.14%
76.98%
82.91%
8%
86.17%
82.78%
93.24%
84.92%
89.07%
9%
93.21%
91.47%
96.79%
92.58%
94.72%
10%
100%
100%
100%
100%
100%
Fig. 1. Entity novelty change rate over data level
Table 2 presents the coverage result. It shows what percentage of an entity-type
can be found in each data level. This table shows some interesting findings. For user
information, at the 5% level, we can find about 80% of the verified users, and 72% of
all users. And it also shows that at just the 2% level, we can find more than half of the
verified users present in the 10% Decahose data. One explanation is that users, espe-
cially the verified users, usually author or retweet multiple tweets during this period
of time. In terms of Url, only 55% of them are discovered at the 5% data level, which
means the emerging rate of new Urls is nearly linear to the data increase level, and
Urls are less repeated in different tweets.
Table 3. Recall result using cosine similarity threshold of 0.75
Decahose
data (%)
News account tweets
Non-news account tweets
Recall
Average number
of matches
Recall
Average number
of matches
1
47.0%
3.4
23.3%
7.2
2
63.3%
5.1
34.1%
9.9
3
71.9%
6.8
41.2%
12.2
4
77.4%
8.4
46.1%
14.5
5
77.9%
7.4
50.3%
16.6
6
84.6%
11.5
54.0%
18.6
7
86.5%
13.2
56.8%
20.7
8
88.5%
14.7
58.4%
22.9
9
89.9%
16.3
60.7%
24.8
10
91.0%
17.8
62.4%
26.9
Figure 1 shows the result from the novelty point of view: how many new entities
emerge at each data level? We can see that the verified user line drops very fast from
1% to 10% level, which means very few new verified users emerge when the sample
size reaches a certain level. In contrast, the Url line is almost a straight line, which
means it keeps at a high novelty level as the volume of data increases.
6 Tweet Content Recall Analysis
This section tries to answer question Q3. As described before, there are two types of
test tweets, news tweets and non news tweets. Each test tweet was compared to all
Decahose tweets in this 3-day time range: the day the test tweet was created, 1 day
before, and 1 day after. A tweet event usually lasts for a couple of days, and we think
a 3-day window is a reasonable time period for finding relevant tweets. Expanding
this window may increase the recall value, but the increase is small based on our test-
ing. On average, each test tweet was compared to about 27 million tweets. Cosine
similarity is used to measure the similarity between two tweets. Cosine similarity is a
popular measure for computing the similarity between two sets of text, and has been
used by many previous studies; a value of 1 means the two text segments are the same
and 0 means totally different. Before the calculation, some basic pre-processing is
applied to the tweet text, such as stopword removal. Different applications may
choose different cosine values, usually greater than 0.5, as the threshold for compu-
ting recall. The recall result in Table 3 is based on 0.75.
From Table 3 we can see that news tweets have a much higher recall than non-
news tweets, since the tweets from news agencies are usually about important events,
there are usually more tweets talking about them. In contrast, non-news tweets usually
attract less attention. One interesting observation is that although non-news tweets
have lower recall, the average number of matches is higher than news tweets. This
means that a non-news tweet either has no related tweets, or if it does, it may have a
large amount of tweets. For example, an event about Justin Bieber may go viral on
Twitter. Another observation is that news tweets’ recall is already close to 0.5 at the
1% data level, and when it is at 10%, its recall is 0.91. This means if one is only inter-
ested in tweets related to news, the 10% Decahose will provide coverage very close to
the 100% Firehose.
Table 4. Metedata comparison of Twitter 1% streaming data with Decahose data
Metadata
Distribution
1% Decahose
data
1% Twitter streaming
data
User
related
Verified
0.51%
0.42%
News Organization
0.024%
0.020%
Influential (followers >5k)
5.95%
6.40%
Influential (followers >10k)
3.25%
3.71%
Tweet
related
Is retweet
37.98%
39.62%
Has hashtag
19.24%
18.65%
Has mention
63.89%
64.55%
Has url
22.63%
22.35%
Has media
17.45%
18.37%
Has > 1 proper noun
35.92%
35.18%
Has > 2 proper nouns
12.36%
11.82%
7 Comparison of 1% Twitter Streaming Data with Decahose
Data
We try to address the research question Q4 in this section. Twitter claims that the 1%
streaming data is randomly sampled from the 100% Twitter data in real time, but how
exactly that is done is not clear. People may wonder if there is any difference between
the 1% streaming data and the Decahose data. We have both the 1% streaming data
and the Decahose data from the same period of time, which makes the comparison
possible. The 1% Decahose sample used in this study was generated as follows: a
tweet from Decahose was randomly assigned to one of ten buckets; after all tweets
were processed, one bucket was randomly selected as the 1% sampling of the
Decahose.
Table 4 presents the comparison results of the general metadata. One limitation in
this comparison study is that the 1% Decahose data has fewer tweets than the 1%
streaming data. The size difference between the two data sets is about 15%. The rea-
son is that when Twitter handled the Decahose data to us, some tweets were already
deleted either by their authors or by Twitter. Twitter deletes tweets that are considered
spam by their off-line spam filter, or violate copyrights or other rules. This might be
one main reason that some of the distributions are different between these two data
sets. Table 4 shows that the Decahose data has a slightly higher percentage of verified
users, while it has slightly lower percentages of influential users. One explanation is
that verified users are more careful when they tweet and therefore it is less likely for
them to delete their own tweets, and also it is rare that Twitter would delete tweets
from verified users. In contrast, users with many followers are more likely to tweet
more, and the chance of deleting their tweets is also higher than the ordinary users.
Table 5. Topic comparison of Twitter’s 1% streaming data to Decahose data
Tweet Topic
Distribution
1% Decahose data
1% Twitter streaming data
Business/Finance
2.23%
2.18%
Technology/Internet
1.50%
1.48%
Politics
1.08%
0.98%
Sports
11.57%
11.21%
Entertainment
9.55%
9.66%
Health/Medical
1.55%
1.56%
Crisis/War/Disaster
1.61%
1.58%
Weather
0.47%
0.46%
Law/Crime
1.01%
0.98%
Life/Society
66.94%
67.28%
Other
2.48%
2.63%
In this analysis, in addition to the metadata used in previous sections, the topic of
a tweet is also identified by the topic classifier described before. Table 5 shows the
topic distribution for both data sets. The result tells us that the streaming data and
Decahose have basically the same distributions. Because the majority of tweets are
talking about people's daily lives, a large portion of tweets are classified as
Life/Society.
8 Conclusion
In this paper, we analyze Twitter's Decahose dataset and report on the following anal-
yses on the Decahose: the distribution of a rich set of metadata, how the volume of
entities evolves when the Decahose data changes from 1% to 10%, the amount of
information change at different data levels, and the potential difference between Twit-
ter's 1% streaming data and the Decahose. We hope the statistics and findings will
provide insight and help interested parties decide the amount of Twitter data needed
for their applications.
References
1. Castillo, C.; Mendoza, M.; and Poblete, B. 2011. Information credibility on twitter. In Pro-
ceedings of WWW 2011, 675684.
2. Gnip. 2015. An overview of twitter’s streaming API.
3. Java, A.; Song, X.; Finin, T.; and Tseng, B. 2007. Why we twitter: Understanding
microblogging usage and communities. In Proceedings of 9th WebKDD, New York, NY
4. Kwak, H.; Lee, C.; Park, H.; and Moon, S. 2010. What is twitter, a social network or a
news media? In Proceedings of WWW 2010.
5. Liao, Q., and Shi, L. 2013. She gets a sports car from our donation: rumor transmission in
a chinese microblogging community. In Proceedings of CSCW 2013.
6. Morstatter, F.; Pfeffer, J.; Liu, H.; and Carley, K. 2013. Is the sample good enough? Com-
paring data from twitter’s streaming API with twitter’s firehose. In Proceedings of
ICWMS 2013.
7. Morstatter, F.; Pfeffer, J.; and Liu, H. 2014. When is it biased? assessing the representa-
tiveness of twitter’s streaming API. In Proceedings of WWW 2014 Companion.
8. Owoputi, O.; O’Connor, B.; Dyer, C.; Gimpel, K.; Schneider, n.; and Smith, N. 2013. Im-
proved part-of-speech tagging for online conversational text with word clusters. In Pro-
ceedings of NAACL 2013.
9. Pal, A., and Counts, S. 2011. Identifying topical authorities in microblogs. In Proceedings
of WSDM 2011
10. Quercia, D.; Askham, H.; and Crowcroft, J. 2012. Tweetlda: Supervised topic classifica-
tion and link prediction in twitter. In the 4th ACM Web Science Conference.
11. Twitter. 2015. Twitter api rate limits.
12. Wu, S.; Hofman, J. M.; Mason, W. A.; and Watts, D. J. 2011. Who says what to whom on
twitter. In Proceedings of WWW ’2011.
... Twitter is one of the leading online social networks next to Facebook; it started its service in 2005 with 5,000 tweets a day and increased by magnitudes to 35 million tweets per day in 2010 to almost 500 million tweets per day in 2013 [133]. Twitter released these numbers in 2014. ...
... On average, the entirety of Twitter users creates six thousand tweets each second [133]. ...
Thesis
Full-text available
When a high-ranking British politician was falsely accused of child abuse by the BBC in November 2012, a wave of short messages followed on the online social network Twitter leading to considerable damage to his reputation. However, not only did the politician’s image suffer considerable damage, moreover, he was also able to sue the BBC for £185,000 in damages. On the relatively new media of the internet and specifically in online social networks, digital wildfires, i.e., fast spreading, counterfactual or even intentionally misleading information occur on a regular basis and lead to severe repercussions. Although the example of the British politician is a simple digital wildfire that only damaged the reputation of a single person, there are more complex digital wildfires whose consequences are more far-reaching. This thesis deals with the capture, automatic processing, and investigation of a complex digital wildfire, namely, the Corona and 5G misinformtionsevent - the idea that the COVID-19 outbreak is somehow connected to the introduction of the 5G wireless technology. In this context, we present a system whose application allows us to acquire large amounts of data from the online social network Twitter and thus create the database from which we extract the digital wildfire in its entirety. Furthermore, we present a framework that provides the playing field for investigating the spread of digital wildfires. The main findings that emerge from the study of the 5G and corona misinformation event can be summarised as follows. Although published work suggests that a purely structure-based analysis of the information spread allows for early detection, there is no way of predictively labelling spreading information as probably leading to a digital wildfire. Digital wildfires do not emerge out of nowhere but find their origin in a multitude of already existing ideas and narratives that are reinterpreted and recomposed in the light of a new situation. It does not matter if ideas and explanations contradict each other. On the contrary, it seems that it is the existence of contradictory explanations that unites supporters from different camps to support a new idea. Finally, it has been shown that the spread of a digital wildfire is not the result of an information cascade in the sense of single, particularly influential short messages within a single medium. Rather, a multitude of small cascades with partly contradictory statements are responsible for the rapid spread. The dissemination media are diverse, and even more so, it is precisely the mix of different media that makes a digital wildfire possible.
... Twitter is one of the leading online social networks next to Facebook; it started its service in 2005 with 5,000 tweets a day and increased by magnitudes to 35 million tweets per day in 2010 to almost 500 million tweets per day 2013 [5]. ...
... On average, Twitter creates six thousand tweets each second [5]. To access this vast amount of data, Twitter offers the developer API to interact with its underlying services across multiple endpoints. ...
... This can lead to these small samples of data not being representative of the whole -depending on what the use of the data is. However, in some cases, the small samples of data are enough for particular applications such as event detection due to the volume and velocity of data, or hold similar results to using the entire dataset (Li, Shah, Thomas, Anderson, & Liu, 2016b). ...
... Since geosocial media data are controlled by the social media platform provider, only a limited sample of data is available, which poses questionable data representativeness issues. Depending on a study's objectives, obtaining data will require careful selection of samples such that they abide by the regulations and restrictions set by the provider (Li et al., 2016b). For example, limiting extraction of data to a particular area/location to lower data rate limits, or only extracting data for certain keywords, phrases, or user accounts. ...
Article
Full-text available
Since the rapid growth of urban populations, the study of urban systems has gained considerable attention from researchers, decision makers, governments, and organizations. Urban systems are complex and dynamic such that they produce emergent patterns such as self-organization and nonlinearity. Agent-based modelling presents an approach to simulating and abstracting urban systems to reveal and study emergent patterns from urban-related entities. However, agent-based models are difficult to effectively optimize and validate without high quality real-world data. Geosocial media data provides agent-based models with location-enabled data at high volumes and frequencies. Integrating agent-based models with geosocial media data presents opportunities in advancing and developing studies in urban systems. This paper provides a general overview of concepts, review of recent applications, and discussion of challenges and opportunities in the context of using geosocial media data in agent-based models for urban systems. We argue that ABMs focused on studying urban systems can benefit greatly from geosocial media data, given that research moves towards standard guidelines that enable the comparison and effective use of ABMs, and geosocial media data under appropriate circumstances and applications.
... Some studies of Twitter's free-tier Streaming API, which takes a 1% random sample of all tweets, have shown that bias may arise when dealing with very popular terms, likely due to rate limits set by Twitter (Tromble et al. 2017;Campan et al. 2018), but that having a higher coverage of the entire set of tweets reduces the problem (Morstatter et al. 2013). A comparison between the free-tier API (1% random sample) and subsamples of the Decahose (10% random sample) API also showed that the number of new entities (hashtags, mentions, users, and proper nouns) in the sample plateaus with increasing coverage, so that the advantage of increasing the sample size from 1 to 2% is greater than increasing it from, for example, 5-6% (Li et al. 2016). Although we found no direct comparisons between the Decahose API and the Firehose API (100% of all tweets), we can infer from these previous studies that the expected bias due to subsampling will be mitigated by using the Decahose API, as we have done in this study. ...
Article
Full-text available
We explore the relationship between context and happiness scores in political tweets using word co-occurrence networks, where nodes in the network are the words, and the weight of an edge is the number of tweets in the corpus for which the two connected words co-occur. In particular, we consider tweets with hashtags #imwithher and #crookedhillary, both relating to Hillary Clinton’s presidential bid in 2016. We then analyze the network properties in conjunction with the word scores by comparing with null models to separate the effects of the network structure and the score distribution. Neutral words are found to be dominant and most words, regardless of polarity, tend to co-occur with neutral words. We do not observe any score homophily among positive and negative words. However, when we perform network backboning, community detection results in word groupings with meaningful narratives, and the happiness scores of the words in each group correspond to its respective theme. Thus, although we observe no clear relationship between happiness scores and co-occurrence at the node or edge level, a community-centric approach can isolate themes of competing sentiments in a corpus.
... We query the daily usage rate of hashtags referencing hurricanes are queried from a corpus of 1-gram-words or other single word-like constructs-usage rate time series, computed from approximately 10% of all posts ("tweets") from 2009 to 2019 collected from Twitter's "decahose" [52]. We define usage rate, f, as ...
Article
Full-text available
We study collective attention paid towards hurricanes through the lens of n-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct ‘hurricane attention maps’ and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane’s Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage.
... We query the daily usage rate of hashtags referencing hurricanes are queried from a corpus of 1-gram-words or other single word-like constructs-usage rate time series, computed from approximately 10% of all posts ("tweets") from 2009 to 2019 collected from Twitter's "decahose" [46]. We define usage rate, f , as ...
Preprint
Full-text available
We study collective attention paid towards hurricanes through the lens of n-grams on Twitter, a social media platform with global reach. Using hurricane name mentions as a proxy for awareness, we find that the exogenous temporal dynamics are remarkably similar across storms, but that overall collective attention varies widely even among storms causing comparable deaths and damage. We construct `hurricane attention maps' and observe that hurricanes causing deaths on (or economic damage to) the continental United States generate substantially more attention in English language tweets than those that do not. We find that a hurricane's Saffir-Simpson wind scale category assignment is strongly associated with the amount of attention it receives. Higher category storms receive higher proportional increases of attention per proportional increases in number of deaths or dollars of damage, than lower category storms. The most damaging and deadly storms of the 2010s, Hurricanes Harvey and Maria, generated the most attention and were remembered the longest, respectively. On average, a category 5 storm receives 4.6 times more attention than a category 1 storm causing the same number of deaths and economic damage.
... Twitter is a popular micro-blogging service that allows users to share thoughts and news with a global community via short messages (up to 140 or, from around November 2017 on, 280 characters, in length). We purchased access to Twitter's "decahose" streaming API and used it to collect a random 10% sample of all public tweets authored between September 9, 2008 and April 4, 2018 [16]. We then parsed these tweets to count appearances of words included in the LabMT dataset, a set of roughly 10,000 of the most commonly used words in English [15]. ...
Article
Full-text available
Abstract We introduce a qualitative, shape-based, timescale-independent time-domain transform used to extract local dynamics from sociotechnical time series—termed the Discrete Shocklet Transform (DST)—and an associated similarity search routine, the Shocklet Transform And Ranking (STAR) algorithm, that indicates time windows during which panels of time series display qualitatively-similar anomalous behavior. After distinguishing our algorithms from other methods used in anomaly detection and time series similarity search, such as the matrix profile, seasonal-hybrid ESD, and discrete wavelet transform-based procedures, we demonstrate the DST’s ability to identify mechanism-driven dynamics at a wide range of timescales and its relative insensitivity to functional parameterization. As an application, we analyze a sociotechnical data source (usage frequencies for a subset of words on Twitter) and highlight our algorithms’ utility by using them to extract both a typology of mechanistic local dynamics and a data-driven narrative of socially-important events as perceived by English-language Twitter.
Preprint
Full-text available
Machine intelligence, or the use of complex computational and statistical practices to make predictions and classifications based on data representations of phenomena, has been applied to domains as disparate as criminal justice, commerce, medicine, media and the arts, mechanical engineering, among others. How has machine intelligence become able to glide so freely across, and to make such waves for, these domains? In this dissertation, I take up that question by ethnographically engaging with how the authority of machine learning has been constructed such that it can influence so many domains, and I investigate what the consequences are of it being able to do so. By examining the workplace practices of the applied machine learning researchers who produce machine intelligence, those they work with, and the artifacts they produce. The dissertation begins by arguing that machine intelligence proceeds from a naive form of empiricism with ties to positivist intellectual traditions of the 17th and 18th centuries. This naive empiricism eschews other forms of knowledge and theory formation in order for applied machine learning researchers to enact data performances that bring objects of analysis into existence as entities capable of being subjected to machine intelligence. By data performances, I mean generative enactments which bring into existence that which machine intelligence purports to analyze or describe. The enactment of data performances is analyzed as an agential cut into a representational field that produces both stable claims about the world and the interpretive frame in which those claims can hold true. The dissertation also examines how machine intelligence depends upon a range of accommodations from other institutions and organizations, from data collection and processing to organizational commitments to support the work of applied machine learning researchers.
Article
Full-text available
Twitter has captured the interest of the scientific community not only for its massive user base and content, but also for its openness in sharing its data. Twitter shares a free 1% sample of its tweets through the "Streaming API", a service that returns a sample of tweets according to a set of parameters set by the researcher. Recently, research has pointed to evidence of bias in the data returned through the Streaming API, raising concern in the integrity of this data service for use in research scenarios. While these results are important, the methodologies proposed in previous work rely on the restrictive and expensive Firehose to find the bias in the Streaming API data. In this work we tackle the problem of finding sample bias without the need for "gold standard" Firehose data. Namely, we focus on finding time periods in the Streaming API data where the trend of a hashtag is significantly different from its trend in the true activity on Twitter. We propose a solution that focuses on using an open data source to find bias in the Streaming API. Finally, we assess the utility of the data source in sparse data situations and for users issuing the same query from different regions.
Article
Full-text available
Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.
Article
Full-text available
Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. Twitter, a popular microblogging tool has seen a lot of growth since it launched in October, 2006. In this paper, we present our observations of the microblogging phenomena by studying the topological and geographical properties of Twitter's social network. We find that people use microblogging to talk about their daily activities and to seek or share information. Finally, we analyze the user intentions associated at a community level and show how users with similar intentions connect with each other.
Conference Paper
Full-text available
We analyze the information credibility of news propagated through Twitter, a popular microblogging service. Previous research has shown that most of the messages posted on Twitter are truthful, but the service is also used to spread misinformation and false rumors, often unintentionally. On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we analyze microblog postings related to "trending" topics, and classify them as credible or not credible, based on features extracted from them. We use features from users' posting and re-posting ("re-tweeting") behavior, from the text of the posts, and from citations to external sources. We evaluate our methods using a significant number of human assessments about the credibility of items on a recent sample of Twitter postings. Our results shows that there are measurable differences in the way messages propagate, that can be used to classify them automatically as credible or not credible, with precision and recall in the range of 70% to 80%.
Conference Paper
Full-text available
We study several longstanding questions in media communications research, in the context of the microblogging service Twitter, regarding the production, flow, and consumption of information. To do so, we exploit a recently introduced feature of Twitter known as "lists" to distinguish between elite users - by which we mean celebrities, bloggers, and representatives of media outlets and other formal organizations - and ordinary users. Based on this classification, we find a striking concentration of attention on Twitter, in that roughly 50% of URLs consumed are generated by just 20K elite users, where the media produces the most information, but celebrities are the most followed. We also find significant homophily within categories: celebrities listen to celebrities, while bloggers listen to bloggers etc; however, bloggers in general rebroadcast more information than the other categories. Next we re-examine the classical "two-step flow" theory of communications, finding considerable support for it on Twitter. Third, we find that URLs broadcast by different categories of users or containing different types of content exhibit systematically different lifespans. And finally, we examine the attention paid by the different user categories to different news topics.
Conference Paper
In this paper we report on a case study of rumor transmission during a nationwide scandal via China's most popular microblogging service, weibo.com. Specifically, we explore dynamics of the rumor discourse by characterizing different statement types and their evolution over time. We examine the roles that different user groups play in the rumor discussions. Through qualitative and statistical analyses, our results identify seven reaction patterns to rumors and their different development trends. We reveal a three-stage pattern of the change of leadership during the rumor discussions. By connecting social theories on rumor transmission to the large scale social platform, this paper offers insight into understanding rumor development in social media, as well as utilizing microblogging data for effectively detecting, analyzing and controlling public rumors.
Conference Paper
L-LDA is a new supervised topic model for assigning "topics" to a collection of documents (e.g., Twitter profiles). User studies have shown that L-LDA effectively performs a variety of tasks in Twitter that include not only assigning topics to profiles, but also re-ranking feeds, and suggesting new users to follow. Building upon these promising qualitative results, we here run an extensive quantitative evaluation of L-LDA. We test the extent to which, compared to the competitive baseline of Support Vector Machines (SVM), L-LDA is effective at two tasks: 1) assigning the correct topics to profiles; and 2) measuring the similarity of a profile pair. We find that L-LDA generally performs as well as SVM, and it clearly outperforms SVM when training data is limited, making it an ideal classification technique for infrequent topics and for (short) profiles of moderately active users. We have also built a web application that uses L-LDA to classify any given profile and graphically map predominant topics in specific geographic regions.
Conference Paper
Content in microblogging systems such as Twitter is produced by tens to hundreds of millions of users. This diversity is a notable strength, but also presents the challenge of finding the most interesting and authoritative authors for any given topic. To address this, we first propose a set of features for characterizing social media authors, including both nodal and topical metrics. We then show how probabilistic clustering over this feature space, followed by a within-cluster ranking procedure, can yield a final list of top authors for a given topic. We present results across several topics, along with results from a user study confirming that our method finds authors who are significantly more interesting and authoritative than those resulting from several baseline conditions. Additionally our algorithm is computationally feasible in near real-time scenarios making it an attractive alternative for capturing the rapidly changing dynamics of microblogs.
Conference Paper
Twitter, a microblogging service less than three years old, com- mands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal of this paper is to study the topological characteristics of Twitter and its power as a new medium of information sharing. We have crawled the entire Twitter site and obtained 41:7 million user profiles, 1:47 billion social relations, 4; 262 trending topics, and 106 million tweets. In its follower-following topology analysis we have found a non-power-law follower distribution, a short effec- tive diameter, and low reciprocity, which all mark a deviation from known characteristics of human social networks (28). In order to identify influentials on Twitter, we have ranked users by the number of followers and by PageRank and found two rankings to be sim- ilar. Ranking by retweets differs from the previous two rankings, indicating a gap in influence inferred from the number of followers and that from the popularity of one's tweets. We have analyzed the tweets of top trending topics and reported on their temporal behav- ior and user participation. We have classified the trending topics based on the active period and the tweets and show that the ma- jority (over 85%) of topics are headline news or persistent news in nature. A closer look at retweets reveals that any retweeted tweet is to reach an average of 1; 000 users no matter what the number of followers is of the original tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops, signifying fast diffusion of information after the 1st retweet. To the best of our knowledge this work is the first quantitative study on the entire Twittersphere and information diffusion on it.