PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10% and 20% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nationwide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.
Article
Redundancy Reduction in Twitter Event Streams
Nane Kratzke
Lübeck University of Applied Sciences; nane.kratzke@th-luebeck.de
Version February 12, 2020 submitted to Information
Abstract:
The data from social networks like Twitter is a valuable source for research but full of
1
redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data
2
recording is a common problem in social media-based studies and could be standardized. Sadly, this
3
is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the
4
complete public sample of the German and English Twitter stream. It presents a recording solution
5
proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed
6
multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A
7
10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June
8
and September 2017 was used to analyze expectable compression rates. It turned out that resulting
9
datasets need only between 10% and 20% of the original data size without losing any event, metadata
10
or the relationships between single events. This kind of redundancy reduction recording makes it
11
possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks
12
for research in a standardized and reproducible manner.13
Keywords: Twitter, dataset, redundancy, reduction, archive14
1. Introduction15
The data-drivenness and systematic analysis of social media data get more and more common in
16
the (social) sciences and is, besides other domains, applied to understand the influence of social media
17
on our everyday life. E.g. Barberá and Rivero emphasize "the opportunities offered by Twitter for the
18
analysis of public opinion: messages are exchanged by numerous users in a public forum, and they may contain
19
valuable information about individual preferences and reactions to different (...) events in an environment that is
20
fully accessible to the researcher" [1].21
Twitter provides samples of these data for free via its streaming APIs. At least for large datasets
22
[
2
], these samples “truthfully reflect the daily and hourly activity patterns of the Twitter users (...) and preserve
23
the relative importance (...) of content terms" [
3
]. So, although Twitter might not be the biggest influencing
24
social network, it is with more than 330 million monthly, and approximately 145 million daily active
25
users in 2019 a valid and free to use data source for research. Therefore, Twitter data has been used for
26
a variety of interesting studies:27
Analysis of the political representativeness of Twitter users [3]28
Real-time Twitter analysis [46]29
Democratic elections [713]30
The uses of Twitter by populists [14]31
Misinformation dissemination and event detection in social networks [1517]32
Online public shaming [18]33
and many more34
Furthermore, there exist plenty of public available datasets. E.g., a curated collection [
19
] is
35
hosted on Zenodo [
20
] and covers several datasets of political campaigns [
8
,
9
], online misinformation
36
Submitted to Information, pages 1 – 10 www.mdpi.com/journal/information
Version February 12, 2020 submitted to Information 2 of 10
networks [
21
], event detection in temporal networks [
22
], public shaming [
23
], Twitter-related word
37
vectors [
24
], retweeting timeseries [
25
], and even continuously updated samples of a nations-wide
38
Twitter usage [26].39
So, there is no problem with a lack of data. The problem lays more in the variety of data, and
40
in the variety of different collection methodologies and software stacks used to collect the data. The
41
mentioned datasets are provided in varying formats (CSV, JSON, TXT, XML, and further proprietary –
42
even binary – data formats). What is more, almost none of the above-referenced studies reported in
43
repeatable details what the exact methodology and tool-suite was to archive the data? This mix of data
44
formats and vaguely described collection methodologies makes it hard to compare different studies
45
and datasets. However, we can distinguish two major data collection approaches:46
Streaming approaches like the programmable Tweepy API [
27
] or BotSlayer [
28
] make use of
47
the Twitter streaming API. They make it possible to record large amounts of data in real-time.
48
However, the filtering must be specified in the upfront of a study - which can be tricky if the
49
specific filtering can be hardly predicted. E.g. in case of sudden and unpredictable events like
50
earthquakes [
5
] or terror attacks. However, the programmable libraries let plenty of room for
51
very proprietary recording solutions that are hardly reproducible by other researchers.52
Scraping approaches like TWINT [
29
] make use of Web scraping techniques. Because of the
53
scraping approach, these tools are more limited regarding the amount of recordable data.
54
However, the data can be collected even in the aftermath of an event. Because they are
55
"backwards"-looking, they are hard to analyze real-time effects. What is more, scraping
56
technologies need an initial set of search terms or user accounts to start scraping. A slightly
57
different set of search terms may result in substantially different datasets. So, scraping based
58
approaches are vulnerable for unaware and non-obvious biases. Therefore, it is hard to use
59
them to collect datasets the can be used as an objective and unbiased "ground-truth" of a social
60
network.61
So, to create a large-scale, objective, and un-biased "ground-truth," it is necessary to archive even
62
nation-scale social-network traffic with as less up-front filters as possible. Thus, the resulting archives
63
can be filtered for research questions in the aftermath of events independently by different researchers.
64
However, the resulting sizes of datasets must be manageable and reasonable. This paper focuses on the
65
inherent redundancy of social network event streams that are mainly repeating content. If archiving
66
solutions would effectively eliminating the intrinsic redundancy in social network streams, large-scale,
67
self-contained, (relatively) small datasets would become possible, that could be used for a broad range
68
of research. A long-term evaluation study archiving the German Twitter stream demonstrates that this
69
is possible. The monthly updated dataset is provided as open-source and might be inspected by the
70
reader [26].71
The rest of the paper is
outlined
as follows. Section 2explains the overall problem that the inherent
72
redundancy of Social network event streams makes it necessary to find solutions for self-containing
73
but redundancy-reduced data formats for archiving. It proposes furthermore a redundancy reduction
74
solution that can be configured using different chunk sizes of events. This solution proposal is
75
implemented in a recording solution called Twista [
30
]. The reader might inspect Twista on GitHub.
76
Section 3explains the methodology of the evaluation of Twista’s recording capabilities that are based
77
on replaying already existing raw data of social network event streams. Section 4shows different
78
compression rate results. It turns out that social network event streams can be archived consuming less
79
than 20% of the raw data space without losing information. Section 5discusses the results critically and
80
addresses internal and external threats on the validity of this study. The paper closes with a conclusion
81
on the results of this study in Section 6.82
Version February 12, 2020 submitted to Information 3 of 10
2. Problem Statement83
The Twitter content dissemination mechanics is well structured. Figure 1presents the conceptual
84
metamodel as a UML class diagram.85
Figure 1. UML data model of Twitter events
Every user interaction on Twitter starts with a
Status
post.
Status
posts are used to disseminate
86
some updates or news. Other users are informed about such
Status
posts and can interact using
87
special kind of
Reaction
s to comment on or support the dissemination of
Tweet
s. These
Reaction
s
88
refer to this initial
Status
or observed follow-up
Reaction
s. Such
Reaction
s that flow as events
89
through the streaming API are:90
Retweet
(this is used to broadcast other
Tweet
s to own followers; usually this expresses a kind
91
of support for the content)92
Quote
(similar to a
Retweet
, but own content is attached to the original
Tweet
; this might change
93
the tenor of the original tweet, e.g. by sarcastic comments)94
Reply
(to comment on different kind of posts; this can be supportive, neutral or contradictive
95
comments)96
A further reaction is a "Like". A "Like" expresses some support for the content of a
User
.
97
Nevertheless, "Likes" do not flow as events through the Twitter streaming API. So, for this paper,
98
"Likes" are not considered. However, each event contains metadata that counts how many "Likes"
99
a Tweet got. So, "Likes" are recorded and can be analyzed, although they are not flowing as events
100
through the Twitter streaming API.101
Except for
Replies
(for Twitter API historical and backward compatibility reasons),
Quote
s and
102
Retweet
s contain the referring content completely. For instance, a
Quote Q
of a
Retweet R
of a
Status103
S
contains
R
and
S
(the
Retweet R
that includes the
Status S
). On the Twitter streaming API, this
104
looks like a linear stream of events (see Figure 2, TOP). However, there exist referrals between some
105
events. Figure 2(BOTTOM) presents a more precise representation of the situation. Moreover, every
106
refer-to link is accompanied by redundancy (the referring event always includes the referred event).
107
This redundancy is very comfortable for (mobile) streaming applications because every streaming
108
event contains its full context. So, no expensive follow-up queries are necessary to fetch the context
109
of a Tweet. However, for archiving Twitter content, this self-containing is a redundancy nightmare.
110
According to studies focussing on political campaigning [
9
], almost 2
/
3 of all Tweets are
Retweet
s
111
or
Quotes
, and only 6% are
Status
posts (see Figure 7). Although these percentages depend to some
112
degree on the language (German, English, etc.) or the context (political campaigns, general, lifestyle,
113
etc.), they can be observed to some degree similarily in different contexts (see Figure 7). So, more than
114
60% of all streaming events repeat 6% of the original content. That is a factor of 10!115
Version February 12, 2020 submitted to Information 4 of 10
Figure 2. Stream of Twitter events
However, this provides plenty of opportunities to reduce redundancy. Like the original streaming
116
API, we want to record events as chunks of a self-contained set of events but without unnecessary
117
redundancy. Twista is doing this via its recording component by logging a Twitter stream as a linear
118
stream of events. Every
n
-th event (a chunk), all so far recorded events in this chunk are written to a
119
log, but duplicate events are eliminated. So each log contains unique events, but all referrals between
120
these events are preserved. Figure 3shows this effect by a constructed example for different chunk
121
sizes. So, we can create much smaller records that are still self-contained (containing all referrals).122
Figure 3. Effect of different chunk sizes, larger chunks reduce redundancy
Figure 3shows furthermore that this redundancy reduction is correlated with the chunk size. The
123
redundancy is shrinking for larger chunk sizes. However, Figure 3is only presenting a constructed
124
example to demonstrate the effect. The question is, whether this principle holds for real-world datasets?
125
We will address these questions in the following Sections.126
3. Methodology127
The principle demonstrated in Figure 2and 3has been implemented in Twista [
30
] to measure
128
the effect of redundancy elimination. The reader can study the straightforward implementation of the
129
recording component (twista/recorder.py) on GitHub to introspect the solution proposal.130
Version February 12, 2020 submitted to Information 5 of 10
In a second step, some Twitter streaming API raw data has been selected for evaluation. For this
131
study, it was decided to work with the #BTW17 dataset [
9
]. It compromises approximately 10 GB raw
132
data of German tweets (1GB if compressed as ZIP archive), that has been recorded while the political
133
campaigns for the 2017 German Federal Elections (Bundestag). Other raw datasets would have been
134
possible as well. However, no further documented Twitter streaming raw data sets in public dataset
135
repositories were found. The collection of the #BTW17 dataset is explained and described detailly in
136
[8].137
Figure 4. #BTW17 dataset, active accounts, taken from [9], the 26th Sep. 2017 was the election day
The #BTW17 dataset [
8
,
9
] comprises more than 1,200,000 tweets from 120,000 users recorded
138
between June and September 2017. These recorded tweets and users are stored in precisely the
139
JSON-based API (raw) format provided by the public Twitter streaming API. Figure 4shows the
140
observed tweeting activity throughout the recording. This dataset can be taken as a typical sample of
141
what is "going on" on Twitter every day.142
Therefore, the #BTW17 dataset was time-ordered injected into the Twista recording engine with
143
different recording chunk sizes. A chunk size of 1 means effectively to store every event immediately
144
and is therefore very similar to the behaviour of the Twitter streaming API itself. A chunk size of
145
10.000 means to eliminate all duplicates until 10.000 unique entities could be collected. The Twista
146
recording engine recorded this stream as it would have recorded a live stream. In the aftermath, the
147
resulting data sizes of the records resulting from different chunk sizes could be compared to reason
148
about the redundancy reduction efficiency of varying chunk sizes.149
4. Results150
Since April 2019 Twista records a sample of the complete German Twitter stream [
26
]. This
151
long-term evaluation demonstrates Twista’s large-scale recording capabilities. Figure 4shows the
152
number of recorded tweets and active users per month using the presented redundancy reduction
153
approach. If the reader compares Figure 5with Figure 4, it becomes evident that the given recording
154
solution is capable to archive much larger datasets than the selected #BTW17 reference dataset. The
155
evaluation turned out, that is possible to archive the complete public sample of the German language
156
Twitter stream. Even the full public sample of the English language Twitter stream has been recorded
157
for several days without problems. This data has been used to compile Figure 7and to deduce
158
language-dependent Twitter usage characteristics.159
Version February 12, 2020 submitted to Information 6 of 10
Figure 5.
Long-term evaluation of Twista (recording the complete public sample of the German Twitter
stream [26])
Figure 6shows the evaluation results processing the #BTW17 dataset. The compression analysis
160
presented in Figure 6(1+2) shows that increasing the chunk size decreases the overall size of the
161
recorded dataset. The effect is for smaller chunk sizes much more significant than for larger chunks.
162
So enlarging the chunk size results obviously in more massive records but the redundancy reduction163
effect is hardly measurable on the right end. Chunk sizes larger than 250.000 entities per file make
164
merely sense and only increasing the record size (but hardly minimize the redundancy any more).165
Figure 6. Replaying of #BTW17 data, results of evaluation
Increasing the chunk size also increases the time-span that is covered by a record. According to
166
Figure 5(4), records with 100 recorded events include about 100 hours (4 days) between the youngest
167
and oldest event. This period can increase to more than 50.000 hours (six years) for a chunk size of
168
500.000 events. This astonishing long period has to do with the fact that in some rare cases, really "old"
169
tweets are referenced. Especially the time-spans are deeply dependent on the recorded dataset and
170
may show completely different characteristics.171
All these effects shall be considered when selecting a chunk size for a large-scale recording.
172
According to made experiences chunk sizes larger than 250.000 make little sense (even for massive
173
Version February 12, 2020 submitted to Information 7 of 10
streams like the complete English Twitter stream). For less frequent streams – like the German Twitter
174
stream – chunk sizes of 100.000 per record seem to be a reasonable balance between redundancy
175
reduction and record size.176
In general, more massive Twitter streams need larger chunk-sizes, less frequent Twitter streams
177
should prefer smaller chunk-sizes (otherwise the generation of a record takes unpractical long).
178
However, archive sizes can be easily reduced to 20% compared with raw data, but some initial
179
explorative experiments might be necessary to figure out an appropriate chunk size.180
5. Critical Discussion181
The Twitter streaming API returns only a sample of all tweets flowing through the Twitter social
182
network. Data analysis must consider this and should take corresponding studies into consideration
183
[
2
]. It is not assured by Twitter how big this sample size is. However, Twitter states a range of 1% and
184
10% for tweets. Studies that measured this sample size reported a sample size between 0.95% and 9.6%
185
for tweets and between 10% and 45% for users [
2
,
3
]. Wang et al. concluded that "the sample datasets
186
truthfully reflect the daily and hourly activity patterns of the Twitter users. (...) Even with a very small sampling
187
ratio (i.e., 0.95%), the sample datasets (...) preserve the relative importance (i.e., frequency of appearance) of the
188
content terms" [3].189
Figure 7.
Example of observable differences across languages and datasets, the general German and
English Tweets have been recorded from 9th to 12th Feb. 2020 for cross-validational purposes, the data
for the #BTW17 dataset is taken from [8].
Version February 12, 2020 submitted to Information 8 of 10
The evaluation of the compression rate in Section 4has been done using the #BTW17 dataset
190
[
9
]. This dataset has not been recorded for that purpose. It was taken merely because it is one of
191
the rare Twitter raw datasets that exist publicly. However, according to its general characteristics
192
(see Figure 4and [
9
]), it should be big enough and typically shaped to derive a realistic picture of
193
the compression capabilities of Twista. Nevertheless, if datasets are collected that differ significantly
194
from the percentages of tweet types shown in Figure 7, different compression rates are expectable.
195
For instance, an unrealistic dataset only with status posts would result in almost no compression
196
rate at all. Figure 7shows that slightly different percentages of tweet types are expectable across
197
different languages. According to Figure 7, German tweeting users tend to reply more often than
198
English tweeting users. On the other hand, English tweeting users seem to retweet more often than
199
German tweeting users. Because retweets contain the retweeted tweet, and replies not (due to API
200
historical backward compatibility reasons of the Twitter API) English tweets should be slightly better
201
compressed than German tweets. However, this study does not investigate such aspects more deeply.
202
Nevertheless, effects like that should be expected to some degree. Figure 7shows furthermore another
203
interesting aspect. Even within the same language (here German), the focus of the recording can
204
influence the percentages of retweets, replies, status, and quotes. In political contexts, retweets and
205
replies seem to occur more often than in other not explicitly specified contexts (at least in Germany).
206
So, political recordings should be slightly better compressed than the general Twitter "basic noise".207
Twitter interactions happen in the open space, and every Twitter user is aware of that by accepting
208
the Twitter terms and conditions. However, recording with Twista enables to curate large-scale and
209
long-term datasets of Twitter social network events that might deduce more in-depth insights of
210
individuals that are hardly detectable by only analyzing the short-living real-time stream of social
211
network interactions. Therefore, it is emphasized that the Twitter User Protection terms of use and
212
general ethical considerations must be respected under all circumstances and compromises explicitly
213
the following aspects:214
The data may not be used to conduct surveillance or gather intelligence with the primary purpose
215
to isolate a group of individuals or any single individual for any discriminatory purpose.216
The data may not be used to target, segment or profile individuals due to their political affiliation
217
or any other category of personal information.218
6. Conclusions219
Twitter provides a free sample of all events flowing through its streaming APIs and is, therefore,
220
a valuable source for research. However, this data must be captured and archived. Plenty of studies
221
made use of this data, but the recording, scraping, and archiving of this data seems to be more kind of
222
an "art" than systematic application of standardized tools. Almost every social network-related study
223
seems to develop its specific set of recording and data processing tool-suite. This situation leads to
224
datasets in varying formats captured with hardly documented tool-sets and recording methodologies.
225
In other words, the datasets are hardly comparable.226
One problem is the amount of data that 145 million active users a day are producing. It was
227
shown that this data is full of redundancy resulting quickly in Terrabyte of data and datasets that are
228
hardly processable and shareable for research.229
Gladly, it is possible to minimize the redundancy to made effective use of this valuable data
230
source for research. Social network datasets can be reduced to about 20% of the original raw data
231
without losing the valuable relationships between social network events.232
This paper presented and evaluated a solution proposal. Twista can be used as a standardized
233
means to record and generate Twitter datasets for research. Twista is available as open-source software
234
[
30
]. In a long-term evaluation since April 2019, Twista recorded the complete public sample of
235
the German Twitter stream [
26
]. Although the dataset for each month is about 1GB of data, this is
236
astonishingly small for a social network stream covering all German tweets. That such large-scale
237
Version February 12, 2020 submitted to Information 9 of 10
datasets are effectively recordable is only possible because of the systematic exploitation of the inherent
238
redundancy of such social network event streams.239
Ongoing work will focus on better integration with dataset platforms like Zenodo and graph
240
databases like Neo4j to simplify curation, sharing, and updating of large-scale social network datasets
241
as well as their comparable and reproducible analysis.242
Funding:
This research received no external funding. Especially not from Twitter, or Governmental Agencies
243
with technological needs or wishes for mass surveillance.244
Conflicts of Interest: The author declares no conflict of interest.245
Abbreviations246
The following abbreviations are used in this manuscript:247
248
API Application Programming Interface
CSV Comma Separated Values (data format)
JSON JavaScript Object Notation (data format)
UML Unified Modeling Language
TXT Text file (data format)
XML Extensible Markup Language (data format)
ZIP compressed data format
249
References250
1.
Barberá, P.; Rivero, G. Understanding the Political Representativeness of Twitter Users. Social Science
251
Computer Review 2015,33, 712–729. doi:10.1177/0894439314558836.252
2.
Morstatter, F.; Pfeffer, J.; Liu, H.; Carley, K.M. Is the sample good enough? comparing data from twitter’s
253
streaming api with twitter’s firehose. Seventh international AAAI conference on weblogs and social media,
254
2013.255
3.
Wang, Y.; Callan, J.; Zheng, B. Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s
256
Stream API. ACM Trans. Web 2015,9. doi:10.1145/2746366.257
4.
Wang, H.; Can, D.; Kazemzadeh, A.; Bar, F.; Narayanan, S. A system for real-time twitter sentiment analysis
258
of 2012 us presidential election cycle. Proceedings of the ACL 2012 system demonstrations. Association
259
for Computational Linguistics, 2012, pp. 115–120.260
5.
Crooks, A.; Croitoru, A.; Stefanidis, A.; Radzikowski, J. #Earthquake: Twitter as a Distributed Sensor
261
System. Transactions in GIS 2013,17, 124–147. doi:10.1111/j.1467-9671.2012.01359.x.262
6.
Oliveira, R.; Almeida, P.; de Abreu, J.F. From Live TV Events to Twitter Status Updates-a Study on Delays.
263
Iberoamerican Conference on Applications and Usability of Interactive TV. Springer, 2016, pp. 117–128.264
7.
Gayo-Avello, D.; Metaxas, P.T.; Mustafaraj, E. Limits of electoral predictions using twitter. Fifth
265
International AAAI Conference on Weblogs and Social Media, 2011.266
8.
Kratzke, N. The BTW17 Twitter Dataset - Recorded Tweets of the Federal Election Campaigns of 2017 for
267
the 19th German Bundestag. Data 2017,2. doi:10.3390/data2040034.268
9.
Kratzke, N. The #BTW17 Twitter Dataset - Recorded Tweets of the Federal Election Campaigns of 2017 for
269
the 19th German Bundestag, 2017. doi:10.5281/zenodo.835735.270
10.
Cook, J.M. Twitter adoption and activity in US legislatures: A 50-state study. American Behavioral Scientist
271
2017,61, 724–740.272
11.
Fraisier, O.; Cabanac, G.; Pitarch, Y.; Besancon, R.; Boughanem, M. # Élysée2017fr: The 2017 French
273
Presidential Campaign on Twitter. Twelfth International AAAI Conference on Web and Social Media, 2018.
274
12.
Stier, S.; Bleier, A.; Bonart, M.; Mörsheim, F.; Bohlouli, M.; Nizhegorodov, M.; Posch, L.; Maier, J.; Rothmund,
275
T.; Staab, S. Systematically Monitoring Social Media: the case of the German federal election 2017. arXiv
276
preprint arXiv:1804.02888 2018.277
13.
Baviera, T.; Calvo, D.; Llorca-Abad, G. Mediatisation in Twitter: an exploratory analysis of the 2015 Spanish
278
general election. The Journal of International Communication 2019,25, 275–300.279
14.
Waisbord, S.; Amado, A. Populist communication by digital means: presidential Twitter in Latin America.
280
Information, Communication & Society 2017,20, 1330–1346.281
Version February 12, 2020 submitted to Information 10 of 10
15.
Shao, C.; Hui, P.M.; Wang, L.; Jiang, X.; Flammini, A.; Menczer, F.; Ciampaglia, G.L. Anatomy of an online
282
misinformation network. PLOS ONE 2018,13, 1–23. doi:10.1371/journal.pone.0196087.283
16.
Moriano, P.; Finke, J.; Ahn, Y.Y. Community-based event detection in temporal networks. Scientific reports
284
2019,9, 1–9.285
17.
Mazza, M.; Cresci, S.; Avvenuti, M.; Quattrociocchi, W.; Tesconi, M. RTbust: Exploiting Temporal Patterns
286
for Botnet Detection on Twitter. Proceedings of the 10th ACM Conference on Web Science; Association for
287
Computing Machinery: New York, NY, USA, 2019; WebSci ’19, p. 183–192. doi:10.1145/3292522.3326015.288
18.
Basak, R.; Sural, S.; Ganguly, N.; Ghosh, S.K. Online Public Shaming on Twitter: Detection,
289
Analysis, and Mitigation. IEEE Transactions on Computational Social Systems
2019
,6, 208–220.
290
doi:10.1109/TCSS.2019.2895734.291
19. Kratzke, N. Twitter Datasets, 2017 - 2020. https://zenodo.org/communities/twitter-datasets.292
20. Nielsen, L.H.; Smith, T. Introducing ZENODO, 2013. doi:10.5281/zenodo.7111.293
21.
Shao, C.; Hui, P.M.; Wang, L.; Jiang, X.; Flammini, A.; Menczer, F.; Ciampaglia, G.L. Anatomy of an online
294
misinformation network, 2018. doi:10.5281/zenodo.1154840.295
22.
Moriano, P.; Finke, J.; Ahn, Y.Y. Community-Based Event Detection in Temporal Networks, 2018.
296
doi:10.5281/zenodo.1321085.297
23. Basak, R. Online Public Shaming on Twitter- Dataset, 2019. doi:10.5281/zenodo.2587843.298
24. Halasz, P. Twitter pre-trained word vectors, 2019. doi:10.5281/zenodo.3237458.299
25.
Mazza, M.; Cresci, S.; Avvenuti, M.; Quattrociocchi, W.; Tesconi, M. Italian Retweets Timeseries, 2019.
300
doi:10.5281/zenodo.2653138.301
26. Kratzke, N. Monthly Samples of German Tweets, 2019-2020. doi:10.5281/zenodo.2783954.302
27. Roesslein, J. Tweepy, 2009 - 2020. https://tweepy.org.303
28.
Hui, P.M.; Yang, K.C.; Torres-Lugo, C.; Monroe, Z.; McCarty, M.; Serrette, B.; Pentchev, V.; Menczer,
304
F. BotSlayer: real-time detection of bot amplification on Twitter. Journal of Open Source Software
2019
.
305
doi:10.21105/joss.01706.306
29.
Zacharias, C.; Poldi, F. TWINT - Twitter Intelligence Tool, 2017 - 2020.
307
https://github.com/twintproject/twint.308
30.
Kratzke, N. Twista - A Twitter streaming and analysis command line tool suite, 2017-2020.
309
doi:10.5281/zenodo.845856.310
c
2020 by the author. Submitted to Information for possible open access publication
311
under the terms and conditions of the Creative Commons Attribution (CC BY) license
312
(http://creativecommons.org/licenses/by/4.0/).313
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose a method for detecting large events based on the structure of temporal communication networks. Our method is motivated by findings that viral information spreading has distinct diffusion patterns with respect to community structure. Namely, we hypothesize that global events trigger viral information cascades that easily cross community boundaries and can thus be detected by monitoring intra- and inter-community communications. By comparing the amount of communication within and across communities, we show that it is possible to detect events, even when they do not trigger a significantly larger communication volume. We demonstrate the effectiveness of our method using two examples—the email communication network of Enron and the Twitter communication network during the Boston Marathon bombing.
Article
Full-text available
Massive amounts of fake news and conspiratorial content have spread over social media before and after the 2016 US Presidential Elections despite intense fact-checking efforts. How do the spread of misinformation and fact-checking compete? What are the structural and dynamic characteristics of the core of the misinformation diffusion network, and who are its main purveyors? How to reduce the overall amount of misinformation? To explore these questions we built Hoaxy, an open platform that enables large-scale, systematic studies of how misinformation and fact-checking spread and compete on Twitter. Hoaxy filters public tweets that include links to unverified claims or fact-checking articles. We perform k-core decomposition on a diffusion network obtained from two million retweets produced by several hundred thousand accounts over the six months before the election. As we move from the periphery to the core of the network, fact-checking nearly disappears, while social bots proliferate. The number of users in the main core reaches equilibrium around the time of the election, with limited churn and increasingly dense connections. We conclude by quantifying how effectively the network can be disrupted by penalizing the most central nodes. These findings provide a first look at the anatomy of a massive online misinformation diffusion network.
Article
Full-text available
Social media feeds are rapidly emerging as a novel avenue for the contribution and dissemination of information that is often geographic. Their content often includes references to events occurring at, or affecting specific locations. Within this article we analyze the spatial and temporal characteristics of the twitter feed activity responding to a 5.8 magnitude earthquake which occurred on the East Coast of the United States (US) on August 23, 2011. We argue that these feeds represent a hybrid form of a sensor system that allows for the identification and localization of the impact area of the event. By contrasting this with comparable content collected through the dedicated crowdsourcing ‘Did You Feel It?’ (DYFI) website of the U.S. Geological Survey we assess the potential of the use of harvested social media content for event monitoring. The experiments support the notion that people act as sensors to give us comparable results in a timely manner, and can complement other sources of data to enhance our situational awareness and improve our understanding and response to such events.
Conference Paper
Full-text available
This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a micro-blogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion.
Article
In this article, we analyze the structure and content of the political conversations that took place through the microblogging platform Twitter in the context of the 2011 Spanish legislative elections and the 2012 U.S. presidential elections. Using a unique database of nearly 70 million tweets collected during both election campaigns, we find that Twitter replicates most of the existing inequalities in public political exchanges. Twitter users who write about politics tend to be male, to live in urban areas, and to have extreme ideological preferences. Our results have important implications for future research on the relationship between social media and politics, since they highlight the need to correct for potential biases derived from these sources of inequality.
Article
Researchers have begun studying content obtained from microblogging services such as Twitter to address a variety of technological, social, and commercial research questions. The large number of Twitter users and even larger volume of tweets often make it impractical to collect and maintain a complete record of activity; therefore, most research and some commercial software applications rely on samples, often relatively small samples, of Twitter data. For the most part, sample sizes have been based on availability and practical considerations. Relatively little attention has been paid to how well these samples represent the underlying stream of Twitter data. To fill this gap, this article performs a comparative analysis on samples obtained from two of Twitter's streaming APIs with a more complete Twitter dataset to gain an in-depth understanding of the nature of Twitter data samples and their potential for use in various data mining tasks.
Is the sample good enough? comparing data from twitter's 253 streaming api with twitter's firehose
  • F Morstatter
  • J Pfeffer
  • H Liu
  • K M Carley
Morstatter, F.; Pfeffer, J.; Liu, H.; Carley, K.M. Is the sample good enough? comparing data from twitter's 253 streaming api with twitter's firehose. Seventh international AAAI conference on weblogs and social media, 254