PreprintPDF Available

Just Another Day on Twitter: A Complete 24 Hours of Twitter Data

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

At the end of October 2022, Elon Musk concluded his acquisition of Twitter. In the weeks and months before that, several questions were publicly discussed that were not only of interest to the platform's future buyers, but also of high relevance to the Computational Social Science research community. For example, how many active users does the platform have? What percentage of accounts on the site are bots? And, what are the dominating topics and sub-topical spheres on the platform? In a globally coordinated effort of 80 scholars to shed light on these questions, and to offer a dataset that will equip other researchers to do the same, we have collected all 375 million tweets published within a 24-hour time period starting on September 21, 2022. To the best of our knowledge, this is the first complete 24-hour Twitter dataset that is available for the research community. With it, the present work aims to accomplish two goals. First, we seek to answer the aforementioned questions and provide descriptive metrics about Twitter that can serve as references for other researchers. Second, we create a baseline dataset for future research that can be used to study the potential impact of the platform's ownership change.
Content may be subject to copyright.
Just Another Day on Twitter: A Complete 24 Hours of Twitter Data
J¨
urgen Pfeffer1, Daniel Matter1, Kokil Jaidka2, Onur Varol3, Afra Mashhadi4, Jana Lasser5, 15 ,
Dennis Assenmacher6, Siqi Wu7, Diyi Yang8, Cornelia Brantner9, Daniel M. Romero7, Jahna
Otterbacher10, Carsten Schwemmer11 , Kenneth Joseph12, David Garcia13, Fred Morstatter14
1School of Social Sciences and Technology, Technical University of Munich, 2Centre for Trusted Internet and Community,
National University of Singapore, 3Sabanci University, 4University of Washington (Bothell), 5Graz University of Technology,
6GESIS - Leibniz Institute for the Social Sciences, 7University of Michigan, 8Stanford University, 9Karlstad University,
10Open University of Cyprus & CYENS CoE, 11Ludwig Maximilian University of Munich, 12University at Buffalo,
13University of Konstanz, 14Information Sciences Institute, University of Southern California 15Complexity Science Hub
Vienna
Abstract
At the end of October 2022, Elon Musk concluded his acqui-
sition of Twitter. In the weeks and months before that, sev-
eral questions were publicly discussed that were not only of
interest to the platform’s future buyers, but also of high rele-
vance to the Computational Social Science research commu-
nity. For example, how many active users does the platform
have? What percentage of accounts on the site are bots? And,
what are the dominating topics and sub-topical spheres on the
platform? In a globally coordinated effort of 80 scholars to
shed light on these questions, and to offer a dataset that will
equip other researchers to do the same, we have collected all
375 million tweets published within a 24-hour time period
starting on September 21, 2022. To the best of our knowl-
edge, this is the first complete 24-hour Twitter dataset that
is available for the research community. With it, the present
work aims to accomplish two goals. First, we seek to an-
swer the aforementioned questions and provide descriptive
metrics about Twitter that can serve as references for other
researchers. Second, we create a baseline dataset for future
research that can be used to study the potential impact of the
platform’s ownership change.
Introduction
On March 21, 2006, Twitter’s first CEO Jack Dorsey sent
the first message on the platform. In the subsequent 16 years,
close to 3 trillion tweets have been sent.1Roughly two-thirds
of these have been either removed from the platform be-
cause the senders deleted them or because the accounts (and
all their tweets) have been banned from the platform, have
been made private by the users, or are otherwise inaccessi-
ble via the historic search with the v2 API endpoints. We
estimate that about 900 billion public tweets were on the
platform when Elon Musk acquired Twitter in October 2022
for $44B., i.e., he paid about 5 cents per tweet.
Besides its possible economic value, Twitter has been
instrumental in studying human behavior with social me-
dia data and the entire field of Computational Social Sci-
ence (CSS) has heavily relied on data from Twitter. At the
Copyright © 2022, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1While we do not have an official source for this number, it rep-
resents an educated guess from a collaboration of dozens of schol-
ars of Twitter.
AAAI International Conference on Web and Social Media
(ICWSM), in the past two years alone (2021-2022), over
30 scientific papers analyzed a subset of Twitter for a wide
range of topics ranging from public and mental health anal-
yses to politics and partisanship. Indeed, since its emer-
gence, Twitter has been described as a digital socioscope
(i.e., social telescope) by researchers in fields of social sci-
ence (Mejova, Weber, and Macy 2015), “a massive antenna
for social science that makes visible both the very large (e.g.,
global patterns of communications) and the very small (e.g.,
hourly changes in emotions)”. Beyond CSS, there is increas-
ing use of Twitter data for training large pre-trained language
models in the field of natural language processing and ma-
chine learning, such as Bernice (DeLucia et al. 2022), where
2.5 billion tweets are used to develop representations for
Twitter-specific languages, and TwHIN-BERT (Zhang et al.
2022) that leverages 7 billion tweets covering over 100 dis-
tinct languages to model short, noisy, and user-generated
text.
Although Twitter data has fostered interdisciplinary re-
search across many fields and has become a “model organ-
ism” of big data, scholarship using Twitter data has also
been criticized for various forms of bias that can emerge
during analyses (Tufekci 2014). One major challenge giv-
ing rise to these biases is getting access to data and knowing
about data quality and possible data biases (Ruths and Pfef-
fer 2014; Gonz´
alez-Bail´
on et al. 2014; Olteanu et al. 2019).
While Twitter has long served as one of the most collabora-
tive big social media platforms in the context of data-sharing
with academic researchers, there nonetheless exists a lack
of transparency in sampling procedures and possible biases
created from technical artifacts (Morstatter et al. 2013; Pf-
effer, Mayer, and Morstatter 2018). These unknown biases
may jeopardize research quality. At the same time, access to
unfiltered/unsampled Twitter data is nearly impossible to ac-
cess, and thus the above mentioned studies, as well as thou-
sands of others, still retain unknown and potentially signifi-
cant biases in their use of sampled data.
Contributions. The data collection efforts presented in
this paper were driven by a desire to address these concerns
about sampling bias that exist because of the lack of a com-
plete sample of Twitter data. Consequently, the main contri-
arXiv:2301.11429v1 [cs.SI] 26 Jan 2023
bution of this article is to create the first complete dataset of
24 hours on Twitter and make these Tweets available via fu-
ture collaborations with the authors and contributors of this
article. The dataset collected and described here can be used
by the research community to:
Promote a better understanding of the communication
dynamics on the platform. For example, it can be used
to answer questions like, how many active (posting) ac-
counts are on the platform? And, what are the dominating
languages and topics?
Create a set of descriptive metrics that can serve as refer-
ences for the research community and provide context to
past and present research papers on Twitter.
Provide a baseline for the situation before the recent
sale of Twitter. With the new ownership of Twitter, plat-
form policies as well as the company structures are un-
der significant change, which will create questions about
whether previous Twitter studies will be still valuable ref-
erences for future studies.
In the following sections, we describe the data collection
process and provide some descriptive analyses of the dataset.
We also discuss ethical considerations and data availability.
Data
Data Collection. We have collected 24 hours of Twitter
data from September 20, 15:00:00 UTC to September 21
14:59:59 UTC. The data collection was accomplished by
utilizing the Academic API (Pfeffer et al. 2022) that is free
and openly available for researchers. The technical setup of
the data collection pipeline was dominated by two major
challenges: First, how can we avoid—at least to a satisfying
extent—a temporal bias in data collection? Second, how can
we get a good representation of Twitter? In the following,
these two aspects are discussed in more detail.
What is a complete dataset? What does complete mean
when we want to collect a day’s worth of Twitter data? It has
been shown previously that the availability of tweets fluctu-
ates, especially in the first couple of minutes (Pfeffer et al.
2022)—people might delete their tweets because of typos,
tweets might be removed because of violations of terms of
service, etc. To reduce this initial uncertainty, we have de-
cided to collect the data 10 minutes after the tweets were
sent. Consequently, this dataset does not include all tweets
that were sent on the collection day but instead tries to create
a somewhat stable representation of Twitter.
Avoiding temporal collection bias. We wanted to collect
a set of tweets close to the time when they were created.
However, collecting data takes time, which can introduce
possible temporal bias, e.g., if we want to collect data from
the previous hour and the data collection job takes three
hours, then the data that is collected at the end of the col-
lection job will be much older (with potentially more tweet
removals) than the data that is collected at the beginning. To
tackle this challenge, we have split the day into 86,400 col-
lection tasks, each consisting of 1 second of Twitter activ-
ity. The collection of every second of data started exactly 10
Time
Tweets per minute
200,000
300,000
400,000
15 18 21 00 03 06 09 12
Figure 1: Tweets per minute over the 24-hour collection pe-
riod, time in UTC.
minutes after the data creation time. Because the data collec-
tion of a second took more than a minute during peak times,
we have distributed the workload to 80 collection processes,
i.e., Academic API tokens, in order to avoid backlogs.
Number of tweets. With the above-described process, we
have collected 374,937,971 tweets within the 24 hours time
span. On average, this amounts to 4,340 [2,989 8,955]
tweets per second. Fig. 1 plots the number of tweets per
Minute (avg=260,374, min=192,322, max=435,721). The
data collection started at 15:00 UTC, when almost the en-
tire Twitter world is awake. Then, we can see from Japan to
Europe time zone after time zone getting off the platform.
While Europe and the Americas are sleeping, Asia keeps
the number of tweets at around 200,000. Starting at 7:00
UTC, Europe is getting active again, followed by the Amer-
icas from East to West. Another astonishing observation of
this time series is that the first minute of every hour has on
average 15.5% more tweets than the minute before—most
likely due to bot activities and other timed tweet releases,
e.g., news.
Descriptive Analyses
Active Users
The 375 million tweets in our dataset were sent by
40,199,195 accounts. While the publicly communicated
numbers of users of a platform are often based on the num-
ber of active and passive visitors, we can state that Twitter
has (or at least had on our observed day) 40 million active
contributors who have sent at least one tweet. Less than 100
accounts have created about 1% (=3.5M) tweets. 175,000
accounts (0.44%) created 50% of all tweets.
These numbers are not surprising when we consider that
>95% of active accounts have sent one or two tweets. How-
ever, these numbers lend more nuance to recent reports from
the Pew Research Center, which reported that while the ma-
jority of Americans use social media, approximately 97% of
all tweets were posted by 25% of the users (McClain 2021).
hi
fr
qme
in
zxx
fa
ko
th
pt
und
ar
tr
es
ja
en
Proportion of Tweets
0.00 0.05 0.10 0.15 0.20 0.25 0.30
0.01
0.015
0.017
0.022
0.023
0.024
0.03
0.04
0.044
0.049
0.05
0.053
0.073
0.165
0.31
Figure 2: All languages occurring in at least 1% of the
tweets.
In fact, our dataset suggests that worldwide, the numbers
may be more skewed than previously suggested.
User metrics
Followers. The active accounts on our day of Twitter
data have a mean of 2,123 followers (median=99). We can
find six accounts with more than 100 million followers
(max=133,301,854), and 427/8,635 accounts with more than
10/1 million followers. Exactly 50% of accounts that were
active on our collection day have less than 100 followers.
Following. These accounts follow much fewer other ac-
counts: mean=547, median=197, range: 0–4,103,801. Inter-
estingly, there are 2,377 accounts that follow more than
100,000 other accounts. One-third of accounts follow less
than 100 accounts, but only 1.7% of accounts follow zero
other accounts.
Listed. Lists are a Twitter feature for users to organize
accounts around topics and filter tweets. While there is lit-
tle evidence that lists are used widely on the platform, this
feature might be useful for getting an impression about the
number of interesting content creators on the platform. The
40 million active accounts in our dataset are listed (i.e.,
number of lists that include a user) in 0 to 3,086,443 lists
(mean=10.1, median=0). 1,692/46,139 accounts are in lists
of at least 10,000/1,000 accounts.
Tweets sent. The user information of the tweet metadata
also includes the number of tweets that a user has sent—or
at least how many of those tweets are still available on Twit-
ter. The sum of the sent tweets variable of all 40 million ac-
Table 1: Distribution of user activity
% Total Tweets % Total Users Min. no. of Tweets
1% 0.00023% 2,267
10% 0.01199% 465
25% 0.07284% 152
50% 0.43526% 39
75% 1.70955% 11
90% 4.18836% 3
counts is 404 billion (mean=9,704, median=1,522). If we
assume that our initial estimate of having 900 billion tweets
on the platform at the time of data collection is somewhat
correct, the accounts active in our dataset have contributed
45% of all of the available tweets over the entire lifetime
of Twitter.
Verified accounts. At the time of our data collection, we
can identify 221,246 verified accounts among the 40 million
active users.
Tweets and retweets
79.2% of all tweets refer to other tweets, i.e. they are
retweets or quotes of or replies to other tweets. Conse-
quently, 20.8% of the tweets in our dataset are original
tweets. The tweets with references are of the following
types: 50.7% retweets, 4.3% quotes, 24.2% replies, i.e. half
of all tweets are retweets and a fourth are replies.
Retweeted and liked. Studying the retweet and like num-
bers from the tweets’ metadata has created little insight since
the top retweeted tweets are very old tweets that have been
retweeted by chance on our collection day. Furthermore, we
can see the number of likes only for tweets that have been
tweeted and retweeted. In any case, the retweeted number
is interesting—the 374 million tweets have been retweeted
401 billion times. In other words, significant parts of historic
Twitter get retweeted on a daily basis.
Languages
Twitter annotates a language variable for every tweet. Fig. 2
shows those languages that were annotated on at least 1% of
our dataset. Together, these 15 languages make up 92.5% of
all tweets. Besides the most common languages on Twitter,
we can also find interesting language codes in this list: und
stands for undefined and represents tweets for which Twitter
was not able to identify a language; qme and zxx seem to
be used by Twitter for tweets consisting of only media or a
Twitter card.
Media
There are 112,779,266 media attachments in our data collec-
tion (76.9% photos, 20.7% videos, 2.4% animated GIFs), of
which 37,803,473 have unique media keys (83.8% photos,
10.0% videos, 6.2% animated GIFs).
Geo-tags
We found only 0.5% of tweets to be geo-tagged. This is
not surprising as previous works have shown that the per-
centage of geo-tagging in Twitter has been declining (Ajao,
Hong, and Liu 2015). Fig. 3 shows the distribution of the
geo-tagged tweets across the world, with USA (20%), Brazil
(11%), Japan (8%), Saudi Arabia (6%) and India (4%) being
the top five countries.
Estimating prevalence of bot accounts
Twitter has a pivotal role in public discourse and entities
that are after power and influence often utilize this platform
Figure 3: Choropleth map of the geo-tagged tweets across
the world.
through social bots and other means of automated activi-
ties. Since the early days of Twitter, researchers have been
studying bot behavior, and it has become an active research
area (Ferrara et al. 2016; Cresci 2020). The first estimation
of bot prevalence on Twitter indicates that 9-15% of Twit-
ter accounts exhibit automated behavior (Varol et al. 2017),
while others have observed significantly higher percentages
of tweets produced by bot-likely accounts on specific dis-
courses (Uyheng and Carley 2021; Antenore, Camacho Ro-
driguez, and Panizzi 2022). One major challenge in estimat-
ing bot prevalence is the variety of definitions, datasets, and
models used for detection (Varol 2022).
In this study, we employed BotometerLite (Yang et al.
2020), a scalable and light-weight version of the Botome-
ter (Sayyadiharikandeh et al. 2020), for computing bot
scores for unique accounts in our collection. In Fig. 4a, we
present the distribution of bot scores and nearly 20% of the
40 million active accounts have scores above 0.5 suggesting
bot-likely behavior.
While identification of bots is a complex and possi-
bly controversial challenge, plotting the distributions of
BotometerLite scores grouped by account age in Fig. 4b sug-
gests the proportions of accounts that show bot-like behavior
has increased dramatically in recent years. This result may
also suggest that the longevity of simpler bot accounts is
limited and they are no longer active on the platform. In Fig.
4c, we also present the distribution of bot scores for differ-
ent rates of activities in our dataset. Accounts that have over
1,000 posts exhibit higher rates of bot-like behaviors.
It is important to mention that accounts studied in this pa-
per were identified due to their content creation activities.
Our collection cannot capture passive accounts that are sim-
ply used to boost follower counts without visible activity on
tweet streams. Fair assessment of bot prevalence is only pos-
sible with complete access to Twitter’s internal database;
since activity streams, network data, and historical tweet
archives can capture different sets of accounts (Varol 2022).
Content on Twitter
The top 500 hashtags occurred 81,468,508 times in the
tweets. Via manual inspection, we were able to identify the
meaning of 95% of these top hashtags. They can be aggre-
gated into ten the categories.
Table 2 suggests that a large proportion of tweets referred
to entertainment, which together comprised about 30% of
tweets. These included mentions of celebrities (25.5%) and
other entertainment-related tweets (5.4%) such as mentions
of South Korean boy band members, and other references
to music, movies, and TV shows. Our data collection time
window occurred during Fall/Winter 2022, when the world
was discussing the protests in Iran after the death of Mahsa
Amini. Therefore, the Iranian protests also comprised a large
proportion of the hashtag volume at 16.6%.
Finally, and perhaps surprisingly, the category sex com-
prised over a quarter of all content covered by the top hash-
tags, and was almost completely related to escorts. “Other”
topics reflect that on “regular” Twitter days, sports, tech, and
art may take up only about 3.3% of Twitter volume.
Fig. 5 is a hashtag visualization that attempts to provide an
overview of the entire content on Twitter. We first removed
all tweets from accounts with more than 240 tweets to re-
duce the noise from bots using random trending hashtags.
From the remaining tweets, we extracted the 10,000 most
often used hashtags in our dataset and created a hashtag sim-
ilarity matrix with the number of accounts that have used a
pair of two hashtags on the day of data collection. Every el-
ement in Fig. 5 represents a hashtag. The position is the re-
sult of Multidimensional Scaling (MDS) and the color shows
the dominant language that was used in the tweets with the
particular hashtag. In this figure, we can see how languages
separate the Twitter universe but that there are also topical
sub-communities within languages.
Discussion and Potential Applications
Twitter is a social media platform with a worldwide user-
base. Open access to its data also makes it attractive to a
large community of researchers, journalists, technologists,
and policymakers who are interested in examining social
and civic behavior online. Early studies of Twitter explored
who says what to whom on Twitter (Wu et al. 2011), char-
acterizing its primary use as a communication tool. Other
early work mapped follower communities through ego net-
works (Gruzd, Wellman, and Takhteyev 2011). However,
Twitter has since expanded into its own universe, with a
plethora of users, uses, modalities, communities, and real-
life implications. Twitter is increasingly the source of break-
Table 2: The categories of the top 500 hashtags in the dataset
Category Hashtags Occurrence
Celebrities 159 20,809,742 25.5%
Sex 104 20,529,196 25.2%
Iranian Protests 15 13,488,295 16.6%
Entertainment 45 4,392,227 5.4%
Advertisement 32 4,644,540 5.7%
Politics 38 3,858,550 4.7%
Finance 30 3,549,107 4.4%
Games 21 3,348,128 4.1%
Other 31 2,672,291 3.3%
Unknown 25 4,176,432 5.1%
Sum 500 81,468,508 100.0%
0.0 0.2 0.4 0.6 0.8 1.0
BotometerLite score
0.0
0.2
0.4
0.6
0.8
1.0
Histogram of botscores
1e6
0.0
0.2
0.4
0.6
0.8
1.0
CDF
(a)
0.0 0.2 0.4 0.6 0.8 1.0
BotometerLite score
0
1
2
3
4
5
6
Density
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
0246 8
# of accounts
1e6
(b)
0.0 0.2 0.4 0.6 0.8 1.0
BotometerLite score
0.0
0.5
1.0
1.5
2.0
2.5
Density
Nt= Tweet count
Nt< 101
101Nt< 102
102Nt< 103
103Nt
(c)
Figure 4: BotometerLite scores distribution: (a) histogram and cumulative distribution, (b) by account age, (c) by tweet counts
in our dataset.
ing news, and many studies from the U.S. and Europe have
reported that Twitter is one of the primary sources of news
for their citizens. Twitter has been used for political engage-
ment and citizen activism worldwide. During the COVID-
19 pandemic, Twitter even assumed the role of the official
mouthpiece and crisis communication tool for many gov-
ernments to contact their citizens, and from which citizens
could seek help and information.
Fig. 3 confirms prior reports that geotagging practices are
limited in many low- and middle-income countries (Malik
et al. 2015); however, this should not deter scholars from ex-
ploring alternative methods of triangulating the location of
users (Schwartz et al. 2013), and creating post-stratified es-
timates of regional language use (Jaidka et al. 2020; Giorgi
et al. 2022). In prior studies, the difficulties in widespread
data collection and analyses have so far implied that most
answers are based on smaller samples (usually constrained
by geography, for convenience) of a burgeoning Twitter pop-
ulation. Fig. 5 and Table 2 also impressively illustrate that
Twitter is about so much more than US politics.
We hope that our dataset is the first step in creating al-
ternatives for conducting a representative and truly inclusive
analysis of the Twitterverse. Temporal snapshots are invalu-
able to map the national and international migration patterns
that increasingly blur geopolitical boundaries (Zagheni et al.
2014).
The increasing popularity of Twitter has led it into issues
of scale, where its moderation can no longer check the large
proportion of bots on the platform. Our findings in Fig. 4
indicate that the infestation of bots may be more pernicious
than previously imagined. We are especially concerned that
the escalation of the war on Ukraine by Russia may reflect a
spike (in our dataset) in the online activity of bots from Rus-
sia operated either by the Russian government or its allied
intelligence agencies (Badawy, Ferrara, and Lerman 2018).
These and other bots serve to amplify trending topics and
facilitate the spread of misinformation (though, perhaps, at
a rate less than humans do (Vosoughi, Roy, and Aral 2018)).
They may also misuse hashtags to divert attention away from
social or political topics (Earl, Maher, and Pan 2022; Broni-
atowski et al. 2018) or strategically target influential users
(Shao et al. 2018; Varol and Uluturk 2020). We hope that
our work will spur more studies on these topics, and we wel-
come researchers to explore our data.
By observing bursts of discussions around politically
charged events and characterizing the temporal spikes in
Twitter topics, we can better rationalize how our experience
of Twitter as a political hotbed differs from the simplified
understanding of the American Twitter landscape reported
in Mukerjee, Jaidka, and Lelkes (2022), which suggested
that politics is largely a sideshow on Twitter. It is worth con-
sidering that these politically active users may not be rep-
resentative of social media users at large (McClain 2021;
Wojcieszak et al. 2022).
Twitter is also under scrutiny for how its platform gover-
nance may conflict with users’ interests and rights (Van Di-
jck, Poell, and De Waal 2018). Concerns have been raised
about alleged biases in the algorithmic amplification (and
deamplification) of content, with evidence from France,
Germany, Turkey, and the United States, among other coun-
tries (Maj´
o-V´
azquez et al. 2021; Tanash et al. 2015; Jaidka,
Mukerjee, and Lelkes 2023). Other scholars have also criti-
cized Twitter’s use as a censorship weapon by governments
and political propagandists worldwide (Varol 2016; Elmas,
Overdorf, and Aberer 2021; Jakesch et al. 2021). They, and
others, may be interested in examining the trends in the en-
forcement of content moderation policies by Twitter.
Besides answering questions of data, representativeness,
access, and censorship, we anticipate that our dataset
is suited to explore the temporal dynamics of online
(mis)information in the following directions:
Content characteristics: We have provided a high-level
exploration of the topics on Twitter. However, more can
Figure 5: MDS of top 10,000 hashtags based on co-usage by same accounts; colors represent dominant language in tweets using
a hashtag.
be done with regard to understanding users’ concerns and
priorities. While hashtags act as signposts for the broader
Twitter community to find and engage in topics of mu-
tual interest (Cunha et al. 2011), tweets without hashtags
may offer a different understanding of Twitter discourse,
where users may engage in more interpersonal discus-
sions of news, politics, and sports than the numbers sug-
gest (Rajadesingan, Budak, and Resnick 2021).
Patterns of information dissemination: Informational
exchanges occurring on Twitter can overcome spatio-
temporal limitations as they essentially reconfigure user
connections to create newly emergent communities.
However, these communities may vanish as quickly as
they are created, as the lifecycle of a tweet determines
how long it continues to circulate on Twitter timelines.
To the best of our knowledge, no prior research has re-
ported on the average “age” of a tweet, and we hope that
a 24-hour snapshot will enable us to answer this question
empirically.
Content moderation and fake news: Prior research
suggests that 0.1% of Twitter users accounted for 80%
of all fake news sources shared in the lead-up to a
US election (Grinberg et al. 2019). However, we ex-
pect there to be cross-lingual differences in this distri-
bution, especially for low- or under-resourced languages
with fewer open tools for fact-checking. Similarly, we
expect that the quality of moderation and hate speech
will vary by geography and language, and recommend
the use of multilingual large language models to explore
these trends (with attention to persisting representative-
ness caveats (Wu and Dredze 2020)).
Mass mobilization: Twitter is increasingly the hotbed of
protest, which has led to some activists donning the role
of “movement spilloverers” (Zhou and Yang 2021) or se-
rial activists (Bastos and Mercea 2016) who broker infor-
mation across different online movements, thereby acting
as key coordinators, itinerants, or gatekeepers in the ex-
change of information. Such users, as well as the constant
communities in which they presumably reside (Chowd-
hury et al. 2022), may be easier to study through tempo-
ral snapshots, as facilitated by this dataset.
Echo chambers and filter bubbles: On Twitter, algo-
rithms can affect the information diets of users in over
200 countries, with an estimated 396.5 million monthly
users (Kemp 2022). Recent surveys of the literature have
considered the evidence on how platforms’ designs and
affordances influence users behaviors, attitudes, and be-
liefs (Gonz´
alez-Bail´
on and Lelkes 2022). Studies of the
structural and informational networks based on snapshots
of Twitter can offer clues to solving these puzzles with-
out the constraints of data selection.
Ethics Statement and Data Availability
Ethics statement. We acknowledge that privacy and ethi-
cal concerns are associated with collecting and using social
media data for research. However, we took several steps to
avoid risks to human subjects since participants no longer
opt into being part of our study, in a traditional sense (Zim-
mer 2020). In our analysis, we only studied and reported
population level, and aggregated observations of our dataset.
We share publicly only the tweet IDs with the research com-
munity to account for privacy issues and Twitter’s TOS. For
this purpose, we use a data sharing and long-term archiving
service provided by GESIS - Leibniz Institute for the Social
Sciences, a German infrastructure institute for the social sci-
ences 2.
With regards to data availability, this repository adheres
to the FAIR principles (Wilkinson et al. 2016) as follows:
Findability: In compliance with Twitter’s terms of ser-
vice, only tweet IDs are made publicly available at DOI:
https://doi.org/10.7802/2516. A unique Document Ob-
ject Identifier (DOI) is associated with the dataset. Its
metadata and licenses are also readily available.
Accessibility: The dataset can be downloaded using stan-
dard APIs and communications protocol (the REST API
and OAI-PMH).
Interoperability: The data is provided in raw text for-
mat.
Reusability: The CC BY 4.0 license implies that re-
searchers are free to use the data with proper attribution.
Furthermore, we want to invite the broader research com-
munity to approach one or more of the authors and collab-
orators (see Acknowledgments) of this paper with research
ideas about what can be done with this dataset. We will be
very happy to collaborate with you on your ideas!
2https://www.gesis.org/en/data-services/share-data
Acknowledgments
The data collection effort described in this paper could
not have been possible without the great collaboration of
a large number of scholars, here are some of them (in
random order): Chris Schoenherr, Leonard Husmann, Diyi
Liu, Benedict Witzenberger, Joan Rodriguez-Amat, Flo-
rian Angermeir, Stefanie Walter, Laura Mahrenbach, Isaac
Bravo, Anahit Sargsyan, Luca Maria Aiello, Sophie Brandt,
Wienke Strathern, Bilal C¸ akir, David Schoch, Yuliia Holu-
bosh, Savvas Zannettou, Kyriaki Kalimeri.
References
Ajao, O.; Hong, J.; and Liu, W. 2015. A survey of loca-
tion inference techniques on Twitter. Journal of Information
Science, 41(6): 855–864.
Antenore, M.; Camacho Rodriguez, J. M.; and Panizzi, E.
2022. A Comparative Study of Bot Detection Techniques
With an Application in Twitter Covid-19 Discourse. Social
Science Computer Review, 08944393211073733.
Badawy, A.; Ferrara, E.; and Lerman, K. 2018. Analyzing
the digital traces of political manipulation: The 2016 Rus-
sian interference Twitter campaign. In 2018 IEEE/ACM in-
ternational conference on advances in social networks anal-
ysis and mining (ASONAM), 258–265. IEEE.
Bastos, M. T.; and Mercea, D. 2016. Serial activists: Polit-
ical Twitter beyond influentials and the twittertariat. New
Media & Society, 18(10): 2359–2378.
Broniatowski, D. A.; Jamison, A. M.; Qi, S.; AlKulaib, L.;
Chen, T.; Benton, A.; Quinn, S. C.; and Dredze, M. 2018.
Weaponized health communication: Twitter bots and Rus-
sian trolls amplify the vaccine debate. American journal of
public health, 108(10): 1378–1384.
Chowdhury, A.; Srinivasan, S.; Bhowmick, S.; Mukherjee,
A.; and Ghosh, K. 2022. Constant community identifica-
tion in million-scale networks. Social Network Analysis and
Mining, 12(1): 1–17.
Cresci, S. 2020. A decade of social bot detection. Commu-
nications of the ACM, 63(10): 72–83.
Cunha, E.; Magno, G.; Comarela, G.; Almeida, V.;
Gonc¸alves, M. A.; and Benevenuto, F. 2011. Analyzing the
dynamic evolution of hashtags on twitter: a language-based
approach. In Proceedings of the workshop on language in
social media (LSM 2011), 58–65.
DeLucia, A.; Wu, S.; Mueller, A.; Aguirre, C.; Dredze, M.;
and Resnik, P. 2022. Bernice: A Multilingual Pre-trained
Encoder for Twitter.
Earl, J.; Maher, T. V.; and Pan, J. 2022. The digital repres-
sion of social movements, protest, and activism: A synthetic
review. Science Advances, 8(10): eabl8198.
Elmas, T.; Overdorf, R.; and Aberer, K. 2021. A Dataset of
State-Censored Tweets. In ICWSM, 1009–1015.
Ferrara, E.; Varol, O.; Davis, C.; Menczer, F.; and Flammini,
A. 2016. The rise of social bots. Communications of the
ACM, 59(7): 96–104.
Giorgi, S.; Lynn, V. E.; Gupta, K.; Ahmed, F.; Matz, S.; Un-
gar, L. H.; and Schwartz, H. A. 2022. Correcting Sociode-
mographic Selection Biases for Population Prediction from
Social Media. In Proceedings of the International AAAI
Conference on Web and Social Media, volume 16, 228–240.
Gonz´
alez-Bail´
on, S.; and Lelkes, Y. 2022. Do social media
undermine social cohesion? A critical review. Social Issues
and Policy Review.
Gonz´
alez-Bail´
on, S.; Wang, N.; Rivero, A.; Borge-
Holthoefer, J.; and Moreno, Y. 2014. Assessing the bias in
samples of large online networks. Social Networks, 38: 16
27.
Grinberg, N.; Joseph, K.; Friedland, L.; Swire-Thompson,
B.; and Lazer, D. 2019. Fake news on Twitter during the
2016 US presidential election. Science, 363(6425): 374–
378.
Gruzd, A.; Wellman, B.; and Takhteyev, Y. 2011. Imagining
Twitter as an imagined community. American Behavioral
Scientist, 55(10): 1294–1318.
Jaidka, K.; Giorgi, S.; Schwartz, H. A.; Kern, M. L.; Ungar,
L. H.; and Eichstaedt, J. C. 2020. Estimating geographic
subjective well-being from Twitter: A comparison of dictio-
nary and data-driven language methods. Proceedings of the
National Academy of Sciences, 117(19): 10165–10171.
Jaidka, K.; Mukerjee, S.; and Lelkes, Y. 2023. Silenced on
social media: the gatekeeping functions of shadowbans in
the American Twitterverse. Journal of Communication.
Jakesch, M.; Garimella, K.; Eckles, D.; and Naaman, M.
2021. Trend alert: A cross-platform organization manip-
ulated Twitter trends in the Indian general election. Pro-
ceedings of the ACM on Human-Computer Interaction,
5(CSCW2): 1–19.
Kemp, S. 2022. Digital 2022: Global overview report. Tech-
nical report, DataReportal.
Maj´
o-V´
azquez, S.; Congosto, M.; Nicholls, T.; and Nielsen,
R. K. 2021. The Role of Suspended Accounts in Political
Discussion on Social Media: Analysis of the 2017 French,
UK and German Elections. Social Media+ Society, 7(3):
20563051211027202.
Malik, M.; Lamba, H.; Nakos, C.; and Pfeffer, J. 2015. Pop-
ulation bias in geotagged tweets. In proceedings of the in-
ternational AAAI conference on web and social media, vol-
ume 9, 18–27.
McClain, C. 2021. 70% of U.S. social media users never or
rarely post or share about political, social issues. Technical
report, Pew Research Center.
Mejova, Y.; Weber, I.; and Macy, M. W. 2015. Twitter: a
digital socioscope. Cambridge University Press.
Morstatter, F.; Pfeffer, J.; Liu, H.; and Carley, K. M. 2013. Is
the Sample Good Enough? Comparing Data from Twitter’s
Streaming API with Twitter’s Firehose. In Seventh Inter-
national AAAI Conference on Weblogs and Social Media,
400–408.
Mukerjee, S.; Jaidka, K.; and Lelkes, Y. 2022. The Political
Landscape of the US Twitterverse. Political Communica-
tion, 1–31.
Olteanu, A.; Castillo, C.; Diaz, F.; and Kıcıman, E. 2019.
Social data: Biases, methodological pitfalls, and ethical
boundaries. Frontiers in Big Data, 2: 13.
Pfeffer, J.; Mayer, K.; and Morstatter, F. 2018. Tampering
with Twitter’s Sample API. EPJ Data Science, 7(50).
Pfeffer, J.; Mooseder, A.; Lasser, J.; Hammer, L.; Stritzel,
O.; and Garcia, D. 2022. This Sample seems to be good
enough! Assessing Coverage and Temporal Reliability of
Twitter’s Academic API.
Rajadesingan, A.; Budak, C.; and Resnick, P. 2021. Political
discussion is abundant in non-political subreddits (and less
toxic). In Proceedings of the Fifteenth International AAAI
Conference on Web and Social Media, volume 15.
Ruths, D.; and Pfeffer, J. 2014. Social Media for Large Stud-
ies of Behavior. Science, 346(6213): 1063–1064.
Sayyadiharikandeh, M.; Varol, O.; Yang, K.-C.; Flammini,
A.; and Menczer, F. 2020. Detection of novel social bots
by ensembles of specialized classifiers. In Proceedings of
the 29th ACM international conference on information &
knowledge management, 2725–2732.
Schwartz, H.; Eichstaedt, J.; Kern, M.; Dziurzynski, L.; Lu-
cas, R.; Agrawal, M.; Park, G.; Lakshmikanth, S.; Jha, S.;
Seligman, M.; et al. 2013. Characterizing geographic varia-
tion in well-being using tweets. In Proceedings of the Inter-
national AAAI Conference on Web and Social Media, vol-
ume 7, 583–591.
Shao, C.; Ciampaglia, G. L.; Varol, O.; Yang, K.-C.; Flam-
mini, A.; and Menczer, F. 2018. The spread of low-
credibility content by social bots. Nature communications,
9(1): 1–9.
Tanash, R. S.; Chen, Z.; Thakur, T.; Wallach, D. S.; and Sub-
ramanian, D. 2015. Known unknowns: An analysis of Twit-
ter censorship in Turkey. In Proceedings of the 14th ACM
Workshop on Privacy in the Electronic Society, 11–20.
Tufekci, Z. 2014. Big questions for social media big data:
Representativeness, validity and other methodological pit-
falls. In Eighth international AAAI conference on weblogs
and social media.
Uyheng, J.; and Carley, K. M. 2021. Computational Analy-
sis of Bot Activity in the Asia-Pacific: A Comparative Study
of Four National Elections. In Proceedings of the Inter-
national AAAI Conference on Web and Social Media, vol-
ume 15, 727–738.
Van Dijck, J.; Poell, T.; and De Waal, M. 2018. The plat-
form society: Public values in a connective world. Oxford
University Press.
Varol, O. 2016. Spatiotemporal analysis of censored content
on twitter. In Proceedings of the 8th ACM Conference on
Web Science, 372–373.
Varol, O. 2022. Should we agree to disagree about Twitter’s
bot problem? arXiv preprint arXiv:2209.10006.
Varol, O.; Ferrara, E.; Davis, C.; Menczer, F.; and Flam-
mini, A. 2017. Online human-bot interactions: Detection,
estimation, and characterization. In Proceedings of the in-
ternational AAAI conference on web and social media, vol-
ume 11, 280–289.
Varol, O.; and Uluturk, I. 2020. Journalists on Twitter: self-
branding, audiences, and involvement of bots. Journal of
Computational Social Science, 3(1): 83–101.
Vosoughi, S.; Roy, D.; and Aral, S. 2018. The spread of true
and false news online. science, 359(6380): 1146–1151.
Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Apple-
ton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.;
da Silva Santos, L. B.; Bourne, P. E.; et al. 2016. The FAIR
Guiding Principles for scientific data management and stew-
ardship. Scientific data, 3(1): 1–9.
Wojcieszak, M.; Casas, A.; Yu, X.; Nagler, J.; and Tucker,
J. A. 2022. Most users do not follow political elites on Twit-
ter; those who do show overwhelming preferences for ideo-
logical congruity. Science advances, 8(39): eabn9418.
Wu, S.; and Dredze, M. 2020. Are All Languages Created
Equal in Multilingual BERT? In Proceedings of the 5th
Workshop on Representation Learning for NLP, 120–130.
Wu, S.; Hofman, J. M.; Mason, W. A.; and Watts, D. J. 2011.
Who says what to whom on twitter. In Proceedings of the
20th international conference on World wide web, 705–714.
Yang, K.-C.; Varol, O.; Hui, P.-M.; and Menczer, F. 2020.
Scalable and generalizable social bot detection through data
selection. In Proceedings of the AAAI conference on artifi-
cial intelligence, volume 34, 1096–1103.
Zagheni, E.; Garimella, V. R. K.; Weber, I.; and State, B.
2014. Inferring international and internal migration patterns
from twitter data. In Proceedings of the 23rd international
conference on world wide web, 439–444.
Zhang, X.; Malkov, Y.; Florez, O.; Park, S.; McWilliams, B.;
Han, J.; and El-Kishky, A. 2022. TwHIN-BERT: A Socially-
Enriched Pre-trained Language Model for Multilingual
Tweet Representations. arXiv preprint arXiv:2209.07562.
Zhou, A.; and Yang, A. 2021. The Longitudinal Dimension
of Social-Mediated Movements: Hidden Brokerage and the
Unsung Tales of Movement Spilloverers. Social Media+
Society, 7(3): 20563051211047545.
Zimmer, M. 2020. “But the data is already public”: on the
ethics of research in Facebook. In The Ethics of Information
Technologies, 229–241. Routledge.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We offer comprehensive evidence of preferences for ideological congruity when people engage with politicians, pundits, and news organizations on social media. Using 4 years of data (2016–2019) from a random sample of 1.5 million Twitter users, we examine three behaviors studied separately to date: (i) following of in-group versus out-group elites, (ii) sharing in-group versus out-group information (retweeting), and (iii) commenting on the shared information (quote tweeting). We find that the majority of users (60%) do not follow any political elites. Those who do follow in-group elite accounts at much higher rates than out-group accounts (90 versus 10%), share information from in-group elites 13 times more frequently than from out-group elites, and often add negative comments to the shared out-group information. Conservatives are twice as likely as liberals to share in-group versus out-group content. These patterns are robust, emerge across issues and political elites, and exist regardless of users’ ideological extremity.
Article
Full-text available
The inherently stochastic nature of community detection in real-world complex networks poses an important challenge in assessing the accuracy of the results. In order to eliminate the algorithmic and implementation artifacts, it is necessary to identify the groups of vertices that are always clustered together, independent of the community detection algorithm used. Such groups of vertices are called constant communities. Current approaches for finding constant communities are very expensive and do not scale to large networks. In this paper, we use binary edge classification to find constant communities. The key idea is to classify edges based on whether they form a constant community or not. We present two methods for edge classification. The first is a GCN-based semi-supervised approach that we term Line-GCN. The second is an unsupervised approach based on image thresholding methods. Neither of these methods requires explicit detection of communities and can thus scale to very large networks of the order of millions of vertices. Both of our semi-supervised and unsupervised results on real-world graphs demonstrate that the constant communities obtained by our method have higher F1-scores and comparable or higher NMI scores than other state-of-the-art baseline methods for constant community detection. While the training step of Line-GCN can be expensive, the unsupervised algorithm is 10 times faster than the baseline methods. For larger networks, the baseline methods cannot complete, whereas all of our algorithms can find constant communities in a reasonable amount of time. Finally, we also demonstrate that our methods are robust under noisy conditions. We use three different, well-studied noise models to add noise to the networks and show that our results are mostly stable.
Article
Full-text available
Prior research suggests that Twitter users in the United States are more politically engaged and more partisan than the American citizenry, who are generally characterized by low levels of political knowledge and disinterest in political affairs. This study seeks to understand this disconnect by conducting an observational analysis of the most popular accounts on American Twitter. We identify opinion leaders by drawing random samples of ordinary American Twitter users and observing whom they follow. We estimate the ideological leaning and political relevance of these opinion leaders and crowdsource estimates of perceived ideology. We find little evidence that American Twitter is as politicized as it is made out to be, with politics and hard news outlets constituting a small subset of these opinion leaders. Ordinary Americans are significantly more likely to follow nonpolitical opinion leaders on Twitter than political opinion leaders. We find no evidence of polarization among these opinion leaders either. While a few political professional categories are more polarized than others, the overall polarization dissipates when we factor in the rate at which the opinion leaders tweet: a large number of vocal nonpartisan opinion leaders drowns out the partisan voices on the platform. Our results suggest that the degree to which Twitter is political has likely been overstated in the past. Our findings have implications about how we use Twitter and social media, in general, to represent public opinion in the United States.
Article
Full-text available
Content moderation on social media is at the center of public and academic debate. In this study, we advance our understanding on which type of election-related content gets suspended by social media platforms. For this, we assess the behavior and content shared by suspended accounts during the most important elections in Europe in 2017 (in France, the United Kingdom, and Germany). We identify significant differences when we compare the behavior and content shared by Twitter suspended accounts with all other active accounts, including a focus on amplifying divisive issues like immigration and religion and systematic activities increasing the visibility of specific political figures (often but not always on the right). Our analysis suggests that suspended accounts were overwhelmingly human operated and no more likely than other accounts to share “fake news.” This study sheds light on the moderation policies of social media platforms, which have increasingly raised contentious debates, and equally importantly on the integrity and dynamics of political discussion on social media during major political events.
Article
Full-text available
On the morning of November 9th 2016, the world woke up to the shocking outcome of the US Presidential elections: Donald Trump was the 45th President of the United States of America. An unexpected event that still has tremendous consequences all over the world. Today, we know that a minority of social bots – automated social media accounts mimicking humans – played a central role in spreading divisive messages and disinformation, possibly contributing to Trump's victory [16, 19]. In the aftermath of the 2016 US elections, the world started to realize the gravity of widespread deception in social media. Following Trump's exploit, we witnessed to the emergence of a strident dissonance between the multitude of efforts for detecting and removing bots, and the increasing effects that these malicious actors seem to have on our societies [27, 29]. This paradox opens a burning question: What strategies should we enforce in order to stop this social bot pandemic? In these times – during the run-up to the 2020 US elections – the question appears as more crucial than ever. Particularly so, also in light of the recent reported tampering of the electoral debate by thousands of AI-powered accounts. What stroke social, political and economic analysts after 2016 – deception and automation – has been however a matter of study for computer scientists since at least 2010. In this work, we briefly survey the first decade of research in social bot detection. Via a longitudinal analysis, we discuss the main trends of research in the fight against bots, the major results that were achieved, and the factors that make this never-ending battle so challenging. Capitalizing on lessons learned from our extensive analysis, we suggest possible innovations that could give us the upper hand against deception and manipulation. Studying a decade of endeavors at social bot detection can also inform strategies for detecting and mitigating the effects of other – more recent – forms of online deception, such as strategic information operations and political trolls.
Conference Paper
Full-text available
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.
Article
Many governments impose traditional censorship methods on social media platforms. Instead of removing it completely, many social media companies, including Twitter, only withhold the content from the requesting country. This makes such content still accessible outside of the censored region, allowing for an excellent setting in which to study government censorship on social media. We mine such content using the Internet Archive's Twitter Stream Grab. We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. We also release 4,301 accounts that were censored in their entirety. Additionally, we release a set of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user. We provide an exploratory analysis of this dataset. Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users. The dataset is publicly available at https://doi.org/10.5281/zenodo.4439509
Article
Repression research examines the causes and consequences of actions or policies that are meant to, or actually do, raise the costs of activism, protest, and/or social movement activity. The rise of digital and social media has brought substantial increases in attention to the repression of digital activists and movements and/or to the use of digital tools in repression, which is spread across many disciplines and areas of study. We organize and review this growing welter of research under the concept of digital repression by expanding a typology that distinguishes actions based on actor type, whether actions are overt or covert, and whether behaviors are shaped by coercion or channeling. This delineation between broadly different forms of digital repression allows researchers to develop expectations about digital repression, better understand what is "new" about digital repression in terms of explanatory factors, and better understand the consequences of digital repression.
Article
Political organizations worldwide keep innovating their use of social media technologies. In the 2019 Indian general election, organizers used a network of WhatsApp groups to manipulate Twitter trends through coordinated mass postings. We joined 600 WhatsApp groups that support the Bharatiya Janata Party, the right-wing party that won the general election, to investigate these campaigns. We found evidence of 75 hashtag manipulation campaigns in the form of mobilization messages with lists of pre-written tweets. Building on this evidence, we estimate the campaigns' size, describe their organization and determine whether they succeeded in creating controlled social media narratives. Our findings show that the campaigns produced hundreds of nationwide Twitter trends throughout the election. Centrally controlled but voluntary in participation, this hybrid configuration of technologies and organizational strategies shows how profoundly online tools transform campaign politics. Trend alerts complicate the debates over the legitimate use of digital tools for political participation and may have provided a blueprint for participatory media manipulation by a party with popular support.