Conference PaperPDF Available

Hunting Malicious Bots on Twitter: An Unsupervised Approach

Authors:

Abstract and Figures

Malicious bots violate Twitter’s terms of service – they include bots that post spam content, adware and malware, as well as bots that are designed to sway public opinion. How prevalent are such bots on Twitter? Estimates vary, with Twitter [3] itself stating that less than 5% of its over 300 million active accounts are bots. Using a supervised machine learning approach with a manually curated set of Twitter bots, [12] estimate that between 9% to 15% of active Twitter accounts are bots (both benign and malicious). In this paper, we propose an unsupervised approach to hunt for malicious bot groups on Twitter. Key structural and behavioral markers for such bot groups are the use of URL shortening services, duplicate tweets and content coordination over extended periods of time. While these markers have been identified in prior work [9, 15], we devise a new protocol to automatically harvest such bot groups from live Tweet streams. Our experiments with this protocol show that between 4% to 23% (mean 10.5%) of all accounts that use shortened URLs are bots and bot networks that evade detection over a long period of time, with significant heterogeneity in distribution based on the URL shortening service. We compare our detection approach with two state-of-the-art methods for bot detection on Twitter: a supervised learning approach called BotOrNot [10] and an unsupervised technique called DeBot [8]. We show that BotOrNot misclassifies around 40% of the malicious bots identified by our protocol. The overlap between bots detected by our approach and DeBot, which uses synchronicity of tweeting as a primary behavioral marker, is around 7%, indicating that the detection approaches target very different types of bots. Our protocol effectively identifies malicious bots in a language-independent, as well as topic and keyword independent framework in real-time in an entirely unsupervised manner and is a useful supplement to existing bot detection tools.
Content may be subject to copyright.
Hunting malicious bots on Twitter: an
unsupervised approach
Zhouhan Chen, Rima S. Tanash, Richard Stoll, and Devika Subramanian
Rice University, Houston TX 77005, USA,
zc12@rice.edu, rtanash@rice.edu, stoll@rice.edu, devika@rice.edu
Abstract. Malicious bots violate Twitter’s terms of service – they in-
clude bots that post spam content, adware and malware, as well as bots
that are designed to sway public opinion. How prevalent are such bots
on Twitter? Estimates vary, with Twitter [3] itself stating that less than
5% of its over 300 million active accounts are bots. Using a supervised
machine learning approach with a manually curated set of Twitter bots,
[13] estimate that between 9% to 15% of active Twitter accounts are bots
(both benign and malicious). In this paper, we propose an unsupervised
approach to hunt for malicious bot groups on Twitter. Key structural and
behavioral markers for such bot groups are the use of URL shortening
services, duplicate tweets and content coordination over extended peri-
ods of time. While these markers have been identified in prior work [16]
[10], we devise a new protocol to automatically harvest such bot groups
from live Tweet streams. Our experiments with this protocol show that
between 4% to 23% (mean 10.5%) of all accounts that use shortened
URLs are bots and bot networks that evade detection over a long pe-
riod of time, with significant heterogeneity in distribution based on the
URL shortening service. We compare our detection approach with two
state-of-the-art methods for bot detection on Twitter: a supervised learn-
ing approach called BotOrNot[11] and an unsupervised technique called
DeBot[9]. We show that BotOrNot misclassifies around 40% of the mali-
cious bots identified by our protocol. The overlap between bots detected
by our approach and DeBot, which uses synchronicity of tweeting as a
primary behavioral marker, is around 7%, indicating that the detection
approaches target very different types of bots. Our protocol effectively
identifies malicious bots in a language-independent, as well as topic and
keyword independent framework in real-time in an entirely unsupervised
manner and is a useful supplement to existing bot detection tools.
Keywords: Bot Detection, Social Network Analysis, Data Mining
1 Introduction
In recent years, Twitter, with its easy enrollment process and attractive user
interface has seen a proliferation of automated accounts or bots [12], [14]. While
a few of these automated accounts engage in human conversation or provide
community benefits [1], many are malicious. We define malicious bots as those
that violate Twitter’s terms of service [6] including those that post spam content,
adware and malware, as well as bots that are part of sponsored campaigns to
sway public opinion.
How prevalent are bots and bot networks on Twitter? Estimates vary, with
Twitter [3] itself stating that less than 5% of its over 300 million active accounts
are bots. Using a supervised machine learning approach with a manually curated
set of Twitter bots, [13] estimate that between 9% to 15% of active Twitter
accounts are bots (both benign and malicious). An open question is how to
efficiently obtain a census of Twitter bots and how to reliably estimate the
percentage of malicious bots among them. In addition, it is important to estimate
the percentage of tweets contributed by these bots so we have an understanding
of the impact such accounts have on a legitimate Twitter user’s experience.
In particular, malicious bots can seriously distort analyses such as [15] based on
tweet counts, because these bots cut and paste real content from trending tweets
[10].
In this paper, we propose an unsupervised approach to hunt for malicious
bot groups on Twitter. Key structural and behavioral markers for such bot
groups are the use of shortened URLs, typically to disguise final landing sites, the
tweeting of duplicate content, and content coordination over extended periods
of time. While the use of shortened URLs and tweeting of duplicate content
has been separately identified in prior work [16] [10], we devise a new protocol
that follows this up by verifying content coordination between bot groups over
extended periods of time. Our bot detection protocol has four sequential phases
as illustrated in Figure 1. Our unit of analysis is a cluster of accounts. The initial
clustering is based on duplicate text content and the use of shortened URLs. The
final detection decision is made by examining the long term behavior of these
account clusters and the extent of content coordination between them.
Our experiments with this protocol on actively gathered tweets with short-
ened URLs from the nine most popular URL shortening services shows a complex
picture of the prevalence and distribution of malicious bots. Fewer than 6% of
accounts tweeting shortened URLs are malicious bots, except for ln.is (27%)
and dlvr.it (8%). The tweet traffic generated by malicious bot accounts using
shortened URLs from bit.ly,ift.tt,ow.ly,goo.gl and tinyurl.com is under 6%. But
malicious bots using dlvr.it,dld.bz,viid.me and ln.is account for 13% to 27% of
tweets.
The gold standard for confirming bots is suspension by Twitter. However,
as noted by [9] and [11], there is a time lag between detection of bots by re-
searchers and their suspension by Twitter. If we had a reference list of bots,
we could give recall and precision measures for our detection approach. In the
absence of such a list, we can only provide a precision measure by comparing our
bots with those detected by state of the art methods: a supervised learning ap-
proach called BotOrNot[11] and an unsupervised technique called DeBot[9]. We
show that BotOrNot misclassifies around 40% of the malicious bots identified
by our protocol. Unlike BotOrNot, our approach identifies entire bot groups by
Fig. 1. The four phases of our bot detector architecture. Note the second round of
tweet collection from a group of accounts that tweet near identical content and which
use shortened URLs. This phase analyses the long term tweeting behavior of potential
malicious bots.
their collective behavior over a period of time, rather than using decision rules
for classifying individual accounts as bots. DeBot is focused on the question
of detecting groups of bots, not necessarily malicious ones, by exploiting coor-
dinated temporal patterns in their tweeting behavior. Since we use duplicated
content over a time period, and shortened URLs as primary behavioral markers,
in contrast to synchronicity of tweeting, we find a far more diverse group of ma-
licious bots than DeBot. Thus there is only a small overlap (7%) between bot
groups found by our protocol and DeBot’s, which mainly finds bots associated
with news services rather than those that hijack users to spam, ad and malware
sites.
In sum, our protocol effectively identifies malicious bots by focusing on short-
ened URLs and tweeting of near duplicate content over an extended period of
time. It is language-independent, as well as topic and keyword independent and
is completely unsupervised. We have validated our protocol on tweets from nine
URL shortening services and characterized the heterogeneity of the distribution
of malicious bots in the Twitterverse.
The remainder of the paper is organized as follows. Section 2 provides brief
background on existing Twitter bot detection methods and the novelty and
efficacy of our protocol. Section 3 describes and motivates each phase of our
bot detection protocol. Section 4 introduces the tweet sets collected from nine
URL shortening services used in our experiments, and provides results of our
detection protocol on these sets. It also includes results of comparisons of our
method with DeBot and BotOrNot. We conclude the paper in Section 5 with
our main results and directions for future exploration.
2 Background
There is a significant literature on detecting bots on Twitter and other social
media forums – recent surveys are in [14] and [12]. Spam bots on Twitter are
constantly evolving, and there is an ongoing arms race between spammers and
Twitter’s account suspension policies and systems. For instance, link-farming
used to be a dominant category of spam bots in 2012, however, Twitter’s new
detection algorithms [2] have driven them to extinction.
Current Twitter bot detection methods can be placed in two major cate-
gories: ones based on supervised learning which rely on curated training sets of
known bots, and unsupervised approaches that need no training data. Super-
vised methods generally work at the level of individual accounts. They extract
hundreds to thousands of features from each account based on properties of the
account itself and its interaction with others, and learn decision rules to distin-
guish bots from human accounts using a variety of machine learning algorithms
[16] [10] [8]. The state of the art in this line of work is BotOrNot[11] which
is a Random Forest classifier that uses more than a thousand features to dis-
tinguish bots from humans. Features include user profile metadata, sentiment,
content, friends, network and timing of tweets. This method has two limitations.
First, it makes decisions at the level of individual accounts and therefore fails
to identify groups of accounts that act in concert. Second, it works only on ac-
counts tweeting in English, and it is not adaptive; requiring re-training with new
human-labeled data as new types of bots emerge.
The state-of-the-art in unsupervised bot detection is DeBot [9], which uses
dynamic time warping to identify accounts with synchronized tweeting patterns.
This protocol relies on tweet creation time and not its contents. Our results show
that DeBot mostly finds news bots with temporally synchronized tweeting pat-
terns. It does not capture malicious spam bots that are temporally uncorrelated,
but that tweet duplicate trending content in order to hijack users to spam and
malware sites.
Our unsupervised detection method fills a gap in Twitter bot detection re-
search. The underlying features of bots identified by our method are accounts
using shortened URLs and near duplicate content over an extended period of
time. Because the method is unsupervised, it is not biased by any keyword,
topic, hashtag, country or language. It does not require human labeled training
sets, and can be deployed in a real time online fashion.
3 Our Methods
Our spam detection method consists of four components run sequentially: crawler,
duplicate filter, collector and bot detector. The crawler collects live tweets from
the Twitter Streaming API using keyword filtering [5]. We choose prefixes of the
domain name of a URL shortening service as keywords. The duplicate filter
selects suspicious groups of accounts for further analysis. It first hashes all tweet
content extracted from the text field of the tweet’s json representation and maps
each unique tweet text to a group of users who tweet that content. The filter
selects duplicate tweeting groups of size 20 or greater. This threshold of 20 en-
ables us to focus on more significant bot groups. To make sure accounts do in
fact violate Twitter’s terms of service over a period of time, we perform a second
level of tweet collection on each member of a suspicious group. The collector
gathers the 200 most recent tweets of every account in each suspicious group
using Twiter’s REST API. This step ensures that we filter out innocent users
who happen to tweet a few of the same texts as bots. The bot detector clusters
accounts in a group that have most of their historical tweets (200 most recent
tweets) identical to each other. Given a group Gof naccounts a1, . . . , an, sets
T(a1), . . . , T (an) of tweets where T(ai) = {ti1, . . . , ti200}of the 200 most recent
tweets for each account ai,1in, it constructs the set Cof tweets that are
tweeted by at least αaccounts in the group. That is,
tC⇒ |{i|tT(ai); 1 in}| ≥ α(1)
In the next step, the detector measures the overlap between the tweet set T(ai)
associated with an individual account and the set Cof tweets for the group G
that account aiis a member of. The potential bots in the group, denoted by the
set S, are identified as follows,
aiS|T(ai)C|
|T(ai)|β(2)
Thus there are two parameters in our detection protocol: α, which we call
minimum duplicate factor, which influences the construction of the most frequent
tweet set C, and β, which we call the overlap ratio, which determines the ratio of
frequent tweets in the tweet set associated with an account. Accounts that meet
criteria (1) and (2) for a specific choice of αand βare identified as malicious
bots in our protocol. In all of our experiments reported in the next section, we
use α= 3 and β= 0.6. These parameters were obtained after cross-validation
studies which we do not have space to document here.
4 Experimental Evaluation
4.1 Landscape of Twitter trending URLs
To justify the use of URL shortening as a marker in our bot detection protocol,
we examined the distribution of all URLs in the Twitter stream to estimate
the fraction of tweets containing URL shorteners. We first streamed more than
thirty million live tweets using keyword http, extracted all URLs within each
tweet, and sorted them by frequency of occurrence. While this is only a sample
of all tweets with embedded URLs, we believe it is an unbiased sample since the
Twitter Streaming API does not favor one particular region/language/account
over another. Figure 2 shows top trending URLs on Twitter constructed with
this sample. Shortened URLs clearly constitute a major fraction of tweet traffic.
Fig. 2. Top trending URLs on Twitter (Green bars are social media websites, red bars
are websites tweeting Quran verses, orange bars are URL shortening services, and blue
bars are others. In this paper we focus on the orange bars. For real-time, complete top
100 trending URLs, visit our Twitter Spam Monitor[4].)
4.2 Datasets and Results
Based on our analysis of the distribution of URLs on Twitter, we choose to study
nine popular URL shortening services bit.ly,ift.tt,ow.ly,goo.gl,tinyurl.com,
dlvr.it,dld.bz,viid.me and ln.is. In our sample, more than 24% of tweets with
embedded URLs are generated from these nine service providers.
Table 1 shows the number and percentage of bot accounts identified by our
detection protocol, the percentage of our bot accounts also identified by DeBot,
and percentage of accounts that are identified by our protocol and later become
suspended by Twitter. We note that the percentages of bot accounts vary greatly
among the nine URL shorteners, ranging from 2.4% to 23%, and the percentage
of tweets generated by these bot accounts vary from 2.7% to 26.65%, with an
average of 10.51%. Four URL shorteners – dlvr.it,dld.bz,viid.me and ln.is, have
more than 10% of their tweets generated by bot accounts, suggesting that those
URL shorteners are more abused by malicious users than the other five.
4.3 Scaling Experiments
In July 2017, we performed a series of experiments on all nine URL shortening
services in which we scaled the number of tweets gathered to around 500,000 in
order to verify the robustness of our results in Table 1. Two of URL shorteners
were no longer available: ln.is is suspended [7], while viid.me has been blocked
by Twitter due to its malicious behavior. Table 2 gives the detailed statistics of
our bot detection performance on the remaining URL shortening services.
Table 1. Statistics of Twitter accounts from nine URL shortening services. Note the
uptick in suspended accounts identified by us on viid.me
URL
shortener
Total
# of
accts
Total
# of
bots
% bots
we found
% tweets
from bots
we found
% our bots
found by
DeBot
% our bots
susp. by
Twitter
until 6/10/17
% our bots
susp. by
Twitter
until 7/17/17
bit.ly 28964 696 2.40% 4.44% 12.93% 3.74% 4.74%
ift.tt 12543 321 2.56% 3.54% 11.21% 2.80% 9.97%
ow.ly 28416 894 3.15% 3.22% 6.04% 45.30% 48.21%
tinyurl.com 20005 705 3.52% 5.70% 1.99% 5.39% 7.66%
dld.bz 6893 304 4.41% 13.36% 10.20% 8.22% 11.84%
viid.me 2605 129 4.95% 21.66% 22.48% 38.76% 55.81%
goo.gl 11250 710 6.31% 2.70% 8.73% 0.42% 3.24%
dlvr.it 15122 1194 7.90% 13.34% 22.86% 7.37% 9.13%
ln.is 25384 5857 23.07% 26.65% 3.57% 1.11% 1.25%
Compared to our first set of experiments, we see an increase in the percent-
age of bot accounts in six out of seven URL shortening services, and increased
percentages of tweets in all URL shortening services. The percentages of bot
accounts range from 5.88% to 17.80%, and the percentage of tweets generated
by these bot accounts vary from 14.74% to 56.46%. Together with Table 1, we
find that the rate of bot account creation outpaces Twitter suspensions.
Table 2. Statistics of Twitter accounts from nine URL shortening services (10X scale
experiments)
URL
shortener
service
Total #
of accts
Total #
of bots % bots % tweets
from bots
bit.ly 193207 22938 11.87% 16.11%
ift.tt 75024 4415 5.88% 16.70%
ow.ly 182539 31416 17.21% 26.07%
tinyurl.com 49563 4644 9.37% 14.74%
dld.bz 11705 1036 8.85% 56.46%
goo.gl 177030 31515 17.80% 27.88%
dlvr.it 86830 6517 7.51% 18.58%
ln.is N/A N/A N/A N/A
viid.me N/A N/A N/A N/A
4.4 Comparison with existing bot detection methods
Twitter suspension system We revisited the account status of bots detected
by our protocol to check if they have been suspended by Twitter. The last
two columns in Table 1 show percentages of suspended accounts among all bot
accounts identified by our protocol, one collected in June and the other in July.
As of July, more than 56% of bot accounts using viid.me and 48% of accounts
using ow.ly have been suspended. However, fewer than 15% of bot accounts that
use the other seven URL shortening services have been suspended. Twitter’s
suspension system appears to be unsystematic and incomplete.
DeBot We compare our results with DeBot in the following manner. For all
Twitter accounts in the nine datasets, we queried the DeBot API to determine
whether or not the account is archived in its bot database. Table 3 documents
the number of bots identified by our method and by DeBot. The intersection of
those two groups is small in all cases. To investigate the low overlap in detected
Table 3. Overlap of results from our Bot Detector and DeBot
URL shortener # bot accts
we found
# verified accts
we found
# bot accts
DeBot found
# verified accts
DeBot found Overlap in accts
bit.ly 605 2 1657 57 91
ift.tt 321 0 989 8 38
ow.ly 894 0 1500 34 55
tinyurl.com 705 0 826 9 14
dld.bz 304 0 473 2 31
viid.me 129 0 515 0 31
goo.gl 710 0 822 9 62
dlvr.it 1194 17 1843 19 281
ln.is 5857 0 2383 7 216
bots, we checked for verified accounts among the ones identified as bots by both
methods. To determine the percentage of news bots (defined as an account that
tweets at least once with a URL from a list of established news media URLs),
we used the Twitter REST API to collect the 200 most recent tweets from these
accounts. Table 4 shows that more than 50% of bot accounts identified by DeBot
are news bots, compared to 15% based on our method.
Table 4. Comparison of percentage of news accounts
Protocol Total bots News bots % News bots
Our protocol 696 102 14.66%
DeBot 1748 947 54.18%
Thus, what DeBot finds, but our method does not are news bots linked
to large news media accounts. What both methods find are bot groups that
tweet highly synchronously with duplicate content, and what our method finds
but DeBot does not are bot groups using shortened URLs that do not tweet
simultaneously.
BotOrNot We also compared our results with BotOrNot, a supervised, account
based Twitter bot classifier. BotOrNot assigns a score of 0 to 1 to an account
based on more than 1000 features, including temporal, sentimental and social
network information [11]. A score close to 0 suggests a human account, while a
score close to 1 suggests a bot account. Like DeBot, BotOrNot also provides a
public API to interact with its service. Table 5 shows the statistics of BotOrNot
scores of all bots we identified in the nine datasets. In 5 out of the 9 datasets,
more than 50% of the scores of accounts identified as bots by our protocol fall
in the range of 0.4 and 0.6. We expect scores of bots detected by our protocol
to exceed 0.6, so we interpret these results as misclassifications by BotOrNot.
Table 5. Bot accounts scores from BotOrNot
URL shorteners Average Score % bots with score in [0.4,0.6]
bit.ly 0.50 50.65%
ift.tt 0.56 50.46%
ow.ly 0.52 40.83%
tinyurl.com 0.56 66.08%
dld.bz 0.71 17.75%
viid.me 0.68 24.81%
goo.gl 0.53 68.99%
dlvr.it 0.49 44.43%
ln.is 0.44 56.81%
5 Conclusions
In this paper we present a Twitter bot detection method that hunts for a spe-
cific class of malicious bots using shortened URLs and tweeting near duplicate
content over an extended period of time. Our unsupervised method does not
require labeled training data, and is not biased toward any language, topic or
keyword. Arguably our method does not capture the most sophisticated bots
out on Twitter, yet it is surprising that between 4% to 23% of the accounts we
sampled from the streaming API satisfy our bot criteria and remain active on
Twitter, generating 4% to 27% of tweet traffic. In the absence of identified bot
lists, we resort to comparisons with two of the best bot detection protocols on
Twitter to evaluate the effectiveness of our approach. Our work gives us a more
nuanced understanding of the demographics of Twitter bots and the severity
of bot proliferation on Twitter. Bot removal is a necessary step in any analysis
relying on raw counts of collected tweets, so our work is useful for anyone with a
Twitter dataset from which duplicates of trending tweets generated by malicious
bots need to be eliminated.Our future work involves devising better approaches
to evaluating bot detection tools, and developing new criteria for discovering
more sophisticated bots that contaminate the Twitter stream. Malicious bot de-
tection is an arms race between bot makers and social media platforms and we
hope our work contributes to the design of better bot account detection policies.
References
1. Earthquakebot. https://twitter.com/earthquakebot?lang=en, accessed: 2017-
03-30
2. Fighting spam with botmaker. https://blog.twitter.com/2014/
fighting-spam-with-botmaker, accessed: 2017-03-20
3. Twitter Annual Report. http://files.shareholder.com/downloads/
AMDA-2F526X/4335316487x0xS1564590-17-2584/1418091/filing.pdf, accessed:
2017-04-22
4. Twitter Bot Monitor project on github. https://github.com/Joe-- Chen/
TwitterBotProject
5. Twitter developer documentation. https://dev.twitter.com/streaming/
overview/request-parameters, accessed: 2017-03-20
6. The Twitter Rules. https://support.twitter.com/articles/18311, accessed:
2017-01-13
7. We need you to help to get linkis back to work. http://blog.linkis.com/2017/
06/02/we-need-you-to- help-to-get- linkis-back-to- work, accessed: 2017-07-
19
8. Cao, C., Caverlee, J.: Detecting spam URLs in social media via behavioral analysis.
In: European Conference on Information Retrieval. pp. 703–714. Springer (2015)
9. Chavoshi, N., Hamooni, H., Mueen, A.: Debot: Twitter bot detection via warped
correlation. In: Proceedings of the 16th IEEE International Conference on Data
Mining (2016)
10. Chu, Z., Gianvecchio, S., Wang, H., Ja jodia, S.: Detecting automation of twitter
accounts: Are you a human, bot, or cyborg? IEEE Transactions on Dependable
and Secure Computing 9(6), 811–824 (2012)
11. Davis, C.A., Ferrara, .V.E., Flammini, A., Menczer, F.: Botornot: A system to
evaluate social bots. In: Companion to Proceedings of the 25th International Con-
ference on the World Wide Web. pp. 273–274. International World Wide Web
Conferences Steering Committee (2016)
12. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social
bots. Communications of the ACM 59(7), 96–104 (2016)
13. Ferrara, O.V.E., Davis, C.A., Menczer, F., Flammini, A.: Online human-
bot interactions: Detection, estimation, and characterization. arXiv preprint
arXiv:1703.03107 (2017)
14. Jiang, M., Cui, P., Faloutsos, C.: Suspicious behavior detection: Current trends
and future directions. IEEE Intelligent Systems 31(1), 31–39 (2016)
15. Montesinos, L., Rodrguez, S.J.P., Orchard, M., Eyheramendy, S.: Sentiment analy-
sis and prediction of events in twitter. In: 2015 CHILEAN Conference on Electrical,
Electronics Engineering, Information and Communication Technologies (CHILE-
CON). pp. 903–910 (Oct 2015)
16. Wang, D., Navathe, S., Liu, L., Irani, D., Tamersoy, A., Pu, C.: Click traffic analysis
of short url spam on twitter. In: 9th International Conference on Collaborative
Computing: Networking, Applications and Worksharing (CollaborateCom), 2013.
pp. 250–259. IEEE (2013)
... erefore, in this paper, the number of likes, Security and Communication Networks comments, and reposts of a post is defined as λ nl , λ nc , and λ nr . (2) e originality of post: Many social bots generate their own posts by reposting or copying other users' posts, and thus the frequency of reposted posts of social bots is higher than that of normal users [46]. In this paper, we use λ or to define the originality of user post, and take 1 when the post is reposted from other users, otherwise, take 0. (3) e publish time of post: Because social bots achieve automatic posting with the help of computer programs, they can be active in OSN for a long time. ...
Article
Full-text available
With the increasing popularity of online social networks (OSNs), a huge number of social bots have emerged. Social bots are involved in various cybercrimes like cyberbullying and rumor dissemination, which have seriously affected the normal order of OSNs. Nowadays, existing studies in this field almost focus on English OSNs like Twitter and Facebook. However, it is difficult to directly apply these detection technologies to Sina Weibo, which is one of the largest Chinese microblogging services in the world. In addition, social bots are evolving rapidly and time-consuming feature engineering may not perform well in detecting newly emerging social bots. In this paper, we propose a new joint approach with Temporal and Profile information for social bot detection (TPBot). The approach includes data collection module, feature extraction module, and detection module. To begin with, data collection module uses a web crawler to obtain user data from Sina Weibo. Next, the feature extraction module regards the user posts as temporal data to extract temporal-semantic and temporal-metadata features. Furthermore, this module extracts features based on users’ profile. Finally, a detection model based on BiGRU and attention mechanism is designed in the detection module. The results show that TPBot performs better than baselines with the F1-score of 0.9837 on the Sina Weibo dataset. Moreover, we have also conducted an experiment on the two datasets collected from Twitter to evaluate the generalization ability of TPBot. It is found that TPBot outperforms baselines on the new datasets and has good generalization ability.
... dld.bz, viid.me, or ln.is). In the study by Chen et al. (2017), which compiled a comprehensive list of this type of shortener, more than 10% of tweets including them would be generated by bots. ...
Article
Full-text available
Content moderation on social media is at the center of public and academic debate. In this study, we advance our understanding on which type of election-related content gets suspended by social media platforms. For this, we assess the behavior and content shared by suspended accounts during the most important elections in Europe in 2017 (in France, the United Kingdom, and Germany). We identify significant differences when we compare the behavior and content shared by Twitter suspended accounts with all other active accounts, including a focus on amplifying divisive issues like immigration and religion and systematic activities increasing the visibility of specific political figures (often but not always on the right). Our analysis suggests that suspended accounts were overwhelmingly human operated and no more likely than other accounts to share “fake news.” This study sheds light on the moderation policies of social media platforms, which have increasingly raised contentious debates, and equally importantly on the integrity and dynamics of political discussion on social media during major political events.
... Many studies have attempted bot detection in recent years. For example, [10] used an unsupervised learning approach to detect bots that distribute malicious URL links using URL shortening services. Based on this study, URL sharing bots use constant tweet duplication of legitimate users at a specific time to spread malicious URLs. ...
... [52] proposed a probabilistic inference on Twitter data that can discover suspicious users and malicious contents. Unsupervised approach was also employed by Chen, Tanash [53] to identify malicious bots in Twitter. The proposed model was trained with several features like URL shortening services, duplicate tweets and content coordination over extended periods of time. ...
Method
Full-text available
The main objective of this study is to propose and develop a Deep Learning based model for sentiment analysis using data extracted from twitter
... Since, our data set is unlabeled we used K-means clustering which is an unsupervised machine learning technique using sklearn package in python. This approach has also been used previously for hunting malicious bots in twitter campaigns [2,3]. For clustering, we choose features of daily tweets, retweet percentage and daily favourite count. ...
Conference Paper
Full-text available
The aim of this paper is to identify and understand bot activity in twitter discussion. The prevalence of Twitter bots have gained significant limelight recently due to their misuse in influencing public sentiment for political gains. For our analysis, we use Twitter data of 2019 Canadian Elections. We perform principal component analysis and K-means clustering on the data set. Using the results we isolate bots from human accounts.
Article
The bourgeoning of Online Social Networks has triggered an increase in undesirable acts caused by some disruptive entities, e.g. fake accounts, bots, and cyber-extremists. Thence, detection systems for unveiling malicious accounts and mitigating their harmful behavior were taken by a storm. This paper presents a systematic review of the literature on malicious account detection and comprehensive analysis from a social network perspective. We critically explore the detection approaches to identify the unsolved problems in the domain. We scrutinized 147 articles to come out with the following findings: the targeted malicious accounts category, the list of features selected for the detection task, the social platform which offered features information, the application area that requires detection of malicious accounts, a comparison between detection methods, a comparison between available datasets, and the performance metrics used for validation. We also discuss the forthcoming challenges in terms of detection methods, annotation techniques, and validation protocols.
Article
Full-text available
Increasing evidence suggests that a growing amount of social media content is generated by autonomous entities known as social bots. In this work we present a framework to detect such entities on Twitter. We leverage more than a thousand features extracted from public data and meta-data about users: friends, tweet content and sentiment, network patterns, and activity time series. We benchmark the classification framework by using a publicly available dataset of Twitter bots. This training data is enriched by a manually annotated collection of active Twitter users that include both humans and bots of varying sophistication. Our models yield high accuracy and agreement with each other and can detect bots of different nature. Our estimates suggest that between 9% and 15% of active Twitter accounts are bots. Characterizing ties among accounts, we observe that simple bots tend to interact with bots that exhibit more human-like behaviors. Analysis of content flows reveals retweet and mention strategies adopted by bots to interact with different target groups. Using clustering analysis, we characterize several subclasses of accounts, including spammers, self promoters, and accounts that post content from connected applications.
Conference Paper
Full-text available
While most online social media accounts are controlled by humans, these platforms also host automated agents called social bots or sybil accounts. Recent literature reported on cases of social bots imitating humans to manipulate discussions, alter the popularity of users, pollute content and spread misinformation, and even perform terrorist propaganda and recruitment actions. Here we present BotOrNot, a publicly-available service that leverages more than one thousand features to evaluate the extent to which a Twitter account exhibits similarity to the known characteristics of social bots. Since its release in May 2014, BotOrNot has served over one million requests via our website and APIs.
Conference Paper
Full-text available
We develop a warped correlation finder to identify correlated user accounts in social media websites such as Twitter. The key observation is that humans cannot be highly synchronous for a long duration; thus, highly synchronous user accounts are most likely bots. Existing bot detection methods are mostly supervised, which requires a large amount of labeled data to train, and do not consider cross-user features. In contrast, our bot detection system works on activity correlation without requiring labeled data. We develop a novel lag-sensitive hashing technique to cluster user accounts into correlated sets in near real-time. Our method, named DeBot, detects thousands of bots per day with a 94% precision and generates reports online everyday. In September 2016, DeBot has accumulated about 544,868 unique bots in the previous one year. We compare our detection technique with per-user techniques and with Twitter’s suspension system. We observe that some bots can avoid Twitter’s suspension mechanism and remain active for months, and, more alarmingly, we show that DeBot detects bots at a rate higher than the rate Twitter is suspending them.
Technical Report
Full-text available
While most online social media accounts are controlled by humans, these platforms also host automated agents called social bots or sybil accounts. Recent literature reported on cases of social bots imitating humans to manipulate discussions, alter the popularity of users, pollute content and spread misinformation, and even perform terrorist propaganda and recruitment actions. Here we present BotOrNot, a publicly-available service that leverages more than one thousand features to evaluate the extent to which a Twitter account exhibits similarity to the known characteristics of social bots. Since its release in May 2014, BotOrNot has served over one million requests via our website and APIs.
Article
Full-text available
The Turing test asked whether one could recognize the behavior of a human from that of a computer algorithm. Today this question has suddenly become very relevant in the context of social media, where text constraints limit the expressive power of humans, and real incentives abound to develop human-mimicking software agents called social bots. These elusive entities wildly populate social media ecosystems, often going unnoticed among the population of real people. Bots can be benign or harmful, aiming at persuading, smearing, or deceiving. Here we discuss the characteristics of modern, sophisticated social bots, and how their presence can endanger online ecosystems and our society. We then discuss current efforts aimed at detection of social bots in Twitter. Characteristics related to content, network, sentiment, and temporal patterns of activity are imitated by bots but at the same time can help discriminate synthetic behaviors from human ones, yielding signatures of engineered social tampering.
Article
Full-text available
Twitter is a new web application playing dual roles of online social networking and microblogging. Users communicate with each other by publishing text-based posts. The popularity and open structure of Twitter have attracted a large number of automated programs, known as bots, which appear to be a double-edged sword to Twitter. Legitimate bots generate a large amount of benign tweets delivering news and updating feeds, while malicious bots spread spam or malicious contents. More interestingly, in the middle between human and bot, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. To assist human users in identifying who they are interacting with, this paper focuses on the classification of human, bot, and cyborg accounts on Twitter. We first conduct a set of large-scale measurements with a collection of over 500,000 accounts. We observe the difference among human, bot, and cyborg in terms of tweeting behavior, tweet content, and account properties. Based on the measurement results, we propose a classification system that includes the following four parts: 1) an entropy-based component, 2) a spam detection component, 3) an account properties component, and 4) a decision maker. It uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot, or cyborg. Our experimental evaluation demonstrates the efficacy of the proposed classification system.
Conference Paper
Sentiment analysis, also known as opinion mining, is a mechanism for understanding the natural disposition that people possess towards a specific topic. This type of information is very valuable for certain industries — digital marketing companies use sentiment analysis to track the public's mood about a particular product, the view of elected authorities in a given country, or to explain sports allegiances, among many other goals. A common approach to sentiment analysis consists of systematically reviewing content from websites, especially social networks like Facebook, Twitter, and Google+, and using an algorithm to determine the opinions of the masses. For this work, the main body of analysis came from the "Twittersphere." On the Twitter platform, users send 140-character messages to the social network as a means of expressing their viewpoints on certain issues. These messages, or "tweets," are then shown in the user's homepage. Twitter is used widely in Chile. This work analyzed the public's opinions on the presidential primaries for the Alliance political party between Andres Allamand "Renovacion Nacional" (RN) and Pablo Longueira from "Union Democrata Independiente" (UDI) using information collected from Twitter in that country. After gathering all relevant data, researchers used sentiment analysis to predict the outcome of the primaries. This data identified citizens who were in favor (positive) of either Allamand or Longueira and people who were against (negative) any political party or persuasion. Researchers designed a dictionary algorithm to aid in these predictions. This was comprised of certain positive and negative words, which, when applied to the Twitter data, was able to determine the polarity of the message: positive, negative, and/or neutral. In addition, an exponential function was used for analyzing the distance between the words, which is useful to gather opinions where both candidates are mentioned, identifying polarity for each of them separately. Later, a score was assigned to each Twitter user. Those cumulative scores were ultimately used to predict the way those given users would vote in the primary elections.
Conference Paper
This paper addresses the challenge of detecting spam URLs in social media, which is an important task for shielding users from links associated with phishing, malware, and other low-quality, suspicious content. Rather than rely on traditional blacklist-based filters or content analysis of the landing page for Web URLs, we examine the behavioral factors of both who is posting the URL and who is clicking on the URL. The core intuition is that these behavioral signals may be more difficult to manipulate than traditional signals. Concretely, we propose and evaluate fifteen click and posting-based features. Through extensive experimental evaluation, we find that this purely behavioral approach can achieve high precision (0.86), recall (0.86), and area-under-the-curve (0.92), suggesting the potential for robust behavior-based spam detection.
Article
Different real-world applications have varying definitions of suspicious behaviors. Detection methods often look for the most suspicious parts of the data by optimizing scores, but quantifying the suspiciousness of a behavioral pattern is still an open issue.
Conference Paper
With an average of 80% length reduction, the URL shorteners have become the norm for sharing URLs on Twitter, mainly due to the 140-character limit per message. Unfortunately, spammers have also adopted the URL shorteners to camouflage and improve the user click-through of their spam URLs. In this paper, we measure the misuse of the short URLs and analyze the characteristics of the spam and non-spam short URLs. We utilize these measurements to enable the detection of spam short URLs. To achieve this, we collected short URLs from Twitter and retrieved their click traffic data from Bitly, a popular URL shortening system. We first investigate the creators of over 600,000 Bitly short URLs to characterize short URL spammers. We then analyze the click traffic generated from various countries and referrers, and determine the top click sources for spam and non-spam short URLs. Our results show that the majority of the clicks are from direct sources and that the spammers utilize popular websites to attract more attention by cross-posting the links. We then use the click traffic data to classify the short URLs into spam vs. non-spam and compare the performance of the selected classifiers on the dataset. We determine that the Random Tree algorithm achieves the best performance with an accuracy of 90.81% and an F-measure value of 0.913.