Conference PaperPDF Available

Detect Me If You Can: Spam Bot Detection Using Inductive Representation Learning

Authors:
  • XU Exponential University of Applied Sciences
  • German University of Digital Science

Abstract and Figures

Spam Bots have become a threat to online social networks with their malicious behavior, posting misinformation messages and influencing online platforms to fulfill their motives. As spam bots have become more advanced over time, creating algorithms to identify bots remains an open challenge. Learning low-dimensional embeddings for nodes in graph structured data has proven to be useful in various domains. In this paper, we propose a model based on graph convolutional neural networks (GCNN) for spam bot detection. Our hypothesis is that to better detect spam bots, in addition to defining a features set, the social graph must also be taken into consideration. GCNNs are able to leverage both the features of a node and aggregate the features of a node’s neighborhood. We compare our approach, with two methods that work solely on a features set and on the structure of the graph. To our knowledge, this work is the first attempt of using graph convolutional neural networks in spam bot detection.
Content may be subject to copyright.
Detect Me If You Can: Spam Bot Detection Using Inductive
Representation Learning
Seyed Ali Alhosseini
Hasso-Plattner-Institute
University of Potsdam
Potsdam, Germany
seyedali.alhosseini@hpi.de
Raad Bin Tareaf
Hasso-Plattner-Institute
University of Potsdam
Potsdam, Germany
raad.bintareaf@hpi.de
Pejman Naja
Hasso-Plattner-Institute
University of Potsdam
Potsdam, Germany
pejman.naja@hpi.de
Christoph Meinel
Hasso-Plattner-Institute
University of Potsdam
Potsdam, Germany
christoph.meinel@hpi.de
ABSTRACT
Spam Bots have become a threat to online social networks with
their malicious behavior, posting misinformation messages and
inuencing online platforms to fulll their motives. As spam bots
have become more advanced over time, creating algorithms to iden-
tify bots remains an open challenge. Learning low-dimensional
embeddings for nodes in graph structured data has proven to be
useful in various domains. In this paper, we propose a model based
on graph convolutional neural networks (GCNN) for spam bot
detection. Our hypothesis is that to better detect spam bots, in ad-
dition to dening a features set, the social graph must also be taken
into consideration. GCNNs are able to leverage both the features of
a node and aggregate the features of a node’s neighborhood. We
compare our approach, with two methods that work solely on a
features set and on the structure of the graph. To our knowledge,
this work is the rst attempt of using graph convolutional neural
networks in spam bot detection.
CCS CONCEPTS
Information systems Social networks
;
Security and
privacy
Social network security and privacy;
Computing
methodologies Neural networks.
KEYWORDS
Social Media Analysis, Bot Detection, Graph Embedding, Graph
Convolutional Neural Networks
ACM Reference Format:
Seyed Ali Alhosseini, Raad Bin Tareaf, Pejman Naja, and Christoph Meinel.
2019. Detect Me If You Can: Spam Bot Detection Using Inductive Repre-
sentation Learning. In Companion Proceedings of the 2019 World Wide Web
Conference (WWW ’19 Companion), May 13–17, 2019, San Francisco, CA, USA.
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3308560.3316504
This paper is published under the Creative Commons Attribution 4.0 International
(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their
personal and corporate Web sites with the appropriate attribution.
WWW ’19 Companion, May 13–17, 2019, San Francisco, CA, USA
©
2019 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC-BY 4.0 License.
ACM ISBN 978-1-4503-6675-5/19/05.
https://doi.org/10.1145/3308560.3316504
1 INTRODUCTION
Online Social Networks (OSN) have provided a means of commu-
nication for individuals to share information and express their
opinions in a free and simple manner. Twitter, Facebook and other
social media websites have changed the way we consume news and
interact with one another. The important role of these platforms
has resulted in attempts by interest groups to inuence users, seize
their attention and ultimately change public opinion [3, 11].
Social Bots are automated user accounts operated by a computer
program mimicking human behaviour with the intention of abus-
ing the social media platform [
3
,
6
]. They have evidently become
a threat to online social networks with their malicious behavior
spamming with advertisement and scam URLs, promoting a specic
hashtag, spreading misinformation and impacting elections.
The research community has proposed several approaches for
bot detection. The dierence in these work varies depending on
the denition of a bot account, the selected feature set representing
accounts and the machine-learning algorithm used for classifying
bot accounts from normal user accounts.
However, spam bot detection remains an open challenge for sev-
eral reasons. The rst reason relies in the denition of bot accounts
as there is no single denition to precisely determine an account
as a bot account. This is an important matter specially for building
a ground truth dataset. Another issue as reported by [
3
,
17
] is that
bots have become more advanced and sophisticated in avoiding
the existing proposed detection methods. In fact, bots have been
evolving over time. [
6
] has given attention to the rise of social bots
that are designed to emulate human-like behavior. The social bots
are able to interact with other accounts, post tweets in dierent
topics, and display a similar activity like humans [3].
Recent advances in deep learning for graph-structured data has
led to a new method of representation learning named Graph Con-
volution Networks (GCNs) [
9
,
10
]. The main idea of GCNs is to
represent a node in a vector space based on its features and the
features of its neighboring nodes using neural networks. The ad-
vantage of GCNs is that it captures both the node features and the
graph structure to learn a low-dimensional representation of nodes
[7, 8].
148
In this work we propose an inductive representation learning
approach for bot detection based on the user prole features and
the social network graph. The main contributions of this work are
summarized as follows:
We deploy graph convolutional neural networks on a well-
known spam bots dataset previously used in the literature.
We compare our approach with two algorithms by a MLP
classier and applying the Belief Propagation on the dataset.
We show that using the graph structure in our method gains
better performance in spambot detection.
The remainder of this paper is structured as follows. First we
cover the previous related work in spam bot detection and graph
convolutional neural networks. Section 3 describes in detail the
dataset used. In section 4, we provide an overview of our methodol-
ogy. We illustrate are results in section 5 and discuss the limitations
of our work and suggestions for future work. Finally, we conclude
this paper in section 6.
2 RELATED WORK
In this section, we rst review the literature on spambot detection
and compare each work by their denition for spambots, the fea-
tures used and the classication algorithm they employed. Next,
we look at graph convolutional networks.
[
12
] proposed a method working as a honeypot trap for bot
accounts. They created 60 twitter accounts and started posting
meaningless tweets that would have no interest for humans. Despite
this fact, they were able to draw some accounts’ attention to follow
the accounts they made. Analyzing these accounts in detail showed
that they were in fact bot accounts trying to increase their following
list.
[
17
,
18
] used a conservative denition for bot accounts consid-
ering only accounts who post URLs linking to malicious content.
They also introduced and considered several robust features on
the BayesNet classier to predict spam accounts. Yang and et al
investigated the dierent approaches bots take for avoiding detec-
tion by Twitter. Their ndings show that bots tend to increase the
reputation of their accounts by purchasing followers and posting
more tweets.
[
4
] introduced a DNA-inspired technique that models each ac-
count as a sequence of behavioral information and detects spambots
based on similar sequences. They categorized each users’ tweets
into dierent types and based on whether a tweet contains URLs,
hashtags, pictures, etc. it will be assigned a dierent character. The
similarity of the accounts is measured by the longest common
substring in their DNA sequences.
BotOrNot [
5
] used the random forest classier algorithm on
more than 1000 features to detect bots. The features are categorized
in 6 groups: network (degree distribution, clustering coecient,
. . . ), users’ account information, friends (number of followers, fol-
lowings, . . . ), temporal (tweet rate, . . . ), content (natural language
processing, . . . ) and sentiment features. The downside of BotOrNot
is that it was trained on English tweets so its performance declines
on bots which are tweeting in another language than English.
DeBot [
1
,
2
] is an unsupervised bot detection system. The idea
behind their work is that accounts with a high correlation in their
activities (tweet, retweet, ...) have a high chance of being bots.
DeBot monitors the activities of accounts over a specic period
and creates a time series for each account. It then clusters accounts
based on the similarity of their time series using a lag-sensitive
hashing method. Finally, DeBot reports the accounts with a high
correlation as bots.
[
13
] dened spam bots as content polluters that try to take over
a discussion for political or advertising reasons. Their approach
considers individual tweets for detecting bots. Instead of focusing
on the friend and follower network, they created the event network
where the nodes are the users and the edges are based on users
having tweeted on the same event. They also compute the diversity
of a tweet based on the URLs and hashtags it has mentioned. Results
of their work indicate that spam bots operate as a group often
tweeting at the same time.
2.1 Graph Convolutional Networks
Graph structures are used in many domains and applications such as
social networks, recommender systems etc. The challenging task for
graph structures is how to use them in machine learning algorithms.
The initial works in this area considered the statistical data of the
graph like the degree of the nodes, the centrality and betweenness
coecients as features for training models. In other words, they
considered the graph structure as a pre-processing step to extract
structural information. Therefore, these approaches do not use
the graph structure in the learning phase. Another downgrade for
these approaches is that computing the graph statistics has high
complexity and the output of it cannot be used on unseen data.
With the recent advances in Convolutional Neural Network
(CNN) there have been eorts to adapt this popular deep learning
model for encoding graph structures. Two main approaches have
been used for embedding the graph structure into a dimensional
space. The dierence between these approaches is based on how the
convolution operation is dened. The rst approaches aim to take a
xed-length node sequence of the graph structure and directly use
it in the original CNN models that work in the Euclidean domain.
Alternatively, the other methods model the graph structure to non-
Euclidean domains. [
10
] proposed graph convolutional network
(GCN); that considers spectral convolutions on graph structures.
The term convolutional is used since a node’s neighborhood is
considered as its representation. Their method can be considered
as the initial steps for graph semi-supervised classication tasks.
However, the drawbacks of their approach are that it requires the
full graph Laplacian to be calculated and the output embeddings
of a node in each layer are dependent on all its neighbors at the
previous layer.
Most recently [
8
] introduced GraphSage; a node embedding al-
gorithm that uses neural networks to learn embeddings for nodes
in the graph structure. Their main contribution is that they solve
the limitations mentioned above and show how to aggregate infor-
mation from a node’s neighborhood. Their method consists of two
main phases:
(1) Dening the computation graph and training the neu-
ral nets
The structure of a node’s neighborhood will dene the com-
putation graph for training the neural networks. In this
phase, the objective is to build neural networks that will
149
ensure nodes close to each other have similar embeddings
while nodes far from one another have dierent embeddings.
(2) Propagation
For each node, the information of its neigh-
bors is aggregated and passed through the neural networks
trained in the rst phase.
3 DATASET
There are several well-known datasets collected by dierent re-
search groups specically for bot detection on Twitter. Lee et al.
[
12
] provide a social honeypot dataset that contains approximately
22000 content polluters. They have gathered the accounts’ meta-
data and tweets of each account. However, in their released dataset
they have anonymized the Twitter account ids. Therefore collect-
ing further information is not possible. Cresci et al. have worked
on dierent Twitter datasets in [
3
] and by using a crowdsourcing
platform they labeled the dierent types of accounts. [
16
] released
the twitter ids of the accounts they detected as spambots.
Yang et al. 2013 collected Twitter spammers and their dataset
contains each account’s followers and followings. To the best of
our knowledge, this is the dataset we found which has gathered
this information for the Twitter accounts. The authors of that work
have kindly shared their dataset and we have used the dataset in
the present paper. The dataset contains 11000 nodes and 2342816
edges between them.
Table 1 shows the statistics of the dataset used in this paper. The
age, tweets and neighbors columns indicate the average amount
in each group. The age column is the average age of the accounts
reported in days and normalized by setting the oldest day as the
rst day. The majority of edges between nodes are user to user
connections. However, around 5.4% of the edge relations include
bot accounts.
Accounts Age Tweets Neighbors
bots 1000 3023.80 220.90 1963.84
users 10000 3174.28 4658.52 21579.76
relation bot-bot bot-user user-bot user-user
2673 73363 50153 2216627
0.11% 3.13% 2.14% 94.61%
Table 1: Dataset statistics
Figure 1 shows the degree distribution of the accounts in the
dataset. Most accounts have a small number of followers and fol-
lowings and there are a few accounts which have more than 1000
accounts in their neighborhood.
Figure 2.a shows the age and the length of user account name for
both bots and users accounts. As shown in gure 2.b and reported
in previous work [
6
] bot accounts have smaller age meaning they
were created more recently compared to user accounts. Also as [
13
]
indicated there is no signicant dierence in length of the accounts
name.
4 METHODOLOGY
We used an inductive representation learning approach similar to
[8, 9] for detecting twitter bot accounts.
Figure 1: The degree distribution of the nodes in graph. The
gure is drawn in log-log scale.
4.1 Problem denition
Let
G=(V,E)
be a graph where for each
vV
exists a feature
vector
Xv
and a binary label
y∈ {
0
,
1
}
associated with it. The goal
is to nd an embedding vector
hv
for each node
vV
such that
f(hv)predicts the label of the node in the graph.
Similar to convolution lters in image processing, graph convo-
lutional networks consider the attributes of a node’s neighbors as
a representation for that node. Let us dene
k
as the depth of the
neighbors of a node from which information is aggregated. If
k
=1
only the information from its own neighbors will be considered.
For
k
=2 the information is gathered also from the neighbors of its
neighbors and so on. The output
hk
v
at each depth is calculated as
follows:
hk
N(v)=mean({hk1
u,uN(v)})(1)
hk
v=fk(hk1
v,hk
N(v))=σ(Wk·concat (hk1
v,hk
N(v))) (2)
Where
hk
N(v)
is the average of the embedding vectors from
v
’s
neighbors.
hk
v
is the output which is concatenated with
v
’s previous
embedding.
The neural networks are optimized based on the cross-entropy
loss function:
J(fk(hk1
v,hk
N(v)),y)=
X
vV
ylog(f(Xv)) +(1y)log(1f(Xv)) (3)
4.2 Features
The initial vector (
Xv
) for each user consists of the features that
can be retrieved directly from the Twitter API
1
. The feature vector
consists of :
1
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.
html.
150
Age (a) Account Length Name (b)
Figure 2: Bots(red) and Users(blue) attributes
Feature Name Description
1 age
The created_at attribute returns the
datetime that an account was cre-
ated on Twitter. The age feature is
computed by the number of days
from the created_at date.
2 favourites_count
This feature indicates the number
of tweets a user has liked.
3 statuses_count
The number of tweets including the
retweets a user has posted.
4 account length name The length of an account’s name
5 followers_count
followers_count shows the number
of follower an accounts has.
6 friends_count
The friends_count attribute shows
the number of accounts the user is
following.
Table 2: Features
5 EVALUATION
In this section, we evaluate the performance of our approach. We
conducted a 5-fold cross-validation on the dataset to evaluate the
accuracy of the model. Figure 3 shows the area under curve for
each fold. On average the GCNN has 0.94 accuracy measured by
the area under roc curve.
We measured the precision, recall and f1 metrics as shown in
Table 5 for the evaluation. Choosing a meaningful evaluation metric
for the classication task is important. For example, it is possible
to use the precision measure dened in equation 4 to evaluate the
performance of a model. In this case, the measures are calculated
over all the data disregarding the class labels.
Pr ecisionmi cr o =PcT P
PcT P +PcF P (4)
Recallmic r o =PcTP
PcT P +PcF N (5)
Figure 3: ROC curve over 5-fold cross-validation
f1mic r o =2Precisionmic r o Recallmic r o
Pr ecisionmi cr o +Rec allmi c r o
(6)
However, by this denition for a dataset where the majority of
labels belong to one class, the precision score remains high even if
the model has not detected the labels of the other class correctly.
Therefore, for a better evaluation of the model, we compute the
precision, recall, f1 score for each class separately and report the
average score on the two classes. This is also known as macro score
in the scikit-learn python library [15].
Pr ecisionma cr o =1
|c|X
c
T P
T P +F P (7)
Recallmac r o =1
|c|X
c
T P
T P +F N (8)
f1mac r o =2Precisionm ac r o Recallm ac r o
Pr ecisionma cr o +Recallmac r o
(9)
151
ψi j (xi,xj)xj=user xj=bot
xi=user 0.5+wϵ0.5wϵ
xi=bot 0.5wϵ0.5+wϵ
Table 3: Edge potentials matrices
Node P(user )P(bot )
User 0.99 0.01
Bot 0.01 0.99
Unknown/Validation 0.5 0.5
Table 4: Node potential based on the original state
5.1 Comparsion with MLP and Belief
Propagation
We further evaluated our approach by comparing it with two other
methods. As graph convolutional neural networks take both the
feature set and the graph structure into consideration, we demon-
strate the performance of this method by comparing it with multi
layer perceptron (MLP) and belief propagation (BP).
The MLP classier is trained based on the feature set dened
in the Features section. The input layer takes the feature vectors
normalized to values between 0 and 1 for each account. The hidden
layers consist of two layers with 25 and 10 neurons respectively
and use a rectied linear unit as the transfer function. The log
loss function is optimized using stochastic gradient descent with a
learning rate of 0.0001.
On the other hand, the Belief Propagation algorithm runs solely
on the graph structure. The Belief Propagation (BP) algorithm orig-
inally proposed by Judea Pearl [
14
] infers a node’s label from some
prior knowledge about that node and other neighboring nodes by
iteratively passing messages between all pairs of nodes in the graph.
The message sent indicates nodes’ beliefs regarding the state of
their neighbours. For details please refer to [
19
]. In this experiments
we adopted the original BP with the node and edge potential metrics
indicated in Table 3 and 4. Furthermore, we ran the experiment with
7 iterations as the messages passed across nodes had no signicant
changes after 7 iterations.
We plotted the Receiver Operating Characteristic (ROC) for the
dierent models as shown in gure 4. We observe that the area
under the ROC curve is 94% for the GCNN approach which is 8%
and 16% percent higher than the MLP and BP approach respectively.
While neural networks have shown to perform well in various
domains, they are often considered as black boxes when it comes
to why they result in such outputs. Interpreting each entry of
the output and the meaning of the embedding vectors generated
remains open question and topic to investigate for future research.
6 CONCLUSION AND FUTURE WORK
In this paper, we have examined a new approach for detecting
malicious accounts and social bots on Twitter by using graph con-
volutional networks. The main idea of our method is to employ the
graph structure and relationships of Twitter accounts for classify-
ing the accounts. Each account aggregates the feature information
from its neighborhood. To demonstrate the ecacy of our proposal,
Figure 4: Comparison of the area under curve of dierent
algorithms.
we have worked on a previous well-known dataset in bot detection.
Results show that our approach outperforms the state of the art
classication algorithms with 8% improvement in the area under
curve accuracy.
Since the Twitter API has a limit of 15 requests per rate limit
window every 15 minutes, building the Twitter graph structure
based on the follower and friend relation of the accounts is not an
easy task. We are aware this may be considered as a limitation to
our approach. It can thus be suggested to build the graph structure
based on the retweet graph of user accounts. Finally, a specic
extension for future work is to deploy this method in real time on
Twitter’s streaming API for spambot detection.
ACKNOWLEDGMENTS
The authors would like to thank the HPI Future SOC Lab for pro-
viding access to the resources during the period of Fall 2018.
REFERENCES
[1]
N. Chavoshi, H. Hamooni, and A. Mueen. 2016. DeBot: Twitter Bot Detection via
Warped Correlation. In 2016 IEEE 16th International Conference on Data Mining
(ICDM). 817–822. https://doi.org/10.1109/ICDM.2016.0096
[2]
Nikan Chavoshi, Hossein Hamooni, and Abdullah Mueen. 2017. Temporal Pat-
terns in Bot Activities. In Proceedings of the 26th International Conference on World
Wide Web Companion (WWW ’17 Companion). International World Wide Web
Conferences Steering Committee, Republic and Canton of Geneva, Switzerland,
1601–1606. https://doi.org/10.1145/3041021.3051114
[3]
Stefano Cresci, Roberto Di Pietro, Marinella Petrocchi, Angelo Spognardi, and
Maurizio Tesconi. 2017. The Paradigm-Shift of Social Spambots: Evidence, The-
ories, and Tools for the Arms Race. In Proceedings of the 26th International
Conference on World Wide Web Companion (WWW ’17 Companion). 963–972.
https://doi.org/10.1145/3041021.3055135
[4]
S. Cresci, R. Di Pietro, M. Petrocchi, A. Spognardi, and M. Tesconi. 2016. DNA-
Inspired Online Behavioral Modeling and Its Application to Spambot Detection.
IEEE Intelligent Systems 31, 5 (Sept 2016), 58–64. https://doi.org/10.1109/MIS.
2016.29
[5]
Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and
Filippo Menczer. 2016. BotOrNot: A System to Evaluate Social Bots. In Proceedings
of the 25th International Conference Companion on World Wide Web (WWW ’16
Companion). International World Wide Web Conferences Steering Committee,
273–274. https://doi.org/10.1145/2872518.2889302
152
Pr ecisionma cr o Recallma cr o f1mac r o
MLP 0.81 0.73 0.77
BP 0.56 0.54 0.55
GCNN (with features 1, 2, 3, 4) 0.85 0.77 0.80
GCNN (with features 5, 6) 0.80 0.69 0.72
GCNN (All features) 0.89 0.80 0.84
Table 5: Comparison of dierent algorithms on the dataset
[6]
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro
Flammini. 2016. The Rise of Social Bots. Commun. ACM 59, 7 (June 2016), 96–104.
https://doi.org/10.1145/2818717
[7]
Palash Goyal, Homa Hosseinmardi, Emilio Ferrara, and Aram Galstyan. 2018.
Embedding Networks with Edge Attributes. In Proceedings of the 29th on Hypertext
and Social Media (HT ’18). ACM, New York, NY, USA, 38–42. https://doi.org/10.
1145/3209542.3209571
[8]
Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation
Learning on Large Graphs. In Advances in Neural Information Processing Systems
30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett (Eds.). Curran Associates, Inc., 1024–1034. http://papers.nips.cc/
paper/6703-inductive- representation-learning-on-large-graphs.pdf
[9]
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Representation
Learning on Graphs: Methods and Applications. CoRR abs/1709.05584 (2017).
arXiv:1709.05584 http://arxiv.org/abs/1709.05584
[10]
Thomas N. Kipf and Max Welling. 2016. Semi-Supervised Classication with
Graph Convolutional Networks. CoRR abs/1609.02907 (2016). arXiv:1609.02907
http://arxiv.org/abs/1609.02907
[11]
Srijan Kumar and Neil Shah. 2018. False Information on Web and Social Media:
A Survey. CoRR abs/1804.08559 (2018). arXiv:1804.08559 http://arxiv.org/abs/
1804.08559
[12]
Kyumin Lee, Brian David Eo, and James Caverlee. 2011. Seven months with
the devils: a long-term study of content polluters on Twitter. In In AAAI IntâĂŹl
Conference on Weblogs and Social Media (ICWSM).
[13]
Mehwish Nasim, Andrew Nguyen, Nick Lothian, Robert Cope, and Lewis Mitchell.
2018. Real-time Detection of Content Polluters in Partially Observable Twitter
Networks. In Companion Proceedings of the The Web Conference 2018 (WWW
’18). International World Wide Web Conferences Steering Committee, Republic
and Canton of Geneva, Switzerland, 1331–1339. https://doi.org/10.1145/3184558.
3191574
[14]
Judea Pearl. 2014. Probabilistic reasoning in intelligent systems: networks of plausi-
ble inference. Elsevier.
[15]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[16]
Onur Varol, Emilio Ferrara, Clayton A Davis, Filippo Menczer, and Alessandro
Flammini. 2017. Online human-bot interactions: Detection, estimation, and
characterization. In Proceedings of the Eleventh International Conference on Web
and Social Media, ICWSM 2017, Montréal, Québec, Canada, May 15-18, 2017. AAAI
Press.
[17]
Chao Yang, Robert Harkreader, and Guofei Gu. 2013. Empirical Evaluation and
New Design for Fighting Evolving Twitter Spammers. IEEE Transactions on
Information Forensics and Security (2013).
[18]
Chao Yang, Robert Harkreader, Jialong Zhang, Seungwon Shin, and Guofei Gu.
2012. Analyzing Spammers’ Social Networks for Fun and Prot: A Case Study
of Cyber Criminal Ecosystem on Twitter. In Proceedings of the 21st International
Conference on World Wide Web (WWW ’12). ACM, New York, NY, USA, 71–80.
https://doi.org/10.1145/2187836.2187847
[19]
Jonathan S Yedidia, William T Freeman, and Yair Weiss. 2003. Understanding
belief propagation and its generalizations. Exploring articial intelligence in the
new millennium 8 (2003), 236–239.
153
... Alhosseini et al. [46] introduced the use of graph convolutional neural networks (GCNN) in bot identification. They noted that besides the users' features, the construction of a social network would enhance a model's ability to distinguish the bots from the genuine users. ...
... • Botometer [23] is a web-based program that leverages more than 1,000 user features. • Alhosseini et al. [46] introduced graph convolutional neural networks in bot detection. • SATAR [27] leverages the user's semantics, property, and neighborhood information • BotRGCN et al. [12] used the user's description, tweets, numerical and categorical properties, and neighborhood information. ...
... We see that our model benefits from the search for the fittest architecture that we performed beforehand, as it achieves a higher accuracy, F1-score, and MCC than other state-of-theart methods. Model Accuracy F1-score MCC [9] 0.7456 0.7823 0.4879 [37] 0.8191 0.8546 0.6643 [20] 0.8174 0.7517 0.6710 [21] 0.7126 0.7533 0.4193 [39] 0.4801 0.6266 -0.1372 [10] 0.4793 0.1072 0.0839 [23] 0.5584 0.4892 0.1558 [46] 0.6813 0.7318 0.3543 [27] 0.8412 0.8642 0.6863 [12] 0.8462 0.8707 0.7021 [29] 0.7466 0.7630 ours 0.8568 ± 0.004 0.8712 ± 0.003 0.7116 ± 0.007 ...
Preprint
Full-text available
Social media platforms, including X, Facebook, and Instagram, host millions of daily users, giving rise to bots-automated programs disseminating misinformation and ideologies with tangible real-world consequences. While bot detection in platform X has been the area of many deep learning models with adequate results, most approaches neglect the graph structure of social media relationships and often rely on hand-engineered architectures. Our work introduces the implementation of a Neural Architecture Search (NAS) technique, namely Deep and Flexible Graph Neural Architecture Search (DFG-NAS), tailored to Relational Graph Convolutional Neural Networks (RGCNs) in the task of bot detection in platform X. Our model constructs a graph that incorporates both the user relationships and their metadata. Then, DFG-NAS is adapted to automatically search for the optimal configuration of Propagation and Transformation functions in the RGCNs. Our experiments are conducted on the TwiBot-20 dataset, constructing a graph with 229,580 nodes and 227,979 edges. We study the five architectures with the highest performance during the search and achieve an accuracy of 85.7%, surpassing state-of-the-art models. Our approach not only addresses the bot detection challenge but also advocates for the broader implementation of NAS models in neural network design automation.
... Graph neural networks, through their efficient capture of semantic relationship features by leveraging the relationships between nodes in the graph structure, have been pivotal in the advancement of social bot detection. These methods, treating social bot detection as a node classification problem, consider social media platform accounts as nodes, the relationships between accounts as edges, and then employ graph neural networks (such as GCN [17,18], RGCN [9], RGT [10]) to learn user node representations for bot detection. This practical application of graph neural networks has led to the achievement of state-of-the-art performance in social bot detection, effectively detecting new social bots with better generalization [9,10]. ...
... Alhosseini et al. utilize account and relationship graphs as graph structural features and apply graph convolutional networks (GCN) to social robot detection for the first time [17]. Satar model utilizes GCNs in a feature engineering manner and utilizes self-supervision for social robot detection [20]. ...
... We find that many works collect user profile data from various online social networks for this analysis, like Twitter [212][213][214][215][216][217][218], Instagram [219][220][221], Facebook [222][223][224], YouTube [225], and Sina Weibo [226]. A different approach was used by a study [227] that collected real names from various webpages, schools, and other sources to automatically detect fake names online. ...
Preprint
Full-text available
Fraud is a prevalent offence that extends beyond financial loss, causing psychological and physical harm to victims. The advancements in online communication technologies alowed for online fraud to thrive in this vast network, with fraudsters increasingly using these channels for deception. With the progression of technologies like AI, there is a growing concern that fraud will scale up, using sophisticated methods, like deep-fakes in phishing campaigns, all generated by language generation models like ChatGPT. However, the application of AI in detecting and analyzing online fraud remains understudied. We conduct a Systematic Literature Review on AI and NLP techniques for online fraud detection. The review adhered the PRISMA-ScR protocol, with eligibility criteria including relevance to online fraud, use of text data, and AI methodologies. We screened 2,457 academic records, 350 met our eligibility criteria, and included 223. We report the state-of-the-art NLP techniques for analysing various online fraud categories; the training data sources; the NLP algorithms and models built; and the performance metrics employed for model evaluation. We find that current research on online fraud is divided into various scam activitiesand identify 16 different frauds that researchers focus on. This SLR enhances the academic understanding of AI-based detection methods for online fraud and offers insights for policymakers, law enforcement, and businesses on safeguarding against such activities. We conclude that focusing on specific scams lacks generalization, as multiple models are required for different fraud types. The evolving nature of scams limits the effectiveness of models trained on outdated data. We also identify issues in data limitations, training bias reporting, and selective presentation of metrics in model performance reporting, which can lead to potential biases in model evaluation.
... In recent years, because feature engineering Feng et al., 2021a;Dehghan et al., 2023) has been proven to be incapable of addressing the challenge of generalization (Feng et al., 2022b), graph convolutional neural networks (GCNs) (Kipf and Welling, 2016) based models have been designed to leverage graph structure information. While homogeneous graphs (Alhosseini et al., 2019) fail to model diverse relationships between social accounts, such as following and commenting, methods based on heterogeneous graphs, e.g., heterogeneous GCNs (Zhao et al., 2020), relational GCNs (Feng et al., 2021c), relational graph transformer (Feng et al., 2022a), etc., can better extract node features by capturing information from various types of relationships. ...
... This procedure baits spambots into attacking a specific system aimed at studying their behaviors and profiles [4,22]. Furthermore, some recent methods have been developed by Ali Alhosseini et al. [3] to detect traditional spambots via models based on graph convolutional neural networks. ...
Preprint
Emerging technologies, particularly artificial intelligence (AI), and more specifically Large Language Models (LLMs) have provided malicious actors with powerful tools for manipulating digital discourse. LLMs have the potential to affect traditional forms of democratic engagements, such as voter choice, government surveys, or even online communication with regulators; since bots are capable of producing large quantities of credible text. To investigate the human perception of LLM-generated content, we recruited over 1,000 participants who then tried to differentiate bot from human posts in social media discussion threads. We found that humans perform poorly at identifying the true nature of user posts on social media. We also found patterns in how humans identify LLM-generated text content in social media discourse. Finally, we observed the Uncanny Valley effect in text dialogue in both user perception and identification. This indicates that despite humans being poor at the identification process, they can still sense discomfort when reading LLM-generated content.
Preprint
Full-text available
The importance of social media in our daily lives has unfortunately led to an increase in the spread of misinformation, political messages and malicious links. One of the most popular ways of carrying out those activities is using automated accounts, also known as bots, which makes the detection of such accounts a necessity. This paper addresses that problem by investigating features based on the user account profile and its content, aiming to understand the relevance of each feature as a basis for improving future bot detectors. Through an exhaustive process of research, inference and feature selection, we are able to surpass the state of the art on several metrics using classical machine learning algorithms and identify the types of features that are most important in detecting automated accounts.
Preprint
The detection of malicious social bots has become a crucial task, as bots can be easily deployed and manipulated to spread disinformation, promote conspiracy messages, and more. Most existing approaches utilize graph neural networks (GNNs)to capture both user profle and structural features,achieving promising progress. However, they still face limitations including the expensive training on large underlying graph, the performance degration when similar neighborhood patterns' assumption preferred by GNNs is not satisfied, and the dynamic features of bots in a highly adversarial context. Motivated by these limitations, this paper proposes a method named BSG4Bot with an intuition that GNNs training on Biased SubGraphs can improve both performance and time/space efficiency in bot detection. Specifically, BSG4Bot first pre-trains a classifier on node features efficiently to define the node similarities, and constructs biased subgraphs by combining the similarities computed by the pre-trained classifier and the node importances computed by Personalized PageRank (PPR scores). BSG4Bot then introduces a heterogeneous GNN over the constructed subgraphs to detect bots effectively and efficiently. The relatively stable features, including the content category and temporal activity features, are explored and incorporated into BSG4Bot after preliminary verification on sample data. The extensive experimental studies show that BSG4Bot outperforms the state-of-the-art bot detection methods, while only needing nearly 1/5 training time.
Article
Full-text available
False information can be created and spread easily through the web and social media platforms, resulting in widespread real-world impact. Characterizing how false information proliferates on social platforms and why it succeeds in deceiving readers are critical to develop efficient detection algorithms and tools for early detection. A recent surge of research in this area has aimed to address the key issues using methods based on feature engineering, graph mining, and information modeling. Majority of the research has primarily focused on two broad categories of false information: opinion-based (e.g., fake reviews), and fact-based (e.g., false news and hoaxes). Therefore, in this work, we present a comprehensive survey spanning diverse aspects of false information, namely (i) the actors involved in spreading false information, (ii) rationale behind successfully deceiving readers, (iii) quantifying the impact of false information, (iv) measuring its characteristics across different dimensions, and finally, (iv) algorithms developed to detect false information. In doing so, we create a unified framework to describe these recent methods and highlight a number of important directions for future research.
Conference Paper
Full-text available
Content polluters, or bots that hijack a conversation for political or advertising purposes are a known problem for event prediction, election forecasting and when distinguishing real news from fake news in social media data. Identifying this type of bot is particularly challenging, with state-of-the-art methods utilising large volumes of network data as features for machine learning models. Such datasets are generally not readily available in typical applications which stream social media data for real-time event prediction. In this work we develop a methodology to detect content polluters in social media datasets that are streamed in real-time. Applying our method to the problem of civil unrest event prediction in Australia, we identify content polluters from individual tweets, without collecting social network or historical data from individual accounts. We identify some peculiar characteristics of these bots in our dataset and propose metrics for identification of such accounts. We then pose some research questions around this type of bot detection, including: how good Twitter is at detecting content polluters and how well state-of-the-art methods perform in detecting bots in our dataset.
Article
Full-text available
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into low-dimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Here we provide a conceptual review of key advancements in this area of representation learning on graphs, including matrix factorization-based methods, random-walk based algorithms, and graph convolutional networks. We review methods to embed individual nodes as well as approaches to embed entire (sub)graphs. In doing so, we develop a unified framework to describe these recent approaches, and we highlight a number of important applications and directions for future work.
Article
Full-text available
Low-dimensional embeddings of nodes in large graphs have proved extremely useful in a variety of prediction tasks, from content recommendation to identifying protein functions. However, most existing approaches require that all nodes in the graph are present during training of the embeddings; these previous approaches are inherently transductive and do not naturally generalize to unseen nodes. Here we present GraphSAGE, a general, inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood. Our algorithm outperforms strong baselines on three inductive node-classification benchmarks: we classify the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and we show that our algorithm generalizes to completely unseen graphs using a multi-graph dataset of protein-protein interactions.
Article
Full-text available
Increasing evidence suggests that a growing amount of social media content is generated by autonomous entities known as social bots. In this work we present a framework to detect such entities on Twitter. We leverage more than a thousand features extracted from public data and meta-data about users: friends, tweet content and sentiment, network patterns, and activity time series. We benchmark the classification framework by using a publicly available dataset of Twitter bots. This training data is enriched by a manually annotated collection of active Twitter users that include both humans and bots of varying sophistication. Our models yield high accuracy and agreement with each other and can detect bots of different nature. Our estimates suggest that between 9% and 15% of active Twitter accounts are bots. Characterizing ties among accounts, we observe that simple bots tend to interact with bots that exhibit more human-like behaviors. Analysis of content flows reveals retweet and mention strategies adopted by bots to interact with different target groups. Using clustering analysis, we characterize several subclasses of accounts, including spammers, self promoters, and accounts that post content from connected applications.
Conference Paper
Full-text available
Correlated or synchronized bots commonly exist in social media sites such as Twitter. Bots work towards gaining human followers, participating in campaigns, and engaging in unethical activities such as spamming and false click generation. In this paper, we perform temporal pattern mining on bot activities in Twitter. We discover motifs (repeating behavior), discords (anomalous behavior), joins, bursts and dynamic clusters in activities of Twitter bots, and explain the significance of these temporal patterns in gaining competitive advantage over humans. Our analysis identifies a small set of indicators that separates bots from humans with high precision.
Conference Paper
Full-text available
While most online social media accounts are controlled by humans, these platforms also host automated agents called social bots or sybil accounts. Recent literature reported on cases of social bots imitating humans to manipulate discussions, alter the popularity of users, pollute content and spread misinformation, and even perform terrorist propaganda and recruitment actions. Here we present BotOrNot, a publicly-available service that leverages more than one thousand features to evaluate the extent to which a Twitter account exhibits similarity to the known characteristics of social bots. Since its release in May 2014, BotOrNot has served over one million requests via our website and APIs.
Conference Paper
Full-text available
Recent studies in social media spam and automation provide anecdotal argumentation of the rise of a new generation of spambots, so-called social spambots. Here, for the first time, we extensively study this novel phenomenon on Twitter and we provide quantitative evidence that a paradigm-shift exists in spambot design. First, we measure current Twitter's capabilities of detecting the new social spambots. Later, we assess the human performance in discriminating between genuine accounts, social spambots, and traditional spambots. Then, we benchmark several state-of-the-art techniques proposed by the academic literature. Results show that neither Twitter, nor humans, nor cutting-edge applications are currently capable of accurately detecting the new social spambots. Our results call for new approaches capable of turning the tide in the fight against this raising phenomenon. We conclude by reviewing the latest literature on spambots detection and we highlight an emerging common research trend based on the analysis of collective behaviors. Insights derived from both our extensive experimental campaign and survey shed light on the most promising directions of research and lay the foundations for the arms race against the novel social spambots. Finally, to foster research on this novel phenomenon, we make publicly available to the scientific community all the datasets used in this study.
Conference Paper
Full-text available
We develop a warped correlation finder to identify correlated user accounts in social media websites such as Twitter. The key observation is that humans cannot be highly synchronous for a long duration; thus, highly synchronous user accounts are most likely bots. Existing bot detection methods are mostly supervised, which requires a large amount of labeled data to train, and do not consider cross-user features. In contrast, our bot detection system works on activity correlation without requiring labeled data. We develop a novel lag-sensitive hashing technique to cluster user accounts into correlated sets in near real-time. Our method, named DeBot, detects thousands of bots per day with a 94% precision and generates reports online everyday. In September 2016, DeBot has accumulated about 544,868 unique bots in the previous one year. We compare our detection technique with per-user techniques and with Twitter’s suspension system. We observe that some bots can avoid Twitter’s suspension mechanism and remain active for months, and, more alarmingly, we show that DeBot detects bots at a rate higher than the rate Twitter is suspending them.
Conference Paper
Predicting links in information networks requires deep understanding and careful modeling of network structure. Network embedding, which aims to learn low-dimensional representations of nodes, has been used successfully for the task of link prediction in the past few decades. Existing methods utilize the observed edges in the network to model the interactions between nodes and learn representations which explain the behavior. In addition to the presence of edges, networks often have information which can be used to improve the embedding. For example, in author collaboration networks, the bag of words representing the abstract of co-authored paper can be used as edge attributes. In this paper, we propose a novel approach, which uses the edges and their associated labels to learn node embeddings. Our model jointly optimizes higher order node neighborhood, social roles and edge attributes reconstruction error using deep architecture which can model highly non-linear interactions. We demonstrate the efficacy of our model over existing state-of-the-art methods on two real world data sets. We observe that such attributes can improve the quality of embedding and yield better performance in link prediction.