Conference PaperPDF Available

Camouflaged bot detection using the friend list

Authors:
  • St. Petersburg Institute for Informatics and Automation of the Russian Academy of Science (SPIIRAS)
  • St. Petersburg Federal Research Center of the Russian Academy of Sciences
  • St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)

Abstract and Figures

The paper considers the task of bot detection in social networks. Study is focused on the case when the account is closed by the privacy settings, and the bot needs to be identified by the friend list. The paper proposes a solution that is based on machine learning and statistical methods. Social network VKontakte is used as a data source. The paper provides the review of data, that one needs to get from the social network for bot detection in the case when the profile is closed by privacy settings. The paper includes a description of features extraction from VKontakte social network and extracting complexity evaluation; description of features construction using statistics, Benford’s law and Gini index. The paper describes the experiment. To collect data for training, we collect bots and real users datasets. To collect bots we made fake groups and bought bots of different quality for them from 3 different companies. We performed two series of experiments. In the first series, all the features were used to train and evaluate the classifiers. In the second series of experiments, the features were preliminary examined for the presence of strong correlation between them. The results demonstrated the feasibility of rather high- accuracy private account bot detection by means of unsophis- ticated off-the-shelf algorithms in combination with data scaling. The Random Forest Classifier yields the best results, with an ROC AUC more than 0.9 and FPR less than 0.3. In paper we also discuss the limitations of the experimental part, and plans for future research.
Content may be subject to copyright.
Camouflaged bot detection using the friend list
Maxim Kolomeets∗†‡ , Olga Tushkanova, Dmitry Levshun∗†‡ , Andrey Chechulin
{kolomeec,tushkanova,levshun,chechulin}@comsec.spb.ru
Laboratory of Computer Security Problems
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)
14th line of VO 39, St. Petersburg, Russia
ITMO University
Kronverksky Pr. 49, St. Petersburg, Russia
Universit´
e Paul Sabatier Toulouse III
118 Route de Narbonne, Toulouse, France
Abstract—The paper considers the task of bot detection in
social networks. Study is focused on the case when the account
is closed by the privacy settings, and the bot needs to be identified
by the friend list. The paper proposes a solution that is based
on machine learning and statistical methods. Social network
VKontakte is used as a data source.
The paper provides the review of data, that one needs to get
from the social network for bot detection in the case when the
profile is closed by privacy settings.
The paper includes a description of features extraction from
VKontakte social network and extracting complexity evaluation;
description of features construction using statistics, Benford’s law
and Gini index.
The paper describes the experiment. To collect data for
training, we collect bots and real users datasets. To collect bots
we made fake groups and bought bots of different quality for
them from 3 different companies.
We performed two series of experiments. In the first series,
all the features were used to train and evaluate the classifiers. In
the second series of experiments, the features were preliminary
examined for the presence of strong correlation between them.
The results demonstrated the feasibility of rather high-
accuracy private account bot detection by means of unsophis-
ticated off-the-shelf algorithms in combination with data scaling.
The Random Forest Classifier yields the best results, with an
ROC AUC more than 0.9 and FPR less than 0.3.
In paper we also discuss the limitations of the experimental
part, and plans for future research.
Index Terms—social network, bot, bot detection, social rela-
tionship, information security
I. INTRODUCTION
Searching for bots on social networks is one of the most
requested security functions from commercial companies and
law enforcement agencies. In many cases, bots are used to
affect reputation, unfair competition, spread disinformation,
and fraud. In some cases, bot activity can pose a serious threat
to the integrity of online communities. For example, recent
investigations [4]–[7], [15] show that electoral interference, the
spread of anti-vaccine conspiracy theories, and stock market
manipulations are carried out using bots.
The demand for such services has led to a significant
increase in bots detection studies. There were developed many
This research was supported by the Russian Science Foundation under grant
number 18-71-10094 in SPC RAS.
machine learning, graph-based, and statistics-based techniques
for bot detection. On the other hand, since security specialists
pay much attention to the techniques for creating and behaving
bots, bots began to use better-camouflaging methods. Bots
try to be like real users, and the highest quality bots are
hacked accounts of real persons. But the simplest and the least
expensive method of camouflaging - is to use the legal privacy
settings of the social network. Account privacy settings do not
allow obtaining profile information, which is necessary for bot
recognition.
Thus, it becomes necessary to recognize bots in cases when
the analyzed account is hidden by the privacy settings. Our
proposal is based on the fact, that even if the account’s profile
is closed by the privacy settings, its friend list still can be
analyzed. In this paper, we analyze the possibility of detecting
bots by the list of friends, without analyzing the content on
the account page itself.
It is important to note that in this paper we do not consider
the problem of obtaining a list of friends of a closed account.
We assume that the list of friends is available for analysis.
The novelty of the paper lies in a new approach to bots
detection which can recognize artificial profiles even if these
profiles are closed by privacy settings. The contribution of the
paper is the list of features and the description of extraction,
construction, and selection.
The paper is organized as follows. The second section is the
state of the art, where we describe bot detection techniques
and types of source data. The third section is proposed
features, where we describe: features extraction how we
extract data from VKontakte social network and how complex
is it; features construction which data and techniques we
used to construct features . The fourth section describes the
bots detection approach and experimental results, where we
describe methods which classification techniques we used;
source data - how we made datasets; experiments where we
summarize and discuss the results obtained. The fifth section
is the discussion, where we review the pros and cons of the
proposed approach and limitations of the experiment. The last
section is conclusion where we summarise results and present
plans for future work.
preprint
PDP-21 conf.
II. STATE OF T HE A RT
Nowadays there exist a market where anyone can buy a
bots with different types of quality. At the same time, quality
means different concepts [1]. For example, one seller claims
that high-quality bots are accounts with more complete profiles
and a larger number of friends (in comparison with low quality
bots). Another case is when high-quality bots are hacked
accounts of real users. The third option is when high-quality
bots are real people who perform someone’s tasks for money.
It is impossible to reliably establish which strategy for bots
management is used for different quality. But all sellers claim
that higher quality bots are harder to recognize.
So we can say that the cost of a bot directly depends on the
effectiveness of its camouflage. The most common techniques
of camouflage is to restrict access to a social network profile
using legal privacy settings. Privacy settings help to hide some
of the profile fields, so the bot cannot be distinguished from
real person who uses the privacy settings.
If one have access to all the data, it is possible to use
4 ”families” of methods [17] for bot detection or feature
extraction:
Statistical - methods that are based on searching for
anomalies in distributions of features. For bot detection
can be used threshold for some statistical measures
(mean, quartiles, p-value, etc.). For example, account can
be recognized as bot, if some number of measures above
or below certain threshold [10]. Also, statistical methods
are most common for feature extraction.
Network science - methods that are based on the analysis
of graph structures [3] that can be formed by user or
content of social network. Network science algorithms
can be applied for various types of graphs to calcu-
late centrality measures, such as betweenness centrality,
eigenvector centrality, clustering coefficient and other.
Analytical - various methods of manual mapping of bots
[4]–[7]. It can be segmentation on user and bot by view
of user’s home page, visual analysis of graph structures
that are formed by sets of users, manual text markup and
similar manual techniques. These methods cannot provide
high accuracy and are very expensive. Also, it is difficult
to justify decisions based in analytical techniques: often,
the criteria that are used in such methods are very
subjective.
However, social media companies has a larger arsenal
of data than others. So they can reasonably detect bots
by manual analysis of user HTTP requests, used IP
addresses, and other information inaccessible to a wide
audience [16].
Also, analytical methods can be used during developing.
For example text markup for machine learning in future,
or visualization [8] of graphs with centrality measures
when developing algorithms.
Machine learning - feature-based bot recognition methods
[1], [2], [9]. Machine learning methods are used most
often, because they allows one to use results of all
described methods (statistical, network science, analytical
and machine learning themselves) as features for training.
Also, machine learning is the common method of analysis
of media (e.g. image recognition) and text (e.g. sentiment
analysis [18]).
To use these methods one needs to get account’s features
from social network. The list of features differs from one social
net to another. But there are 5 big group of features [2] that
can be extracted:
Account based features that one can extract from
account’s home page. Home pages vary depending on
the social network. They usually include the name, city,
hobbies, and other general information that help de-
scribe user. As numerical values can be used the length
of a fields, number of words in a field, numbers of
friends/posts/photos/followers/subscriptions/etc., account
age or serial number, and other values depending on the
social network.
Adjacent account based distribution parameters of ac-
count based features, that one can extract from adjacent
accounts. There are many ways to obtain the list of
adjacent accounts. The most obvious are to get the lists of
friends, followers, accounts that replied to this user’s post,
family members, etc. More lists of adjacent accounts can
be obtained by complex requests, for example: accounts
with which user start the discussion, accounts that regu-
larly ”liked” and others.
From that lists of adjacent accounts can be extracted
the distributions of numerical account based features.
After that, one can extract the basic statistical parameters
quartiles, mean, dispersion, statistical hypothesis test
results and other.
Text and Media features that one can extract from con-
tent. The content can includes text in posts or comments,
videos, photos, pols, emojis, live streams, gifts and so on.
The extracted numerical values depends on the content
type. For text it can be length of text, number of words,
text entropy, emotions, etc. For video it can be length,
number of views, types of objects in the video, etc.
Graph centrality measures features that can be obtained
by analysis of graphs of accounts (the vertex is account)
or graphs of content (the vertex is text/media/etc.).
The graphs of accounts can be formed the same ways
as adjacent accounts lists (e.g. graph of friends, graph of
accounts that regularly ”liked”). Usually the edge in such
graphs is friendship, but it is also possible to use other
complex relations of users.
The graphs of content can be formed from the dependen-
cies of user content [11], for example: one can build a
tree of posts where edge is re-post; or one can link posts
that were commented by one person; etc.
Also, for vertices of graphs can be applied weights.
Weights for account-vertices can be account based fea-
tures. Weights for content-vertices can be text/media
based features. Its also possible to apply more complex
weights. For example, weights for content-vertices can be
features of account that created the post.
By applying network science algorithms to such graphs
is possible to obtain distribution of centrality measures.
After that, one can extract the basic statistical parameters.
Temporal features that can be extracted by the analysis
of timeline. Such features describes the variability of all
previously described features over time. For example, one
can extract the number of posts per week, increase in the
number of subscribers per month, shift in mean of some
centrality measure per week and so on.
If the profile is closed by the privacy settings, it is obvious
that one cannot use some account based, Text and Media, and
Temporal features.
In this paper, we analyse the possibility to detect bot
while profile is closed by privacy settings (we can not extract
information from profile directly). We do this by the analysis
of adjacent account based features (friend list) using statistical
and machine learning methods.
III. PROP OS ED FEATURES
A. Features extraction
An algorithm for data collection from VKontake social
network can be divided into 4 main steps:
1) Collection and parsing of files with input data unique
identifiers of social network users.
2) Collection of data about users profiles based on their
identifiers page status, birth date and friends, groups,
followers, subscriptions, albums, photos and posts count.
3) Collection of unique identifiers of friends of users from
previous step.
4) Collection of data about users profiles based on their
identifiers, obtained during previous step, collected
data is similar to step 2.
Data collection from VKontakte social network is based [14]
on VK API, VK bridge library, web-applications and access
tokens. API can be used by anyone with access to the social
network, but there are limitations on how it can be used and
for what purpose. For example:
There can’t be more than 3 requests per 1000 millisec-
onds from one access token.
Some requests can’t be used more than limited amount
of times per day from one access token (for example,
“wall.get” 5000, “newsfeed.search” 1000 and etc.).
Some requests can’t return more than limited amount
of items per request (for example, “friends.get” 5000,
“groups.get” 1000, “wall.get” 100 and etc.).
Availability of different requests depends on type of
access token and provided to it access rights.
For the experiments, a special application in the in-
frastructure of VKontakte social network was developed
VK mini app. Such an application can request access to its
users access tokens and collect data based on them. As was
mentioned before, data about users profiles were collected.
To collect all data it was required to request API 7 times per
user, but if the user’s page is not opened - data collection
process was stopped for him or her after the first request
(or during the process if, for example, user was banned in
the middle of it).
So, if one wants to calculate the amount of time that is
required to collect data about users page status, birth date and
friends, groups, followers, subscriptions, albums, photos and
posts count, the next formula can be used:
T ime = ([Nd],[Nh],[Nm],[Ns]) ,
where [Nd] whole part of number of days that are required
to collect all the data; [Nm] whole part of number of minutes
that are required to collect all the data; [Ns] whole part of
number of seconds that are required to collect all the data.
Nd=Nu×NAP I
min(NAP I
rpd )×Nat
,
where Nd number of days that are required to collect all
the data; Nu number of users to collect the data; NAP I
average number of API requests per user to collect the data;
NAP I
rpd set of limits related to APIs, showing the possible
amount of times per day they can be requested from one
access token; min(NAP I
rpd ) function that returns the API with
minimal possible requests per day (bottle neck); Nat number
of available access tokens.
Ns=Nu×NAP I min(NAP I
rpd )×Nat ×[Nd]
3×Nat
,
where Ns number of seconds that are required to collect the
data left after [Nd]days of collection.
Nh=Ns÷3600,
where Nh number of hours that are required to collect the
data left after [Nd]days of collection.
Nm= (Ns[Nh]×3600) ÷60,
where Nm number of minutes that are required to collect
the data left after [Nd]days [Nh]hours of collection.
Ns=Ns[Nh]×3600 [Nm]×60,
where Ns number of seconds that are required to collect
the data left after [Nd]days [Nh]hours [Nm]minutes of
collection.
For example, approximately 216 days 2 minutes 14 seconds
is required to collect numerical profile data about 1 000 000
unique users if NAP I = 5.402 this value depends on
percentage of open pages and current one was taken according
to our database statistics, Nat = 5 and min(NAP I
rpd ) = 5000.
B. Features construction
Since the profile is closed by the privacy settings, we cannot
retrieve data from its fields. We assume that in such a case,
the bot can be detected by the friend list. Every user on every
social network has a list of relations that are similar to friends
- it can be a subscription list (Twitter), circles (Google+),
contact list (Telegram), and so on.
First, we need to get data that can be used for the features
construction. To do it we collect the following data for users:
The number of friends. Each user has its friend list. To
make friends, a user must send a request, and another
user must confirm it.
The number of groups. In VKontakte user can join
groups. Groups can be dedicated to some event or topic.
The number of subscriptions. A person can subscribe to
another person. The difference between a subscription
and a friendship is that you don’t need another person’s
confirmation to subscribe.
The number of followers. Followers are people who
subscribed to your account.
The number of photos. A person can upload some photos.
The number of albums. A person can combine photos
into albums.
The number of posts. Each VKontakte user has his/her
microblog. The number of posts is the number of records
in this microblog.
Based on this data we form distribution of number of friends,
distribution of number of groups and so on. For each distri-
bution we construct the following features:
Basic statistical metrics: mean, Q1, Q2, and Q3 quartiles.
Number of not empty values in the distribution. Each
empty value represents an account that close access to
the profile by privacy settings.
The p-value of the distribution agreement with Benford’s
law.
Gini index.
We also add as a feature the length of the friend list on the
basis of which the distributions were made.
It is worth dwelling on two methods in more detail.
Benford’s law is empirical law and it says that a dataset
satisfies Benford’s law for leading digit if the probability of
observing the first digit of dis approximate log10(d+1
d). In
our paper [10], we studied how bots obey Benford’s law
and we found that real users obey it more often than bots.
Using the Kolmogorov-Smirnov test we calculated the p-value
that expresses the agreement of distribution’s first digits with
Benford’s distribution. Here the p-value was used as a feature.
Gini index is a statistical measure that is very popular
in economics for the expression of income inequality. It is
sometimes used in other areas as well, for example, to assess
biodiversity [13]. We assume that the values for the number
of friends, photos, etc. can be interpreted as the ”wealth”
of a social network user because to increase these numbers
users need to do more actions. So the Gini index can indicate
equality or inequality of users in the friend list.
IV. BOTS DET EC TI ON APPROACH A ND E XP ERIME NTAL
RE SULTS
A. Methods
In this study, we face the task of binary classification (’bot’
vs. ’not bot’) associated with the account-level bot detection
as at this stage of the research we have combined all types of
bots, namely standard quality bots (dataset bot4), high-quality
bots (dataset bot5), and live users (dataset bot6), into single
’bot’ class. Further we describe the approaches that we use to
address these challenge.
As shown in Section III, a small number of well-interpreted
features are used to describe user accounts, which almost not
require preliminary data preprocessing. This allowed us at
this stage of the research to use several ready-made classical
machine learning methods for classification, namely:
SGD Classifier,
Random Forest Classifier,
AdaBoost Classifier,
Na¨
ıve Bayes Classifier.
We performed two series of experiments with data described
in Section III. In the first series, all the features described
in Section III were used to train and evaluate the mentioned
classifiers.
In the second series of experiments, the features were
preliminary examined for the presence of strong correlation
between them. As a result, some of the features were removed
from the set.
The correlation matrix of the features that were retained in
the dataset during the second series of experiments is shown
in Figure 1.
Additionally for both experiments series we tried to use
min-max scaling strategy for all features in order to check
how it affects the classifiers accuracy.
To train and evaluate the classifiers, the dataset was divided
into two samples in a ratio of 70% to 30%. The first sample
was used to search optimal parameters for each classifier using
the F-measure and 4-fold cross validation. The second sample
was used for final evaluation of each classifier with optimal
parameters using various metrices, namely False Positive Rate
(FPR), Precision, Recall, F1-score, Accuracy, and Area Under
the Receiver Operating Characteristic Curve (ROC AUC).
B. Source data
We collect data [12] from VKontakte social networks. The
summary of collected bots and users is provided in the table I.
To collect bots we found 3 companies that provide promo-
tion services in social networks using bots. Those companies
have different strategies on how to manage bots. Each com-
pany also provide different bot quality:
Vtope company provide bots. Provide 3 types of
quality: standard quality bots (dataset bot 1), high-quality
bots (dataset bot 2), and live users (dataset bot 3).
Martinismm company provide bots. Provide 3 types of
quality: standard quality bots (dataset bot 4), high-quality
bots (dataset bot 5), and live users (dataset bot 6).
Fig. 1. The correlation matrix for the features used for the second series of experiments with classifiers.
Vktarget company is an exchange platform, where one
can describe the type of promotion and real users perform
it for money. Provide 2 types of quality: standard quality
(dataset bot 7) and high-quality (dataset bot 8).
Companies use a single set of bots for different actions, so
the complexity of their activity does not affect the quality.
To collect real users we select 10 random posts. We
selected posts that have around 250 likes (datasets user N)
and collect accounts that liked the post (only ID of account).
We have chosen such a collection strategy for real users so that
bots and real users perform the same action - like the post.
But we cannot be sure that there are 100% only real users.
So we create 3 groups on VKontakte social network (one for
each company) to buy their promotion. During the week we
filled these groups with content because companies refused to
provide promotion services to empty groups. We made those
groups be absurd to make sure that during experiments real
users will not accidentally join the group and we will collect
only bots. The first group was dedicated to ”Wigs for hairless
cats”. The second ”Deer Registration at the Ministry of
Nature”. The third ”Minibus from J˜
ogeva to Agayakan”.
During the experiments, only one real user joined the group
and we exclude him from the dataset.
For each company we had the next algorithm: create a post
in a group buy 300 likes under the post collect accounts
that liked the post (only ID of account) to the file delete the
post. After iteration, we perform it again with bots of another
quality. Thus, we did not mix bots of different quality and
different companies.
For Vtope and Martinismm companies the whole process
TABLE I
DATASE TS DE SCR IPT ION
Dataset Type Number of profiles Descriptions
bot 1 standard quality bots 301 Probably bots that are controlled by software.
bot 2 mid-quality bots 295 Probably more user-like bots that are controlled by software.
bot 3 live-users bots 298 Probably bots that are controlled by human or users from exchange platform.
bot 4 standard quality bots 301 Probably bots that are controlled by software.
bot 5 mid-quality bots 303 Probably more user-like bots that are controlled by software.
bot 6 live-users bots 304 Probably bots that are controlled by humans, or users from an exchange platform.
bot 7 standard quality bots 302 Probably users from the exchange platform.
bot 8 high-quality bots 357 Probably users from the exchange platform with more filled profiles.
user 1 activists 385 Post in group ”velosipedization” that is dedicated to development of bicycle transport.
user 2 mass media 298 Post in group ”belteanews” that is dedicated to Belarussian news.
user 3 developers 332 Post in group ”tproger” that is dedicated to software development.
user 4 sport 224 Post in group ”mhl” that is dedicated to youth hockey.
user 5 mass media 420 Post in group ”true lentach” that is dedicated to Russian news.
user 6 blog 251 Post in group ”mcelroy dub” that is dedicated to re-playing of funny videos.
user 7 commerce 284 Post in group ”sevcableport” that is dedicated to creative space in Saint-Petersburg.
user 8 festival 259 Post in group ”bigfestava” that is dedicated to cartoon festival.
user 9 sport 181 Post in group ”hcakbars” that is dedicated to fun community of hockey club Ak Bars.
user 10 developers 397 Post in group ”tnull” that is dedicated to software development and memes.
for one post took one day, because the companies afraid to
give likes too quickly, and they are trying to simulate natural
growth. For Vktarget company collection took 3 days because
Vktarget provides not bots, but referrals. Referrals are real
users (not operated by a program) who registered on the bot
exchange platform and give likes, comments, and other activity
for money. Thus, we can be 100% sure that we have collected
only bots.
C. Experiments
In this section, we summarize and discuss the results
obtained using the approaches and methods described in the
previous section.
Table II shows the results of the first series of the ex-
periments when all features are included in the dataset. We
can conclude that SGD Classifier with or without features
scaling shows the worst results with ROC AUC less then 0.83.
Random Forest Classifier yields the best results, with an ROC
AUC more than 0.9 and FPR less than 0.3. If taking into
account all metrices Random Forest Classifier with MinMax
Scaling achieves the best results.
Overall, we demonstrated the feasibility of rather high-
accuracy private account bot detection by means of unsophis-
ticated off-the-shelf algorithms with data scaling.
Table III shows the results of the second series of the
experiments when only weakly correlated features are retained
in the dataset. We can conclude that deleting strongly cor-
related features from the dataset does not affect much the
accuracy of the classifiers in a bad way. On the contrary, some
classifiers, namely Na¨
ıve Bayes, have improved their accuracy
scores (which is reasonable as for Na¨
ıve Bayes Classifier non-
collinearity of the features is a necessary requirement).
As a result it is shown that with a minimal number of uncor-
related features simple ready-made algorithms in combination
with simple data preprocessing and scaling it is possible to
achieve very good performance (ROC AUC more then 0.9) in
private account bot detection task.
TABLE II
CLASSIFICATION PERFORMANCE OF SGD, RANDOM FORE ST CLA SSI FIE R,
ADABO OST A ND NA¨
IVE BAYES CL ASS IFIE RS WI TH A ND WI THO UT
MIN MAX SCA LIN G FOR T HE DATASE T TH AT INCL UDE S ALL F EATU RES .
THE R ESU LTS PRE SE NTE D ARE G OOD I LLU ST RATIO N OF HO W IT IS
PO SSI BLE T O ACHI EVE V ERY G OOD AC CUR ACY O F ACCO UNT-L EVE L BOT
DE TEC TIO N WIT HOU T THE N EE D FOR C OMP LEX D EEP A RC HIT ECT URE S.
FOR EACH METRIC THE BEST MODEL IS HIGHLIGHTED IN BOLD FONT.
RANDOM FOR EST CLASSIFIER WITH MIN MAX SCALING ACHIEVES THE
BE ST RE SULTS A CROS S TH E MOS T OF TH E MET RIC ES.
Model FPR Precision Recall F1 Accuracy ROC AUC
SGD 0.310 0.830 0.837 0.824 0.783 0.763
SGD+MMS 0.285 0.845 0.838 0.829 0.793 0.848
RF 0.295 0.847 0.942 0.892 0.855 0.903
RF+MMS 0.289 0.849 0.938 0.891 0.855 0.905
AB 0.292 0.846 0.930 0.886 0.886 0.892
AB+MMS 0.292 0.846 0.930 0.886 0.886 0.892
NB 0.589 0.737 0.953 0.831 0.83 0.838
NB+MMS 0.577 0.741 0.951 0.833 0.833 0.840
TABLE III
CLASSIFICATION PERFORMANCE OF SGD, RANDOM FORE ST CLA SSI FIE R,
ADABO OST A ND NA¨
IVE BAYES CL ASS IFIE RS WI TH A ND WI THO UT
MIN MAX SCA LIN G FOR T HE DATASE T TH AT INCL UDE S ONLY W EAK LY
CO RRE LATED F EATUR ES . THE R ESU LTS SH OW THAT I T IS PO SSI BLE T O
ACH IEV E VERY G OOD A CCU RAC Y OF ACC OUN T-LEV EL BOT D ETE CT ION
USING ONLY MINIMAL SET OF UNCORRELATED FEATURES. RANDOM
FOR EST CLASSIFIER WITH MIN MAX SCALING ACHIEVES THE BEST
RE SULTS A CROS S ALM OS T ALL M ETR ICE S.
Model FPR Precision Recall F1 Accuracy ROC AUC
SGD 0.325 0.822 0.819 0.811 0.766 0.747
SGD+MMS 0.322 0.826 0.837 0.796 0.779 0.812
RF 0.296 0.846 0.937 0.889 0.851 0.901
RF+MMS 0.291 0.848 0.936 0.890 0.853 0.901
AB 0.292 0.846 0.929 0.885 0.848 0.887
AB+MMS 0.292 0.846 0.929 0.885 0.848 0.887
NB 0.497 0.768 0.950 0.850 0.786 0.849
NB+MMS 0.468 0.778 0.946 0.854 0.794 0.850
V. DISCUSSION
The advantage of the proposed approach is that it is able to
detect camouflaged bots that use privacy settings. In addition,
text, media, and profile information are not used to identify
bots. All features are based on statistical analysis of the
distribution of friends. This information is easy to extract
as there is no need to download a lot of data. In addition,
proposed method can be used for blocked or deleted accounts
if the list of friends is known.
Another advantage is that the sample includes the three
most common categories of bots: standard software bots, bots
that sellers tried to make as human-like as possible, and users
performing tasks for money on the exchange platform.
Since the proposed method gives a high false-positive rate,
it is not suitable for ”hard countermeasures” such as blocking
profiles. Nevertheless, it can be used for investigations of
bots impact or as a part of a ”soft countermeasures” policy
- reducing the API limit, captcha, etc.
It is important to note that in this paper we do not consider
possible scenarios for obtaining a list of friends. We analyzed
open accounts as if they were closed by privacy settings.
Usually, to get a list of friends when account is closed, it
is necessary to analyze all the lists of friends for all open
accounts of the social network. In this case, one can create
a list of friends in the opposite way - one need to find this
closed account in other lists (this is true for all social networks
where one can see a list of a user’s friends or subscriptions
- including Facebook, Twitter, Instagram, etc.). This method
may sound ambitious, but it is a feasible task. It does not
require downloading profile information, text, media or other
”heavy” information.
Another moment that needs to be discussed is the datasets.
We are almost 100% sure about the reliability of datasets with
bots, but not sure about datasets with real users. We have
chosen a collection strategy for real users so that bots and real
users perform the same action - like the post. But we cannot
be absolutely sure that bots haven’t got into the datasets with
real users. In general, research is needed on how to obtain a
representative sample of users that does not include bots.
VI. CONCLUSION
We developed bot detection models requiring a minimal
number of features that is generated using information even
from private accounts in VKontakte social network.
So we have demonstrated the ability to detect bots with a
fairly high accuracy using standard off-the-shelf algorithms,
combined with simple preprocessing and scaling of data from
private VKontakte accounts.
The training datasets includes bots of different quality:
standard software bots, bots that sellers tried to make as
human-like as possible, and users performing tasks for money
on the exchange platform.
The proposed approach is based on the analysis of friend
lists and do not take into account profile information. So it
is able to detect camouflaged bots that use privacy settings.
Also, it can be used for blocked or deleted accounts if the list
of friends is known.
In the next stages of the study, we plan to apply more
complex classification models to the task, for example, deep
neural networks. And also consider the account-level bot
detection problem as a multi-class classification problem with
the following classes: not bot, standard quality bots, high-
quality bots, and live users (acting as bots).
In this paper we analyzed open accounts as if they were
closed by privacy settings. So, we also plan to consider
possible scenarios for obtaining a list of friends of closed
accounts and evaluate their complexity.
REFERENCES
[1] V. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K.
Lerman, L. Zhu, E. Ferrara, A. Flammini, F. Menczer, “The DARPA
Twitter Bot Challenge, Computer (Long. Beach. Calif)., vol. 49, no. 6,
pp. 38–46, 2016, doi: 10.1109/MC.2016.183.
[2] H. Dong, Guozhu, and Liu, “Feature Engineering for Machine Learning
and Data Analytics,” 2018.
[3] M. E. J. Newman, “Networks, Oxford: Oxford University Press, 2019.
[4] J. D. Gallacher, V. Barash, P. N. Howard, and J. Kelly, “Junk News
on Military Affairs and National Security: Social Media Disinformation
Campaigns Against US Military Personnel and Veterans,” arXiv Prepr.
no. October, 2018. Available: http://arxiv.org/abs/1802.03572.
[5] E. Ferrara, “Disinformation and social bot operations in the run up to the
2017 French presidential election,” First Monday, vol. 22, no. 8, 2017,
doi: 10.5210/fm.v22i8.8005.
[6] R. Faris, H. Roberts, B. Etling, N. Bourassa, E. Zuckerman, and Y.
Benkler, “Partisanship Propaganda Disinformation: Online Media and
the 2016 US Presidential Election,” 2017. doi: 10.1109/LARS.
[7] F. Pierri, A. Artoni, and S. Ceri, “Investigating Italian disinformation
spreading on Twitter in the context of 2019 European elections, PLoS
One, vol. 15, no. 1, 2020, doi: 10.1371/journal.pone.0227821.
[8] A. Swito, A. Michalczuk, and H. Josi, “Appliance of Social Network
Analysis and Data Visualization Techniques in Analysis of Information
Propagation,” vol. 1, no. 1. Springer International Publishing, 2019.
[9] C. A. Davis, O. Varol, E. Ferrara, A. Flammini, and F. Menczer,
“BotOrNot,” in Proc. 25th International Conference Companion on
World Wide Web - WWW ’16 Companion, 2016, pp. 273–274, doi:
10.1145/2872518.2889302.
[10] M. Kolomeets, D. Levshun, S. Soloviev, A. Chechulin, and I. Kotenko,
“Social networks bot detection using Benford’s law,” in Proc. 13th
International Conference on Security of Information and Networks SIN
2020, November 4–7, 2020, Merkez, Turkey.
[11] M. Kolomeets, A. Benachour, D. El Baz, A. Chechulin, M. Strecker, and
I. Kotenko, “Reference architecture for social networks graph analysis,
J. Wirel. Mob. Networks, Ubiquitous Comput. Dependable Appl., vol.
10, no. 4, pp. 109–125, 2019, doi: 10.22667/JOWUA.2019.12.31.109.
[12] M. Kolomeets, O. Tushkanova, D. Levshun, A. Chechulin “VKontakte
bots dataset,” GitHub. Available: https://github.com/guardeec/datasets.
[13] R. C. Guiasu and S. Guiasu, “The weighted Gini-Simpson index: Revi-
talizing an old index of biodiversity,” International Journal of Ecology,
vol. 2012, pp. 1–10, 2012.
[14] VKontakte API Limits. Available: https://vk.com/dev/api requests
[15] S. Oates and J. Gray, “#Kremlin: Using hashtags to analyze Russian
disinformation strategy and dissemination on twitter, SSRN Electron.
J., 2019.
[16] T. Stein, E. Chen, and K. Mangla. 2011. “Facebook immune system,
in Proc. 4th workshop on social network systems. 1–8.
[17] A. Karatas¸ and S. S¸ahin. 2017. “A review on social bot detection tech-
niques and research directions,” in Proc. Int. Security and Cryptology
Conference Turkey. 156–161.
[18] J.P. Dickerson, V. Kagan, and V.S. Subrahmanian. “Using sentiment
to detect bots on twitter: Are humans more opinionated than bots?,”
in Proc. 2014 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM 2014), 2014. IEEE,
620–627.
... Одно из предыдущих авторских исследований [10] посвящено обнаружению ботов, основываясь исключительно на списках друзей, без какого-либо анализа самого профиля. Такой подход может быть полезен для анализа аккаунтов, ограничивающих доступ к своему профилю настройками приватности, так как список друзей можно косвенно установить путем поиска целевого аккаунт в списках друзей других пользователей. ...
Article
The emergence of new varieties of bots in social networks and the improvement of their capabilities to imitate the natural behavior of real users represent a significant problem in the field of protection of social networks and online communities. This paper proposes a new approach to detecting and assessing the parameters of bots within the social network «VKontakte». The basis of the proposed approach is the creation of datasets using the method of «controlled purchase» of bots, which allows one to assess bots’ characteristics such as price, quality, and speed of action of bots, and using the Turing Test to assess how much users trust bots. In combination with traditional machine learning methods and features extracted from interaction graphs, text messages, and statistical distributions, it becomes possible to not only detect bots accurately but also predict their characteristics. This paper demonstrates that the trained machine learning model, based on the proposed approach, is robust to imbalanced data and can identify most types of bots as it has only a minor correlation with their main characteristics. The proposed approach can be used within the choice of countermeasures for the protection of social networks and for historical analysis, which allows not only to confirm the presence of bots but also to characterize the specifics of the attack.
... In our previous research [13], we attempted to detect bots based solely on friend lists, bypassing the analysis of the profile itself. This approach could be beneficial for analysing accounts that restrict access to their profile, as their friend list can be indirectly established by searching the target account in the friend lists of other users. ...
Conference Paper
Full-text available
The increasing sophistication of social bots and their ability to mimic human behaviour online presents a significant challenge in distinguishing them from genuine users. This paper proposes a novel approach for detecting and estimating the parameters of such bots within the VKontakte social network. This method involves creating a dataset by using controlled attacks on 'honeypot' accounts to measure bot activity. This process allows us to assess these bots' cost, quality, and speed. Additionally, we evaluate how well users trust bots by using a Turing test, which tests users' ability to identify bots. This dataset is then used within conventional machine learning techniques, leveraging features extracted from interaction graphs, text content, and statistical distributions. The evaluation of the proposed approach shows that it effectively detects bots and predicts their behaviour with considerable accuracy. It can work well even with datasets with a skewed balance of bot and human data, can identify the majority of bot networks, and shows only a minor correlation with the primary characteristics of the bots. Our approach has significant implications for enhancing the ability to select countermeasures against bots by providing a deeper understanding of the capabilities of attackers. Furthermore, it offers important insights for the forensic analysis of botnet attacks, enabling not just to confirm the presence of a botnet but also to characterise the attack’s specifics.
... Twitter had a global audience of over 368 million monthly active users as of December 2022 which is expected to grow by the next decade [2]. The extensive use of twitter, has led to an increase in the utilization of robotic account as twitter bots [3], where 9 to 15 percentage active twitter account are socialbots, created for various purpose, which often perceived malicious objective [4], Spreading Fake News [5], manipulating the stock market [6], promoting terrorism [7], spreading sexually explicit material [8], and the interference in 2016 United State presidential election [9] are Socialbots, which compromise the authenticity of trends, undermine democracy, and cause harm. ...
Preprint
Full-text available
The increasing use of social media platforms has brought about various opportunities for people to communicate, share opinions, and express themselves online. However, this has also led to the emergence of socialbots, programmed accounts that mimic human behavior and have the potential to spread fake news, manipulate the stock market, promote terrorism, and interfere in democratic processes. In order to address this problem, various techniques have been employed for socialbots detection, including profile-based, temporal-based, content-based, behavioral-based, and network-based approaches. However, none of these methods have utilized a hybrid of all these features. In this paper, we propose a hybrid approach that integrates all these features to train a model for socialbots detection on Twitter. we use the Twibot-22 dataset for our experiments and evaluate the performance of our proposed approach against benchmark models. With an accuracy of 0.898, the XGBoost model surpassed the benchmark models. This study contributes to the ongoing efforts to maintain the integrity of tweet contents and address the potential harms caused by socialbots on social networks
... For the experiment, we used the same neural network architecture and data processing pipeline, that we previously used for bot detection [17,18]. It is a feed-forward neural network with 2 Dense layers. ...
Article
Full-text available
In this paper, we propose metrics for malicious bots that provide a qualitative estimation of a bot type involved in the attack in social media: price, bot-trader type, normalized bot quality, speed, survival rate, and several variations of Trust metric. The proposed concept is that after one detects bots, they can measure bot metrics that help to understand the types of bots involved in the attack and estimate the attack. For that, it is possible to retrain existing bot-detection solutions with a metric label so that the machine-learning model can estimate bot parameters. For that, we propose two techniques for metrics labelling: purchase technique—labelling while purchasing an attack, and Trust measurement technique—labelling during the Turing test when humans try to guess which accounts are bots and which are not. In the paper, we describe metrics calculations, correlation analysis, and an example of a neural network which can predict bots’ properties. The proposed metrics can become a basis for developing social media attack detection and risk analysis systems, for exploring bot evolution phenomena, and for evaluating bot-detectors efficiency dependence on bot parameters. We have also opened access to the data, including bot offers, identifiers, and metrics, extracted during the experiments with the Russian VKontakte social network.
... Then machine learning based classifiers will decide whether the selected accounts are fake or real. Some of the key findings in the related research areas are as follows: Kolomeets et al. [20] focused on the case for 'fake profile detection' while profile was locked due to secrecy situations and had to find fake profiles based on the friend list. Machine learning and statistical methods were used. ...
Article
Full-text available
The online social network is the largest network, more than 4 billion users use social media and with its rapid growth, the risk of maintaining the integrity of data has tremendously increased. There are several kinds of security challenges in online social networks (OSNs). Many abominable behaviors try to hack social sites and misuse the data available on these sites. Therefore, protection against such behaviors has become an essential requirement. Though there are many types of security threats in online social networks but, one of the significant threats is the fake profile. Fake profiles are created intentionally with certain motives, and such profiles may be targeted to steal or acquire sensitive information and/or spread rumors on online social networks with specific motives. Fake profiles are primarily used to steal or extract information by means of friendly interaction online and/or misusing online data available on social sites. Thus, fake profile detection in social media networks is attracting the attention of researchers. This paper aims to discuss various machine learning (ML) methods used by researchers for fake profile detection to explore the further possibility of improvising the machine learning models for speedy results.
... There are also some real-time Twitter bot detection platforms such as BotOrNot [7]. Due to these efforts to tackle bots, the bot accounts have started changing their patterns and they are now able to better camouflage themselves such that previous methods are not reliable enough anymore to identify them [14]. This paper focuses on identifying these camouflaged bots with the help of the Benford's Law, Machine Learning (ML) classifiers, and Statistical Analysis. ...
Chapter
Full-text available
Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results.
... To collect the graphs, we extended the dataset that was collected earlier [24,27]. This dataset contains a list of bots of different quality and real users from different communities. ...
Article
Full-text available
In this paper, we propose a machine learning approach to detecting malicious bots on the VKon-takte social network. The advantage of this approach is that only the friends graph is used as the source data to detect bots. Thus, there is no need to download a large amount of text and media data, which are highly language-dependent. Besides, the approach allows one to detect bots that are hidden by privacy settings or blocked, since the graph data can be set indirectly.To do this, we calculate graph metrics using a number of graph algorithms. These metrics are used as features in classifier training.The paper evaluates the effectiveness of the proposed approach for detecting bots of various quality-from low quality to paid users. The resistance of the approach to changing the bot management strategy is also evaluated. Estimates are given for two classifiers-a random forest and a neural network.The study showed that using only information about the graph it is possible to recognize bots with high efficiency (AUC-ROC greater than 0.9). At the same time, the proposed solution is quite resistant to the emergence of bots having a new management strategy.
... .5.In this test, we check if our efficiency scores will drop if we remove multicolinear features that have a linear correlation with other features geq 0.9. These features have been identified in the Feature Analysis section for the dataset of individual accounts. ...
Thesis
Full-text available
In this thesis, we propose machine learning techniques to detecting and characterizing malicious bots in social networks. The novelty of these techniques is that only interaction patterns of friends' of analysed accounts are used as the source data to detect bots. The proposed techniques have a number of novel advantages. There is no need to download a large amount of text and media data, which are highly language-dependent. Furthermore, it allows one to detect bots that are hidden by privacy settings or blocked, to detect cam- ouflages bots that mimic real people, to detect a group of bots, and to estimate quality and price of a bot. In the developed solution, we propose to extract the input data for the analysis in form of a social graphs, using a hierarchical social network model. After, to construct features from this graph we use statistical methods, graph algorithms, and methods that analyze graph embedding. And finally, we make a decision using a random forest model or a neural network. Based on this schema we propose 4 techniques, that allows one to perform the full cycle attack detection pipeline - 2 techniques for bot detection: individual bot detection, and group bot detection; and 2 techniques for characterization of bots: estimation of bot quality, and estimation of bot price. The thesis also presents experiments that evaluate the proposed solutions on the example of bot detection in VKontakte social network. For this, we developed the software prototype that implements the entire chain of analysis - from data collection to decision making. And in order to train the models, we collected the data about bots with different quality, price and camouflage strategies directly from the bot sellers. The study showed that using only information about the graphs of friends it is possible to recognize and characterize bots with high efficiency (AUC - ROC ˜ 0.9). At the same time, the proposed solution is quite resistant to the emergence of new types of bots, and to bots of various types - from automatically generated and hacked accounts to users that perform malicious activity for money.
... In the process of experimental testing, an algorithm for ranking countermeasures and a method for countering malicious information in social networks is being developed. In addition, the authors plan to conduct research to detect the targeted dissemination of malicious information using bots [32]. ...
Article
Full-text available
The problem of countering the spread of destructive content in social networks is currently relevant for most countries of the world. Basically, automatic monitoring systems are used to detect the sources of the spread of malicious information, and automated systems, operators, and counteraction scenarios are used to counteract it. The paper suggests an approach to ranking the sources of the distribution of messages with destructive content. In the process of ranking objects by priority, the number of messages created by the source and the integral indicator of the involvement of its audience are considered. The approach realizes the identification of the most popular and active sources of dissemination of destructive content. The approach does not require the analysis of graphs of relationships and provides an increase in the efficiency of the operator. The proposed solution is applicable both to brand reputation monitoring systems and for countering cyberbullying and the dissemination of destructive information in social networks.
Chapter
Nowadays, social media platforms are thronged with social bots spreading misinformation. Twitter has become the hotspot for social bots. These bots are either automated or semi-automated, spreading misinformation purposefully or not purposefully is influencing society’s perspective on different aspects of life. This tremendous increase in social bots has aroused huge interest in researchers. In this paper, we have proposed a social bot detection model using Random Forest Classifier, we also used Extreme Gradient Boost Classifier, Artificial Neural Network, and Decision Tree Classifier on the top 8 attributes, which are staunch. The attribute is selected after analyzing the preprocessed data set taken from Kaggle which contains 37446 Twitter accounts having both human and bots. The overall accuracy of the proposed model is above 83%. The result demonstrated that the model is feasible for high-accuracy social bot detection.KeywordsSocial botsBot detectionFeature selectionRandom forest classifierXGBoostANNDecision tree classifier
Article
Full-text available
When analyzing social networks, graph data structures are often used. Such graphs may have acomplex structure that makes their operational analysis difficult or even impossible. This paper discusses the key problems that researchers face in the field of processing big graphs in that particular area. The paper proposes a reference architecture for storage, analysis and visualization of social network graphs, as well as a big graph process “Pipeline”. Based on this pipeline it is possible to developa tool that will be able to filter, aggregate and process in parallel big graphs of social networks, and atthe same time take into account its structure. The paper includes the implementation of that pipelineusing the OrientDB graph database for storage, parallel processing for graph measures calculationand visualization of big graphs using D3. The paper also includes the conducted experiments based on the calculation of betweenness centrality of some graphs collected from the VKontakte social net.
Article
Full-text available
We investigate the presence (and the influence) of disinformation spreading on online social networks in Italy, in the 5-month period preceding the 2019 European Parliament elections. To this aim we collected a large-scale dataset of tweets associated to thousands of news articles published on Italian disinformation websites. In the observation period, a few outlets accounted for most of the deceptive information circulating on Twitter, which focused on controversial and polarizing topics of debate such as immigration, national safety and (Italian) nationalism. We found evidence of connections between Italian disinformation sources and different disinformation outlets across Europe, U.S. and Russia, featuring similar, even translated, articles in the period before the elections. Overall, the spread of disinformation on Twitter was confined in a limited community, strongly (and explicitly) related to the Italian conservative and far-right political environment, who had a limited impact on online discussions on the up-coming elections.
Conference Paper
Full-text available
The rise of web services and popularity of online social networks (OSN) like Facebook, Twitter, LinkedIn etc. have led to the rise of unwelcome social bots as automated social actors. Those actors can play many malicious roles including infiltrators of human conversations, scammers, impersonators, misinformation disseminators, stock market manipulators, astroturfers, and any content polluter (spammers, malware spreaders) and so on. It is undeniable that social bots have major importance on social networks. Therefore, this paper reveals the potential hazards of malicious social bots, reviews the detection techniques within a methodological categorization and proposes avenues for future research.
Article
Full-text available
From politicians and nation states to terrorist groups, numerous organizations reportedly conduct explicit campaigns to influence opinions on social media, posing a risk to freedom of expression. Thus, there is a need to identify and eliminate "influence bots"--realistic, automated identities that illicitly shape discussions on sites like Twitter and Facebook--before they get too influential.
Article
Social media provides political news and information for both active duty military personnel and veterans. We analyze the subgroups of Twitter and Facebook users who spend time consuming junk news from websites that target US military personnel and veterans with conspiracy theories, misinformation, and other forms of junk news about military affairs and national security issues. (1) Over Twitter we find that there are significant and persistent interactions between current and former military personnel and a broad network of extremist, Russia-focused, and international conspiracy subgroups. (2) Over Facebook, we find significant and persistent interactions between public pages for military and veterans and subgroups dedicated to political conspiracy, and both sides of the political spectrum. (3) Over Facebook, the users who are most interested in conspiracy theories and the political right seem to be distributing the most junk news, whereas users who are either in the military or are veterans are among the most sophisticated news consumers, and share very little junk news through the network.
Article
Recent accounts from researchers, journalists, as well as federal investigators, reached a unanimous conclusion: social media are systematically exploited to manipulate and alter public opinion. Some disinformation campaigns have been coordinated by means of bots, social media accounts controlled by computer scripts that try to disguise themselves as legitimate human users. In this study, we describe one such operation occurred in the run up to the 2017 French presidential election. We collected a massive Twitter dataset of nearly 17 million posts occurred between April 27 and May 7, 2017 (Election Day). We then set to study the MacronLeaks disinformation campaign: By leveraging a mix of machine learning and cognitive behavioral modeling techniques, we separated humans from bots, and then studied the activities of the two groups taken independently, as well as their interplay. We provide a characterization of both the bots and the users who engaged with them and oppose it to those users who didn't. Prior interests of disinformation adopters pinpoint to the reasons of the scarce success of this campaign: the users who engaged with MacronLeaks are mostly foreigners with a preexisting interest in alt-right topics and alternative news media, rather than French users with diverse political views. Concluding, anomalous account usage patterns suggest the possible existence of a black-market for reusable political disinformation bots.