Content uploaded by Maxim Kolomeets
Author content
All content in this area was uploaded by Maxim Kolomeets on Jun 22, 2022
Content may be subject to copyright.
Camouflaged bot detection using the friend list
Maxim Kolomeets∗†‡ , Olga Tushkanova∗, Dmitry Levshun∗†‡ , Andrey Chechulin∗
{kolomeec,tushkanova,levshun,chechulin}@comsec.spb.ru
∗Laboratory of Computer Security Problems
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)
14th line of VO 39, St. Petersburg, Russia
†ITMO University
Kronverksky Pr. 49, St. Petersburg, Russia
‡Universit´
e Paul Sabatier Toulouse III
118 Route de Narbonne, Toulouse, France
Abstract—The paper considers the task of bot detection in
social networks. Study is focused on the case when the account
is closed by the privacy settings, and the bot needs to be identified
by the friend list. The paper proposes a solution that is based
on machine learning and statistical methods. Social network
VKontakte is used as a data source.
The paper provides the review of data, that one needs to get
from the social network for bot detection in the case when the
profile is closed by privacy settings.
The paper includes a description of features extraction from
VKontakte social network and extracting complexity evaluation;
description of features construction using statistics, Benford’s law
and Gini index.
The paper describes the experiment. To collect data for
training, we collect bots and real users datasets. To collect bots
we made fake groups and bought bots of different quality for
them from 3 different companies.
We performed two series of experiments. In the first series,
all the features were used to train and evaluate the classifiers. In
the second series of experiments, the features were preliminary
examined for the presence of strong correlation between them.
The results demonstrated the feasibility of rather high-
accuracy private account bot detection by means of unsophis-
ticated off-the-shelf algorithms in combination with data scaling.
The Random Forest Classifier yields the best results, with an
ROC AUC more than 0.9 and FPR less than 0.3.
In paper we also discuss the limitations of the experimental
part, and plans for future research.
Index Terms—social network, bot, bot detection, social rela-
tionship, information security
I. INTRODUCTION
Searching for bots on social networks is one of the most
requested security functions from commercial companies and
law enforcement agencies. In many cases, bots are used to
affect reputation, unfair competition, spread disinformation,
and fraud. In some cases, bot activity can pose a serious threat
to the integrity of online communities. For example, recent
investigations [4]–[7], [15] show that electoral interference, the
spread of anti-vaccine conspiracy theories, and stock market
manipulations are carried out using bots.
The demand for such services has led to a significant
increase in bots detection studies. There were developed many
This research was supported by the Russian Science Foundation under grant
number 18-71-10094 in SPC RAS.
machine learning, graph-based, and statistics-based techniques
for bot detection. On the other hand, since security specialists
pay much attention to the techniques for creating and behaving
bots, bots began to use better-camouflaging methods. Bots
try to be like real users, and the highest quality bots are
hacked accounts of real persons. But the simplest and the least
expensive method of camouflaging - is to use the legal privacy
settings of the social network. Account privacy settings do not
allow obtaining profile information, which is necessary for bot
recognition.
Thus, it becomes necessary to recognize bots in cases when
the analyzed account is hidden by the privacy settings. Our
proposal is based on the fact, that even if the account’s profile
is closed by the privacy settings, its friend list still can be
analyzed. In this paper, we analyze the possibility of detecting
bots by the list of friends, without analyzing the content on
the account page itself.
It is important to note that in this paper we do not consider
the problem of obtaining a list of friends of a closed account.
We assume that the list of friends is available for analysis.
The novelty of the paper lies in a new approach to bots
detection which can recognize artificial profiles even if these
profiles are closed by privacy settings. The contribution of the
paper is the list of features and the description of extraction,
construction, and selection.
The paper is organized as follows. The second section is the
state of the art, where we describe bot detection techniques
and types of source data. The third section is proposed
features, where we describe: features extraction – how we
extract data from VKontakte social network and how complex
is it; features construction – which data and techniques we
used to construct features . The fourth section describes the
bots detection approach and experimental results, where we
describe methods – which classification techniques we used;
source data - how we made datasets; experiments – where we
summarize and discuss the results obtained. The fifth section
is the discussion, where we review the pros and cons of the
proposed approach and limitations of the experiment. The last
section is conclusion where we summarise results and present
plans for future work.
preprint
PDP-21 conf.
II. STATE OF T HE A RT
Nowadays there exist a market where anyone can buy a
bots with different types of quality. At the same time, quality
means different concepts [1]. For example, one seller claims
that high-quality bots are accounts with more complete profiles
and a larger number of friends (in comparison with low quality
bots). Another case is when high-quality bots are hacked
accounts of real users. The third option is when high-quality
bots are real people who perform someone’s tasks for money.
It is impossible to reliably establish which strategy for bots
management is used for different quality. But all sellers claim
that higher quality bots are harder to recognize.
So we can say that the cost of a bot directly depends on the
effectiveness of its camouflage. The most common techniques
of camouflage is to restrict access to a social network profile
using legal privacy settings. Privacy settings help to hide some
of the profile fields, so the bot cannot be distinguished from
real person who uses the privacy settings.
If one have access to all the data, it is possible to use
4 ”families” of methods [17] for bot detection or feature
extraction:
•Statistical - methods that are based on searching for
anomalies in distributions of features. For bot detection
can be used threshold for some statistical measures
(mean, quartiles, p-value, etc.). For example, account can
be recognized as bot, if some number of measures above
or below certain threshold [10]. Also, statistical methods
are most common for feature extraction.
•Network science - methods that are based on the analysis
of graph structures [3] that can be formed by user or
content of social network. Network science algorithms
can be applied for various types of graphs to calcu-
late centrality measures, such as betweenness centrality,
eigenvector centrality, clustering coefficient and other.
•Analytical - various methods of manual mapping of bots
[4]–[7]. It can be segmentation on user and bot by view
of user’s home page, visual analysis of graph structures
that are formed by sets of users, manual text markup and
similar manual techniques. These methods cannot provide
high accuracy and are very expensive. Also, it is difficult
to justify decisions based in analytical techniques: often,
the criteria that are used in such methods are very
subjective.
However, social media companies has a larger arsenal
of data than others. So they can reasonably detect bots
by manual analysis of user HTTP requests, used IP
addresses, and other information inaccessible to a wide
audience [16].
Also, analytical methods can be used during developing.
For example text markup for machine learning in future,
or visualization [8] of graphs with centrality measures
when developing algorithms.
•Machine learning - feature-based bot recognition methods
[1], [2], [9]. Machine learning methods are used most
often, because they allows one to use results of all
described methods (statistical, network science, analytical
and machine learning themselves) as features for training.
Also, machine learning is the common method of analysis
of media (e.g. image recognition) and text (e.g. sentiment
analysis [18]).
To use these methods one needs to get account’s features
from social network. The list of features differs from one social
net to another. But there are 5 big group of features [2] that
can be extracted:
•Account based – features that one can extract from
account’s home page. Home pages vary depending on
the social network. They usually include the name, city,
hobbies, and other general information that help de-
scribe user. As numerical values can be used the length
of a fields, number of words in a field, numbers of
friends/posts/photos/followers/subscriptions/etc., account
age or serial number, and other values depending on the
social network.
•Adjacent account based – distribution parameters of ac-
count based features, that one can extract from adjacent
accounts. There are many ways to obtain the list of
adjacent accounts. The most obvious are to get the lists of
friends, followers, accounts that replied to this user’s post,
family members, etc. More lists of adjacent accounts can
be obtained by complex requests, for example: accounts
with which user start the discussion, accounts that regu-
larly ”liked” and others.
From that lists of adjacent accounts can be extracted
the distributions of numerical account based features.
After that, one can extract the basic statistical parameters
– quartiles, mean, dispersion, statistical hypothesis test
results and other.
•Text and Media – features that one can extract from con-
tent. The content can includes text in posts or comments,
videos, photos, pols, emojis, live streams, gifts and so on.
The extracted numerical values depends on the content
type. For text it can be length of text, number of words,
text entropy, emotions, etc. For video it can be length,
number of views, types of objects in the video, etc.
•Graph centrality measures – features that can be obtained
by analysis of graphs of accounts (the vertex is account)
or graphs of content (the vertex is text/media/etc.).
The graphs of accounts can be formed the same ways
as adjacent accounts lists (e.g. graph of friends, graph of
accounts that regularly ”liked”). Usually the edge in such
graphs is friendship, but it is also possible to use other
complex relations of users.
The graphs of content can be formed from the dependen-
cies of user content [11], for example: one can build a
tree of posts where edge is re-post; or one can link posts
that were commented by one person; etc.
Also, for vertices of graphs can be applied weights.
Weights for account-vertices can be account based fea-
tures. Weights for content-vertices can be text/media
based features. Its also possible to apply more complex
weights. For example, weights for content-vertices can be
features of account that created the post.
By applying network science algorithms to such graphs
is possible to obtain distribution of centrality measures.
After that, one can extract the basic statistical parameters.
•Temporal – features that can be extracted by the analysis
of timeline. Such features describes the variability of all
previously described features over time. For example, one
can extract the number of posts per week, increase in the
number of subscribers per month, shift in mean of some
centrality measure per week and so on.
If the profile is closed by the privacy settings, it is obvious
that one cannot use some account based, Text and Media, and
Temporal features.
In this paper, we analyse the possibility to detect bot
while profile is closed by privacy settings (we can not extract
information from profile directly). We do this by the analysis
of adjacent account based features (friend list) using statistical
and machine learning methods.
III. PROP OS ED FEATURES
A. Features extraction
An algorithm for data collection from VKontake social
network can be divided into 4 main steps:
1) Collection and parsing of files with input data – unique
identifiers of social network users.
2) Collection of data about users profiles based on their
identifiers – page status, birth date and friends, groups,
followers, subscriptions, albums, photos and posts count.
3) Collection of unique identifiers of friends of users from
previous step.
4) Collection of data about users profiles based on their
identifiers, obtained during previous step, – collected
data is similar to step 2.
Data collection from VKontakte social network is based [14]
on VK API, VK bridge library, web-applications and access
tokens. API can be used by anyone with access to the social
network, but there are limitations on how it can be used and
for what purpose. For example:
•There can’t be more than 3 requests per 1000 millisec-
onds from one access token.
•Some requests can’t be used more than limited amount
of times per day from one access token (for example,
“wall.get” – 5000, “newsfeed.search” – 1000 and etc.).
•Some requests can’t return more than limited amount
of items per request (for example, “friends.get” – 5000,
“groups.get” – 1000, “wall.get” – 100 and etc.).
•Availability of different requests depends on type of
access token and provided to it access rights.
For the experiments, a special application in the in-
frastructure of VKontakte social network was developed –
VK mini app. Such an application can request access to its
users access tokens and collect data based on them. As was
mentioned before, data about users profiles were collected.
To collect all data it was required to request API 7 times per
user, but if the user’s page is not opened - data collection
process was stopped for him or her after the first request
(or during the process if, for example, user was banned in
the middle of it).
So, if one wants to calculate the amount of time that is
required to collect data about users page status, birth date and
friends, groups, followers, subscriptions, albums, photos and
posts count, the next formula can be used:
T ime = ([Nd],[Nh],[Nm],[Ns]) ,
where [Nd]– whole part of number of days that are required
to collect all the data; [Nm]– whole part of number of minutes
that are required to collect all the data; [Ns]– whole part of
number of seconds that are required to collect all the data.
Nd=Nu×NAP I
min(NAP I
rpd )×Nat
,
where Nd– number of days that are required to collect all
the data; Nu– number of users to collect the data; NAP I –
average number of API requests per user to collect the data;
NAP I
rpd – set of limits related to APIs, showing the possible
amount of times per day they can be requested from one
access token; min(NAP I
rpd )– function that returns the API with
minimal possible requests per day (bottle neck); Nat – number
of available access tokens.
Ns=Nu×NAP I −min(NAP I
rpd )×Nat ×[Nd]
3×Nat
,
where Ns– number of seconds that are required to collect the
data left after [Nd]days of collection.
Nh=Ns÷3600,
where Nh– number of hours that are required to collect the
data left after [Nd]days of collection.
Nm= (Ns−[Nh]×3600) ÷60,
where Nm– number of minutes that are required to collect
the data left after [Nd]days [Nh]hours of collection.
Ns=Ns−[Nh]×3600 −[Nm]×60,
where Ns– number of seconds that are required to collect
the data left after [Nd]days [Nh]hours [Nm]minutes of
collection.
For example, approximately 216 days 2 minutes 14 seconds
is required to collect numerical profile data about 1 000 000
unique users if NAP I = 5.402 – this value depends on
percentage of open pages and current one was taken according
to our database statistics, Nat = 5 and min(NAP I
rpd ) = 5000.
B. Features construction
Since the profile is closed by the privacy settings, we cannot
retrieve data from its fields. We assume that in such a case,
the bot can be detected by the friend list. Every user on every
social network has a list of relations that are similar to friends
- it can be a subscription list (Twitter), circles (Google+),
contact list (Telegram), and so on.
First, we need to get data that can be used for the features
construction. To do it we collect the following data for users:
•The number of friends. Each user has its friend list. To
make friends, a user must send a request, and another
user must confirm it.
•The number of groups. In VKontakte user can join
groups. Groups can be dedicated to some event or topic.
•The number of subscriptions. A person can subscribe to
another person. The difference between a subscription
and a friendship is that you don’t need another person’s
confirmation to subscribe.
•The number of followers. Followers are people who
subscribed to your account.
•The number of photos. A person can upload some photos.
•The number of albums. A person can combine photos
into albums.
•The number of posts. Each VKontakte user has his/her
microblog. The number of posts is the number of records
in this microblog.
Based on this data we form distribution of number of friends,
distribution of number of groups and so on. For each distri-
bution we construct the following features:
•Basic statistical metrics: mean, Q1, Q2, and Q3 quartiles.
•Number of not empty values in the distribution. Each
empty value represents an account that close access to
the profile by privacy settings.
•The p-value of the distribution agreement with Benford’s
law.
•Gini index.
We also add as a feature the length of the friend list on the
basis of which the distributions were made.
It is worth dwelling on two methods in more detail.
Benford’s law is empirical law and it says that a dataset
satisfies Benford’s law for leading digit if the probability of
observing the first digit of dis approximate log10(d+1
d). In
our paper [10], we studied how bots obey Benford’s law
and we found that real users obey it more often than bots.
Using the Kolmogorov-Smirnov test we calculated the p-value
that expresses the agreement of distribution’s first digits with
Benford’s distribution. Here the p-value was used as a feature.
Gini index is a statistical measure that is very popular
in economics for the expression of income inequality. It is
sometimes used in other areas as well, for example, to assess
biodiversity [13]. We assume that the values for the number
of friends, photos, etc. can be interpreted as the ”wealth”
of a social network user because to increase these numbers
users need to do more actions. So the Gini index can indicate
equality or inequality of users in the friend list.
IV. BOTS DET EC TI ON APPROACH A ND E XP ERIME NTAL
RE SULTS
A. Methods
In this study, we face the task of binary classification (’bot’
vs. ’not bot’) associated with the account-level bot detection
as at this stage of the research we have combined all types of
bots, namely standard quality bots (dataset bot4), high-quality
bots (dataset bot5), and live users (dataset bot6), into single
’bot’ class. Further we describe the approaches that we use to
address these challenge.
As shown in Section III, a small number of well-interpreted
features are used to describe user accounts, which almost not
require preliminary data preprocessing. This allowed us at
this stage of the research to use several ready-made classical
machine learning methods for classification, namely:
•SGD Classifier,
•Random Forest Classifier,
•AdaBoost Classifier,
•Na¨
ıve Bayes Classifier.
We performed two series of experiments with data described
in Section III. In the first series, all the features described
in Section III were used to train and evaluate the mentioned
classifiers.
In the second series of experiments, the features were
preliminary examined for the presence of strong correlation
between them. As a result, some of the features were removed
from the set.
The correlation matrix of the features that were retained in
the dataset during the second series of experiments is shown
in Figure 1.
Additionally for both experiments series we tried to use
min-max scaling strategy for all features in order to check
how it affects the classifiers accuracy.
To train and evaluate the classifiers, the dataset was divided
into two samples in a ratio of 70% to 30%. The first sample
was used to search optimal parameters for each classifier using
the F-measure and 4-fold cross validation. The second sample
was used for final evaluation of each classifier with optimal
parameters using various metrices, namely False Positive Rate
(FPR), Precision, Recall, F1-score, Accuracy, and Area Under
the Receiver Operating Characteristic Curve (ROC AUC).
B. Source data
We collect data [12] from VKontakte social networks. The
summary of collected bots and users is provided in the table I.
To collect bots we found 3 companies that provide promo-
tion services in social networks using bots. Those companies
have different strategies on how to manage bots. Each com-
pany also provide different bot quality:
•Vtope company – provide bots. Provide 3 types of
quality: standard quality bots (dataset bot 1), high-quality
bots (dataset bot 2), and live users (dataset bot 3).
•Martinismm company – provide bots. Provide 3 types of
quality: standard quality bots (dataset bot 4), high-quality
bots (dataset bot 5), and live users (dataset bot 6).
Fig. 1. The correlation matrix for the features used for the second series of experiments with classifiers.
•Vktarget company is an exchange platform, where one
can describe the type of promotion and real users perform
it for money. Provide 2 types of quality: standard quality
(dataset bot 7) and high-quality (dataset bot 8).
Companies use a single set of bots for different actions, so
the complexity of their activity does not affect the quality.
To collect real users we select 10 random posts. We
selected posts that have around 250 likes (datasets user N)
and collect accounts that liked the post (only ID of account).
We have chosen such a collection strategy for real users so that
bots and real users perform the same action - like the post.
But we cannot be sure that there are 100% only real users.
So we create 3 groups on VKontakte social network (one for
each company) to buy their promotion. During the week we
filled these groups with content because companies refused to
provide promotion services to empty groups. We made those
groups be absurd to make sure that during experiments real
users will not accidentally join the group and we will collect
only bots. The first group was dedicated to ”Wigs for hairless
cats”. The second – ”Deer Registration at the Ministry of
Nature”. The third – ”Minibus from J˜
ogeva to Agayakan”.
During the experiments, only one real user joined the group
and we exclude him from the dataset.
For each company we had the next algorithm: create a post
in a group →buy 300 likes under the post →collect accounts
that liked the post (only ID of account) to the file →delete the
post. After iteration, we perform it again with bots of another
quality. Thus, we did not mix bots of different quality and
different companies.
For Vtope and Martinismm companies the whole process
TABLE I
DATASE TS DE SCR IPT ION
Dataset Type Number of profiles Descriptions
bot 1 standard quality bots 301 Probably bots that are controlled by software.
bot 2 mid-quality bots 295 Probably more user-like bots that are controlled by software.
bot 3 live-users bots 298 Probably bots that are controlled by human or users from exchange platform.
bot 4 standard quality bots 301 Probably bots that are controlled by software.
bot 5 mid-quality bots 303 Probably more user-like bots that are controlled by software.
bot 6 live-users bots 304 Probably bots that are controlled by humans, or users from an exchange platform.
bot 7 standard quality bots 302 Probably users from the exchange platform.
bot 8 high-quality bots 357 Probably users from the exchange platform with more filled profiles.
user 1 activists 385 Post in group ”velosipedization” that is dedicated to development of bicycle transport.
user 2 mass media 298 Post in group ”belteanews” that is dedicated to Belarussian news.
user 3 developers 332 Post in group ”tproger” that is dedicated to software development.
user 4 sport 224 Post in group ”mhl” that is dedicated to youth hockey.
user 5 mass media 420 Post in group ”true lentach” that is dedicated to Russian news.
user 6 blog 251 Post in group ”mcelroy dub” that is dedicated to re-playing of funny videos.
user 7 commerce 284 Post in group ”sevcableport” that is dedicated to creative space in Saint-Petersburg.
user 8 festival 259 Post in group ”bigfestava” that is dedicated to cartoon festival.
user 9 sport 181 Post in group ”hcakbars” that is dedicated to fun community of hockey club Ak Bars.
user 10 developers 397 Post in group ”tnull” that is dedicated to software development and memes.
for one post took one day, because the companies afraid to
give likes too quickly, and they are trying to simulate natural
growth. For Vktarget company collection took 3 days because
Vktarget provides not bots, but referrals. Referrals are real
users (not operated by a program) who registered on the bot
exchange platform and give likes, comments, and other activity
for money. Thus, we can be 100% sure that we have collected
only bots.
C. Experiments
In this section, we summarize and discuss the results
obtained using the approaches and methods described in the
previous section.
Table II shows the results of the first series of the ex-
periments when all features are included in the dataset. We
can conclude that SGD Classifier with or without features
scaling shows the worst results with ROC AUC less then 0.83.
Random Forest Classifier yields the best results, with an ROC
AUC more than 0.9 and FPR less than 0.3. If taking into
account all metrices Random Forest Classifier with MinMax
Scaling achieves the best results.
Overall, we demonstrated the feasibility of rather high-
accuracy private account bot detection by means of unsophis-
ticated off-the-shelf algorithms with data scaling.
Table III shows the results of the second series of the
experiments when only weakly correlated features are retained
in the dataset. We can conclude that deleting strongly cor-
related features from the dataset does not affect much the
accuracy of the classifiers in a bad way. On the contrary, some
classifiers, namely Na¨
ıve Bayes, have improved their accuracy
scores (which is reasonable as for Na¨
ıve Bayes Classifier non-
collinearity of the features is a necessary requirement).
As a result it is shown that with a minimal number of uncor-
related features simple ready-made algorithms in combination
with simple data preprocessing and scaling it is possible to
achieve very good performance (ROC AUC more then 0.9) in
private account bot detection task.
TABLE II
CLASSIFICATION PERFORMANCE OF SGD, RANDOM FORE ST CLA SSI FIE R,
ADABO OST A ND NA¨
IVE BAYES CL ASS IFIE RS WI TH A ND WI THO UT
MIN MAX SCA LIN G FOR T HE DATASE T TH AT INCL UDE S ALL F EATU RES .
THE R ESU LTS PRE SE NTE D ARE G OOD I LLU ST RATIO N OF HO W IT IS
PO SSI BLE T O ACHI EVE V ERY G OOD AC CUR ACY O F ACCO UNT-L EVE L BOT
DE TEC TIO N WIT HOU T THE N EE D FOR C OMP LEX D EEP A RC HIT ECT URE S.
FOR EACH METRIC THE BEST MODEL IS HIGHLIGHTED IN BOLD FONT.
RANDOM FOR EST CLASSIFIER WITH MIN MAX SCALING ACHIEVES THE
BE ST RE SULTS A CROS S TH E MOS T OF TH E MET RIC ES.
Model FPR Precision Recall F1 Accuracy ROC AUC
SGD 0.310 0.830 0.837 0.824 0.783 0.763
SGD+MMS 0.285 0.845 0.838 0.829 0.793 0.848
RF 0.295 0.847 0.942 0.892 0.855 0.903
RF+MMS 0.289 0.849 0.938 0.891 0.855 0.905
AB 0.292 0.846 0.930 0.886 0.886 0.892
AB+MMS 0.292 0.846 0.930 0.886 0.886 0.892
NB 0.589 0.737 0.953 0.831 0.83 0.838
NB+MMS 0.577 0.741 0.951 0.833 0.833 0.840
TABLE III
CLASSIFICATION PERFORMANCE OF SGD, RANDOM FORE ST CLA SSI FIE R,
ADABO OST A ND NA¨
IVE BAYES CL ASS IFIE RS WI TH A ND WI THO UT
MIN MAX SCA LIN G FOR T HE DATASE T TH AT INCL UDE S ONLY W EAK LY
CO RRE LATED F EATUR ES . THE R ESU LTS SH OW THAT I T IS PO SSI BLE T O
ACH IEV E VERY G OOD A CCU RAC Y OF ACC OUN T-LEV EL BOT D ETE CT ION
USING ONLY MINIMAL SET OF UNCORRELATED FEATURES. RANDOM
FOR EST CLASSIFIER WITH MIN MAX SCALING ACHIEVES THE BEST
RE SULTS A CROS S ALM OS T ALL M ETR ICE S.
Model FPR Precision Recall F1 Accuracy ROC AUC
SGD 0.325 0.822 0.819 0.811 0.766 0.747
SGD+MMS 0.322 0.826 0.837 0.796 0.779 0.812
RF 0.296 0.846 0.937 0.889 0.851 0.901
RF+MMS 0.291 0.848 0.936 0.890 0.853 0.901
AB 0.292 0.846 0.929 0.885 0.848 0.887
AB+MMS 0.292 0.846 0.929 0.885 0.848 0.887
NB 0.497 0.768 0.950 0.850 0.786 0.849
NB+MMS 0.468 0.778 0.946 0.854 0.794 0.850
V. DISCUSSION
The advantage of the proposed approach is that it is able to
detect camouflaged bots that use privacy settings. In addition,
text, media, and profile information are not used to identify
bots. All features are based on statistical analysis of the
distribution of friends. This information is easy to extract
as there is no need to download a lot of data. In addition,
proposed method can be used for blocked or deleted accounts
if the list of friends is known.
Another advantage is that the sample includes the three
most common categories of bots: standard software bots, bots
that sellers tried to make as human-like as possible, and users
performing tasks for money on the exchange platform.
Since the proposed method gives a high false-positive rate,
it is not suitable for ”hard countermeasures” such as blocking
profiles. Nevertheless, it can be used for investigations of
bots impact or as a part of a ”soft countermeasures” policy
- reducing the API limit, captcha, etc.
It is important to note that in this paper we do not consider
possible scenarios for obtaining a list of friends. We analyzed
open accounts as if they were closed by privacy settings.
Usually, to get a list of friends when account is closed, it
is necessary to analyze all the lists of friends for all open
accounts of the social network. In this case, one can create
a list of friends in the opposite way - one need to find this
closed account in other lists (this is true for all social networks
where one can see a list of a user’s friends or subscriptions
- including Facebook, Twitter, Instagram, etc.). This method
may sound ambitious, but it is a feasible task. It does not
require downloading profile information, text, media or other
”heavy” information.
Another moment that needs to be discussed is the datasets.
We are almost 100% sure about the reliability of datasets with
bots, but not sure about datasets with real users. We have
chosen a collection strategy for real users so that bots and real
users perform the same action - like the post. But we cannot
be absolutely sure that bots haven’t got into the datasets with
real users. In general, research is needed on how to obtain a
representative sample of users that does not include bots.
VI. CONCLUSION
We developed bot detection models requiring a minimal
number of features that is generated using information even
from private accounts in VKontakte social network.
So we have demonstrated the ability to detect bots with a
fairly high accuracy using standard off-the-shelf algorithms,
combined with simple preprocessing and scaling of data from
private VKontakte accounts.
The training datasets includes bots of different quality:
standard software bots, bots that sellers tried to make as
human-like as possible, and users performing tasks for money
on the exchange platform.
The proposed approach is based on the analysis of friend
lists and do not take into account profile information. So it
is able to detect camouflaged bots that use privacy settings.
Also, it can be used for blocked or deleted accounts if the list
of friends is known.
In the next stages of the study, we plan to apply more
complex classification models to the task, for example, deep
neural networks. And also consider the account-level bot
detection problem as a multi-class classification problem with
the following classes: not bot, standard quality bots, high-
quality bots, and live users (acting as bots).
In this paper we analyzed open accounts as if they were
closed by privacy settings. So, we also plan to consider
possible scenarios for obtaining a list of friends of closed
accounts and evaluate their complexity.
REFERENCES
[1] V. Subrahmanian, A. Azaria, S. Durst, V. Kagan, A. Galstyan, K.
Lerman, L. Zhu, E. Ferrara, A. Flammini, F. Menczer, “The DARPA
Twitter Bot Challenge,” Computer (Long. Beach. Calif)., vol. 49, no. 6,
pp. 38–46, 2016, doi: 10.1109/MC.2016.183.
[2] H. Dong, Guozhu, and Liu, “Feature Engineering for Machine Learning
and Data Analytics,” 2018.
[3] M. E. J. Newman, “Networks,” Oxford: Oxford University Press, 2019.
[4] J. D. Gallacher, V. Barash, P. N. Howard, and J. Kelly, “Junk News
on Military Affairs and National Security: Social Media Disinformation
Campaigns Against US Military Personnel and Veterans,” arXiv Prepr.
no. October, 2018. Available: http://arxiv.org/abs/1802.03572.
[5] E. Ferrara, “Disinformation and social bot operations in the run up to the
2017 French presidential election,” First Monday, vol. 22, no. 8, 2017,
doi: 10.5210/fm.v22i8.8005.
[6] R. Faris, H. Roberts, B. Etling, N. Bourassa, E. Zuckerman, and Y.
Benkler, “Partisanship Propaganda Disinformation: Online Media and
the 2016 US Presidential Election,” 2017. doi: 10.1109/LARS.
[7] F. Pierri, A. Artoni, and S. Ceri, “Investigating Italian disinformation
spreading on Twitter in the context of 2019 European elections,” PLoS
One, vol. 15, no. 1, 2020, doi: 10.1371/journal.pone.0227821.
[8] A. Swito, A. Michalczuk, and H. Josi, “Appliance of Social Network
Analysis and Data Visualization Techniques in Analysis of Information
Propagation,” vol. 1, no. 1. Springer International Publishing, 2019.
[9] C. A. Davis, O. Varol, E. Ferrara, A. Flammini, and F. Menczer,
“BotOrNot,” in Proc. 25th International Conference Companion on
World Wide Web - WWW ’16 Companion, 2016, pp. 273–274, doi:
10.1145/2872518.2889302.
[10] M. Kolomeets, D. Levshun, S. Soloviev, A. Chechulin, and I. Kotenko,
“Social networks bot detection using Benford’s law,” in Proc. 13th
International Conference on Security of Information and Networks SIN
2020, November 4–7, 2020, Merkez, Turkey.
[11] M. Kolomeets, A. Benachour, D. El Baz, A. Chechulin, M. Strecker, and
I. Kotenko, “Reference architecture for social networks graph analysis,”
J. Wirel. Mob. Networks, Ubiquitous Comput. Dependable Appl., vol.
10, no. 4, pp. 109–125, 2019, doi: 10.22667/JOWUA.2019.12.31.109.
[12] M. Kolomeets, O. Tushkanova, D. Levshun, A. Chechulin “VKontakte
bots dataset,” GitHub. Available: https://github.com/guardeec/datasets.
[13] R. C. Guiasu and S. Guiasu, “The weighted Gini-Simpson index: Revi-
talizing an old index of biodiversity,” International Journal of Ecology,
vol. 2012, pp. 1–10, 2012.
[14] VKontakte API Limits. Available: https://vk.com/dev/api requests
[15] S. Oates and J. Gray, “#Kremlin: Using hashtags to analyze Russian
disinformation strategy and dissemination on twitter,” SSRN Electron.
J., 2019.
[16] T. Stein, E. Chen, and K. Mangla. 2011. “Facebook immune system,”
in Proc. 4th workshop on social network systems. 1–8.
[17] A. Karatas¸ and S. S¸ahin. 2017. “A review on social bot detection tech-
niques and research directions,” in Proc. Int. Security and Cryptology
Conference Turkey. 156–161.
[18] J.P. Dickerson, V. Kagan, and V.S. Subrahmanian. “Using sentiment
to detect bots on twitter: Are humans more opinionated than bots?,”
in Proc. 2014 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM 2014), 2014. IEEE,
620–627.