Content uploaded by Maxim Kolomeets
Author content
All content in this area was uploaded by Maxim Kolomeets on Jun 20, 2022
Content may be subject to copyright.
Social networks bot detection using Benford’s law
Maxim Kolomeets
kolomeec@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Dmitry Levshun
levshun@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Sergei Soloviev
sergei.soloviev@irit.fr
Université Paul Sabatier Toulouse III
Toulouse, France
Andrey Chechulin
chechulin@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Igor Kotenko
ivkote@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Abstract
The paper considers the task of bot detection in social net-
works. It checks the hypothesis that bots break Benford’s
law much more often than users, so one can identify them.
A bot detection approach is proposed based on experiments
where the test results for bot datasets of dierent classes and
real-user datasets of dierent communities are evaluated
and compared. The experiments show that automatically
controlled bots possibly can be identied by disagreement
with Benford’s law, while human-orchestrated bots are not.
CCS Concepts •Security and privacy →Social network
security and privacy.
Keywords
bot detection, information security, social net-
works, Benford’s law
ACM Reference Format:
Maxim Kolomeets, Dmitry Levshun, Sergei Soloviev, Andrey Chechulin,
and Igor Kotenko. 2020. Social networks bot detection using Ben-
ford’s law. In ?????????????. ACM, New York, NY, USA, 8pages.
hps://doi.org/??????????????????
1 Introduction
Detection of social bots is especially important in the cyber-
security contexts involving social networks because they
are the software agents designed to mimic humans but pro-
grammed for specic purposes. Bots may seriously threaten
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
©2020 Association for Computing Machinery.
ACM ISBN ???????. . . $??????
hps://doi.org/??????????????????
the integrity of networks and various networked installa-
tions. Bots are used for metrics cheating and information
spreading by dierent actors: commerce companies, scam-
mers, and even political manipulators.
The diculty of bots detection is an important factor that
enables the proliferation of bots [
14
]. Statistical methods play
a leading role in social bots detection due to the multiplicity
of bots and diculty of access to the underlying code.
In this paper, we analyze the possibility to detect bots by
applying Benford’s law [
12
], which requires just one metric
of a user (one API request per user).
Benford’s law is already used for detection of fraud [
1
,
9
].
Our hypothesis is that it can also be applied to bot detection
in social networks since bots break Benford’s law signi-
cantly more often than human users because of their unnat-
ural behavior. We analyze and test experimentally dierent
numerical parameters of accounts to nd ones that will obey
Benford’s distribution for users and will not obey for bots.
The novelty of the paper is the use of Benford’s law for the
detection of dierent types of bots in social networks. The
contribution consists also in the results of large-scale experi-
ments that involved purchased bots and human users and
had shown that Benford’s law can be applied to detect bots
that are controlled by software. The results of experiments
for live bots (involving human input) were not conclusive.
The paper is organized as follows. Second section is state
of the art where we describe current approaches in bot detec-
tion. Third section presents the bot detection approach that is
based on Benford’s law. Forth section presents experiments
and evaluation, where we describe datasets, software that
we develop for collection and analysis and the results of ex-
periments. Fifth section is discussion where we explain the
results of experiments. Last section is conclusion where we
summarise results and present plans for future work.
preprint
SIN conference
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
2 State of the art
Bots detection.
Nowadays there exist a full-edged market
where dierent types of bots are sold. Usually, sellers indicate
the "quality" of bots. Higher quality means a higher price.
The quality tags such as "automatic", "standard", "high", "auto-
bot", "live", "real user", "referrals" and other are used. It is
hard to nd exactly what these tags mean, e.g, are "real users"
human persons or just hacked accounts? But the basic idea
of categorization is clear. There are:
•
Software bots / automated bots – bots that are con-
trolled by some software.
•
Live bots / human-orchestrated bots at least partly
controlled by humans.
The border between these types is blurred. The software
bots represented by hacked accounts can have natural pro-
les that look like human users (photos, friend list, or other
content) but show anomalous activity common to bots (iden-
tical comments, too numerous likes, etc.). At the same time
there exist automatically generated bots (with anomalous
proles) orchestrated by real persons (and so, more natural
behavior).
There also exist exchange platforms and troll-factories.
Exchange platforms do pay users for their actions. Analyzing
this type of bots, we have found that their "abnormal" activity
(like massive re-posts of publicity, political events, clearly
fraudulent "cries for help" and others) overlaps with "normal"
activity (like speaking with family members, greeting friends
on holidays, etc.). So these are rather bot-accounts used (or
abused) by real persons for a specic cause, exchange of
services, or merely for money. A more extreme case is troll-
factories, "notorious" companies that hire a paid sta who
completely manually and without automatic methods create
accounts and conduct information attacks.
This explains the interest of bot identication. Many tech-
niques were developed for this purpose [
11
]. However, none
of them can guarantee 100% accuracy, and usually, several
methods are used together. The main methods of bot de-
tection are: statistical, graph-based methods, visualization,
forensics, and machine learning based on the results of all
previous methods as features.
Statistical methods are based on the comparison of dis-
tributions of metrics of users and metrics of bots such as
number of friends, number of posts, word occurrence, num-
ber of comments per day, and other [
14
,
17
]. Graph-based
methods are network science algorithms that analyze graph
topology to nd communities (as bots prefer to add other
bots to friend list [
15
]), and patterns in network structures.
Facebook also uses social graphs credibility system "Face-
book immune system" [15]. Visualization methods used for
representation of social networks graphs, statistics, cloud of
words so analyst can manually characterise dataset [
6
–
8
,
13
].
Forensics methods use multiple case-based techniques. For
example in [
7
] for analysis of bots impact on French election
dataset were selected by collection of tweets with specic
hashtags related to MacronLeaks information campaign. The
similar case-based methods were used in investigation of
bots impact on European [
13
] and United States elections [
6
].
Also for owners of social networks there are available some
additional methods that typical researches can not get access
to, such as comparison of IP addresses or login history.
Machine learning methods are the most popular methods
because they can unite the results of multiple methods. They
use various features of users that were extracted from nu-
merical (ex. number of photos), textual (ex. text of the posts),
media (ex. photo), and topological (ex. community graph)
data. For example, the BotOrNot [
3
] system uses around
1000 features for the classication of bots and real users on
Twitter. Sentibot [
4
] is a framework for bot detection where
the novelty is the use of dierent measures of sentiments
as features for classication. Also, the challenge [
16
] was
organized by DARPA where participated in 6 teams that use
various methods for the detection of software bots.
Usually, researchers use the next features [
5
] in machine
learning:
•
User-based features - can be obtained from the prole
of the account. For example, user name, age, length of
the description of the user, and others.
•
Friend features – metrics that can be obtained from
the friend list of the analyzed user. Such as the number
of friends, friends age distribution, their name length
distribution, and others.
•
Network features – metrics that one can get using
Graph-based methods are network science algorithms.
For example, centrality measures of the graph of friends,
distance to the root on the tree of reposts, and others.
•
Content and language features – metrics based on
word count and text entropy. Such as the number of
verbs, nouns, and others.
•
Sentiment features – metrics that can be obtained by
using sentiment extraction and natural language pro-
cessing techniques. For example, text happiness, po-
larization, some specic classes of text, and others.
•
Temporal features – features that reect the dynamics
of user activity. Such as prole login frequency, the
number of posts per day, and others.
Such as name, age, description of themselves, and others.
Benford’s law.
As noticed in [
12
]: "The history of Ben-
ford’s Law is a fascinating and unexpected story of the inter-
play between theory and applications... Currently, hundreds
of papers are being written by accountants, computer sci-
entists, engineers, mathematicians, statisticians, and many
others."
In its simplest form the "Benford’s law for the leading
digit" is formulated as follows. First, every positive number
𝑥
has a unique representation in "scientic notation", that
is, in the form
𝑆(𝑥) ·
10
𝑘
, where
𝑆(𝑥) ∈ [
1
,
10
[
is called
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
its signicant and
𝑘
is an integer called its exponent. The
integer part
𝑑
of
𝑆(𝑥)
is called its rst digit. Benford’s law
denition uses logarithm, and taking into account that for
nite datasets frequencies cannot be irrational numbers, the
following working denition is used (see [12]):
A dataset satises Benford’s law for leading digit if the
probability of observing a rst digit of
𝑑
is approximately
𝑙𝑜𝑔10 (𝑑+1
𝑑).
In our setting, we did not need yet a more rened deni-
tion.
As one of the most closely related to our work, let us men-
tion [
10
] where random samples of users for Twitter, Google
Plus, Pinterest, Facebook, and LiveJournal social networks
were studied. The research showed that friends and followers
metrics obey Benford’s law (with the exception of Pinter-
est followers, but it can be simply explained). Although the
study was not about bot detection, some parts of datasets
that had a low correlation with Benford’s distribution (under
0.5%) were represented spam and bots.
In this paper, we check this observation and verify whether
it is possible to use Benford’s law for security purposes.
3 Bot detection approach
For bot detection, we propose an approach that is based
on the combined use of Benford’s law tests for multiple
users metrics. The test result is positive for a metric if metric
distribution agrees with Benford’s distribution (p-value is
bigger than 0.95). The test result is negative for a metric if
metric distribution do not agree with Benford’s distribution
(p-value is lower than 0.95).
The metric obtains from dataset: list of accounts which
needs to be checked (if they are bots). For accounts, we
calculate p-values for next distributions of metrics: number
of friends, number of groups, number of followers, number
of subscriptions, number of photo-albums, number of photos,
number of posts. If the proportion of positive results (p-value
is bigger than 0.95) is above a certain threshold we recognize
the dataset as representing a human users. Otherwise, we
recognize the dataset as representing a bot.
In our experiments we use the following schema:
At rst step
for each dataset we get the list of the rst
signicant digit of each metric. And for each metric we test
the agreement of metric with Benford’s distribution using
Kolmogorov-Smirnov test. We decide that test is passed if
the p-value is
>
0.95. The pipeline of that process is presented
in Figure 1.
At second step
we check that users-datasets passed the
test. If big number of user-datasets not passed the test with
some metric, we exclude this metric from the analysis.
At third step
, if bot-datasets are not passed the test with
2/3 of metrics (that are not removed by previous step) we
make the conclusion that bot datasets where recognised
correctly.
Figure 1. Data processing pipeline for rst step.
This approach is very simple, but it provides a sucient
basis for experiments and permits to evaluate whether it is
possible to use Benford’s law for bot detection.
4 Experiments and evaluation
To evaluate the capacity of Benford’s law to detect bots we
conducted the following experiment. We developed software
to collect the account metrics from Vkontakte social net-
work, bought bots and collected 8 bots datasets, collected 10
real users datasets, run Kolmogorov-Smirnov tests for each
metric of each dataset and compared results for bots and real
users.
4.1 Dataset description
Our data were divided into 2 parts: bots and users.
To collect bots
we bought promotion of 8 posts by 3
dierent companies with dierent quality. Important note -
those dierent companies can have dierent strategies on
how to create bots (create a new account, buy an account,
hack account or pay to the account owner for some action)
and how to manage them (automatic, manual, or mixed con-
trolling). The quality also depends on the strategy, so bots
with higher quality have more complicated strategies and
act more naturally.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
At rst, we found 3 companies that provide posts promo-
tion using bots:
•
Vtope company. Provide 3 types of quality: standard
quality bots (dataset bot_1), high-quality bots (dataset
bot_2), and live users (dataset bot_3).
•
Martinismm company. Also provide 3 types of quality:
standard quality bots (dataset bot_4), high-quality bots
(dataset bot_5), and live users (dataset bot_6).
•
Vktarget company. Provide 2 types of quality: stan-
dard quality bots (dataset bot_7) and high-quality bots
(dataset bot_8).
The quality is the option that one can select when he or
she is buying bots. The quality is given as provided on the
companies sites.
For each company we create a group (3 groups) and ll
them with some posts, photos, and subscribers to create the
appearance of activity:
•
Wigs for hairless cats. Group for Vtope company (datasets
bot_1,bot_2, bot_3).
•
Deer Registration at the Ministry of Nature. For Mar-
tinismm company (datasets bot_4, bot_5, bot_6).
•
Minibus from Jõgeva to Agayakan. For Vktarget com-
pany (datasets bot_7, bot_8).
We made those groups be absurd to make sure that during
experiments real users will not join the group (they can
accidentally see our group) and we will collect only bots.
During the experiments, only one real user joined the group
and we exclude him from the dataset.
For each company we had the next algorithm: create a
post in a group
→
buy 300 likes under the post
→
collect
accounts that liked the post (only ID of account) to the le
→
delete the post. After iteration, we perform it again with
bots of another quality.
For Vtope and Martinismm companies the whole process
for one post took one day, because of the companies afraid
to give likes too quickly, and they are trying to simulate
natural growth. For Vktarget company collection took 3 days
because Vktarget provides not bots, but referrals. Referrals
are real users (not operated by a program) who registered
on the bot exchange platform and give likes, comments, and
other activity for money.
To collect real users
we select 10 dierent posts of 7
dierent focuses. We selected posts that have around 250
likes (datasets user_N).
The summary of collected bots and users is provided in
the table 1.
4.2 Implementation
We have IDs of users and bots and we need to scan them to
obtain metrics. Software for metrics collection is an appli-
cation in Vkontakte social network. It works using the next
schema (Figure 2).
Figure 2. Data collection schema implementation.
At rst step,task manager application takes les with
unique IDs of accounts as an input.
At second step
, we use multiple Vkontakte accounts to
run Vkontakte IFrame application clients [
2
]. Task manager
distributes IDs of accounts between these clients.
At the third step
, IFrame apps scan accounts proles by
their IDs that were given by the task manager. The work of
these clients is based on the current user access key – one
needs to be logged in Vkontakte to run the application.
Using IFrame apps we have a limitation: not more than 3
Vkontakte API requests in 1 second. Another limitation of
the data collection process: some of the API requests have
restrictions about the allowed number of requests per day.
For example, the “wall.get” API request can’t be called more
than 5000 times per day from one account (that request helps
to obtain the number of the posts in prole).
In our application for all necessary data collection Vkon-
takte API is called 6 times per one ID, while there is a 350 mil-
liseconds delay between each API request. It means that full
data collection for one user takes approximately 2 seconds
and 450 milliseconds. To make the data collection process
faster, it is parallelized between application users. They col-
lect data about dierent users of the social network, without
intersecting with each other.
We collected all public numerical data from proles: friends
count, groups count, followers count, subscriptions count,
albums count, photos count and posts count. This is the
maximum numerical information about the user that can be
collected without obtaining additional access rights to his
or her page. Also, if the user’s page is deleted, blocked, or
banned – none of this data can be collected. Also, users can
use privacy settings to hide some prole information.
At the fourth step
, IFrame apps entered metrics and IDs
into the database.
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
Table 1. datasets description
Dataset Type Number of proles Descriptions
bot_1 software bots 301 Provided by Vtope company. Low-quality bots.
bot_2 software bots 295 Provided by Vtope company. Mid-quality bots.
bot_3 live bots 298 Provided by Vtope company. Live-quality bots.
bot_4 software bots 301 Provided by Martinismm company. Low-quality bots.
bot_5 software bots 303 Provided by Martinismm company. Mid-quality bots.
bot_6 live bots 304 Provided by Martinismm company. Live-quality bots.
bot_7 live bots 302 Provided by Vktarget company. Low-quality bots.
Exchange platform.
bot_8 live bots 357 Provided by Vktarget company. High-quality bots.
Exchange platform.
user_1 activists 385 Post in group "velosipedization" that is dedicated to
development of bicycle transport Saint-Petersburg.
user_2 mass media 298 Post in group "belteanews" that is dedicated to
Belarussian news.
user_3 developers 332 Post in group "tproger" that is dedicated to
software development.
user_4 sport 224 Post in group "mhl" that is dedicated to
youth hockey.
user_5 mass media 420 Post in group "true_lentach" that is dedicated to
Russian news.
user_6 blog 251 Post in group "mcelroy_dub" that is dedicated to
re-playing of funny videos.
user_7 commerce 284 Post in group "sevcableport" that is dedicated to
creative space in Saint-Petersburg.
user_8 festival 259 Post in group "bigfestava" that is dedicated to
cartoon festival.
user_9 sport 181 Post in group "hcakbars" that is dedicated to
fun community of hockey club Ak Bars.
user_10 developers 397 Post in group "tnull" that is dedicated to
software development and memes.
After collecting the metrics, we generate CSV les from
the database (one per each dataset). We got the CSV les
with metrics of accounts that we analyze in R. To do that we
use the Kolmogorov-Smirnov test and DescTools library that
provide Benford’s distribution. The logic of analysis for each
metric is next:
•
lter metrics that are lower than 1 (remove deleted,
blocked, banned users and users that hide parameter
using privacy settings);
•get the rst signicant symbol of metric;
•
run Kolmogorov-Smirnov test, to test t of metrics to
Benford’s distribution.
4.3 Results
We unite results into 2 tables: one for bots (table 3) and one
for users (table 2). The p-values that are higher than 0.95 are
colored by green color (test is passed), lower – by red color
(test is failed).
We processed results based on the proposed approach.
In bot datasets metrics of followers and posts are passed test
in 5 cases out of 8 (table 3). While in user dataset metrics of
albums and photos are not passed the test in 8 and 7 cases out
of 10 (table 2). So we will not use this metrics (we highlight
them as yellow in tables) and evaluate based on the metrics
of friends,groups and subscriptions.
Based on that 3 metrics we can formulate how we make
the decision. If there are at least 2 metrics of 3 passed tests
we recognize the dataset as real users, and if not – as bots.
According to this, 3 bots datasets out of 8 in table 3were
recognised as users: bot_6,bot_7 and bot_8. Also 1 user
dataset out of 10 in table 2was recognised as bots: user_4.
The cause and meaning of these errors will we discuss in
the next section.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
Table 2. Table with p-values for real user datasets
Dataset friends groups followers subscriptions albums photos posts
user_1 0.9895 0.9999 0.9999 0.9895 0.9895 0.9999 0.9895
user_2 0.9895 0.9794 0.3364 0.9999 0.3364 0.3364 0.9999
user_3 0.9794 0.9895 0.9999 0.9895 0.4647 0.6994 0.9794
user_4 0.6994 0.3364 0.3364 0.9794 0.3364 0.6994 0.9794
user_5 0.9794 0.9794 0.9895 0.9895 0.1243 0.7301 0.9999
user_6 0.9895 0.6994 0.9999 0.9794 0.3364 0.6994 0.7301
user_7 0.9794 0.9794 0.7301 0.6994 0.7301 0.1243 0.6994
user_8 0.9895 0.9895 0.9999 0.9895 0.3364 0.9794 0.9794
user_9 0.6994 0.9895 0.9794 0.9794 0.9539 0.9296 0.9794
user_10 0.9999 0.9794 0.9999 0.9794 0.3364 0.9895 0.9794
Table 3. Table with p-values for bot datasets
Dataset friends groups followers subscriptions albums photos posts
bot_1 0.1243 0.0016 0.9894 0.3364 0.9439 0.5136 0.9793
bot_2 0.6993 0.0366 0.9894 0.3364 0.9776 0.6171 0.6993
bot_3 0.6993 0.0630 0.9793 0.1243 0.9907 0.7344 0.9793
bot_4 0.5436 0.5727 0.8239 0.6993 0.7090 0.7740 0.5436
bot_5 0.3364 0.3364 0.6993 0.3364 0.5454 0.8994 0.6993
bot_6 0.9793 0.9999 0.6993 0.9999 0.6993 0.6993 0.9999
bot_7 0.9793 0.9999 0.9793 0.7301 0.5907 0.7229 0.9894
bot_8 0.9999 0.9894 0.6993 0.9793 0.9793 0.9793 0.9999
5 Discussion
The experiment showed that Benford’s law is not an ultimate
rule that can detect bots. Even after ltering a part of the
metrics we got three bot datasets that were recognized as
users and one real user dataset that was recognized as bots.
At the same time, according to table 1, bot datasets that
were recognised incorrectly (bot_6,bot_7 and bot_8) are
corresponds to "live" bots. These bots represent real users
that participate in exchange market (see table 1), where they
can earn money by their activity. We can denitely say that
for bot_7 and bot_8 datasets, because company where we
bought them represent such exchange platform. Possibly,
company that provide us bot_6 dataset also works through
similar exchange platform.
It is also important to note that one "live" bot dataset didn’t
pass the test and bots were recognized correctly – bot_3. We
also compare bots by their connectivity in Figure 3. As one
can see, "live quality" bot_3 friends-graph looks the same as
software bots bot_1, bot_2, bot_4 and bot_5. Possibly that the
company does not provide "live" bots and deceives.
At the same time, bot_6 really looks like working through
exchange platform, because their graph similar to graphs of
real users.
As we can see, according to the description given by the
companies, it is dicult to divide bots into software and
hardware. So labels must be perceived conditionally.
If we take into account Figure 3and the description given
by the companies, we can conclude that the bots bot_6, bot_7,
bot_8 are live users from the exchange market. So we can
say that Benford’s distribution-based test works only for the
detection of "software" bots and it can not be used for the
correct detection of "live" bots.
Another discussing moment is the results of tests for real
users. There are 2 metrics that didn’t passed the test so we
removed them: albums and photos. This probably can be be-
cause of privacy settings. In Vkontakte, when a user applies
privacy settings to hide friends, groups, and other metrics -
we can not get any information about them. While privacy
settings for albums are more exible – users can hide part of
photos while other part does not. Also in Vkontakte users
prefer to hide photos more often than other metrics.
In some cases, other metrics also didn’t pass the test. We
even can see that one user dataset user_4 was recognized as
bots. We cannot say for sure why it is so – it can be because
of the privacy settings or specic features of communities.
Also, commerce groups can use bots as part of their social
media marketing strategy.
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
So experiments showed that "software" bots possibly can
be identied by agreement with Benford’s law. For an accu-
rate assessment, it is necessary to extend experiments and
include more datasets. It is also necessary to increase the
size of datasets (because of the privacy settings) and evaluate
how specic features of real-user communities aects on
false-positive results.
It is also important to note, that some results of the ex-
periments may be aected by features of Vkontakte social
networks. In other social networks, users may have less ex-
ible privacy settings.
In future work, we plan to use Benford’s distribution agree-
ment p-value as a feature in machine learning approaches
(for example, which are used in BotOrNot [
3
]). It is also nec-
essary to check if it is possible to identify bots individually
by applying Benford’s law agreement to the bot’s friend list.
Benford’s law for bot detection can be used by social net-
work companies, government structures, journalists, and
companies as part of anti-competitive intelligence.
Important ethical moment – that bot companies also can
use Benford’s law to increase the quality of their bots. In
that case, it can become more complicated to detect bots. At
the same time, it will increase the complexity and cost of
creating such bots and management of their activity. Thus,
bots will be more expensive, and the costs of people using
bots will increase.
6 Conclusion
In this paper, we analyzed the approach for bots detection in
social networks based on Benford’s law application. The pro-
posed approach is quite simple – if the p-value of agreement
between proles distribution and Benford’s distribution is
below a certain threshold we recognize these proles as bots
(or bots are a substantial part of such proles set).
To evaluate the proposed approach we performed the
experiments with 8 bots datasets that we bought from 3 com-
panies and 10 real-users datasets. Our experiments showed
that "software" bots can be identied by the Benford’s law ap-
proach. "Live" bots that are represented by hacked accounts
and real users from exchange platforms cannot be identied
by Benford’s law. Some set of real-users can also be falsely
identied as bots, as we think because of the privacy settings
and features of communities.
For an accurate evaluation of the proposed approach, it
is necessary to extend experiments with additional datasets.
In future work, we plan to increase the size and number of
datasets and analyze how specic features of real-user com-
munities can aect on false-positive results. We also plan to
use Benford’s distribution agreement p-value as a feature
in machine learning approaches and evaluate how it aects
accuracy. It is also necessary to check if it is possible to iden-
tify bots individually by applying Benford’s law agreement
to the bot’s friend list.
Acknowledgments
This research was supported by the Russian Science Founda-
tion under grant number 18-71-10094 in SRC RAS.
References
[1]
Andrea Cerioli, Lucio Barabesi, Andrea Cerasa, Mario Menegatti, and
Domenico Perrotta. 2019. Newcomb–Benford law and the detection
of frauds in international trade. Proceedings of the National Academy
of Sciences 116, 1 (2019), 106–115.
[2]
VKontakte company. 2017. IFrame Applications. Retrieved September
12, 2020 from hps://vk.com/dev/IFrame_apps
[3]
Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini,
and Filippo Menczer. 2016. Botornot: A system to evaluate social bots.
In Proceedings of the 25th international conference companion on world
wide web. 273–274.
[4]
John P Dickerson, Vadim Kagan, and VS Subrahmanian. 2014. Using
sentiment to detect bots on twitter: Are humans more opinionated
than bots?. In 2014 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM 2014). IEEE, 620–627.
[5]
Guozhu Dong and Huan Liu. 2018. Feature engineering for machine
learning and data analytics. CRC Press.
[6]
Rob Faris, Hal Roberts, Bruce Etling, Nikki Bourassa, Ethan Zuckerman,
and Yochai Benkler. 2017. Partisanship, Propaganda, and Disinforma-
tion: Online Media and the 2016 US. Presidential Election. Retrieved
September 9 , 2020 from hps://cyber.harvard.edu/publications/2017/
08/mediacloud
[7]
Emilio Ferrara. 2017. Disinformation and social bot operations in the
run up to the 2017 French presidential election. arXiv:1707.00086
[8]
John D Gallacher, Vlad Barash, Philip N Howard, and John Kelly. 2018.
Junk news on military aairs and national security: Social media
disinformation campaigns against us military personnel and veterans.
arXiv:1802.03572
[9]
Nicolas Gauvrit Gauvrit, Jean-Charles Houillon, and Jean-Paul Dela-
haye. 2017. Generalized Benford’s Law as a lie detector. Advances in
cognitive psychology 13, 2 (2017), 121.
[10]
Jennifer Golbeck. 2015. Benford’s law applies to online social networks.
PloS one 10, 8 (2015).
[11]
Arzum Karataş and Serap Şahin. 2017. A review on social bot detection
techniques and research directions. In Proc. Int. Security and Cryptology
Conference Turkey. 156–161.
[12] Steven J Miller. 2015. Benford’s Law. Princeton University Press.
[13]
Francesco Pierri, Alessandro Artoni, and Stefano Ceri. 2020. Investi-
gating Italian disinformation spreading on Twitter in the context of
2019 European elections. PloS one 15, 1 (2020), e0227821.
[14]
Giovanni C Santia, Munif Ishad Mujib, and Jake Ryland Williams. 2019.
Detecting Social Bots on Facebook in an Information Veracity Context.
In Proceedings of the International AAAI Conference on Web and Social
Media, Vol. 13. 463–472.
[15]
Tao Stein, Erdong Chen, and Karan Mangla. 2011. Facebook immune
system. In Proceedings of the 4th workshop on social network systems.
1–8.
[16]
VS Subrahmanian, Amos Azaria, Skylar Durst, Vadim Kagan, Aram
Galstyan, Kristina Lerman, Linhong Zhu, Emilio Ferrara, Alessandro
Flammini, and Filippo Menczer. 2016. The DARPA Twitter bot chal-
lenge. Computer 49, 6 (2016), 38–46.
[17]
Lidia Vitkova, Igor Kotenko, Maxim Kolomeets, Olga Tushkanova,
and Andrey Chechulin. 2019. Hybrid Approach for Bots Detection in
Social Networks Based on Topological, Textual and Statistical Features.
In International Conference on Intelligent Information Technologies for
Industry. Springer, 412–421.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
Figure 3. Graph analysis of bots quality and comparison with some users-datasets.