Conference PaperPDF Available

Social networks bot detection using Benford’s law

Authors:
  • St. Petersburg Federal Research Center of the Russian Academy of Sciences
  • St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS)

Abstract and Figures

The paper considers the task of bot detection in social net-works. It checks the hypothesis that bots break Benford’s law much more often than users, so one can identify them. A bot detection approach is proposed based on experiments where the test results for bot datasets of different classes and real-user datasets of different communities are evaluatedand compared. The experiments show that automatically controlled bots possibly can be identified by disagreement with Benford’s law, while human-orchestrated bots are not.
Content may be subject to copyright.
Social networks bot detection using Benford’s law
Maxim Kolomeets
kolomeec@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Dmitry Levshun
levshun@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Sergei Soloviev
sergei.soloviev@irit.fr
Université Paul Sabatier Toulouse III
Toulouse, France
Andrey Chechulin
chechulin@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Igor Kotenko
ivkote@comsec.spb.ru
St. Petersburg Federal Research
Center of the Russian Academy of
Sciences
Saint-Petersburg, Russia
Abstract
The paper considers the task of bot detection in social net-
works. It checks the hypothesis that bots break Benford’s
law much more often than users, so one can identify them.
A bot detection approach is proposed based on experiments
where the test results for bot datasets of dierent classes and
real-user datasets of dierent communities are evaluated
and compared. The experiments show that automatically
controlled bots possibly can be identied by disagreement
with Benford’s law, while human-orchestrated bots are not.
CCS Concepts Security and privacy Social network
security and privacy.
Keywords
bot detection, information security, social net-
works, Benford’s law
ACM Reference Format:
Maxim Kolomeets, Dmitry Levshun, Sergei Soloviev, Andrey Chechulin,
and Igor Kotenko. 2020. Social networks bot detection using Ben-
ford’s law. In ?????????????. ACM, New York, NY, USA, 8pages.
hps://doi.org/??????????????????
1 Introduction
Detection of social bots is especially important in the cyber-
security contexts involving social networks because they
are the software agents designed to mimic humans but pro-
grammed for specic purposes. Bots may seriously threaten
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specic permission and/or a fee. Request
permissions from permissions@acm.org.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
©2020 Association for Computing Machinery.
ACM ISBN ???????. . . $??????
hps://doi.org/??????????????????
the integrity of networks and various networked installa-
tions. Bots are used for metrics cheating and information
spreading by dierent actors: commerce companies, scam-
mers, and even political manipulators.
The diculty of bots detection is an important factor that
enables the proliferation of bots [
14
]. Statistical methods play
a leading role in social bots detection due to the multiplicity
of bots and diculty of access to the underlying code.
In this paper, we analyze the possibility to detect bots by
applying Benford’s law [
12
], which requires just one metric
of a user (one API request per user).
Benford’s law is already used for detection of fraud [
1
,
9
].
Our hypothesis is that it can also be applied to bot detection
in social networks since bots break Benford’s law signi-
cantly more often than human users because of their unnat-
ural behavior. We analyze and test experimentally dierent
numerical parameters of accounts to nd ones that will obey
Benford’s distribution for users and will not obey for bots.
The novelty of the paper is the use of Benford’s law for the
detection of dierent types of bots in social networks. The
contribution consists also in the results of large-scale experi-
ments that involved purchased bots and human users and
had shown that Benford’s law can be applied to detect bots
that are controlled by software. The results of experiments
for live bots (involving human input) were not conclusive.
The paper is organized as follows. Second section is state
of the art where we describe current approaches in bot detec-
tion. Third section presents the bot detection approach that is
based on Benford’s law. Forth section presents experiments
and evaluation, where we describe datasets, software that
we develop for collection and analysis and the results of ex-
periments. Fifth section is discussion where we explain the
results of experiments. Last section is conclusion where we
summarise results and present plans for future work.
preprint
SIN conference
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
2 State of the art
Bots detection.
Nowadays there exist a full-edged market
where dierent types of bots are sold. Usually, sellers indicate
the "quality" of bots. Higher quality means a higher price.
The quality tags such as "automatic", "standard", "high", "auto-
bot", "live", "real user", "referrals" and other are used. It is
hard to nd exactly what these tags mean, e.g, are "real users"
human persons or just hacked accounts? But the basic idea
of categorization is clear. There are:
Software bots / automated bots bots that are con-
trolled by some software.
Live bots / human-orchestrated bots at least partly
controlled by humans.
The border between these types is blurred. The software
bots represented by hacked accounts can have natural pro-
les that look like human users (photos, friend list, or other
content) but show anomalous activity common to bots (iden-
tical comments, too numerous likes, etc.). At the same time
there exist automatically generated bots (with anomalous
proles) orchestrated by real persons (and so, more natural
behavior).
There also exist exchange platforms and troll-factories.
Exchange platforms do pay users for their actions. Analyzing
this type of bots, we have found that their "abnormal" activity
(like massive re-posts of publicity, political events, clearly
fraudulent "cries for help" and others) overlaps with "normal"
activity (like speaking with family members, greeting friends
on holidays, etc.). So these are rather bot-accounts used (or
abused) by real persons for a specic cause, exchange of
services, or merely for money. A more extreme case is troll-
factories, "notorious" companies that hire a paid sta who
completely manually and without automatic methods create
accounts and conduct information attacks.
This explains the interest of bot identication. Many tech-
niques were developed for this purpose [
11
]. However, none
of them can guarantee 100% accuracy, and usually, several
methods are used together. The main methods of bot de-
tection are: statistical, graph-based methods, visualization,
forensics, and machine learning based on the results of all
previous methods as features.
Statistical methods are based on the comparison of dis-
tributions of metrics of users and metrics of bots such as
number of friends, number of posts, word occurrence, num-
ber of comments per day, and other [
14
,
17
]. Graph-based
methods are network science algorithms that analyze graph
topology to nd communities (as bots prefer to add other
bots to friend list [
15
]), and patterns in network structures.
Facebook also uses social graphs credibility system "Face-
book immune system" [15]. Visualization methods used for
representation of social networks graphs, statistics, cloud of
words so analyst can manually characterise dataset [
6
8
,
13
].
Forensics methods use multiple case-based techniques. For
example in [
7
] for analysis of bots impact on French election
dataset were selected by collection of tweets with specic
hashtags related to MacronLeaks information campaign. The
similar case-based methods were used in investigation of
bots impact on European [
13
] and United States elections [
6
].
Also for owners of social networks there are available some
additional methods that typical researches can not get access
to, such as comparison of IP addresses or login history.
Machine learning methods are the most popular methods
because they can unite the results of multiple methods. They
use various features of users that were extracted from nu-
merical (ex. number of photos), textual (ex. text of the posts),
media (ex. photo), and topological (ex. community graph)
data. For example, the BotOrNot [
3
] system uses around
1000 features for the classication of bots and real users on
Twitter. Sentibot [
4
] is a framework for bot detection where
the novelty is the use of dierent measures of sentiments
as features for classication. Also, the challenge [
16
] was
organized by DARPA where participated in 6 teams that use
various methods for the detection of software bots.
Usually, researchers use the next features [
5
] in machine
learning:
User-based features - can be obtained from the prole
of the account. For example, user name, age, length of
the description of the user, and others.
Friend features metrics that can be obtained from
the friend list of the analyzed user. Such as the number
of friends, friends age distribution, their name length
distribution, and others.
Network features metrics that one can get using
Graph-based methods are network science algorithms.
For example, centrality measures of the graph of friends,
distance to the root on the tree of reposts, and others.
Content and language features metrics based on
word count and text entropy. Such as the number of
verbs, nouns, and others.
Sentiment features metrics that can be obtained by
using sentiment extraction and natural language pro-
cessing techniques. For example, text happiness, po-
larization, some specic classes of text, and others.
Temporal features features that reect the dynamics
of user activity. Such as prole login frequency, the
number of posts per day, and others.
Such as name, age, description of themselves, and others.
Benford’s law.
As noticed in [
12
]: "The history of Ben-
ford’s Law is a fascinating and unexpected story of the inter-
play between theory and applications... Currently, hundreds
of papers are being written by accountants, computer sci-
entists, engineers, mathematicians, statisticians, and many
others."
In its simplest form the "Benford’s law for the leading
digit" is formulated as follows. First, every positive number
𝑥
has a unique representation in "scientic notation", that
is, in the form
𝑆(𝑥) ·
10
𝑘
, where
𝑆(𝑥) [
1
,
10
[
is called
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
its signicant and
𝑘
is an integer called its exponent. The
integer part
𝑑
of
𝑆(𝑥)
is called its rst digit. Benford’s law
denition uses logarithm, and taking into account that for
nite datasets frequencies cannot be irrational numbers, the
following working denition is used (see [12]):
A dataset satises Benford’s law for leading digit if the
probability of observing a rst digit of
𝑑
is approximately
𝑙𝑜𝑔10 (𝑑+1
𝑑).
In our setting, we did not need yet a more rened deni-
tion.
As one of the most closely related to our work, let us men-
tion [
10
] where random samples of users for Twitter, Google
Plus, Pinterest, Facebook, and LiveJournal social networks
were studied. The research showed that friends and followers
metrics obey Benford’s law (with the exception of Pinter-
est followers, but it can be simply explained). Although the
study was not about bot detection, some parts of datasets
that had a low correlation with Benford’s distribution (under
0.5%) were represented spam and bots.
In this paper, we check this observation and verify whether
it is possible to use Benford’s law for security purposes.
3 Bot detection approach
For bot detection, we propose an approach that is based
on the combined use of Benford’s law tests for multiple
users metrics. The test result is positive for a metric if metric
distribution agrees with Benford’s distribution (p-value is
bigger than 0.95). The test result is negative for a metric if
metric distribution do not agree with Benford’s distribution
(p-value is lower than 0.95).
The metric obtains from dataset: list of accounts which
needs to be checked (if they are bots). For accounts, we
calculate p-values for next distributions of metrics: number
of friends, number of groups, number of followers, number
of subscriptions, number of photo-albums, number of photos,
number of posts. If the proportion of positive results (p-value
is bigger than 0.95) is above a certain threshold we recognize
the dataset as representing a human users. Otherwise, we
recognize the dataset as representing a bot.
In our experiments we use the following schema:
At rst step
for each dataset we get the list of the rst
signicant digit of each metric. And for each metric we test
the agreement of metric with Benford’s distribution using
Kolmogorov-Smirnov test. We decide that test is passed if
the p-value is
>
0.95. The pipeline of that process is presented
in Figure 1.
At second step
we check that users-datasets passed the
test. If big number of user-datasets not passed the test with
some metric, we exclude this metric from the analysis.
At third step
, if bot-datasets are not passed the test with
2/3 of metrics (that are not removed by previous step) we
make the conclusion that bot datasets where recognised
correctly.
Figure 1. Data processing pipeline for rst step.
This approach is very simple, but it provides a sucient
basis for experiments and permits to evaluate whether it is
possible to use Benford’s law for bot detection.
4 Experiments and evaluation
To evaluate the capacity of Benford’s law to detect bots we
conducted the following experiment. We developed software
to collect the account metrics from Vkontakte social net-
work, bought bots and collected 8 bots datasets, collected 10
real users datasets, run Kolmogorov-Smirnov tests for each
metric of each dataset and compared results for bots and real
users.
4.1 Dataset description
Our data were divided into 2 parts: bots and users.
To collect bots
we bought promotion of 8 posts by 3
dierent companies with dierent quality. Important note -
those dierent companies can have dierent strategies on
how to create bots (create a new account, buy an account,
hack account or pay to the account owner for some action)
and how to manage them (automatic, manual, or mixed con-
trolling). The quality also depends on the strategy, so bots
with higher quality have more complicated strategies and
act more naturally.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
At rst, we found 3 companies that provide posts promo-
tion using bots:
Vtope company. Provide 3 types of quality: standard
quality bots (dataset bot_1), high-quality bots (dataset
bot_2), and live users (dataset bot_3).
Martinismm company. Also provide 3 types of quality:
standard quality bots (dataset bot_4), high-quality bots
(dataset bot_5), and live users (dataset bot_6).
Vktarget company. Provide 2 types of quality: stan-
dard quality bots (dataset bot_7) and high-quality bots
(dataset bot_8).
The quality is the option that one can select when he or
she is buying bots. The quality is given as provided on the
companies sites.
For each company we create a group (3 groups) and ll
them with some posts, photos, and subscribers to create the
appearance of activity:
Wigs for hairless cats. Group for Vtope company (datasets
bot_1,bot_2, bot_3).
Deer Registration at the Ministry of Nature. For Mar-
tinismm company (datasets bot_4, bot_5, bot_6).
Minibus from Jõgeva to Agayakan. For Vktarget com-
pany (datasets bot_7, bot_8).
We made those groups be absurd to make sure that during
experiments real users will not join the group (they can
accidentally see our group) and we will collect only bots.
During the experiments, only one real user joined the group
and we exclude him from the dataset.
For each company we had the next algorithm: create a
post in a group
buy 300 likes under the post
collect
accounts that liked the post (only ID of account) to the le
delete the post. After iteration, we perform it again with
bots of another quality.
For Vtope and Martinismm companies the whole process
for one post took one day, because of the companies afraid
to give likes too quickly, and they are trying to simulate
natural growth. For Vktarget company collection took 3 days
because Vktarget provides not bots, but referrals. Referrals
are real users (not operated by a program) who registered
on the bot exchange platform and give likes, comments, and
other activity for money.
To collect real users
we select 10 dierent posts of 7
dierent focuses. We selected posts that have around 250
likes (datasets user_N).
The summary of collected bots and users is provided in
the table 1.
4.2 Implementation
We have IDs of users and bots and we need to scan them to
obtain metrics. Software for metrics collection is an appli-
cation in Vkontakte social network. It works using the next
schema (Figure 2).
Figure 2. Data collection schema implementation.
At rst step,task manager application takes les with
unique IDs of accounts as an input.
At second step
, we use multiple Vkontakte accounts to
run Vkontakte IFrame application clients [
2
]. Task manager
distributes IDs of accounts between these clients.
At the third step
, IFrame apps scan accounts proles by
their IDs that were given by the task manager. The work of
these clients is based on the current user access key one
needs to be logged in Vkontakte to run the application.
Using IFrame apps we have a limitation: not more than 3
Vkontakte API requests in 1 second. Another limitation of
the data collection process: some of the API requests have
restrictions about the allowed number of requests per day.
For example, the “wall.get” API request can’t be called more
than 5000 times per day from one account (that request helps
to obtain the number of the posts in prole).
In our application for all necessary data collection Vkon-
takte API is called 6 times per one ID, while there is a 350 mil-
liseconds delay between each API request. It means that full
data collection for one user takes approximately 2 seconds
and 450 milliseconds. To make the data collection process
faster, it is parallelized between application users. They col-
lect data about dierent users of the social network, without
intersecting with each other.
We collected all public numerical data from proles: friends
count, groups count, followers count, subscriptions count,
albums count, photos count and posts count. This is the
maximum numerical information about the user that can be
collected without obtaining additional access rights to his
or her page. Also, if the user’s page is deleted, blocked, or
banned none of this data can be collected. Also, users can
use privacy settings to hide some prole information.
At the fourth step
, IFrame apps entered metrics and IDs
into the database.
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
Table 1. datasets description
Dataset Type Number of proles Descriptions
bot_1 software bots 301 Provided by Vtope company. Low-quality bots.
bot_2 software bots 295 Provided by Vtope company. Mid-quality bots.
bot_3 live bots 298 Provided by Vtope company. Live-quality bots.
bot_4 software bots 301 Provided by Martinismm company. Low-quality bots.
bot_5 software bots 303 Provided by Martinismm company. Mid-quality bots.
bot_6 live bots 304 Provided by Martinismm company. Live-quality bots.
bot_7 live bots 302 Provided by Vktarget company. Low-quality bots.
Exchange platform.
bot_8 live bots 357 Provided by Vktarget company. High-quality bots.
Exchange platform.
user_1 activists 385 Post in group "velosipedization" that is dedicated to
development of bicycle transport Saint-Petersburg.
user_2 mass media 298 Post in group "belteanews" that is dedicated to
Belarussian news.
user_3 developers 332 Post in group "tproger" that is dedicated to
software development.
user_4 sport 224 Post in group "mhl" that is dedicated to
youth hockey.
user_5 mass media 420 Post in group "true_lentach" that is dedicated to
Russian news.
user_6 blog 251 Post in group "mcelroy_dub" that is dedicated to
re-playing of funny videos.
user_7 commerce 284 Post in group "sevcableport" that is dedicated to
creative space in Saint-Petersburg.
user_8 festival 259 Post in group "bigfestava" that is dedicated to
cartoon festival.
user_9 sport 181 Post in group "hcakbars" that is dedicated to
fun community of hockey club Ak Bars.
user_10 developers 397 Post in group "tnull" that is dedicated to
software development and memes.
After collecting the metrics, we generate CSV les from
the database (one per each dataset). We got the CSV les
with metrics of accounts that we analyze in R. To do that we
use the Kolmogorov-Smirnov test and DescTools library that
provide Benford’s distribution. The logic of analysis for each
metric is next:
lter metrics that are lower than 1 (remove deleted,
blocked, banned users and users that hide parameter
using privacy settings);
get the rst signicant symbol of metric;
run Kolmogorov-Smirnov test, to test t of metrics to
Benford’s distribution.
4.3 Results
We unite results into 2 tables: one for bots (table 3) and one
for users (table 2). The p-values that are higher than 0.95 are
colored by green color (test is passed), lower by red color
(test is failed).
We processed results based on the proposed approach.
In bot datasets metrics of followers and posts are passed test
in 5 cases out of 8 (table 3). While in user dataset metrics of
albums and photos are not passed the test in 8 and 7 cases out
of 10 (table 2). So we will not use this metrics (we highlight
them as yellow in tables) and evaluate based on the metrics
of friends,groups and subscriptions.
Based on that 3 metrics we can formulate how we make
the decision. If there are at least 2 metrics of 3 passed tests
we recognize the dataset as real users, and if not as bots.
According to this, 3 bots datasets out of 8 in table 3were
recognised as users: bot_6,bot_7 and bot_8. Also 1 user
dataset out of 10 in table 2was recognised as bots: user_4.
The cause and meaning of these errors will we discuss in
the next section.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
Table 2. Table with p-values for real user datasets
Dataset friends groups followers subscriptions albums photos posts
user_1 0.9895 0.9999 0.9999 0.9895 0.9895 0.9999 0.9895
user_2 0.9895 0.9794 0.3364 0.9999 0.3364 0.3364 0.9999
user_3 0.9794 0.9895 0.9999 0.9895 0.4647 0.6994 0.9794
user_4 0.6994 0.3364 0.3364 0.9794 0.3364 0.6994 0.9794
user_5 0.9794 0.9794 0.9895 0.9895 0.1243 0.7301 0.9999
user_6 0.9895 0.6994 0.9999 0.9794 0.3364 0.6994 0.7301
user_7 0.9794 0.9794 0.7301 0.6994 0.7301 0.1243 0.6994
user_8 0.9895 0.9895 0.9999 0.9895 0.3364 0.9794 0.9794
user_9 0.6994 0.9895 0.9794 0.9794 0.9539 0.9296 0.9794
user_10 0.9999 0.9794 0.9999 0.9794 0.3364 0.9895 0.9794
Table 3. Table with p-values for bot datasets
Dataset friends groups followers subscriptions albums photos posts
bot_1 0.1243 0.0016 0.9894 0.3364 0.9439 0.5136 0.9793
bot_2 0.6993 0.0366 0.9894 0.3364 0.9776 0.6171 0.6993
bot_3 0.6993 0.0630 0.9793 0.1243 0.9907 0.7344 0.9793
bot_4 0.5436 0.5727 0.8239 0.6993 0.7090 0.7740 0.5436
bot_5 0.3364 0.3364 0.6993 0.3364 0.5454 0.8994 0.6993
bot_6 0.9793 0.9999 0.6993 0.9999 0.6993 0.6993 0.9999
bot_7 0.9793 0.9999 0.9793 0.7301 0.5907 0.7229 0.9894
bot_8 0.9999 0.9894 0.6993 0.9793 0.9793 0.9793 0.9999
5 Discussion
The experiment showed that Benford’s law is not an ultimate
rule that can detect bots. Even after ltering a part of the
metrics we got three bot datasets that were recognized as
users and one real user dataset that was recognized as bots.
At the same time, according to table 1, bot datasets that
were recognised incorrectly (bot_6,bot_7 and bot_8) are
corresponds to "live" bots. These bots represent real users
that participate in exchange market (see table 1), where they
can earn money by their activity. We can denitely say that
for bot_7 and bot_8 datasets, because company where we
bought them represent such exchange platform. Possibly,
company that provide us bot_6 dataset also works through
similar exchange platform.
It is also important to note that one "live" bot dataset didn’t
pass the test and bots were recognized correctly bot_3. We
also compare bots by their connectivity in Figure 3. As one
can see, "live quality" bot_3 friends-graph looks the same as
software bots bot_1, bot_2, bot_4 and bot_5. Possibly that the
company does not provide "live" bots and deceives.
At the same time, bot_6 really looks like working through
exchange platform, because their graph similar to graphs of
real users.
As we can see, according to the description given by the
companies, it is dicult to divide bots into software and
hardware. So labels must be perceived conditionally.
If we take into account Figure 3and the description given
by the companies, we can conclude that the bots bot_6, bot_7,
bot_8 are live users from the exchange market. So we can
say that Benford’s distribution-based test works only for the
detection of "software" bots and it can not be used for the
correct detection of "live" bots.
Another discussing moment is the results of tests for real
users. There are 2 metrics that didn’t passed the test so we
removed them: albums and photos. This probably can be be-
cause of privacy settings. In Vkontakte, when a user applies
privacy settings to hide friends, groups, and other metrics -
we can not get any information about them. While privacy
settings for albums are more exible users can hide part of
photos while other part does not. Also in Vkontakte users
prefer to hide photos more often than other metrics.
In some cases, other metrics also didn’t pass the test. We
even can see that one user dataset user_4 was recognized as
bots. We cannot say for sure why it is so it can be because
of the privacy settings or specic features of communities.
Also, commerce groups can use bots as part of their social
media marketing strategy.
Social networks bot detection using Benford’s law SINCONF ’20, November 4–7, 2020, Istanbul, Turkey
So experiments showed that "software" bots possibly can
be identied by agreement with Benford’s law. For an accu-
rate assessment, it is necessary to extend experiments and
include more datasets. It is also necessary to increase the
size of datasets (because of the privacy settings) and evaluate
how specic features of real-user communities aects on
false-positive results.
It is also important to note, that some results of the ex-
periments may be aected by features of Vkontakte social
networks. In other social networks, users may have less ex-
ible privacy settings.
In future work, we plan to use Benford’s distribution agree-
ment p-value as a feature in machine learning approaches
(for example, which are used in BotOrNot [
3
]). It is also nec-
essary to check if it is possible to identify bots individually
by applying Benford’s law agreement to the bot’s friend list.
Benford’s law for bot detection can be used by social net-
work companies, government structures, journalists, and
companies as part of anti-competitive intelligence.
Important ethical moment that bot companies also can
use Benford’s law to increase the quality of their bots. In
that case, it can become more complicated to detect bots. At
the same time, it will increase the complexity and cost of
creating such bots and management of their activity. Thus,
bots will be more expensive, and the costs of people using
bots will increase.
6 Conclusion
In this paper, we analyzed the approach for bots detection in
social networks based on Benford’s law application. The pro-
posed approach is quite simple if the p-value of agreement
between proles distribution and Benford’s distribution is
below a certain threshold we recognize these proles as bots
(or bots are a substantial part of such proles set).
To evaluate the proposed approach we performed the
experiments with 8 bots datasets that we bought from 3 com-
panies and 10 real-users datasets. Our experiments showed
that "software" bots can be identied by the Benford’s law ap-
proach. "Live" bots that are represented by hacked accounts
and real users from exchange platforms cannot be identied
by Benford’s law. Some set of real-users can also be falsely
identied as bots, as we think because of the privacy settings
and features of communities.
For an accurate evaluation of the proposed approach, it
is necessary to extend experiments with additional datasets.
In future work, we plan to increase the size and number of
datasets and analyze how specic features of real-user com-
munities can aect on false-positive results. We also plan to
use Benford’s distribution agreement p-value as a feature
in machine learning approaches and evaluate how it aects
accuracy. It is also necessary to check if it is possible to iden-
tify bots individually by applying Benford’s law agreement
to the bot’s friend list.
Acknowledgments
This research was supported by the Russian Science Founda-
tion under grant number 18-71-10094 in SRC RAS.
References
[1]
Andrea Cerioli, Lucio Barabesi, Andrea Cerasa, Mario Menegatti, and
Domenico Perrotta. 2019. Newcomb–Benford law and the detection
of frauds in international trade. Proceedings of the National Academy
of Sciences 116, 1 (2019), 106–115.
[2]
VKontakte company. 2017. IFrame Applications. Retrieved September
12, 2020 from hps://vk.com/dev/IFrame_apps
[3]
Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini,
and Filippo Menczer. 2016. Botornot: A system to evaluate social bots.
In Proceedings of the 25th international conference companion on world
wide web. 273–274.
[4]
John P Dickerson, Vadim Kagan, and VS Subrahmanian. 2014. Using
sentiment to detect bots on twitter: Are humans more opinionated
than bots?. In 2014 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM 2014). IEEE, 620–627.
[5]
Guozhu Dong and Huan Liu. 2018. Feature engineering for machine
learning and data analytics. CRC Press.
[6]
Rob Faris, Hal Roberts, Bruce Etling, Nikki Bourassa, Ethan Zuckerman,
and Yochai Benkler. 2017. Partisanship, Propaganda, and Disinforma-
tion: Online Media and the 2016 US. Presidential Election. Retrieved
September 9 , 2020 from hps://cyber.harvard.edu/publications/2017/
08/mediacloud
[7]
Emilio Ferrara. 2017. Disinformation and social bot operations in the
run up to the 2017 French presidential election. arXiv:1707.00086
[8]
John D Gallacher, Vlad Barash, Philip N Howard, and John Kelly. 2018.
Junk news on military aairs and national security: Social media
disinformation campaigns against us military personnel and veterans.
arXiv:1802.03572
[9]
Nicolas Gauvrit Gauvrit, Jean-Charles Houillon, and Jean-Paul Dela-
haye. 2017. Generalized Benford’s Law as a lie detector. Advances in
cognitive psychology 13, 2 (2017), 121.
[10]
Jennifer Golbeck. 2015. Benford’s law applies to online social networks.
PloS one 10, 8 (2015).
[11]
Arzum Karataş and Serap Şahin. 2017. A review on social bot detection
techniques and research directions. In Proc. Int. Security and Cryptology
Conference Turkey. 156–161.
[12] Steven J Miller. 2015. Benford’s Law. Princeton University Press.
[13]
Francesco Pierri, Alessandro Artoni, and Stefano Ceri. 2020. Investi-
gating Italian disinformation spreading on Twitter in the context of
2019 European elections. PloS one 15, 1 (2020), e0227821.
[14]
Giovanni C Santia, Munif Ishad Mujib, and Jake Ryland Williams. 2019.
Detecting Social Bots on Facebook in an Information Veracity Context.
In Proceedings of the International AAAI Conference on Web and Social
Media, Vol. 13. 463–472.
[15]
Tao Stein, Erdong Chen, and Karan Mangla. 2011. Facebook immune
system. In Proceedings of the 4th workshop on social network systems.
1–8.
[16]
VS Subrahmanian, Amos Azaria, Skylar Durst, Vadim Kagan, Aram
Galstyan, Kristina Lerman, Linhong Zhu, Emilio Ferrara, Alessandro
Flammini, and Filippo Menczer. 2016. The DARPA Twitter bot chal-
lenge. Computer 49, 6 (2016), 38–46.
[17]
Lidia Vitkova, Igor Kotenko, Maxim Kolomeets, Olga Tushkanova,
and Andrey Chechulin. 2019. Hybrid Approach for Bots Detection in
Social Networks Based on Topological, Textual and Statistical Features.
In International Conference on Intelligent Information Technologies for
Industry. Springer, 412–421.
SINCONF ’20, November 4–7, 2020, Istanbul, Turkey Kolomeets, et al.
Figure 3. Graph analysis of bots quality and comparison with some users-datasets.
... A deep learning architecture is proposed in the second method to determine if tweets were posted by real users or generated by bots. Additionally, important to the learning is Benford's law [10], which was important to understand the studies made in [11]. The researchers created and developed software to collect account metrics from Vkontakte, bought bots and collected eight bot datasets, gathered ten genuine user datasets, ran Kolmogorov-Smirnov tests for each measure in each dataset, and compared bot and real user findings. ...
Article
Full-text available
In the ultra-connected age of information, online social media platforms have become an indispensable part of our daily routines. Recently, this online public space is getting largely occupied by suspicious and manipulative social media bots. Such automated deceptive bots often attempt to distort ground realities and manipulate global trends, thus creating astroturfing attacks on the social media online portals. Moreover, these bots often tend to participate in duplicitous activities, including promotion of hidden agendas and indulgence in biased propagation meant for personal gain or scams. Thus, online bots have eventually become one of the biggest menaces for social media platforms. Therefore, we have proposed an AI-driven social media bot identification framework, namely TweezBot, which can identify fraudulent Twitter bots. The proposed bot detection method analyzes Twitter-specific user profiles having essential profile-centric features and several activity-centric characteristics. We have constructed a set of filtering criteria and devised an exhaustive bag of words for performing language-based processing. In order to substantiate our research, we have performed a comparative study of our model with the existing benchmark classifiers, such as Support Vector Machine, Categorical Naïve Bayes, Bernoulli Naïve Bayes, Multilayer Perceptron, Decision Trees, Random Forest and other automation identifiers.
Chapter
Full-text available
Loosely coupled computing systems is an emerging class of parallel computing systems. They are capable of solving large computationally expensive problems at a relatively low cost. During the computational process one or more computing nodes can be turned off resulting into loss of data. In global optimization problems this loss of data can lead not only to increasing the computation time but also to decreasing the solution quality. This paper presents a new problem decomposition method for loosely coupled systems that splits the search domain into multiply connected subdomains. Such an approach allows minimizing the negative impact of node termination. Results of the comparative experimental investigation with a use of benchmark functions are presented in this paper which demonstrate the increase in solution quality comparing to the traditional decomposition methods.
Chapter
Nowadays social networks become the very important mean of communication between people. But such networks are also used for malicious information dissemination and targeted influence on public opinion. The paper analyzes the characteristics of information dissemination in social networks. The contribution of the paper is the set of features that allow using machine learning methods to detect targeted influence on public opinion in social networks and distinguish the profiles responsible for it. The proposed features include: the dynamics of the number of subscribers, the dynamics of the number of likes on posts, the dynamics of the number of commentators on posts, the coherence of likes, the coherence of commentators, the dates of user registrations. The peculiarity of the proposed features lies in the selection of only those features that do not require content analysis. Also the paper authors propose the classification of the subjects that influence on the public opinion in social networks. The experiments were performed and proposed features and classification were tested.
Conference Paper
Full-text available
Misleading information is nothing new, yet its impacts seem only to grow. We investigate this phenomenon in the context of social bots. Social bots are software agents that mimic humans. They are intended to interact with humans while supporting specific agendas. This work explores the effect of social bots on the spread of misinformation on Facebook during the Fall of 2016 and prototypes a tool for their detection. Using a dataset of about two million user comments discussing the posts of public pages for nine verified news outlets , we first annotate a large dataset for social bots. We then develop and evaluate commercially implementable bot detection software for public pages with an overall F1 score of 0.71. Applying this software, we found only a small percentage (0.06%) of the commenting user population to be social bots. However, their activity was extremely disproportionate, producing comments at a rate more than fifty times higher (3.5%). Finally, we observe that one might commonly encounter social bot comments at a rate of about one in ten on mainstream outlet and reliable content news posts. In light of these findings and to support page owners and their communities we release prototype code and software to help moderate social bots on Facebook.
Conference Paper
Full-text available
The paper presents a hybrid approach to social network analysis for obtaining information on suspicious user profiles. The offered approach is based on integration of statistical techniques, data mining and visual analysis. The advantage of the proposed approach is that it needs limited kinds of social network data (“likes” in groups and links between users) which is often in open access. The results of experiments confirming the applicability of the proposed approach are outlined.
Article
Full-text available
We investigate the presence (and the influence) of disinformation spreading on online social networks in Italy, in the 5-month period preceding the 2019 European Parliament elections. To this aim we collected a large-scale dataset of tweets associated to thousands of news articles published on Italian disinformation websites. In the observation period, a few outlets accounted for most of the deceptive information circulating on Twitter, which focused on controversial and polarizing topics of debate such as immigration, national safety and (Italian) nationalism. We found evidence of connections between Italian disinformation sources and different disinformation outlets across Europe, U.S. and Russia, featuring similar, even translated, articles in the period before the elections. Overall, the spread of disinformation on Twitter was confined in a limited community, strongly (and explicitly) related to the Italian conservative and far-right political environment, who had a limited impact on online discussions on the up-coming elections.
Article
Full-text available
The contrast of fraud in international trade is a crucial task of modern economic regulations. We develop statistical tools for the detection of frauds in customs declarations that rely on the Newcomb–Benford law for significant digits. Our first contribution is to show the features, in the context of a European Union market, of the traders for which the law should hold in the absence of fraudulent data manipulation. Our results shed light on a relevant and debated question, since no general known theory can exactly predict validity of the law for genuine empirical data. We also provide approximations to the distribution of test statistics when the Newcomb–Benford law does not hold. These approximations open the door to the development of modified goodness-of-fit procedures with wide applicability and good inferential properties.
Conference Paper
Full-text available
The rise of web services and popularity of online social networks (OSN) like Facebook, Twitter, LinkedIn etc. have led to the rise of unwelcome social bots as automated social actors. Those actors can play many malicious roles including infiltrators of human conversations, scammers, impersonators, misinformation disseminators, stock market manipulators, astroturfers, and any content polluter (spammers, malware spreaders) and so on. It is undeniable that social bots have major importance on social networks. Therefore, this paper reveals the potential hazards of malicious social bots, reviews the detection techniques within a methodological categorization and proposes avenues for future research.
Article
Full-text available
the first significant (leftmost nonzero) digit of seemingly random numbers often appears to conform to a logarithmic distribution, with more 1s than 2s, more 2s than 3s, and so forth, a phenomenon known as Benford’s law. When humans try to produce random numbers, they often fail to conform to this distribution. this feature grounds the so-called Benford analysis, aiming at detecting fabricated data. A generalized Benford’s law (GBL), extending the classical Benford’s law, has been defined recently. In two studies, we provide some empirical support for the generalized Benford analysis, broadening the classical Benford analysis. We also conclude that familiarity with the numerical domain involved as well as cognitive effort only have a mild effect on the method’s accuracy and can hardly explain the positive results provided here.
Article
Social media provides political news and information for both active duty military personnel and veterans. We analyze the subgroups of Twitter and Facebook users who spend time consuming junk news from websites that target US military personnel and veterans with conspiracy theories, misinformation, and other forms of junk news about military affairs and national security issues. (1) Over Twitter we find that there are significant and persistent interactions between current and former military personnel and a broad network of extremist, Russia-focused, and international conspiracy subgroups. (2) Over Facebook, we find significant and persistent interactions between public pages for military and veterans and subgroups dedicated to political conspiracy, and both sides of the political spectrum. (3) Over Facebook, the users who are most interested in conspiracy theories and the political right seem to be distributing the most junk news, whereas users who are either in the military or are veterans are among the most sophisticated news consumers, and share very little junk news through the network.
Article
Recent accounts from researchers, journalists, as well as federal investigators, reached a unanimous conclusion: social media are systematically exploited to manipulate and alter public opinion. Some disinformation campaigns have been coordinated by means of bots, social media accounts controlled by computer scripts that try to disguise themselves as legitimate human users. In this study, we describe one such operation occurred in the run up to the 2017 French presidential election. We collected a massive Twitter dataset of nearly 17 million posts occurred between April 27 and May 7, 2017 (Election Day). We then set to study the MacronLeaks disinformation campaign: By leveraging a mix of machine learning and cognitive behavioral modeling techniques, we separated humans from bots, and then studied the activities of the two groups taken independently, as well as their interplay. We provide a characterization of both the bots and the users who engaged with them and oppose it to those users who didn't. Prior interests of disinformation adopters pinpoint to the reasons of the scarce success of this campaign: the users who engaged with MacronLeaks are mostly foreigners with a preexisting interest in alt-right topics and alternative news media, rather than French users with diverse political views. Concluding, anomalous account usage patterns suggest the possible existence of a black-market for reusable political disinformation bots.