ArticlePDF Available

Comparison of Classification Algorithms to tell Bots and Humans Apart

Authors:

Abstract and Figures

Websites are vulnerable to many types of attack, so detecting web activities caused by 'bots' becomes imperative, since they do more harm than good to websites. We compared three classification algorithms: Naïve Bayes, Support Vector Machine, and K-Nearest Neighbor to find which classifier performs the best to distinguish between bots and humans. A link obfuscation technique was employed in order for the system to have an additional attribute to distinguish bots from humans.
Content may be subject to copyright.
Comparison of Classification Algorithms to tell Bots and Humans Apart
1Christian Hadiwijaya Saputra, 2Erwin Adi, 3Shintia Revina
1,*2, 3 Bina Nusantara University, Jakarta, Indonesia, eadi@binus.edu
Abstract
Websites are vulnerable to many types of attack, so detecting web activities caused by ‘bots’
becomes imperative, since they do more harm than good to websites. We compared three classification
algorithms: Naïve Bayes, Support Vector Machine, and K-Nearest Neighbor to find which classifier
performs the best to distinguish between bots and humans. A link obfuscation technique was employed
in order for the system to have an additional attribute to distinguish bots from humans.
Keywords: classifier, bot, web security, spam
1. Introduction
‘Bots’, or software that surfs the Internet on behalf of a person, have shown increasing prevalence
on the Internet in recent years. A search engine is an example of a bot that harvests data from the
Internet to serve the human need for information. Not all bots exist to please the stakeholders of the
Internet, however. They consume bandwidth, their behavior may be outside the remit of what an
Internet service is meant to be used for, and their objectives may financially benefit one party at the
expense of others.
Many researchers have aimed to classify the actors that make up network traffic into bots and
humans. These include identifying who the players were in online games [1], detecting click fraud in
online advertising websites [2], and identifying ‘spambots’ [3]. Spambots are bots that post messages in
forums for the purpose of advertising a product or another web link regardless of the interest of the
readers.
In this paper, we have detailed our methodology of separating spambots from humans and search
engine bots. The motivation for this research was that we needed to guard our website (this is briefly
explained in section 4 of this paper). The detailed design and function of our website was already
published [4]. As we aimed to invite considerable traffic to the website, we realized through analysis
that it would mostly be vulnerable to spambots among other possible attacks. However it would be
unwise to ban all bots, since our website could be made popular through the work of search engine
bots. Mimicking the currently available techniques on identifying spambots was not straightforward:
the technique described by [3] aimed for fast detection of bots but did not suit our need as we were
aiming for accuracy. Therefore our study used three classifiers and considered the highest accuracy
measure as discussed in section 5.2, in preference to one described in [5] that used one only classifier.
The remainder of the paper is arranged as follows. Section 2 describes some classification
techniques. Section 3 explains the normalization method while section 4 describes the nature of the
website we aimed to guard, and defines the problem in formal notations. Section 5 details the
methodology, while section 6 presents the results, which are further discussed in section 7. We
conclude the paper in section 8.
2. Classification techniques
A classifier is a function that accepts input values (features) and predicts which class the data
belongs to. A classifier must be initially trained by a set of labeled data (training set) [6]. The classifier
will observe and analyze the training set so that it can classify unseen data (the test set). If it can
generalize the training set to correctly identify the test set, the classifier is considered good. When
training a classifier, there is a challenge to avoid overfitting and underfitting. Overfitting happens when
the classifier model is too complex that it describes ‘noise’. On the other hand, underfitting occurs
when the classifier fails to model the complex data set. Overfitting is undesirable because the unseen
data will not conform to the classifier’s prediction [7]. One phenomena of overfitting is that the
classifier performs well with training set, but not in the test set. A few factors that contribute to
overfitting [7][8] are small training set sizes, noisy training sets, large feature sizes, and incorrect
parameter selections. One of the solutions to the overfitting problem is to do cross-validation.
We considered three classification techniques for our purposes: Naïve Bayes, Support Vector
Machine (SVM), and K-Nearest Neighbor (KNN).
2.1 Naïve Bayes
The Naïve Bayes classifier is a classification method based on Bayesian Theorem. The main idea of
the Bayesian approach is to calculate whether a belief could be changed given new evidence. The
Bayes rule can be used to calculate [9]:
||

C denotes the class of an observed instance, and c as a particular class label. Whereas X <X1, X2, …,
Xn> represents the observed attribute values and x <x1, x2, …, xn> as a particular observed attribute
value vector. Given that the denominator is invariant across classes (i.e. it does not depend on C), it is
effectively constant. Hence we can express it as:
||
The training data will be used to estimate the probabilities and |. However,
we will not be able to directly estimate| when the feature vector is high-dimensional.
Therefore simplifications are commonly used, such as assuming that the features are independent given
a class. This is expressed as follows:
||

Here, we assume that the occurrence of a particular observed value of xi is statistically independent of
the occurrence of any other xi, given a class c. Under this assumption, we can typically model
| with relatively few parameters.
2.2 Support Vector Machine (SVM)
Support Vector Machine (SVM) is a kernel-based classifier, which calculates dot products of the
data [10]. Given some training samples ,
, the label 1,1 indicates the class to which
the feature vector xi belongs. SVM outputs a separating hyper plane with a maximum margin in the
higher feature space, induced by a kernel function k (x, z) [11].
There are 4 basic kernels in SVM: linear, polynomial, radial basis function (RBF), and sigmoid.
Each kernel has different parameters that can be tuned to improve the classifier’s performance [12].
Polynomial, RBF, and sigmoid are examples of nonlinear kernels. Among these three kinds of
nonlinear kernels, RBF performance is superior compared to polynomial or sigmoid [12]. Hence we
adopted RBF as the SVM kernel for our study. The RBF kernel can be written as follows [11]:
,exp 
Here, is the kernel parameter which determines the RBF width. As is shown later in section 6.2., we
tried our gamma over [2-5, 2-4, 2-3, 2-2, 2-1, 20, 21, 22, 23] values and selected the single best value.
2.3 K-Nearest Neighbor (KNN)
K-Nearest Neighbor (KNN) is a classifier that works by looking at the neighbors of the data. A k
value denotes how many neighbors should be considered when making classifications. The optimum
value of k needs to be selected: a small k value risks having a higher noise on the result; large k values
are computationally expensive. As is shown later in section 6.3., we experimented with k values that
ranged from 1 to 10 and selected the single best value.
The operation consists of two steps: first, a distance metric is used to choose the nearest neighbors.
Second, a weighted voting is done among the neighbors to determine the class label [13]. Two
important parameters for KNN are the value of k, to determine the number of nearest neighbors used,
and the distance function. Some known distance metrics are ‘Euclidean’, ‘Manhattan’, ‘Minkowski’,
and ‘Earth Mover’ distances [13]. Our study used Euclidean distance d (x, z) to find the distance
between x and z instances [14]:
,

where xi and zi are particular observed values of x and z instances.
3. Normalization method
In machine learning, a feature vector is an n-dimensional vector that contains numerical
measurements. They usually exist in a high dimensional space, and often are not in the same range
[15]. To equalize the ranges of the features, the feature vector normalization is required so that the
features could contribute equally in the classification process.
Equal contribution of the features is essential when performing classification. However, we noted
that some classification studies did not apply normalization methods [5][16]. Our study differs in that
we normalize our attributes in order to have equal contribution of the features.
The normalization procedure can be employed in different ways, depending on the characteristics of
the data set. A common normalization procedure is to transform the feature component to a random
variable by:

where is the sample mean and is the sample standard deviation of the feature [17]. Another
common normalization method is the removal of out-of range records, assuming that the outliers could
be eliminated [17]. However, applying this method could lead to a loss of information from the
collected data.
In this study, we modify normalization procedure proposed by [15]. This approach normalizes the
data by using the following procedure:  

Here, is the normalized attribute value, while  and  denote the maximum and the minimum
values of attribute respectively. In the above method, the attribute data is scaled to fit in [0,1] range,
while in our study we intend to have the dataset in [-1,1] range. The normalization procedure we
applied is as follows:


where is the upper bound and is the lower bound of the range, in our case +1 and -1 respectively.
The reason we normalized the data within this range was because we wanted to classify our data into
bots or humans, represented by +1 and -1 respectively, as explained in the following section.
In addition, in a data set with different ranges and unit of measurements, a direct measurement of
attributes with large ranges will give an opportunity for these attributes to have more contribution than
those with small ranges. Therefore, in that situation, we should not apply the classification process
without employing the normalization to the attributes, unless one particular attribute could dominate
the others.
4. Problem definition
The website [4] we aimed to guard against bots contains public transportation routes of a
metropolitan city that can be edited by registered users. It is a wiki-like system in which a user can add
a route and add bus stations on a map. These can then be edited by other users for the purpose of
making the information more reliable. The website also has a forum where users can post messages.
The website was vulnerable against bots that could register as new users, spread off-topic messages
in the forum, or supply misguided information to the established map. Abuse by humans was also
possible, but that was not the concern of this study. We were looking for ways to separate bots from
humans.
Thus, this was a binary classification task with bots and humans as the classes. It resembled the
spam classification problem discussed in [18]. ‘U’ was the data set of website visitors or users, with
as the first visitor, as the second visitor and so on.
,,…,||
Suppose C is the set of classes, with as bots and as humans.
,
Each website visitor can only be assigned to one class and not another, for example, if is a
human, then it is not a bot. Thus, the decision function is defined:
,:    1, 1
It can be simplified as:
1,1
If  is +1, then it means that is a bot user. If is a human user,  is -1.
5. Methodology
We captured the navigation behavior of each visiting user to the website. We mimicked the
technique described in [5], in which a set of actions were recorded. For example, a click by a user to
browse a message was considered as one action. The set of actions acted as input vectors for the
classifiers. Our input vectors consisted of 68 action frequencies and 1 average navigation time. This is
illustrated in Figure 1.
Figure 1. Input vectors
We observed that a bot was programmed to register as a new user and post a lot of spam within a
short amount of time without having to first read the topic of the thread. Humans on the other hand,
read messages from one thread to another, and might post a message. Hence identifying bots or humans
was possible. To further strengthen our detection technique, we employed a link obfuscation service
[19]. The approach was to place many decoy links on the web pages in a way that they were invisible
to humans. For example, links that were colored the same as their background and links that were
placed at the far edge of the page were very unlikely to entice humans to click on them. Although it
was nevertheless possible for humans to click these invisible links, frequent and indeterminate visits of
the links were likely done by automated mechanisms.
5.1. Data collection
We surveyed the web navigation log from our website, of which we collected 100 human navigation
logs and 100 bot navigation logs. In order to get human navigation logs, we asked anonymous
volunteers, whom we contacted mostly through a social networking site, a forum, and some social
bookmarking sites, to participate in trying to use the web application.
Collecting unsolicited bot navigation logs was arduous. Initially we only recorded 2 bot visits after
one month. We could identify these bots through observing their post content and the time needed by
those two to accomplish all of their actions in the website. In order to be able to have the desired
amount of data, we decided to create our own bots. We deployed these bots and observed their behavior
while they traversed our website; this was a common practice that other studies implemented [19] [20].
The deployed bots consisted of 6 bots, four of which were built using LWP—The World-Wide Web
library for PERL—and two of which were third party software. The first four were created to simulate
random clicking behavior. The two third-party software bots were deployed to simulate spamming
activities.
The bots that we built using LWP are detailed as follows, with behaviors that were inspired from
another bot [21]. Bot A clicks links randomly, and stops when it finds a dead link. Bot B is a slightly
modified version of bot A. When it clicks on a dead link, it navigates back starting from the previous
link. Bot C does not only navigate back under this situation, but also blacklists the dead link so that it
will not be clicked again. Bot D determines the link’s visibility to humans. It checks the style attribute
of the links and parses them to calculate the link positions. It clicks links that it categorizes as visible,
and otherwise blacklists the invisible links.
The third-party products that we used were IIP WebBot and XRumer. IIP WebBot is an application
to help people automate repetitive tasks. It allows its users to train the bot before it performs the
intended routine tasks. This bot is suitable for our purpose since it records navigation data, such as
clicks on a link or posting a message in a forum. The other bot, XRumer, is essentially a Search Engine
Optimization tool used to improve the page rank of a website. It is capable of avoiding spam detection
by registering itself as a user who asks a question, as well as another user who answers the question by
referring to a link or product name.
5.2. Performance measurement
In order to compare the three classifiers (NB, SVM, and KNN), we chose accuracy [22] and
Matthew Correlation Coefficient (MCC) [23] as the methods to measure the performance of the
classifiers. Accuracy is commonly used for measurement as it provides basic information about how
well a classifier performs. However, accuracy does not tell us whether the predictions made by the
classifier are really based on observations or just random guesses [24]. This is where MCC comes in
handy. This performance measurement has a range between +1 and -1, where +1 means the prediction
and the observation are perfectly related, 0 means a random prediction, and -1 means the prediction
contradicts the observation. In comparison, accuracy does not look into class distributions and may be
biased if the data is unbalanced in terms of the number of class instances.
Other performance measurements exist, such as precision and recall. They are not suitable for our
case since they only look at positive cases. MCC takes both true positives and true negatives into
account as correct predictions [25].
Accuracy was defined as follows [22]:
    
   
Or
  

Where:
TP was True Positives—the number of bot instances that were correctly classified as bots,
i.e., u was +1 when +1 was expected.
TN was True Negatives—the number of human instances that were correctly classified as
human, i.e.,  was -1 when -1 was expected.
FP was False Positives—the number of human instances that were classified as bots, i.e.,
 was +1 when -1 was expected.
FN was False Negatives—the number of bot instances that were classified as human, i.e.,
 was -1 when +1 was expected.
MCC was defined as follows [23]:
 ..

6. Results
Since we provided a balanced data set, i.e., 100 human navigation logs and 100 bot navigation logs,
the performance measurement using accuracy was representative (unbiased).
To assess the performance of the three candidate classifiers, we used k-fold cross validation with k =
10. That is, we divided the data into 10 equal parts and used 90% of the data for training. The value of
k = 10 is considered by general consensus in the data mining community to be a good compromise
[26]. This 10-fold cross validation task was executed using Weka, software that provides complete data
mining tools and algorithms such as NB, SVM, and KNN. Before we fed these classifiers with the data,
we controlled some independent variables to select some optimum parameters.
In addition, all of the attributes were initially normalized before the data was processed.
Table 1 shows the range of several attributes before normalization was applied:
Table 1. Range of several attributes before normalization
Attributes Range
Forum FAQ [0,30]
View User Group [0,2]
Post New Topic [0,10]
Add Place [0,6]
Average Navigation Time [0.25, 137.5]
The normalization procedure approximately equalized the range of all of attributes to [-1, 1].
This was required in the data preprocessing as our input vectors consisted of 68 action
frequencies and 1 average navigation time. These vectors had different units of measurement,
such as frequency and number of seconds. Thus, the attributes resulted in different ranges as
shown above. By applying the normalization technique, all attributes had an opportunity to
contribute equally in the classification process and avoid domination of one particular attribute.
6.1. Naïve Bayes
Table 2 shows that the classifier’s accuracy was 0.910, given that it correctly predicted 182 out of
200 instances in total. The MCC value was 0.834.
Table 2. Performance result of Naïve Bayes
TP FP TN FN Accuracy MCC
82 0 100 18 0.91 0.834
6.2. Support Vector Machine (SVM)
As we used the RBF kernel for the SVM, the parameter that we controlled was the gamma value,
while the classification accuracy was the dependent variable. Table 3 shows the measured accuracy
with gamma values ranging from 2-5 to 23. It can be shown that gamma = 2-2 leads to the highest
accuracy = 0.945. Thus, gamma = 2-2 was used to select the components needed to calculate the MCC
value. We obtained MCC = 0.894.
Table 3. Performance result of SVM
Gamma TP FP TN FN Accuracy MCC
2-5
2-4
2-3
2-2
2-1
20
21
22
23
99
100
99
99
98
96
93
95
95
18
20
13
10
12
13
22
32
40
82
80
87
90
88
87
78
68
60
1
0
1
1
2
4
7
5
5
0.905
0.900
0.930
0.945
0.930
0.915
0.855
0.815
0.775
0.894
6.3. K-Nearest Neighbor (KNN)
The performance of a KNN classifier depends on the k value, i.e. the number of nearest neighbors to
be considered, and the distance metric. For this research we used the Euclidean distance metric. The
independent variable was the k value, and the dependent variable was the accuracy.
Table 4. Performance result of KNN
k TP FP TN FN Accuracy MCC
1
2
3
4
5
6
7
8
9
10
92
98
93
96
93
94
93
96
95
95
5
8
5
7
7
8
7
7
7
9
95
92
95
93
93
92
93
93
93
91
8
2
7
4
7
6
7
4
5
5
0.935
0.950
0.940
0.945
0.930
0.930
0.930
0.945
0.940
0.930
0.902
Table 4 summarizes experiment results on various k values, ranging from 1 to 10, on the KNN
classifier. The optimal parameter was achieved with k = 2 which resulted in accuracy = 0.950. We used
the data from this row, and obtained MCC = 0.902.
7. Discussion
Our results showed that KNN outperformed the other classifiers. This agrees with a previous work
which showed that KNN is superior over SVM and NB [27]. The difference was that our work
classified the user behaviors of a website, while the previous study worked on classifying texts.
We further observed that bots were more attracted to higher-traffic websites. In order to explain this,
we explain our data collection method as follows. We deployed our website into 4 different web
hosting sites, namely host A, host B, host C, and host D. Host A was made popular through our
announcements on Facebook and Kaskus (www.kaskus.us) – probably the most popular forums in
Indonesia. We observed that host A was visited by search engine spiders, bots, and humans. Host B was
visited by only search engine spiders, and it was made popular through our announcements on social
bookmarking sites. We did not make any effort to announce the websites deployed on both host C and
host D, and no navigation data was recorded.
Our observed bots had a navigation time of between 0 and 6 seconds. Readers may question how we
distinguished spamming bots and search engine spiders, since they are both bots in nature. Our
explanation is that ethical spiders register their IP address. For example, one of the IP addresses in our
log, 66.249.71.42, was listed as a Google bot in [28].
Spamming bots register themselves as a user. Figure 2 shows our log which reveals that a bot
registered itself as “gemenero” and carried out a total of 43 actions (1 + 15 + 21 + … + 1) with an
average navigation time of 5.7 seconds per action.
Figure 2. A navigation log of a bot
We learned from the log that the bot posted a message in a forum. Upon reading the message, we
evaluated that its content was not in accordance with the topic of the forum. Figure 3 shows this
message.
Figure 3. A message posted by a spamming bot
It may not be easy for the readers to tell if the message shown in Figure 3 was spam or not, since it
requires the readers to understand the topic (and it is impossible to display the whole thread in this
space). This fact shows that human intervention is still required to train the classifiers. A hybrid that
uses human assistance to automate classification is generally called a cyborg [29].
The classifier’s model was validated with 10-fold cross validation. It had not been evaluated with
new data separate from the test data. In other words, our results were obtained based on lab data: all
human navigation logs were obtained from volunteers who were asked to accomplish certain jobs
rather than posting on their own volition; only 2% of bot navigation data was obtained from bots
captured ‘from the wild’. Hence some cyborg interventions will be needed to make our website
immune from unwanted bot activities.
8. Conclusion and future works
Our results showed that we should use KNN as our classifier for our website. We believe that
human intervention will be necessary to maintain a desirable service of the website for several reasons.
First, although we had chosen a k value that returned the highest performance, periodical supervised
training will be required to adjust the independent variable of the classifier. Second, bot authors get
smarter. Determined attackers can learn which bots are the fittest, read related publications, and create
bots that bypass our detection mechanism. A seemingly secure website will likely become vulnerable at
a later date if left unmonitored.
In order to stay ahead of the pack, we intend to anticipate when such smarter bots will exist. For
example, the bots that we created for this paper were meant to test the link obfuscation technique.
However further observation will be required in order for us to be able to conclude which criteria
determined that some bots were human.
Furthermore, we propose to build an adaptive mechanism in the future to minimize human
intervention as we described above.
9. References
[1] P. Hingston, "A Turing test for computer game bots," IEEE Trans. Computational
I
ntelligence and
AI in Games, vol. 1, no. 3, pp. 169-186, Sep. 2009.
[2] L. Zhang and Y. Guan, "Detecting click fraud in pay-per-click streams of online advertising
nsetworks," in 28th Int. Conf. Distributed Computing Systems, 2008, pp. 77-84.
[3] P. Hayati, V. Potdar, A. Talevski, and W. F. Smyth, "Rule-based on-the-fly web spambot detection
using action string," in 7th Annu. Collaboration, Electronic messaging, Anti-
buse and Spam
Conf., Redmond, 2012.
[4] E. Adi and S. A. Ningsih, "Ubiquitous pu
b
lic transportation route guide for a developing country,"
in Proc. IEEE 24th Int. Conf. Advanced Information Networking and Applications Workshops,
Perth, 2010, pp. 263-268.
[5] P. Hayati, V. Potdar, K. Chai, and A. Talevski, "Web spambot detection based on web navigation
behaviour," in 24th IEEE Conf. Advanced Information Networking and Applications, Perth, 2010,
pp. 797-803.
[6] F. Pereira, T. Mitchell, and M. Botvinick, "Machine learning classifiers and fMRI: a tutorial
overview," NeuroImage, no. 45, pp. S199-S209, Nov. 2008.
[7] W. Sarle. (2011, November) What is ovverfitting and how can I avoid it? [Online]. Available:
http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-3.html#b
[8] M. A. Babyak, "What you see may not be what you get : A
b
rief, nontechnical introduction to
overfitting in regression-type models," Psychosomatic Medicine, no 66, pp. 411-421, 2004.
[9] Y. Yang and G. Webb, "Non-disjoint discretization for Naive-Bayes classifiers", in Proc.
I
nt.
Conf. Machine Learning, 2002, pp.666-673.
[10] A. Ben-Hur and J. Weston, "A User's Guide to Support Vector Machines,"
M
ethods in Molecular
Biology, vol. 609, pp. 223-239, 2010.
[11] H. Cao, T. Naito and Y. Ninomiya. "Approximate RBF kernel SVM and its applications in
pedestrian classification," in 1st Int. Workshop on Machine Learning for Vision-
b
ased Motion
Analysis, 2008.
[12] C. W. Hsu, C. C. Chang, and C. J. Lin. (2010, Apr 15).
A
Practical Guide to Support Vector
Classification [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
[13] P. Cunningham and S. J. Delany, "k-Nearest Neighbour Classiers," University College Dublin,
Dublin, Tech. Rep. UCD-CSI-2007-4, 2007.
[14] L. Jiang et al., "Survey of improving k-nearest-neighbor for classification," in 4th Int. Conf. Fuzzy
Systems and Knowledge Discovery, 2007, pp. 679 - 683.
[15] S. Aksoy and R. Haralick, "Feature normalization and likelihood-
b
ased similarity measures for
image retrieval," Pattern Recognition Letters, vol 22, pp. 563–582, 2001.
[16] E. Frank, M. Hall, and B. Pfahringer, "Locally weighted naive Bayes," Conf. Uncertainty in
Artificial Intelligence, 2003, pp. 249-256.
[17] D.T. Pham, Y.I. Prostov, and M.M. Suarez-Alvarez, "Statistical approach to numerical databases
clustering using Normalized Minkowski Metrics," in 2nd
I
*PROMS Virtual International
Conference, 2006, pp. 356-361.
[18] L. Zhang, J. Zhu, and T. Yao, "An Evaluation of Statistical Spam Filtering Techniques,"
A
C
M
Transactions on Asian Language Information Processing, vol. 3, no. 4, pp. 243-269, Dec. 2004.
[19] D. Brewer, K. Li, L. Ramaswamy, and C. Pu, "A link obfuscation service to detect webbots," in
IEEE Int. Conf. Services Computing, Miami, 2010, pp. 433-440.
[20] K. T. Chen et al., "Identifying MMORPG bots: A traffic analysis approach," EURASIP J.
Advances in Signal Processing, 2009.
[21] N. Daswani and M. Stoppelman, "The anatomy of Clickbot.A," in Proc. 1st Workshop Hot Topics
in Understanding Botnets, Cambridge, 2007, pp. 11-21.
[22] JCGM 200:2008. (2008). Bureau International des Poids et Mesures [Online]. Available
http://www.bipm.org/utils/common/documents/jcgm/JCGM_200_2008.pdf
[23] B. W. Matthew, "Comparison of the predicted and observed secondary structure of T4 phage
lysozyme," Biochimica et Biophysica Acta (BBA) - Protein Structure, vol. 405, no. 2, pp. 442-451,
1975.
[24] P. Baldi et al. "Assessing the accuracy of prediction algorithms for classification: an overview,"
Bioinformatics, vol. 16, no. 5, pp. 412-424, Feb. 2000.
[25] O. Carugo, "Detailed estimation of bioinformatics prediction reliability through the Fragmented
Prediction Performance Plots," BMC Bioinformatics, vol. 8, Oct. 2007.
[26] P. Refaeilzadeh, L. Tang, and H. Liu, "Cross-Validation," in Encyclopedia of Database Systems,
M. Tamer Özsu and L. Liu, Ed. Springer, 2009, pp. 532-538.
[27] F. Colas and P. Brazdil, "Comparison of SVM and some older classification algorithms in text
classification tasks," IFIP International Federation for Information Processing, vol. 217, pp. 169-
178, 2006.
[28] D. Kramer. (2011). IP Addresses of Search Engine Spiders. [Online]. Available
http://www.iplists.com/
[29] Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia, "Detecting automation of Twitter accounts: Are
you a human, bot, or cyborg?," IEEE Trans. Dependable and Secure Computing, vol. 9, no. 6, pp.
811-824, Dec. 2012.
... Regarding application of ML techniques to bot detection, most approaches exploit supervised learning. The following methods have been investigated: decision trees, association rule mining, k-Nearest Neighbors, support vector machine, artificial neural network, Bayesian classifiers, and ensemble methods [20], [21], [22], [23], [13], [24], [3], [25], [26], [27], [28]. Most of these approaches address the problem of binary classification of user sessions (bots vs. humans). ...
Article
Full-text available
Recent studies reported that about half of Web users nowadays are intelligent agents (Web bots). Many bots are impersonators operating at a very high sophistication level, trying to emulate navigational behaviors of legitimate users (humans). Moreover, bot technology continues to evolve which makes bot detection even harder. To deal with this problem, many advanced methods for differentiating bots from humans have been proposed, a large part of which relies on supervised machine learning techniques. In this paper, we propose a novel approach to identify various profiles of bots and humans which combines feature selection and unsupervised learning of HTTP-level traffic patterns to develop a user session classification model. Session clustering is performed with the agglomerative Information Bottleneck (aIB) algorithm, as well as with some other reference algorithms. The model is then used to classify new sessions to one of the profiles and to label the sessions as performed by bots or humans. An extensive experimental study, based on real server log data, demonstrates the ability of aIB clustering to distinguish user profiles and confirms high performance of the classification model in terms of accuracy, F1, recall, and precision.
... The ability of the inferred function to determine correct class labels for new, unseen samples is assessed on a test dataset. Many supervised learning techniques demonstrated their efficiency in classification of bots and humans, e.g., decision trees (Gržinić et al., 2015;Kwon et al., 2012;Tan and Kumar, 2002) support vector machine (Gržinić et al., 2015;Jacob et al., 2012;, neural networks (Bomhardt et al., 2005;, and k-Nearest Neighbours (Stevanovic et al., 2012;Saputra et al., 2013). All supervised learning approaches, however, share a common disadvantage, related to a difficulty with preparation of a reliable training dataset, in particular with assigning accurate class labels to sessions of camouflaged robots. ...
Article
Full-text available
Web traffic on e-business sites is increasingly dominated by artificial agents (Web bots) which pose a threat to the website security, privacy, and performance. To develop efficient bot detection methods and discover reliable e-customer behavioural patterns, the accurate separation of traffic generated by legitimate users and Web bots is necessary. This paper proposes a machine learning solution to the problem of bot and human session classification, with a specific application to e-commerce. The approach studied in this work explores the use of unsupervised learning (k-means and Graded Possibilistic c-Means), followed by supervised labelling of clusters, a generative learning strategy that decouples modelling the data from labelling them. Its efficiency is evaluated through experiments on real e-commerce data, in realistic conditions, and compared to that of supervised learning classifiers (a multi-layer perceptron neural network and a support vector machine). Results demonstrate that the classification based on unsupervised learning is very efficient, achieving a similar performance level as the fully supervised classification. This is an experimental indication that the bot recognition problem can be successfully dealt with using methods that are less sensitive to mislabelled data or missing labels. A very small fraction of sessions remain misclassified in both cases, so an in-depth analysis of misclassified samples was also performed. This analysis exposed the superiority of the proposed approach which was able to correctly recognize more bots, in fact, and identified more camouflaged agents, that had been erroneously labelled as humans.
... Most research in this field has been devoted to tell bots and humans apart with the use of classification techniques, such as Bayesian classifiers [18], [23] decision trees [12], [24], support vector machines [8], association rule mining [10], or ensemble methods [17]. Some studies aimed at comparing the efficiency of various classification algorithms [3], [16], [19]. Furthermore, unsupervised classification techniques revealed a high potential for differentiating between bots and humans [20], [28]. ...
... Thus, programmers and web designers embarked upon the task of improving the users authentication, arriving at what today is known as CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart), one of the most popular authentication methods. According to [9], for a human being, average resolution time for an OCR-based authentication method is 16 seconds, in comparison, the average time that a bot takes to obtain an attempt of solution, which could be wrong or right, is of approximately 6 seconds, regardless of the method used [10]. ...
Article
Full-text available
A large fraction of traffic on present-day Web servers is generated by bots — intelligent agents able to traverse the Web and execute various advanced tasks. Since bots’ activity may raise concerns about server security and performance, many studies have investigated traffic features discriminating bots from human visitors and developed methods for automated traffic classification. Very few previous works, however, aim at identifying bots on-the-fly, trying to classify active sessions as early as possible. This paper proposes a novel method for binary classification of streams of Web server requests in order to label each active session as “bot” or “human”. A machine learning approach has been developed to discover traffic patterns from historical usage data. The model, built on a neural network, is used to classify each incoming HTTP request and a sequential probabilistic analysis approach is then applied to capture relationships between subsequent HTTP requests in an ongoing session to assess the likelihood of the session being generated by a bot or a human, as soon as possible. A performance evaluation study with real server traffic data confirmed the effectiveness of the proposed classifier in discriminating bots from humans at early stages of their visits, leaving very few of them undecided, with very low number of false positives.
Article
The paper concerns the issue of modeling and generating a representative Web workload for Web server performance evaluation through simulation experiments. Web traffic analysis has been done from two decades, usually based on Web server log data. However, while the character of the overall Web traffic has been extensively studied and modeled, relatively few studies have been devoted to the analysis of Web traffic generated by Internet robots (Web bots). Moreover, the overwhelming majority of studies concern the traffic on non e-commerce websites. In this paper we address the problem of modeling a realistic arrival process of bots’ requests on an e-commerce Web server. Based on real log data for an online store, sessions generated by bots were reconstructed and their key features were analyzed, including the interarrival time of bot sessions, the number of HTTP requests per session, and the interarrival time of requests in session. To deal with the problem of non-stationarity of the Web traffic, chunks associated with times of day were distinguished based on the intensity of bot sessions’ arrivals and then features of sessions in individual time chunks were analyzed separately. Using regression analysis, a mathematical model of the bots’ traffic features was developed and implemented in a bot traffic generator. Our findings confirm the existence of a heavy-tail in bot traffic features’ distributions. The bots’ session interarrival times and request interarrival times are best modeled by a Weibull and a sigmoid distributions, respectively, while the model proposed for the numbers of requests per bot session is based on a hybrid function being a combination of one exponential and two normal distribution functions. The suitable fit of the model was confirmed by the high correlation of the real and model data. Furthermore, a visual inspection of the simulation results showed that the estimated values represent distributions close to those of the empirical data.
Conference Paper
A large part of Web traffic on e-commerce sites is generated not by human users but by Internet robots: search engine crawlers, shopping bots, hacking bots, etc. In practice, not all robots, especially the malicious ones, disclose their identities to a Web server and thus there is a need to develop methods for their detection and identification. This paper proposes the application of a Bayesian approach to robot detection based on characteristics of user sessions. The method is applied to the Web traffic from a real e-commerce site. Results show that the classification model based on the cluster analysis with the Ward's method and the weighted Euclidean metric is very effective in robot detection, even obtaining accuracy of above 90%.
Article
A significant volume of Web traffic nowadays can be attributed to robots. Although some of them, e.g., search-engine crawlers, perform useful tasks on a website, others may be malicious and should be banned. Consequently, there is a growing need to identify bots and to characterize their behavior. This paper investigates the share of bot-generated traffic on an e-commerce site and studies differences in bots' and humans' session-based traffic by analyzing data recorded in Web server log files. Results show that both kinds of sessions reveal different characteristics, including the session duration, the number of pages visited in session, the number of requests, the volume of data transferred, the mean time per page, the number of images per page, and the percentage of pages with unassigned referrers.
Article
Full-text available
Twitter is a new web application playing dual roles of online social networking and microblogging. Users communicate with each other by publishing text-based posts. The popularity and open structure of Twitter have attracted a large number of automated programs, known as bots, which appear to be a double-edged sword to Twitter. Legitimate bots generate a large amount of benign tweets delivering news and updating feeds, while malicious bots spread spam or malicious contents. More interestingly, in the middle between human and bot, there has emerged cyborg referred to either bot-assisted human or human-assisted bot. To assist human users in identifying who they are interacting with, this paper focuses on the classification of human, bot, and cyborg accounts on Twitter. We first conduct a set of large-scale measurements with a collection of over 500,000 accounts. We observe the difference among human, bot, and cyborg in terms of tweeting behavior, tweet content, and account properties. Based on the measurement results, we propose a classification system that includes the following four parts: 1) an entropy-based component, 2) a spam detection component, 3) an account properties component, and 4) a decision maker. It uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot, or cyborg. Our experimental evaluation demonstrates the efficacy of the proposed classification system.
Article
Full-text available
Web spambots are a new type of internet robot that spread spam content through Web 2.0 applications like online discussion boards, blogs, wikis, social networking platforms etc. These robots are intelligently designed to act like humans in order to fool safeguards and other users. Such spam content not only wastes valuable resources and time but also may mislead users with unsolicited content. Spam content typically intends to misinform users (scams), generate traffic, make sales (marketing/advertising), and occasionally compromise parties, people or systems by spreading spyware or malwares. Current countermeasures do not effectively identify and prevent web spambots. Proactive measures to deter spambots from entering a site are limited to question / response scenarios. The remaining efforts then focus on spam content identification as a passive activity. Spammers have evolved their techniques to bypass existing anti-spam filters. In this paper, we describe a rule-based web usage behaviour action string that can be analysed using Trie data structures to detect web spambots. Our experimental results show the proposed system is successful for on-the-fly classification of web spambots hence eliminating spam in web 2.0 applications.
Article
Full-text available
Despite its simplicity, the naive Bayes classifier has surprised machine learning researchers by exhibiting good performance on a variety of learning problems. Encouraged by these results, researchers have looked to overcome naive Bayes primary weakness - attribute independence - and improve the performance of the algorithm. This paper presents a locally weighted version of naive Bayes that relaxes the independence assumption by learning local models at prediction time. Experimental results show that locally weighted naive Bayes rarely degrades accuracy compared to standard naive Bayes and, in many cases, improves accuracy dramatically. The main advantage of this method compared to other techniques for enhancing naive Bayes is its conceptual and computational simplicity.
Article
Full-text available
Perhaps the most straightforward classifier in the arsenal or Machine Learning techniques is the Nearest Neighbour Classifier—classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance, because issues of poor runtime performance is not such a problem these days with the computational power that is available. This article presents an overview of techniques for Nearest Neighbour classification focusing on: mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours, and mechanisms for reducing the dimension of the data. This article is the second edition of a paper previously published as a technical report [16]. Sections on similarity measures for time-series, retrieval speedup, and intrinsic dimensionality have been added. An Appendix is included, providing access to Python code for the key methods.
Article
Full-text available
In this paper, a version of the Turing Test is proposed, to test the ability of computer game playing agents (ldquobotsrdquo) to imitate human game players. The proposed test has been implemented as a bot design and programming competition, the 2K BotPrize Contest. The results of the 2008 competition are presented and analyzed. We find that the Test is challenging, but that current techniques show promise. We also suggest probable future directions for developing improved bots.
Article
Pre-processing or normalisation of data sets is widely used in a number of fields of machine intelligence. Contrary to the overwhelming majority of other normalisation procedures, when data is scaled to a unit range, it is argued in the paper that after normalisation of a data set, the average contributions of all features to the measure employed to assess the similarity of the data have to be equal to one another. Using the Minkowski distance as an example of a similarity metric, new normalised metrics are introduced such that the means of all attributes are the same and, hence, contributions of the features to similarity measures are approximately equalised. Such a normalisation is achieved by scaling of the numerical attributes, i.e. by dividing the database values by the means of the appropriate components of the metric.
Article
This paper provides a detailed case study of the architecture of the Clickbot. A botnet that attempted a low-noise click fraud attack against syndicated search engines. The botnet of over 100,000 machines was controlled using a HTTP-based botmaster. Google identified all clicks on its ads exhibiting Clickbot. Alike patterns and marked them as invalid. We disclose the results of our investigation of this botnet to educate the security research community and provide information regarding the novelties of the attack.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.