Content uploaded by Rizaain Yusof
Author content
All content in this area was uploaded by Rizaain Yusof on Mar 15, 2019
Content may be subject to copyright.
adfa, p. 34, 2011.
© Springer-Verlag Berlin Heidelberg 2011
An Evaluation on KNN-SVM Algorithm for Detection
and Prediction of DDoS Attack
Ahmad Riza’ain Yusof
1,2
, Nur Izura Udzir
1
, Ali Selamat
2
Faculty of Computer Science and Information Technology,
Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia
1
{izura@upm.edu.my}
Faculty of Computing, Universiti Teknologi Malaysia,
UTM Johor Bahru, 81310 Johor Bahru, Malaysia
2
{rizaain, aselamat} @utm.my
Abstract Recently, damage caused by DDoS attacks increases year by year.
Along with the advancement of communication technology, this kind of attack
also evolves and it has become more complicated and hard to detect using flash
crowd agent, slow rate attack and also amplification attack that exploits a vulner-
ability in DNS server. Fast detection of the DDoS attack, quick response mecha-
nisms and proper mitigation are a must for an organization. An investigation has
been performed on DDoS attack and it analyzes the details of its phase using
machine learning technique to classify the network status. In this paper, we pro-
pose a hybrid KNN-SVM method on classifying, detecting and predicting the
DDoS attack. The simulation result showed that each phase of the attack scenario
is partitioned well and we can detect precursors of DDoS attack as well as the
attack itself.
Keywords: Distributed denial of services (DDoS), Machine learning classifiers,
Security, Intrusion detection, Prediction, support vector machine (SVM), k-
nearest neighbor (KNN), KNN-SVM
1 Introduction
Three aspects usually involve in computer related issues such as integrity, con-
fidentiality and availability. Security threats fall into three categories such as breach of
confidentiality, failure of authenticity and unauthorized denial of services [1]. Distrib-
uted Denial of Services (DDoS) become the major problem and it gives the latest threat
to the users, organizations and infrastructures of the internet. This type of intrusion
(DDoS) attacker attempts to disrupt a target, by flooding it with illegitimate packets,
exhausting its resource and overtaking it to prevent legitimate inquiries from getting
through. According to the security report of Arbor 2005-2010 [8].
This paper analyzes current research challenges in DDoS by evaluating machine
learning algorithms for detecting and predicting DDoS attack, which includes feature
extraction, classification, and clustering. Besides, various hybrid approaches have been
employed. It is illustrated that these evaluation results of research challenges are mainly
suitable for machine learning technique.
This paper is organized as follows. Section 2 provides a related study on an
overview of machine learning techniques and briefly describes a number of related
techniques for intrusion detection. Section 3 compares related work based on the types
of classifier design, the chosen baselines, datasets used for experiments, etc. Conclusion
and discussion for future research are given in Section 4.
2 Related Study
Lately, there are many reports that show the involvement of DDoS attack on
commercial or government website [4]. Along with the advancement technique of
DDoS attack, the studies on detection also evolve and as a result, various methods have
been suggested to counter DDoS attack. As we know, DDoS attack can be classified
into anomaly-based, congestion-based and others [3]. A network traffic controller using
machine learning (ML) techniques was proposed in 1990, aiming to maximize call
completion in a circuit-switched telecommunications network [1]. This was one of the
works that marked the point at which ML techniques expanded their application space
into the telecommunications networking field. In 1994, ML was first utilized for Inter-
net flow classification in the context of intrusion detection. It is the starting point for
much of the work using ML techniques in Internet traffic classification that follows.
Gavrilis et. al [4] utilized RBF-NN detector which is a two-layer neural net-
work. It uses nine packet parameters and the frequencies of these parameters are esti-
mated. Based on the frequencies, RBF-NN classifies traffic into attack or normal class.
In this study, the IP spoofing characteristic which is one of the most definite DDoS
attack evidences is not considered for a correct attack detection. Regarding UDP type
attacks, the detection efficiency is lower than that of TCP type attacks and is apparently
low in the beginning period of attacks. Defining k-means center which minimizes the
quantization error is also a difficult task.
The hybrid technique proposed by Ming-Yang Su et.al [5] is a method to weigh
features of DDoS attacks and it analyzed the relationship between detection perfor-
mance and number of features. The study proposed a genetic algorithm combined with
KNN (k-nearest-neighbor) for feature selection and weighting. All initial 35 features in
the training phase were weighted, and the top ones were selected to implement Network
Intrusion Detection System (NIDS) for testing. A fast mechanism to detect DDoS attack
is by extracting features from the network traffic, so that all these features come from
the headers, including IP, TCP, UDP, ICMP, ARP and IGMP. According to the frame-
work of Genetic Algorithm (GA), the proposed NIDS is described by three parts in the
section. The first subsection will present all features that are considered in the study;
the second subsection will state the encoding of a chromosome and the fitness function;
the third subsection will provide details on the selection, crossover, and mutation in the
GA. There is also an evaluation on machine learning technique on DDoS attack, pro-
posed by Suresh [10] which indicates that Fuzzy c-means clustering gives better clas-
sification and it is fast compared to the other algorithms.
3 Propose Work
In this section, we discuss the details of methods that have been utilized in this
work for detection and prediction of DDoS attack. There are k-nearest neighbor (KNN)
and support vector machine (SVM) or known as KNN-SVM.
3.1 Support Vector Machine (SVM)
In classification and regression, Support Vector Machines are the most common
and popular method for machine learning tasks [16]. In this method, a set of training
examples is given with which each example is marked belonging to one of the two
categories. Then, by using the SVM algorithm, a model that can predict whether a new
example falls into one category or the other is built.
3.2 K-Nearest Neighbor (k-NN)
A k-NN algorithm has been shown to be very effective for a variety of problem
domains including text categorizing [17]. It determines the class label of a test example
based on its k neighbor that is close to it. The similarity score of each neighbor docu-
ment to test document is used as the weight of categories of the neighbor document.
Referring to fig. 1, it has been effectively used to calculate the distance among neigh-
bors.
Fig. 1. A k-nearest neighbor (KNN) classifier [16]
3.3 Features Extraction
Various types of DDoS attacks are studied to select the traffic parameters that
change unusually during such attacks. There are eight features extracted from both da-
tasets using information gain rank. Then we rank all the features to identify which one
is more relevant. Many machine learning problems can actually enhance their accuracy
by applying features selection and extraction. This situation intensively indicates that
feature selection is also important for ranking [10]. Information gain is applied to meas-
ure the importance of each feature. The information gain of a given attribute X with
respect to the class Y is the reduction in uncertainty about the value of Y, after observ-
ing values of X. The uncertainty about the value of Y is measured by its entropy defined
as
H(Y) = -
(1)
where P(Yi) is the prior probabilities for all values of Y. The uncertainty about the
value of Y after observing values of X is given by the conditional entropy of Y given
X defined as
H () = -
(
)
(2)
where P(
|
) is the posterior probabilities of Y given the values of X. The infor-
mation gain is thus defined as
IG () = H(Y) - H() (3)
Info. Gain
Rank
Features
No Features Description
1 5 Src bytes Number of data bytes from source to destination
2 23 Count Number of connections to the same host as the current
connection in the past two seconds
3 3 Service Network service on the destination, e.g., http, telnet,
etc.
4 24 Srv count Number of connections to the same service as the cur-
rent connection in the past two seconds
5 36 Dst host same
src port rate
Percentage of connections to the current host having
the same source port
6 2 Protocol Type Connection protocol ( TCP, UDP, ICMP)
7 33 Dst host srv
count
Count of connections having the same destination
host and using the same service
8 35 Dst host diff
srv rate
Percentage of different services on the current host
Table 1. : List of features extraction
By calculating information gain, the correlations of each attribute can be
ranked to the class. The most important attributes can then be selected based on the
ranking. Based on the result, the following eight feature vectors are selected for
detection of DDoS attacks.
3.4 Machine Learning Algorithms
In this part, we briefly describe machine learning algorithm which is used in our
experiment.
3.4.1 Naive Bayes
The Naïve Bayes is a simple probabilistic classifier. According to Livadas et.
al [12], a widely used framework for classification is provided by a simple theorem of
probability known as Bayes' rule, Bayes' theorem, or Bayes' formula:
3.4.2 C4.5
Among classification algorithms, the C4.5 system of Quinlan [13], shows the
result of research in machine learning that traces back to the ID
[14] system that tries
to locate small decision tree.
3.4.3 K-Mean Clustering
K-means or hard c means clustering is basically a partitioning method applied
to analyze data and treat observations of the data as objects based on locations and
distance between various input data points. Partitioning the objects into mutually ex-
clusive clusters (K) is done by it in such a fashion that objects within each cluster re-
main as close as possible to each other but as far as possible from objects in other clus-
ters [15].
3.4.4 K-NN Classifier
The k-NN algorithm is a similarity-based learning algorithm and is known to
be highly effective in various problem domains, including classification problems.
Given a test element dt, the k-NN algorithm finds its k-nearest neighbors among the
training elements, which form the neighborhood of dt. Majority voting among the ele-
ments in the neighborhood is used to decide the class for dt.
3.4.5 FCM Clustering
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to
belong to two or more clusters. This method (developed by Dunn in 1973 and improved
by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimiza-
tion of the following objective function:
(U,v) =
(4)
Where m is any number greater than 1,
is the degree of membership of
in the
cluster j,
is the ith of the d-dimensional measured data,
is the d-dimension center
of the cluster, and ||*|| is any norm expressing the similarity between any measured data
and the center.
4 Experimental Result
The KDD99 dataset [17] is used in the experiments as the attack component. Clas-
sification of attack and normal traffic is done using WEKA. Table 2 shows the dataset
and the normal traffic. Table 3 shows the correct classification and the attack detection
time. Table 4 shows the F-measure details and Fig. 2 shows the evaluation results using
ROC curves for the selected machine learning techniques.
4.1 Performance Evaluation Criteria
Two criteria are chosen for evaluating performance of the classifier: True Positive
Rate (TPR) and False Positive Rate (FPR).
TPR =
, FPR =
(5)
In formula (3), TP(True Positive), FN(False Negative), FP(False Positive) and TN
(True Negative) are defined in [9]. TPR describes the sensitivity of our classifier while
FPR shows the rate of false alarms. According to TPR and FPR, a Receiver Operating
Characteristic (ROC) curve can be drawn, which is from signal detection theory.
Fig 1:
Attack traffic trace at 11.30 a.m [17]
Table 2. Sample collected
Method Used Correct Classification
%
Detection Time
(In Second)
SVM 96.4 0.23
KNN 96.6 0.26
Decision Tree 95.6 0.25
K-Mean 96.7 0.20
Naive Bayesian 92.9 0.52
Fuzzy C Mean 98.7 0.15
Table 3. Classification results
Network Data Data type Total number of record
Trained Full set data
494,021
Normal
97,277
DDoS Attack
391,458
Method TP FP TN FN F-Measure
SVM 281 18 253 20 0.96
KNN 280 20 243 30 0.97
Decision Tree 277 22 218 55 0.96
K-Mean 285 15 273 0 0.97
Naive Bayesian 292 10 256 17 0.97
Fuzzy C Mean 298 2 270 3 0.99
Table 4. F-Measure details of classifiers
5 Conclusion
The dataset is evaluated by using machine learning algorithms for effectively
detecting the DDoS attacks. KDD99 dataset is used as the attack data and based on
information gain ranking, relevant features have been selected. Experimental results
show that Fuzzy c-means clustering gives better classification and it is fast compared
to other algorithms.
6 Acknowledgement
The authors would like to thank anonymous reviewers for their constructive
comments and valuable suggestions. The authors wish to thank Universiti Teknologi
Malaysia (UTM) under Research University Grant Vot-02G31 and Ministry of Higher
Education Malaysia (MOHE) under the Fundamental Research Grant Scheme (FRGS
Vot-4F551) for completion of the research.
References
1. B. Silver, “Netman: A learning network traffic controller,” in Proc. Third International
Conference on Industrial and Engineering Applications of Artificial Intelligence and
Expert Systems, Association for Computing Machinery, 1990.
2. I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Tech-
niques with Java Implementations (Second Edition). Morgan Kaufmann Publishers,
2005.
3. P. Ferguson and D. Senie, "Network ingress filtering: Defeating denial of service
attacks which employ IP source address spoofing," RFC 2267, January 1998
4. Gavrilis, D., & Dermatas, E. (2005). Real-time detection of distributed denial-of-ser-
vice attacks using RBF networks and statistical features. Computer Networks, 48(2),
235–245. doi:10.1016/j.comnet.2004.08.014
5. Lee, K., Kim, J., Kwon, K.H., Han, Y., Kim, S.: DDoS Attack Detection Method using
Cluster Analysis. Expert Systems with Applications 34, 1659–1665 (2008)
6. Panda, M., Patra, M.R.: Evaluating Machine Learning Algorithms for Detecting Net-
work Intrusions. International Journal of Recent Trends in Engineering 1(1), 472–
477 (2009)
7. Arbor Networks – Annual report(2015), http://www.arbornetworks.com/re-
sources/annual-security-report [accessed on January 2016]
8. Geng, X., Liu, T., Qin, T., & Li, H. (2007). Feature Selection for Ranking 2. Learning,
(49), 407–414.
9. Suresh, M., & Anitha, R. (2011). Evaluating Machine Learning Algorithms for De-
tecting DDoS Attacks. 4th International Conference, CNSA 2011, Chennai, India,
441–452. doi:10.1007/978-3-642-22540-6_42
10. Livadas, C., Walsh, R., Lapsley, D., & Strayer, W. T. (2006). Usilng Machine Learn-
ing Technliques to Identify Botnet Traffic. Local Computer Networks, Proceedings
2006 31st IEEE Conference on, 967–974. doi:10.1109/LCN.2006.322210
11. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Ar-
tificial Intelligence Research, 4, 77–90. doi:10.1613/jair.279
12. J.R. Quinlan, ªInduction of Decision Trees,ºMachine Learning, vol. 1, no. 1,pp. 81-
106, 1996
13. Ghosh, S., & Dubey, S. (2013). Comparative Analysis of K-Means and Fuzzy C-
Means Algorithms. Ijacsa, 4(4), 35–39. doi:10.14569/IJACSA.2013.040406
14. Vapnik, V.: The Nature of Statitical Learning Theory. Springer, Heidelberg (1995)
15. Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2006). Using kNN Model-based
Approach for Automatic Text Categorization. Soft Computing, 10(5), 423–430.
16. M. Tavallaee, E. Bagheri, W. Lu, and A. a. Ghorbani, “A detailed analysis of the
KDD CUP 99 data set,” IEEE Symp. Comput. Intell. Secur. Def. Appl. CISDA 2009,
no. Cisda, pp. 1–6, 2009.
17. “The CAIDA UCSD ‘DDoS Attack 2007’ Dataset http://www.caida.org/data/pas-
sive/ddos-20070804_dataset.xml,”2013