Towards a Generic Feature-Selection Measure for Intrusion Detection

Conference Paper (PDF Available) · August 2010with187 Reads
DOI: 10.1109/ICPR.2010.378 · Source: DBLP
Conference: 20th International Conference on Pattern Recognition, ICPR 2010, Istanbul, Turkey, 23-26 August 2010
Abstract
Performance of a pattern recognition system depends strongly on the employed feature-selection method. We perform an in-depth analysis of two main measures used in the filter model: the correlation-feature-selection (CFS) measure and the minimal-redundancy-maximal-relevance (mRMR) measure. We show that these measures can be fused and generalized into a generic feature-selection (GeFS) measure. Further on, we propose a new feature-selection method that ensures globally optimal feature sets. The new approach is based on solving a mixed 0-1 linear programming problem (M01LP) by using the branchand-bound algorithm. In this M01LP problem, the number of constraints and variables is linear (O(n)) in the number n of full set features. In order to evaluate the quality of our GeFS measure, we chose the design of an intrusion detection system (IDS) as a possible application. Experimental results obtained over the KDD Cup '99 test data set for IDS show that the GeFS measure removes 93% of irrelevant and redundant features from the original data set, while keeping or yielding an even better classification accuracy.

Full-text (PDF)

Available from: Katrin Franke
Other full-text sources
Towards a Generic Feature-Selection Measure for Intrusion Detection
Hai Thanh Nguyen, Katrin Franke and Slobodan Petrovi´c
Norwegian Information Security Laboratory
Gjøvik University College, Norway
{hai.nguyen, katrin.franke, slobodan.petrovic}@hig.no
Abstract
Performance of a pattern recognition system de-
pends strongly on the employed feature-selection
method. We perform an in-depth analysis of two main
measures used in the filter model: the correlation-
feature-selection (CFS) measure and the minimal-
redundancy-maximal-relevance (mRMR) measure. We
show that these measures can be fused and generalized
into a generic feature-selection (GeFS) measure.
Further on, we propose a new feature-selection method
that ensures globally optimal feature sets. The new
approach is based on solving a mixed 0-1 linear
programming problem (M01LP) by using the branch-
and-bound algorithm. In this M01LP problem, the
number of constraints and variables is linear (O(n))in
the number n of full set features. In order to evaluate
the quality of our GeFS measure, we chose the design
of an intrusion detection system (IDS) as a possible
application. Experimental results obtained over the
KDD Cup’99 test data set for IDS show that the GeFS
measure removes 93% of irrelevant and redundant
features from the original data set, while keeping or
yielding an even better classification accuracy.
1. Introduction
An intrusion detection system (IDS) is considered as
a pattern recognition system, in which feature selection
is an important pre-processing step. By removing irrel-
evant and redundant features, we can improve classifi-
cation performance and reduce the computational com-
plexity, thus increasing the available time for detect-
ing intrusions. The most of the feature selection work
in intrusion detection practice is still done manually
and the quality of selected features depends strongly
on expert knowledge. For automatic feature selection,
the wrapper and the filter models from machine learn-
ing are frequently applied [7]. The wrapper model as-
sesses the selected features by learning algorithm’s per-
formance. Therefore, the wrapper method requires a
lot of time and computational resources to find the best
feature subsets. The filter model considers statistical
characteristics of a data set directly without involving
any learning algorithm. Due to the computational effi-
ciency, the filter method is usually used to select fea-
tures from high-dimensional data sets, such as intrusion
detection systems. The filter model encompasses two
groups of methods: the feature ranking methods and the
feature-subset-evaluating methods. The feature ranking
methods assign weights to features individually based
on their relevance to the target concept. The feature-
subset-evaluating methods estimate feature subsets not
only by their relevance, but also by the relationships be-
tween features that make certain features redundant. It
is well known that the redundant features can reduce the
performance of a pattern recognition system. Therefore,
the feature-subset-evaluating methods are more suitable
for selecting features for intrusion detection. A ma-
jor challenge in the IDS feature selection process is to
choose appropriate measures that can precisely deter-
mine the relevance and the relationship between fea-
tures of a given data set.
Since the relevance and the relationship are usually
characterized in terms of correlation or mutual informa-
tion [5, 6], we focus on two measures: the correlation-
feature-selection (CFS) measure [1] and the minimal-
redundancy-maximal-relevance (mRMR) measure [2].
We show that these two measures can be fused and gen-
eralized into a generic-feature-selection (GeFS) mea-
sure. We reformulate the feature selection problem by
means of the GeFS measure as a polynomial mixed 0-1
fractional programming (PM01FP) problem. We im-
prove the Chang’s method [3] in order to equivalently
reduce this PM01FP problem into a mixed 0-1 linear
programming (M01LP) problem. Finally, we propose
to use the branch-and-bound algorithm to solve this
M01LP problem, whose optimal solution is also the
globally optimal feature subset. Experimental results
obtained over the KDD Cup’99 data set [4] show that
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.378
1533
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.378
1533
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.378
1529
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.378
1529
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.378
1529
Page 1
the GeFS measure removes 93% of irrelevant and re-
dundant features from the original data set, while keep-
ing or yielding an even better classification accuracy.
The paper is organized as follows. Section 2 for-
mally defines a generic feature-selection (GeFS) mea-
sure and describes the CFS and the mRMR measures as
instances. The background regarding PM01FP, M01LP
problems and Chang’s method [3] are introduced in the
Section 3. Section 4 describes our new feature-selection
method to get globally optimal feature subsets. We
present experimental results in Section 5. The last Sec-
tion summarizes our findings.
2. A Generic Feature-Selection Measure
2.1. Definitions
Definition 1: A generic feature-selection measure
used in the filter model is a function GeF S(x),which
has the following form with x =(x
1
,...,x
n
):
GeF S(x)=
a
0
+
n
i=1
A
i
(x)x
i
b
0
+
n
i=1
B
i
(x)x
i
,x∈{0, 1}
n
(1)
In this definition, binary values of the variable x
i
indi-
cate the appearance (x
i
=1) or the absence (x
i
=0)
of the feature f
i
; a
0
, b
0
are constants; A
i
(x), B
i
(x) are
linear functions of variables x
1
,...,x
n
.
Definition 2: The feature selection problem is to find
x ∈{0, 1}
n
that maximizes the function GeF S(x).
max
x∈{0,1}
n
GeF S(x)=
a
0
+
n
i=1
A
i
(x)x
i
b
0
+
n
i=1
B
i
(x)x
i
(2)
There are several feature selection measures,
which can be represented by the form (1), such
as the correlation-feature-selection (CFS) measure,
the minimal-redundancy-maximal-relevance (mRMR)
measure and the Mahalanobis distance.
2.2. The mRMR Feature Selection Measure
In 2005, Peng et. al. [2] proposed a feature-selection
method, which is based on mutual information. In this
method, relevant features and redundant features are
considered simultaneously. In terms of mutual infor-
mation, the relevance of a feature set S for the class c is
defined by the average value of all mutual information
values between the individual feature f
i
and the class c
as follows: D(S, c)=
1
|S|
f
i
S
I(f
i
; c). The redun-
dancy of all features in the set S is the average value
of all mutual information values between the feature f
i
and the feature f
j
: R(S)=
1
|S|
2
f
i
,f
j
S
I(f
i
; f
j
).
The mRMR criterion is a combination of two measures
given above and is defined as follows:
max
S
1
|S|
f
i
S
I(f
i
; c)
1
|S|
2
f
i
,f
j
S
I(f
i
; f
j
)
(3)
Suppose that there are n full-set features. We use binary
values of the variable x
i
in order to indicate the appear-
ance (x
i
=1) or the absence (x
i
=0)ofthefeaturef
i
in the globally optimal feature set. We denote the mu-
tual information values I(f
i
; c), I(f
i
; f
j
) by constants
c
i
, a
ij
, respectively. Therefore, the problem (3) can be
described as an optimization problem as follows:
max
x∈{0,1}
n
n
i=1
c
i
x
i
n
i=1
x
i
n
i,j=1
a
ij
x
i
x
j
(
n
i=1
x
i
)
2
(4)
It is obvious that the mRMR measure is an instance of
the GeFS measure that we denote by GeF S
mRMR
.
2.3. Correlation Feature Selection Measure
The Correlation Feature Selection (CFS) measure
evaluates subsets of features on the basis of the follow-
ing hypothesis: ”Good feature subsets contain features
highly correlated with the classification, yet uncorre-
lated to each other” [1]. The following equation gives
the merit of a feature subset S consisting of k features:
Merit
S
k
=
k
r
cf
k + k(k 1)r
ff
Here, r
cf
is the average value of all feature-
classification correlations, and
r
ff
is the average value
of all feature-feature correlations. The CFS criterion is
defined as follows:
max
S
k
r
cf
1
+ r
cf
2
+ ... + r
cf
k
k +2(r
f
1
f
2
+ .. + r
f
i
f
j
+ .. + r
f
k
f
1
)
(5)
By using binary values of the variable x
i
as in the case
of the mRMR measure to indicate the appearance or the
absence of the feature f
i
, we can also rewrite the prob-
lem (5) as an optimization problem as follows:
max
x∈{0,1}
n
(
n
i=1
a
i
x
i
)
2
n
i=1
x
i
+
i=j
2b
ij
x
i
x
j
(6)
It is obvious that the CFS measure is an instance of the
GeFS measure. We denote this measure by GeF S
CF S
.
3. Polynomial Mixed 0-1 Fractional Pro-
gramming
A general polynomial mixed 0 1 fractional pro-
gramming (PM01FP) problem [3] is represented as
15341534153015301530
Page 2
follows, where s.t. denotes the set of constraints:
min
m
i=1
a
i
+
n
j=1
a
ij
kJ
x
k
b
i
+
n
j=1
b
ij
kJ
x
k
(7)
s.t.
b
i
+
n
j=1
b
ij
kJ
x
k
> 0,i= 1,m,
c
p
+
n
j=1
c
pj
kJ
x
k
0,p = 1,m,
x
k
∈{0, 1},k J; a
i
,b
i
,c
p
,a
ij
,b
ij
,c
pj
∈.
By replacing the denominators in (7) by positive vari-
ables y
i
(i = 1,m),thePM01FP then leads to the
following equivalent polynomial mixed 0 1 program-
ming problem:
min
m
i=1
a
i
y
i
+
n
j=1
a
ij
kJ
x
k
y
i
(8)
s.t.
b
i
y
i
+
n
j=1
b
ij
kJ
x
k
y
i
=1;y
i
> 0,
c
p
+
n
j=1
c
pj
kJ
x
k
0,p= 1,m,
x
k
∈{0, 1}; a
i
,b
i
,c
p
,a
ij
,b
ij
,c
pj
∈.
(9)
In order to solve this problem, Chang [3] proposed a
linearization technique to transfer the terms
kJ
x
k
y
i
into a set of mixed 0 1 linear inequalities. Based on
this technique, the PM01FP becomes then a mixed
0 1 linear programming (M01LP ), which can be
solved by means of the branch-and-bound method to
obtain the global solution.
Proposition 1: A polynomial mixed 0 1 term
kJ
x
k
y
i
from (8) can be represented by the follow-
ing program [3], where M is a large positive value:
min z
i
s.t.
z
i
0,
z
i
M (
kJ
x
k
−|J|)+y
i
(10)
Proposition 2: A polynomial mixed 0 1 term
kJ
x
k
y
i
from (9) can be represented by a continu-
ous variable v
i
, subject to the following linear inequali-
ties [3], where M is a large positive value:
v
i
M (
kJ
x
k
−|J|)+y
i
,
v
i
M (|J|−
kJ
x
k
)+y
i
,
0 v
i
Mx
i
,
(11)
We now formulate the feature selection problem (2)
as a polynomial mixed 0 1 fractional programming
(PM01FP) problem.
Proposition 3: The feature selection problem (2)
is a polynomial mixed 0 1 fractional programming
(PM01FP) problem.
Remark: By applying Chang’s method [3], we can
transform this PM01FP problem into an M01LP
problem. The number of variables and constraints is
quadratic (O(n
2
)) in the number n of full set features.
This is because the number of terms x
i
x
j
in (2), which
are replaced by the new variables, is n(n +1)/2.The
branch-and-bound algorithm can then be used to solve
this M01LP problem. But the efficiency of the method
depends strongly on the number of variables and con-
straints. The larger the number of variables and con-
straints an M01LP problem has, the more complicated
the branch-and-bound algorithm is.
In the next section, we present an improvement of the
Chang’s method to get an M01LP problem in which
the number of variables and constraints is linear (O(n))
in the number n of full set features.
4. Optimization of the GeFS Measure
By introducing an additional positive variable, de-
noted by y, we now consider the following problem
equivalent to (2):
min
x∈{0,1}
n
(GeF S(x)) = a
0
y
n
i=1
A
i
(x)x
i
y (12)
s.t.
b
0
y +
n
i=1
B
i
(x)x
i
y =1;y>0.
(13)
This problem is transformed into a mixed 0-1 linear pro-
gramming problem as follows:
Proposition 4: AtermA
i
(x)x
i
y from (12) can be
represented by the following program, where M is a
large positive value:
min z
i
s.t.
z
i
0,
z
i
M (x
i
1) + A
i
(x)y,
(14)
Proposition 5: AtermB
i
(x)x
i
y from (13) can be
represented by a continuous variable v
i
, subject to the
following linear inequality constraints, where M is a
large positive value:
v
i
M (x
i
1) + B
i
(x)y,
v
i
M (1 x
i
)+A
i
(x)y,
0 v
i
Mx
i
(15)
We substitute each term x
i
y in (14), (15) by new
variables t
i
satisfying constraints from Proposition 2.
Table 1. Number of selected features
Data Set Full-set GeF S
mRMR
GeF S
CF S
Nor&Dos 41 22 3
Nor&Probe 41 14 6
Nor&U2R 41 5 1
Nor&R2L 41 6 2
15351535153115311531
Page 3
Table 2. Classification accuracies of C4.5 and BayesNet performed on KDD Cup’99 data set
Data Set C4.5 BayesNet
Full-Set GeF S
mRMR
GeF S
CF S
Full-Set GeF S
mRMR
GeF S
CF S
Nor&DoS 97.80 99.98 98.89 99.99 99.36 98.87
Nor&Probe 99.98 99.35 99.70
98.96 98.65 97.63
Nor&U2R 99.97 99.94 99.96
99.85 99.94 99.95
Nor&R2L 98.70 99.19 99.11
99.33 99.17 98.81
Average 99.11 99.61 99.42 99.53 99.28 98.82
The total number of variables for the M01LP problem
will be 4n +1,astheyarex
i
, y, t
i
, z
i
and v
i
(i = 1,n).
Therefore, the number of constraints on these variables
will also be a linear function of n. As we mentioned
above, with Chang’s method [3] the number of vari-
ables and constraints depends on the square of n.
Thus our new method actually improves that method
by reducing the complexity of the branch and bound
algorithm.
5. Experimental Results
For evaluating our new GeFS measure, we con-
ducted an experiment on the KDD Cup’99 data set [4].
The goal was to find optimal feature subsets by means
of the GeF S
CF S
and GeF S
mRMR
measures. These
subsets were then compared with each other by us-
ing classification accuracies of 2 machine-learning al-
gorithms: the C4.5 and the BayesNet.
We performed our experiment using 10% of the over-
all (5 millions of instances) KDD Cup’99 data set [4].
This data set contains normal traffic (Nor) and four at-
tack classes: Denial of Service (DoS), Probe, User to
Root (U2R) and Remote to Local (R2L) attacks. As the
attack classes distribute so differently (e.g. the ratio of
the number of U2R to the number of DoS is 1.3 10
4
)
the feature selection algorithm might concentrate only
on the most frequent class data and neglect the others.
Therefore, we chose to process these attack classes sep-
arately. The GeF S
CF S
measure was compared with
the GeF S
mRMR
measure regarding the number of se-
lected features and the classification accuracies of 5-
fold cross-validation of BayesNet and C4.5 algorithms.
All the obtained results are listed in Tables 1 and 2.
It can be observed from Tables 1 and 2 that the
GeF S
CF S
removes 93% of irrelevant and redundant
features, while keeping or yielding an even better classi-
fication accuracy. The GeF S
CF S
measure outperforms
the GeF S
mRMR
measure by removing more than 21%
of redundant features.
6. Conclusions
We have studied two main feature-selection mea-
sures used in the filter model: the CFS measure and the
mRMR measure. We showed that these two measures
can be fused and generalized into a generic-feature-
selection (GeFS) measure. We proposed a new, efficient
approach that ensures globally optimal feature sets. The
new approach is based on solving a mixed 0-1 linear
programming problem by using the branch-and-bound
algorithm with a number of constraints and variables
that is linear in the number of full set features. Experi-
mental results obtained over the KDD Cup’99 test data
set for intrusion detection systems show that the GeFS
measure removes 93% of irrelevant and redundant fea-
tures from the original data set, while keeping or yield-
ing an even better classification accuracy.
References
[1] M. Hall. Correlation Based Feature Selection for Ma-
chine Learning. Doctoral Dissertation, University of
Waikato, Department of Computer Science, 1999.
[2] H. Peng, F. Long, and C. Ding. Feature selection based
on mutual information: criteria of max-dependency,
max-relevance, and min-redundancy. IEEE Transactions
on Pattern Analysis and Machine Intelligence, Vol. 27,
No. 8, pp.1226-1238, 2005.
[3] C-T. Chang. On the polynomial mixed 0-1 fractional pro-
gramming problems. European Journal of Operational
Research, vol. 131, issue 1, pages 224-227, 2001.
[4] KDD Cup 1999 data set. http://www.
sigkdd.org/kddcup/index.php?section=
1999&method=data.
[5] I. Guyon, S. Gunn, M. Nikravesh and L.A. Zadeh. Fea-
ture Extraction: Foundations and Applications.Se-
ries Studies in Fuzziness and Soft Computing, Physica-
Verlag, Springer, 2006.
[6] H. Liu, H. Motoda. Computational Methods of Feature
Selection. Chapman & Hall/CRC, 2008.
[7] Y. Chen, Y. Li, X-Q. Cheng and L. Guo. Survey and Tax-
onomy of Feature Selection Algorithms in Intrusion De-
tection System. In Proceedings of Inscrypt 2006, LNCS
4318, pp. 153 167, 2006.
15361536153215321532
Page 4
    • "Generally, the intrusion detection problem data sets include less important and redundant features. In most previous works selection techniques have been used to enhance the performance of clustering and reduce the dimension of feature [16][17][18][19][20]. In this paper, the need to reduce the size of the problem by ranking method based on the characteristics of each class with the highest score are selected .This is done by using Chi-Squared Attribute Evaluation. "
    Full-text· Article · Apr 2015
    0Comment1Citation
    • "Massive dataset which contains irrelevant and redundant features has a long time training or testing process, higher resource use as well as unsuitable detection rate[9] and [24]. So the performance of a pattern recognition system depends strongly on the employing of a feature-selection method [20]. "
    [Show abstract] [Hide abstract] ABSTRACT: In this paper, a hybrid classifier using fuzzy clustering and several neural networks has been proposed. With using the fuzzy C-means algorithm, training samples will be clustered and the inappropriate data will be detected and moved to another dataset (Removed-Dataset) and used differently in the classification phase. Also, in the proposed method using the membership degree of samples to the clusters, the class of samples will be changed to the fuzzy class. Thus, for example in KDD cup99 dataset, any sample will have 5 membership degrees to classes DoS, Probe, Normal, U2R, and R2L. Afterwards, the neural networks will be trained by new labels then using a combination of regression and classification methods, the hybrid classifier will be created. Also to classify the outlier data, a fuzzy ARTMAP neural network is employed which is a part of the hybrid classifier. Evaluation of the proposed method is performed by KDDCup99 dataset for intrusion detection and Cambridge datasets for traffic classification problems. Our experimental results indicate that the proposed system has performed better than the previous works in the case of precision, recall and f-value also detection and false alarm rate. Also, ROC curve analysis shows that the proposed hybrid classifier has been better than the famous non-hybrid classifiers.
    Full-text· Article · Nov 2014
    0Comment1Citation
    • "The authors establish the effectiveness of their method in terms of efficiency in intrusion detection without compromising the detection rate. An example filter model for feature selection is [62], where the authors fuse correlation-based and minimal redundancy-maximal-relevance measures. They evaluate their method on benchmark intrusion datasets for classification accuracy. "
    [Show abstract] [Hide abstract] ABSTRACT: Network anomaly detection is an important and dynamic research area. Many network intrusion detection methods and systems (NIDS) have been proposed in the literature. In this paper, we provide a structured and comprehensive overview of various facets of network anomaly detection so that a researcher can become quickly familiar with every aspect of network anomaly detection. We present attacks normally encountered by network intrusion detection systems. We categorize existing network anomaly detection methods and systems based on the underlying computational techniques used. Within this framework, we briefly describe and compare a large number of network anomaly detection methods and systems. In addition, we also discuss tools that can be used by network defenders and datasets that researchers in network anomaly detection can use. We also highlight research directions in network anomaly detection.
    No preview · Article · Mar 2014 · IEEE Communications Surveys & Tutorials
    0Comment61Citations
Show more