### Full-text (PDF)

Available from: Katrin FrankeOther full-text sources

Towards a Generic Feature-Selection Measure for Intrusion Detection

Hai Thanh Nguyen, Katrin Franke and Slobodan Petrovi´c

Norwegian Information Security Laboratory

Gjøvik University College, Norway

{hai.nguyen, katrin.franke, slobodan.petrovic}@hig.no

Abstract

Performance of a pattern recognition system de-

pends strongly on the employed feature-selection

method. We perform an in-depth analysis of two main

measures used in the ﬁlter model: the correlation-

feature-selection (CFS) measure and the minimal-

redundancy-maximal-relevance (mRMR) measure. We

show that these measures can be fused and generalized

into a generic feature-selection (GeFS) measure.

Further on, we propose a new feature-selection method

that ensures globally optimal feature sets. The new

approach is based on solving a mixed 0-1 linear

programming problem (M01LP) by using the branch-

and-bound algorithm. In this M01LP problem, the

number of constraints and variables is linear (O(n))in

the number n of full set features. In order to evaluate

the quality of our GeFS measure, we chose the design

of an intrusion detection system (IDS) as a possible

application. Experimental results obtained over the

KDD Cup’99 test data set for IDS show that the GeFS

measure removes 93% of irrelevant and redundant

features from the original data set, while keeping or

yielding an even better classiﬁcation accuracy.

1. Introduction

An intrusion detection system (IDS) is considered as

a pattern recognition system, in which feature selection

is an important pre-processing step. By removing irrel-

evant and redundant features, we can improve classiﬁ-

cation performance and reduce the computational com-

plexity, thus increasing the available time for detect-

ing intrusions. The most of the feature selection work

in intrusion detection practice is still done manually

and the quality of selected features depends strongly

on expert knowledge. For automatic feature selection,

the wrapper and the ﬁlter models from machine learn-

ing are frequently applied [7]. The wrapper model as-

sesses the selected features by learning algorithm’s per-

formance. Therefore, the wrapper method requires a

lot of time and computational resources to ﬁnd the best

feature subsets. The ﬁlter model considers statistical

characteristics of a data set directly without involving

any learning algorithm. Due to the computational efﬁ-

ciency, the ﬁlter method is usually used to select fea-

tures from high-dimensional data sets, such as intrusion

detection systems. The ﬁlter model encompasses two

groups of methods: the feature ranking methods and the

feature-subset-evaluating methods. The feature ranking

methods assign weights to features individually based

on their relevance to the target concept. The feature-

subset-evaluating methods estimate feature subsets not

only by their relevance, but also by the relationships be-

tween features that make certain features redundant. It

is well known that the redundant features can reduce the

performance of a pattern recognition system. Therefore,

the feature-subset-evaluating methods are more suitable

for selecting features for intrusion detection. A ma-

jor challenge in the IDS feature selection process is to

choose appropriate measures that can precisely deter-

mine the relevance and the relationship between fea-

tures of a given data set.

Since the relevance and the relationship are usually

characterized in terms of correlation or mutual informa-

tion [5, 6], we focus on two measures: the correlation-

feature-selection (CFS) measure [1] and the minimal-

redundancy-maximal-relevance (mRMR) measure [2].

We show that these two measures can be fused and gen-

eralized into a generic-feature-selection (GeFS) mea-

sure. We reformulate the feature selection problem by

means of the GeFS measure as a polynomial mixed 0-1

fractional programming (PM01FP) problem. We im-

prove the Chang’s method [3] in order to equivalently

reduce this PM01FP problem into a mixed 0-1 linear

programming (M01LP) problem. Finally, we propose

to use the branch-and-bound algorithm to solve this

M01LP problem, whose optimal solution is also the

globally optimal feature subset. Experimental results

obtained over the KDD Cup’99 data set [4] show that

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.378

1533

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.378

1533

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.378

1529

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.378

1529

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.378

1529

Page 1

the GeFS measure removes 93% of irrelevant and re-

dundant features from the original data set, while keep-

ing or yielding an even better classiﬁcation accuracy.

The paper is organized as follows. Section 2 for-

mally deﬁnes a generic feature-selection (GeFS) mea-

sure and describes the CFS and the mRMR measures as

instances. The background regarding PM01FP, M01LP

problems and Chang’s method [3] are introduced in the

Section 3. Section 4 describes our new feature-selection

method to get globally optimal feature subsets. We

present experimental results in Section 5. The last Sec-

tion summarizes our ﬁndings.

2. A Generic Feature-Selection Measure

2.1. Deﬁnitions

Deﬁnition 1: A generic feature-selection measure

used in the ﬁlter model is a function GeF S(x),which

has the following form with x =(x

1

,...,x

n

):

GeF S(x)=

a

0

+

n

i=1

A

i

(x)x

i

b

0

+

n

i=1

B

i

(x)x

i

,x∈{0, 1}

n

(1)

In this deﬁnition, binary values of the variable x

i

indi-

cate the appearance (x

i

=1) or the absence (x

i

=0)

of the feature f

i

; a

0

, b

0

are constants; A

i

(x), B

i

(x) are

linear functions of variables x

1

,...,x

n

.

Deﬁnition 2: The feature selection problem is to ﬁnd

x ∈{0, 1}

n

that maximizes the function GeF S(x).

max

x∈{0,1}

n

GeF S(x)=

a

0

+

n

i=1

A

i

(x)x

i

b

0

+

n

i=1

B

i

(x)x

i

(2)

There are several feature selection measures,

which can be represented by the form (1), such

as the correlation-feature-selection (CFS) measure,

the minimal-redundancy-maximal-relevance (mRMR)

measure and the Mahalanobis distance.

2.2. The mRMR Feature Selection Measure

In 2005, Peng et. al. [2] proposed a feature-selection

method, which is based on mutual information. In this

method, relevant features and redundant features are

considered simultaneously. In terms of mutual infor-

mation, the relevance of a feature set S for the class c is

deﬁned by the average value of all mutual information

values between the individual feature f

i

and the class c

as follows: D(S, c)=

1

|S|

f

i

∈S

I(f

i

; c). The redun-

dancy of all features in the set S is the average value

of all mutual information values between the feature f

i

and the feature f

j

: R(S)=

1

|S|

2

f

i

,f

j

∈S

I(f

i

; f

j

).

The mRMR criterion is a combination of two measures

given above and is deﬁned as follows:

max

S

1

|S|

f

i

∈S

I(f

i

; c) −

1

|S|

2

f

i

,f

j

∈S

I(f

i

; f

j

)

(3)

Suppose that there are n full-set features. We use binary

values of the variable x

i

in order to indicate the appear-

ance (x

i

=1) or the absence (x

i

=0)ofthefeaturef

i

in the globally optimal feature set. We denote the mu-

tual information values I(f

i

; c), I(f

i

; f

j

) by constants

c

i

, a

ij

, respectively. Therefore, the problem (3) can be

described as an optimization problem as follows:

max

x∈{0,1}

n

n

i=1

c

i

x

i

n

i=1

x

i

−

n

i,j=1

a

ij

x

i

x

j

(

n

i=1

x

i

)

2

(4)

It is obvious that the mRMR measure is an instance of

the GeFS measure that we denote by GeF S

mRMR

.

2.3. Correlation Feature Selection Measure

The Correlation Feature Selection (CFS) measure

evaluates subsets of features on the basis of the follow-

ing hypothesis: ”Good feature subsets contain features

highly correlated with the classiﬁcation, yet uncorre-

lated to each other” [1]. The following equation gives

the merit of a feature subset S consisting of k features:

Merit

S

k

=

k

r

cf

k + k(k − 1)r

ff

Here, r

cf

is the average value of all feature-

classiﬁcation correlations, and

r

ff

is the average value

of all feature-feature correlations. The CFS criterion is

deﬁned as follows:

max

S

k

r

cf

1

+ r

cf

2

+ ... + r

cf

k

k +2(r

f

1

f

2

+ .. + r

f

i

f

j

+ .. + r

f

k

f

1

)

(5)

By using binary values of the variable x

i

as in the case

of the mRMR measure to indicate the appearance or the

absence of the feature f

i

, we can also rewrite the prob-

lem (5) as an optimization problem as follows:

max

x∈{0,1}

n

(

n

i=1

a

i

x

i

)

2

n

i=1

x

i

+

i=j

2b

ij

x

i

x

j

(6)

It is obvious that the CFS measure is an instance of the

GeFS measure. We denote this measure by GeF S

CF S

.

3. Polynomial Mixed 0-1 Fractional Pro-

gramming

A general polynomial mixed 0 − 1 fractional pro-

gramming (PM01FP) problem [3] is represented as

15341534153015301530

Page 2

follows, where s.t. denotes the set of constraints:

min

m

i=1

a

i

+

n

j=1

a

ij

k∈J

x

k

b

i

+

n

j=1

b

ij

k∈J

x

k

(7)

s.t.

⎧

⎨

⎩

b

i

+

n

j=1

b

ij

k∈J

x

k

> 0,i= 1,m,

c

p

+

n

j=1

c

pj

k∈J

x

k

≤ 0,p = 1,m,

x

k

∈{0, 1},k∈ J; a

i

,b

i

,c

p

,a

ij

,b

ij

,c

pj

∈.

By replacing the denominators in (7) by positive vari-

ables y

i

(i = 1,m),thePM01FP then leads to the

following equivalent polynomial mixed 0 − 1 program-

ming problem:

min

m

i=1

a

i

y

i

+

n

j=1

a

ij

k∈J

x

k

y

i

(8)

s.t.

⎧

⎨

⎩

b

i

y

i

+

n

j=1

b

ij

k∈J

x

k

y

i

=1;y

i

> 0,

c

p

+

n

j=1

c

pj

k∈J

x

k

≤ 0,p= 1,m,

x

k

∈{0, 1}; a

i

,b

i

,c

p

,a

ij

,b

ij

,c

pj

∈.

(9)

In order to solve this problem, Chang [3] proposed a

linearization technique to transfer the terms

k∈J

x

k

y

i

into a set of mixed 0 − 1 linear inequalities. Based on

this technique, the PM01FP becomes then a mixed

0 − 1 linear programming (M01LP ), which can be

solved by means of the branch-and-bound method to

obtain the global solution.

Proposition 1: A polynomial mixed 0 − 1 term

k∈J

x

k

y

i

from (8) can be represented by the follow-

ing program [3], where M is a large positive value:

min z

i

s.t.

z

i

≥ 0,

z

i

≥ M (

k∈J

x

k

−|J|)+y

i

(10)

Proposition 2: A polynomial mixed 0 − 1 term

k∈J

x

k

y

i

from (9) can be represented by a continu-

ous variable v

i

, subject to the following linear inequali-

ties [3], where M is a large positive value:

⎧

⎨

⎩

v

i

≥ M (

k∈J

x

k

−|J|)+y

i

,

v

i

≤ M (|J|−

k∈J

x

k

)+y

i

,

0 ≤ v

i

≤ Mx

i

,

(11)

We now formulate the feature selection problem (2)

as a polynomial mixed 0 − 1 fractional programming

(PM01FP) problem.

Proposition 3: The feature selection problem (2)

is a polynomial mixed 0 − 1 fractional programming

(PM01FP) problem.

Remark: By applying Chang’s method [3], we can

transform this PM01FP problem into an M01LP

problem. The number of variables and constraints is

quadratic (O(n

2

)) in the number n of full set features.

This is because the number of terms x

i

x

j

in (2), which

are replaced by the new variables, is n(n +1)/2.The

branch-and-bound algorithm can then be used to solve

this M01LP problem. But the efﬁciency of the method

depends strongly on the number of variables and con-

straints. The larger the number of variables and con-

straints an M01LP problem has, the more complicated

the branch-and-bound algorithm is.

In the next section, we present an improvement of the

Chang’s method to get an M01LP problem in which

the number of variables and constraints is linear (O(n))

in the number n of full set features.

4. Optimization of the GeFS Measure

By introducing an additional positive variable, de-

noted by y, we now consider the following problem

equivalent to (2):

min

x∈{0,1}

n

(−GeF S(x)) = −a

0

y −

n

i=1

A

i

(x)x

i

y (12)

s.t.

b

0

y +

n

i=1

B

i

(x)x

i

y =1;y>0.

(13)

This problem is transformed into a mixed 0-1 linear pro-

gramming problem as follows:

Proposition 4: AtermA

i

(x)x

i

y from (12) can be

represented by the following program, where M is a

large positive value:

min z

i

s.t.

z

i

≥ 0,

z

i

≥ M (x

i

− 1) + A

i

(x)y,

(14)

Proposition 5: AtermB

i

(x)x

i

y from (13) can be

represented by a continuous variable v

i

, subject to the

following linear inequality constraints, where M is a

large positive value:

⎧

⎨

⎩

v

i

≥ M (x

i

− 1) + B

i

(x)y,

v

i

≤ M (1 − x

i

)+A

i

(x)y,

0 ≤ v

i

≤ Mx

i

(15)

We substitute each term x

i

y in (14), (15) by new

variables t

i

satisfying constraints from Proposition 2.

Table 1. Number of selected features

Data Set Full-set GeF S

mRMR

GeF S

CF S

Nor&Dos 41 22 3

Nor&Probe 41 14 6

Nor&U2R 41 5 1

Nor&R2L 41 6 2

15351535153115311531

Page 3

Table 2. Classiﬁcation accuracies of C4.5 and BayesNet performed on KDD Cup’99 data set

Data Set C4.5 BayesNet

Full-Set GeF S

mRMR

GeF S

CF S

Full-Set GeF S

mRMR

GeF S

CF S

Nor&DoS 97.80 99.98 98.89 99.99 99.36 98.87

Nor&Probe 99.98 99.35 99.70

98.96 98.65 97.63

Nor&U2R 99.97 99.94 99.96

99.85 99.94 99.95

Nor&R2L 98.70 99.19 99.11

99.33 99.17 98.81

Average 99.11 99.61 99.42 99.53 99.28 98.82

The total number of variables for the M01LP problem

will be 4n +1,astheyarex

i

, y, t

i

, z

i

and v

i

(i = 1,n).

Therefore, the number of constraints on these variables

will also be a linear function of n. As we mentioned

above, with Chang’s method [3] the number of vari-

ables and constraints depends on the square of n.

Thus our new method actually improves that method

by reducing the complexity of the branch and bound

algorithm.

5. Experimental Results

For evaluating our new GeFS measure, we con-

ducted an experiment on the KDD Cup’99 data set [4].

The goal was to ﬁnd optimal feature subsets by means

of the GeF S

CF S

and GeF S

mRMR

measures. These

subsets were then compared with each other by us-

ing classiﬁcation accuracies of 2 machine-learning al-

gorithms: the C4.5 and the BayesNet.

We performed our experiment using 10% of the over-

all (5 millions of instances) KDD Cup’99 data set [4].

This data set contains normal trafﬁc (Nor) and four at-

tack classes: Denial of Service (DoS), Probe, User to

Root (U2R) and Remote to Local (R2L) attacks. As the

attack classes distribute so differently (e.g. the ratio of

the number of U2R to the number of DoS is 1.3 ∗ 10

−4

)

the feature selection algorithm might concentrate only

on the most frequent class data and neglect the others.

Therefore, we chose to process these attack classes sep-

arately. The GeF S

CF S

measure was compared with

the GeF S

mRMR

measure regarding the number of se-

lected features and the classiﬁcation accuracies of 5-

fold cross-validation of BayesNet and C4.5 algorithms.

All the obtained results are listed in Tables 1 and 2.

It can be observed from Tables 1 and 2 that the

GeF S

CF S

removes 93% of irrelevant and redundant

features, while keeping or yielding an even better classi-

ﬁcation accuracy. The GeF S

CF S

measure outperforms

the GeF S

mRMR

measure by removing more than 21%

of redundant features.

6. Conclusions

We have studied two main feature-selection mea-

sures used in the ﬁlter model: the CFS measure and the

mRMR measure. We showed that these two measures

can be fused and generalized into a generic-feature-

selection (GeFS) measure. We proposed a new, efﬁcient

approach that ensures globally optimal feature sets. The

new approach is based on solving a mixed 0-1 linear

programming problem by using the branch-and-bound

algorithm with a number of constraints and variables

that is linear in the number of full set features. Experi-

mental results obtained over the KDD Cup’99 test data

set for intrusion detection systems show that the GeFS

measure removes 93% of irrelevant and redundant fea-

tures from the original data set, while keeping or yield-

ing an even better classiﬁcation accuracy.

References

[1] M. Hall. Correlation Based Feature Selection for Ma-

chine Learning. Doctoral Dissertation, University of

Waikato, Department of Computer Science, 1999.

[2] H. Peng, F. Long, and C. Ding. Feature selection based

on mutual information: criteria of max-dependency,

max-relevance, and min-redundancy. IEEE Transactions

on Pattern Analysis and Machine Intelligence, Vol. 27,

No. 8, pp.1226-1238, 2005.

[3] C-T. Chang. On the polynomial mixed 0-1 fractional pro-

gramming problems. European Journal of Operational

Research, vol. 131, issue 1, pages 224-227, 2001.

[4] KDD Cup 1999 data set. http://www.

sigkdd.org/kddcup/index.php?section=

1999&method=data.

[5] I. Guyon, S. Gunn, M. Nikravesh and L.A. Zadeh. Fea-

ture Extraction: Foundations and Applications.Se-

ries Studies in Fuzziness and Soft Computing, Physica-

Verlag, Springer, 2006.

[6] H. Liu, H. Motoda. Computational Methods of Feature

Selection. Chapman & Hall/CRC, 2008.

[7] Y. Chen, Y. Li, X-Q. Cheng and L. Guo. Survey and Tax-

onomy of Feature Selection Algorithms in Intrusion De-

tection System. In Proceedings of Inscrypt 2006, LNCS

4318, pp. 153 167, 2006.

15361536153215321532

Page 4