# Optimizing a Class of Feature Selection Measures

**ABSTRACT** Feature selection is an important processing step in machine learning and the de-sign of pattern-recognition systems. A major challenge consists in the selection of relevant features in cases of high-dimensional data sets. In order to tackle the computational complexity, heuristic, sequential or random search strategies are applied frequently. These methods, however, often yield only locally optimal fea-ture sets that might be globally sub-optimal. The aim of our research is to derive a new, efficient approach that ensures globally optimal feature sets. We focus on the so-called filter methods. We show that a number of feature-selection measures, e.g., the correlation-feature-selection measure, the minimal-redundancy-maximal-relevance measure and others, can be fused and generalized. We formulate the fea-ture selection problem as a polynomial-mixed 0 – 1 fractional programming prob-lem (P M 01F P). To solve it, we transform the P M 01F P problem into a mixed 0-1 linear programming (M 01LP) problem. This transformation is performed by applying an improved Chang's method of grouping additional variables. To ob-tain the globally optimal solution to the M 01LP problem, the branch-and-bound algorithm can be used. Experimental results obtained over the UCI database show that our globally optimal method outperforms other heuristic search procedures by up to 10 % of redundant or confusing features that are removed from the original data set, while keeping or yielding an even better accuracy.

**0**Bookmarks

**·**

**395**Views

Page 1

Optimizing a Class of Feature Selection Measures

Hai Nguyen, Katrin Franke and Slobodan Petrovi´ c

NISlab, Department of Computer Science and Media Technology,

Gjøvik University College, P.O. box 191, N-2802 Gjøvik, Norway

Email:{hai.nguyen, katrin.franke, slobodan.petrovic}@hig.no

Abstract

Feature selection is an important processing step in machine learning and the de-

sign of pattern-recognition systems. A major challenge consists in the selection

of relevant features in cases of high-dimensional data sets. In order to tackle the

computational complexity, heuristic, sequential or random search strategies are

applied frequently. These methods, however, often yield only locally optimal fea-

ture sets that might be globally sub-optimal. The aim of our research is to derive

a new, efficient approach that ensures globally optimal feature sets. We focus on

the so-called filter methods. We show that a number of feature-selection measures,

e.g., the correlation-feature-selectionmeasure, the minimal-redundancy-maximal-

relevance measure and others, can be fused and generalized. We formulate the fea-

ture selection problem as a polynomial-mixed 0–1 fractional programming prob-

lem (PM01FP). To solve it, we transform the PM01FP problem into a mixed

0-1 linear programming (M01LP) problem. This transformation is performed by

applying an improved Chang’s method of grouping additional variables. To ob-

tain the globally optimal solution to the M01LP problem, the branch-and-bound

algorithm can be used. Experimental results obtained over the UCI database show

that our globally optimal method outperforms other heuristic search procedures by

up to 10% of redundant or confusing features that are removed from the original

data set, while keeping or yielding an even better accuracy.

1Introduction

Feature selection is the first important step in data processing, e.g., machine learning and pattern-

recognition applications like intrusion detection. There are three categories of feature selection algo-

rithms: the wrapper, the filter and hybrid models [8, 9]. The wrapper model uses learning algorithms

in feature selection and assesses the selected features by learning algorithm’s performance. The fil-

ter model considers statistical characteristics of a data set directly without involving any learning

algorithm. This filter model comprises several feature selection measures, such as the correlation-

feature-selection (CFS) measure [2], the minimal-redundancy-maximal-relevance(mRMR) measure

[1], the discriminant function [9] and the Mahalanobis distance [9]. The wrapper and the filter model

havetheirownadvantagesanddisadvantages. Featuresselectedwiththewrappermethodareusually

better adapted to a machine-learning algorithm if the learning algorithm is chosen in advance and

used in the feature-selection process. However, the wrapper method also requires more computa-

tional resources than the filter method. Thus, the wrapper model can be applied when the number of

features is small. As the number of features becomes very large, the filter model is more appropriate

due to its computational efficiency. In order to combine the advantages of both models, the hybrid

model has been proposed [8, 9]. With this model, some good subsets of features will be selected

from a high-dimensional data set by using the filter model. The wrapper model will then be applied

on those good subsets to get the best one.

A typical problem when using measures (objective functions) applied in so-called filter methods for

feature selection is that it is difficult to find the optimal subset of features. For example, with the CFS

Page 2

or mRMR measures, with 80 full set features the brute force method needs to scan all 280possible

subsets of features in order to find the best one. That is impractical in general. With heuristic,

sequential and random search strategies, we can deal with high-dimensional data sets, but these

methods usually give us locally optimal subsets of features. It is desirable to get globally optimal

subset(s) of relevant features by means of these measures with the hope of removing much more

redundant features and still keeping classification accuracies or even getting better performances.

In this paper, we focus on generalizing several feature-selection measures applied in so-called filter

methods. We formally define a generic measure and propose a new method to find the globally

optimal subset of relevant features by means of an instance of this generic measure. For our study,

we consider the mRMR measure since the new method applied to this measure can also be used

for optimizing other measure instances from the defined generic measure. Firstly, we reformu-

late the problem of feature selection by using the mRMR measure as an optimization problem and

then transform this optimization problem into a polynomial mixed 0 − 1 fractional programming

(PM01FP) problem. We improve the Chang’s method [3] in order to equivalently reduce this

PM01FP problem to a mixed 0 − 1 linear programming (M01LP) problem. Finally, we pro-

pose to use the branch-and-bound algorithm to solve this M01LP problem, whose optimal solution

is also the globally optimal subset of relevant features. We have compared our globally optimal

method with the heuristic search method proposed by Peng [1]. Experimental results obtained over

the UCI database [7] show that our method outperforms the method from [1] by removing additional

10% of redundant or confusing features from the original data set, while keeping or yielding an even

better accuracy. This new method can also be applied to other measure instances from the generic

measure of filter methods.

The paper is organized as follows. Section 2 formally defines a generic feature-selection measure

and describes the mRMR measure as an instance. We show how to reformulate the problem of fea-

ture selection by means of the mRMR measure as an optimization problem and then as a polynomial

mixed 0 − 1 fractional programming (PM01FP) problem. The background regarding PM01FP,

M01LP problems and Chang’s method are introduced in the section 3. Section 4 describes our new

approach to get globally optimal subsets of features by means of the mRMR measure. We present

experimental results in Section 5. The last section summarizes our findings.

2A Generic Feature-Selection Measure

2.1Definitions

Definition 1: A generic feature-selection measure applied in so-called filter methods is a function

F(x), which has the following form:

F(x) =a0+?n

i=1ai

j=1bj

?

k∈Jxk

b0+?n

?

k∈Jxk,x = (x1,x2,...,xn) ∈ {0,1}n.

(1)

In this definition, binary values of variable xiindicate the appearance (xi = 1) or the absence

(xi= 0) of the feature fi. For a particular measure instance from this generic measure, ai, biare

constants, such as correlation coefficients or mutual information values.

Definition 2: The feature selection task is to find x = (x1,x2,...,xn) in order to maximize the

function F(x):

x∈{0,1}nF(x) =a0+?n

max

i=1ai

j=1bj

?

k∈Jxk

k∈Jxk.

b0+?n

?

(2)

There are several feature selection measures, which can be represented by the form (1), such as the

correlation-feature-selection (CFS) measure, the minimal-redundancy-maximal-relevance (mRMR)

measure, the discriminant function and the Mahalanobis distance. For our study in the following, we

are going to consider the minimal-redundancy-maximal-relevance (mRMR) measure and propose a

new method to obtain the optimal subset of features by means of this measure. The new method can

also be applied to other measure instances from the generic measure defined above.

Page 3

2.2The mRMR Feature Selection Measure

In 2005, Peng et.al. [1] proposed a new feature-selection method, which is based on mutual infor-

mation. In this method, relevant features and redundant features are considered simultaneously. In

terms of mutual information, the relevance of a feature set S for the class c is defined by the mean

value of all mutual information values between the individual feature fiand the class c as follows:

1

|S|

The redundancy of all features in the set S is the mean value R(S) of all mutual information values

between the feature fiand the feature fjand is given below:

D(S,c) =

?

fi∈S

I(fi;c).

(3)

R(S) =

1

|S|2

?

fi,fj∈S

I(fi;fj).

(4)

The mRMR criterion is a combination of two measures given above and is defined as follows:

?

Suppose that there are n full set features. The task of feature selection by means of this mRMR

measure is to find the subset S, which has the maximum value of maxSΦS(D,R) over all 2n

possible feature subsets. As we mentioned above, this task is not easy when the number n is large.

Therefore, heuristic, sequential and random search strategies are usually chosen instead of the brute

force method due to their computational efficiency. Consequently, the obtained results will usually

be locally optimal feature sets. It is desirable to get globally optimal feature sets. In the sequel, we

propose a new method to find these sets.

Firstly, we formulate the above mentioned task as an optimization problem. As mentioned above,

we use binary values of the variable xiin order to indicate the appearance (xi= 1) or the absence

(xi = 0) of the feature fiin the globally optimal feature set. We denote the mutual information

values I(fi;c), I(fi;fj) by constants ciand aij, respectively. Therefore, the problem of selecting

features by means of the mRMR measure can be described as an optimization problem as follows:

?n

or in the following equivalent form:

?n

It is obvious that the form of the function F(x) in (7) can be represented by the form (1). In the next

section, we consider the optimization problem stated above as a polynomial mixed 0 − 1 fractional

programming (PM01FP) problem and show how to solve it.

max

S

ΦS(D,R) =

1

|S|

fi∈S

I(fi;c) −

1

|S|2

?

fi,fj∈S

I(fi;fj).

(5)

max

x=(x1,...,xn)∈{0,1}nF(x) =

i=1cixi

?n

i=1xi

−

?n

i,j=1aijxixj

(?n

i=1xi)2

.

(6)

max

x=(x1,...,xn)∈{0,1}nF(x) =

i=1

?n

i=1

j=1(cj− aij)xjxi

?n

?n

j=1xjxi

.

(7)

3Polynomial Mixed 0-1 Fractional Programming

A general polynomial mixed 0 − 1 fractional programming (PM01FP) problem [3] is represented

as follows:

m

?

subject to the following constraints:

min

i=1

?ai+?n

j=1aij

j=1bij

?

k∈Jxk

k∈Jxk

bi+?n

?

?

?

(8)

bi+?n

xk∈ {0,1},k ∈ J;ai,bi,cp,aij,bij,cpj∈ ?.

j=1bij

j=1cpj

k∈Jxk> 0,i = 1,...,m,

?

cp+?n

k∈Jxk≤ 0,p = 1,...,m,

Page 4

By replacing the denominators in (8) by positive variables yi(i = 1,...,m), the PM01FP then

leads to the following equivalent polynomial mixed 0 − 1 programming problem:

m

?

subject to the following constraints:

In order to solve this problem, Chang [3] proposed a linearization technique to transfer the terms

?

branch-and-bound method to obtain the global solution.

Proposition 1: A polynomial mixed 0 − 1 term?

minzi

subject to the following constraints:

?

where M is a large positive value.

Proposition 2: A polynomial mixed 0 − 1 term?

where M is a large positive value.

We now formulate the optimization problem of the mRMR measure (7) as a polynomial mixed 0−1

fractional programming (PM01FP) problem.

Proposition 3: The optimization problem of the mRMR measure (7) is a polynomial mixed 0 − 1

fractional programming (PM01FP) problem.

Remark: By applying the Chang’s method [3], we can transform this PM01FP problem to a

M01LP problem. The number of variables and constraints will depend on the square of n, where

n is the number of features. The reason is that the number of terms xixjin (7), which are replaced

by the new variables, is n(n + 1)/2. The branch-and-bound algorithm can then be used in order to

solve this M01LP problem. But the efficiency of the method depends strongly on the number of

variables and constraints. The larger the number of variables and constraints an M01LP problem

has, the more complicated the branch-and-bound algorithm is.

In the next section, we present an improvement of the Chang’s method to get an M01LP problem

with a linear number of variables and constraints in the number of full set variables. We also give a

new search strategy to obtain the relevant subsets of features by means of the mRMR measure.

min

i=1

?

aiyi+

n

?

j=1

aij

?

k∈J

xkyi

?

(9)

biyi+?n

xk∈ {0,1},k ∈ J;ai,bi,cp,aij,bij,cpj∈ ?.

j=1bij

j=1cpj

?

k∈Jxkyi= 1;yi> 0,i = 1,...,m,

k∈Jxk≤ 0,p = 1,...,m,

cp+?n

?

(10)

k∈Jxkyiinto a set of mixed 0 − 1 linear inequalities. Based on this technique, the PM01FP

becomes then a mixed 0 − 1 linear programming (M01LP), which can be solved by means of the

k∈Jxkyifrom (9) can be represented by the

following program [3]:

zi≥ 0,

zi≥ M(?

k∈Jxk− |J|) + yi

(11)

k∈Jxkyifrom (10) can be represented by a

continuous variable vi, subject to the following linear inequalities [3]:

vi≥ M(?

0 ≤ vi≤ Mxi,

k∈Jxk− |J|) + yi,

k∈Jxk) + yi,vi≤ M(|J| −?

(12)

4 Optimization of the mRMR Measure

By introducing an additional positive variable, denoted by y, we now consider the following problem

equivalent to (7):

n

?

min{−F(x)} =

i=1

[

n

?

j=1

(aij− cj)xj]xiy

(13)

Page 5

subject to the following constraints:

y > 0,

x = (x1,x2,...,xn) ∈ {0,1}n,

?n

i=1(?n

j=1xj)xiy = 1.

(14)

Each sum in (13) and (14) contains n terms, which will be equivalently replaced by new variables

with constraints following the two propositions given below:

Proposition 4: A polynomial mixed 0−1 term [?n

minzi

subject to the following constraints:

?

where M is a large positive value.

Proposition 5: A polynomial mixed 0 − 1 term?n

We substitute each term xiy in (15) and (16) by new variables tisatisfying constraints from Propo-

sition 2. Then the total number of variables for the M01LP problem will be 4n + 1, as they are xi,

y, ti, ziand vi(i = 1,...,n). Therefore, the number of constraints on these variables will also be

a linear function of n. As we mentioned above, with Chang’s method [3] the number of variables

and constraints depends on the square of n, thus our new method actually improves that method by

reducing the complexity of the branch and bound algorithm.

We now present a new search strategy for obtaining globally optimal subsets of relevant features by

means of the mRMR measure.

A new search method for the globally optimal feature set by means of the mRMR measure:

j=1(aij−cj)xj]xiy from (13) can be represented

by the following program:

zi≥ M(xi− 1) +?n

j=1(aij− cj)xjy,

zi≥ 0,

(15)

j=1xjxiy from (14) can be represented by a

continuous variable vi, subject to the following linear inequality constraints:

vi≥ M(xi− 1) +?n

0 ≤ vi≤ Mxi

where M is a large positive value.

j=1xjy,

j=1xjy,vi≤ M(1 − xi) +?n

(16)

• Step 1: Calculate all mutual information values I(fi;c), I(fi;fj) from a data set.

• Step 2: Construct the optimization problem (7) from the mutual information values cal-

culated above. In this step, we can use expert knowledge by assigning the value 1 to the

variable xiif the feature fiis relevant and the value 0 otherwise.

• Step 3: Transform the optimization problem of the mRMR measure to a mixed 0−1 linear

programming (M01LP) problem, which can be solved by using the branch-and-bound

algorithm. A non-zero integer value of xifrom the optimal solution indicates the relevance

of the feature firegarding the mRMR measure.

5 Experimental Results

In order to validate our new method, we conducted an experiment on data sets from the public UCI

database [7]. The goal was to find optimal feature subsets by means of the mRMR measure. These

subsets were then compared with the full set of features and with the selected features from Peng’s

heuristic search [1] by using classification accuracy of the applied machine-learning algorithms.

We chose 5 data sets from the UCI database. Some of them have a large number of features, such

as the ”Ticdata” that has 85 features or the ”Test lung s3” that has 325 features. We applied C4.5

[4] and BayesNet [5] machine learning algorithms with 5-folds cross-validation for evaluating the

classification accuracy on full sets of features and on selected features. In order to solve the M01LP

problem, we used TOMLAB tool [6]. All the obtained results are listed in Table 1 and Table 2.

Page 6

It can be observed from Table 1 that our new method significantly reduces the dimensions of orig-

inal feature sets and selects the smaller number of relevant features in comparison to the heuristic

search. At the same time, our new method still keeps or even yields better performances of C4.5 and

BayesNet classifiers. This can be seen from Table 2.

Table 1: Number of selected features (H-S: heuristic search)

Data SetFull-set

Zoo 16

Ticdata 85

Optdigits 64

Test lung s3

325

Letter recognition16

Our-method

8

1

31

25

12

H-S

10

5

30

30

10

Table 2: Classification accuracies of C4.5 and BayesNet

Data SetC4.5BayesNet

Our-method

93.07

94.02

92.52

87.67

74.34

Full-Set

94.06

93.91

90.03

58.90

87.33

Our-method

94.06

94.02

90.40

68.49

87.82

H-S

93.06

94.02

90.56

68.49

87.18

Full-Set

94.06

86.12

92.73

83.56

74.08

H-S

93.06

93.90

91.86

86.30

73.41

Zoo

Ticdata

Optdigits

Test lung s3

Letter recognition

6Conclusion

We have studied the feature selection methods for high-dimensional data sets. We have formally

generalized feature-selection measures applied in so-called filter methods. For our study, we have

considered the mRMR measure as an instance of this generic measure. We also proposed a new

search method to get the globally optimal subset of relevant features. Further on, we transformed the

mRMR optimization problem into a polynomial mixed 0 − 1 fractional programming (PM01FP)

problem. From this PM01FP problem, we used our improved Chang’s method [3] to get a mixed

0 − 1 linear programming (M01LP) problem with linear dependence of the number of constraints

and variables on the number of features in the full set. We applied the branch-and-bound algorithm

to solve that M01LP problem. Experimental results show that our method outperforms the Peng’s

method [1] by removing additional 10% of redundant or confusing features from the original data

set, while keeping or yielding even better classification accuracy. Our new method can also be

applied to any measure instances derived from the defined generic measure of filter methods.

References

[1] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of max-

dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine

Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005.

[2] M. Hall. Correlation Based Feature Selection for Machine Learning, Doctoral Dissertation, University

of Waikato, Department of Computer Science, 1999.

[3] C-T. Chang. On the polynomial mixed 0-1 fractional programming problems, European Journal of Op-

erational Research, vol. 131, issue 1, pages 224-227, 2001.

[4] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.

[5] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. John Wiley& Sons, USA, 2001.

[6] TOMLAB, The optimization environment in MATLAB. http://tomopt.com/tomlab/.

[7] UCI data sets. http://archive.ics.uci.edu/ml/.

[8] I. Guyon, S. Gunn, M. Nikravesh and L.A. Zadeh. Feature Extraction: Foundations and Applications.

Series Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2006.

[9] H. Liu, H. Motoda. Computational Methods of Feature Selection. Chapman & Hall/CRC, 2008.

#### View other sources

#### Hide other sources

- Available from Katrin Franke · May 22, 2014
- Available from Hai Thanh Nguyen · May 22, 2014
- Available from caltech.edu