Page 1

Combined SVM-based Feature Selection

and Classification

Julia Neumann (jneumann@uni-mannheim.de)∗, Christoph Schn¨ orr

(schnoerr@uni-mannheim.de) and Gabriele Steidl

(steidl@uni-mannheim.de)

Dept. of Mathematics and Computer Science

University of Mannheim

D-68131 Mannheim, Germany

March 16, 2005

Abstract. Feature selection is an important combinatorial optimisation problem in

the context of supervised pattern classification. This paper presents four novel con-

tinuous feature selection approaches directly minimising the classifier performance.

In particular, we include linear and nonlinear Support Vector Machine classifiers.

The key ideas of our approaches are additional regularisation and embedded nonlin-

ear feature selection. To solve our optimisation problems, we apply difference of con-

vex functions programming which is a general framework for non-convex continuous

optimisation. Experiments with artificial data and with various real-world problems

including organ classification in computed tomography scans demonstrate that our

methods accomplish the desired feature selection and classification performance

simultaneously.

Keywords: feature selection, SVMs, embedded methods, mathematical program-

ming, difference of convex functions programming, non-convex optimisation

∗Phone: +49 621 181 2746, Fax +49 621 181 2744

c ? 2005 Kluwer Academic Publishers. Printed in the Netherlands.

Page 2

2

Neumann, Schn¨ orr, and Steidl

1. Introduction

Overview and Related Work. In the context of supervised pat-

tern classification, feature selection aims at picking out some of the

original input dimensions (features) (i) for performance issues by fa-

cilitating data collection and reducing storage space and classification

time, (ii) to perform semantics analysis helping to understand the prob-

lem, and (iii) to improve prediction accuracy by avoiding the “curse of

dimensionality” (cf. (Guyon and Elisseeff, 2003)).

According to (Guyon and Elisseeff, 2003; John et al., 1994; Bradley,

1998), feature selection approaches divide into filters, wrappers and

embedded approaches. Most known approaches are filters which act as

a preprocessing step independently of the final classifier (Hermes and

Buhmann, 2000; Duda et al., 2000). In contrast, wrappers take the clas-

sifier into account as a black box (John et al., 1994; Weston et al., 2001).

An example for a wrapper method for nonlinear SVMs is (Weston

et al., 2001), where instead of minimising the classification error, the

features are selected to minimise a generalisation error bound. Finally,

embedded approaches simultaneously determine features and classifier

during the training process. The embedded methods in (Bradley and

Mangasarian, 1998) are based on a linear classifier. As for the wrapper

methods, there exist only few embedded methods addressing feature se-

lection in connection with nonlinear classifiers up to now. An embedded

approach for the quadratic 1-norm SVM was suggested in (Zhu et al.,

2004). The authors penalise the features by the ?1-norm and apply

the nonlinear mapping explicitly. This makes the approach feasible

only for low dimensional feature maps such as the quadratic one. In

particular, original features are not suppressed so that no performance

improvements or semantics analysis are possible. Finally, in (Jebara

and Jaakkola, 2000) a feature selection method was developed as an

extension to the so-called maximum entropy discrimination, i.e., from

a discriminative (probabilistic) perspective.

Contribution. In this work, we focus on embedded approaches for fea-

ture selection. The starting point for our investigation is the approach

of (Bradley and Mangasarian, 1998) that minimises the training errors

of a linear classifier while penalising the number of features by a concave

penalty approximating the ?0-“norm”. In this way, the linear classifier

is constructed while implicitly discarding features. The first objective

of our work is to extend this feature selection approach with the aim

to improve the generalisation performance of the classifiers. Taking

into account that the Support Vector Machine (SVM) provides good

Page 3

Combined SVM-based Feature Selection and Classification

3

generalisation ability by its ?2regulariser, we propose new methods by

introducing additional regularisation terms.

In the second part of our work, we construct direct objective minimis-

ing feature selection methods for nonlinear SVM classifiers. First, we

generalise the approach for the quadratic SVM of (Zhu et al., 2004) in

two directions. We apply the approximate ?0penalty considered supe-

rior to the ?1-norm in (Bradley and Mangasarian, 1998) and we focus

on feature selection in the original feature space to further improve

the performance and enable semantics analysis. Next we incorporate

“kernel-target alignment” (Cristianini et al., 2002) within this frame-

work which performs appropriate feature selection if, e.g., the Gaussian

kernel SVM is used as classifier. This approach is essentially different

from multiple kernel learning techniques addressed, e.g., in (Bach et al.,

2004).

Some of our new approaches require the solution of non-convex

optimisation problems. To solve these problems, we apply a general

difference of convex functions (d.c.) optimisation algorithm in an ap-

propriate way. Moreover, we show that the Successive Linearization

Algorithm (SLA) proposed in (Bradley and Mangasarian, 1998) for

concave minimisation is in effect a special case of our general optimisa-

tion approach. A short summary of our algorithms has been announced

in (Neumann et al., 2004).

Feature selection is especially profitable for high-dimensional prob-

lems. To illustrate this, we investigate as part of our in-depth method

evaluation the problem of selecting a suitable subset from 650 image

features in order to classify organs in computed tomography (CT).

Organisation. After reviewing the linear embedded approaches pro-

posed in (Bradley and Mangasarian, 1998), we introduce our enhanced

approaches both for linear and nonlinear classification in Sec. 3. The

d.c. optimisation approach and its application to our feature selection

problems is described in Sec. 4. Numerical results illustrating and eval-

uating various approaches, including the CT organ classification, are

given in Sec. 5.

Notation. We denote vectors and matrices by bold small and capital

letters, respectively. The matrix I denotes the identity matrix in appro-

priate dimensions. The vector 0 signifies a vector of zeros and e a vector

of ones. All vectors will be column vectors unless transposed by the su-

perior symbolT. If x ∈ Rndenotes a vector, in general, we will indicate

its components by xi(i = 1,...,n). We set |w| := (|w1|,|w2|,...)Tand

assume vector inequalities to hold componentwise. Furthermore, [−v,v]

for v ∈ Rdsignifies the cuboid {w ∈ Rd: −v ≤ w ≤ v}. We use the

function x+:= max(x,0) and the indicator function χC of a feasible

Page 4

4

Neumann, Schn¨ orr, and Steidl

convex set C which is defined by χC(x) = 0 if x ∈ C, and χC(x) = ∞

otherwise.

2. Classifier Regularisation and Feature Penalties

Given a training set {(xi,yi) ∈ X × {−1,1} : i = 1,...,n} with

X ⊂ Rd, the first goal is to find a classifier F : X → {−1,1}. We will

introduce in Sec. 2.1 the linear classifier on which the presented embed-

ded feature selection approaches are based, and then add penalties for

feature suppression and for improving the generalisation performance

in Sec. 2.2.

2.1. Robust Linear Programming

Our starting point are linear classification approaches for construct-

ing two parallel bounding hyperplanes in Rdsuch that the differently

labelled sets are maximally located in the two opposite half spaces

determined by these hyperplanes. More precisely, one solves the min-

imisation problem

fRLP(w,b) :=

n

?

i=1

(1 − yi(wTxi+ b))+−→

min

w∈Rd,b∈R

.

(1)

If (w,b) is the solution of (1), then the classifier is F(x) = sgn(wTx+b).

The linear method (1) was proposed as Robust Linear Programming

(RLP) by Bennett and Mangasarian (Bennett and Mangasarian, 1992).

Note that these authors weighted the training errors by 1/n±1, where

n±1= |{i : yi= ±1}|.

2.2. Regularisation and Feature Penalties

In general, optimisation approaches to statistical classification include

an additional penalty term ρ besides a “goodness of fit” term as fRLP

in (1) whose competition is controlled by a weight parameter λ ∈ [0,1):

min

w∈Rd,b∈R(1 − λ)fRLP(w,b) + λρ(w) .

In the following, we consider different penalties.

(2)

Page 5

Combined SVM-based Feature Selection and Classification

5

2.2.1. SVM

In order to maximise the margin between the two parallel hyperplanes,

the original SVM penalises the ?2-norm of w. Then (2) yields

min

w∈Rd,b∈R(1 − λ)

n

?

i=1

(1 − yi(wTxi+ b))++λ

2wTw

(3)

which can be solved by a convex Quadratic Program (QP). The Sup-

port Vectors (SVs) are those patterns xifor which the dual solution is

positive, which implies yi(wTxi+ b) ≤ 1.

2.2.2. ?1-SVM

In order to suppress features, i.e. components of the vector w, ?p-

norms of w with p < 2 are used as feature penalties. In (Bradley and

Mangasarian, 1998), the ?1-norm (lasso penalty) ρ(w) = ?w?1led to

good feature selection and classification results. Accordingly, (2) reads

min

w∈Rd,b∈R(1 − λ)

n

?

i=1

(1 − yi(wTxi+ b))++ λeT|w|

(4)

which can be solved by a linear program. This penalty term was orig-

inally introduced in the statistical context of linear regression in the

’lasso’ (’Least Absolute Shrinkage and Selection Operator’) in (Tibshi-

rani, 1996), and also applied in (Zhu et al., 2004).

2.2.3. Feature Selection Concave (FSV)

Feature selection can be further improved by using the so-called ?0-

“norm” ?w?0

et al., 2003). Note that ? · ?0 is no norm because, unlike ?p-norms

(p ≥ 1), the triangle inequality does not hold. Since the ?0-“norm” is

non-smooth, it was approximated in (Bradley and Mangasarian, 1998)

by the concave functional

0= |{i : wi?= 0}| (Bradley and Mangasarian, 1998; Weston

ρ(w) = eT(e − e−α|w|) ≈ ?w?0

0

(5)

with approximation parameter α ∈ R+. Problem (2) with penalty term

(5) yields with suitable constraints the mathematical program:

min

w∈Rd,b∈R,ξ∈Rn,v∈Rd(1 − λ)eTξ + λeT(e − e−αv)

subject to

yi(wTxi+ b) ≥ 1 − ξi

,i = 1,...,n ,

ξ ≥ 0 ,

−v ≤ w ≤ v

(6)

which is known as Feature Selection concaVe (FSV). Note that this

problem is non-convex, and high quality solutions can be obtained