Page 1
Combined SVMbased Feature Selection
and Classification
Julia Neumann (jneumann@unimannheim.de)∗, Christoph Schn¨ orr
(schnoerr@unimannheim.de) and Gabriele Steidl
(steidl@unimannheim.de)
Dept. of Mathematics and Computer Science
University of Mannheim
D68131 Mannheim, Germany
March 16, 2005
Abstract. Feature selection is an important combinatorial optimisation problem in
the context of supervised pattern classification. This paper presents four novel con
tinuous feature selection approaches directly minimising the classifier performance.
In particular, we include linear and nonlinear Support Vector Machine classifiers.
The key ideas of our approaches are additional regularisation and embedded nonlin
ear feature selection. To solve our optimisation problems, we apply difference of con
vex functions programming which is a general framework for nonconvex continuous
optimisation. Experiments with artificial data and with various realworld problems
including organ classification in computed tomography scans demonstrate that our
methods accomplish the desired feature selection and classification performance
simultaneously.
Keywords: feature selection, SVMs, embedded methods, mathematical program
ming, difference of convex functions programming, nonconvex optimisation
∗Phone: +49 621 181 2746, Fax +49 621 181 2744
c ? 2005 Kluwer Academic Publishers. Printed in the Netherlands.
Page 2
2
Neumann, Schn¨ orr, and Steidl
1. Introduction
Overview and Related Work. In the context of supervised pat
tern classification, feature selection aims at picking out some of the
original input dimensions (features) (i) for performance issues by fa
cilitating data collection and reducing storage space and classification
time, (ii) to perform semantics analysis helping to understand the prob
lem, and (iii) to improve prediction accuracy by avoiding the “curse of
dimensionality” (cf. (Guyon and Elisseeff, 2003)).
According to (Guyon and Elisseeff, 2003; John et al., 1994; Bradley,
1998), feature selection approaches divide into filters, wrappers and
embedded approaches. Most known approaches are filters which act as
a preprocessing step independently of the final classifier (Hermes and
Buhmann, 2000; Duda et al., 2000). In contrast, wrappers take the clas
sifier into account as a black box (John et al., 1994; Weston et al., 2001).
An example for a wrapper method for nonlinear SVMs is (Weston
et al., 2001), where instead of minimising the classification error, the
features are selected to minimise a generalisation error bound. Finally,
embedded approaches simultaneously determine features and classifier
during the training process. The embedded methods in (Bradley and
Mangasarian, 1998) are based on a linear classifier. As for the wrapper
methods, there exist only few embedded methods addressing feature se
lection in connection with nonlinear classifiers up to now. An embedded
approach for the quadratic 1norm SVM was suggested in (Zhu et al.,
2004). The authors penalise the features by the ?1norm and apply
the nonlinear mapping explicitly. This makes the approach feasible
only for low dimensional feature maps such as the quadratic one. In
particular, original features are not suppressed so that no performance
improvements or semantics analysis are possible. Finally, in (Jebara
and Jaakkola, 2000) a feature selection method was developed as an
extension to the socalled maximum entropy discrimination, i.e., from
a discriminative (probabilistic) perspective.
Contribution. In this work, we focus on embedded approaches for fea
ture selection. The starting point for our investigation is the approach
of (Bradley and Mangasarian, 1998) that minimises the training errors
of a linear classifier while penalising the number of features by a concave
penalty approximating the ?0“norm”. In this way, the linear classifier
is constructed while implicitly discarding features. The first objective
of our work is to extend this feature selection approach with the aim
to improve the generalisation performance of the classifiers. Taking
into account that the Support Vector Machine (SVM) provides good
Page 3
Combined SVMbased Feature Selection and Classification
3
generalisation ability by its ?2regulariser, we propose new methods by
introducing additional regularisation terms.
In the second part of our work, we construct direct objective minimis
ing feature selection methods for nonlinear SVM classifiers. First, we
generalise the approach for the quadratic SVM of (Zhu et al., 2004) in
two directions. We apply the approximate ?0penalty considered supe
rior to the ?1norm in (Bradley and Mangasarian, 1998) and we focus
on feature selection in the original feature space to further improve
the performance and enable semantics analysis. Next we incorporate
“kerneltarget alignment” (Cristianini et al., 2002) within this frame
work which performs appropriate feature selection if, e.g., the Gaussian
kernel SVM is used as classifier. This approach is essentially different
from multiple kernel learning techniques addressed, e.g., in (Bach et al.,
2004).
Some of our new approaches require the solution of nonconvex
optimisation problems. To solve these problems, we apply a general
difference of convex functions (d.c.) optimisation algorithm in an ap
propriate way. Moreover, we show that the Successive Linearization
Algorithm (SLA) proposed in (Bradley and Mangasarian, 1998) for
concave minimisation is in effect a special case of our general optimisa
tion approach. A short summary of our algorithms has been announced
in (Neumann et al., 2004).
Feature selection is especially profitable for highdimensional prob
lems. To illustrate this, we investigate as part of our indepth method
evaluation the problem of selecting a suitable subset from 650 image
features in order to classify organs in computed tomography (CT).
Organisation. After reviewing the linear embedded approaches pro
posed in (Bradley and Mangasarian, 1998), we introduce our enhanced
approaches both for linear and nonlinear classification in Sec. 3. The
d.c. optimisation approach and its application to our feature selection
problems is described in Sec. 4. Numerical results illustrating and eval
uating various approaches, including the CT organ classification, are
given in Sec. 5.
Notation. We denote vectors and matrices by bold small and capital
letters, respectively. The matrix I denotes the identity matrix in appro
priate dimensions. The vector 0 signifies a vector of zeros and e a vector
of ones. All vectors will be column vectors unless transposed by the su
perior symbolT. If x ∈ Rndenotes a vector, in general, we will indicate
its components by xi(i = 1,...,n). We set w := (w1,w2,...)Tand
assume vector inequalities to hold componentwise. Furthermore, [−v,v]
for v ∈ Rdsignifies the cuboid {w ∈ Rd: −v ≤ w ≤ v}. We use the
function x+:= max(x,0) and the indicator function χC of a feasible
Page 4
4
Neumann, Schn¨ orr, and Steidl
convex set C which is defined by χC(x) = 0 if x ∈ C, and χC(x) = ∞
otherwise.
2. Classifier Regularisation and Feature Penalties
Given a training set {(xi,yi) ∈ X × {−1,1} : i = 1,...,n} with
X ⊂ Rd, the first goal is to find a classifier F : X → {−1,1}. We will
introduce in Sec. 2.1 the linear classifier on which the presented embed
ded feature selection approaches are based, and then add penalties for
feature suppression and for improving the generalisation performance
in Sec. 2.2.
2.1. Robust Linear Programming
Our starting point are linear classification approaches for construct
ing two parallel bounding hyperplanes in Rdsuch that the differently
labelled sets are maximally located in the two opposite half spaces
determined by these hyperplanes. More precisely, one solves the min
imisation problem
fRLP(w,b) :=
n
?
i=1
(1 − yi(wTxi+ b))+−→
min
w∈Rd,b∈R
.
(1)
If (w,b) is the solution of (1), then the classifier is F(x) = sgn(wTx+b).
The linear method (1) was proposed as Robust Linear Programming
(RLP) by Bennett and Mangasarian (Bennett and Mangasarian, 1992).
Note that these authors weighted the training errors by 1/n±1, where
n±1= {i : yi= ±1}.
2.2. Regularisation and Feature Penalties
In general, optimisation approaches to statistical classification include
an additional penalty term ρ besides a “goodness of fit” term as fRLP
in (1) whose competition is controlled by a weight parameter λ ∈ [0,1):
min
w∈Rd,b∈R(1 − λ)fRLP(w,b) + λρ(w) .
In the following, we consider different penalties.
(2)
Page 5
Combined SVMbased Feature Selection and Classification
5
2.2.1. SVM
In order to maximise the margin between the two parallel hyperplanes,
the original SVM penalises the ?2norm of w. Then (2) yields
min
w∈Rd,b∈R(1 − λ)
n
?
i=1
(1 − yi(wTxi+ b))++λ
2wTw
(3)
which can be solved by a convex Quadratic Program (QP). The Sup
port Vectors (SVs) are those patterns xifor which the dual solution is
positive, which implies yi(wTxi+ b) ≤ 1.
2.2.2. ?1SVM
In order to suppress features, i.e. components of the vector w, ?p
norms of w with p < 2 are used as feature penalties. In (Bradley and
Mangasarian, 1998), the ?1norm (lasso penalty) ρ(w) = ?w?1led to
good feature selection and classification results. Accordingly, (2) reads
min
w∈Rd,b∈R(1 − λ)
n
?
i=1
(1 − yi(wTxi+ b))++ λeTw
(4)
which can be solved by a linear program. This penalty term was orig
inally introduced in the statistical context of linear regression in the
’lasso’ (’Least Absolute Shrinkage and Selection Operator’) in (Tibshi
rani, 1996), and also applied in (Zhu et al., 2004).
2.2.3. Feature Selection Concave (FSV)
Feature selection can be further improved by using the socalled ?0
“norm” ?w?0
et al., 2003). Note that ? · ?0 is no norm because, unlike ?pnorms
(p ≥ 1), the triangle inequality does not hold. Since the ?0“norm” is
nonsmooth, it was approximated in (Bradley and Mangasarian, 1998)
by the concave functional
0= {i : wi?= 0} (Bradley and Mangasarian, 1998; Weston
ρ(w) = eT(e − e−αw) ≈ ?w?0
0
(5)
with approximation parameter α ∈ R+. Problem (2) with penalty term
(5) yields with suitable constraints the mathematical program:
min
w∈Rd,b∈R,ξ∈Rn,v∈Rd(1 − λ)eTξ + λeT(e − e−αv)
subject to
yi(wTxi+ b) ≥ 1 − ξi
,i = 1,...,n ,
ξ ≥ 0 ,
−v ≤ w ≤ v
(6)
which is known as Feature Selection concaVe (FSV). Note that this
problem is nonconvex, and high quality solutions can be obtained
Page 6
6
Neumann, Schn¨ orr, and Steidl
by, e.g., the Successive Linearization Algorithm (SLA) presented in
Sec. 4.2.1.
3. New Feature Selection Approaches
3.1. Combined ?pPenalties
FSV performs well for feature selection. However, its classification ac
curacy can be improved by applying a standard SVM on the selected
features only, as shown in (Jakubik, 2003) and also indicated in (Weston
et al., 2003). Therefore, since the ?2penalty term is responsible for the
very good SVM classification results while the ?1and ?0penalty terms
focus on feature selection, we suggest combinations of these terms.
Consequently, we need two weight parameters µ,ν ∈ R+.
3.1.1. ?2?1SVM
For the ?2?1SVM, we are interested in solving the constrained convex
QP
min
w∈Rd,b∈R,ξ∈Rn,v∈Rd
subject to
µ
neTξ +1
yi(wTxi+ b) ≥ 1 − ξi
2wTw + νeTv
,i = 1,...,n ,
ξ ≥ 0 ,
−v ≤ w ≤ v .
(7)
It is advisable here to solve the dual problem because it involves fewer
variables and has a simpler structure, similar to the SVM case.
3.1.2. ?2?0SVM
For the ?2?0SVM with approximate ?0“norm”, we minimise
f(w,b,v) :=µ
n
+ χ[−v,v](w) .
n
?
i=1
(1 − yi(wTxi+ b))++1
2wTw + νeT(e − e−αv)
(8)
An appropriate approach to optimise (8) is developed in Sec. 4.
3.2. Nonlinear Classification
For problems which are not linearly separable a socalled feature map φ
is commonly used which maps the set X ⊂ Rdinto a higher dimensional
space φ(X) ⊂ Rd?(d?≥ d). Then a linear classification approach (1)
Page 7
Combined SVMbased Feature Selection and Classification
7
or (3) can be applied in the new feature space φ(X). This results in
a nonlinear classification in the original space Rd, i.e., in nonlinear
separating surfaces. Below, we consider two popular feature maps in
connection with feature selection.
3.2.1. Quadratic FSV
We start with the simple quadratic feature map
φ : X → Rd?,
x ?→ (xα: α ∈ Nd
= (xα1
0, 0 < ?α?1≤ 2)
2···xαd
1xα2
d: α ∈ Nd
0, 0 < ?α?1≤ 2) ,
where d?=d(d+3)
by FSV leads to the minimisation problem
2
. Straightforward application of the ?0penalty in Rd?
(1 − λ)
n
?
i=1
(1 − yi(wTφ(xi) + b))++ λeT(e − e−αv) +
d?
?
i=1
χ[−vi,vi](wi)
for w ∈ Rd?,b ∈ R and v ∈ Rd?. This approach, as well as a similar
one for the ?1penalty in (Zhu et al., 2004), achieve feature selection
only in the transformed feature space Rd?. Our goal, however, is to
select features in the original space Rdin order to get insight into our
original problem, too, and to reduce the number of primary features.
To this end, instead of penalising vifor v ∈ Rd?, we examine for each
wi(i = 1,...,d?) which original features are included in computing φi.
If ej∈ Rddenotes the jth unit vector and φi(ej) ?= 0, we penalise the
corresponding vjfor v ∈ Rd:
n
?
d?
?
In the following, we refer to (9) as quadratic FSV. In principle, the
approach can be extended to other explicit feature maps φ, especially
by choosing other polynomial degrees.
In the same manner as done for FSV here, it is possible to generalise
the ?2?pSVMs for p = 0,1 by explicitly applying, e.g., the quadratic
feature map. This leads to solving a sequence of convex QPs instead of
LPs as will be seen in the next section.
f(w,b,v) :=(1 − λ)
i=1
(1 − yi(wTφ(xi) + b))++ λeT(e − e−αv)
+
i=1
?
φi(ej)?=0
χ[−vj,vj](wi)
−→
min
w∈Rd?,b∈R,v∈Rd
.
(9)
3.2.2. KernelTarget Alignment Approach
Compared with linear SVMs, further improvements of classification ac
curacy in our context may be achieved by using Gaussian kernel SVMs,
Page 8
8
Neumann, Schn¨ orr, and Steidl
as has been confirmed by experiments in (Jakubik, 2003). Therefore,
we also consider SVMs with the feature map φ : X → ?2induced by
K(x,z) = ?φ(x),φ(z)? for the Gaussian kernel
K(x,z) = Kθ(x,z) = e−?x−z?2
?d
the quadratic feature map is no longer applicable. We apply the com
mon SVM classifier without bias term b. For further information on
nonlinear SVMs see, e.g., (Sch¨ olkopf and Smola, 2002). We obtain the
commonly used kernel and classifier for θ = e. Direct feature selection,
i.e., the setting of as many θk to zero as possible while retaining or
improving the classification ability, is a difficult problem. One possible
approach is to use a wrapper as in (Weston et al., 2001). Instead, we aim
at directly maximising the alignmentˆA(K,yyT) = yTKy/(n?K?F)
proposed in (Cristianini et al., 2002) as a measure of conformance of
a kernel represented by K = (K(xi,xj))n
To simplify this optimisation task, we drop the denominator which is
justified in view of the boundedness of the kernel elements (10). To
cope with unequal sample partitioning as, e.g., in Fig. 1 left on p. 15,
we replace y by yn= (yi/nyi)n
2,θ/2σ2
(10)
with weighted ?2norm ?x?2
the feature space has infinite dimension, feature selection as done for
2,θ=
k=1θkxk2, for all x,z ∈ X. As
i,j=1with a learning task.
i=1. This leads to
yT
nKyn=
??????
1
n+1
?
{i:yi=+1}
φ(xi) −
1
n−1
?
{i:yi=−1}
φ(xi)
??????
2
(11)
which is the class–centre distance in the feature space. A different
view on the alignment criterion is obtained by considering the classi
fier F with w =
imising the correct class responses
expression above. Relaxing the binary θ ∈ {0,1}dto θ ∈ [0,1]dand
adding penalty (5), we define as our kerneltarget alignment approach
to feature selection
?n
i=1yniφ(xi), b = 0 in feature space. Then max
?n
i=1yniF(xi) also leads to the
f(θ) := −(1 − λ)1
2yT
nKθyn+ λ1
deT(e − e−αθ) + χ[0,e](θ) −→ min
θ∈Rd.
(12)
The scaling factors1
[0,1]. The minimisation problem (12) is subjected to bound constraints
only, but the variable θ is included in the exponential norm expressions
in the first term as well as in the concave second term. As a result, the
problem is likely to have many local minima and will be difficult to
optimise. Considering the boundary values, it follows for θ = 0 that
2,1
densure that both objective terms take values in
Page 9
Combined SVMbased Feature Selection and Classification
9
Kθ= (1)n×n and yT
yT
nKθyn= 0. For θ → ∞, we have Kθ→ I and
1
n−1.
nKθyn→
1
n+1+
4. D.C. Decomposition and Optimisation
Whereas RLP (1), SVM (3) and ?1SVM (4) are still convex QPs,
adding the concave penalty term (5) makes problems FSV (6), the
?2?0SVM (8), quadratic FSV (9) and, particularly, the kerneltarget
alignment approach (12) difficult to optimise due to possible local
minima.
A robust algorithm for minimising nonconvex problems is the Dif
ference of Convex functions Algorithm (DCA) proposed in (Pham Dinh
and Hoai An, 1998) in a different context. It can be used to minimise
a nonconvex function f : Rd→ R ∪ {∞} which reads
f(x) = g(x) − h(x) −→ min
x∈Rd,
(13)
where g,h : Rd→ R ∪ {∞} are lower semicontinuous, proper convex
functions cf. (Rockafellar, 1970). A property of this approach, partic
ularly convenient for applications, is that f may be nonsmooth. For
example, constraints sets C ? x may be taken into account by adding
a corresponding indicator function χC to the objective function f. In
the next subsections, we first sketch the DCA and then apply it to our
nonconvex feature selection problems where the precise algorithm is
determined by the appropriate decomposition of f in each case.
4.1. D.C. Programming
According to (Rockafellar, 1970; Pham Dinh and Hoai An, 1998), for
a lower semicontinuous, proper convex function f : Rd→ R∪{∞} we
use the standard notation
domf := {x ∈ Rd: f(x) < ∞} ,
f∗(˜ x) := sup
(domain)
x∈Rd{?x,˜ x? − f(x)} ,
∂f(z) := {˜ x ∈ Rd: f(x) ≥ f(z) + ?x − z,˜ x?
(conjugate function)
∀x ∈ Rd}
(subdifferential)
for z,˜ x ∈ Rd. For differentiable functions we have ∂f(z) = {∇f(z)}.
According to (Rockafellar, 1970, Theorem 23.5), it holds
∂f(x) = arg max
˜ x∈Rd{xT˜ x − f∗(˜ x)} , ∂f∗(˜ x) = arg max
x∈Rd{˜ xTx − f(x)} .
(14)
Page 10
10
Neumann, Schn¨ orr, and Steidl
In the remainder of this section, we will apply the following general
algorithm:
Algorithm 4.1: D.C. minimisation Algorithm (DCA)(g,h,tol)
choose x0∈ domg arbitrarily
for k ∈ N0
The following theorem was proven in (Pham Dinh and Hoai An, 1998,
Lemma 3.6, Theorem 3.7):
Theorem 1 (DCA convergence). If g,h : Rd→ R∪{∞} are lower
semicontinuous, proper convex functions so that domg ⊂ domh and
domh∗⊂ domg∗, then it holds for the DCA Algorithm 4.1:
(i) The sequences
do
select ˜ xk∈ ∂h(xk) arbitrarily
select xk+1∈ ∂g∗(˜ xk) arbitrarily
if min
i
− xk
then return (xk+1)
????xk+1
i
???,
????
xk+1
i
−xk
xk
i
i
????
?
≤ tol
∀i = 1,...,d
?
xk?
k∈N0,
?
˜ xk?
k∈N0is monotonously decreasing.
xk?
k∈N0are well defined.
(ii)
?
f(xk) = g(xk) − h(xk)
?
(iii) Every limit point of
particular, if f(xk+1) = f(xk), then xkis a critical point of f in
(13).
Hence the algorithm converges to a local minimum that is controlled
by the start value x0and of course by the d.c. decomposition (13) of
the objective. In case of nonglobal solutions, one may restart the DCA
with a new initial point x0. However, (Pham Dinh and Hoai An, 1998)
state that the DCAs often converge to a global solution.
We point out that a similar optimisation approach has been pro
posed by (Yuille and Rangarajan, 2003), obviously unaware of previ
ous related work in the mathematical literature (Pham Dinh and El
bernoussi, 1988; Pham Dinh and Hoai An, 1998). Whereas the approach
by (Yuille and Rangarajan, 2003) assumes differentiable objective func
tions, our approach — adopted from (Pham Dinh and Hoai An, 1998)
— is applicable to a significantly larger class of nonsmooth optimisa
tion problems. This allows to include constraint sets in a natural way,
for example.
?
k∈N0is a critical point of f = g − h. In
Page 11
Combined SVMbased Feature Selection and Classification
11
4.2. Application to Direct Objective Minimising Feature
Selection
The crucial point in applying the DCA is to define a suitable d.c.
decomposition (13) of the objective function. The aim of this section
is to propose such decompositions for the different approaches under
consideration.
4.2.1. FSV
Consider general problems of the form
min
x∈Xf(x) (15)
where f : Rd→ R is concave but not necessarily differentiable, and
X ⊂ Rdis a polyhedral set. According to (Mangasarian, 1997) f always
takes its minimum value at a vertex of the polyhedral feasible set X,
and ’argmin’ may be written as ’argvertexmin’. The symbol ∂f now
denotes the superdifferential of f which, for concave f, is the analogue
of the subdifferential for (not necessarily differentiable) convex func
tions. For such problems, and especially for the concave problem FSV
(6), the following iterative algorithm was proposed in (Bradley and
Mangasarian, 1998):
Algorithm 4.2: Successive Linearization Algorithm (SLA)(f,
X)
choose x0∈ Rnarbitrarily
for k ∈ N0
The algorithm produces a sequence of linear programs and terminates
after a finite number of iterations (Mangasarian, 1997).
Now let us solve the general nonconvex problems in the d.c. op
timisation framework. It turns out that our new feature selection ap
proaches not only generalise the FSV approach, but also that the SLA
is a special case of the DCA: We show that the DCA applied to a
particular d.c. decomposition (13) of (15) coincides with the SLA.
do
select z ∈ ∂f(xk) arbitrarily
select xk+1∈ argvertexminx∈XzT(x − xk) arbitrarily
if zT(xk+1− xk) = 0
then return (xk)
Corollary 2 (SLA equivalence). Let f : Rd→ R be concave and
X ⊂ Rdbe a polyhedral set. Then for solving the concave minimisa
tion problem (15) the SLA with x0∈ X and DCA with tol = 0 are
equivalent.
Page 12
12
Neumann, Schn¨ orr, and Steidl
Proof. Modelling problem (15) as a d.c. problem reads
min
x∈RdχX(x) − (−f(x)) ,
where the first term is defined as function g in (13), and the second one
as h. Then we have in the DCA Algorithm 4.1
− x0∈ domg ⇔ x0∈ X, and for k ∈ N0:
− ˜ xk∈ ∂h(xk) ⇔ ˜ xk∈ −∂f(xk),
− xk+1∈ ∂g∗(˜ xk)
The problem given in the theorem has exactly the form for which the
SLA Algorithm 4.2 is defined. Algorithm 4.2 and the above DCA are
almost identical with z = −˜ xk. If we use tol = 0 in the DCA, choose our
start value x0∈ X in the SLA and apply, e.g., the simplex algorithm
to obtain only vertex solutions, the algorithms are identical.
(14)
⇔ xk+1∈ argminx∈X−(˜ xk)T(x − xk).
4.2.2. ?2?0SVM
A viable d.c. decomposition (13) for (8) reads
g(w,b,v) =
µ
n
n
?
i=1
(1 − yi(wTxi+ b))++1
2wTw + χ[−v,v](w) ,
h(v) = −νeT(e − e−αv) .
Here and for the following problems, h is differentiable, so in the first
step of DCA iteration k ∈ N0we have ˜ xk= ∇h(xk). Combining the
two DCA steps for each k by (14) leads to xk+1∈ ∂g∗(∇h(xk)) =
argmaxx{∇h(xk)Tx−g(x)} so that we arrive at the constrained convex
QP
neTξ +1
subject to
yi(wTxi+ b) ≥ 1 − ξi
min
w∈Rd,b∈R,ξ∈Rn,v∈Rd
µ
2wTw + ναvTe−αvk
,i = 1,...,n ,
ξ ≥ 0 ,
−v ≤ w ≤ v .
Note that the sequence of solutions to these QPs converges, due to
Theorem 1, as f is bounded from below.
Page 13
Combined SVMbased Feature Selection and Classification
13
4.2.3. Quadratic FSV
To solve (9), we use the d.c. decomposition
g(w,b,v) = (1 − λ)
n
?
?
i=1
(1 − yi(wTφ(xi) + b))+
+
d?
?
i=1
φi(ej)?=0
χ[−vj,vj](wi) ,
h(v) = −λeT(e − e−αv) ,
which, analogously to the previous approach, in each DCA step k ∈ N0
leads to a linear problem
min
w∈Rd?,b∈R,ξ∈Rn,v∈Rd(1 − λ)eTξ + λαvTe−αvk
subject to
yi(wTφ(xi) + b) ≥ 1 − ξi , i = 1,...,n ,
ξ ≥ 0 ,
−vj≤ wi≤ vj , i = 1,...,d?; φi(ej) ?= 0 .
4.2.4. KernelTarget Alignment Approach
For the function defined in (12), as the kernel (10) is convex in θ, we
split f as
g(θ) =
1 − λ
2n+1n−1
n
?
yi?=yj
i,j=1
e−?xi−xj?2
2,θ/2σ2+ χ[0,e](θ) ,
h(θ) =
1 − λ
2
n
?
yi=yj
i,j=1
1
n2
yi
e−?xi−xj?2
2,θ/2σ2−λ
deT(e − e−αθ) .
Again h is differentiable, so by applying the DCA we find the solution
in the first step of iteration k as
˜θk= ∇h(θk) = −1 − λ
4σ2
n
?
yi=yj
i,j=1
1
n2
yi
e−?xi−xj?2
2,θk/2σ2?
(xil− xjl)2?d
l=1
−λ
dαe−αθk
.
Page 14
14
Neumann, Schn¨ orr, and Steidl
In the second step, looking for θk+1∈ ∂g∗(˜θk)
g(θ)} leads to solving the convex nonquadratic problem
1 − λ
2n+1n−1
i,j=1
yi?=yj
(14)
= argmaxθ{θT˜θk−
min
θ∈Rd
n
?
e−?xi−xj?2
2,θ/2σ2− θT˜θk
subject to 0 ≤ θ ≤ e
(16)
with a valid initial point 0 ≤ θ0≤ e. We solve the problems (16)
efficiently by a trust region method using the function fmincon in
MATLAB’s optimisation toolbox (MathWorks, 2002). Alternatively, a
penalty/barrier multiplier method with logarithmicquadratic penalty
function as proposed in (BenTal and Zibulevsky, 1997) also reliably
solves the problems.
5. Evaluation
To study the performance of our new methods in detail, we first present
computer generated ground truth experiments illustrating the general
behaviour and robustness of the nonlinear classification methods in
Sec. 5.1. To evaluate the performance of the suggested approaches
at large, we study various realworld problems in Sec. 5.2 and finally
examine the highdimensional research problem of organ classification
in CT scans in Sec. 5.3.
5.1. Ground Truth Experiments
In this section, we consider artificial training sets in R2and R4where y
is a function of the first two features x1and x2. We examine specially
designed points (x1,x2) ∈ R2on the left of the figures and n = 64
randomly distributed points (x1,x2,x3,x4) ∈ R4on the right.
The examples in Fig. 1 show that our quadratic FSV approach
indeed performs feature selection and finds classification rules for qua
dratic, not linearly separable problems. Ranking methods for feature
selection as well as linear classification approaches do not appreciate
the feature relevance for these problems.
For the nonquadratic chess board classification problems in Fig. 2,
our kerneltarget alignment approach performs very well, in contrast to
all other feature selection approaches presented. Again, the features by
themselves do not contain relevant information and all linear methods
are doomed to fail. In both test examples, only relevant feature sets are
selected by our methods as can be seen in the bottom plots. Particularly
the correct feature set {1,2} is selected for most values of λ. This clearly
Page 15
Combined SVMbased Feature Selection and Classification
15
−1−0.500.51
−1
−0.5
0
0.5
1
x1
x2
−3 −2−10123
−2
−1
0
1
2
x1
x2
00.10.20.30.40.50.60.70.80.9
{}
{1, 2}
λ
features
00.10.20.3 0.4 0.5 0.6 0.7 0.8 0.9
{}
{1}
{1, 2}
{1, 2, 3, 4}
features
λ
deterministic in R2
random in R4(four normal
random variables x1,...,x4
with variances 1, 1, 1 and 2)
Figure 1. Quadratic classification problems with y = sgn(x2
points and decision boundaries (white lines) computed by (9) for λ = 0.1, left: in
R2, right: projection of R4onto selected features. Bottom: Features determined by
(9)
1+x2
2−1). Top: Training
shows the favourable properties of embedded feature selection also in
connection with nonlinear classification.
Fig. 2 shows on the right a remarkable property: The alignment
approach discards the two noise features even for λ = 0 which indicates
that the alignment functional (11) incorporates implicit feature selec
tion. This is due to the isotropic properties of the Gaussian kernel where
the feature space distances are bounded by ?φ(x)?2= ?φ(x),φ(x)? =
K(x,x) = 1. As argued in Sec. 3.2.2, maximising the alignment term
yT
vectors which lie on the unit sphere in ?2. Adding random features
disturbs the original distances ?xi−xj? and so distributes the feature
vectors φ(xi) more uniformly on the sphere potentially moving the
class means closer to each other. More precisely, adding features
nKθynamounts to maximising the class–centre distance of the feature
x ?→
?x
˜ x
?
leads for θ = e to kernel matrix elements
e(−?x−z?2
2−?˜ x−˜ z?2
2)/2σ2= K(x,z) · e−?˜ x−˜ z?2
2/2σ2
for x,z ∈ X. If the new features are random, roughly all offdiagonal
elements are damped by the same factor α. Splitting the diagonal from
Page 16
16
Neumann, Schn¨ orr, and Steidl
−4 −3 −2 −101234
−4
−3
−2
−1
0
1
2
3
4
x1
x2
−2−1012
−2
−1
0
1
2
x1
x2
00.10.20.30.40.5 0.6 0.7 0.8 0.9
{}
{1, 2}
λ
features
0 0.10.20.30.40.50.60.70.80.9
{}
{1, 2}
{1, 2, 3, 4}
λ
features
deterministic
(xi∈ {−3,−1,1,3}2)
Figure 2. Chess board classification problems withy+1
2). Top: Training points and Gaussian SVM decision boundaries (white lines) for
σ = 1, λ = 0.1, left: in R2, right: zoomed projection of R4onto selected features.
Bottom: Features determined by (12)
4 random features
(same as Fig. 1 right)
2
= (?x1
2? mod 2)⊕(?x2
2? mod
the offdiagonal terms, the original alignment yT
is reduced if c > 0 or yT
of the alignment term is reduced to (
factor α too. The implicit feature selection of the alignment functional
does not apply for arbitrary kernels: The linear kernel, e.g., leads to a
(nonnegative) alignment summand for each feature.
nKyn=: (
n−1. For large ni, the value
n+1+
1
n+1+
1
n−1)+c
nKyn>
1
n+1+
1
1
1
n−1) + αc by almost the
5.2. RealWorld Data
We compare our approaches with RLP (1) and FSV (6) favoured over
the ?1SVM in (Bradley and Mangasarian, 1998) and standard linear
and Gaussian kernel SVMs as well as with the fast SVM–based filter
method for feature selection (Heiler et al., 2001) ranking the features
according to the linear SVM decision function.
5.2.1. Data Sets and Preprocessing
To test all our methods on realworld data, we use several data sets
from the UCI repository (Blake and Merz, 1998) as well as the high
dimensional Colon Cancer data set from (Weston et al., 2003). The
problems mostly treat medical diagnoses based on genuine patient data
and are resumed in Table I where we use distinct short names for
the databases. (See also (Bradley and Mangasarian, 1998) for a brief
Page 17
Combined SVMbased Feature Selection and Classification
17
Table I. Statistics for data sets used
data setno. of features
d
32
32
6
13
34
8
no. of samples
n
110
155
345
297
351
768
class distribution
n+1/n−1
41/ 69
28/127
145/200
160/137
225/126
500/268
wpbc60
wpbc24
liver
cleveland
ionosphere
pima
bcw (breast
cancerwisconsin)
sonar
musk
microarray
9683444/239
60
166
2000
208
476
62
111/ 97
207/269
22/ 40
review of most of the data sets used.) It is essential that the features are
normalised, especially for the kerneltarget alignment approach as their
variances influence its sensible objective with initially equal weights. In
experiments, it shows that otherwise features with large variances are
preferred. So we rescale the features linearly to zero mean and unit
variance.
5.2.2. Choice of Parameters
As to the parameters, we set α = 5 in (5) as proposed in (Bradley and
Mangasarian, 1998) and σ =
of the problems. We start the DCA with v0= e for the ?2?0SVM, FSV
and quadratic FSV and with θ0= e/2 for the kerneltarget alignment
approach, respectively. We stop on v with tol = 10−5resp. tol = 10−3
for θ. To determine the weight parameters, we discretise their range of
values and perform a parameter selection step minimising the error on
an independent validation set before actually applying the feature selec
tion algorithm. The validation set is chosen randomly as one half of each
run’s (crossvalidation) training set to select lnµ ∈ {0,...,10},lnν ∈
{−5,...,5} or λ ∈ {0.05,0.1,0.2,...,0.9,0.95} for (quadratic) FSV or
λ ∈ {0,0.1,...,0.9} for the kerneltarget alignment approach. In case
of equal validation error, we choose the larger values for (ν,µ) resp. λ.
In the same manner, the SVM weight parameter λ is chosen according
to the smallest1−λ
λ
∈ {e−5,e−4,...,e5} independently of the selected
features. For the filter method, we successively include features until
the validation error does not drop 0.1% below the current value five
times. The final classifier is then built from the training and validation
sets. To solve the elementary optimisation problems, we use the CPLEX
solver library (Ilog, Inc., 2001), MATLAB’s QP solver quadprog for the
√d
2in (10) which maximises the alignment
Page 18
18
Neumann, Schn¨ orr, and Steidl
common SVMs as well as its constrained optimisation method fmincon
documented in (MathWorks, 2002).
5.2.3. Results
We first partition the data equally into a training, a validation and
a test set. The validated parameters and test results for the linear
classifiers are summarised in Table II where the number of features
is determined as {j = 1,...,d : wj > 10−8}. As a result of the
validation, the optimal combination for (µ,ν) mostly falls within the
range of discretised values. Further, the methods are often stable for
large regions of values for ν or for the ratio µ/ν. Our linear methods
achieve feature selection and are often able to improve the classification
performance compared with the baseline RLP classifier. Especially for
the very high dimensional ’microarray’ data, both our linear feature
selection methods ?2?1SVM and ?2?0SVM are more accurate than
even the linear SVM.
For more thorough crossvalidation experiments, the aggregate re
sults are summarised in Table III for linear and Table IV for nonlin
ear classifiers. The number of features is again determined as {j =
1,...,d : wj > 10−8} again resp. {j = 1,...,d : θj > 10−2}.
The results for the quadratic FSV on data set ’musk’ are not given
due to the high problem dimension. It is clear that all proposed ap
proaches perform feature selection: linear FSV discards most features
followed by the kerneltarget alignment approach, the SVM ranking
method and then the ?2?0SVM, then the ?2?1SVM. In addition,
for all approaches the test error is often smaller than for RLP. The
quadratic FSV performs well mainly for special problems (e.g., ’liver’
and ’ionosphere’), but the classification is good in general for all other
approaches. For the kerneltarget alignment approach, apart from the
apparent feature reduction, also the number of SVs is generally reduced
which can be seen in Table IV. This allows again faster classification and
may also lead to a higher generalisation ability. The average number of
DC iterations given in Table IV for a run with ten validation calls and
the final evaluation is still moderate.
Average runtimes for the final feature selection and classifier training
during crossvalidation are given in Table V. Taking into account that
RLP is FSV for λ = 0 and that the common SVMs are determined by
the MATLAB solver, the runtimes reflect the problem types. Also the
kerneltarget alignment approach is tractable. Besides, one should be
aware that the final classification is fast for all approaches.
We already pointed out in Sec. 5.1 that the alignment approach per
forms feature selection implicitly which means without feature penalty
(λ = 0). To illustrate this, the respective results are given in Table VI.
Page 19
Combined SVMbased Feature Selection and Classification
19
Table II. Feature selection and linear classification performance (number of features,
test error [%]) and weight parameters that minimise classification error on validation
set
RLP (1)
linear SVM (3)
ranking
FSV (6)
?2?1SVM (7)
?2?0SVM (8)
dim
err
dim
err
dim
err
dim
err
λ∗
dim
err
(lnµ∗,
dim
err
(lnµ∗,
data set
lnν∗)
lnν∗)
wpbc60
32
44
32
31
2
31
0
31
0.95
27
33
( 1,4)
27
33
( 1,5)
wpbc24
32
25
32
22
1
22
0
22
0.95
19
20
(10, 5)
13
22
( 8, 5)
liver
6
28
6
30
6
30
2
33
0.3
6
30
( 9, 5)
6
31
( 5, 1)
cleveland
13
17
13
16
6
18
4
23
0.05
9
17
( 8, 5)
7
17
( 2,2)
ionosphere
33
12
34
11
9
14
2
14
0.2
19
11
( 9, 5)
3
15
( 6, 3)
pima
8
26
8
27
6
27
1
29
0.05
7
27
( 6,1)
8
27
( 5,3)
bcw
9
4
9
4
8
4
1
9
0.2
9
4
( 3,2)
8
4
( 5,3)
sonar
48
34
60
32
15
35
20
31
0.05
30
35
( 6, 2)
58
26
(10,4)
musk
113
27
166
18
3
39
17
22
0.05
163
18
(10,3)
160
18
(10,4)
microarray
41
40
2000
10
4
30
1
15
0.3
21
5
( 1, 0)
18
5
( 0,3)
Page 20
20
Neumann, Schn¨ orr, and Steidl
Table III. Feature selection and linear classification tenfold crossvalidation per
formance (average number of features, average test error [%], error variance [%]),
bold numbers indicate lowest errors of feature selection methods including Table IV
RLP
linear SVM
ranking
FSV
?2?1SVM
?2?0SVM
data set
dim
err
var
dim
err
var
dim
err
var
dim
err
var
dim
err
var
dim
err
var
wpbc60
32.0
40.9
2.7
32.0
33.6
1.5
4.9
36.4
2.1
0.4
36.4
1.7
12.4
35.5
1.2
13.4
37.3
1.4
wpbc24
32.0
27.7
1.1
32.0
18.1
1.0
1.8
18.1
1.0
0.0
18.1
1.0
12.6
17.4
0.9
2.9
18.1
1.0
liver
6.0
31.9
0.7
6.0
32.5
0.7
4.5
33.3
0.7
2.1
36.2
1.0
6.0
35.1
1.0
5.0
34.2
1.6
cleveland
13.0
16.2
0.6
13.0
15.8
0.5
6.9
16.2
0.4
1.8
23.6
1.0
9.9
16.5
0.5
8.2
16.5
0.4
ionosphere
33.0
13.4
0.1
34.0
13.4
0.1
10.0
14.0
0.2
2.3
21.7
1.0
24.8
13.4
0.3
14.0
15.7
0.6
pima
8.0
22.5
0.3
8.0
23.2
0.2
5.5
24.0
0.1
0.6
30.1
0.4
6.6
25.1
0.2
6.1
24.7
0.2
bcw
9.0
3.4
0.0
9.0
2.9
0.0
8.7
3.1
0.0
2.4
4.8
0.0
8.7
3.2
0.0
7.9
3.1
0.0
sonar
51.6
27.9
0.7
60.0
26.0
0.3
10.0
27.9
0.6
4.6
27.4
0.4
50.4
22.6
0.1
40.3
23.6
0.2
musk
116.0
20.6
0.2
166.0
15.3
0.1
12.6
29.2
0.4
4.0
28.2
0.2
125.1
18.3
0.3
105.2
16.8
0.2
Page 21
Combined SVMbased Feature Selection and Classification
21
Table IV. Feature selection and nonlinear classification tenfold crossvalidation av
erage performance (number of features, test error [%], error variance [%], number of
DCA iterations, number of Support Vectors), bold numbers indicate lowest errors
of feature selection methods including Table III
Gaussian SVM
quad. FSV
kerneltarget alignment
data set
dim
err
var
SVs
dim
err
var
dim
err
var
DCA iter
SVs
wpbc60
32.0
32.7
2.3
94.3
3.2
37.3
1.7
4.4
35.5
3.0
248.1
92.0
wpbc24
32.0
16.8
0.9
123.8
0.0
18.1
1.0
1.9
18.1
1.0
215.2
131.5
liver
6.0
33.3
0.8
233.1
3.2
32.5
0.8
2.5
35.4
1.5
242.6
262.3
cleveland
13.0
15.8
0.5
241.0
9.2
32.3
1.4
3.2
23.6
0.3
139.6
224.4
ionosphere
34.0
7.1
0.2
159.7
32.9
10.5
0.4
6.6
7.7
0.3
192.2
109.6
pima
8.0
23.4
0.2
481.1
4.7
29.9
0.4
1.4
27.0
0.2
202.2
444.2
bcw
9.0
2.9
0.0
229.0
5.9
9.4
0.1
2.8
4.2
0.0
74.9
160.5
sonar
60.0
12.5
0.8
159.1
60.0
24.0
0.7
9.6
27.4
0.6
268.2
110.7
musk
166.0
5.5
0.1
311.7



41.0
15.5
0.2
676.5
218.9
Page 22
22
Neumann, Schn¨ orr, and Steidl
Table V. Feature selection and classification time relative to fastest method
data set
RLP
lin.
SVM
67
ran
king
105
FSV
?2?1
SVM
1
?2?0
SVM
9
Gauss.
SVM
38
quad.
FSV
454
k.t.
align.
263 wpbc6055
Table VI. SVM tenfold crossvalida
tion performance (average number of
features, average test error [%]) with
features chosen by (12) for λ = 0
kerneltarget alignment
dim
9.0
6.5
4.0
4.2
8.9
2.0
3.0
13.6
48.4
data set
wpbc60
wpbc24
liver
cleveland
ionosphere
pima
bcw
sonar
musk
err
38.2
17.4
29.6
19.9
7.1
25.9
4.0
24.5
14.3
Of course the number of selected features is larger than with feature
penalty as in Table IV, but many features are discarded inherently
along with a sound classification performance. Note that this gives a
reliable feature selection approach without any necessity for parameter
selection.
5.3. Organ Classification in CT Scans
The results on the ’microarray’ data set in the previous section already
indicate that feature selection methods are more important in higher
dimensions. The evaluation of medical data is a prominent area where
this occurs. Due to the unknown relevant factors and problem nature,
at first often large feature sets are collected.
Here, we study the classification of specific organs in CT scans where
no satisfactory algorithms exist up to now. However this automatic
detection is essential for the treatment of, e.g., cancer patients. The
data originates from threedimensional CT scans of the masculine hip
region. An exemplary twodimensional image slice is depicted in Fig. 3.
To label the images, the adjacent organs bladder and prostate have been
masked manually by experts. The contours of both organs are shown
in Fig. 3 where the organs are very difficult to distinguish visually.
As described in (Schmidt, 2004), the images are filtered by a three
dimensional steerable pyramid filter bank with 16 angular orientations
and four decomposition levels. Then local histograms are built for the
Page 23
Combined SVMbased Feature Selection and Classification
23
Figure 3. Sample CT slice from data set ’organs22’ with contours of both organs
filter responses with ten bins per channel. Including the original grey
values, this results in 650 features per image voxel. The task is to label
each voxel with the correct organ. Here, the highdimensional feature
space is induced by the filtering which requires many directions due
to the already three input image dimensions. In total, for problem ’or
gans22’, the data for the region where bladder or prostate are contained
amount to 117 × 80 × 31 feature vectors ∈ R650.
In our experiments, we consider three different patients or data sets.
For each of those, we select 500 feature vectors from each class. From
those, we use 334 arbitrary samples for training and test, respectively,
during the parameter validation and then train our final classifier on
all 1000 training vectors. Note that, by choosing an equal number of
training samples from both classes different from the entire test set
where
class ’prostate’.
As done in (Schmidt, 2004), we also apply an SVM classifier with
χ2kernel
Kθ(x,z) = e−ρ?d
for x,z ∈ Rdwith ρ = 2−11on unmodified features. According to
(Haasdonk and Bahlmann, 2004), the kernel is positive definite. Nev
ertheless, we include a bias term b as in the linear case. This kernel
achieved a performance significantly superior to the Gaussian kernel
for histogram features in (Chapelle et al., 1999). In order to apply
the kerneltarget alignment approach for feature selection, one has to
replace the Gaussian kernel by the new kernel which is still convex in
θ in Sec. 4.2.4.
In our experiments, we include the filter method (Heiler et al., 2001)
for the χ2SVM now determining the ranking and number of features
by crossvalidation on the final training set. For the other approaches,
we use the same parameter settings as in the previous section. Then
n+1
n−1∈ [1
12,1
4], we put more weight on the errors of the smaller
k=1θk
(xk−zk)2
xk+zk
Page 24
24
Neumann, Schn¨ orr, and Steidl
Table VII. Feature selection and linear classification performance for CT data
(number of features, test error [%]) with weight parameters chosen to minimise
classification error on validation set
RLP lin. SVM
dim
650
650
650
FSV ?2?1SVM
dim
61
79
106
?2?0SVM
dim
18
43
66
data set
organs4
organs20
organs22
dim
225
242
231
err
13.2
15.2
11.7
err
1.1
1.4
1.3
dim err
2.3
3.6
11.4
err
0.9
1.5
2.2
err
0.7
2.7
2.2
4
6
3
Table VIII. Feature selection and nonlinear classification performance for CT
data (number of features, test error [%]) with weight parameters chosen to
minimise classification error on validation set
χ2SVM
data set
dim err dim
organs46501.5650
organs20 650 2.3 650
organs22 6502.2 650
Gaussian SVMχ2SVM ranking
dim
25
32
22
k.t. align.
dim
16
29
35
err
0.8
1.1
1.9
err
1.2
1.8
2.7
err
1.6
1.9
3.9
the results for the three patients are given in Table VII for linear and
in Table VIII for nonlinear classification methods.
The data sets seem to be well linearly separable which also results in
much lower classification and training times. Even more, the Gaussian
SVM yields astonishingly bad results compared with its linear and χ2
variants although reasonable values for the weight λ are selected and
our chosen kernel width σ produces an alignment of around 12% on
the training set which is maximised for a near kernel width ∈ [σ/2,σ].
This slight overestimation of σ is due to the sparsity of the histogram
features. The error of the Gaussian SVM always increased compared
with its validation error of 0.3–2.1% whereas it decreased for the other
SVMs. But the scant superiority of Gaussian SVMs over linear ones is
also consistent with (Chapelle et al., 1999).
Both our linear methods perform very well: They sometimes reduce
the classification error compared with RLP and linear SVM on the
whole feature set and reliably reduce the number of features. The
alignment approach and the filter method select very few features only,
in particular only few features corresponding to each filter subband.
So the alignment approach well copes with the redundancy of the
histogram features. The classification results for the ?2?0SVM on the
data set ’organs4’ may be visually compared with the mask considered
as ground truth in Fig. 4. The organs are classified with a high ac
curacy although the classes are again difficult to distinguish visually.
The dimension reduction also leads to a reduced classification time for
Page 25
Combined SVMbased Feature Selection and Classification
25
CT scan
yF(x)CT scan
yF(x)
CT scan
yF(x)CT scan
yF(x)
Figure 4. Sample results for ?2?0SVM on classification problem ’organs4’; classes
are marked black and white
all feature selection approaches which is essential in realtime medical
applications.
6. Summary and Conclusions
We proposed several novel methods that extend existing linear embed
ded feature selection approaches towards better generalisation ability
by improved regularisation, and constructed feature selection methods
in connection with nonlinear classifiers. In order to apply the DCA, we
found appropriate splittings of our nonconvex objective functions.
Our results show that embedded nonlinear methods, which have
been rarely examined up to now, are indispensable for feature selection.
In the experiments with real data, effective feature selection was always
carried out by our methods in conjunction with a small classification
error. So direct objective minimising feature selection is profitable and
viable for different types of classifiers. In higher dimensions, the curse
of dimensionality affects the classification error even more such that
our methods are also more important here.
For multiclass classification problems solved by a sequence of binary
classifiers, one could select features for every binary classifier or apply
one of the embedded approaches for all classifiers simultaneously. This
is left for future research.
The approaches may also be extended to incorporate other feature
maps in the same manner as quadratic FSV. For the kerneltarget align
ment approach, an application to kernels other than the Gaussian is
possible as we have shown in the experiments with histogram features.
Page 26
26
Neumann, Schn¨ orr, and Steidl
Acknowledgements
This work was funded by the DFG, Grant Schn 457/5.
Further thanks to Stefan Schmidt for the reliable collaboration con
cerning the evaluation of the CT data. as well as to Dr. Pekar and
Dr. Kaus, Philips Research Hamburg, for providing the data and stim
ulating discussions.
References
Bach, F., G. Lanckriet, and M. Jordan: 2004, ‘Multiple Kernel Learning, Conic
Duality, and the SMO Algorithm’.
Mach. Learning. New York, NY, ACM Press.
BenTal, A. and M. Zibulevsky: 1997, ‘Penalty/Barrier Multiplier Methods for
Convex Programming Problems’. SIAM Journal on Optimization 7(2), 347–366.
Bennett, K. P. and O. L. Mangasarian: 1992, ‘Robust Linear Programming Discrim
ination of Two Linearly Inseparable Sets’. Optimization Methods and Software
1, 23–34.
Blake, C. L. and C. J. Merz: 1998, ‘UCI Repository of Machine Learning Databases’.
Bradley, P. S.: 1998, ‘Mathematical Programming Approaches to Machine Learning
and Data Mining’. Ph.D. thesis, University of Wisconsin, Computer Sciences
Dept., Madison, WI, USA. TR9811.
Bradley, P. S. and O. L. Mangasarian: 1998, ‘Feature Selection via Concave Min
imization and Support Vector Machines’. In: Proc. of the 15th International
Conference on Machine Learning. San Francisco, CA, USA, pp. 82–90, Morgan
Kaufmann.
Chapelle, O., P. Haffner, and V. N. Vapnik: 1999, ‘SVMs for HistogramBased Image
Classification’. IEEE Transactions on Neural Networks 10(5), 1055–1064.
Cristianini, N., J. ShaweTaylor, A. Elisseeff, and J. Kandola: 2002, ‘On Kernel
Target Alignment’. In: T. G. Dietterich, S. Becker, and Z. Ghahramani (eds.):
Advances in Neural Information Processing Systems 14. Cambridge, MA, USA:
MIT Press, pp. 367–373.
Duda, R., P. Hart, and D. Stork: 2000, Pattern Classification. New York, NY, USA:
John Wiley & Sons, second edition.
Guyon, I. and A. Elisseeff: 2003, ‘An Introduction to Variable and Feature Selection’.
Journal of Machine Learning Research 3, 1157–1182.
Haasdonk, B. and C. Bahlmann: 2004, ‘Learning with Distance Substitution Ker
nels’. In: C. E. Rasmussen, H. H. B¨ ulthoff, M. A. Giese, and B. Sch¨ olkopf (eds.):
Pattern Recognition, Proc. of 26th DAGM Symposium, Vol. 3175 of LNCS. pp.
220–227, Springer.
Heiler, M., D. Cremers, and C. Schn¨ orr: 2001, ‘Efficient Feature Subset Selection for
Support Vector Machines’. Technical Report TR01021, Comp. science series,
Dept. of Mathematics and Computer Science, University of Mannheim.
Hermes, L. and J. M. Buhmann: 2000, ‘Feature Selection for Support Vector
Machines’. In: Proc. of the International Conference on Pattern Recognition
(ICPR’00), Vol. 2. pp. 716–719.
Ilog, Inc.: 2001, ‘ILOG CPLEX 7.5’.
In: Proc. Twentyfirst Int. Conf. on
Page 27
Combined SVMbased Feature Selection and Classification
27
Jakubik, O. J.: 2003, ‘Feature Selection with Concave Minimization’. Master’s thesis,
Dept. of Mathematics and Computer Science, University of Mannheim.
Jebara, T. and T. Jaakkola: 2000, ‘Feature Selection and Dualities in Maximum
Entropy Discrimination’. In: Proc. 16th Conf. on Uncertainty in Artif. Intell.
San Francisco, CA, pp. 291–300, Morgan Kaufmann Publ. Inc.
John, G. H., R. Kohavi, and K. Pfleger: 1994, ‘Irrelevant Features and the Subset
Selection Problem’. In: Proc. of the 11th International Conference on Machine
Learning. pp. 121–129.
Mangasarian, O. L.: 1997, ‘MinimumSupport Solutions of Polyhedral Concave Pro
grams’. Technical Report TR199705, Mathematical Programming, University
of Wisconsin.
MathWorks: 2002, ‘Optimization Toolbox User’s Guide’. The MathWorks, Inc.
Neumann, J., C. Schn¨ orr, and G. Steidl: 2004, ‘SVMbased Feature Selection by Di
rect Objective Minimisation’. In: C. E. Rasmussen, H. H. B¨ ulthoff, M. A. Giese,
and B. Sch¨ olkopf (eds.): Pattern Recognition, Proc. of 26th DAGM Symposium,
Vol. 3175 of LNCS. pp. 212–219, Springer.
Pham Dinh, T. and S. Elbernoussi: 1988, ‘Duality in d.c. (difference of convex
functions) optimization. Subgradient Methods.’.
Optimization, Vol. 84 of Int. Series of Numer. Math. Basel: Birk¨ auser Verlag,
pp. 277–293.
Pham Dinh, T. and L. T. Hoai An: 1998, ‘A D.C. Optimization Algorithm for Solving
the TrustRegion Subproblem’. SIAM Journal on Optimization 8(2), 476–505.
Rockafellar, R. T.: 1970, Convex Analysis. Princeton, NJ, USA: Princeton University
press.
Schmidt, S.: 2004, ‘ContextSensitive Image Labeling Based on Logistic Regres
sion’. Master’s thesis, Dept. of Mathematics and Computer Science, University
of Mannheim.
Sch¨ olkopf, B. and A. J. Smola: 2002, Learning with Kernels. Cambridge, MA, USA:
MIT Press.
Tibshirani, R.: 1996, ‘Regression Shrinkage and Selection via the Lasso’. Journal of
the Royal Statistical Society, Series B 58(1), 267–288.
Weston, J., A. Elisseeff, B. Sch¨ olkopf, and M. Tipping: 2003, ‘Use of the ZeroNorm
with Linear Models and Kernel Methods’. Journal of Machine Learning Research
3, 1439–1461.
Weston, J., S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and V. Vapnik: 2001,
‘Feature Selection for SVMs’. In: T. K. Leen, T. G. Dietterich, and V. Tresp
(eds.): Advances in Neural Information Processing Systems 13. Cambridge, MA,
USA: MIT Press, pp. 668–674.
Yuille, A. and A. Rangarajan: 2003, ‘The ConvexConcave Procedure’.
Computation 15, 915–936.
Zhu, J., S. Rosset, T. Hastie, and R. Tibshirani: 2004, ‘1norm Support Vector
Machines’. In: S. Thrun, L. Saul, and B. Sch¨ olkopf (eds.): Advances in Neural
Information Processing Systems 16. Cambridge, MA, USA: MIT Press.
In: Trends in Mathematical
Neural