Vol. 23 no. 9 2007, pages 1106–1114
MSVM-RFE: extensions of SVM-RFE for multiclass gene selection
on DNA microarray data
Xin Zhou and David P. Tuck?
Department of Pathology, Yale University School of Medicine, New Haven, Connecticut 06510, USA
Received on December 7, 2006; revised on January 24, 2007; accepted on January 26, 2007
Associate Editor: David Rocke
Motivation: Given the thousands of genes and the small number of
samples, gene selection has emerged as an important research
problem in microarray data analysis. Support Vector Machine—
Recursive Feature Elimination (SVM-RFE) is one of a group
of recently described algorithms which represent the stat-of-the-art
for gene selection. Just like SVM itself, SVM-RFE was originally
designed to solve binary gene selection problems. Several groups
have extended SVM-RFE to solve multiclass problems using one-
versus-all techniques. However, the genes selected from one binary
gene selection problem may reduce the classification performance in
other binary problems.
Results: In the present study, we propose a family of four extensions
to SVM-RFE (called MSVM-RFE) to solve the multiclass gene
selection problem, based on different frameworks of multiclass
SVMs. By simultaneously considering all classes during the gene
selection stages, our proposed extensions identify genes leading to
more accurate classification.
Supplementary information: Supplementary materials, including a
detailed review of both binary and multiclass SVMs, and complete
experimental results, are available at Bioinformatics online.
Microarray technology allows researchers to measure the
quantity of mRNA for tens of thousands of genes simulta-
neously. It has the power to create a comprehensive overview of
the gene regulatory network. Studies on DNA microarray
data offer opportunities for advancing fundamental biological
research and clinical practice (Golub et al., 1999; Ramaswamy
et al., 2001; Ross et al., 2000; Staunton et al., 2001; West
et al., 2001).
Normally, a microarray data set contains a large number
of genes (usually several thousand or more) and a relatively
small number of samples or conditions (usually 5100).
Among all the genes, many are irrelevant, insignificant or
redundant to the discriminant problem under investigation.
Hence the identification of informative genes, which have the
greatest power for classification, is of fundamental and
practical importance to the investigation of specific discrimi-
nant problems, such as cancer versus non-cancer (Alon et al.,
1999)or differenttumor types
Ramaswamy et al., 2001).
Many gene selection algorithms have been proposed in the
context of microarray data analysis over the past few years.
However, most of the studies are related to binary (two-class)
gene selection problems (Golub et al., 1999; Guyon et al., 2002;
Lee et al., 2003; Zhou and Mao, 2005a,b; Zhang et al., 2006,
to name a few), and only a few involve multiclass (i.e. more
than two classes) gene selection and classification (Chai and
Domeniconi, 2004; Li et al.,2004; Ramaswamy et al., 2001;
Tibshirani et al., 2002). As the multiclass problem is
intrinsically more difficult and presents more challenges, it is
worthy of further investigation.
The well-known SVM-RFE (Support Vector Machine—
Recursive Feature Elimination) (Guyon et al., 2002) is a simple
and efficient algorithm which conducts gene selection in a
backward elimination procedure. It has been widely applied in
analyzing high-dimensional biological data, such as gene
expression data (Frank et al., 2006; Ramaswamy et al., 2001;
Zheng et al., 2006), sequence analysis (Das et al., 2006), and
protein mass spectra data (Hilario et al., 2006). Just as SVM,
SVM-RFE was initially proposed for binary cases. Some
researchers (Chai and Domeniconi, 2004; Ramaswamy et al.,
2001) employed a simple one-versus-all technique to apply
SVM-RFE to solve multiclass
genes selected from one binary gene selection problem
may reduce the classification performance in other binary
In the present study, we propose four extensions of SVM-
RFE to solve the multiclass gene selection problem. By taking
all classes simultaneously into consideration during the gene
selection stage, our proposed extensions provide genes leading
to more accurate classification performance.
The paper is organized as follows. The binary and multiclass
SVMs, as well as SVM-RFE, are first briefly reviewed in
Section 2. In Section 3, we discuss the existing extension of
SVM-RFE for multiclass problems, and propose four new
MSVM-RFE algorithms based on different frameworks of
multiclass SVMs, respectively, to overcome the shortcoming of
the existing extension. The performance of MSVM-RFE
algorithms is finally tested with six microarray data sets in
problems. However, the
*To whom correspondence should be addressed.
? The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
by guest on May 30, 2013
Consider n training data pairs: S ¼ fhxi;yiig, i ¼ 1,...;n, where
xi(2 Rp) is a feature vector representing the i-th sample, and yi
is the class label of xi. For a binary problem, yi2 f?1;1g, while
for k-class (k42) problem, yi2 f1;2,...;kg.
2.1 Support vector machines
Support vector machines (SVMs) (Vapnik,1998) are powerful
classification algorithms that have shown state-of-the-art
performance in a variety of biological classification tasks.
SVMs are not very sensitive to the curse of dimensionality, and
they are well-suited to work with high dimensional data, such
as microarray gene expression data (Brown et al., 2000;
Furey et al., 2000).
The first generation of SVMs was only designed for binary
classification. However, most real-life diagnostic tasks are not
binary, and solving multiclass classification problems is much
harder than solving binary ones (Mukherjee, 2003; Rifkin et al.,
2003). Fortunately, several algorithms have been proposed for
extending binary SVMs to multiclass classification. These
algorithms are grouped into two types. One is by constructing
and combining several binary SVM classifiers. One-versus-all
(OVA) and one-versus-one (OVO) SVMs (Kreßel, 1999) are
two typical methods in this type. The other type, called
‘all-together’, is to directly solve one optimization problem
which takes all classes into consideration (Crammer and Singer,
2001; Weston and Watkins, 1999; Lee et al., 2004). In this
section, we will briefly introduce the binary and multiclass
SVMs. A detailed review of both binary and multiclass SVMs,
including their mathematical formulations, is presented in the
hyperplane with maximal distance between itself and the closest
samples from each of two classes (Vapnik, 1998). The decision
function of SVMs, just as other linear classifiers, is presented as
fðxÞ ¼ wTx þ b, where w ¼ ½w1;w2,...;wp?Tis the weight vector
and b is a scalar. The mechanism of SVMs is to minimize the
following optimization problem:
Binary SVMs Intuitively, an SVM searches for a
subject toyi½wTxiþ b?51 ? ?i;
i ¼ 1,...;n;
where the predefined parameter C is a trade-off between
training accuracy and generalization. Note that the SVM is
originally designed for the binary classification problem, hence
yi2 f?1;1g as we mentioned before. The solution of this
optimization problem is given by solving the corresponding
dual problem, a quadratic programming (QP) problem with
and probably the earliest implementation for multiclass SVMs
(Bottou et al., 1994). Here, k binary SVM classifiers are
constructed: class 1 (positive) versus all other classes (negative),
Multiclass SVMs: one-versus-allThis is the simplest
class 2 versus all other classes, ..., class k versus all
other classes. The decision function of the r-th classifier
(r ¼ 1,...;k), frðxÞ ¼ wT
is the r-th weight vector and bris a scalar, can be obtained by
solving the following optimization problem,
rx þ br, where wr¼ ½wr1;wr2,...;wrp?T
rxiþ br?51 ? ?r
i ¼ 1,...;n;
if yi¼r and ?1 otherwise. Note that for multiclass cases
yi2 f1;2,...;kg. When all the k classifiers are constructed, the
combined k binary classifiers ½f1;f2,...;fk?Tpredict the class (y)
of a (test) sample x that corresponds to the maximal value of k
binary classifiers, that is,
iis the class label for the r-th classifier, that is, zr
y ¼ argmax
Practically we solve the dual problem of (2) whose number of
variables is the same as the number of data, so k QP problems
with n variables are solved in this implementation of
method based on multiple binary SVMs is called the OVO
method (Kreßel, 1999). This method involves the construction
of kðk ? 1Þ=2 SVM classifiers where each one is trained on data
from two classes. In total we need to solve kðk ? 1Þ=2 QP
problems with less than n variables.
Hsu and Lin (2002) compared several multiclass SVMs, and
their experiments indicate that the OVO SVM is more suitable
for practical use (with large or medium sample size) concerning
both accuracy and running speed. However, some researchers
(Statnikov et al., 2005; Yeang et al., 2001) found that, in small
sample sized microarray data, OVO SVMs perform worse than
OVA SVMs and other ‘all-together’ multiclass SVMs. It is
also observedfrom ourexperimental
Supplementary Material). The reason perhaps is that each
binary classifier in the OVO SVM uses only a fraction of total
training samples, and the small training set might make these
binary classifiers subject to overfitting.
Multiclass SVMs: one-versus-oneAnother major
studies (refer to
multiclass SVMs by solving one single optimization problem
(Vapnik 1998; Weston and Watkins, 1999). The idea is similar
to the OVA approach. It simultaneously constructs k binary
classifiers where the r-th function, wT
r from the others. The formulation is as follows:
Multiclass SVMs: method by Weston and Watkins
This is the first ‘all-together’ implementation of
rx þ br, separates the class
Pðw; nÞ ¼1
rxiþ brþ 2 ? ?r
i ¼ 1,...;n;
r 2 f1,...;kg n yi:
by guest on May 30, 2013
The decision function is the same as that of OVA SVMs,
y ¼ argmax
rx þ brÞ:
The QP problem arising from (3) has ðk ? 1Þn variables.
Given that the computational complexity of the QP
optimization algorithm is polynomial to the number of
variables, it is computationally much more expensive to solve
multiclass classification using ‘all-together’ methods than using
methods based on binary SVMs.
variables in the constraints of the optimization problem, and
there is no bias item b in the decision function (Crammer and
Singer, 2001). Basically, this method solves the following
Multiclass SVMs: method by Crammer and Singer
This method is similar to WW. It uses fewer slack
rxi51 ? ?i;
i ¼ 1,...;n:
r 6¼ yi;
The decision function is
y ¼ argmax
The corresponding dual problem of (5) is a QP problem with kn
variables. Although the QP problem is larger than that from
WW, Crammer and Singer, (2001) decomposed the single
problem into multiple QP problems with reduced size, which
saves much computation.
that its solution for the multiclass problem resembles Bayes rule
asymptotically. The formulation is as follow:
Multiclass SVMs: method by Lee, Lin and Wahba
This method has a promising theoretical property
rxiþ br4 ?
k ? 1þ ?r
i ¼ 1,...;n;
r 2 f1,...;kg n yi;
rxiþ brÞ ¼ 0;
i ¼ 1,...;n:
The formulation presented here is slightly different from that in
Lee et al. (2004). Here we assume that the linear kernel and
equal misclassification costs are employed. Please refer to
Lee et al. (2004) for the general form. The final decision
function is just the same as (4). The single QP dual problem
of (6) has ðk ? 1Þn variables.
2.2 Support Vector Machine—Recursive Feature
SVM-RFE (Guyon et al., 2002) conducts feature selection in a
sequential backward elimination manner, which starts with all
the features and discards one feature at a time. Just like SVM,
SVM-RFE was initially proposed for binary problems. The
squared coefficients w2
j(j ¼ 1,...;p) of the weight vector w
obtained from binary problem (1) are employed as feature
ranking criteria. Intuitively, those features with the largest
weights are the most informative. Thus in an iterative
procedure of SVM-RFE one trains the SVM classifier,
computes the ranking criteria w2
the feature with the smallest ranking criterion. The procedure is
repeated until a small subset of features is obtained.
This is explained by the Optimal Brain Damage (OBD)
algorithm (Lecun et al., 1990). The criterion J can be expanded
in Taylor series to second order,
jfor all features, and discards
i¼1?i, when the j-th feature is discarded.
At the optimum of J, the first order term can be neglected1,
then Equation (7) becomes ?JðjÞ ? ð?wjÞ2. If we denote by J(j)
the value of J when the j-th feature is removed (by setting the
corresponding weight to 0), it follows that
JðjÞ ? J þ w2
Therefore, removing the feature with smallest w2
least increase of J, which increases the generalization perfor-
mance at the same time. In other words, SVM-RFE aims to
find the gene subset that yields minimal criterion J. This
criterion was also employed by Zhou and Mao (2006) in a
sequential forward selection.
jwill cause the
3EXTENSIONS OF SVM-RFE FOR MULTICLASS
3.1 Existing extension: OVA-RFE
SVM-RFE (Guyon et al., 2002) is originally designed for binary
problems. Some researchers (Chai and Domeniconi, 2004;
Ramaswamy et al., 2001; Rifkin et al., 2003) employed RFE for
multiclass gene selection by ranking genes for each class
separately in the OVA manner.
For k-class problem (k53), k binary SVM classifiers are
constructed using OVA method described in Section 2.1.2. For
the r-th binary classification problem, SVM-RFE is carried out
to identify a feature subset Sr for class r against all other
classes. After k feature subsets are selected, the final selected
subset for the whole multiclass problem is the combination
of all the k subsets. This extension of SVM-RFE is called
OVA-RFE in the present work.
1Actually, the criterion for OBD is not J ¼1
Lagrangian function from (1), that is L ¼1
?i½yiðwTxiþ bÞ ? 1 þ ?i? ?Pn
problem, J ¼ L, moreover, @2J=@2wj¼ @2L=@2wj¼ 1, so for simplicity
here we use J instead of L. In the following, we use the similar simplified
forms for the extensions of SVM-RFE.
i¼1?i, but the
@J=@wj6¼ 0. However, at the optimum of (1) and corresponding dual
X.Zhou and D.P.Tuck
by guest on May 30, 2013
The k subsets Sr ðr ¼ 1,...;kÞ are identified independently.
According to Section 2.1.2 and 2.2, the subset Srminimizes the
However, it cannot guarantee that the final selected feature
subset (S1[ S2[ ??? [ Sk) simultaneously minimizes all the k
criteria: J1;J2,...;Jk. For example, some features selected
from one binary problem may reduce the classification
performance in other binary problems. As a consequence, the
feature subset selected by OVA-RFE might not be an
3.2 New extensions: MSVM-RFEs
To overcome the limitations of OVA-RFE, we propose
MSVM-RFE algorithms for feature selection on multiclass
data. As there are two types of multiclass SVMs: methods
based on multiple binary SVMs and ‘all-together’ methods, our
proposed extensions are derived from these two groups,
discussion in Section 3.1, the optimal feature subset should
simultaneously minimize the k criteria in Equation (9) with
r ¼ 1,...;k. Therefore, the multiclass feature selection is cast as
a multi-objective optimization problem. The simplest way to
solve the multi-objective optimization problem is the Weighting
Objectives Method (Geoffrion, 1968). Assume that the k classes
equally contribute to the classification task. The multi-objective
optimization is changed to the following single objective
optimization problem by employing equal weights,
MSVM-RFE based on OVA SVMsAccording to our
The OBD algorithm (Lecun et al., 1990) is employed to solve
this optimization problem. As in the binary SVM-RFE, the
criterion J in (10) is expanded in Taylor series to second order,
At the optimum of J, ?JðjÞ ?1
binary SVM-RFE, it follows that
r¼1ð?wrjÞ2. Similar with
JðjÞ ? J þ1
where J(j) is the value of J when the j-th feature is eliminated.
In consequence, removing the feature with smallestPk
employed as a ranking criterion in the recursive elimination
procedure as described in Algorithm 1. Compared with OVA-
RFE which independently identifies features from each class,
this new extension simultaneously takes all the k classes into
It is straightforward to apply such techniques on other binary
SVMs based methods, such as OVO SVM (Kreßel, 1999).
will cause the least increase of J. Therefore,Pk
However, the extension based on OVO SVMs does not perform
as well as other extensions proposed in this paper (see
experimental results in the Supplementary Material) on the
small sample sized microarray data. It might be the direct
impact of the overfitting of OVO classifiers we mentioned in
Section 2.1.3. With large sample sizes, we expect the extension
based on OVO SVMs to perform as well as or better than other
extensions proposed here.
WW (Weston and Watkins, 1999, Section 2.1.4) as an example
to illustrate our extensions on ‘all-together’ methods. The
criterion used in the OBD algorithm (LeCun et al., 1990) is just
the cost function in (3), that is,
MSVM-RFE based on ‘all-together’ methodsWe take
Similar with binary SVM-RFE, the criterion J is expanded in
Taylor series to second order,
At the optimum of J, it follows that
JðjÞ ? J þ
The recursive elimination procedure of our proposed
extensions is summarized as Algorithm 1. We propose four
MSVM-RFE algorithms based on different multiclass SVMs,
respectively. They are, for short, MSVM-RFE(OVA) MSVM-
RFE(WW), MSVM-RFE(CS) and MSVM-RFE(LLW). To
reduce the computational cost, the algorithm can be generalized
to discard more than one gene at each step, although discarding
several genes may degrade the selection performance.
rjis still an appropriate ranking criterion for
by guest on May 30, 2013
4 EXPERIMENTS AND DISCUSSION
We tested our proposed MSVM-RFE algorithms on six
microarray data sets. The specifications of the data sets are
listed in Table 1. Please refer to the Supplementary Material for
more detailed data description and preprocessing.
For assessment of gene selection algorithms, Ambroise and
McLachlan (2002) suggested external cross-validation, in which
the gene selection and validation are performed on different
parts of the sample set, to obtain an unbiased performance
estimate. In this work, we employed external 4-fold stratified
cross-validation to evaluate the selection performance. The
stratified cross-validation (Breiman et al., 1984) is slightly
different from regular cross-validation. In k-fold stratified
cross-validation, a data set is randomly partitioned into k
equally sized folds such that the class distribution in each fold is
approximately the same as that in the entire data set. In
contrast, regular cross-validation randomly partitions the data
set into k-folds without considering class distributions. Kohavi
(1995) reported that stratified cross-validation has smaller bias
and variance than regular cross-validation. To obtain a more
reliable estimate, the external 4-fold cross-validation process
was repeated 100 times using different partitions of the data.
We compared our MSVM-RFE algorithms with two multi-
class gene selection algorithms: BSS/WSS and OVA-RFE. BSS/
WSS (Ratio of Between-group to Within-groups Sum of
Squares) was proposed by Dudoit et al., (2002) to perform
gene selection for multi-class problems. BSS/WSS is an
individual feature ranking criterion. For a gene j, this ratio is
where I(?) is the indicator function, x:j denotes the average
expression level of gene j across all samples and xrjdenotes the
average expression level of gene j across samples belonging to
class r. Intuitively, genes with large BSS/WSS ratios (that is,
relatively large variation among classes and relatively small
variation within classes) are likely relevant to class separation.
OVA-RFE is described in Section 3.1.
In Algorithm 1, MSVM-RFE algorithms remove one gene
with the smallest ranking criterion in each step. However, the
number of genes in microarray data is very large (normally
several thousands). If only one gene is removed in each step, the
computational cost would be very high. Therefore, in the
present work genes corresponding to the ranking criterion in
r¼1Iðyi¼ rÞðxrj? x:jÞ2
r¼1Iðyi¼ rÞðxij? xrjÞ2;
the bottom 10% of the remaining genes were removed to
expedite the selection procedure.
To select the relevant genes using MSVM-RFE algorithms, it
is important to find the appropriate value of C which is used to
construct the multiclass SVMs during the recursive elimination
procedure. Generally, the appropriate C should be variant
when multiclass SVM classifiers are trained on different gene
subsets. However, it might not be easy to tune the parameter C
for each SVM classifier, and such tuning would demand
substantially increased computation. In the present study, we
employed fixed C during the RFE procedure. The parameter
was tuned through a simple technique. We assessed the
performance of the BSS/WSS algorithm using external cross-
validation with a sequence of given values of C (for example,
C ¼ 10;1;0:1;0:01;0:001;0:0001, respectively). The optimal
value of C is chosen to be the one which gives the minimal
external cross-validation error on the selected gene subsets. For
the GCM data set, C was set to 0.1 for OVA SVM and
corresponding RFE algorithms, such as OVA-RFE and
MSVM-RFE (OVA). Accordingly, C¼0.1 for LLW related
algorithms, while C¼0.01 for WW and CS related algorithms.
Another issue is to decide the number of genes to be selected.
Generally, finding the optimal number of genes is very difficult.
In the present study, we employed the solution in Li et al.,
(2004). A set of simulations were conducted on the GCM
dataset by varying the number of selected genes. When the
number of selected genes is4400, the selection performance is
not significantly increased, and the variance of external cross-
validation error is small and stable. The same results were
observed on other datasets. Hence, a maximum of 400 genes
were identified in all experiments.
All programs are written in C/Cþþ. We used LIBSVM
(Chang and Lin, 2001) to implement OVA SVMs and
corresponding gene selection algorithms, including OVA-RFE
and MSVM-RFE (OVA). BSVM (Hsu and Lin, 2002) was
employed to implement WW2and CS related algorithms.
As there are no available fast implementations of LLW SVMs,
we implemented this SVM framework using the OOQP package
(Gertz and Wright, 2003) to solve the arising QP problems. The
QP problem from LLW SVMs has ðk ? 1Þn variables, so the
computational cost is very high, especially for large data sets. In
this work, we only tested the selection performance when
k;2k;... genes were selected, respectively, using LLW SVMs on
the GCM and 11_Tumors data sets.
The results and genes selected on the GCM data set are
presented here for illustration of typical results. (However, all
the experimental results are available as Supplementary
Material). For the GCM data set, we compared the selection
performances of six gene selection algorithms, as shown in
Figure 1. As an individual ranking method, the performance of
Table 1. Description of the data sets
Data setSamples Genes Classes Reference
Ramaswamy et al. (2001)
Su et al. (2001)
Ross et al. (2000)
Staunton et al. (2001)
Bhattacharjee et al. (2001)
Armstrong et al. (2002)
2The WW SVM algorithm implemented in BSVM is slightly different
from the original algorithm. BSVM solves a modified optimization
problem, which adds a termPk
solved using fast decomposition techniques. Considering its fast
running speed, we used BSVM for classification and gene selection on
large data sets (11_Tumor and GCM datasets), but for other smaller
data sets we still used the original WW SVM algorithm implemented by
the OOQP package (Gertz and Wright, 2003).
rto the objective function (3). By
doing so, the corresponding dual problem is simplified, and is easy to be
X.Zhou and D.P.Tuck
by guest on May 30, 2013
BSS/WSS algorithm is much worse than those of SVM-RFE
extensions. This is because individual ranking methods may
select many redundant genes, which add little discriminatory
power to the classification problem under investigation, while
SVM-RFE extensions that actually evaluate gene subsets
instead of individual genes implicitly take into account the
effect of irrelevant or redundant genes. The MSVM-RFE
algorithms proposed here outperform OVA-RFE as shown in
Figure 1. We believe the good performance is due to
considering all classes simultaneously during the gene selection
procedure. However, the performances of four MSVM-RFE
algorithms are not the same. MSVM-RFE(LLW) is not as good
as other three algorithms. MSVM-RFE(CS) achieves better
performance when only 50–150 genes are selected, and gives
similar performance with MSVM-RFE(OVA) and MSVM-
RFE(WW) when4200 genes are selected.
The test performances in Figure 1 are based on different
classification algorithms, that is, genes selected by BSS/WSS,
OVA-RFE or MSVM-RFE(OVA) are evaluated using OVA
SVMs, while MSVM-RFEs (CS, WW, LLW) are tested by CS,
WW and LLW SVMs, respectively. Therefore, we carried out a
fair comparison with all gene selection algorithms assessed by
one multiclass SVM. The comparison results are shown in
Table 2. The results are similar to those in Figure 1 no matter
which multiclass SVM algorithm is employed. If ?A, ?A, ?Band
?Brepresent means and standard deviations of the error rates
from methods A and B, respectively, we can reach a conclusion
that method A is significantly better than method B when
?B? ?Aþ ?A. Based on the rule-of-thumb, our MSVM-RFE
algorithms are significantly better than OVA-RFE and
BSS/WSS. Furthermore, the performances of MSVM-RFE
algorithms are significantly better than that without any gene
selection being performed as well.
We combined all the selected genes during cross-validation
procedure to identify the most important ones. As we applied
our gene selection algorithms 400 times on different subsets of
the original data set (4-fold cross-validation repeated 100
times), we actually obtained 400 different selected gene subsets
(each contains 400 genes). By computing the frequency of each
gene appearing in all the 400 gene subsets, we can identify the
important genes which have been most frequently selected for
different re-sampling sets. For MSVM-RFE(OVA) on the
GCM data set, we finally identified 342 genes with frequency
?200, which are shown in Figure 2. While for MSVM-RFEs
(WW, CS, LLW), 351, 353 and 297 markers were identified.
A complete list of selected markers is provided with the
improved performance for the challenge of selecting genes
which contribute to learning in multiple classes simultaneously.
Improving classification performance also affords the oppor-
tunity to contribute to biological discovery, however, inter-
pretation of results using these methods requires care since not
all genes will be selected based on specificity for a single tumor
type. The algorithms proposed here select a mixture of genes
which exhibit molecular features specific for the neoplastic
process along with features characteristic of a tumor’s tissue of
10 50100 150
Number of genes
200 250 300350400
External CV error
Fig. 1. Comparison of MSVM-RFEs, OVA-RFE and BSS/WSS in
terms of external cross-validation error rate on the GCM dataset. The
performances of BSS/WSS, OVA-RFE and MSVM-RFE (OVA) are
assessed by OVA SVMs, while those of MSVM-RFE(WW), MSVM-
RFE(CS) and MSVM-RFE(LLW) are by WW, CS and LLW SVMs,
Fig. 2. Expression values for genes selected by MSVM-RFE(OVA) on
the GCM data set are shown. Columns represent 198 tumor samples
ordered by class. Rows represent 342 most frequently selected genes
during the external cross-validation procedure. The genes were
reordered to simply group into 14 clusters. Red indicates high relative
level of expression, and blue represents low relative level of expression.
BR, breast adenocarcinoma; PR, prostate adenocarcinoma; LU, lung
adenocarcinoma; CR, colorectal adenocarcinoma; LY, lymphoma; BL,
bladder transitional cell carcinoma; ML, melanoma; UT, uterine
adenocarcinoma; LE, leukemia; RE, renal cell carcinoma; PA,
pancreatic adenocarcinoma; OV, ovarian adenocarcinoma; ME, pleural
mesothelioma; CNS, central nervous system.
by guest on May 30, 2013
processes. However, beyond this, optimal classification perfor-
mance may also lead to selection of genes which may not be
associated primarily with a single tumor or tissue type, but
which may contribute to learning differences among various
combinations of samples. The potential contribution of gene
selection techniques for discovery of biological relationships
has been noted by most previous researchers, although their
focus has been primarily on the genes which are specific to a
single tissue or tumor subtype. Uncovering the true relations
may prove to be more complex than in identifying the role of a
gene which is highly specific to a single tumor type.
Many of the most tumor or tissue specific-genes which were
identified in previous analysis of the data sets which we studied
were also selected by our algorithms. For the GCM data set, for
instance, we find that a number of the top selected genes are
genes which have well established histories as biomarkers for
specific tumors, similar to what was noted by Ramaswamy
et al., (2001). CEA (carcinoembryonic antigen) is the top
ranked gene for colorectal cancer, PSA (prostate-specific
antigen) for prostate cancer, CD20 for lymphomas. Perhaps,
more interestingly, we have also identified several genes which
have more recently been noted to have close biological
associations with established biomarkers. Several recent pub-
lications have begun to explore roles for these molecules as
biomarkers in their own right, and in combination with other
markers. For instance, GATA3 was consistently identified as
one of the top markers by all the algorithms. It is associated
with estrogen receptor status and biological function and is
increasingly recognized as a biomarker in its own right, with its
own role in breast cancer prognosis (Oh et al., 2006).
Galectin-4, also consistently identified as one of the most
informative genes, is a lectin for which CEA has been identified
as a potential ligand (Ideo et al., 2005).
Again similar to previous studies (Ramaswamy et al., 2001) a
number of genes are selected for optimal multiclass classifiers
due to their tissue specificity, rather than their association with
the neoplastic process, per se. Examples include genes with
gastrointestinal tissue origin (intestinal Trefoil factor 3) which
are correlated with colorectal tumors, lymphoid specific
functions (such as Lymphotoxin beta, and EBI1-ligand
chemokine), multiple genes specific to lung tissue (Pulmonary
Surfactant-associated protein B precursor, SFTP1, SP-C1).
Several genes which are tissue specific to prostate were strongly
selected including PSCA (prostate stem cell antigen), semeno-
gelin, prostatic secretory protein 57 and prostate acid phos-
phatase. In addition to genes known to be involved in the
leukemic process including Myc and Myb, whereas others were
tissue-specific molecules, or various genes known to be involved
in normal hematopoietic differentiation, sometimes associated
with specific types of leukocytes, such as leukocyte surface
markers CD37, terminal transferase, IL-8 and IL-6. Examples
for melanoma include melanoma inhibitory activity (MIA), and
differentiation antigen melan-A MLANA; for pancreas, genes
directly or indirectly related to pancreatic function such as
insulin, glucagon, spasmolytic polypeptide precursor; for CNS,
the Zic gene identified by MSVM-RFE(LLW) encodes a
member of the ZIC family of C2H2-type zinc finger proteins.
Aberrant expression of this gene is seen in medulloblastoma,
but perhaps more relevantly it encodes a transcription factor
that can bind and transactivate the apolipoprotein E gene, and
may serve as an identifier of tissue specificity more than tumor
Other genes which were consistently selected may character-
ize the third type of gene which may be less specific to a single
tumor type, but serve as key contributors to optimal multiclass
classification. It is possible that this group of genes may instead
of serving as biomarkers for a specific tumor type may be
important areas for further exploration and novel discoveries of
molecular elements involved in biological processes relevant to
the malignant process across multiple tumor types. Among
genes with established functions, Claudin-4, for instance
consistently selected as one of the top five genes overall in
contributing to multiclass classification, is suspected to be
involved in several epithelial based tumor types. Claudin-4 is a
member of a family of transmembrane proteins, essential in the
formation and maintenance of tight junctions (Hewitt et al.,
2006). Several genes were selected which did not appear even in
the top 1000 genes for each individual tumor type on the GCM
data set (Ramaswamy et al., 2001). For instance, tumor-
associated antigen CO-029 is a tetraspanin shown to be
associated with integrins (Gesierich et al., 2005) and expressed
in several different tumor types. Although it was not specific to
a single tumor type, it contributed to sub-classification within
multiple groups and therefore contributed to information for
Table 2. External cross-validation error rates and standard deviations (% ) of six gene selection algorithm tested by four multiclass SVMs, when 400
genes are selected on the GCM data set. The last row is the cross-validation error rates and standard deviations (% ) on the entire data set (without
any gene selection being performed)
Gene selection algorithmsOVA SVM WW SVM CS SVMLLW SVM
Without gene selection
X.Zhou and D.P.Tuck
by guest on May 30, 2013
the multiclass problem. Therefore, it is possible that further
exploration among the list of genes with unknown function
which were consistently selected for optimal classification could
either be genes with tissue or tumor specificity or could identify
genes involved more fundamentally with the neoplastic process.
In this study, we proposed four MSVM-RFE algorithms,
which show promising performance compared with other
multiclass gene selection algorithms. However, it is difficult to
determine the best gene selection approach among the
extensions we proposed. There does not seem to exist a clear
winner. For example, MSVM-RFE(OVA) achieves the best
result on the MLL data sets, while MSVM-RFE(LLW) is best
on the NCI Ross and Lung data sets; MSVM-RFE(CS) has
superb performance on the GCM, 11 Tumor and NCI
Staunton data sets, and MSVM-RFE(WW) is similar to
MSVM-RFE(CS) on the 11 Tumor and NCI Staunton
data sets. Therefore, which MSVM-RFE is best may depend
on the particular data set. So surprisingly, although MSVM-
RFE(CS) has the best performance on fully half of the datasets,
it performs poorly on the Lung and MLL data sets (refer to
Supplementary Material for details). The reason for the poor
performance on these data sets is not clear. We conjecture that
it may be due to lacking bias item b in the CS SVM
classification (Poggio et al., 2002). To identify the factors
which influence this requires further investigation.
In this paper, we proposed four extensions of SVM-RFE
(called MSVM-RFE) to solve the multiclass gene selection
problem, based on different frameworks for multiclass SVMs.
By simultaneously considering all classes during the gene
selection stages, our proposed extensions identify genes leading
to more accurate classification.
The authors would like to thank Drs Robert Bjornson and
Nicholas Carriero for their help in running codes on the
Bulldog Cluster. This work is supported by NIH P30
DK072442. Part of the simulations were run on the Bulldog
Cluster, funded by NIH grant RR19895-02, at Yale Center for
High Performance Computation in Biology and Biomedicine.
Conflict of Interest: none declared.
Alon,U. et al. (1999) Broad patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
Pro. Nat Acad. Sci., 96, 6745–6750.
Ambroise,C. and McLachlan,G.J. (2002) Selection bias in gene extraction on the
basis of microarray gene-expression data. Pro. Nat Acad. Sci., 99, 6562–6566.
Armstrong,S.A. et al. (2002) MLL translocations specify a distinct gene
expression profile that distinguishes a unique leukemia. Nat. Genet., 30,
Bhattacharjee,A. et al. (2001) Classification of human lung carcinomas by
mrna expression profiling reveals distinct adenocarcinoma subclasses.
Pro. Nat Acad. Sci., 98, 13790–13795.
Bottou,L. et al. (1994) Comparison of classifier methods: A case study
in handwriting digit recognition.InProceedingsof International
Conference on Pattern Recognition, Vol. 2, IEEE Computer Society Press,
Los Alamitos, CA, pp. 77–82.
Breiman,L. et al. (1984) Classification and Regression Trees. Wadsworth and
Brooks, Monterey, CA.
Brown,M.P.S. et al. (2000) Knowledge-based analysis of microarray gene
expression data by using support vector machines. Pro. Nat Acad. Sci., 97,
Chai,H. and Domeniconi,C. (2004) An evaluation of gene selection methods for
multi-class microarray data classification. In Scheffer.T (ed.) Proceedings of
the Second European Workshop on Data Mining and Text Mining in
Bioinformatics, pp. 3–10.
Chang,C.-C. and Lin,C.-J. (2001) LIBSVM: a library for support vector machines.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Crammer,K. and Singer,Y. (2001) On the algorithmic implementation of
multiclass kernel-based vector machines. J. Mach. Learn. Res., 2, 265–292.
Das,R. et al. (2006) Computational prediction of methylation status in human
genomic sequences. Pro. Nat Acad. Sci., 103, 10713–10716.
Dudoit,S. et al. (2002) Comparison of discrimination methods for the
classification of tumors using gene expression data. J. Am. Stat. Assoc., 97,
Frank,O. et al. (2006) Gene expression signature of primary imatinib-resistant
chronic myeloid leukemia patients. Leukemia, 20, 1400–1407.
Furey,T. et al. (2000) Support vector machine classification and validation of
cancer tissue samples using microarray expression data. Bioinformatics, 16,
Geoffrion,A.M. (1968) Proper efficiency and the theory of vector maximization.
J. Math. Anal. Appl., 22, 618–630.
Gertz,E.M. and Wright,S.J. (2003) Object-oriented software for quadratic
programming. ACM T. Math. Software, 29, 58–81.
Gesierich,S. et al. (2005) Colocalization of the tetraspanins, CO-029 and CD151,
with integrins in human pancreatic adenocarcinoma: Impact on cell motility.
Clinical Cancer Res., 11, 2840–2852.
Golub,T. et al. (1999) Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science, 286, 531–537.
Guyon,I. et al. (2002) Gene selection for cancer classification using support vector
machines. Mach. Learn., 46, 389–422.
Hewitt,K.J. et al. (2006) The claudin gene family: expression in normal and
neoplastic tissues. BMC Cancer, 6, 186.
Hilario,M. et al. (2006) Processing and classification of protein mass spectra.
Mass Spectrom. Rev., 25, 0277–7037.
Hsu,C.-W. and Lin,C.-J. (2002) A comparison of methods for multiclass support
vector machines. IEEE T. Neural Networ., 13, 415–425.
Ideo,H. et al. (2005) Galectin-4 binds to sulfated glycosphingolipids and
carcinoembryonic antigen in patches on the cell surface of human colon
adenocarcinoma cells.J. Biol. Chem., 280, 4730–4737.
Kohavi,R. (1995). A study of cross-validation and bootstrap for accuracy
estimation and model selection. In Proceedings of the Fourteenth International
Joint Conference on Artificial Intelligence, Morgan X. Zhou and D P.
TuckKaufmann Publishers, San Francisco, pp. 1137–1145.
Kreßel,U.H.-G. (1999) Pairwise classification and support vector machines. In
Scho ¨ lkopf,B. and Burges,C.J.C, and Smola,A.J. (eds.) Advances in Kernel
Methods: Support Vector Learning, MIT press, Cambridge, MA, USA.
LeCun,Y. et al. (1990) Optimal brain damage. In Touretzky,D.S. (ed.) Advances
in Neural Information Processing Systems II, Morgan Kaufmann Publishers,
San Mateo,CA, pp. 598–605.
Lee,K.E. et al. (2003) Gene selection: a Bayesian variable selection approach.
Bioinformatics, 19, 90–97.
Lee,Y. et al. (2004) Multicategory support vector machines: theory and
application to the classification of microarray data and satellite radiance
data. J. Am. Stat. Assoc., 99, 67–81.
Li,T. et al. (2004) A comparative study of feature selection and multiclass
classification methods for tissue classification based on gene expression.
Bioinformatics, 20, 2429–2437.
Mukherjee,S. (2003) Classifying microarray data using support vector machines.
In Berrar,D.P. Dubitzky,W. and Granzow,M. (eds.) A Practical Approach to
Microarray Data Analysis, Kluwer Academic Publishers, Boston, MA,
Oh,D.S. et al. (2006) Estrogen-regulated genes predict survival in hormone
receptor-positive breast cancers. J. Clin. Oncol., 24, 1656–1664.
Poggio,T. et al. (2002) b. In J.Winkler and M.Niranjan, editors, Uncertainty in
Geometric Computations, Kluwer Academic Publishers, Norwell, MA, pp. 131–141.
by guest on May 30, 2013
Ramaswamy,S. et al. (2001) Multiclass cancer diagnosis using tumor gene
expression signatures. Pro. Nat. Acad. Sci., 98, 15149–15154.
Rifkin,R. et al. (2003) An analytical method for multiclass molecular cancer
classification. SIAM Review, 45, 706–723.
Ross,D.T. et al. (2000) Systematic variation in gene expression patterns in human
cancer cell. Nat. Genet., 24, 227–235.
Statnikov,A. et al. (2005) A comprehensive evaluation of multicategory
classification methods for microarray gene expression cancer diagnosis.
Bioinformatics, 21, 631–643.
Staunton,J.E. et al. (2001) Chemosensitivity prediction by transcriptional
profiling. Pro. Nat. Acad. Sci., 98, 10787–10792.
Su,A.I. et al. (2001) Molecular classification of human carcinomas by use of gene
expression signatures. Cancer Res., 61, 7388–7393.
Tibshirani,R. et al. (2002) Diagnosis of multiple cancer types by shrunken
centroids of gene expression. Proc. Nat. Acad. Sci., 99, 6567–6572.
Vapnik,V.N. (1998) Statistical Learning Theory\/. John Wiley and Sons,
West,M. et al. (2001) Predicting the clinical status of human breast cancer by
using gene expression profiles. Proc. Nat. Acad. Sci., 98, 11462–11467.
Weston,J. and Watkins,C. (1999) Support vector machines for multiclass pattern
recognition. In Proceedings of the Seventh European Symposium On Artificial
Yeang,C.-H. et al. (2001) Molecular classification of multiple tumor types.
Bioinformatics, 17(Suppl. 1), S316–S322.
Zhang,H.H. et al. (2006) Gene selection using support vector machines with non-
convex penalty. Bioinformatics, 22, 88–95.
Zheng,C. et al. (2006) Gene expression profiling of CD34+ cells identifies a
molecular signature of chronic myeloid leukemia blast crisis. Leukemia, 20,
Zhou,X. and Mao,K.Z. (2005a) Gene selection of DNA microarray data based
on Regularization Networks. In Gallagher,M. Hogan,J.M. and Maire,F (eds.)
IDEAL, Vol 3578 of Lecture Notes in Computer Science, Springer, pp. 414–
Zhou,X. and Mao,K.Z. (2005b) LS Bound based gene selection for DNA
microarray data. Bioinformatics, 21, 1559–1564.
Zhou,X. and Mao,K.Z. (2006) The ties problem resulting from counting-based
error estimators and its impact on gene selection algorithms.Bioinformatics,
X.Zhou and D.P.Tuck
by guest on May 30, 2013