NEUROSVM: an architecture to reduce the effect of the choice of kernel on the performance of svm
ABSTRACT In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture has two cascaded modules: feature extraction module and classification module. In the feature extraction module we use the multilayered perceptron (MLP) neural networks, although other tools such as radial basis function (RBF) networks can be used. In the classification module we use sup-port vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature extraction module has several sub-modules each of which is expected to extract features capturing the discriminating characteristics of different areas of the input space. The classification module classifies the data based on the extracted features. The resultant architecture with MLP in feature extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better than both MLP and SVM. We also compare the performance of proposed architecture with that of two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to perform better than these two ensemble methods. Further we explore the use of MLP and RBF in the classification module of the proposed architecture. The most attractive feature of NEUROSVM is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has been verified with respect to both linear and non-linear kernels. We have also demonstrated that for the feature extraction module, the full training of MLPs is not needed.
-
Citations (0)
-
Cited In (0)
Page 1
Journal of Machine Learning Research 10 (2009) 591-622Submitted 8/06; Revised 9/08; Published 3/09
NEUROSVM: An Architecture to Reduce the Effect of the Choice of
Kernel on the Performance of SVM
Pradip Ghanty
Praxis Softek Solutions Pvt. Ltd.
Module 616, SDF Building, Sector V, Salt Lake City
Calcutta - 700 091, India
Samrat Paul
IBM India Pvt. Ltd.
DLF IT Park, 4thFloor, Tower C, New Town Rajarhut
Calcutta - 700 156, India
PRADIP.GHANTY@GMAIL.COM
SAMRAT.PAUL@GMAIL.COM
Nikhil R. Pal
Electronics and Communication Sciences Unit
Indian Statistical Institute
203, B. T. Road
Calcutta - 700 108, India
NIKHIL@ISICAL.AC.IN
Editor: Yoshua Bengio
Abstract
In this paper we propose a new multilayer classifier architecture. The proposed hybrid architecture
has two cascaded modules: feature extraction module and classification module. In the feature
extraction module we use the multilayered perceptron (MLP) neural networks, although other tools
such as radial basis function (RBF) networks can be used. In the classification module we use sup-
port vector machines (SVMs)—here also other tool such as MLP or RBF can be used. The feature
extraction module has several sub-modules each of which is expected to extract features capturing
the discriminating characteristics of different areas of the input space. The classification module
classifies the data based on the extracted features. The resultant architecture with MLP in feature
extraction module and SVM in classification module is called NEUROSVM. The NEUROSVM is
tested on twelve benchmark data sets and the performance of the NEUROSVM is found to be better
than both MLP and SVM. We also compare the performance of proposed architecture with that of
two ensemble methods: majority voting and averaging. Here also the NEUROSVM is found to
perform better than these two ensemble methods. Further we explore the use of MLP and RBF in
the classification module of the proposed architecture. The most attractive feature of NEUROSVM
is that it practically eliminates the severe dependency of SVM on the choice of kernel. This has
been verified with respect to both linear and non-linear kernels. We have also demonstrated that for
the feature extraction module, the full training of MLPs is not needed.
Keywords: feature extraction, neural networks (NNs), support vector machines (SVMs), hybrid
system, majority voting, averaging
c ?2009 Pradip Ghanty, Samrat Paul and Nikhil R. Pal.
Page 2
GHANTY, PAUL AND PAL
1. Introduction
AclassifierdesignedfromadatasetX ={xi|i=1,2,...,N,xi∈ℜp}, whereℜpisthe pdimensional
real space, can be defined as a function G:ℜp→Nc. Here Nc={y∈ℜc:yk∈{0,1}∀k,
isthesetoflabelvectorsandcis thenumberofclasses. Foranyinputvector x∈ℜp, G(x)isavector
in c dimension with only one component as 1 and all others 0. In this paper our primary objective
is to find a good G combining neural networks (NNs) and support vector machines (SVMs).
In machine learning literature NN and SVM are two widely used classifiers. NNs have been
developed for many years and been used in various applications (Haykin, 1999; Pal et al., 2006).
The SVM (Vapnik, 1995) is a classification and regression tool. It is comparatively a new family of
learning tools including training algorithms for optimal margin classifiers (Boser et al., 1992) and
support vector networks (Cortes and Vapnik, 1995). In SVM the input data are often transformed
into a high dimensional space using some kernel functions. A linear separating hyper plane with
the maximal margin between the closest positive and closest negative samples in the mapped space
is found. The SVM works by solving a quadratic optimization problem that minimizes a sum of
two terms. The first term is related with the reciprocal of norm of weight vector associated with
the hyper plane and the second term is related to the sum of classification error. The SVM is a
very active topic of research (von Luxburg et al., 2004; Adankon and Cheriet, 2007) and it has been
successfully applied to many areas including handwritten digit recognition (Vapnik, 1995), object
recognition (Pontil and Verri, 1998), protein structure prediction (Nguyen and Rajapakse, 2003) and
texture classification (Kim et al., 2002). But there are some computational difficulties associated
with using SVM. One of them is the required memory, which grows very quickly with the size of
the training data since the SVM algorithm involves solving a large quadratic programming problem
where every training data point forms a constraint. This is a constraint on the application of SVM to
very large data sets. More importantly, the performance of SVM is significantly dependent on the
choice of kernel. Needless to say that for non-linearly separable data, the performance of linear and
nonlinear SVM also differs significantly.
Use of an ensemble of classifiers is a popular approach to improve the classification perfor-
mance. Many ensemble methods are used by researchers to report the improvement in performance
over single classifier (Hansen, 1999; Maqsood et al., 2004; Chawla et al., 2004). An ensemble
of classifiers can be constructed using both homogeneous and heterogeneous classifiers (Hansen,
1999; Prevost et al., 2003; Garcia-Pedrajas et al., 2005). An ensemble of neural networks is often
used for pattern classification problems (Garcia-Pedrajas et al., 2005; Islam et al., 2003) including
face recognition (Melin et al., 2005), weather forecasting (Maqsood et al., 2004), protein secondary
structure prediction (Guimaraes et al., 2003). Different approaches for constructing ensemble of
neural networks have been suggested in the literatures (Wu et al., 2001; Zhou et al., 2002; Windeatt,
2006). In this paper for the purpose of comparison we have considered two ensemble methods for
neural networks, one uses the average output of the ensemble of networks while the other one makes
the ensemble vote on a classification task.
In this context, the ensemble method of Garcia-Pedrajas et al. (2007) needs a special attention
as this method also uses a multilayer perceptron network for feature extraction and hence one may
get a false impression that this method and our proposed method are quite similar.
This is an ensemble method where a large number of classifiers are trained and then their out-
comesareaggregatedusingthemajorityvotingrule. Thisisaninterestingmethodbutquitedifferent
from our proposed scheme.
c
∑
k=1yk= 1}
592
Page 3
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
Like AdaBoost the first baseline classifier is trained using the original training data while each
of the subsequent classifiers is trained using a projected data set created using the hidden output
of a trained MLP. The second baseline classifier uses data projected through the hidden layer of a
projection network (MLP here). The projection network is an MLP network with number of hidden
nodes equal to the number of inputs in the original training data and it is trained using only that
subset of the training data which are not classified correctly by the first baseline classifier. The
projection network (again an MLP with number of hidden nodes equal to the number of inputs in
the original data) for the third baseline classifier is trained using the original data points whose
projected versions are wrongly classified by the second baseline classifier. The process is repeated
to generate a large number of baseline classifiers.
Note that, our proposed method falls in the category of hybrid system. There have been several
attempts to combine different machine learning tools to develop efficient hybrid systems for pattern
classification problems (Huang and LeCun, 2006; Happel and Murre, 1994; Vincent and Bengio,
2000;Mitraetal.,2006,2005). Todesignahybridsystemdifferentcombinationofclassifiersisused
including neural network-SVM (Mitra et al., 2005, 2006; Vincent and Bengio, 2000), convolution
network-SVM (Huang and LeCun, 2006). Neural networks and support vector machines are used to
design a hybrid system for text classification in Mitra et al. (2005) and Lidar detection of underwater
objects in Mitra et al. (2006). Mitra et al. (2005) proposed a hybrid system called neuro-SVM which
takes the component wise product of the outputs of a cascaded-SVM classifier and a recurrent neural
network, and applies a set of heuristic rules to decide on the class. In the work of Mitra et al. (2006),
after preprocessing Lidar signal is modeled using a polynomial as well as a linear predictor. The
optimal coefficients of the polynomial are used as inputs to train a RBF, while coefficients of the
linear predictor are used to train an MLP. The products of the corresponding components of the
output vectors from the two networks are used as input to a cascaded-SVM classifier. Huang and
LeCun (2006) presented a hybrid system for object recognition that uses the outputs of the last
hidden layer of a convolution network to train a SVM with Gaussian kernel. The convolution
network is generally used for computer vision problems. A convolution network has several hidden
layers alternately consisting of convolution layer and sub-sampling layer. In a convolution network,
the successive layers are designed to learn progressively higher-level features until the last layer,
which produces categories.
There have been a number of attempts to develop modular networks to solve complex prob-
lems efficiently (Ronco and Gawthrop, 1995; Bottou and Gallinari, 1991). The basic philosophy
of developing a modular network is to divide the task into a number of, preferably, meaningful
subtasks, and then design one module for each subtask. Finally one needs to devise a mechanism
to integrate these modules—this will dictate how different modules interact and lead to the final
output. Sometimes the knowledge of the problem domain can be used to find the subtasks, but often
clustering is used for this purpose. For example in Pal et al. (2003) a self organized map (SOM) is
used to find natural clusters (subtasks) in the data and then for each cluster a separate network is
trained. A given input is routed to the appropriate MLP module using the SOM. Jenkins and Yuhas
(1993) have presented a simple solution to the truck backer-upper problem by decomposing it into
subtasks. Then all subtasks are realized in parallel (that is, off line) to obtain the final two-layer
feed-forward network, which is used to control the truck. Although our proposed architecture uses
several modules, this is not designed following the usual principle of modular network.
In this paper we propose a new classifier architecture called NEUROSVM. The proposed clas-
sifier has two modules. In the first module we have used an MLP. We view the first module as a
593
Page 4
GHANTY, PAUL AND PAL
feature extraction module (FM), because outputs of this module can be used as inputs to any other
classifier. This new set of features is used in the next module, termed as the classification module
(CM). In the classification module we have used SVM with different kernel functions. Instead of
SVM, one can use any other classifier also. We also consider the MLP and RBF neural networks in
the CM of our proposed architecture. To further demonstrate the effectiveness of NEUROSVM we
compare it with two other ensemble methods: majority voting and averaging. We demonstrate the
effect of the kernel on SVM and NEUROSVM.
Our proposed method is neither an ensemble method nor has any relation to boosting. There is
only one classifier. The classifier uses features extracted from the hidden nodes of several trained
networks where typically the number of hidden nodes in a network is smaller than input dimension.
Each network used for feature extraction is trained using the same data and each network sees
the entire input space as represented through the training data. Thus typically to get improved
performance we need fewer feature extraction networks than that would be needed by the ensemble
type methods.
2. Methods
The section is arranged as follows. First, we provide a brief description of neural networks for
the sake of completeness. Next, we give a brief description of the support vector machine (SVM)
classifier and how several binary SVMs can be combined to solve a multiclass problem. Then we
explain two popular existing ensemble methods that will be used for comparison. This is followed
by a detailed discussion of the proposed method.
2.1 MLP and RBF Neural Networks
The two most widely used neural networks for pattern recognition are multilayer perceptron (MLP)
and radial basis function (RBF) networks (Haykin, 1999). We have used the back-propagation
algorithm for training MLP networks with single hidden layer.
The RBF network consists of exactly three layers: input layer, basis function layer and output
layer. Unlike MLP, the activation functions of the hidden nodes are not of sigmoidal type, rather
each hidden node represents a radial basis function. The transformation from the input space to the
hidden space is nonlinear but each node in the output layer computes just the weighted sum of the
outputs of the previous layer, that is, each output layer node makes a linear transformation. The
learning of RBF network is usually performed in two phases. An unsupervised learning method
is applied to estimate the basis function parameters. Then a supervised learning method, such as
gradient descent or least square error estimate, is applied to tune the network weights between
the hidden layer and the output layer. However, the parameters of the basis functions can also be
tuned using gradient descent technique. Here we have used the MATLAB implementation of RBF
network.
2.2 Support Vector Machines (SVMs)
The basic SVM (Haykin, 1999; Vapnik, 1995) formulation is for two class problems. If the training
data are linearly separable, then SVM finds an optimum hyperplane that maximizes the margin of
separation between the two classes.
594
Page 5
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
Given a training set (X,Y), xi∈ X, xi∈ ℜpand yi∈ Y, the class label associated with xi;
yi∈ {−1,+1}, the learning problem for SVMs is to find the weight vector w and bias b such that
they satisfy the constraints:
xi.w+b ? +1 for yi= +1 (1)
xi.w+b ? −1 for yi= −1 (2)
and the weight vector w minimizes the cost function
Φ(w) =1
2wTw.
The constraints written in Equations (1)-(2) can be combined as
yi(xi.w+b) ? +1 ∀i.
If the training points are not linearly separable, then there is no hyperplane that separates them
into positive and negative classes. In this case, the problem is reformulated considering the slack
variables ξi? 0;i = 1,2,...,N. For most xi, ξi= 0. The constraints are now modified as follows:
xi.w+b ? +1−ξi
xi.w+b ? −1+ξi
for yi= +1 (3)
for yi= −1 (4)
ξi? 0, ∀i.
(5)
The SVM then finds w, minimizing
Φ(w,ξ) =1
2wTw+C
N
∑
i=1
ξi
subject to constraints as in Equations (3)-(5). The constantC is termed as a regularization parameter
as it controls the trade off between the complexity of the machine and the number of misclassifica-
tions.
Typically, when the training points are not linearly separable, a nonlinear mapping ϕ is used to
map the training data from ℜpto some higher dimensional feature space H, with a hope that the
data may be linearly separable in H. The mapping is implicitly realized using a kernel function.
Two kernels that are popular for non-linear SVMs are:
1. Polynomial of degree d: K(x,xi) = (sx.xi+1)d, where s is the scaling coefficient of the dot
product.
2. Radial Basis Function (RBF): K(x,xi) = e−γ?x−xi?2, γ > 0.
In this study, we shall extensively use the RBF kernel with a wide range of γ. We shall also demon-
strate the utility of the proposed method with polynomial kernel.
We have used SVMlight(Joachims, 2002) software for learning the SVM classifier. Note that,
NEUROSVM uses SVMlightin the classification module. We also use SVMs on the original data
to compare its performance with that of NEUROSVM.
595
Page 6
GHANTY, PAUL AND PAL
2.3 SVM for Multiclass Problems
The preceding SVM formulation is for two class problems. Multiclass SVMs are generally realized
using several two class SVMs. We use the One versus One (OVO) method (Nguyen and Rajapakse,
2003; Weston and Watkins, 1999). Let us assume that we have a c class problem. In this method
we construct one binary classifier for every pair of distinct classes. So we get c×(c−1)/2 binary
classifiers for a c class problem. In the training data, suppose kisamples are from class i, N =
c
∑
i=1ki.
For the class pair (i, j), a binary classifierCijis trained using kiand kjdata points from class i and j.
An unknown sample x is then classified by each of the c×(c−1)/2 different classifiers. If classifier
Cijclassifies x as class i then the vote for class i for sample x is increased by one. Otherwise, vote
for class j for sample x is increased by one. In this way for sample x, the votes for all c classes are
calculated using the output of all c×(c−1)/2 classifiers. After that we assign x to class l, if class
l has the largest number of votes for x. Ties are randomly resolved.
2.4 Ensemble Methods: Majority Voting and Averaging
Different methods of classifier fusion are available in the literature (Maqsood et al., 2004; Ko et al.,
2007; Brown et al., 2005; Tang et al., 2006; Kuncheva and Whitaker, 2003; Windeatt, 2006; Islam
et al., 2003), of which the majority voting scheme is probably the most popular method (Stepenosky
et al., 2006). In this method, the final class is determined by the maximum number of votes counted
among all the classifiers fused. Let us consider a c class problem and let m be the number of
classifiers to be fused. For an unknown sample x, vote for class j, vj,(j = 1,2,...,c) is computed
from the ensemble of classifiers Ci,(i = 1,2,...,m). If Ci,(i = 1,2,...,m) assigns sample x to class
j then vjis incremented by 1. Note that, initially vote for every class is initialized to 0; that is, vj=
0,(j = 1,2,...,c). The final class determination by the ensemble for sample x is k, if vk=
c
max
j=1{vj}.
Averaging also is a simple but effective method and is used in many classification problems
(Guimaraes et al., 2003; Naftaly et al., 1997). In this method, the final class is determined by the
average of continuous outputs of all classifiers (here MLPs) fused. For an unknown sample x, let
the output for class j (j = 1,2,...,c) from classifier Ci,(i = 1,2,...,m) be oij. Then the output from
the ensemble classifier is obtained as Oj=1
m
∑
c
max
m
i=1oij, j = 1,2,...,c. The final class assignment by
the ensemble to x is k, if Ok=
j=1{Oj}.
2.5 Proposed NEUROSVM Classifier
The proposed multilayer architecture can be thought of as a combination of two types of modules:
feature extraction module (FM) and classification module (CM). The FM consists of a number of
sub-modules SFMi, i = 1,2,...,m. Each sub-module SFMitakes the same p dimensional data
x = (x1,x2,...,xp)Tas input and produces nidimensional output vectors vi= (vi1,vi2,...,vini)T.
Thus n=
∑
m
i=1nioutput values together as shown in Equations (6) and (7) constitute an n dimensional
596
Page 7
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
input to the classification module.
z =
v1
v2
...
vm
∈ ℜn1+n2+...+nm
(6)
and
v1
v2
=(v11,v12,...,v1n1)T,
(v21,v22,...,v2n2)T,
...
(vm1,vm2,...,vmnm)T.
=
(7)
vm
=
In general, different SFMican use different methods of feature extractions or they can use the
same principle for feature extraction. Similarly, the classification module can use any principle like
neurocomputing, support vector machines and so on.
In this investigation, the sub-modules SFMis are derived from multilayer perceptron networks,
while the classification module consists of support vector machines. And, hence, we call the result-
ing architecture NEUROSVM.
In order to constitute the ithsub-module SFMi, we consider an MLP with just one hidden layer,
with architecture (p,ni,c), where p is the input dimension, niis the number of nodes in the hidden
layer and c is the number of classes. Note that, although the number of input and output nodes
in each MLP remains the same, the number of nodes in the hidden layer could be different for
different MLPs. Each MLP is then trained using the training data X = {xi;i = 1,2,...,N} ⊂ ℜp,
Y = {yi;i = 1,2,...,N} ⊂ ℜcwhere yiis the target output corresponding to xi.
Once each network is trained, the output of the hidden layer can be taken as the extracted
features. These features capture characteristics of the data that can discriminate between classes;
hence using these features we can do the classification job using just a single layer network.
Note that, instead of MLP, we can use RBF also in the feature extraction module. In Figure
1, the top panel has m different trained MLPs labeled as MLP1,MLP2,...,MLPm. After the train-
ing, we remove the output layer and its associated connections from each of the MLPs and then
the truncated two-layer sub-networks are taken as feature extraction sub-modules. The subnets
SFM1,SFM2,...,SFMmin the lower panel of the NEUROSVM are constructed from the trained
MLPs in the upper panel. The first two layers of MLPiconstitute SFMi,i = 1,2,...,m.
As depicted in the lower panel of Figure 1, the output from the m sub-modules, taken together
constitutes the input to the classification module. Here we consider SVMs for classification, but
other classifiers such as neural networks (MLP or RBF network) can also be used. Note that, each
sub-module receives the same input x = (x1,x2,...,xp)T.
Given the training data X and Y, in order to train the CM we use the following data set. For
each xi∈ X, the FM produces an output zi∈ ℜnas in Equation (6). Like an MLP, every node in
the second layer of NEUROSVM computes the weighted sum of its input and applies a sigmoidal
activation function to produce its output. Thus Z = {zi;i = 1,2,...,N},zi∈ ℜn, as in Equation (6),
is used as the input data and corresponding to each zi∈ Z, the associated yi∈ ℜc, yi∈Y is taken as
the target output. The CM is trained using (Z,Y).
597
Page 8
GHANTY, PAUL AND PAL
In the present case the CM has two layers. The first layer, as shown in Figure 1, is the SVM
kernel layer where each node is associated to a mapped training sample zi(it is the output from an
FM that represents a support vector) and it computes the kernel output K(z,zi) on a mapped input
z, while the other layer is the output layer.
Figure 1: The proposed NEUROSVM classifier
2.6 Advantages of the Proposed Method
A natural question comes, why such an architecture (NEUROSVM) will be better or more useful
than the usual SVM or MLP? There are number of reasons behind this. Note that, we are not
considering very simple data sets where most classifiers will lead to zero training-test errors.
598
Page 9
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
1. Typically, due to the local minima problem of MLP training and its dependence on initial-
ization, different MLPs may learn different areas of the input space better. Hence when we
combine the output of the hidden layer of different networks to generate new features, the
learning task becomes simpler to the CM. This is true irrespective of whether the CM is a
neural network or SVM.
2. The extracted features result in simpler classification boundaries because a single layer net-
work can classify the new data (consider a two layer network consisting of the hidden and
output layers of an MLP). This also makes the learning task of the CM simpler.
3. For high dimensional data, typically the number of nodes in the hidden layer is much smaller
than the number of the input nodes and one does not need many feature extraction sub-
modules (SFMs). Hence, the dimensionality of the input for the CM can be reduced com-
pared to the original dimension of the input. This makes simpler error surface, faster learning
and allows us to do more experiments, if CM is a neural network.
4. This is not an ensemble method but it makes fusion of salient characteristics of the input
space as extracted/learnt by different feature extraction networks. It can at least be viewed as
an implicit fusion of multiple classifiers, and hence improvement in performance is expected.
5. For large data sets, it may not be necessary to make full training of the MLPs for constructing
the SFMis, because the objective of the MLPs here is to capture the inherent attributes of the
data by the FM.
For low dimensional data sets or simpler data sets this method may not have much advantage be-
cause then n (dimension of input to the CM) can be more than p (original dimension of the input)
and different SFMs may capture the same attributes of the data resulting in not much of benefits.
Note that, the advantages mentioned in 2 and 3 are also applicable to MLPs.
3. Experiments
Thesectionisarrangedasfollows. Firstwehavelistedtheselecteddatasetstovalidateourproposed
method. Thenexperimental setupis described. Next, theexperimental resultsarepresented. Finally,
a control experiment to justify one of the advantages of the proposed method is demonstrated.
3.1 Data Sets
To demonstrate the effectiveness of the proposed method, we consider twelve data sets from the UCI
Machine Learning Repository (Blake and Merz, 1998). We divide the data sets into two Groups:
A and B. The Group A consists of eight data sets: Iris, Vehicle, Breast Cancer (WDBC), Glass,
Sonar, Ionosphere, Lymphography Domain (Lymph) and Pima Indians Diabetes (Pima) data. The
Group B contains Pendigits, Image-Segmentation (Img. Seg.), Landsat satellite image (Sat. Img.)
and Optdigits data. For Group A data sets some results are available in the literature but the details
of the experimental protocols (such as training/test divisions) used are not available. Hence, we
report the performance with ten-fold cross-validation experiments. Each data set is divided into ten
subsets of almost equal size. One of the subsets is used for testing and the remaining nine subsets
are used for training. The procedure is repeated ten times and the average performance is reported.
We report the results in terms of mean test error and its standard error for Group A data sets. For the
599
Page 10
GHANTY, PAUL AND PAL
four data sets in Group B, benchmark results with different classifiers are available along with the
training-test partition. Hence we have used the same training-test partition here and report the error
on the fixed test set. Table 1 and Table 2 summarize the Group A and Group B data sets respectively.
Data set No. of
classes
3
4
2
6
2
2
4
2
No. of featuresSize of the data set and class wise
distribution
150 (= 50 + 50 +50 )
846 (=212+217+218+199 )
569 (=212 + 357 )
214 (=70+76+17+13+9+29)
208 (=97+111)
351 (=225+126)
148 (=2+81+61+4)
768 (=500+268)
Iris
Vehicle
WDBC
Glass
Sonar
Ionosphere
Lymph
Pima
4
18
30
9
60
34
18
8
Table 1: Group A data sets
Data setNo. of
classes
No. of
features
Training data
Class distribution
780, 779, 780, 719
780, 720, 720, 778
719, 719
30 in each class
104, 68, 108, 47
58, 115
376, 389, 380, 389
387, 376, 377, 387
380, 382
Test data
Class distribution
363, 364, 364, 336
364, 335, 336, 364
336, 336
300 in each class
1429, 635, 1250, 579
649, 1393
178, 182, 177, 183
181, 182, 181, 179
174, 180
Size Size
Pendigits
1016 74943498
Img. Seg.7 18210 2100
Sat. Img.
64500 5935
Optdigits
10 6438231797
Table 2: Group B data sets
3.2 Experimental Setup
In this subsection we describe the selection method for hyper parameters of MLP and SVM classi-
fiers. To select the optimal architecture for an MLP, Andersen and Martinezr (1999) used ten-fold
cross-validation experiments. Adankon and Cheriet (2007) discussed another scheme for SVM
model selection. Here we have used ten-fold cross-validation experiments for MLP architecture
selection as well as for selection of SVM kernel parameters. For Group B data sets training-test
partitions are fixed and hence we have used ten-fold cross-validation on the training set to select the
hyper parameters of classifiers. For Group A data sets, as mentioned earlier, the performances are
reported based on ten-fold cross-validation. So, we perform double blind ten-fold cross-validation
experiments to select hyper parameters of classifiers for Group A data sets.
Note that, for the FM of NEUROSVM, we need to select m ? 1 MLPs. A natural choice would
be to select the best m architectures corresponding to the smallest m values of validation errors.
600
Page 11
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
Based on validation error we choose m architectures for each of the ten folds for Group A data sets
and m architectures for each of the Group B data sets for NEUROSVM.
In a similar manner the regularization parameter C and spread γ of RBF kernels of SVMs are
chosen based on ten-fold cross-validation experiments. We have experimented with ncchoices of
C and ngchoices of γ. So, we have used nc×ngsets of choice of parameters. For each choice, the
ten-fold cross-validation experiments are conducted. Here we also select the (C,γ) pair that leads to
the minimum average validation error. In this investigation nc= 12 and ng= 15 are used resulting
in a total of 180 pairs of parameters.
We have also used ten-fold cross-validation to find the sub-modules for NEUROSVM. The
hyper parameters of SVMs in the classification module of NEUROSVM are also estimated through
ten-fold cross-validation experiments. Note, for Group A data sets we have used double blind ten-
fold cross-validation. We have selected m (=5) SFMs. Hence using the m SFMs we can generate
2m−1 different feature subsets combinations. In our case it is 25−1 = 31. Then for each of
the 31 combinations with all 180 pairs of (C,γ) we have conducted the ten-fold cross-validation
experiments on training set(s). We have obtained the best (C,γ) for each of the 31 combinations.
Then the best combination is chosen based on the minimum average validation error over all 31
combinations. Finally using the best combination and corresponding (C,γ) pair the performance of
NEUROSVM is reported.
We have performed statistical tests (Dietterich, 1998) to compare the proposed algorithms with
that of standard algorithms. For Group A data sets where cross-validation is performed, we have
applied the ten-fold cross-validation paired t-test with 9 degrees of freedom and 95% significance
level. For the four data sets of Group B where a single test set is employed, we have constructed
McNemar test with 1 degree of freedom and 95% significance level. The formulations of these tests
are as follows.
3.2.1 K-FOLD CROSS-VALIDATION PAIRED T-TEST (DIETTERICH, 1998)
Consider two classifier models, D1and D2, and a data set X. The data set is split into K parts
of approximately equal sizes, and each part is used in turn for testing of a classifier built on the
pooled remaining K−1 parts. Classifiers D1and D2are trained on the training set and tested on the
test set. Denote the observed test accuracies as P1and P2, respectively. This process is repeated K
timesandtestaccuraciesaretaggedwithsuperscript(i),i=1,2,...,K. ThusasetofK differencesis
obtained, P(1)=P(1)
212
. Under the null hypothesis (H0: equal accuracies),
the following statistic has a t-distribution with K−1 degrees of freedom
1−P(1)
to P(K)=P(K)
−P(K)
t =
P√K
?
K
∑
i=1(P(i)−P)2/(K−1)
,
where P = (1/K)
K
∑
i=1P(i). If the calculated t is greater than the tabulated value for chosen level of
significance (here 0.05) and K −1 (here 9) degrees of freedom, we reject null hypothesis H0and
accept that there are significant differences in the two compared classifier models.
601
Page 12
GHANTY, PAUL AND PAL
3.2.2 MCNEMAR TEST (DIETTERICH, 1998)
As done before consider two classifiers D1and D2. Let us define the following: N00= number of
samples which both D1and D2classify incorrectly, N01= number of samples which D1classifies
incorrectly but D2classifies correctly, N10= number of samples which D1classifies correctly but
D2classifies incorrectly and N11= number of samples which both D1and D2classify correctly. Let,
N = N00+N01+N10+N11be the total number of samples in the test set. The null hypothesis, H0,
is that there is no difference between the accuracies of the two classifiers. If the null hypothesis is
correct, then the expected counts for N01and N10are1
expected and the observed counts is measured by the following statistic
2(N01+N10). The discrepancy between the
χ2=(|N01−N10|−1)
N01+N10
,
which is approximately distributed as χ2with 1 degree of freedom. To carry out the test we simply
calculate χ2and compare it with the tabulated χ2value for a given level of significance, say, 0.05
(in our case).
We have performed all experiments using two Sun Blade 2500 with dual processors. The
svm learn and svm classify modules for binary SVMs training and classification are used from
SVMlight(Joachims, 2002) software. For the RBF neural network MATLAB toolbox is used. All
other programs are written in c.
3.3 Experimental Results
In this subsection first we list the selected hyper parameters of MLP and SVM by cross-validation
experiments. Next selection of sub-modules and hyper parameters of NEUROSVM is discussed.
The performance comparison of NEUROSVM with the baseline classifiers and standard ensemble
methods is presented. Finally, we present the performance of other variants of NEUROSVM and
compare it with the baseline classifiers as well as ensemble methods.
3.3.1 SELECTION OF HYPER PARAMETERS FOR MLPS TO CONSTRUCT THE FM
For Group A data set we use double blind ten-fold cross-validation. The partitioning of data for
Group A data sets is explained in Appendix A. For each of the outer level cross validation loop,
finding the optimal number of hidden nodes and computation of the test error are explained in the
procedure RunMLP in Appendix B. The initial weights of the MLPs are chosen randomly in [-0.5,
+0.5] and the learning rate used to train the MLPs is 0.60. The number of iterations used to train
the networks for different data sets are chosen based on a few trial experiments. For each data set,
a set of choices on the number of hidden nodes is used to train the MLPs. In Table 3, number of
iterations and number of hidden nodes that are used to train the MLPs for the twelve data sets are
listed. We have decided to use m = 5 neural networks for feature extraction modules and hence for
each fold, we have to select a set of five hidden nodes to train five MLPs.
First we display the variation of the average validation error of cross-validation experiments as
a function of the number of hidden nodes for both Group A and Group B data sets. Since for each
data set in Group A 10 panels are required for the 10 folds, we include the figure for only one data
set, Vehicle, in Figure 2. In Figure 3, four panels are included, one for each of the four data sets in
Group B. In both Figures 2 and 3 we also include the average training errors. As mentioned earlier,
for the FM of NEUROSVM, we want to use m = 5 networks (SFMs). Consider a data set in Group
602
Page 13
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
A. Suppose, we have trained MLPs with M different architectures, that is, with M different choices
of hidden nodes. Then for each of the outer level fold, we shall have M different hidden nodes each
associated with an average validation error. Now we order these M hidden nodes in ascending order
of the associated validation error. Then select the top five hidden nodes from this ordered list. These
five different choices of hidden nodes will be used to train five MLPs for feature extraction for that
particular fold. For each data set, in Table 4, we depict the list of selected hidden nodes for each fold
(outer level). As an example, for the IRIS data for the first fold (outer level), the selected hidden
nodes are (7, 2, 5, 6, 8). This means that for the first fold (outer level) we got the least validation
error with 7 hidden nodes; the next smaller validation error is obtained with 2 hidden nodes and so
on.
Since the first element of this set of five resulted in the smallest validation error, we use this
choice of hidden nodes to train MLPs when we report the performance of the MLP networks as
classifiers. For each data set in Group B, since the training and test partitions are fixed, we have
only one outer loop and hence only one set with five choices of hidden nodes as shown in Table 4.
We follow the same protocol as that of Group A data sets to choose the number of hidden nodes for
computing the performance of MLP networks.
Data set
Iris
Vehicle
WDBC
Glass
Sonar
Ionosphere
Lymph
Pima
Pendigits
Img. Seg.
Sat. Img.
Optdigits
Training iterations
1500
2000
1500
1500
2000
1500
1500
1500
1500
1500
5000
1500
Hidden nodes explore
2-10
3-16
3, 5-10, 12, 15, 20
2-15
3, 5, 7, 10, 12, 15, 20, 25, 30, 35, 40
5-10, 12, 15, 20, 25, 30
4-10, 12, 15, 20
2-10
5-10, 12, 15, 18, 20, 25
3-10, 12, 15, 20
2-10
5, 8, 10, 12, 15, 18, 20, 25, 30, 35, 40, 50
Table 3: List of explore hidden nodes and number of iterations for MLP for the twelve data sets
3.3.2 SELECTION OF HYPER PARAMETERS FOR SVMS
In this section we consider the problem of selecting hyper parameters for a regular SVM that we
shall use as benchmark in experiments for the purpose of comparison with NEUROSVM. To select
the regularization parameterC and spread γ for RBF kernel of SVM classifiers we have tried a wide
range ofC and γ. In this experiment we have used 12 different values ofC and 15 different values of
γ resulting in a total of 180 pairs of (C,γ). The 12 different values of C are 0.001, 0.01, 0.10, 0.20,
0.50, 1.00, 2.00, 5.00, 10.00, 20.00, 100.00 and 1000.00. The 15 different values of γ that we have
used are 0.0001, 0.001, 0.01, 0.10, 0.20, 0.40, 0.80, 1.00, 2.00, 5.00, 10.00, 20.00, 100.00, 1000.00
and 10000.00. In a manner similar to the way the optimal number of hidden node is chosen for each
fold (outer level), the optimal (C,γ) is chosen using ten-fold cross-validation experiments. This is
further explained by Procedure RunSVM included in Appendix C. For each of the twelve data sets,
603
Page 14
GHANTY, PAUL AND PAL
Figure 2: For each of the ten-folds the variation of cross-validation error with different choices of
number of hidden nodes for MLPs on the Vehicle data set. The lines with cross-mark
denote the validation error while the lines with circles denote the training error.
604
Page 15
NEUROSVM: AN ARCHITECTURE TO REDUCE THE EFFECT OF THE CHOICE OF SVM KERNEL
Figure 3: Variation of cross-validation error with different choices of number of hidden nodes for
MLPs on four data sets in Group B: (a) Pendigits (b) Img. Seg. (c) Sat. Img. and
(d) Optdigits. The lines with cross-mark denote the validation error while the lines with
circles denote the training error.
605
View other sources
Hide other sources
-
Available from Pradip Ghanty · 23 Jan 2013
-
Available from mit.edu