Page 1
2007 IEEE International Conference on Signal Processing and Communications (ICSPC 2007), 2427 November 2007, Dubai, United Arab Emirates
PATTERN CLASSIFFICATION USING SVM WITH GMM DATA
SELECTION TRAINING METHODE
Ali Reza Bayesteh Tashk,Abolghasem Sayadiyan,PejmanMowlaee Begzadeh Mahale,
Mohammad Nazari
Electrical Engineering Department, Amirkabir University ofTechnology,
Tehran, Iran, 15914
ABSTRACT
In pattern recognition, Support Vector Machines (SVM)
as a discriminative classifier and Gaussian mixture model
as a generative model classifier are two most popular
techniques.
Current
combine them together for achieving more power of
classification and improving the performance of the
recognition systems. Most of recent works focus on
probabilistic SVM/GMM hybrid methods but this paper
presents a novel method for SVM/GMM hybrid pattern
classification based on training data
system uses the output of the Gaussian mixture model to
choose training data for SVM classifier. Results on
databases are provided to demonstrate the effectiveness
of this system. We are able to achieve better errorrates
that are better than the current systems.
stateoftheart
systemstry
to
selection.This
Index Terms Support vector machine, Gaussian
Mixture model
1.INTRODUCTION
In recent years, several researchers have demonstrated
various constrained recognition and classification tasks
such as speech source separation and speaker recognition
in pattern recognition, under SVMbased frameworks.
The Support Vector Machines (SVMs) are stateoftheart
tools for linear and nonlinear knowledge discovery.
Being based on the maximum margin classifier, SVMs
are able to outperform classical classifier in the presence
of high dimensional data even when working with
nonlinear machines.
An SVM is a powerful and accurate model and can
represent pattern variations when a sufficiently large
amount oftraining set is available. Discriminative models
have good property of making full use of discriminative
information of different classes. This is troublesome and
timeconsuming for systems because each pattern must
have a large amount oftraining sets.
Generative model consists of Hidden Markov Model
(HIMM) and Gaussian Mixture Model (GMM), etc. Each
one of them can construct class models for pattern
recognition
and previous
performances.
Generative
information.
tasks,
studies show good
models
use
statistical
Earlier
particularly GMMs and HlMMs, with discriminative
framework like SVM [4]. In these systems classifiers are
trained to discriminate between individual frames of data
then the likelihood scores of each frame are combined
using an averaging step [5] or via an H\MM to give an
overall utterance score from which the authenticity ofthe
speaker may be determined.
A discriminative classifier discards information that the
objective
functionconsiders
speaker recognition tasks, Discrimination at the frame
level is therefore not optimal for sequence classification
since information relevant to sequence discrimination
may be discarded inadvertently [6].
In this paper we introduce a solution to combine both
discriminative model and the generative model to make
use of the two kinds of models and we describe our
method that how we can use GMM for selecting the SVM
training data set to prevail over an important weakness of
SVM in large scale databases. Therefore we use GMM
score
(likelihood)
members of classes and keep other hard to find other
members, we can reduce number ofsupport vectors.
The remainder of this paper is organized as follows. In
section 2 we briefly describe the SVM and GMM. In
section 3 we introduce our classifier. In section 5 we
describe
experimental
Conclusions follow in section 6.
works
try to
combine generativemodels,
irrelevant.
Spatially
in
for
classifying
the
easy to
find
resultsfor
our
classifier.
2. SVM CLASSIFIER WITH GMM TRAINING
SET SELECTION
2.1. Support Vector Machines (SVM)
The principle of Support Vector Machine (SVM) relies
on a linear separation in a high dimension feature space
where the data have been previously mapped, in order to
take into account the eventual nonlinearities of the
problem.
If we assume that, the training set X =(Xi)>= c RR
where / is the number of training vectors, R stands for
the real line and R is the number ofmodalities, is labeled
with two class targets Y = (yi)>= , where:
yie{L+I}
t:RR
F
(1)
1424412366/07/$25.00© 2007 IEEE
1023
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 09:13 from IEEE Xplore. Restrictions apply.
Page 2
Maps the data into a feature space F. Vapnik has proved
that maximizing the minimum distance
between .D(X) and the separating hyper plane H(w, b)
is a good means of reducing the generalization risk.
Where:
in space F
H(w,b) = fe F <w,f >F +b = O},
(<> is inner product)
(2)
Vapnik also proved that the optimal hyper plane can be
obtained solving the convex quadratic programming (QP)
problem:
Minimize
!lll2+ C
2
i=l
with
y,(<w, ¢(X)>+b)21J4
i=1, ...,1
(3)
Where constant C and slack variables x are introduced to
take into account the eventual nonseparability of 4)(X)
into F.
In practice this criterion is softened to the minimization
of a cost factor involving both the complexity of the
classifier and the degree to which marginal points are
misclassified, and the tradeoff between these factors is
managed through a margin of error parameter (usually
designated C) which is tuned through crossvalidation
procedures.
Although the SVM is based upon a linear discriminator, it
is not restricted to making linear hypotheses. Nonlinear
decisions are made possible by a nonlinear mapping of
the data to a higher dimensional space. The phenomenon
is analogous to folding a flat sheet of paper into any
threedimensional shape and then cutting
halves, the resultant nonlinear boundary in the two
dimensional space is revealed by unfolding the pieces.
The SVM's nonparametric mathematical formulation
allows these transformations to be applied efficiently and
implicitly: the SVM's objective is a function of the dot
product between pairs of vectors; the substitution of the
original dot products with those computed in another
space eliminates the need to transform the original data
points explicitly to the higher space. The computation of
dot products between vectors without explicitly mapping
to another space is performed by a kernel function.
The nonlinear projection ofthe data is performed by this
kernel
There
are
functions that are used such as the linear, polynomial
kernel (K(x,y)
kernel (K(x,y)
tanh(< x,y >RR +a)), where x and
y are feature vectors in the input space.
The other popular kernel is the Gaussian (or "radial basis
function") kernel, defined as:
it into two
functions. several common kernel
(< x,y >RR +l)d and the sigmoidal
K(x, y)=exp( (2
' )
Where a is a scale parameter, and x and y are feature
vectors in the input space. The Gaussian kernel has two
hyper parameters to control performance C and the scale
parameter a. In this paper we used radial basis function
(RBF).
2.2. Gaussian Mixture Models
Gaussian Mixture Models (GMM) provides a good
approximation
of
the
probability density functions by a mixture of weighted
Gaussians. The mixture coefficients were computed by
use of an Expectation Maximization algorithm. Each
emotion is modeled in one GMM. The decision is made
for the maximum likelihood model. We used diagonal
covariance GMMs as a baseline classifier.
The outputs ofGMM are:
originally
observed
feature
M
PGMM(X
C1) =CimN(X,/lim, dim)
m =1
(5)
Where:
N(X,AcUim, im)=
I
x
(2)d2(x1/2
exp[!(x ,Y)Ty1(
(6)
Here,Cimn
variance, respectively, of the m th mixture for class i.
The GMM reflects the intraclass information.
Alimand
nimare
the weight,
mean
and
3. GMM TRAINING DATA SELECTION
In this section we describe our method that how we can
use GMM for selecting the SVM training data set. An
important weakness of SVM in large scale databases is
time consuming in real time recognition because of its
large numbers of support vectors.
In this case, if we use GMM score (likelihood) for
classifying the easy to find members of classes and keep
other hard to find other members, we can reduce number
of support vectors. This work needs to define a margin
for GMM score, £, easily, the samples that place in this
margin, use for SVM training.
In this work, we discuss twoclass problems, which
arise because each sample belongs to one of two classes
Co,or (02. The GMM score is the difference between the
log likelihoods ofthe two models,
l(X)=logP(X(0)
P(XIw(2)
=logP(XIwj)logP(X w2)
(7)
(4)
The decision boundary is:
1024
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 09:13 from IEEE Xplore. Restrictions apply.
Page 3
ct~
1(X)
log
1,
>
\yP2
(8)
Where P~are the Apriori probability ofco,
If we add
margin for GMM score,
XeEwCO,
if
l(X)<log
<IjJ
Xe w2,if~~~~~~~\, P,
(9)
Gaussian Mixture estimated by EM
+ C1:
0
 0420204
Figure 1 All training daa n to+itrotu
0.4 ~
ah ls
06
stdmeso
fo
LXpassto SVM classifier
if
otherwise
Thatis calculated experimentally.
It
is clear that the
valueof is
very
important
for
generalization
characteristicof classifier.
Hence,
SVM
a
tradeoff
can
be
made between
GMM
and
employingepsilon
parameter i.e.
= 0 resulting inpure GMM calcification
while
= oo gives SVM classification. The selection of
epsilon depends on the application.
4.EXPERIMENTAL RESULT
The synthetic twoclass problem and Diabetes
Indians,
takenfrom
[8],
proposed algorithm.
in Pima
were used
to
investigate
the
4.1. Synthetic Data
Thedimension of the feature
space
was m=2.
The
training
set contained 250 samples and the test
points.
The
optimal
example is around 80 OQur classifier works with a GMM
With
tow
mixture
per
classes
set had
1000
Bayes
error
rate
for this
for
training
data
set
selection for passing training data to SVM classifier.
In figure 1, all training data and tow mixture contour are
illustrated. Forec = 0.7 (typically selected here only for
comparison purposes ), we try to select best training data
set for training the SVM classifier. The selected data set
is shown inFigure 2.
The decision boundaries of our classifier are illustrated in
Fig.
RVM to this data set and only used 100 random selected
samples from the 250points training data set in training.
The results given in [9]
3 forec = 0.7. Tipping
[9] applied the SVM and
are compared with our result in
Table 1.
0.60.60.40.20
V"dimension
0.20.40.6
0.6
Figure 2. Selected besttraining data set fortrainingthe SVM
classifier with
= 0.7
4.2. Pima Diabetes Data
The dimension of the input space was m= 7, thetraining
data set contained 200 samples and the test data set had
332 samples.
Results obtained by the SVM and RVM
methods, quoted from
[9].
12
~~~~~~~~

0 ~~~~~~
04~~~~~~~~~++
++


02~~~~~~~~
ID~
CkDT
IDp
0 6

d
0upre to
1025
0.9I

Class 1
Class 2
0.6 
0.7
 

032
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 09:13 from IEEE Xplore. Restrictions apply.
Page 4
Table 2. Error rate for Pima Diabetes Data
SVM
GMM methodRVM Our
Methode
21
Classifier
size
Error Rate
1094
4
21.1%31.4%19.6%
21.5%
[10]
GMM/SVM Approach to Speaker Identification, Proc. ICASSP
2001.
S.
Fine,
J.
Navratil
and
R. A.
Gopinath. Hybrid
5.
CONCLUTION
A novel method has been proposed
accuracy and speed of SVM with combining it with
GMM scoring system. We show that our method can use
GMM score to find the hard to learn places in feature
space and then pass those data
maximum margin nonlinear discriminative classifier) for
training then we described our method that how we can
use GMM for selecting the SVM training data set.
The proposed method has the ability to reduce support
vectors significantly. This reduction lead to improve the
speed of SVM classifier also using the GMM help us
achieving more accuracy. Experimental results presented
have demonstrated the effectiveness of the proposed
technique.
for improved
set to SVM (as a
6. REFERENCES
[1] V.N. Vapnik. The nature of statistical learning theory,
Second Edition, New York: SpringerVerlag, 1999.
[2]
Large Margin Classifiers, pp. 6174, MIT Press, 1999.
J. Platt, Probabilities for SV Machines, Advances in
[3]
based on smoothing splines and reproducing kernels. Nonlinear
Modeling and Forecasting, SFI Studies in the Sciences of
Complexity, Proc. Vol XII, pp 95112. AddisonWesley, 1992.
G. Wahba. Multivariate function and operator estimation,
[4]
Reynolds,
overview, methodology, systems, results, perspective," Speech
Communication, vol. 31, no. 23, pp. 225254, 2000.
G.
Doddington,
"The NIST
M.
Przybocki,
speaker
A.
Martin,
and D.
recognition
evaluation

[5]
using
Communication, vol. 17, pp. 91108, 1995.
D. A. Reynolds, "Speaker identification and verification
Gaussianmixture
speakermodels," Speech
[6]
speaker verification methodology", in proc. ICASSP'03, 2003
Vol. 2, pp. 221224.
V. Wan, S. Renals, "SVMSVM: support vector machin
[7]
using probabilistic SVM with GMM adjasment", in Proc.
Natural Language Processing and Knowledge Engineering
conf, 2003, pp. 305308.
F. Hou, B. Wang. "Textindependent Speaker recognition
[8]
Cambridge, U.K.: Cambridge Univ. Press, 1996.
B. D. Ripley, Pattern Recognition and Neural Networks.
[9]
relevance vector machine," J Machine Learn. Res., vol. 1, pp.
211244, 2001.
M. E.
Tipping,
"Sparse Bayesian learning and the
1026
Authorized licensed use limited to: Aalborg Universitetsbibliotek. Downloaded on January 18, 2010 at 09:13 from IEEE Xplore. Restrictions apply.