# Sparse Kernel Logistic Regression using Incremental Feature Selection for Text-Independent Speaker Identification

**ABSTRACT** Logistic regression is a well known classification method in the field of statistical learning. Recently, a kernelized version of logistic regression has become very popular, because it allows non-linear probabilistic classification and shows promising results on several benchmark problems. In this paper we show that kernel logistic regression (KLR) and especially its sparse extensions (SKLR) are useful alternatives to standard Gaussian mixture models (GMMs) and support vector machines (SVMs) in Speaker recognition. While the classification results of KLR and SKLR are similar to the results of SVMs, we show that SKLR produces highly sparse models. Unlike SVMs the kernel logistic regression also provides an estimate of the conditional probability of class membership. In speaker identification experiments the SKLR methods outperform the SVM and the GMM baseline system on the POLY-COST database

**0**Bookmarks

**·**

**75**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**This study aims to explore the case of robust speaker recognition with multi-session enrollments and noise, with an emphasis on optimal organization and utilization of speaker information presented in the enrollment and development data. This study has two core objectives. First, we investigate more robust back-ends to address noisy multi-session enrollment data for speaker recognition. This task is achieved by proposing novel back-end algorithms. Second, we construct a highly discriminative speaker verification framework. This task is achieved through intrinsic and extrinsic back-end algorithm modification, resulting in complementary sub-systems. Evaluation of the proposed framework is performed on the NIST SRE2012 corpus. Results not only confirm individual sub-system advancements over an established baseline, the final grand fusion solution also represents a comprehensive overall advancement for the NIST SRE2012 core tasks. Compared with state-of-the-art SID systems on the NIST SRE2012, the novel parts of this study are: 1) exploring a more diverse set of solutions for low-dimensional i-Vector based modeling; and 2) diversifying the information configuration before modeling. All these two parts work together, resulting in very competitive performance with reasonable computational cost.Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 08/2014;

Page 1

Sparse Kernel Logistic Regression using Incremental Feature Selection for

Text-Independent Speaker Identification

Marcel Katz, Martin Schafföner, Edin Andelic, Sven E. Krüger, Andreas Wendemuth

IESK, Cognitive Systems

University of Magdeburg, Germany

marcel.katz@e-technik.uni-magdeburg.de

Abstract

Logistic Regression is a well known classification method in the

field of statistical learning. Recently, a kernelized version of

logistic regression has become very popular, because it allows

non-linear probabilistic classification and shows promising re-

sults on several benchmark problems.

In this paper we show that kernel logistic regression (KLR)

and especially its sparse extensions (SKLR) are useful alterna-

tives to standard Gaussian mixture models (GMMs) and Sup-

port Vector Machines (SVMs) in Speaker Recognition. While

the classification results of KLR and SKLR are similar to the

results of SVMs, we show that SKLR produces highly sparse

models.

Unlike SVMs the kernel logistic regression also provides an es-

timate of the conditional probability of class membership.

In speaker identification experiments the SKLR methods out-

perform the SVM and the GMM baseline system on the POLY-

COST database.

1. Introduction

InthisworkwewanttointroducetheSparseKernelLogisticRe-

gression as a classification method for text-independent speaker

identification.

Logistic regression (LR) and its non-linear version kernel

logistic regression (KLR) are popular methods of discrimina-

tive classifiers [1]. In contrast to informative classifiers like the

linear discriminant analysis (LDA) where the underlying class

densities are estimated, discriminative classifiers like LR focus

on modeling the a-posteriori probability of the membership to

each of C classes.

The use of speaker recognition and its applications is al-

ready widespread, e.g. in telephone banking systems where it

is important to verify that the person prompting the credit card

number is the owner of the card, or in access to private areas.

The field of speaker recognition can be divided into two

tasks, namely speaker identification and speaker verification.

The speaker identification task consists of a set of known speak-

ers or clients (closed-set) and the problem is to decide which

person from the set is talking.

In speaker verification the recognition system has to verify

if a person is the one he claims to be (open-set). So the main

difficulty in this setup is that the system has to deal with known

and unknown speakers, so-called clients and impostors, respec-

tively.

In speaker recognition Gaussian mixtures are usually a

good choice in modeling the distribution of the speaker-specific

speech samples. In speech recognition or in text-dependent

speaker recognition systems the temporal dynamics of speech

can be represented very efficient by multi-state left-to-right

Hidden Markov Models (HMMs) and for text-independent ap-

proaches by single-state HMMs [2].

It has to be noted that good performance of GMMs depends

on a sufficient amount of data for the parameter estimation. In

speaker recognition the amount of speech data of each client

is limited. Normally, only a few seconds of speech are avail-

able and so the parameter estimation might not be very robust.

Especially if the data is limited, discriminative classifiers like

SVMs show a great ability of generalization and a better classi-

fication performance than GMMs. Also, SVMs were integrated

into HMMs to model the acoustic feature vectors [3] in contin-

uous speech recognition.

There have been several approaches of integrating SVMs

into speaker verification environments.

to discriminate between frames but between entire utterances.

These utterances have different lengths and so a mapping from

a variable length pattern to a fixed size vector is needed. Sev-

eral methods like the Fisher-kernel [4] map the sequences into a

highdimensionalscore-spacewhereaSVMisthenusedtoclas-

sify the data, e.g. [5]. Alternatively, one can use the sequence

kernel introduced by [6] which maps the support vectors into a

single vector for each speaker by a kernel which is derived from

generalized linear discriminants.

One method is not

Logistic regression is used in speaker recognition, as well.

In [7] a kind of multi-class KLR was used on a speaker identi-

fication task. This version differs from our approach in the way

of parameter estimation and it is not sparse so that the size of

the training data is very limited.

The paper is organized as follows. We first describe the

standard methods in speaker recognition in section 2.

section 3 and 4 we give a short overview about kernel-based

methods and the SVM respectively. We give a brief introduc-

tion to logistic regression and its non-linear extension to kernel

functions in section 5. We describe a method of incorporating

sparseness into the KLR model.

results of (S)KLR, SVMs and GMMs are compared in section 7

and we conclude with a short discussion in section 8.

In

The speaker identification

2. GMM Baseline System

In the past several years there has been a lot of progress in

the field of speaker recognition [8][2]. State-of-the-art speaker

verification systems are based on modeling speaker-dependent

GMMs for each client as well as an additional speaker-

independent model (world-model) for the decision progress.

Page 2

The GMM is the weighted sum of M Gaussian densities:

p(x|λ) =

M

X

i=1

ci N(x|µi,Σi)

(1)

where ciis the weight of the i’th component and N(x|µi,Σi)

is a multivariate Gaussian with mean vector µiand covariance

matrix Σi:

N(x|µi,Σi) =

1

p(2π)n|Σi|exp[−1

2(x−µi)TΣ−1

i (x−µi)]

(2)

These mixture models are trained for each speaker in the

closed speaker set using estimation methods like the Expecta-

tion Maximization (EM) algorithm.

The probability that a test sentence X = {x1,...,xN}

is generated by the speaker model λ is calculated by the log-

likelihood over the whole sequence of length N:

logP(X|λ) =

N

X

i=1

logP(xi|λ).

(3)

Given a set of speakers, the task of speaker identification is to

deduce the most likely client C from a given speech sequence:

ˆCk= argmax

k

log(P(X|λk)).

(4)

2.1. Score Normalization

There are several methods of score normalization in speaker

recognition systems, e.g the T-norm. The T-norm is a method

for normalizing the score of a speaker model by a collection of

background models:

S(X|λ) =logP(X|λ) − µx

σx

(5)

with the mean µx and the standard deviation σx of log-

likelihood scores of the test sentence against the set of

background speakers. This normalization has no effect on the

decision of equation (4), but we will need this normalization for

the comparison of the different scores produced by the GMM,

SVM and (S)KLR models.

3. Kernel-Based Methods

Discriminative classifiers like SVMs and KLR are linear clas-

sification methods, where different classes are separated by hy-

perplanes. Instead of applying the linear methods directly to the

input space Rd, they are applied to a higher-dimensional fea-

ture space F, which is non-linearly related to the input space:

Φ : Rd→ F by the so-called kernel trick. The kernel trick

can be used if the algorithm to be generalized uses the training

vectors only in the form of Euclidean dot-products (xT· y).

Thus, it suffices to calculate the dot-product in the feature space

(Φ(x) · Φ(y)), which is equal to the kernel function k(x,y),

i.e.

k(x,y) = (Φ(x) · Φ(y))

(6)

if k(x,y) fulfills Mercer’s conditions [9]. Important kernel

functions which fulfill these conditions are the polynomial ker-

nel

k(x,y) = (x · y + 1)d

(7)

which maps the input data into the space of all monomials up to

degree d, and the Gaussian radial basis function (RBF) kernel

k(x,y) = exp

„−?x − y?2

2σ2

«

.

(8)

4. Support Vector Machines

SVMs were first introduced by Vapnik and developed from the

theory of Structural Risk Minimization (SRM) [10]. We now

give a short overview of SVMs and refer to [11] for more de-

tails and further references. Given a training set of input sam-

ples x ∈ Rdand corresponding targets y ∈ {1,−1} a kernel

function k(xi,xj) and a parameter C, the SVM tries to find an

optimal separating hyperplane in F. This is done by solving the

quadratic programming problem:

W(α) =

N

X

i=1

αi−1

2

N

X

i=1

N

X

j=1

αiαjyiyjk(xi,xj)

(9)

subject to constraints

N

X

i=1

αiyi = 0

and 0 ≤ αi < C ∀i.

(10)

The regularization parameter C > 0 allows us to specify how

strictly we want the classifier to fit to the training data.

4.1. Probabilistic SVM Output

The output of the SVM is a distance measure between a pattern

and the decision boundary:

f(x) =

N

X

i=1

αik(xi,x) + b

(11)

where the pattern x is assigned to the positive class if f(x) > 0.

The training vectors x for which the weights α are greater than

zero are called support vectors.

For the posterior class probability we have to model the

probabilities P(f|y = +1) and P(f|y = −1) of f(x). Us-

ing two exponential functions the probability of the class given

the output is computed by Bayes’ rule [12]:

P(y = +1|x) = g(f(x),A,B) =

1

1 + exp(Af(x) + B)

(12)

where the parameters A and B can be calculated by a maximum

likelihood estimation from the training set [12].

5. Kernel Logistic Regression

Consider again a binary classification problem with targets y ∈

{0,1}, the success probability of the sample x belonging to

class 1 is given by P(y = 1|x) and P(y = 0|x) = 1 − P(y =

1|x) that it belongs to class 0.

In Logistic Regression we want to model the posterior proba-

bility of the class membership via the linear function:

f(x) = βTx

(13)

whereβ denotestheweightvector, includingabiasβ0whilethe

sample x is augmented by a constant entry of 1. Interpreting the

Page 3

output of f(x) as an estimate of a probability P(x,β) we have

to rearrange equation (13) by the logit transfer function

logit{P(x,β)} = log

P(x,β)

1 − P(x,β)= βTx

(14)

which results in the probability:

P(x,β) =

1

1 + exp(−f(x)).

(15)

If we assume that the training data are drawn from a Bernoulli

distributionconditionedonthesamplesx, theconditionedprob-

ability of P(y|β,x) is

P(y|x,β)=

(

P(x,β)y(1 − P(x,β))1−y

P(x,β)

1 − P(x,β)

y = 1

y = 0

(16)

=

(17)

and the negative log-likelihood (NLL) of equation (17) can be

written as

l{β}

=

N

X

i=1

−yiβTxi+ log

“

1 + exp(βTxi)

”

.(18)

To avoid over-fitting to the training data it is necessary to im-

pose a penalty on large fluctuations of the estimated parameters

β. The most popular method is the ridge penalty

was introduced by [13]. This quadratic regularization is added

to the NLL function

l{β}ridge= l{β} +λ

λ

2?β?2that

2?β?2

2

(19)

where λ is the regularization parameter.

To minimize the regularized NLL we set the derivatives

∂l{β}ridge

∂β

to zero and use the Newton-Raphson algorithm to

iteratively solve equation (19). This algorithm is also referred

to as iteratively re-weighted least square (IRLS) in this case,

see e.g. [14]:

βnew=

“

XTWX + λI

”−1

XTWz

(20)

with the adjusted response:

z = Xβold+ W−1(y − p)

(21)

where p is the vector of fitted probabilities with the i’th ele-

ment P(βold,xi), W is the N ×N weight matrix with entries

P(βold,xi)(1−P(βold,xi))onthediagonal, andIistheiden-

tity matrix.

The extension from the linear model to the non-linear one

is realized by a non-linear mapping Φ : Rd→ F to a feature

space F. Following the Representer Theorem [15], every β ∈

F that solves (20) lies in the span of all Φ(xi):

β =

N

X

i=1

αiΦ(xi).

(22)

We apply this expansion of β to equation (20) so that all sam-

ples appear in the form of dot products (Φ(xi)Φ(xj)) in the

feature space.

Introducingthekernel

(Φ(xi)Φ(xj)) =k(xi,xj) we can write equation (20)

as

αnew=`K + λW−1´−1KW˜ z

“

matrix

K

with

Kij

=

(23)

where the adjusted response in F is given by

˜ z =

Kαold+ W−1(y − p)

”

.

(24)

5.1. Sparse KLR using ridge regularization

The main drawback of the kernel logistic regression is that all

training vectors are involved in the final solution which is not

acceptable for large datasets like speech recognition tasks. A

sparse solution of the kernel expansion could be achieved if we

involve only basis functions corresponding to a subset S of the

training set R:

f(x) =

q

X

i=1

αik(xi,x)q ? N

(25)

with q training samples. If we apply equation (25) instead of

(22)intheIRLSalgorithmofequation(20)wegetthefollowing

sparse formulation:

αnew=

“

KT

NqWKNq+ λKqq

”−1

KT

NqW˜ z

(26)

with ˜ z =

KNq = k(xi,xj);xi ∈ R,xj ∈ S and the q×q regularization

matrix Kqq = k(xi,xj);xi,xj ∈ S.

`KNqαold+ W−1(y − p)´, the N × q matrix

This variant was originally introduced by [16] and is re-

ferred to as Import Vector Machine (IVM). The IVM aims to

iteratively minimize the NLL by adding training samples to a

subset S of selected training vectors until the minimum of the

NLL is reached. Starting with an empty subset we have to min-

imize the NLL for each xlof the training set R:

l{xl}

=

−yT(Kl

+λ

2αlTKl

Nqαl) + 1Tlog(1 + exp(Kl

Nqαl))

qqαl

(27)

with the N × (q + 1) matrix Kl

S∪{xl} and the (q+1)×(q+1) regularization matrix Kl

k(xi,xj);xi,xj ∈ S∪{xl}. Then we add the vector for which

we get the highest decrease in NLL to the subset S:

Nq= k(xi,xj);xi ∈ R,xj ∈

qq=

x∗

l= argmin

xl∈R

l{xl}.

(28)

While in the original KLR algorithm we iteratively estimate

the parameter α by the IRLS algorithm, we can use a one-step

approximation here [16]. In each step we approximate the new

α with the fitted result from the current subset S which we

estimated in the previous minimization process.

Since we want to minimize the NLL of equation (27) we also

calculate the ratio

|l{α}l|

the algorithm if the ratio is less than a predefined value ?.

The resulting fit of the KLRridge approximates the full KLR

model, while the number of kernel functions used in the final

solution is only a small fraction of the number of training

vectors.

|l{α}l−l{α}l−1|

in each iteration and stop

6. Multi-class problems

Naturally, kernel logistic regression could be extended to multi-

class problems. The conditional probability that x belongs to

class j is written as

p(yj|x,α) =

exp(fj(x))

PC

c=1exp(fc(x))

(29)

Page 4

where in the kernelized version fj(x) is defined as

fj(x) =

N

X

i=1

αijk(xi,x).

(30)

But for comparison with binary classifiers like the SVM we de-

cided to use a common one-versus-one approach where a clas-

sifier learns to discriminate one class from one other class. This

leads to C(C − 1)/2 pairwise classification rules [17], where

we need to train all possible pairs out of C classes. The main

advantage of the one-versus-one approach is that it is easier to

solve C(C − 1)/2 small classification problems than to solve

C large problems.

TheresultingpairwiseprobabilitiesPij(qi|x)ofaclassqigiven

a sample vector x belonging to either qi or qj are transformed

to the posteriori probability p(qi|x) by [18]:

p(qi|x) = 1/

0

@

C

X

j=1,j?=i

1

Pij(qi|x)− (C − 2)

1

A.

(31)

7. Experiments

We compare the presented approach of (S)KLR to a baseline

GMM and a SVM system on the POLYCOST dataset [19].

This dataset contains 110 speakers (63 females and 47 males)

from different European countries. The dataset is divided into 4

baseline experiments (BE1-BE4) from which we used the text-

independent set BE4 for speaker identification experiments.

During feature extraction the speech data is divided into

frames of 25ms at a frame rate of 10ms and a voiced/unvoiced

decision is obtained using a pitch detection algorithm [20].

Only the voiced speech is then parameterized using 12 Mel-

Cepstrum coefficients as well as the frame-energy. The first and

the second order differences are added, resulting in feature vec-

tors of 39 dimensions. The parameters of the baseline GMM

models were estimated using the HTK toolkit [21]. For the

SVM and (S)KLR approaches we used a modified version of

HTK utilizing a runtime plug-in library for external computa-

tion of probabilities.

As in the baseline system the (S)KLR and the SVM models

were applied at the frame-level. For the identification experi-

ments we used 2 sentences of each speaker for the training and

2 sentences as development test set. The evaluation set contains

up to 5 sentences per speaker, all in all 664 true identity tests.

There is a total amount of only 10 to 20 seconds of free speech

for the training of the speaker models and about 5 seconds per

speaker for the evaluation. For the SVM and the KLR classi-

fiers we used the RBF kernel function (8) and validated the ker-

nel and the regularization parameters of the different classifiers

on the development set. Because of the fact that all speakers

are known to the system, the error rate is simply computed as

Identification Error Rate (IER):

IER =number of incorrect identifications

total number of identification tests

(32)

Over the whole test sentence the log-likelihood is computed by

equation (3) and the sentence is classified to the speaker with

the highest probability over the speech sequence. The IER is

presented in the left column of table 1. We also generated an

N-best list of each test sentence and give the results of finding

Table 1: Speaker identification experiments on the BE4 set of

the POLYCOST database.

ClassifierIER (%)

GMM10.84

SVM8.89

KLR 8.58

SKLR8.58

5-best (%)

6.48

4.67

4.97

5.87

the best speaker in the 5-best alternatives in the right column of

the table.

As can be seen in table 1 both the SVM and the KLR

classifiers clearly outperform the GMM baseline system. The

SKLR classifier with ridge regularization decreases the IER of

the baseline system by more than 20% relatively. Furthermore,

evaluating the classification on the 5-best alternatives, the SVM

reached the lowest error rate followed by the non-sparse KLR.

Additionally, we want to compare the sparseness of the dif-

ferent discriminative classifiers. While the number of support

vectors in the SVM depends on the size of the training set, we

want to show that the number of relevant vectors for the SKLR

solutions are independent of the training set size.

Table 2: Sparseness (%) of SKLR compared to SVM on different

training sizes of the BE4 dataset.

Dataset

BE4-50

BE4-60

BE4-70

BE4-80

BE4-90

BE4-100

SVM

66.8

61.3

56.2

51.3

46.6

41.7

SKLR

82.8

81.4

80.3

79.3

78.8

78.1

KLR

0.0

0.0

0.0

0.0

0.0

0.0

To compare the number of kernel functions used in the final

solutions, we divided the training set into subsets of 50 to 90%

of the full BE4 task. The sparseness is defined as the portion of

data not used in the final solution. The lower the sparseness the

more feature vectors are part of the classifier.

50607080 90100

15

20

25

30

35

40

45

50

55

60

Size of training set (%)

Amount of support/relevant vectors (%)

SVM

SKLR

Figure 1: Sparseness (%) of the discriminative classifiers using

different amounts of training data.

Page 5

Using the same parameters as for the full dataset the

results in table 2 show that on all subsets the models of the

SKLR approach are sparser than the SVM models. On the

full set the SVM solution needs about 58.3% of all training

data. The SKLR gives a good performance with 21.9%.

Naturally, the non-sparse KLR needs all training vectors in the

experiments and its sparsity level is 0%. As can be seen in

figure 1 the number of support vectors increased nearly linear

with the amount of training data which results in a decreasing

sparseness, while the number of relevant feature vectors for

the SKLR increased only slightly for larger datasets.

is computational advantage over the SVM because a much

smaller portion of kernel products have to be calculated in the

test.

This

For a comparison of the decision quality of the three identi-

fication systems, we normalized the log-likelihood scores in the

N-best list by the T-norm. The model set of background speaker

for the (S)KLR and the SVM systems were trained in the same

manner (one-versus-one) as the speaker in the closed set. In

figure 2 the histograms of the true speakers (right) and the aver-

age of the 10-best alternatives in the N-best list (left) are given.

The means of the GMM-score distributions are closer than the

means of the other systems, which results in a lower confidence

of the decision. The largest distance of the score means and the

highest confidence can be observed for the SVM models.

0.511.522.53 3.544.55

0

0.5

1

gmm

0.511.52 2.533.544.55

0

0.5

1

svm

0.51 1.52 2.53 3.544.55

0

0.5

1

sgklr

0.51 1.522.5 33.544.55

0

0.5

1

klr

Figure 2: Scores of the 10-best alternatives (left) compared to

the scores of the true speakers (right) for the different classifi-

cation methods.

Introducing a threshold for accepting or rejecting a speaker

of the identification set, the Equal Error Rate (EER) gives the

lowest error rate by an equal rate of false-accepts and false-

reject errors. The results are given in table 3.

As we can see the SVM clearly outperforms all other

systems, and considering the distribution of the scores in the

histograms it is not surprising that the KLR method decreases

the baseline error rate more than the SKLR.

Table 3: Equal Error Rates of the speaker identification experi-

ments using a decision threshold.

Classifier

GMM

SVM

KLR

SKLR

EER (%)

6.69

5.15

5.39

6.15

8. Conclusions

In this paper we presented a sparse version of the kernel lo-

gistic regression for the task of speaker identification. The two

KLR classification methods outperform the SVM and the GMM

baseline system on the speaker identification task. Furthermore,

theSKLRprovidesaverysparsesolutionbyanincrementalfea-

ture selection procedure. On the other hand, the SVM classifier

performs better if one introduces a threshold for acceptance or

rejection of a client. The main drawback of the kernel classifiers

is the computational cost for the parameter estimation.

Future work will concentrate on an objective measure of

the score quality of the moderated SVM output and the SKLR

probability scores. Furthermore, we are planning to use the

SVM and the SKLR on speaker verification tasks. We want

to investigate our approaches regarding the integration of the

discriminative classifiers especially on larger datasets.

9. References

[1] Volker Roth, “Probabilistic discriminative kernel classi-

fiers for multi-class problems,” in Lecture Notes in Com-

puter Science. 2001, vol. 2191, pp. 246–253, Springer.

[2] Douglas A. Reynolds, “An overview of automatic speaker

recognition technology,”

Conference on Acoustics, Speech, and Signal Processing,

2002, vol. 4, pp. 4072–4075.

in Proc. IEEE International

[3] Sven E. Krüger, Martin Schafföner, Marcel Katz, Edin

Andelic, and Andreas Wendemuth, “Speech recognition

with support vector machines in a hybridsystem,” inProc.

EuroSpeech, 2005, pp. 993–996.

[4] Tommi S. Jaakkola and David Haussler, “Exploiting gen-

erative models in discriminative classifiers,” in Advances

in Neural Information Processing Systems, Michael S.

Kearns, Sarah A. Solla, and David A. Cohn, Eds., Cam-

bridge, MA, USA, 1999, vol. 11, pp. 487–493, MIT Press.

[5] Vincent Wan and Steve Renals,

using sequence discriminant support vector machines,”

IEEE Transactions on Speech and Audio Processing, vol.

13, no. 2, pp. 203–210, 2005.

“Speaker verification

[6] William Campbell, “Generalized linear discriminant se-

quence kernels for speaker recognition,” in International

Conference on Acoustics, Speech, and Signal Processing,

Orlando, Florida, 2002, vol. 1, pp. 161–164.

[7] Kunio Matsui, Tomoko und Tanabe, “Speaker identifica-

tion with dual penalized logistic regression machine,” in

Proceedings of ODYSSEY - The Speaker and Language

Recognition Workshop, 2004, pp. 363–368.

[8] Mark Przybocki and Alvin F. Martin,

recognition evaluation chronicles,”

“NIST speaker

in Proceedings of

Page 6

ODYSSEY - The Speaker and Language Recognition

Workshop, 2004.

[9] Nachman Aronszajn, “Theory of reproducing kernels,”

Transactions of the American Mathematical Society, vol.

68, pp. 337–404, 1950.

[10] Vladimir N. Vapnik,

The Nature of Statistical Learn-

ing Theory, Information Science and Statistics. Springer,

Berlin, 2nd edition, 2000.

[11] Christopher J. C. Burges,

tor machines for pattern recognition,” Data Mining and

Knowledge Discovery, vol. 2, no. 2, pp. 121–167, June

1998.

[12] John C. Platt,“Probabilistic outputs for support vec-

tor machines and comparisons to regularized likelihood

methods,” in Advances in Large-Margin Classifiers, Pe-

ter J. Bartlett, Bernhard Schölkopf, Dale Schuurmans,

and Alexander Smola, Eds., pp. 61–74. MIT Press, Cam-

bridge, MA, USA, oct 2000.

[13] Arthur Hoerl and Robert Kennard, “Ridge regression: Bi-

ased estimation for nonorthogonal problems,” Technomet-

rics, vol. 12, pp. 55–67, 1970.

[14] Ian T. Nabney, “Efficient training of rbf networks for clas-

sification,” in Artificial Neural Networks – ICANN 1999,

1999, vol. 1, pp. 210–215.

[15] George S. Kimeldorf and Grace Wahba, “Some results on

Tchebycheffian spline functions,” Journal of Mathematics

Analysis and Applications, vol. 33, pp. 82–95, 1971.

[16] Ji Zhu and Trevor Hastie, “Kernel logistic regression and

the import vector machine,” Journal of Computational

and Graphical Statistics, vol. 14, pp. 185–205, 2005.

[17] Trevor Hastie and Robert Tibshirani, “Classification by

pairwise coupling,” in Advances in Neural Information

Processing Systems 10, Michael I. Jordan, Michael J.

Kearns, and Sara A. Solla, Eds. MIT Press, Cambridge,

MA, USA, jun 1998.

[18] David Price, Stefan Knerr, Leon Personnaz, and Gérard

Dreyfus, “Pairwise neural network classifiers with proba-

bilistic outputs,” in Advances in Neural Information Pro-

cessing Systems 7, Gerald Tesauro, David S. Touretzky,

and Todd K. Leen, Eds., pp. 1109–1116. MIT Press, Cam-

bridge, MA, USA, 7 1995.

[19] Håkan Melin and Johan Lindberg, “Guidelines for ex-

periments on the polycost database,” in Proceedings of a

COST 250 workshop on Application of Speaker Recogni-

tion Techniques in Telephony, Vigo, Spain, 1996, pp. 59–

69.

[20] Paul Taylor, Richard Caley, Alan W. Black, and Simon

King, “Edinburgh speech tools library,” Tech. Rep., Uni-

versity of Edinburgh, 1999.

[21] Steve Young, Gunnar Evermann, Dan Kershaw, Gareth

Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho

Valtchev, and Phil Woodland, The HTK Book, Cambridge

University Engineering Department, 2002.

“A tutorial on support vec-

#### View other sources

#### Hide other sources

- Available from psu.edu
- Available from Andreas Wendemuth · May 20, 2014

An error occurred while rendering template.

gl_544cb0b0d039b13c628b45e5

rgreq-349b9e4c-c813-45ca-97e0-9f5939c3cb67

false