Content uploaded by Ondrej Glembek

Author content

All content in this area was uploaded by Ondrej Glembek

Content may be subject to copyright.

DISCRIMINATIVELY TRAINED PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS

FOR SPEAKER VERIFICATION

Luk´

aˇ

s Burget1, Oldˇ

rich Plchot1, Sandro Cumani2, Ondˇ

rej Glembek1, Pavel Matˇ

ejka1, Niko Br¨

ummer3

1Brno University of Technology, Czech Rep., {burget,iplchot,glembek,matejkap}@fit.vutbr.cz,

2Politecnico di Torino, Italy, sandro.cumani@polito.it,3AGNITIO, S. Africa, niko.brummer@gmail.com

ABSTRACT

Recently, i-vector extraction and Probabilistic Linear Discriminant

Analysis (PLDA) have proven to provide state-of-the-art speaker

veriﬁcation performance. In this paper, the speaker veriﬁcation score

for a pair of i-vectors representing a trial is computed with a func-

tional form derived from the successful PLDA generative model. In

our case, however, parameters of this function are estimated based on

a discriminative training criterion. We propose to use the objective

function to directly address the task in speaker veriﬁcation: discrimi-

nation between same-speaker and different-speaker trials. Compared

with a baseline which uses a generatively trained PLDA model, dis-

criminative training provides up to 40% relative improvement on the

NIST SRE 2010 evaluation task.

Index Terms—Speaker veriﬁcation, Discriminative training,

Probabilistic Linear Discriminant Analysis

1. INTRODUCTION

In this paper, we show that discriminative training can be used to

improve the performance of state-of-the-art speaker veriﬁcation sys-

tems based on i-vector extraction and Probabilistic Linear Discrim-

inant Analysis (PLDA). Recently, systems based on i-vectors [1, 2]

extracted from cepstral features have provided superior performance

in speaker veriﬁcation. The so-called i-vector is an information-rich

low-dimensional ﬁxed length vector extracted from the feature se-

quence representing a speech segment (see section 2 for more details

on i-vector extraction). A speaker veriﬁcation score is then produced

by comparing the two i-vectors corresponding to the segments in the

veriﬁcation trial. The function taking two i-vectors as an input and

producing the corresponding veriﬁcation score is typically designed

to give a good approximation of the log-likelihood ratio between

the “same-speaker” and “different-speaker” hypotheses. Typically,

the function is also designed to produce a symmetric score (i.e. to

produce output that is independent of which segment is enrollment

and which is test — unlike traditional systems, which distinguish

the two). In [1], good performance was reported when scores were

computed as cosine distances between i-vectors normalized using

within-class covariance normalization (WCCN). Best performance,

however, is currently obtained with PLDA [2] — a generative model

that models i-vector distributions allowing for direct evaluation of

This work was funded by the Ofﬁce of the Director of National Intelli-

gence (ODNI), Intelligence Advanced Research Projects Activity (IARPA),

through the Army Research Laboratory (ARL). All statements of fact, opin-

ion or conclusions contained herein are those of the authors and should not be

construed as representing the ofﬁcial views or policies of IARPA, the ODNI,

or the U.S. Government. The work was also partly supported by the Grant

Agency of Czech Republic project No. 102/08/0707, and Czech Ministry of

Education project No. MSM0021630528.

the desired log-likelihood ratio veriﬁcation score (see section 3 for

details on the speciﬁc form of PLDA used in our work).

In this paper, we propose to estimate veriﬁcation scores using a

discriminative model rather than a generative PLDA model. More

speciﬁcally, the speaker veriﬁcation score for a pair of i-vectors is

computed using a function having the functional form derived from

the PLDA generative model. The parameters of the function, how-

ever, are estimated using a discriminative training criterion. We use

an objective function that directly addresses the speaker veriﬁcation

task, i.e. the discrimination between “same-speaker” and “different-

speaker” trials. In other words, a binary classiﬁer that takes a pair of

i-vectors as an input, is trained to answer the question of whether

or not the two i-vectors come from the same speaker. We show

that the functional form derived from PLDA can be interpreted as

a binary linear classiﬁer in a nonlinearly expanded space of i-vector

pairs. We have experimented with two discriminative linear clas-

siﬁers, namely linear support vector machines (SVM) and logistic

regression. The advantage of logistic regression is its probabilistic

interpretation: the linear output of this classiﬁer can be directly in-

terpreted as the desired log-likelihood ratio veriﬁcation score. On

the NIST SRE 2010 evaluation task, we show that up to 40% rela-

tive improvement over the PLDA baseline can by obtained with such

discriminatively trained models.

There has been previous work on discriminative training for

speaker recognition, such as GMM-SVM [3]. This and similar ap-

proaches, however, do not directly address the objective of discrim-

inating between same-speaker and different-speaker trials. Instead,

SVMs are trained as discriminative models representing each tar-

get speaker. As a consequence, this approach cannot fully beneﬁt

from discriminative training, as there is a very limited number of

positive examples (usually only one enrollment segment) available

for training of each model. In contrast, in our approach, a model is

trained using a large number of positive and negative examples, each

of which is one of many possible same-speaker or different-speaker

trials that can be constructed from the training segments.

The very same idea of discriminatively training a PLDA-like

model for speaker veriﬁcation was originally proposed in [4] and

some initial work has been done in [5]. At that time, however,

speaker factors extracted using Joint Factor Analysis (JFA) [6] were

used as a suboptimal input for the classiﬁer, and state-of-the-art per-

formance would not have been achieved.

2. I-VECTORS

The i-vector approach has become state of the art in the speaker veri-

ﬁcation ﬁeld [1]. The approach provides an elegant way of reducing

large-dimensional input data to a small-dimensional feature vector

while retaining most of the relevant information. The technique was

originally inspired by the JFA framework [6]. The basic principle

is that on some data, we train the i-vector extractor and then for

4832978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011

each speech segment, we extract the i-vector as a low-dimensional

ﬁxed length representation of the segment. The main idea is that the

speaker- and session-dependent supervectors of concatenated Gaus-

sian mixture model (GMM) means can be modeled as

s=m+Tx,(1)

where mis the Universal Backgroung Model (UBM) GMM mean

supervector, Tis a matrix of bases spanning the subspace covering

the important variability (both speaker- and session-speciﬁc) in the

supervector space, and xis a standard-normally distributed latent

variable. For each observation sequence representing a segment, our

i-vector φis the MAP point estimate of the latent variable x.

3. PLDA

3.1. Two covariance model

To facilitate comparison of i-vectors in a veriﬁcation trial, we model

the distribution of i-vectors using a Probabilistic LDA model [7, 2].

We ﬁrst consider only a special form of PLDA, a two-covariance

model, in which speaker and inter-session variability are modeled

using across-class and within-class full covariance matrices Σac and

Σwc. The two-covariance model is a generative linear-Gaussian

model, where latent vectors yrepresenting speakers (or more gener-

ally classes) are assumed to be distributed according to prior distri-

bution

p(y)=N(y;μ,Σac).(2)

For a given speaker represented by a vector ˆ

y, the distribution of

i-vectors is assumed to be

p(φ|ˆ

y)=N(φ;ˆ

y,Σwc).(3)

The ML estimates of the model parameters, μ,Σac, and Σwc , can

be obtained using an EM algorithm as in [2]. The training i-vectors

come from a database comprising recordings of many speakers (to

capture across-class variability), each recorded in several sessions

(to capture within-class variability).

In the more general case, the speaker and/or inter-session vari-

ability can be modeled using subspaces [1]. For example, in our

baseline system, speaker variability is not modeled using a full co-

variance matrix. Instead a low rank across-class covariance matrix

is modeled as Σac =VTV, which limits speaker variability to live

in a subspace spanned by the columns of the reduced rank matrix V.

3.2. Evaluation of veriﬁcation score

Consider the process of generating two i-vectors φ1and φ2forming

a trial. In the case of a same-speaker trial, a single vector ˆ

yrepre-

senting a speaker is generated from the prior p(y), for which both φ1

and φ2are generated from p(φ|ˆ

y). For a different-speaker trial, two

latent vectors representing two different speakers are independently

generated from p(y). For each latent vector, one of the i-vectors φ1

and φ2is generated. Given a trial, we want to test two hypotheses:

Hdthat the trial is a different-speaker trial and Hsthat the trial is a

same-speaker trial. The speaker veriﬁcation score can now be calcu-

lated as a log-likelihood ratio between the two hypotheses Hsand

Hdas

s= log p(φ1,φ2|Hs)

p(φ1,φ2|Hd)(4)

= log p(φ1|y)p(φ2|y)p(y)dy

p(φ1)p(φ2),(5)

where in the numerator we integrate over the distribution of speaker

vectors and, for each possible speaker, the likelihood of producing

both i-vectors from the speaker is calculated. In the denominator, we

simply multiply the marginal likelihoods p(φ)=p(φ|y)p(y)dy.

The integrals, which can be interpreted as convolutions of Gaussians,

can be evaluated analytically giving

s= log Nφ1

φ2;μ

μ,Σtot Σac

Σac Σtot

−log Nφ1

φ2;μ

μ,Σtot 0

0Σ

tot ,(6)

where the total covariance matrix is given as Σtot =Σac +Σwc.

By expanding the log of Gaussian distributions and simplifying the

ﬁnal expression, we obtain

s=φT

1Λφ2+φT

2Λφ1+φT

1Γφ1+φT

2Γφ2

+(φ1+φ2)Tc+k, (7)

where

Γ=−1

4(Σwc +2Σac)−1−1

4Σ−1

wc +1

2Σ−1

tot

Λ=−1

4(Σwc +2Σac)−1+1

4Σ−1

wc

c=((Σwc +2Σac)−1−Σ−1

tot)μ

k= log |Σtot|− 1

2log |Σwc +2Σac|− 1

2log |Σwc|

+μT(Σ−1

tot −(Σwc +2Σac)−1)μ.(8)

We recall that the computation of a bilinear form xTAy can be

expressed in terms of the Frobenius inner product as xTAy =

A,xyT= vec(A)Tvec(xyT), where vec(·)stacks the columns

of a matrix into a vector. Therefore, the log-likelihood ratio score

can be written as a dot product of a vector of weights wT, and an

expanded vector ϕ(φ1,φ2)representing a trial:

s=wTϕ(φ1,φ2)

=⎡

⎢

⎣

vec(Λ)

vec(Γ)

c

k

⎤

⎥

⎦

T⎡

⎢

⎢

⎣

vec(φ1φT

2+φ2φT

1)

vec(φ1φT

1+φ2φT

2)

φ1+φ2

1

⎤

⎥

⎥

⎦

.(9)

Hence, we have obtained a generative generalized linear classi-

ﬁer [8], where the probability for a same-speaker trial can be

computed from the log-likelihood ratio score using the sigmoid

activation function as

p(Hs|φ1,φ2)=σ(s) = (1 + exp(−s))−1.(10)

Here, we have assumed equal priors for both hypotheses. To allow

for different priors, we can simply adjust the constant kin the vector

of weights by adding logit(p(Hs)).

4. DISCRIMINATIVE CLASSIFIERS

In this section, we describe how we train the weights wdi-

rectly, in order to discriminate between same-speaker and different-

speaker trials, without having to explicitly model the distributions

of i-vectors. To represent a trial, we keep the same expansion

ϕ(φ1,φ2)as deﬁned in (9). Hence, we reuse the functional form

for computing veriﬁcation scores that provided excellent results with

generative PLDA. We consider two standard discriminative linear

classiﬁers, namely logistic regression and SVMs.

4833

4.1. Objective functions

The set of training examples, which we continue referring to as train-

ing trials, comprises both different-speaker and same-speaker trials.

Let us use the coding scheme t∈{−1,1}to represent labels for

the different-speaker, and same-speaker trials, respectively. Assign-

ing each trial a log-likelihood ratio sand the correct label t, the log

probability of recognizing the trial correctly can be expressed as

log p(t|φ1,φ2)=−log(1 + exp(−st)).(11)

This is easy to see from equation (10) and recalling that σ(−s)=

1−σ(s). In the case of logistic regression, the objective function to

maximize is the log probability of correctly classifying all training

examples, i.e. the sum of expressions (11) evaluated for all training

trials. Equivalently, this can be expressed by minimizing the cross-

entropy error function, which is a sum over all training trials

E(w)=

N

n=1

αnELR(tnsn)+ λ

2w2,(12)

where the logistic regression loss function

ELR(ts) = log(1 + exp(−ts)) (13)

is simply the negative log probability (11) of correctly recognizing a

trial. We have also added the regularization term λ

2w2, where λis

a constant controlling the tradeoff between the error function and the

regularizer. The coefﬁcients αnallow us to weight individual trials.

Speciﬁcally, we use them to assign different weights to same-speaker

and different-speaker trials. This allows us to select a particular op-

erating point, around which we want to optimize the performance of

our system without relying on the proportion of same- and different-

speaker trials in the training set. The advantage of using the cross-

entropy objective for training is that it reﬂects performance of the

system over a wide range of operating points (around the selected

one). For this reason, a similar function was also proposed as a per-

formance measure for the speaker veriﬁcation task [9]. Another ad-

vantage of using the logistic regression classiﬁer is its probabilistic

nature: It trains the weights so that the score s=wTϕ(φ1,φ2)can

be interpreted as the log-likelihood ratio between hypotheses Hsand

Hd.

Taking (12) and replacing ELR(ts)with hinge loss function

ESV (ts) = max(0,1−ts),(14)

we obtain an SVM, which is a classiﬁer traditionally understood to

maximize the margin separating class samples. Alternatively, one

can see the hinge loss function as a piecewise approximation to the

logistic regression loss function. Therefore, one can assume that

the score s=wTϕ(φ1,φ2)obtained from an SVM classiﬁer will

still be a reasonable approximation to the log-likelihood ratio (after

a linear calibration).

4.2. Gradient evaluation

In order to numerically optimize the parameters wof the classiﬁer,

we want to evaluate the gradient of the error function

∇E(w)=

N

n=1

αn

∂E(tnsn)

∂sn

∂sn

∂w+λw,(15)

where the derivation of the loss function E(tnsn), w.r.t. score sn,

depends on the particular choice of the loss function. For the logistic

regression loss function, it is deﬁned as

∂ELR(ts)

∂s =−tσ(−ts)(16)

while for the hinge loss function it becomes

∂ESV (ts)

∂s =0if ts ≥1

−totherwise.(17)

Finally, the derivation of the score w.r.t. the classiﬁer parameters just

gives the expanded trial vector

∂s

∂w=∂

∂wwTϕ(φ1,φ2)=ϕ(φ1,φ2).(18)

4.3. Efﬁcient score and gradient evaluation

Given a trained classiﬁer, we can obtain a veriﬁcation score for a trial

by forming the expanded vector ϕ(φ1,φ2)and computing the dot

product (9). However, as we have already seen, the same score can

be obtained using the two original i-vectors φ1,φ2and using the

formula (7), which is both memory and computationally efﬁcient.

Now, consider two sets of i-vectors stored as columns of the matrices

Φeand Φt. For illustration, let us call these sets enrollment and test

trials, although they play symmetrical roles in our scoring scheme.

We can efﬁciently score each enrollment trial against each test trial

and obtain the full matrix of scores as

S=2ΦT

eΛΦt

+((ΦT

eΓ)◦ΦT

e)11T+11T(Φt◦(ΓΦt))

+ΦT

ec1T+1cTΦt+k11T,(19)

where ◦denotes the Hadamard, or “entrywise” product. Similarly,

the na¨

ıve way of evaluating the gradient would be to explicitly ex-

pand every training trial and then to apply equations (15) to (18).

However, again taking into account the functional form for com-

puting scores (7), the gradient can be evaluated much more efﬁ-

ciently without any need for explicit trial expansion. Let all the i-

vectors, which we have available for training, be stored in columns

of a matrix Φ. Now consider forming a training trial using every

possible pair of i-vectors from the matrix. Let sij be the score for

the trial formed by the i-th and j-th columns of Φcalculated us-

ing the parameters wfor which we wish to evaluate the gradient.

Let tij and αij be the corresponding label and trial weight, respec-

tively. Further, let dij be the corresponding derivation of loss func-

tion E(tij sij )w.r.t. the score sij given in (16) or (17) depending on

the loss function used. The gradient can now be efﬁciently evaluated

as

∇E(w)=⎡

⎢

⎣

∇ΛL

∇ΓL

∇cL

∇kL

⎤

⎥

⎦=⎡

⎢

⎢

⎣

2·vec ΦGΦT

2·vec Φ[ΦT◦(G11T)]

2·1T[ΦT◦(G11T)]

1TG1

⎤

⎥

⎥

⎦

+λw

(20)

where elements of matrix Gare gij =dij ·αij .

5. EXPERIMENTS

The i-vector extractor and the baseline PLDA system is taken from

the ABC system submitted to NIST SRE 2010 evaluation [10]. The

i-vector extractor uses 60-dimensional cepstral features and a 2048-

component full covariance GMM. The UBM and i-vector extractor

are trained on NIST SRE 2004, 2005 and 2006, Switchboard and

Fisher data. All PLDA systems and discriminative classiﬁers are

trained using 400 dimensional i-vectors extracted from 21663 seg-

ments from 1384 female speakers and 16969 segments from 1051

male speakers from NIST SRE 2004, NIST SRE 2005, NIST SRE

2006, Switchboard II Phases 2 and 3, and Switchboard Cellular Parts

1 and 2. Table 1 presents results for the extended condition 5 (tel-tel)

4834

Female Set Male Set Pooled

System minDCF oldDCF EER minDCF oldDCF EER minDCF oldDCF EER

PLDA 0.40 0.15 3.57 0.42 0.13 2.86 0.41 0.14 3.23

LR 0.40 0.12 2.94 0.39 0.10 2.22 0.40 0.11 2.62

SVM 0.39 0.11 2.35 0.31 0.08 1.55 0.37 0.10 1.94

HT-PLDA 0.34 0.11 2.22 0.33 0.08 1.47 0.34 0.10 1.88

Table 1. Normalized newDCF, oldDCF and EER for the extended condition 5 (tel-tel) from the NIST SRE 2010 evaluation.

from NIST SRE 2010 evaluation. The reported numbers are Equal

Error Rate (EER) and normalized minimum Decision Cost Functions

for the two operating points as deﬁned by NIST for the SRE 2008

(oldDCF) and SRE 2010 (newDCF) evaluations [11].

The system denoted as PLDA, which serves as our baseline, is

based on a generatively trained PLDA model with a 90-dimensional

speaker variability subspace [10]. On telephone data, this conﬁgura-

tion was found to give the best newDCF, which was the primary per-

formance measure in the NIST SRE 2010 evaluation, which focused

on low false alarm rates. As a tradeoff, the system gives somewhat

poorer performance at the oldDCF and EER.

The system denoted as LR is the discriminative linear classiﬁer,

where parameters were initialized from the baseline system using (8)

and retrained to optimize the logistic regression objective function.

We have used the conjugate gradient trust region method [12] as im-

plemented in [13] to numerically optimize the parameters. No regu-

larization was used in this case. Signiﬁcant improvements compared

to the baseline can be observed, especially at oldDCF and EER.

Even larger improvements were observed for the SVM-based

classiﬁer, where 10%, 30% and 40% relative improvements over the

baseline were obtained for newDCF, oldDCF and EER respectively.

The improvements over the LR system can probably be attributed

mainly to the presence of the regularization term. Often, SVM clas-

siﬁers are trained using a solver to the dual problem, where a Gram

matrix needs to be evaluated. The Gram matrix is a matrix com-

prising dot products between every pair of training examples, which

are the trials in our case. Since we decided to construct a training

trial for every pair of i-vectors, the size of the Gram matrix would

be unmanageably large (the number of training i-vectors to the 4th

power). Therefore, we train a linear SVM by again solving the pri-

mal problem using a solver [14], which makes use of the efﬁcient

evaluation of gradient. To make SVM regularization effective, we

have found that it is necessary to ﬁrst normalize input i-vectors using

within-class covariance normalization (WCCN) [1], i.e. to normalize

i-vectors to have identity within-class covariance matrix. More de-

tails on the SVM-based system described in this paper can be found

in our parallel paper [15].

Finally, for comparison, we also include results with Heavy-

tailed PLDA (HT-PLDA) [2], which are so far the best results we

have obtained with the same set of training and test i-vectors. In

heavy-tailed PLDA, speaker and intersession variability are modeled

using Student’s t, rather than Gaussian distributions. In our system,

the dimensionality of i-vectors was ﬁrst reduced from 400 to 120

and the ﬁnal vectors were modeled with full-rank speaker and in-

tersession subspaces. Nevertheless, the price paid for the excellent

results obtained with heavy-tailed PLDA is the very computationally

demanding score evaluation. As we can see, competitive results can

be obtained with our discriminatively trained models, for which the

score evaluation is several orders of magnitude faster.

6. CONCLUSIONS

Recent advances in speaker veriﬁcation build on i-vector extraction

and Probabilistic Linear Discriminant Analysis (PLDA). In this pa-

per, we have proposed to use a PLDA-like functional for evaluat-

ing the speaker veriﬁcation score for a pair of i-vectors represent-

ing a trial. However, estimation of the function parameters is based

on a discriminative rather than a generative training criterion. We

have shown the beneﬁt of using the objective function to directly ad-

dress the task in speaker veriﬁcation: discrimination between same-

speaker and different-speaker trials. On the NIST SRE 2010 eval-

uation task, our results show a signiﬁcant (up to 40%) relative im-

provement from this approach, compared to a baseline that uses a

generatively trained PLDA model.

In future work, we would like to test our method on additional

conditions beyond the telephone speech, and to develop techniques

for adapting the trained system to be able to cope with new chan-

nel conditions. Various methods for regularizing logistic regression

training are also worth investigating. We would also like to experi-

ment with models based on more general forms of the PLDA model.

Functional forms for veriﬁcation scores derived from PLDA with

low-rank speaker or channel subspaces would allow us to control

the number of trainable parameters. Another interesting alternative

would be a functional form that would more closely simulate the

heavy-tailed PLDA generative model [2], which is currently provid-

ing better performance than PLDA based on Gaussian distributions.

7. REFERENCES

[1] N. Dehak, P. Kenny, et al., “Front–end factor analysis for speaker veri-

ﬁcation,” in IEEE Trans. on Audio, Speech and Lang. Process., 2010.

[2] P. Kenny, “Bayesian speaker veriﬁcation with heavy–tailed priors,”

keynote presentation, Proc. of Odyssey 2010, June 2010.

[3] W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A. Solomonoff,

“SVM based speaker veriﬁcation using a GMM supervector kernel and

NAP variability compensation,” May 2006, vol. 1, pp. I –I.

[4] N. Br ¨

ummer, “A farewell to SVM: Bayes factor speaker detection in

supervector space,” http://sites.google.com/site/nikobrummer/.

[5] L. Burget et al., “Robust speaker recognition over varying channels,”

in Johns Hopkins University CLSP Summer Workshop Report, 2008,

www.clsp.jhu.edu/workshops/ws08/documents/jhu report main.pdf.

[6] P. Kenny et al., “Joint factor analysis versus eigenchannes in speaker

recognition,” IEEE Transactions on Audio, Speech, and Language Pro-

cessing, vol. 15, no. 7, pp. 2072–2084, 2007.

[7] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant anal-

ysis for inferences about identity,” in 11th International Conference on

Computer Vision, 2007, pp. 1–8.

[8] C. M. Bishop, Pattern Recognition and Machine Learning, chapter 4.2,

Springer, 2006.

[9] N. Br ¨

ummer and J. A. du Preez, “Application-independent evaluation

of speaker detection,” Computer Speech & Language, vol. 20, no. 2-3,

pp. 230–275, 2006.

[10] N. Brummer, L. Burget, P. Kenny, et al., “ABC system description for

NIST SRE 2010,” in Proc. NIST 2010 Speaker Recognition Evaluation.

[11] NIST, “The NIST year 2008 and 2010 speaker recognition evaluation

plans,” http://www.itl.nist.gov/iad/mig/tests/sre.

[12] Jorge Nocedal and Stephen J. Wright, Numerical Optimization,

Springer, August 2000.

[13] E. de Villiers and N. Br¨

ummer, “BOSARIS toolkit,”

https://sites.google.com/site/bosaristoolkit/.

[14] C.H. Teo, A. Smola, et al., “A scalable modular convex solver for

regularized risk minimization,” in Proc. of KDD, 2007, pp. 727–736.

[15] S. Cumani, N. Brummer L. Burget, , and P. Laface, “Fast discrimi-

native speaker veriﬁcation in the i-vector space,” submitted to Proc. of

ICASSP 2011, Prague.

4835