Conference PaperPDF Available

Discriminatively trained Probabilistic Linear Discriminant Analysis for speaker verification

Authors:

Abstract

Recently, i-vector extraction and Probabilistic Linear Discriminant Analysis (PLDA) have proven to provide state-of-the-art speaker verification performance. In this paper, the speaker verification score for a pair of i-vectors representing a trial is computed with a functional form derived from the successful PLDA generative model. In our case, however, parameters of this function are estimated based on a discriminative training criterion. We propose to use the objective function to directly address the task in speaker verification: discrimination between same-speaker and different-speaker trials. Compared with a baseline which uses a generatively trained PLDA model, discriminative training provides up to 40% relative improvement on the NIST SRE 2010 evaluation task.
DISCRIMINATIVELY TRAINED PROBABILISTIC LINEAR DISCRIMINANT ANALYSIS
FOR SPEAKER VERIFICATION
Luk´
aˇ
s Burget1, Oldˇ
rich Plchot1, Sandro Cumani2, Ondˇ
rej Glembek1, Pavel Matˇ
ejka1, Niko Br¨
ummer3
1Brno University of Technology, Czech Rep., {burget,iplchot,glembek,matejkap}@fit.vutbr.cz,
2Politecnico di Torino, Italy, sandro.cumani@polito.it,3AGNITIO, S. Africa, niko.brummer@gmail.com
ABSTRACT
Recently, i-vector extraction and Probabilistic Linear Discriminant
Analysis (PLDA) have proven to provide state-of-the-art speaker
verification performance. In this paper, the speaker verification score
for a pair of i-vectors representing a trial is computed with a func-
tional form derived from the successful PLDA generative model. In
our case, however, parameters of this function are estimated based on
a discriminative training criterion. We propose to use the objective
function to directly address the task in speaker verification: discrimi-
nation between same-speaker and different-speaker trials. Compared
with a baseline which uses a generatively trained PLDA model, dis-
criminative training provides up to 40% relative improvement on the
NIST SRE 2010 evaluation task.
Index TermsSpeaker verification, Discriminative training,
Probabilistic Linear Discriminant Analysis
1. INTRODUCTION
In this paper, we show that discriminative training can be used to
improve the performance of state-of-the-art speaker verification sys-
tems based on i-vector extraction and Probabilistic Linear Discrim-
inant Analysis (PLDA). Recently, systems based on i-vectors [1, 2]
extracted from cepstral features have provided superior performance
in speaker verification. The so-called i-vector is an information-rich
low-dimensional fixed length vector extracted from the feature se-
quence representing a speech segment (see section 2 for more details
on i-vector extraction). A speaker verification score is then produced
by comparing the two i-vectors corresponding to the segments in the
verification trial. The function taking two i-vectors as an input and
producing the corresponding verification score is typically designed
to give a good approximation of the log-likelihood ratio between
the “same-speaker” and “different-speaker” hypotheses. Typically,
the function is also designed to produce a symmetric score (i.e. to
produce output that is independent of which segment is enrollment
and which is test — unlike traditional systems, which distinguish
the two). In [1], good performance was reported when scores were
computed as cosine distances between i-vectors normalized using
within-class covariance normalization (WCCN). Best performance,
however, is currently obtained with PLDA [2] — a generative model
that models i-vector distributions allowing for direct evaluation of
This work was funded by the Office of the Director of National Intelli-
gence (ODNI), Intelligence Advanced Research Projects Activity (IARPA),
through the Army Research Laboratory (ARL). All statements of fact, opin-
ion or conclusions contained herein are those of the authors and should not be
construed as representing the official views or policies of IARPA, the ODNI,
or the U.S. Government. The work was also partly supported by the Grant
Agency of Czech Republic project No. 102/08/0707, and Czech Ministry of
Education project No. MSM0021630528.
the desired log-likelihood ratio verification score (see section 3 for
details on the specific form of PLDA used in our work).
In this paper, we propose to estimate verification scores using a
discriminative model rather than a generative PLDA model. More
specifically, the speaker verification score for a pair of i-vectors is
computed using a function having the functional form derived from
the PLDA generative model. The parameters of the function, how-
ever, are estimated using a discriminative training criterion. We use
an objective function that directly addresses the speaker verification
task, i.e. the discrimination between “same-speaker” and “different-
speaker” trials. In other words, a binary classifier that takes a pair of
i-vectors as an input, is trained to answer the question of whether
or not the two i-vectors come from the same speaker. We show
that the functional form derived from PLDA can be interpreted as
a binary linear classifier in a nonlinearly expanded space of i-vector
pairs. We have experimented with two discriminative linear clas-
sifiers, namely linear support vector machines (SVM) and logistic
regression. The advantage of logistic regression is its probabilistic
interpretation: the linear output of this classifier can be directly in-
terpreted as the desired log-likelihood ratio verification score. On
the NIST SRE 2010 evaluation task, we show that up to 40% rela-
tive improvement over the PLDA baseline can by obtained with such
discriminatively trained models.
There has been previous work on discriminative training for
speaker recognition, such as GMM-SVM [3]. This and similar ap-
proaches, however, do not directly address the objective of discrim-
inating between same-speaker and different-speaker trials. Instead,
SVMs are trained as discriminative models representing each tar-
get speaker. As a consequence, this approach cannot fully benefit
from discriminative training, as there is a very limited number of
positive examples (usually only one enrollment segment) available
for training of each model. In contrast, in our approach, a model is
trained using a large number of positive and negative examples, each
of which is one of many possible same-speaker or different-speaker
trials that can be constructed from the training segments.
The very same idea of discriminatively training a PLDA-like
model for speaker verification was originally proposed in [4] and
some initial work has been done in [5]. At that time, however,
speaker factors extracted using Joint Factor Analysis (JFA) [6] were
used as a suboptimal input for the classifier, and state-of-the-art per-
formance would not have been achieved.
2. I-VECTORS
The i-vector approach has become state of the art in the speaker veri-
fication field [1]. The approach provides an elegant way of reducing
large-dimensional input data to a small-dimensional feature vector
while retaining most of the relevant information. The technique was
originally inspired by the JFA framework [6]. The basic principle
is that on some data, we train the i-vector extractor and then for
4832978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011
each speech segment, we extract the i-vector as a low-dimensional
fixed length representation of the segment. The main idea is that the
speaker- and session-dependent supervectors of concatenated Gaus-
sian mixture model (GMM) means can be modeled as
s=m+Tx,(1)
where mis the Universal Backgroung Model (UBM) GMM mean
supervector, Tis a matrix of bases spanning the subspace covering
the important variability (both speaker- and session-specific) in the
supervector space, and xis a standard-normally distributed latent
variable. For each observation sequence representing a segment, our
i-vector φis the MAP point estimate of the latent variable x.
3. PLDA
3.1. Two covariance model
To facilitate comparison of i-vectors in a verification trial, we model
the distribution of i-vectors using a Probabilistic LDA model [7, 2].
We first consider only a special form of PLDA, a two-covariance
model, in which speaker and inter-session variability are modeled
using across-class and within-class full covariance matrices Σac and
Σwc. The two-covariance model is a generative linear-Gaussian
model, where latent vectors yrepresenting speakers (or more gener-
ally classes) are assumed to be distributed according to prior distri-
bution
p(y)=N(y;μ,Σac).(2)
For a given speaker represented by a vector ˆ
y, the distribution of
i-vectors is assumed to be
p(φ|ˆ
y)=N(φ;ˆ
y,Σwc).(3)
The ML estimates of the model parameters, μ,Σac, and Σwc , can
be obtained using an EM algorithm as in [2]. The training i-vectors
come from a database comprising recordings of many speakers (to
capture across-class variability), each recorded in several sessions
(to capture within-class variability).
In the more general case, the speaker and/or inter-session vari-
ability can be modeled using subspaces [1]. For example, in our
baseline system, speaker variability is not modeled using a full co-
variance matrix. Instead a low rank across-class covariance matrix
is modeled as Σac =VTV, which limits speaker variability to live
in a subspace spanned by the columns of the reduced rank matrix V.
3.2. Evaluation of verification score
Consider the process of generating two i-vectors φ1and φ2forming
a trial. In the case of a same-speaker trial, a single vector ˆ
yrepre-
senting a speaker is generated from the prior p(y), for which both φ1
and φ2are generated from p(φ|ˆ
y). For a different-speaker trial, two
latent vectors representing two different speakers are independently
generated from p(y). For each latent vector, one of the i-vectors φ1
and φ2is generated. Given a trial, we want to test two hypotheses:
Hdthat the trial is a different-speaker trial and Hsthat the trial is a
same-speaker trial. The speaker verification score can now be calcu-
lated as a log-likelihood ratio between the two hypotheses Hsand
Hdas
s= log p(φ1,φ2|Hs)
p(φ1,φ2|Hd)(4)
= log p(φ1|y)p(φ2|y)p(y)dy
p(φ1)p(φ2),(5)
where in the numerator we integrate over the distribution of speaker
vectors and, for each possible speaker, the likelihood of producing
both i-vectors from the speaker is calculated. In the denominator, we
simply multiply the marginal likelihoods p(φ)=p(φ|y)p(y)dy.
The integrals, which can be interpreted as convolutions of Gaussians,
can be evaluated analytically giving
s= log Nφ1
φ2;μ
μ,Σtot Σac
Σac Σtot 
log Nφ1
φ2;μ
μ,Σtot 0
tot ,(6)
where the total covariance matrix is given as Σtot =Σac +Σwc.
By expanding the log of Gaussian distributions and simplifying the
final expression, we obtain
s=φT
1Λφ2+φT
2Λφ1+φT
1Γφ1+φT
2Γφ2
+(φ1+φ2)Tc+k, (7)
where
Γ=1
4(Σwc +2Σac)11
4Σ1
wc +1
2Σ1
tot
Λ=1
4(Σwc +2Σac)1+1
4Σ1
wc
c=((Σwc +2Σac)1Σ1
tot)μ
k= log |Σtot|− 1
2log |Σwc +2Σac|− 1
2log |Σwc|
+μT(Σ1
tot (Σwc +2Σac)1)μ.(8)
We recall that the computation of a bilinear form xTAy can be
expressed in terms of the Frobenius inner product as xTAy =
A,xyT= vec(A)Tvec(xyT), where vec(·)stacks the columns
of a matrix into a vector. Therefore, the log-likelihood ratio score
can be written as a dot product of a vector of weights wT, and an
expanded vector ϕ(φ1,φ2)representing a trial:
s=wTϕ(φ1,φ2)
=
vec(Λ)
vec(Γ)
c
k
T
vec(φ1φT
2+φ2φT
1)
vec(φ1φT
1+φ2φT
2)
φ1+φ2
1
.(9)
Hence, we have obtained a generative generalized linear classi-
fier [8], where the probability for a same-speaker trial can be
computed from the log-likelihood ratio score using the sigmoid
activation function as
p(Hs|φ1,φ2)=σ(s) = (1 + exp(s))1.(10)
Here, we have assumed equal priors for both hypotheses. To allow
for different priors, we can simply adjust the constant kin the vector
of weights by adding logit(p(Hs)).
4. DISCRIMINATIVE CLASSIFIERS
In this section, we describe how we train the weights wdi-
rectly, in order to discriminate between same-speaker and different-
speaker trials, without having to explicitly model the distributions
of i-vectors. To represent a trial, we keep the same expansion
ϕ(φ1,φ2)as defined in (9). Hence, we reuse the functional form
for computing verification scores that provided excellent results with
generative PLDA. We consider two standard discriminative linear
classifiers, namely logistic regression and SVMs.
4833
4.1. Objective functions
The set of training examples, which we continue referring to as train-
ing trials, comprises both different-speaker and same-speaker trials.
Let us use the coding scheme t∈{1,1}to represent labels for
the different-speaker, and same-speaker trials, respectively. Assign-
ing each trial a log-likelihood ratio sand the correct label t, the log
probability of recognizing the trial correctly can be expressed as
log p(t|φ1,φ2)=log(1 + exp(st)).(11)
This is easy to see from equation (10) and recalling that σ(s)=
1σ(s). In the case of logistic regression, the objective function to
maximize is the log probability of correctly classifying all training
examples, i.e. the sum of expressions (11) evaluated for all training
trials. Equivalently, this can be expressed by minimizing the cross-
entropy error function, which is a sum over all training trials
E(w)=
N
n=1
αnELR(tnsn)+ λ
2w2,(12)
where the logistic regression loss function
ELR(ts) = log(1 + exp(ts)) (13)
is simply the negative log probability (11) of correctly recognizing a
trial. We have also added the regularization term λ
2w2, where λis
a constant controlling the tradeoff between the error function and the
regularizer. The coefficients αnallow us to weight individual trials.
Specifically, we use them to assign different weights to same-speaker
and different-speaker trials. This allows us to select a particular op-
erating point, around which we want to optimize the performance of
our system without relying on the proportion of same- and different-
speaker trials in the training set. The advantage of using the cross-
entropy objective for training is that it reflects performance of the
system over a wide range of operating points (around the selected
one). For this reason, a similar function was also proposed as a per-
formance measure for the speaker verification task [9]. Another ad-
vantage of using the logistic regression classifier is its probabilistic
nature: It trains the weights so that the score s=wTϕ(φ1,φ2)can
be interpreted as the log-likelihood ratio between hypotheses Hsand
Hd.
Taking (12) and replacing ELR(ts)with hinge loss function
ESV (ts) = max(0,1ts),(14)
we obtain an SVM, which is a classifier traditionally understood to
maximize the margin separating class samples. Alternatively, one
can see the hinge loss function as a piecewise approximation to the
logistic regression loss function. Therefore, one can assume that
the score s=wTϕ(φ1,φ2)obtained from an SVM classifier will
still be a reasonable approximation to the log-likelihood ratio (after
a linear calibration).
4.2. Gradient evaluation
In order to numerically optimize the parameters wof the classifier,
we want to evaluate the gradient of the error function
E(w)=
N
n=1
αn
∂E(tnsn)
∂sn
∂sn
w+λw,(15)
where the derivation of the loss function E(tnsn), w.r.t. score sn,
depends on the particular choice of the loss function. For the logistic
regression loss function, it is defined as
∂ELR(ts)
∂s =(ts)(16)
while for the hinge loss function it becomes
∂ESV (ts)
∂s =0if ts 1
totherwise.(17)
Finally, the derivation of the score w.r.t. the classifier parameters just
gives the expanded trial vector
∂s
w=
wwTϕ(φ1,φ2)=ϕ(φ1,φ2).(18)
4.3. Efficient score and gradient evaluation
Given a trained classifier, we can obtain a verification score for a trial
by forming the expanded vector ϕ(φ1,φ2)and computing the dot
product (9). However, as we have already seen, the same score can
be obtained using the two original i-vectors φ1,φ2and using the
formula (7), which is both memory and computationally efficient.
Now, consider two sets of i-vectors stored as columns of the matrices
Φeand Φt. For illustration, let us call these sets enrollment and test
trials, although they play symmetrical roles in our scoring scheme.
We can efficiently score each enrollment trial against each test trial
and obtain the full matrix of scores as
S=2ΦT
eΛΦt
+((ΦT
eΓ)ΦT
e)11T+11T(Φt(ΓΦt))
+ΦT
ec1T+1cTΦt+k11T,(19)
where denotes the Hadamard, or “entrywise” product. Similarly,
the na¨
ıve way of evaluating the gradient would be to explicitly ex-
pand every training trial and then to apply equations (15) to (18).
However, again taking into account the functional form for com-
puting scores (7), the gradient can be evaluated much more effi-
ciently without any need for explicit trial expansion. Let all the i-
vectors, which we have available for training, be stored in columns
of a matrix Φ. Now consider forming a training trial using every
possible pair of i-vectors from the matrix. Let sij be the score for
the trial formed by the i-th and j-th columns of Φcalculated us-
ing the parameters wfor which we wish to evaluate the gradient.
Let tij and αij be the corresponding label and trial weight, respec-
tively. Further, let dij be the corresponding derivation of loss func-
tion E(tij sij )w.r.t. the score sij given in (16) or (17) depending on
the loss function used. The gradient can now be efficiently evaluated
as
E(w)=
ΛL
ΓL
cL
kL
=
2·vec ΦGΦT
2·vec Φ[ΦT(G11T)]
2·1T[ΦT(G11T)]
1TG1
+λw
(20)
where elements of matrix Gare gij =dij ·αij .
5. EXPERIMENTS
The i-vector extractor and the baseline PLDA system is taken from
the ABC system submitted to NIST SRE 2010 evaluation [10]. The
i-vector extractor uses 60-dimensional cepstral features and a 2048-
component full covariance GMM. The UBM and i-vector extractor
are trained on NIST SRE 2004, 2005 and 2006, Switchboard and
Fisher data. All PLDA systems and discriminative classifiers are
trained using 400 dimensional i-vectors extracted from 21663 seg-
ments from 1384 female speakers and 16969 segments from 1051
male speakers from NIST SRE 2004, NIST SRE 2005, NIST SRE
2006, Switchboard II Phases 2 and 3, and Switchboard Cellular Parts
1 and 2. Table 1 presents results for the extended condition 5 (tel-tel)
4834
Female Set Male Set Pooled
System minDCF oldDCF EER minDCF oldDCF EER minDCF oldDCF EER
PLDA 0.40 0.15 3.57 0.42 0.13 2.86 0.41 0.14 3.23
LR 0.40 0.12 2.94 0.39 0.10 2.22 0.40 0.11 2.62
SVM 0.39 0.11 2.35 0.31 0.08 1.55 0.37 0.10 1.94
HT-PLDA 0.34 0.11 2.22 0.33 0.08 1.47 0.34 0.10 1.88
Table 1. Normalized newDCF, oldDCF and EER for the extended condition 5 (tel-tel) from the NIST SRE 2010 evaluation.
from NIST SRE 2010 evaluation. The reported numbers are Equal
Error Rate (EER) and normalized minimum Decision Cost Functions
for the two operating points as defined by NIST for the SRE 2008
(oldDCF) and SRE 2010 (newDCF) evaluations [11].
The system denoted as PLDA, which serves as our baseline, is
based on a generatively trained PLDA model with a 90-dimensional
speaker variability subspace [10]. On telephone data, this configura-
tion was found to give the best newDCF, which was the primary per-
formance measure in the NIST SRE 2010 evaluation, which focused
on low false alarm rates. As a tradeoff, the system gives somewhat
poorer performance at the oldDCF and EER.
The system denoted as LR is the discriminative linear classifier,
where parameters were initialized from the baseline system using (8)
and retrained to optimize the logistic regression objective function.
We have used the conjugate gradient trust region method [12] as im-
plemented in [13] to numerically optimize the parameters. No regu-
larization was used in this case. Significant improvements compared
to the baseline can be observed, especially at oldDCF and EER.
Even larger improvements were observed for the SVM-based
classifier, where 10%, 30% and 40% relative improvements over the
baseline were obtained for newDCF, oldDCF and EER respectively.
The improvements over the LR system can probably be attributed
mainly to the presence of the regularization term. Often, SVM clas-
sifiers are trained using a solver to the dual problem, where a Gram
matrix needs to be evaluated. The Gram matrix is a matrix com-
prising dot products between every pair of training examples, which
are the trials in our case. Since we decided to construct a training
trial for every pair of i-vectors, the size of the Gram matrix would
be unmanageably large (the number of training i-vectors to the 4th
power). Therefore, we train a linear SVM by again solving the pri-
mal problem using a solver [14], which makes use of the efficient
evaluation of gradient. To make SVM regularization effective, we
have found that it is necessary to first normalize input i-vectors using
within-class covariance normalization (WCCN) [1], i.e. to normalize
i-vectors to have identity within-class covariance matrix. More de-
tails on the SVM-based system described in this paper can be found
in our parallel paper [15].
Finally, for comparison, we also include results with Heavy-
tailed PLDA (HT-PLDA) [2], which are so far the best results we
have obtained with the same set of training and test i-vectors. In
heavy-tailed PLDA, speaker and intersession variability are modeled
using Student’s t, rather than Gaussian distributions. In our system,
the dimensionality of i-vectors was first reduced from 400 to 120
and the final vectors were modeled with full-rank speaker and in-
tersession subspaces. Nevertheless, the price paid for the excellent
results obtained with heavy-tailed PLDA is the very computationally
demanding score evaluation. As we can see, competitive results can
be obtained with our discriminatively trained models, for which the
score evaluation is several orders of magnitude faster.
6. CONCLUSIONS
Recent advances in speaker verification build on i-vector extraction
and Probabilistic Linear Discriminant Analysis (PLDA). In this pa-
per, we have proposed to use a PLDA-like functional for evaluat-
ing the speaker verification score for a pair of i-vectors represent-
ing a trial. However, estimation of the function parameters is based
on a discriminative rather than a generative training criterion. We
have shown the benefit of using the objective function to directly ad-
dress the task in speaker verification: discrimination between same-
speaker and different-speaker trials. On the NIST SRE 2010 eval-
uation task, our results show a significant (up to 40%) relative im-
provement from this approach, compared to a baseline that uses a
generatively trained PLDA model.
In future work, we would like to test our method on additional
conditions beyond the telephone speech, and to develop techniques
for adapting the trained system to be able to cope with new chan-
nel conditions. Various methods for regularizing logistic regression
training are also worth investigating. We would also like to experi-
ment with models based on more general forms of the PLDA model.
Functional forms for verification scores derived from PLDA with
low-rank speaker or channel subspaces would allow us to control
the number of trainable parameters. Another interesting alternative
would be a functional form that would more closely simulate the
heavy-tailed PLDA generative model [2], which is currently provid-
ing better performance than PLDA based on Gaussian distributions.
7. REFERENCES
[1] N. Dehak, P. Kenny, et al., “Front–end factor analysis for speaker veri-
fication,” in IEEE Trans. on Audio, Speech and Lang. Process., 2010.
[2] P. Kenny, “Bayesian speaker verification with heavy–tailed priors,”
keynote presentation, Proc. of Odyssey 2010, June 2010.
[3] W.M. Campbell, D.E. Sturim, D.A. Reynolds, and A. Solomonoff,
“SVM based speaker verification using a GMM supervector kernel and
NAP variability compensation, May 2006, vol. 1, pp. I –I.
[4] N. Br ¨
ummer, A farewell to SVM: Bayes factor speaker detection in
supervector space,” http://sites.google.com/site/nikobrummer/.
[5] L. Burget et al., “Robust speaker recognition over varying channels,
in Johns Hopkins University CLSP Summer Workshop Report, 2008,
www.clsp.jhu.edu/workshops/ws08/documents/jhu report main.pdf.
[6] P. Kenny et al., “Joint factor analysis versus eigenchannes in speaker
recognition,” IEEE Transactions on Audio, Speech, and Language Pro-
cessing, vol. 15, no. 7, pp. 2072–2084, 2007.
[7] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant anal-
ysis for inferences about identity, in 11th International Conference on
Computer Vision, 2007, pp. 1–8.
[8] C. M. Bishop, Pattern Recognition and Machine Learning, chapter 4.2,
Springer, 2006.
[9] N. Br ¨
ummer and J. A. du Preez, “Application-independent evaluation
of speaker detection,” Computer Speech & Language, vol. 20, no. 2-3,
pp. 230–275, 2006.
[10] N. Brummer, L. Burget, P. Kenny, et al., “ABC system description for
NIST SRE 2010,” in Proc. NIST 2010 Speaker Recognition Evaluation.
[11] NIST, “The NIST year 2008 and 2010 speaker recognition evaluation
plans,” http://www.itl.nist.gov/iad/mig/tests/sre.
[12] Jorge Nocedal and Stephen J. Wright, Numerical Optimization,
Springer, August 2000.
[13] E. de Villiers and N. Br¨
ummer, “BOSARIS toolkit,”
https://sites.google.com/site/bosaristoolkit/.
[14] C.H. Teo, A. Smola, et al., A scalable modular convex solver for
regularized risk minimization,” in Proc. of KDD, 2007, pp. 727–736.
[15] S. Cumani, N. Brummer L. Burget, , and P. Laface, “Fast discrimi-
native speaker verification in the i-vector space,” submitted to Proc. of
ICASSP 2011, Prague.
4835
... Over the last few years, Probabilistic Linear Discriminant Analysis (or PLDA) modelling of i-vectors [1,2] has become one of the most commonly used approaches for speaker recognition. Recently Cumani [3,4] showed a relationship between the structure of the scoring function for a two-Gaussian PLDA model and how it can be represented by a kernel with up to 2nd order terms. Their work effectively showed that for single session enrollments the performance of their discriminatively trained Support Vector Machine (SVM) kernel function was similar to the performance of PLDA. ...
... The remainder of the paper is as follows. Section 2 provides an outline of the multi-session enrollment SVM formulation building upon the work of Cumani [3,4]. Section 3 presents the experimental results for the SVM representation and enrollment session pruning which is then followed by the conclusions. ...
... Then, a second PLDA stage generates bylanguage LRs, conditional to the cluster, which are then combined with the per-cluster LRs to generate the final LR. The full model is trained discriminatively, as explored for the standard PLDA model in several previous papers [19], [24]-arXiv:2201.01364v1 [cs.CL] 4 Jan 2022 [27], including recent works from our group [28]- [30]. ...
Preprint
Spoken language recognition (SLR) refers to the automatic process used to determine the language present in a speech sample. SLR is an important task in its own right, for example, as a tool to analyze or categorize large amounts of multi-lingual data. Further, it is also an essential tool for selecting downstream applications in a work flow, for example, to chose appropriate speech recognition or machine translation models. SLR systems are usually composed of two stages, one where an embedding representing the audio sample is extracted and a second one which computes the final scores for each language. In this work, we approach the SLR task as a detection problem and implement the second stage as a probabilistic linear discriminant analysis (PLDA) model. We show that discriminative training of the PLDA parameters gives large gains with respect to the usual generative training. Further, we propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster. The final language detection scores are computed as a combination of these two sets of scores. The complete model is trained discriminatively to optimize a cross-entropy objective. We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages, in many cases by large margins. We train our systems on a collection of datasets including 100 languages and test them both on matched and mismatched conditions, showing that the gains are robust to condition mismatch.
... 1) PLDA-based models often include a bias term (e.g. Pairwise Support Vector Machines [24,25,26] or discriminative PLDA [27]). These terms result in score shifts that are optimal for the training criterion, but may not result in wellcalibrated scores. ...
... In this approach, speaker and channel variabilities are modeled in a common constrained low dimensional space, and a speech segment is represented by a lowdimensional "identity vector" or i-vector. The low dimensionality of i-vectors makes them suitable for fast classification using either generative models based on Probabilistic Linear Discriminant Analysis (PLDA) [3,4], or discriminative classifiers such as Support Vector Machines or Logistic Regression [5,6]. ...
... Many studies on this have been conducted, especially those based on the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) [1]. Elegant super-vector approaches such as iVector extraction [2] and probabilistic linear discriminant analysis (PLDA) [3,4] have recently been demonstrated to have state-of-the-art performance in this area. These methods work very well under relatively clean conditions and provide a very low error rate in speaker verification. ...
... An i-vector system developed for NIST SRE 2012 at Radboud University Nijmegen (RUN) [20,17] was used in an 'off the shelf' manner for the experiments in this paper. It consists of a standard i-vector [18] configuration with PLDA modelling [21]. ...
... The i-vector/PLDA solution is considered the pillar for recent SID systems [14,15,16,17,18]. PLDA uses inter-session and intra-session variabilities observed in several recordings in the development set corresponding to individual speakers to find a subspace in the i-vector space. ...
... "A" (Approved for Public Release, Distribution Unlimited) ratio between likelihoods of two hypotheses: the iVectors were generated from the same speaker or independently. When calculating the likelihoods, we need to integrate over the hidden variable, which can be done very efficiently in the case of the basic Gaussian PLDA, where the integral can be solved analytically [4,5]. ...
Article
The deep learning methodologies in state-of-the-art speaker recognition systems are predominantly limited to the extraction of recording level embeddings. This is usually followed by generative modeling of the embeddings to output the verification score. In this paper, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This model, termed as Siamese neural network (SiamNN), combines the embedding extraction and back-end modeling into a single processing pipeline. The back-end modeling is achieved using a neural approach to PLDA modeling, called neural probabilistic linear discriminant analysis (NPLDA). In the NPLDA model, the verification score is computed as a discriminative similarity function. The development of the single neural SiamNN model allows the joint optimization of all the modules using a verification cost. Several speaker recognition experiments are performed using SITW, VOiCES, and NIST SRE datasets where the proposed SiamNN model is shown to significantly improve over the state-of-art x-vector PLDA baseline system (relative improvements of up to 35 % in the primary cost metric). We also provide a detailed analysis of the influence of hyper-parameters, choice of loss functions, and data sampling strategies for training the model. In particular, we highlight that the proposed soft detection cost function based optimization improves over other loss functions considered.
Conference Paper
Full-text available
Gaussian mixture models with universal backgrounds (UBMs) have become the standard method for speaker recognition. Typically, a speaker model is constructed by MAP adaptation of the means of the UBM. A GMM supervector is constructed by stacking the means of the adapted mixture components. A recent discovery is that latent factor analysis of this GMM supervector is an effective method for variability compensation. We consider this GMM supervector in the context of support vector machines. We construct a support vector machine kernel using the GMM supervector. We show similarities based on this kernel between the method of SVM nuisance attribute projection (NAP) and the recent results in latent factor analysis. Experiments on a NIST SRE 2005 corpus demonstrate the effectiveness of the new technique
Conference Paper
Full-text available
This work presents a new approach to discriminative speaker verification. Rather than estimating speaker models, or a model that discriminates between a speaker class and the class of all the other speakers, we directly solve the problem of classifying pairs of utterances as belonging to the same speaker or not. The paper illustrates the development of a suitable Support Vector Machine kernel from a state-of-the-art generative formulation, and proposes an efficient approach to train discriminative models. The results of the experiments performed on the tel-tel extended core condition of the NIST 2010 Speaker Recognition Evaluation are competitive or better, in terms of normalized Decision Cost Function and Equal Error Rate, compared to the more expensive generative models.
Article
Full-text available
This paper presents an extension of our previous work which proposes a new speaker representation for speaker verification. In this modeling, a new low-dimensional speaker- and channel-dependent space is defined using a simple factor analysis. This space is named the total variability space because it models both speaker and channel variabilities. Two speaker verification systems are proposed which use this new representation. The first system is a support vector machine-based system that uses the cosine kernel to estimate the similarity between the input data. The second system directly uses the cosine similarity as the final decision score. We tested three channel compensation techniques in the total variability space, which are within-class covariance normalization (WCCN), linear discriminate analysis (LDA), and nuisance attribute projection (NAP). We found that the best results are obtained when LDA is followed by WCCN. We achieved an equal error rate (EER) of 1.12% and MinDCF of 0.0094 using the cosine distance scoring on the male English trials of the core condition of the NIST 2008 Speaker Recognition Evaluation dataset. We also obtained 4% absolute EER improvement for both-gender trials on the 10 s-10 s condition compared to the classical joint factor analysis scoring.
Conference Paper
Full-text available
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a highly scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for data-locality, and can deal with regularizers such as l1 and l2 penalties. At present, our solver implements 20 different estimation problems, can be easily extended, scales to millions of observations, and is up to 10 times faster than specialized solvers for many applications. The open source code is freely available as part of the ELEFANT toolbox.
Conference Paper
Full-text available
Many current face recognition algorithms perform badly when the lighting or pose of the probe and gallery images differ. In this paper we present a novel algorithm designed for these conditions. We describe face data as resulting from a generative model which incorporates both within- individual and between-individual variation. In recognition we calculate the likelihood that the differences between face images are entirely due to within-individual variability. We extend this to the non-linear case where an arbitrary face manifold can be described and noise is position-dependent. We also develop a "tied" version of the algorithm that al- lows explicit comparison across quite different viewing con- ditions. We demonstrate that our model produces state of the art results for (i) frontal face recognition (ii) face recog- nition under varying pose.
Article
We propose and motivate an alternative to the traditional error-based or cost-based evaluation metrics for the goodness of speaker detection performance. The metric that we propose is an information-theoretic one, which measures the effective amount of information that the speaker detector delivers to the user. We show that this metric is appropriate for the evaluation of what we call application-independent detectors, which output soft decisions in the form of log-likelihood-ratios, rather than hard decisions. The proposed metric is constructed via analysis and generalization of cost-based evaluation metrics. This construction forms an interpretation of this metric as an expected cost, or as a total error-rate, over a range of different application-types. We further show how the metric can be decomposed into a discrimination and a calibration component. We conclude with an experimental demonstration of the proposed technique to evaluate three speaker detection systems submitted to the NIST 2004 Speaker Recognition Evaluation.