A logarithmic based pole-zero vocal tract model estimation for speaker verification.
ABSTRACT In this paper we investigate the use of formant and anti formant measurements of nasal consonants for speaker verification. The features are obtained using a pole-zero vocal tract model estimate optimized by minimizing a logarithmic criterion which is motivated by the perception of amplitude by the human auditory system. A GMM-UBM approach is used for performing speaker comparisons within the likelihood-ratio framework. Results are compared with systems based on Mel Frequency Cepstral Coefficients (MFCCs) as well as formant center frequencies and bandwidths obtained using the Snack Toolkit. The formant and anti-formant based system attains comparable results to the MFCC system and outperforms the formant-based approach while offering a more straight for ward interpretation in terms of a physical speech production model.
-
Citations (0)
-
Cited In (0)
Page 1
A LOGARITHMIC BASED POLE-ZERO VOCAL TRACT MODEL
ESTIMATION FOR SPEAKER VERIFICATION
Ewald Enzinger1, Peter Balazs1, Dami´ an Marelli2, Timo Becker3
1Acoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria
2School of Elect. Engineering and Computer Science, University of Newcastle, Australia
3Federal Criminal Police Office, Germany
ABSTRACT
In this paper we investigate the use of formant and anti-
formant measurements of nasal consonants for speaker verifi-
cation. The features are obtained using a pole-zero vocal tract
model estimate optimized by minimizing a logarithmic crite-
rion which is motivated by the perception of amplitude by the
human auditory system. A GMM-UBM approach is used for
performing speaker comparisons within the likelihood-ratio
framework. Results are compared with systems based on Mel
Frequency Cepstral Coefficients (MFCCs) as well as formant
center frequencies and bandwidths obtained using the Snack
Toolkit. The formant and anti-formant based system attains
comparable results to the MFCC system and outperforms the
formant-based approach while offering a more straightfor-
ward interpretation in terms of a physical speech production
model.
Index Terms— Speaker recognition, speech analysis,
pole-zero model, formants, anti-formants
1. INTRODUCTION
Automatic speaker verification systems, especially those
targeted at the forensic field, predominantly use the Gaus-
sian Mixture Model – Universal Background Model (GMM-
UBM) approach, combined with cepstral features such as Mel
Frequency Cepstral Coefficients (MFCC). While this combi-
nation of classifier and features provides good performance,
the lack of a straightforward interpretation of these features
with regard to a physical model of vocal tract properties of
a speaker leaves them as an unfavorable choice for certain
applications such as providing evidence to the court.
On the other hand, formant features as they are used in
acoustic-phonetic approaches to forensic speaker comparison
[1] can be related to the resonance cavities of the vocal tract.
Formant center frequencies and their bandwidths are suffi-
cient to determine the areas of an acoustic tube formed by
cascading M uniform cylindrical sections of equal length [2].
They have been successfully applied to the task of forensic
speaker comparison using the GMM-UBM approach [3].
Formants are usually measured by methods based on all-
pole models of the speech production filter which provide a
good characterization of some speech sound categories. Rep-
resentations of the vocal tract for unvoiced and nasal as well
as lateral sounds contain the anti-resonances (zeros) and reso-
nances (poles) of the vocal tract. Therefore, pole-zero models
offer an advantage. Here we present the estimation method
described in [4].
2. POLE-ZERO MODEL
Speech production is modeled by a linear, slowly time-
varying filter, the speech production filter (SPF), which mod-
els the combined effect of the vocal tract and the radiation of
the lips, as well as the glottal pulse shape in the case of voiced
sounds. It is assumed to be time-invariant during a short-time
period of approximately 20-40ms.
In the speech production model, the sampled speech sig-
nal y(t) is assumed to be generated by an excitation signal
u(t) filtered by the SPF gt(τ), i.e.
?
The signal u(t) is assumed to be a train of impulses for voiced
sounds, or white noise in the case of unvoiced sounds.
The frequency response of the SPF is given by
y(t) =
τ∈Z
gt(τ)u(t − τ).
(1)
G(z,θ) =B(z,θ)
A(z,θ)=
?n
l=0blz−l
?m
l=0alz−l,
(2)
where n and m denote the orders of the numerator and de-
nominator, respectively. The set of parameters θ denoted as
θ = [b0,b1,··· ,bn,a1,··· ,am]T
is tuned to fit a frequency response estimateˆG(ωk) of the
SPF at a discrete set of frequencies {ωk,k = 1,··· ,K}. The
present work uses the method described in [5] which obtains
ˆG(ωk) by interpolating spectral peaks found within neighbor-
hoods of the multiples of the pitch frequency.
(3)
4820 978-1-4577-0539-7/11/$26.00 ©2011 IEEEICASSP 2011
Page 2
Motivated by the fact that the human auditory system is
perceivingamplitudeofthefrequencycontentsofasoundsig-
nalinalogarithmicscale[6], thecoefficientsareoptimizedby
minimizing the following logarithmic criterion:
θ = argmin
θ?
K
?
k=1
????log|ˆG(ωk)| − log
????
B(ejωk,θ?)
A(ejωk,θ?)
????
????
2
.
(4)
The optimization problem (4) can be written as
θ= argmin
θ?
K
?
V (θ?),
(5)
V (θ)=
k=1
[F(θ)]2
k,
(6)
where [F(θ)]kdenotes the k-th component of the real-valued
vector F(θ), which is a function of the d-dimensional real-
valued vector θ. Then, (5)-(6) are equivalent to (4) if we de-
fine
?????
lowing iterative procedure
[F(θ)]k= log
ˆG(ωk)
G(ejωk,θ)
?????, for all k = 1,··· ,K,
(7)
Using Newton-like methods, (5)-(6) is solved using the fol-
θi+1= θi− αi˜θi,
(8)
where˜θiis the solution of
Hi˜θi= gi,
(9)
the scalar αi denotes the step size at iteration i, the d-
dimensional vector gi denotes the gradient of V (θ) at θi,
and the d × d matrix Hidenotes either the Hessian of V (θ)
at θior an approximation of it.
Let J(θ) denote the Jacobian of F(θ), i.e.,
[J(θ)]k,l=∂[F(θ)]k
∂[θ]l
.
(10)
The gradient gican be computed from the Jacobian informa-
tion by
gi= 2JT(θi)F(θi).
(11)
We use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) for-
mula [7], a iterative procedure that directly approximates
H−1
i
:
H−1
i+1
=H−1
i
+
?
1 +qT
iH−1
sT
iqi
+ H−1
i
sT
iqi
i
qi
?sisT
i
sT
iqi
−siqT
θi+1− θi,
gi+1− gi.
iH−1
i
qisT
i
,
si
qi
=
=
The step-size parameter αiis obtained from a linear search
algorithm using a sub-iterative procedure (i.e., formed of sub-
iterations of the main iterations (8)-(9)) in which, starting
from the initial value αi = 1, the value of αiis halved at
each sub-iteration until
V (θi− αi˜θi) < V (θi),
or a maximum number of iterations is reached.
Time [sec]
Frequency [Hz]
0.010.02 0.03 0.04 0.050.06
0
500
1000
1500
2000
2500
3000
3500
4000
Fig. 1. Formant (x) and anti-formant (o) measurements over
the length of a /n/ consonant
The formant and anti-formant measurements are obtained
from the roots of the denominator and numerator polynomials
in the z-plane, sorted in ascending order. The signal is divided
into frames using a 40ms hanning window and 95% overlap.
An order of 11 is selected for both numerator and denomi-
nator, which was determined based on a subset of the data.
The set of coefficients is first initialized by a weighted linear
least-squares algorithm [8] and then optimized by the pro-
posed method. The procedure does not employ any tracking
algorithm, i.e. any consideration of the temporal inter-relation
of the estimated poles and zeros, and imposes no continuity
conditions on obtained values. Fig. 1 shows an example of
formant and anti-formant measurements of an /n/ sound.
3. SPEAKER VERIFICATION SYSTEM
The automatic speaker verification system used in this study
is based on the GMM-UBM approach [9] and extends previ-
ous work in [3, 10] where it was applied to formant center
frequencies and bandwidths. Feature vectors consist of the
first three formants and anti-formants, as determined by the
pole-zero model, yielding 6 features per frame. Speakers are
modeled by a Gaussian mixture model (GMM) denoted by
λ := (pi,μi,Σi)i=1,...,M,
(12)
where pi, μiand Σirepresent the mixture weights, means
and covariance matrices. The universal background model
(UBM) is created by training a GMM from pooled feature
4821
Page 3
vectors of different speakers using a maximum likelihood cri-
terion, which is solved using the Expectation-Maximization
(EM) algorithm. Speaker models are derived from the UBM
usingmaximuma-posteriori(MAP)adaption. Thiswasfound
to provide better results than the original approach in [3, 10].
A number of 8 mixture components is used in accordance to
[10]. Full covariance matrices are used in order to be able to
properly model within-speaker variability. The likelihood of
a set of feature vectors X given a GMM λ is calculated by
P(X|λ) =
n
?
i=1
f(xi|λ).
(13)
where f(xi|λ) is the Gaussian mixture density function for
the specified model λ. In each speaker comparison the like-
lihood ratio (LR) is computed for a set of test feature vectors
X and the models of a speaker and the UBM.
LR =P(X|λspeaker)
P(X|λUBM)
(14)
This score usually does not represent a proper LR which re-
quires that same-speaker comparisons report high LR values
while different-speaker comparisons report low values, with
values close to one offering no support to any of the two
hypotheses. Therefore, an automatic calibration procedure
based on logistic regression is applied to the scores using the
methods provided by the FoCaL toolkit [11]. The parameters
are estimated in a cross validation setting, using the scores of
all speakers except those involved in the current trial.
4. EVALUATION
Performance comparisons are carried out using the proposed
method, a baseline GMM-UBM speaker verification system
using basic MFCC features which is described in Section
4.1 as well as an approach using formant center frequencies
and bandwidths which are extracted using the Snack Toolkit1
(subsequently denoted as Formants/BW). This approach is
akin to [3], but uses MAP adaption to obtain speaker models.
All three systems are applied to the same /n/ consonant
data which is described in Section 4.2. The configuration of
both systems and features are chosen as example for this data
in line with previous work on speaker verification [3, 10, 9].
Their optimization will be dealt with in future work.
The equal error rate (EER) and the log likelihood-ratio
cost (Cllr)metric[11]areusedasperformancemeasures. De-
tection error trade-off (DET) plots are used to show the trade-
off between type I and II errors when the decision threshold
is varied over the LR range. Tippett plots characterize the
cumulative proportion of LRs from target trials less than or
equal to the value indicated on the abscissa and of non-target
trials greater than or equal to the value on the abscissa.
1http://www.speech.kth.se/snack/
4.1. Baseline system description
A GMM-UBM system using Mel Frequency Cepstral Coef-
ficients (MFCCs) [9] is used as baseline to compare speaker
verification performance. Feature vectors of 13 MFCCs are
computed every 10ms using a 20ms hamming window. Af-
ter extraction, cepstral mean reduction (CMR) is applied to
the feature vectors. The system is based on Gaussian mixture
models with 1024 mixture components and diagonal covari-
ance matrices2. Models of individual speakers are obtained
through MAP adaption from the UBM. No further score nor-
malizations such as the T-norm are applied.
4.2. Data base
The evaluations in this study are based on nasal /n/ conso-
nants in recordings of 106 male adult German speakers which
were selected from the Pool2010 corpus [12]. To obtain a suf-
ficient number of items, an automatic phone-level alignment
[13] was performed on recordings of the German version of
the north wind and the sun read by the speakers in one studio
session. Subsequently, auditory validation of the segments
was performed to check for possible alignment errors.
30 speakers were used for UBM training. This number
was chosen based on the results in [10]. The data of the re-
maining 76 speakers was split into two equally-sized train and
test datasets of about 25 /n/ segments with a median duration
of 60ms, allowing for 76 target and 5700 non-target trials.
5. RESULTS AND DISCUSSION
Table 1 provides the EER and Cllr values of the different
systems. The proposed method provides discrimination per-
Features
proposed method
Formants/BW
MFCC
Table 1. EER and Cllrresults
EER
3.9%
5.3%
3.9%
Cllr
0.1325
0.2226
0.1296
formance in terms of EER equal to the MFCC based system
and outperforms the formant features. In terms of Cllr, it
incurs a slightly higher cost than the baseline system and a
lower cost than the formant-based systems. In the DET plot
in Fig. 2 the proposed method displays similar characteristics
as the MFCC system except for thresholds minimizing the
false alarm rate. The Tippett plot which is of interest in the
context of forensics is shown in Fig. 3.
6. CONCLUSIONS
In this paper, a new set of features consisting of formant
and anti-formant measurements obtained from a logarithmic
based pole-zero model estimate of the speech production fil-
ter [4] is applied to the task of speaker verification. These
2A similar configuration was used in [9] for single-gender UBMs.
4822
Page 4
Percent False Alarm Probability
Percent Miss Probability
3210
3
2
1
0
0.1 0.20.512510 20 40
0.1 0.2
0.5
1
2
5
10
20
40
proposed method
MFCC
Formants/BW
Fig. 2. DET plot of the compared systems
features are advantageous due to their more straightforward
interpretation. The features were extracted in an unsupervised
procedure and subsequently used in a GMM-UBM speaker
comparison approach. In an evaluation based on nasal /n/
consonants, this set of features achieves performance values
comparable to a MFCC based approach and outperforms an
approach based on formant frequencies and bandwidths.
Further tests are needed to evaluate the method on non-
contemporaneous speech as well as its susceptibility to chan-
nel mismatch such as transmission over telephone using
speech codecs and differences in speaking style and duration,
which is especially important for forensic applications [14].
A further improvements of the proposed method could be
achieved by using perceptual frequency scale as it is applied
in Perceptual Linear Prediction (PLP) and by adding deriva-
tives of the features, as commonly performed on MFCCs in
speaker verification systems. Furthermore, the amount of
complementary information to MFCC/PLP features and the
order of improvement achievable through fusion techniques
needs to be investigated.
7. REFERENCES
[1] P. Rose, Forensic Speaker Identification, Taylor & Francis,
2002.
[2] B. S. Atal and S. L. Hanauer, “Speech analysis and synthesis
by linear prediction of the speech wave,” J. Acoust. Soc. Amer.,
vol. 50, no. 2B, pp. 637–655, 1971.
[3] T. Becker, M. Jessen, and C. Grigoras, “Forensic Speaker Ver-
ification Using Formant Features and Gaussian Mixture Mod-
els,” in Proc. Interspeech, Brisbane, 2008, pp. 1505–1508.
[4] D. Marelli and P. Balazs,
methods minimizing a logarithmic criterion for speech anal-
ysis,” IEEE Trans. Audio Speech Lang. Process., vol. 18, no.
2, pp. 237–248, Feb. 2010.
“On pole-zero model estimation
−30−20−100 102030
0.0
0.2
0.4
0.6
0.8
1.0
Log Likelihood Ratio
Cumulative proportion
proposed method
MFCC
Formants/BW
Fig. 3. Tippett plot of the compared systems
[5] H. Hermansky, H. Fujisaki, and Y. Sato,
lope sampling and interpolation in linear predictive analysis
of speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process., Mar. 1984, vol. 9, pp. 53–56.
[6] W. Hartmann, Signals, Sounds and Sensation, Springer, New
York, 1998.
[7] R. Fletcher, Practical Methods of Optimization, ser. A Wiley-
Interscience Publication. Wiley, Chichester, U.K., 2nd ed.,
1987.
[8] T. Kobayashi and S. Imai, “Design of IIR digital filters with
arbitrary log magnitude function by WLS techniques,” IEEE
Trans.Acoust., Speech, SignalProcess., vol.38, no.2, pp.247–
252, Feb. 1990.
[9] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker ver-
ification using adapted gaussian mixture models,” Digital Sig-
nal Process., vol. 10, pp. 19–41, 2000.
[10] T. Becker, M. Jessen, and C. Grigoras, “Speaker verification
based on formants using gaussian mixture models,” in Proc.
NAG/DAGA, Rotterdam, 2009.
[11] N. Br¨ ummer and J. du Preez, “Application-independent eval-
uation of speaker detection,” Comput. Speech Lang., vol. 20,
pp. 230–275, 2006.
[12] M. Jessen, O. K¨ oster, and S. Gfroerer, “Influence of vocal
effort on average and variability of fundamental frequency,”
Int. J. Speech, Language, and the Law, vol. 12, no. 2, pp. 174–
213, 2005.
[13] S. Rapp, “Automatic phonemic transcription and linguistic an-
notation from known text with hidden markov models,” in
Proc. ELSNET Goes East and IMACS Workshop, Moscow,
1995.
[14] J. P. Campbell, W. Shen, W. M. Campbell, R. Schwartz, J. F.
Bonastre, and D. Matrouf, “Forensic speaker recognition: A
need for caution,” IEEE Signal Processing Magazine, Special
Issue on Digital Forensics, vol. 26, no. 2, pp. 95–103, 2009.
“Spectral enve-
4823
View other sources
Hide other sources
-
Available from Peter Balazs · 29 Jan 2013
-
Available from 72.88
Similar Publications
Ewald Enzinger |