Vtln Warping Factor Estimation Using Accumulation of Sufficient Statistics
ABSTRACT In this paper we present an efficient and flexible approach to VTLN warping factor estimation. Due to the equivalence of frequency warping and linear transformation of cepstral coefficients, warping factors can be efficiently estimated by accumulating the sufficient statistics for linear transformation estimation, and searching the constrained space of transformations given by the explicit mapping between warping factors and linear transformation matrices. We show that the positive effect of using a properly normalized optimization criterion for warping factor estimation, which has been previously demonstrated for a signal analysis front-end without a filterbank, carries over to a MFCC front-end, resulting in a net improvement in word error rate
-
Citations (0)
-
Cited In (0)
Page 1
VTLN WARPING FACTOR ESTIMATION USING ACCUMULATION OF
SUFFICIENT STATISTICS
Jonas L¨ o¨ of and Hermann Ney
Lehrstuhl f¨ ur Informatik VI, Comp. Sci. Dept.
RWTH Aachen Univ, 52056 Aachen, Germany
{loof, ney}@cs.rwth-aachen.de
Srinivasan Umesh
Department of Electrical Engineering
Indian Institute of Technology, Kanpur, India
sumesh@iitk.ac.in
ABSTRACT
In this paper we present an efficient and flexible approach to
VTLN warping factor estimation. Due to the equivalence of
frequency warping and linear transformation of cepstral coef-
ficients, warping factors can be efficiently estimated by accu-
mulating the sufficient statistics for linear transformation es-
timation, and searching the constrained space of transforma-
tions given by the explicit mapping between warping factors
and linear transformation matrices. We show that the posi-
tive effect of using a properly normalized optimization crite-
rion for warping factor estimation, which has been previously
demonstrated for a signal analysis front-end without a filter-
bank, carries over to a MFCC front-end, resulting in a net
improvement in word error rate.
1. INTRODUCTION
Vocal tract length normalization (VTLN) [1] is an important
method to compensate for inter-speaker variation in speaker-
independent automatic speech recognition (ASR). To achieve
this, the frequency axis is warped using a parameterized in-
vertible function, and the parameter, or warping factor, is op-
timized for each speaker.
The equivalence of VTLN and linear transformation for
a general frequency warping was demonstrated in [2]. This
work was later refined in [3] to explicitly take into account
the frequency discrete nature of the ASR signal processing
front-end. We briefly review the refined method.
As described in [3], if the spectrum is quefrency lim-
ited, samples of the warped spectrum can be exactly obtained
from the unwarped spectrum. For plain cepstral coefficients
(without filterbank smoothing and discrete cosine transform
(DCT)) the cepstrum is computed from the spectrum using
Ck=
1
N
N−1
?
q=0
log|X[q]|2e+j2π
Nqk,
(1)
where Ckare the cepstral coefficients and X[q] is the spec-
trum. Since Ckand log|X[q]|2form a discrete Fourier trans-
form pair we can recover the second from the first. X[q]
This work was partially funded by the European Union under the inte-
grated project TC-STAR - Technology and Corpora for Speech to Speech
Translation -(IST-2002-FP6-506738, http://www.tc-star.org)
cannot be recovered though, because of the magnitude op-
eration. We are interested only in the warped log-magnitude
spectrum log|˜ X[q]|2though, and log|˜ X[q]|2can be exactly
reconstructed from log|X[q]|2if Ckis quefrency limited and
unaliased. If this condition holds the warped spectrum can be
computed directly, by
log|˜ X[l]|2= log|˜ X(˜ ωl)|2= log|X(g−1(ωl))|2
N−1
?
By substituting (2) into (1) a linear transformation between
Ckand the warped cepstral coefficients˜Cnis reached, i.e.
=
k=0
Cke−j2π
Ng−1(ωl)k.
(2)
˜Cn=
N−1
?
N−1
?
k=0
Ck
1
N
N−1
?
l=0
e−2π
Ng−1(ωl)ke+j2π
Nln
=
k=0
WnkCk.
(3)
To use the above relation with a typical ASR system us-
ing DCT based cepstral coefficients derived from filter-bank
smoothed spectra, a relation between plain- and DCT cepstra
can be derived. The DCT based cepstral coefficients are given
by
dk=
M−1
?
q=0
log|XFB[q]|2coskπ(q + 1/2)
N
.
(4)
Similarly the plain cepstra of the filter bank output is given by
Ck=
1
2(M − 1)
M−1
?
q=0
bqlog|XFB[q]|2cos
qkπ
M − 1,
(5)
where bq = 1 for q = 0 or q = (M − 1) and 2 otherwise.
From these two equations a linear transformation relation be-
tween DCT- and plain cepstra can be derived. Furthermore,
combining this transformation and its inverse with (3), a lin-
ear transform between warped and unwarped DCT cepstra is
reached. It should be noted that the above holds for any in-
vertible warping function.
I 1201142440469X/06/$20.00 ©2006 IEEEICASSP 2006
Page 2
Using these results, warping factor estimation is seen as a
constrained linear transform estimation, where the constraint
isgivenbythemappingbetweenwarpingfactorsandtransfor-
mation matrices as given above. It is sufficient to accumulate
the sufficient statistics needed for estimating linear transfor-
mations for each speaker, and perform the constrained opti-
mization off-line.
As presented in [4] the auxiliary function for estimating
a linear feature transform using the expectation maximization
(EM) algorithm is given by
Q(M,ˆ
M) = β log??W??2−1
where terms constant with respect to the transform W has
been omitted. The sufficient statistics, G and k, are given by
2
D
?
d=1
?wdG(d)wT
d− 2wdk(d)T?,
(6)
G(d)=
S
?
S
?
s=1
1
σ(s)2
d
1
σ(s)2
d
T
?
µ(s)
d
t=1
γs(t)xtxT
t
(7)
k(d)=
s=1
T
?
t=1
γs(t)xT
t,
(8)
and β is γs(t) accumulated over s and t.
VTLN warping factors are usually estimated using grid
search by directly evaluating the acoustic scores when align-
ing a reference transcription with the speaker independent
(SI) acoustic model. As has been pointed out earlier [2],
this fails to take into account the Jacobian determinant of
the warping transformation, and thus fails to properly nor-
malize the model distributions. Although previous studies
[2] showed only small performance gains in using the Jaco-
bian, the closer resemblance of the current method to standard
MFCC analysis motivates us to repeat this experiment.
2. ACCUMULATOR BASED WARPING FACTOR
ESTIMATION
In this section we describe our approach to VTLN warping
factor estimation. The starting point is the signal analysis
front-end as presented in [3], a MFCC based front-end with
certain modifications to ensure that the resulting cepstrum
is quefrency limited, ensuring the equivalence between fre-
quency warping and linear transformation. These changes are
briefly described below.
Instead of integrating the Mel-warping into the filter-bank
as is usually done, a uniformly spaced (in Hz) filter-bank is
used, and the Mel-warping is included in the warping trans-
form. In order to still get the same amount of smoothing as
for normal MFCC, the filter width is constant in Mels (the
same as for MFCC). To further ensure quefrency limitedness
the number of cepstral coefficients, and hence the number of
filters had to be increased. In total 129 filters were used, mak-
ing sure that the cepstral coefficients decay to zero. For the
output side of the transform only the first 16 cepstral coeffi-
cients (the same number as in our baseline system) were used,
making the warping transform a projecting transform.
2.1. Interaction with Subsequent Signal Analysis Steps
To be able to accumulate the sufficient statistics for a linear
transform, we require the transform to be the last step before
calculating the likelihood. In our system, the warping trans-
form is followed by cepstral mean normalization and dynamic
feature generation (derivatives or LDA), which we wish to ei-
ther combine with the warp transform, or move before it.
When doing cepstral mean normalization, the mean (com-
putedoverawindow)ofeachcepstralcoefficientissubtracted
from the coefficient. For the warped and normalized cepstral
coefficients a simple calculation, cw− cw = Wc − Wc =
W(c − c), shows that it is possible to change the order of
cepstral mean normalization and linear transformation.
If one wishes to use an LDA based system for warping
factor estimation, the following method can be used. The
LDA step consists of splicing of (in our case five) consec-
utive acoustic frames, followed by a projecting linear trans-
form down to a lower number of output dimensions (in our
case 45). The splicing can be moved before the warp trans-
form; the warp transform will then be block diagonal, repeat-
edly containing the original warping matrix. Statistics are
accumulated to optimize the transform from the spliced un-
warped cepstra to the warped LDA transformed ones. For
each warping factor considered in the optimization a block di-
agonal warping matrix is combined with the previously com-
puted LDA transform and is evaluated using the accumulated
statistics.
When using a system with derivatives (regression fea-
tures) it is possible to exchange the order of the warp transfor-
mation and the application of the derivatives, since a discrete
derivativeofanyorderiscommutativewithmatrixmultiplica-
tion. After the exchange, the warping transform will be block
diagonal, with one repeated copy for each order of dynamic
features used.
2.2. Optimization Criteria
The maximum likelihood (ML) estimation of linear transfor-
mations, as described in [4], requires computing the Jacobian
of the transform. The warping transforms we are consider-
ing are projections, making the plain ML estimation method
unsuitable in unmodified form. One possibility is to simply
ignore the Jacobian term in the ML calculation, using only
the distance. This is numerically equivalent to the standard
method of warping factor estimation; it will be called the
naive criterion. Another possibility would be to use the het-
eroscedastic discriminant analysis (HDA) [5] criterion, which
extends the transform to be non-projecting. The application
of HDA to the current problem has not been studied in this
work.
Another possibility would be to use a standard discrimi-
native criterion such as maximum mutual information (MMI),
but for unconstrained transformation estimation results show
that this requires interpolation with an ML estimated matrix
to be useful [6]. Although this is not likely to be a problem
for warping factor estimation, since only one parameter is op-
timized, a simpler criterion was desired. One criterion that
I 1202
Page 3
proved to be useful for optimizing parameters in the signal
processing front-end [7] is a likelihood ratio criterion moti-
vated in [7] as a simplification of the MMI criterion.
Starting with the MMI criterion, the competing model is
exchanged with a single full covariance Gaussian model that
is optimized on the same data as the transformation. Explic-
itly solving for the mean and covariance of the Gaussian and
inserting into the equation leads to
gMMI? = T logΣ?− P(ξT
where ξ is the features as given by the front-end, and Σ?is the
full covariance matrix of ξT
The resulting criterion is called the MMI?criterion. In [7]
this criterion was used in a direct optimization framework,
using multiple passes over the training data to compute the
objective function and its derivative. On the other hand, the
close formal similarity to the standard ML criterion makes
it possible to use the EM algorithm by defining an auxiliary
function, in exact correspondence to the ML case. The auxil-
iary function is given by
1|M,wN
1),
(9)
1.
Q(M,ˆ
M) =
S
?
s=1
T
?
t=1
γs(t)logΣ?− P(ξt|s,Ms).
(10)
For the specific case of linear transform estimation the auxil-
iary function is given by
Q?(M,ˆ
M) =
β log??WΣWT??−1
where terms constant with respect to the transform W have
been omitted. Σ is the full covariance of the untransformed
adaptation data, with G, k, and β defined as in section 1. Us-
ing these equations, EM optimization can be carried out in
exactly the same way as for non-projecting linear transforms
byiterativelyoptimizingγs(t)usingtheforwardbackwardal-
gorithm, and W by accumulating the sufficient statistics and
optimizing Q.
2
D
?
d=1
?wdG(d)wT
d− 2wdk(d)T?, (11)
2.3. Implementation Considerations
Since the warping matrices are large it is important to imple-
ment the accumulation in an efficient way. Using global accu-
mulation of G and k (equations (7) and (8)) require O(D2˜D)
time per frame for accumulation (D and˜D are untransformed
and transformed feature dimension). With one accumula-
tor per covariance the time complexity decreases to O(D2)
at the cost of increasing memory complexity from O(D2˜D)
to O(CD2) (C is the number of covariances). For the sys-
tem used here this is not a problem, since only one globally
pooled covariance was used. Even for a system without co-
variance tying the storage requirements should not be a prob-
lem, since warping factor estimation is typically done using a
single Gaussian acoustic model.
Since the accumulator based approach differs signifi-
cantly from standard VTLN warping factor estimation, the
speed advantage is likely to be largely system dependent. In
our system the accumulator based implementation uses 1/3rd
of the time required for the standard warping factor estima-
tion, being a 21 point grid search. A further advantage of the
approach is the possibility to further refine the search preci-
sion without large increases in run-time.
3. EXPERIMENTAL RESULTS
All recognition experiments were performed on the TC-Star
project EPPS corpus as used in the 2005 evaluation [8]. The
training material includes 41 hours of manually transcribed
recordings. The tests were performed on the 4 hour develop-
ment set. The system was based on a MFCC front-end, and
the models used consisted of roughly 200k Gaussians sharing
a single globally pooled covariance. For VTLN a piecewise
linear warping function was used, and warping factor esti-
mation in training was performed as a grid search over the
interval 0.8 to 1.2. In recognition the so called fast VTLN [9]
method was used, where a Gaussian mixture model is trained
for each warping factor (on the training data), and the speech
segments in recognition are assigned a warping factor by the
models.
To demonstrate the usefulness of our approach we com-
pared the results of the accumulator based linear transforma-
tion implementation with the standard VTLN system. Since
we had to use an increased number of filters in order to
achieve quefrency limitedness, we also performed experi-
ments on a modified standard VTLN system using the same
number of filters (129) as for the linear transform case to be
able to do a fair comparison. Since we wanted to demonstrate
the feasibility of the linear transformation approach itself, we
used the naive optimization criterion for warping factor esti-
mation, and we used the same search grid size for all experi-
ments.
As seen in table 1, the increase in the number of filters
result in no performance gain, while the linear transformation
(LT) approach outperforms the baseline.
Table 1. Recognition performance
System MFCC
Baseline14.9
Extended14.9
LT (Naive)14.7
VTLN
14.4
14.4
14.3
To investigate the effect of the optimization criterion, ex-
periments were performed comparing MMI?with the naive
criterion. Since accumulator based warping factor estima-
tion was used, a more refined search method could have been
used. However, to facilitate the use of fast VTLN in recog-
nition, grid search was still used. The grid search resolution
was increased though, since this only has a small effect on the
run-time in the accumulation case.
I 1203
Page 4
Table 2. Optimization criteria
Criterion
Naive
MMI?
WER
14.3
14.2
As can be seen in table 2 the MMI?criterion performs
slightly better. A probability of improvement of 76% for
MMI?over the naive criterion was estimated using a boot-
strap estimate. Similar improvements were reported in [2],
but a front-end without a filter-bank was used. The current
result is the first showing an improvement over an optimized
baseline of a VTLN – MFCC system when using a criterion
that takes normalization into account.
femalemale
0.80
0.85 0.900.95 1.001.051.101.15 1.20
0
20
40
60
80
100
120
Naive
MMI’
Fig. 1. Histogram of estimated warping factors
In order to further analyze the effect of properly normal-
izedtrainingcriteria, figure1showshistogramsofthe number
of speakers assigned to each warping factor in the grid search.
Comparing the histograms for the naive and the MMI?case,
we observe that the histogram is more narrow in the MMI?
case. A similar effect was previously observed in [2] using
a non-MFCC front-end. We also observe that the MMI?his-
togram is not centered around warping factor 1.0. One pos-
sible explanation for this effect could be that the Jacobian-
like term in the MMI?criterion is sensitive to the fact that all
frames, including the non-speech frames, were used for the
estimation. It could also indicate that the basic Mel warping
as used in our system is sub optimal and that the MMI?crite-
rion identifies this.
4. SUMMARY
In this paper we have presented an efficient and flexible ap-
proach to VTLN warping factor estimation. We have shown
that the positive effect of using a properly normalized opti-
mization criterion for warping factor estimation, which has
been previously demonstrated for non-standard signal pro-
cessing front-ends, carries over to a standard MFCC setup.
Even though the positive effect is small it is still of practical
interest, since it improves on an optimized baseline. Future
work will be conducted in two directions. Different optimiza-
tion criteria will be investigated. In particular the HDA cri-
terion [5] will be considered, since it performs well for un-
constrained transforms. Preliminary experiments have shown
that it outperforms the MMI?criterion for unconstrained lin-
ear projection estimation. The other direction is to investi-
gate further refinements to the warping transforms, in order
to allow for using smaller matrices, which would speed up
the method.
5. REFERENCES
[1] E. Eide and H. Gish, “A parametric approach to vocal
tract length normalization,” in Proc. IEEE Int. Conf. on
Acoustics, Speech and Signal Processing, Atlanta, GA,
May 1996, vol. 1, pp. 346–349.
[2] M. Pitz, Investigations on Linear Transformations for
Speaker Adaptation and Normalization,
RWTHAachenUniversity, Aachen, Germany, Mar.2005.
Ph.D. thesis,
[3] S. Umesh, A. Zolnay, and H. Ney,
frequency-warping and VTLN through linear transforma-
tion of conventional MFCC,” in Proc. European Conf. on
Speech Communication and Technology, Lisbon, Portu-
gal, Sept. 2005, vol. 1, pp. 269–272.
“Implementing
[4] M. J. F. Gales,
mations for HMM-based speech recognition,” Computer
Speech and Language, vol. 12, no. 2, pp. 75 – 98, Apr.
1998.
“Maximum likelihood linear transfor-
[5] N. Kumar and A. G. Andreou,
criminant analysis and reduced rank HMMs for improved
speech recognition,” Speech Communication, vol. 26, no.
4, pp. 283–297, Dec. 1998.
“Heteroscedastic dis-
[6] L. F. Uebel and P. C. Woodland, “Discriminative lin-
ear transforms for speaker adaptation,”
Workshop on Adaptation Methods in Speech Recognition,
Sophia Antipolis, France, Aug. 2001, pp. 61–64.
in ISCA ITR-
[7] K. Visweswariah and R. Gopinath, “Adaptation of front
end parameters in a speech recognizer,”
Conf. on Spoken Language Processing, Jeju Island, Ko-
rea, Oct. 2004, pp. 21–24.
in Proc. Int.
[8] C. Gollan, M. Bisani, S. Kanthak, R. Schl¨ uter, and
H. Ney, “Cross domain automatic transcription on the
TC-Star EPPS corpus,”
Acoustics, Speech and Signal Processing, Mar. 2005,
vol. 1, pp. 825–828.
in Proc. IEEE Int. Conf. on
[9] L. Welling, S. Kanthak, and H. Ney, “Improved methods
for vocal tract normalization,” in Proc. IEEE Int. Conf. on
Acoustics, Speech and Signal Processing, Phoenix, AZ,
Apr. 1999, vol. 2, pp. 761–764.
I 1204