Page 1
COMPARISON OF DISCRIMINATIVE TRAINING CRITERIA
Ralf Schl¨ uter and Wolfgang Macherey
Lehrstuhl f¨ ur Informatik VI
RWTH Aachen – University of Technology
D52056 Aachen, Germany
Email: schlueter@informatik.rwthaachen.de
ABSTRACT
Inthispaper, aformallyunifying approach foraclassof discri
minative training criteria including Maximum Mutual Information
(MMI) and Minimum Classification Error (MCE) criterion is pre
sented, including the optimization methods gradient descent (GD)
and extended BaumWelch (EB) algorithm. Comparisons are dis
cussed for the MMI and the MCE criterion, including the deter
mination of the sets of word sequence hypotheses for discrimina
tion using word graphs. Experiments have been carried out on
the SieTill corpus for telephone line recorded German continuous
digit strings. Using several approaches for acoustic modeling, the
word error rates obtained by MMI training using single densities
always were better than those for Maximum Likelihood (ML) us
ing mixture densities. Finally, results obtained for corrective train
ing (CT), i.e. using only the best recognized word sequence in ad
dition to the spoken word sequence, could not be improved by
using the word graph based discriminative training.
1. INTRODUCTION
Inan increasing number of applications discriminativetraining cri
teria such as Maximum Mutual Information (MMI) [6] and Mini
mum Classification Error (MCE) [1] have been used. In MCE
training, an approximation for the error rate on the training data
is optimized, whereas MMI optimizes the a posteriori probability
of the training utterances and hence the class separability. Based
on [6], we present a formally unifying approach for a class of dis
criminativecriteriaincluding theMMI andthe MCEcriterion, thus
extending a comparison done in [7]. In a previous study [9], we
also found a unifying approach for the optimization methods gra
dient descent and extendedBaumWelch(EB)algorithmwhichwas
transfered to the unified criterion presented here.
Experimental results are presented for the SieTill corpus for
telephone line recorded German connected digit strings. In order
toinvestigatetheabilitiesof discriminativetrainingto improveML
training results, we performed comparative experiments for sev
eral approaches of acoustic modeling, such as single vs. mixture
densities, pooled vs. state specific variances and an optional linear
discriminant analysis (LDA).
Followingprevious studies[9],wealsoperformed experiments
comparing GD with EB optimization for MMI training of mixture
densities showing no significant differences. Furthermore, for de
termining the sets of competing word hypotheses for discrimina
tion, we performed experiments using CT [6], or word graphs for
efficient representation of all competing word hypotheses. These
experiments were initialized with our best results using CT, where
only the best recognized word sequence is used for discrimination.
We did not observe further improvements in word error rate, al
though in case of the use of word graphs a further convergence of
the criterion was found.
2. DISCRIMINATIVE TRAINING
The training data shall be given by training utterances
each consisting of a sequence
r?????R,
X
r of acoustic observation vectors
x
words. The a posteriori probability for the word sequence
given the acoustic observation vectors
P
r?
?x
r?
? ????x
rT
rand the corresponding sequence
W
r of spoken
W
r
X
r shall be denoted by
p
cording emission and language model probabilities respectively.
In the following, the language model probabilities are supposed
to be given. Hence the parameter
rameters of the emission probabilities
?
?W
r
jX
r
?. Similarly,
p
?
?X
r
jW
r
? and
r
p?W
r
? represent the ac
? represents the set of all pa
p
?
?X
r
jW
r
?. Finally, let
M
discrimination in utterance
criteria
expression
r denote the set of word sequences, which are considered for
r. A class of discriminative training
F
Dincluding MMI and MCE could then be defined by the
F
D
????
R
X
r??
f
?
log
p
?
?W
r
?p
?
?
?X
r
jW
r
?
W?M
p
?
?W?p
?
?
?X
r
jW?
?
?
The choice of theexponent
?, the smoothing function
f and the set
M
is represented. In particular, choosing
the MMI criterion. On the other hand, using the sigmoid func
tion
the MCE criterion, which is to be maximized. Ideally, in case of
the MMI criterion the set
quences. In practice,
and is represented by
spoken wordsequence hastobeexcluded fromthisset. Thecontri
bution of each competing sentence to reestimation is controlled by
the exponent
approximation. For the MMI criterion the latter is called correc
tive training (CT), where only the best recognized word sequences
are used for discrimination. The smoothing function
an optional weighting on the level of whole training utterances, as
can be seen in the following derivation of the iteration equations
for the case of Gaussian mixture densities.
An optimization of the class of discriminative training crite
ria defined above tries to simultaneously maximize the emission
probabilities of the spoken training sentences and to minimize a
r of word sequences for discrimination decide which criterion
??? and
f????? yields
f??????????exp?????? yields an equivalent version of
M
rwould contain all possible word se
M
r is obtained through a recognition pass
Nbest lists or word graphs. For MCE the
?, where very large values of
? lead to a maximum
f leads to
Page 2
weighted sum over the emission probabilities of each competing
sentence given the acoustic observation sequence for each train
ing utterance. Thus, these criteria optimize the class separability
according to the words under consideration of the language model.
2.1. Parameter Optimization
One possibility to maximize discriminative training criteria con
sists of a gradient descent with the following reestimation formula
for the parameters:
?
??????
?F
D
???
??
?
A mixture density for an acoustic observation vector
HMM state
rameters
ters
calculated in maximum approximation. Then the derivative of the
general discriminative criterion
is given by:
x given an
s shall be denoted by
p?xjs??
s
?. The according pa
?
s of a mixture density are the weights
c
sland parame
?
slof densities
l of the mixture, and mixture densities shall be
F
Dwith respect to parameters
?
sl
?F
D
???
??
sl
??
sl
?
?logc
sl
p?xj?
sl
?
??
sl
?
?
where the discriminative averages
?
slare defined by:
?
sl
?g?x????
R
X
r??
f
r
T
r
X
t??
??
rt
?s?W
r
???
rt
?s??
??
rt
?ljs?g?x
rt
??
?
s
?g?x???
X
l
?
sl
?g?x???
(1)
where we have utterance weights
if the smoothing function
f
r which have to be considered
f is not the identity,
f
r
?f
?
?
log
P
?
?X
r
jW
r
?P?W
r
?
X
W
P
?
?X
r
jW?P?W?
?
?
Applying the maximum approximation for the calculation of mix
ture densities, the density probabilities
X
?
rt
?ljs? are determined by
?
rt
?ljs????l?argmax
k
c
sk
p?x
?
rt
j?
sk
???
with the Kronecker delta
make use of the ForwardBackward (FB) probabilities of the spo
ken word sequence
??i?j?. The discriminative averages also
W
r:
?
rt
?s?W
r
??p
?
?s
t
?sjX
r
?W
r
??
and the generalized FB probabilities for the total of all competing
word sequences
W defined by the sets
M
r:
?
rt
?s??
W ?M
r
p
?
?X
r
?W?
X
V?M
r
p
?
?X
r
V?
?
rt
?s?W??
The generalized FB probability is simply a sum over the conven
tional FB probabilities of each competing sentence weighted by its
renormalized posterior probability.
Using the Viterbi approximation [4], i.e. calculating the FB
probabilities from the according time alignment, the sum over all
competing word hypotheses for calculation of the generalized FB
probability could be separated from the time alignment. Then the
according wordposterior probabilities needed could be calculated
efficiently by applying a FB calculation scheme on the basis of
word hypotheses on a word graph. Thus word graphs could also
be used, if
graph FB scheme is applied on state level already, as done for the
MMI criterion in [10]. It should be noted that the calculation of
wordposterior probabilities also finds applications in other areas
of speech recognition like the determination of confidence mea
sures, e.g. [8].
Discriminative training with the MMI criterion usually applies
an extended version of BaumWelch training, the EB algorithm [5,
6]. We extended this approach to the general criterion
could be maximized via the following auxiliary function:
? is not 1, which would not be possible if the word
F
D, which
S???
?
????
X
s
R
X
r??
f
r
T
r
X
t??
??
rt
?s?W
r
???
rt
?s??
?logp?x
rt
js?
?
?
s
?
??
X
s
D
s
Z
dxp?xjs??
s
? logp?xjs?
?
?
s
??
which is to be optimized iteratively. Differentiation with respect to
the iterated parameters
which reestimation formulae can be derived:
?
?
slleads to the following expression, from
?S???
?
??
?
?
?
sl
??
sl
?
?log p?xjs?
?
?
s
?
?
?
?
sl
?
?D
s
Z
dxp?xjs??
s
?
? logp?xjs?
?
?
s
?
?
?
?
sl
?
Using discriminative averages for writing down reestimation
formulae yields expressions which are formally independent of the
particular criterion chosen. Thus, differences of criteria are intro
duced via the discriminative averages only and comparisons could
be reduced to this level.
Performing the EB algorithm, we obtain the following reesti
mation equations for the means
ances
?
?
sl, state specific diagonal vari
?
?
sand mixture weights
?
c
slof Gaussian mixture densities:
??
sl
?
?
sl
?x??D
s
c
sl
?
sl
?
sl
????D
s
c
sl
??
?
s
?
?
s
?x
?
??D
s
??
?
s
?
P
l
c
sl
?
?
sl
?
?
s
????D
s
?
X
l
?
sl
????D
s
c
sl
?
s
????D
s
??
?
sl
c
sl
?
spk
sl
???
?
spk
s
???
?
?
gen
sl
???
?
gen
s
???
?C
s
X
l
?
c
sl
?
?
?
spk
sl
?
???
?
spk
s
???
?
?
gen
sl
?
???
?
gen
s
???
?
?C
s
c
sl
?
An alternative would be to perform gradient descent on the cri
terion
formulae we arrive at step sizes for gradient descent [9], which
lead to reestimation formulae, which differ only for the variances
F
D. Doing this and comparing both sets of reestimation
Page 3
by terms containing the squared step sizes of the means of the ac
cording mixture:
??
sl?GD
???
sl?EB
??
?
s?GD
???
?
s?EB
?
X
l
?
sl
????D
s
c
sl
?
s
????D
s
??
sl
???
sl?BW
?
?
?c
sl?GD
??c
sl?EB
?
The reestimation formulae for the mixture weights do not result
directly from the optimization of the criterion but are smoothed
versions for better convergence [5]. For this version the discrimi
native averages
cording to the FB probability for the spoken (spk) word sequence
and the generalized (gen) FB probability for the total of all com
peting word sequences.
Setting
differences between MCE and MMI. Firstly, for MMI the spoken
word sequence is considered for discrimination, whereas it has to
be excluded when using MCE. Since the wordposterior proba
bilities of correct words securely recognized will be nearly 1, the
differences of FB probabilities in the discriminative averages for
MMI are nearly zero, such that those words do not contribute sig
nificantly to reestimation. Secondly, the worse the recognition re
sults for an utterance are, the more it will contribute to MMI rees
timation, which is not the case for MCE. For MCE, hopelessly bad
recognized utterances together with securely recognized ones are
weighted down as a whole according to their posterior probabili
ties via the smoothing function
Fast convergence is achieved if the iteration constants
chosen such that the denominators in the reestimation equations
and the according variances are kept positive:
?
s?l?, as defined in Equation 1, are separated ac
??? for comparison purposes, we observe only two
f.
D
s are
D
s
?h?max
?
D
s?min
?
?
?
??
s
???
?
?
(2)
Here,
stant which guarantees the positivity of the variance in state
the iteration factor
process, high values leading to low step sizes. The constant
is chosen to prevent overflow caused by lowvalued denominators.
Similarly, the iteration parameters
chosen such that all weights are positive:
D
s?mindenotes an estimation for the minimal iteration con
s and
h?? controls the convergence of the iteration
???
C
s for the mixture weights are
C
s
? max
l
?
?
?
?
spk
sl
???
?
spk
s
???
?
?
g en
sl
???
?
g en
s
???
?
??
?
? ??
with a small constant
?.
3. RESULTS
Experiments were performed on the SieTill corpus [2] for tele
phone line recorded German continuous digit strings. The SieTill
corpus consists of approximately 43k spoken digits in 13k sen
tences for both training and test.
Therecognitionsystemforthe SieTillcorpus isbasedonwhole
word HMMs using continuous emission distributions. It is charac
terized as follows:
? Gaussian mixture emission distributions,
? pooled or state dependent variance vectors,
? gender dependent whole word HMMs for 11 German digits
including ’zwo’ and gender dependent silence models,
Table 1: Comparison of recognition results on the SieTill corpus
for ML and discriminative training for different acoustic modeling
and training techniques.
corp LDAvar dnscritopt WER???
ins
1.0
1.8
0.6
1.6
1.7
1.6
0.7
0.7
0.7
1.0
0.5
0.5
0.5
del
0.7
0.4
0.8
0.6
0.5
0.4
0.8
0.5
0.5
0.3
0.6
0.6
0.7
tot
5.6
4.8
3.3
5.2
4.6
4.1
3.4
4.0
2.8
3.0
2.5
2.6
2.6
test noPV1
4
1
1
4
ML
ML
CT
ML
ML
ML
CT
ML
CT
ML
CT
CT
WG


GD
SV

 25
1
1
GD
yesPV
GD
4
GD
EB
GD
? per gender 132 states plus one for silence,
? 12 cepstral features plus first derivatives and the second
derivative of the energy.
The baseline recognizer applies ML training using the Viterbi ap
proximation [4] which serves as a starting point for the additional
discriminative training. A detailed description of the baseline sys
tem could be found in [11].
In Table 1 the recognition results obtained for several acoustic
modeling approaches using maximum likelihood training are indi
cated by ML. For ML training, state specific variances (SV) gave
better results than pooled variances (PV) for both single and mix
ture densities. The best results for state specific variances were
obtained using approximately 25 densities per mixture, whereas
for pooled variances the best results were already obtained for ap
proximately 4 densities per mixture. Adding linear discriminant
analysis (LDA) to using pooled variances gave our best ML results
with
rate for mixture densities with approx. 4 densities per mixture. It
should be noted that the LDA gave a relative improvement of over
???? word error rate for single densities and
???? word error
??? in word error rate for pooled variances and still more than
??? compared to state specific variances without LDA.
For CT, an iteration factor of
smooth convergence. Fig. 1 shows a plot of the MMI criterion
for the male portion of the SieTill training corpus for CT using
both GD and EB optimization starting from the according ML re
sult using Gaussian mixture densities with approx. 4 densities per
mixture, pooled variance vector and LDA. After CT has converged
(indicated by the vertical line), a plot of the MMI criterion using
word graphs for discrimination (WG) with an average number of
about 47 word hypotheses per spoken word is added. Certainly the
absolute values of the MMI criterion using CT and WG respec
tively are not comparable. In a region where CT does not converge
any more, the MMI criterion clearly converges, although the word
error rate obtained by CT is even slightly better than that for WG
(cf. Table 1). The reason for this could be, that an utterance, which
is correctly recognized does not contribute to reestimation for CT.
Thus, also incorrectly hypothesized word sequences for such ut
terances are not considered for discrimination using CT, even if
h???? leads to relatively
Page 4
MMI criterion
5
4
3
2
1
0
05 1015 2025 30
MMI: CT, EB
CT, GD
WG, GD
iteration index
Figure 1: MMI criterion for the male speakers of the SieTill train
ing corpus for corrective training (CT) and the use of word graphs
(WG).
their posterior probabilities are only marginally smaller than the
maximum. Contrarily, using WG would try to reduce the poste
rior probability of such marginal second best hypotheses, although
this might not be necessary, so far as these wrong hypotheses keep
being second best at most. Such, this further rearrangement in the
posterior probabilites done using WG might have no or even nega
tive effectson the word errorrate incomparison to CT, as observed
in our experiments. Table 1 summarizes the recognition results for
the SieTill test corpus using Gaussian emission densities. For dif
ferentlevelsofacoustic modeling, wecompareML resultswiththe
according discriminative training results using the MMI criterion
with corrective training (CT) approximation and gradient descent
(GD) optimization. The largest improvements using CT were ob
tained for our simplest system using single densities with pooled
variances (PV),wheretheML trainingword errorratewasreduced
by
sities with density specific variances (SV) was better than the ac
cording result for pooled variances, the improvement obtained by
additional CT was smaller, and the word error rate obtained was
even slightly smaller than that for CT using single densities with
pooled variance. It should be noted, that the result for CT us
ing single densities with density specific variances was even bet
ter than the according ML result using mixture densities with 25
densities per mixture. The best results for CT were obtained us
ing mixture densities with 4 densities per mixture, pooled variance
(PV) and LDA, leading to a word error rate of
same as reported in [2]. Still, the according result for CT with
single densities is slightly better than the results for ML training
using 4 densities per mixture. Finally, the best CT result using GD
optimization was compared to CT using EB optimization, showing
no significant difference for mixture densities as was the case for
single densities [9].
??? relatively. Although the initial ML result for single den
????, which is the
4. CONCLUSION
We presented a formally unifying approach for a class of discri
minative training criteria and optimization methods including the
Maximum Mutual Information (MMI) and the Minimum Classifi
cation Error (MCE) criterion which were compared. For the MMI
criterion, experiments were performed on the SieTill corpus. Rel
ative improvements in word error rate of up to
ML training were obtained, and MMI training using single densi
ties always produced better results than ML training using mixture
??? compared to
densities. For the best initial result using ML training, the relative
improvement obtained by a subsequent MMI training was about
1/6, leading to a word error rate of
obtained for corrective training, i.e. using only the best recognized
word sequence in addition to the spoken word sequence, could not
be improved by using the word graph based discriminative train
ing.
Acknowledgement. Thisworkwaspartlysupported bySiemens
AG, Munich.
????. This result, which was
5. REFERENCES
[1] W. Chou, C.H. Lee, B.H. Juang. “Minimum Error Rate
Training based on
Conf. on Acoustics, Speech and Signal Processing, Min
neapolis, MN, Vol. 2, pp. 652655, April 1993.
[2] T. Eisele, R. HaebUmbach, D. Langmann, “A comparative
study of linear feature transformation techniques for auto
maticspeech recognition,” in Proc.Int. Conf. on Spoken Lan
guage Processing, Philadelphia, PA, Vol. I, pp. 252255, Oc
tober 1996.
[3] R. HaebUmbach, H. Ney. “Linear Discriminant Analysis
forImproved LargeVocabulary Continuous SpeechRecogni
tion,” Proc. Int. Conf. Acoustics, Speech and Signal Process.
1992, San Francisco, CA, Vol. 1, pp. 1316, March 1992.
[4] H. Ney. “Acoustic Modeling of Phoneme Units for Continu
ous Speech Recognition,” Proc. Fifth Europ. Signal Process
ing Conf., Barcelona, pp 6572, September 1990.
[5] Y. Normandin. Hidden Markov Models, Maximum Mutual
Information Estimation, and the Speech Recognition Prob
lem, Ph.D. thesis, Department of Electrical Engineering,
McGill University, Montreal, 1991.
[6] Y. Normandin. “Maximum Mutual Information Estimation
of Hidden Markov Models,” Automatic Speech and Speaker
Recognition, C.H.Lee, F.K.Soong, K.K.Paliwal(eds.),pp.
5781, Kluwer Academic Publishers, Norwell, MA, 1996.
[7] W. Reichl, G. Ruske. “Discriminative Training for Continu
ous Speech Recognition,” Proc. 1995 Europ. Conf. onSpeech
Communication and Technology, Madrid, Vol. 1, pp. 537
540, September 1995.
[8] T. Kemp, T. Schaaf. “Estimating Confidence using Word
Lattices,” Proc. 1997 Europ. Conf. on Speech Communica
tion and Technology, Rhodes, Greece, Vol. 2, pp. 827830,
September 1997.
[9] R. Schl¨ uter, W. Macherey, S. Kanthak, H. Ney, L. Welling.
“Comparison of Optimization Methods for Discriminative
Training Criteria,” Proc. 1997 Europ. Conf. on Speech Com
munication and Technology, Rhodes, Greece, Vol. 1, pp. 15
18, September 1997.
[10] V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young.
“LatticeBased Discriminative Training For Large Vocab
ulary Speech Recognition,” In Proc. Int. Conf. Acoustics,
Speech and Signal Process. 1996, Atlanta, GA, Vol. 2, pp.
605608, May 1996.
[11] L. Welling, H. Ney, A. Eiden, C. Forbrig. “Connected Digit
Recognition using Statistical Template Matching,” Proc.
1995 Europ. Conf. on Speech Communication and Technol
ogy, Madrid, Vol. 2, pp. 14831486, September 1995.
NBest String Models,” Proc. 1993 Int.