Content uploaded by Horacio Franco
Author content
All content in this area was uploaded by Horacio Franco on Jan 19, 2015
Content may be subject to copyright.
RATE-OF-SPEECH MODELING FOR LARGE VOCABULARY
CONVERSATIONAL SPEECH RECOGNITION
Jing Zheng, Horacio Franco and Andreas Stolcke
Speech Technology and Research Laboratory
SRI International, Menlo Park, CA, USA
{zj, hef, stolcke}@speech.sri.com
ABSTRACT
Variations in rate of speech (ROS) produce changes in both
spectral features and word pronunciations that affect automatic
speech recognition (ASR) systems. To deal with these ROS
effects, we propose to use parallel, rate-specific, acoustic
models: one for fast speech, the other for slow speech. Rate
switching is permitted at word boundaries, to allow modeling
within-sentence speech rate variation, which is common in
conversational speech. Due to the parallel structure of rate-
specific models and the maximum likelihood decoding method,
we do not need high-quality ROS estimation before recognition,
which is usually hard to achieve. In this paper, we evaluate our
approach on a large-vocabulary conversational speech
recognition (LVCSR) task over the telephone, with several
minimal pair comparisons based on different baseline systems.
Experiments show that on a development set for the 2000 Hub-5
evaluation, introducing word-level ROS-dependent models
results in a 1.9% absolute win over a baseline system without
multiword pronunciation modeling, and a 0.7% absolute win
over a baseline system that incorporates a 4.0% absolute win
from multiword pronunciation modeling. The combination of
rate-dependent acoustic models with rate-dependent
pronunciations obtained by using a data-driven approach is also
explored and shown to produce an additional win.
1. INTRODUCTION
Rate of speech (ROS) is an important factor that
affects the performance of a transcription system [1],[2].
Possible reasons are that some features commonly used in
recognition systems are duration related and clearly
influenced by speech rate, such as delta and delta delta
features, and that some pronunciation phenomena such as
coarticulation and reduction are also speech rate related.
Thus, using rate-dependent acoustic models seems to be a
promising way to improve robustness against speech rate
variation.
In previous research work, rate-dependent acoustic
models were often used at the sentence level. In the
typical framework, an input utterance was first classified
as fast or slow using a ROS estimator, and then fed to a
rate-specific system that was tuned to fast or slow speech
[2]. This method has two drawbacks. First, it presumes
that the speech rate within an utterance is uniform, which
is often not the case in conversational speech. In our
earlier research work on broadcast news [3], we found
that speech rate variation within sentences is common,
and thus we proposed to use a more local rate dependency
for the acoustic models. Second, this approach is based on
sequential classification, so errors in the ROS
classification will most likely trigger errors in the
recognition step. This paper proposes a new approach:
rate-dependent acoustic modeling at the word-level.
Under this approach, each typical word is given two
parallel rate-specific pronunciations: a fast-version
pronunciation and a slow-version pronunciation, each
consisting of rate-specific phones. The recognizer is
allowed to select the fast or the slow pronunciation for
each word automatically during search, based on the
maximum likelihood criterion. This way, we can model
the within-sentence speech rate variation, and avoid the
requirement of pre-recognition ROS classification. To
train the rate-specific phone models, we use a duration-
based ROS measure to partition the training data into rate-
specific categories. Due to the availability of training
transcriptions, robust and accurate ROS estimation for
training data can be achieved.
We also explore a new method to model rate-
dependent pronunciation variation. Based on the concept
of a zero-length phone [3], we enable short phones to be
skipped without changing the contexts of neighboring
phones. A data-driven method is used to generate an
expanded rate-dependent pronunciation dictionary.
In Section 2 we introduce the ROS measure used for
partitioning the training data. In Section 3 we show the
experimental results of rate-dependent acoustic modeling
based on SRI’s 1998 Hub-5 evaluation system, and
compare different training approaches. In Section 4 we
describe the work for the March 2000 Hub-5 evaluation
system, and specifically address the effect of multiwords
in rate-dependent acoustic modeling. In Section 5 we
report initial results on rate-dependent pronunciation
modeling. Finally, in Section 6, we summarize our results.
2. ROS MEASURE
Two methods are typically used to estimate the ROS of an
input utterance. One is based on phone durations, which
are often obtained from phone-level segmentations by
using forced alignments. When the utterance transcription
is known, this duration-based method can provide robust
ROS estimation [2]; however, when the transcription is
unknown, we can only use the hypothesis from a prior
recognition run, whose quality is hard to guarantee. The
second method involves estimating ROS directly from the
waveform or acoustic features of the input utterance [4].
To achieve robust ROS estimation, the computation is
often based on a data window with sufficient length.
Under our proposed approach, to train the rate-specific
models we need to partition the training data into rate-
specific categories at the word level, and we therefore
need the ROS for each word to be estimated locally. The
output of this process should give each word in the
training transcription a rate class label. As our first step to
ROS modeling, we decided to use only two ROS classes:
“fast” or “slow”. Since we only need to compute ROS for
the training data that have transcriptions, it is relatively
straightforward to obtain the duration of each word and its
component phones by computing forced Viterbi
alignments, and then applying duration-based ROS
estimation methods.
Absolute ROS measures, such as phones per second
(PPS) and inverse mean duration (IMD) [2], were often
used in previous work. However, we felt that these
measures are not informative enough since they do not
consider the fact that different types of phones have
different duration distributions. Fig. 1 illustrates the
duration distributions of some different categories of
monophones estimated from the training corpus. As we
can see, the duration distribution across different phone
types differs substantially. When taking PPS or IMD as
the ROS measure, words composed of short phones are
more easily treated as fast than those composed of long
phones, even though they are not actually spoken faster
than the normal rate. In our approach, we use a relative
ROS measure,
R
W
(
D
), defined as a percentile of a word’s
ROS distribution:
D
d
W
W
W
d
P
D
d
P
D
R
0
)
(
1
)
(
)
(
(1)
where
W
is a given word,
D
is the duration of
W
, and
P
W
(
d
) is the probability of that type of word having
duration
d
.
R
W
(
D
) is the probability of
W
having a
duration longer than
D
. The measure
R
W
(
D
) always falls
within the range [0,1], and can be compared between
different word categories. However, in practice,
P
W
(
d
) is
hard to estimate directly due to data sparseness. To
address this problem we assume that, within a word, the
duration distributions of its component subword units,
such as phones, are independent of each other. Thus, a
word’s duration distribution equals the convolution of its
component subword unit distributions, which are easier to
estimate reliably from training data. We currently use
triphones as the subword units for ROS estimation.
Figure 1:
Duration distributions of different phone types
We used this measure to calculate the ROS for all
word tokens in the training data, and found that 80% of
sentences with five or more words have at least one word
belonging to the fastest one third and one word belonging
to the slowest one third of all the words. This suggests
that in conversational speech, speech rate is usually not
uniform within a sentence.
The measure defined in Eq. 1 can also be applied to
subword units, thus allowing us to calculate the ROS of
phones. Using this measure, we studied the phone ROS
variation within words vs. within sentences. Fig. 2 shows
a histogram of the standard deviation of the phone ROS
within words and within sentences for all training data.
The data suggest that the word is a better unit than the
sentence for ROS modeling, since the average phone-
level ROS variation within a word is significantly smaller
than within a sentence.
Figure 2:
Histogram of standard deviation of phone-level ROS:
within words vs. within sentences
3. RATE-DEPENDENT ACOUSTIC
MODELING
In our proposed method, each word is given parallel
pronunciations of fast- and slow-version phones. Both
fast- and slow-version pronunciations are initialized from
the original rate-independent version, with the simple
replacement of rate-independent phones by rate-specific
phones. For example, the original rate-independent
pronunciation of “WORD” is /w er d/. Consequently the
fast-version pronunciation is /w
f
er
f
d
f
/ and the slow-
version /w
s
er
s
d
s
/, consisting of fast and slow phones,
respectively. The recognizer automatically finds the
pronunciations that maximize the likelihood score during
search, and thus avoids the need for ROS estimation
before recognition. In addition, the search algorithm is
allowed to select pronunciations of different rates across
word boundaries, thus coping with the problem of speech
rate variation within a sentence.
Acoustic Training
Our initial experiments were based on SRI’s 1998 Hub-5
evaluation system, which uses continuous-density genonic
hidden Markov models (HMMs) [5]. The original
evaluation system used a multi-pass recognition strategy
[6]. For the sake of simplicity, we ran our experiments
with only the first-pass recognizer, based on gender-
0
2
4
6
8
10
12
0
20
40
60
80
100
Standard deviation of phone-level ROS (100X)
%
within
words
(mean:
0.18)
within
sentences
(mean:
0.25)
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
Phone duration (frame)
%
dependent non-crossword genonic HMMs (1730 genones
with 64 Gaussians each for male, 1458 genones for
female) and a bigram grammar with a 33,275-word
vocabulary. The recognition lexicon was derived from the
CMU V0.4 lexicon with stress information stripped. The
recognizer used a two-pass (forward pass and backward
pass) Viterbi beam search algorithm; in the first pass a
lexical tree was used in the grammar backoff node to
speed up search. Below we report results from the
backward pass. The features used were 9 cepstral
coefficients (C1-C8 plus C0) with their first- and second-
order derivatives in 10ms time frames. The acoustic
training corpus containing 121,000 male sentences and
149,000 female sentences came from (A) Macrophone
telephone speech, (B) 3,094 conversation sides from the
BBN-segmented Switchboard-1 training set (with some
hand-corrections), and (C) 100 CallHome English training
conversations.
We first calculated the ROS for all the words in the
training corpus based on the above-mentioned measure,
sorted these words accordingly, and then split them into
two categories: fast and slow. The ROS threshold for
splitting was selected to achieve equal amounts of training
data for the fast and the slow speech. The training
transcriptions were labeled accordingly. We then prepared
a special training lexicon: words with a fast label were
given the fast-version pronunciation, and words with a
slow label the slow-version pronunciation. In this way, we
were able to train the fast and slow models
simultaneously.
We used the DECIPHER genonic training tools to run
standard maximum likelihood estimation (MLE) gender-
dependent training [5] and obtained rate-dependent
models with 3233 genones for male speech and 2501
genones for female speech. The genone clustering for
rate-dependent models used the same information loss
threshold as the training of rate-independent models.
We compared the rate-dependent acoustic model with
the rate-independent acoustic model (baseline system) on
a development subset of the 1998 Hub-5 evaluation data,
consisting of 1143 sentences from 20 speakers (9 male, 11
female). Table 1 shows the word error rate (WER) for
both models. Note that all results reported here are based
on speaker-independent within-word triphone acoustic
models and bigram language models, and are therefore
not comparable with the full evaluation system.
male
female
all
rate-independent model 55.3 63.4 59.8
rate-dependent model from training 52.9 61.9 57.9
Table 1:
Comparison between the baseline system with rate-
independent models and the system with rate-dependent
models (% WER on the development set)
Rate-dependent modeling brings an absolute WER
reduction of 1.9%, which is statistically significant. To
eliminate the possible effect of different numbers of
parameters, we adjusted the information loss threshold for
genone clustering to obtain another rate-independent
model that had a number of parameters similar to that of
the rate-dependent model. However, we did not observe
any improvement from the increased number of
parameters. This suggests the win is indeed due to the
introduction of rate dependency.
Adaptation vs. Standard Training
In our previous work on the Broadcast News corpus
(Hub-4) [3], instead of using the training method
described above, we trained the rate-dependent model
using a modified Bayesian adaptation scheme [7], by
adapting the rate-independent model to rate-specific data
to obtain rate-specific models. This was motivated by the
small amount of available training data relative to the
model size. In [3], we used a baseline system with a very
large model comprising 256,000 Gaussians, and classified
the training data into three categories: fast, slow, and
medium. For this model size, the training data was not
sufficient to perform standard training. However, in the
current task of Hub-5 telephone speech transcription we
had significantly more training data, and we used a
different strategy to partition the data into two classes
instead of three, yielding more training data for each rate
class. In addition, the optimal models we started with
were smaller. Thus, we were able to train the rate-
dependent model robustly with standard training methods.
For comparison we tested the Bayesian adaptation
approach that we used in [3] on the current training set.
Similar to [3], even though we used separate rate-specific
models for each triphone, we did not create separate
copies of the genones, but let the fast and slow models for
a given triphone share the same genone. In this way, we
used the same number of Gaussians for the rate-dependent
model as for the rate-independent model.
male
female
all
rate-independent model 55.3 63.4 59.8
rate-dependent model from adaptation
54.0 62.6 58.8
Table 2:
Comparison between the baseline system with rate-
independent model and the system with rate-dependent model
from adaptation (% WER on the development set)
Table 2 shows the results of the same development data
set we used in the previous section. We see that this
approach brings a win of 1.0% over the baseline, less than
the standard training scheme. This indicates that the
difference between fast and slow speech in the acoustic
space is significant, and that standard training might be
better than the previous adaptation scheme at capturing
this difference. In fact, standard training optimizes the
parameter tying for the rate-dependent model, reestimates
the HMM transition probabilities, and performs multiple
iterations of parameter reestimation. The adaptation
approach, on the other hand does not recompute genonic
clustering, does not change the transition probabilities,
and includes only one iteration of reestimation for the
rate-dependent model on top of the rate-independent
model. These differences might explain why the
adaptation scheme did not perform as well as standard
training.
4
. EXPERIMENTS IN THE 2000 NIST HUB-5
EVALUATION SYSTEM
For the March 2000 NIST Hub-5 benchmark, numerous
improvements were made to SRI’s 1998 evaluation
system [8], and the baseline system had been enhanced
substantially. Below we show some minimal pair
experiments based on different baseline systems during
the development process. The baseline system in Table 3
used a wider-band front end (with 13 cepstral coefficients
instead of 9), and vocal tract length (VTL) normalization
[9] during training. As we can see, the win from
introducing word-level rate dependency is still 1.9%, over
a baseline that was itself improved by 5.0%.
male
female
all
WER of baseline system 50.6 57.9 54.6
WER of rate-dependent system 49.2 55.6 52.7
Table 3:
Minimal pair comparison based on an improved
baseline system using a wider front end and VTL
normalization (% WER on the development set)
Another major addition to the evaluation system was
the introduction of multiword pronunciations. A
multiword is a high-frequency word bigram or trigram,
such as “a lot of”, that is handled as a single unit in the
vocabulary. By using handcrafted phonetic pronunciations
describing various kinds of pronunciation reduction
phenomena for these multiwords, we achieved better
modeling of crossword coarticulation. In SRI’s March
2000 evaluation system, 1389 multiwords were included
in the dictionary, for a win of about 4.0% absolute on top
of the improved baseline system in Table 3 [8].
We tried applying our rate-dependent modeling
approach to the multiword-augmented system by treating
the multiwords as ordinary words. In this case, we
obtained a smaller win of 0.5% , as shown in Table 4.
(Compared to Table 3, a small part of the baseline WER
reduction -- about 1.3% absolute -- comes from other
improvements, such as variance normalization and
pronunciation probabilities.)
male
female
all
WER of baseline system 44.3 53.3 49.3
WER of rate-dependent system 43.6 53.0 48.8
Table 4:
Minimal pair comparison based on a multiword-
augmented baseline system on the development set
There are several possible reasons for the diminished
effectiveness of ROS modeling with multiwords. First,
each multiword is given multiple parallel pronunciations
reflecting both full and reduced forms. This by itself
models fast and slow speech variants to some extent.
However, since this affects only the 1389 multiwords
(covering about 40% of word tokens), there should still be
room for improvement from rate-dependent modeling.
Second, by treating multiwords as ordinary words, we fail
to model the rate variation occurring within the
multiwords, and thus may influence the quality of the
rate-dependent acoustic models. Third, in our current
implementation, the introduction of multiwords made the
search much more expensive than before; rate-dependent
modeling on top of the multiword dictionary made this
problem even worse, and may have produced a loss in
performance due to search pruning.
Based on the above analysis, we tested another scheme:
instead of treating multiwords as ordinary words we
trained them with multiword-specific phone units, that is,
using separate phonetic models to describe the
multiwords. Similar to the original approach, we trained
three classes of phone models simultaneously: fast models
for ordinary words, slow models for ordinary words, and
a separate set of phone models trained only on the
multiword data. With this approach, we improved the
WER reduction to 0.7%, as shown in Table 5.
male
female
all
WER of baseline system 44.3 53.3 49.3
WER of rate-dependent system 43.6 52.6 48.6
Table 5:
Minimal pair comparison on the development set
between the multiword-augmented baseline system and the
rate-dependent system with multiword-specific phone models
Finally, we replicated the same experiment on the
March 2000 Hub-5 evaluation data set, comprising 4466
utterances from 80 speakers (29 male, 51 female). As
shown in Table 6, we again obtained a win of 0.7%
absolute, which is statistically significant for this data set.
male
female
all
WER of baseline system 40.0 41.8 41.2
WER of rate-dependent system 39.7 41.0 40.5
Table 6:
Minimal pair comparison on the March 2000 Hub-5
evaluation set between the multiword-augmented baseline
system and the rate-dependent system with multiword-specific
phone models
5. PRELIMINARY RESULTS WITH RATE-
DEPENDENT PRONUNCIATION MODELING
In the above experiments, we used identical
pronunciations but different phone types to model fast
versus slow speech. However, some pronunciation
variation phenomena, such as coarticulation and
reduction, are very possibly related to ROS. In our
previous research on the Hub-4 corpus [3], we proposed
the notion of “zero-length phone” to model highly
coarticulated pronunciations. The basic idea is to allow
certain short phones to be skipped during search, while
preserving their contextual influences to their neighboring
phones. By letting different phones in a pronunciation be
zero-length, we can generate different pronunciation
variants for that word. By using rate-dependent
pronunciations, we hope to better model pronunciation
variation than by using rate-independent pronunciations.
In [3], we allowed certain types of monophones to
optionally be zero-length and used the recognizer to
collect the pronunciations that actually occurred, but did
not retrain the acoustic models. In this paper, we
improved the method in several aspects. First, we
analyzed the duration distribution for each type of
triphone based on forced Viterbi alignments of the
training data with rate-independent acoustic models. We
defined a pool of triphones that are candidates for
realization as zero-length phones by using the following
criterion: each triphone must have (1) at least 30 instances
in the training corpus and (2) more than 35% probability
to have duration of three frames. Note that since we used
three-state HMMs for all triphones, three frames is the
minimum duration for a triphone. Then, based on forced
Viterbi alignments, we collected pronunciation variants
for each word that has at least five instances in the
training data. If a triphone in a pronunciation instance
belongs to the pool of zero-length phone candidates, and
its duration is exactly three frames, we convert this
triphone to a zero-length phone, and save the resulting
pronunciation for that word. To make this method robust
with limited training data, we used both male and female
data to obtain the pronunciation variants, and only kept
those that occurred at least twice in the training corpus.
With this pronunciation dictionary, we trained the rate-
dependent acoustic models in the same way as described
in Section 3, and assigned each pronunciation a
probability based on count information. To accommodate
the Viterbi search algorithm, the pronunciation
probabilities are scaled in such a way that the most
frequently used pronunciation for a word has probability
1. Note that since we used different word tokens for fast
and slow speech in training, the same pronunciation
would have different probabilities for fast versus slow
speech. By using this approach, we achieved rate-
dependent pronunciation modeling.
As a preliminary experiment, we applied this method to
the system without multiwords that was introduced in
Section 4, Table 3. Table 7 shows a 0.3% additional
improvement brought by the rate-dependent
pronunciation modeling with zero-length phones.
Altogether, we achieved 2.2% absolute WER reduction
with respect to the rate-independent baseline.
male
female
all
Rate-independent system 50.6 57.9 54.6
Using rate-dependent phones 49.2 55.6 52.7
Using both rate-dependent phones
and pronunciations
48.5 55.6 52.4
Table 7:
WER of a rate-independent baseline system, a system
with rate-dependent phones and a system with both rate-
dependent phones and pronunciations
6. SUMMARY
We proposed a rate-dependent acoustic modeling
scheme, which is able to model within-sentence speech
rate variation, and does not rely on ROS estimation prior
to recognition. Experiments show that this method results
in a 1.9% (absolute) word error rate reduction on a
conversational telephone speech test set. When combined
with multiword pronunciation modeling, our method led
to a win of 0.7% on the same data set, and a statistically
significant win of 0.7% on the March 2000 Hub-5
evaluation set. The combination of rate-dependent
acoustic models with rate-dependent pronunciations
containing zero-length phones further improves the rate-
dependent recognition system.
REFERENCES
[1] M.A. Siegler and Richard M. Stern, “On the Effects of
Speech Rate in Large Vocabulary Speech Recognition
Systems,” Proc.
ICASSP
, vol. 1, pp. 612-615, 1995.
[2] N. Mirghafori, E. Fosler and N. Morgan, “Towards Robust-
ness to Fast Speech in ASR,” Proc.
ICASSP,
vol. 1, pp. 335-
338, 1996.
[3] J. Zheng, H. Franco, F. Weng, A. Sankar,, and H. Bratt,
“Word-level Rate-of-Speech Modeling Using Rate-Specific
Phones and Pronunciations,” Proc.
ICASSP
, vol. 3, pp.
1775-1778, 2000.
[4] N. Morgan and E. Fosler, “Combining Multiple Estimators
of Speaking rate,” Proc.
ICASSP,
vol. 2, pp. 729-732, 1995.
[5] V. Digalakis, P. Monaco, and H. Murveit, “Genones,
Generalized Mixture Tying in Continuous Hidden Markov
Model-based Speech Recognizers,”
IEEE TSAP
, vol. 4, no.
4, pp. 281-289, 1996.
[6] H. Murveit, J. Butzberger, V. Digalakis, and M. Weintraub,
“Large-Vocabulary Dictation Using SRI’s DECIPHER(TM)
Speech Recognition System: Progressive-Search
Techniques,” Proc
. ICASSP,
vol. 2, pp. 319-322, 1993.
[7] V. Digalakis and L. G. Neumeyer, “Speaker Adaptation
Using Combined Transformation and Bayesian Methods,”
IEEE TSAP
, vol. 4, no. 4, pp. 294-300, 1996.
[8] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao
Gadde, M. Plauche, C. Rickey, E. Shriberg, K. Sonmez, F.
Weng, and J. Zheng, “The SRI March 2000 Hub-5
Conversational Speech Transcription System,” Proc.
NIST
Speech Transcription Workshop
, College Park, MD, May
2000.
[9] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin,
“Speaker Normalization on Conversational Telephone
Speech,” Proc.
ICASSP
, vol. 1, pp. 339-341, 1996.