ArticlePDF Available

Rate-Dependent Acoustic Modeling For Large Vocabulary Conversational Speech Recognition

Authors:

Abstract

Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect automatic speech recognition (ASR) systems. To deal with these ROS effects, we propose to use parallel, rate-specific, acoustic models: one for fast speech, the other for slow speech. Rate switching is permitted at word boundaries, to allow modeling within-sentence speech rate variation, which is common in conversational speech. Due to the parallel structure of ratespecific models and the maximum likelihood decoding method, we do not need high-quality ROS estimation before recognition, which is usually hard to achieve. In this paper, we evaluate our approach on a large-vocabulary conversational speech recognition (LVCSR) task over the telephone, with several minimal pair comparisons based on different baseline systems. Experiments show that on a development set for the 2000 Hub-5 evaluation, introducing word-level ROSdependent models results in a 1.9% absolute win over a bas...
RATE-OF-SPEECH MODELING FOR LARGE VOCABULARY
CONVERSATIONAL SPEECH RECOGNITION
Jing Zheng, Horacio Franco and Andreas Stolcke
Speech Technology and Research Laboratory
SRI International, Menlo Park, CA, USA
{zj, hef, stolcke}@speech.sri.com
ABSTRACT
Variations in rate of speech (ROS) produce changes in both
spectral features and word pronunciations that affect automatic
speech recognition (ASR) systems. To deal with these ROS
effects, we propose to use parallel, rate-specific, acoustic
models: one for fast speech, the other for slow speech. Rate
switching is permitted at word boundaries, to allow modeling
within-sentence speech rate variation, which is common in
conversational speech. Due to the parallel structure of rate-
specific models and the maximum likelihood decoding method,
we do not need high-quality ROS estimation before recognition,
which is usually hard to achieve. In this paper, we evaluate our
approach on a large-vocabulary conversational speech
recognition (LVCSR) task over the telephone, with several
minimal pair comparisons based on different baseline systems.
Experiments show that on a development set for the 2000 Hub-5
evaluation, introducing word-level ROS-dependent models
results in a 1.9% absolute win over a baseline system without
multiword pronunciation modeling, and a 0.7% absolute win
over a baseline system that incorporates a 4.0% absolute win
from multiword pronunciation modeling. The combination of
rate-dependent acoustic models with rate-dependent
pronunciations obtained by using a data-driven approach is also
explored and shown to produce an additional win.
1. INTRODUCTION
Rate of speech (ROS) is an important factor that
affects the performance of a transcription system [1],[2].
Possible reasons are that some features commonly used in
recognition systems are duration related and clearly
influenced by speech rate, such as delta and delta delta
features, and that some pronunciation phenomena such as
coarticulation and reduction are also speech rate related.
Thus, using rate-dependent acoustic models seems to be a
promising way to improve robustness against speech rate
variation.
In previous research work, rate-dependent acoustic
models were often used at the sentence level. In the
typical framework, an input utterance was first classified
as fast or slow using a ROS estimator, and then fed to a
rate-specific system that was tuned to fast or slow speech
[2]. This method has two drawbacks. First, it presumes
that the speech rate within an utterance is uniform, which
is often not the case in conversational speech. In our
earlier research work on broadcast news [3], we found
that speech rate variation within sentences is common,
and thus we proposed to use a more local rate dependency
for the acoustic models. Second, this approach is based on
sequential classification, so errors in the ROS
classification will most likely trigger errors in the
recognition step. This paper proposes a new approach:
rate-dependent acoustic modeling at the word-level.
Under this approach, each typical word is given two
parallel rate-specific pronunciations: a fast-version
pronunciation and a slow-version pronunciation, each
consisting of rate-specific phones. The recognizer is
allowed to select the fast or the slow pronunciation for
each word automatically during search, based on the
maximum likelihood criterion. This way, we can model
the within-sentence speech rate variation, and avoid the
requirement of pre-recognition ROS classification. To
train the rate-specific phone models, we use a duration-
based ROS measure to partition the training data into rate-
specific categories. Due to the availability of training
transcriptions, robust and accurate ROS estimation for
training data can be achieved.
We also explore a new method to model rate-
dependent pronunciation variation. Based on the concept
of a zero-length phone [3], we enable short phones to be
skipped without changing the contexts of neighboring
phones. A data-driven method is used to generate an
expanded rate-dependent pronunciation dictionary.
In Section 2 we introduce the ROS measure used for
partitioning the training data. In Section 3 we show the
experimental results of rate-dependent acoustic modeling
based on SRI’s 1998 Hub-5 evaluation system, and
compare different training approaches. In Section 4 we
describe the work for the March 2000 Hub-5 evaluation
system, and specifically address the effect of multiwords
in rate-dependent acoustic modeling. In Section 5 we
report initial results on rate-dependent pronunciation
modeling. Finally, in Section 6, we summarize our results.
2. ROS MEASURE
Two methods are typically used to estimate the ROS of an
input utterance. One is based on phone durations, which
are often obtained from phone-level segmentations by
using forced alignments. When the utterance transcription
is known, this duration-based method can provide robust
ROS estimation [2]; however, when the transcription is
unknown, we can only use the hypothesis from a prior
recognition run, whose quality is hard to guarantee. The
second method involves estimating ROS directly from the
waveform or acoustic features of the input utterance [4].
To achieve robust ROS estimation, the computation is
often based on a data window with sufficient length.
Under our proposed approach, to train the rate-specific
models we need to partition the training data into rate-
specific categories at the word level, and we therefore
need the ROS for each word to be estimated locally. The
output of this process should give each word in the
training transcription a rate class label. As our first step to
ROS modeling, we decided to use only two ROS classes:
fastor slow. Since we only need to compute ROS for
the training data that have transcriptions, it is relatively
straightforward to obtain the duration of each word and its
component phones by computing forced Viterbi
alignments, and then applying duration-based ROS
estimation methods.
Absolute ROS measures, such as phones per second
(PPS) and inverse mean duration (IMD) [2], were often
used in previous work. However, we felt that these
measures are not informative enough since they do not
consider the fact that different types of phones have
different duration distributions. Fig. 1 illustrates the
duration distributions of some different categories of
monophones estimated from the training corpus. As we
can see, the duration distribution across different phone
types differs substantially. When taking PPS or IMD as
the ROS measure, words composed of short phones are
more easily treated as fast than those composed of long
phones, even though they are not actually spoken faster
than the normal rate. In our approach, we use a relative
ROS measure,
R
W
(
D
), defined as a percentile of a words
ROS distribution:
D
d
W
W
W
d
P
D
d
P
D
R
)
(
1
)
(
)
(
(1)
where
W
is a given word,
D
is the duration of
W
, and
P
W
(
d
) is the probability of that type of word having
duration
d
.
R
W
(
D
) is the probability of
W
having a
duration longer than
D
. The measure
R
W
(
D
) always falls
within the range [0,1], and can be compared between
different word categories. However, in practice,
P
W
(
d
) is
hard to estimate directly due to data sparseness. To
address this problem we assume that, within a word, the
duration distributions of its component subword units,
such as phones, are independent of each other. Thus, a
words duration distribution equals the convolution of its
component subword unit distributions, which are easier to
estimate reliably from training data. We currently use
triphones as the subword units for ROS estimation.
Figure 1:
Duration distributions of different phone types
We used this measure to calculate the ROS for all
word tokens in the training data, and found that 80% of
sentences with five or more words have at least one word
belonging to the fastest one third and one word belonging
to the slowest one third of all the words. This suggests
that in conversational speech, speech rate is usually not
uniform within a sentence.
The measure defined in Eq. 1 can also be applied to
subword units, thus allowing us to calculate the ROS of
phones. Using this measure, we studied the phone ROS
variation within words vs. within sentences. Fig. 2 shows
a histogram of the standard deviation of the phone ROS
within words and within sentences for all training data.
The data suggest that the word is a better unit than the
sentence for ROS modeling, since the average phone-
level ROS variation within a word is significantly smaller
than within a sentence.
Figure 2:
Histogram of standard deviation of phone-level ROS:
within words vs. within sentences
3. RATE-DEPENDENT ACOUSTIC
MODELING
In our proposed method, each word is given parallel
pronunciations of fast- and slow-version phones. Both
fast- and slow-version pronunciations are initialized from
the original rate-independent version, with the simple
replacement of rate-independent phones by rate-specific
phones. For example, the original rate-independent
pronunciation of WORDis /w er d/. Consequently the
fast-version pronunciation is /w
f
er
f
d
f
/ and the slow-
version /w
s
er
s
d
s
/, consisting of fast and slow phones,
respectively. The recognizer automatically finds the
pronunciations that maximize the likelihood score during
search, and thus avoids the need for ROS estimation
before recognition. In addition, the search algorithm is
allowed to select pronunciations of different rates across
word boundaries, thus coping with the problem of speech
rate variation within a sentence.
Acoustic Training
Our initial experiments were based on SRIs 1998 Hub-5
evaluation system, which uses continuous-density genonic
hidden Markov models (HMMs) [5]. The original
evaluation system used a multi-pass recognition strategy
[6]. For the sake of simplicity, we ran our experiments
with only the first-pass recognizer, based on gender-
0
2
4
6
8
10
12
0
20
40
60
80
100
Standard deviation of phone-level ROS (100X)
%
within
words
(mean:
0.18)
within
sentences
(mean:
0.25)
0
5
10
15
20
25
30
35
40
45
0
5
10
15
20
25
30
Phone duration (frame)
%
dependent non-crossword genonic HMMs (1730 genones
with 64 Gaussians each for male, 1458 genones for
female) and a bigram grammar with a 33,275-word
vocabulary. The recognition lexicon was derived from the
CMU V0.4 lexicon with stress information stripped. The
recognizer used a two-pass (forward pass and backward
pass) Viterbi beam search algorithm; in the first pass a
lexical tree was used in the grammar backoff node to
speed up search. Below we report results from the
backward pass. The features used were 9 cepstral
coefficients (C1-C8 plus C0) with their first- and second-
order derivatives in 10ms time frames. The acoustic
training corpus containing 121,000 male sentences and
149,000 female sentences came from (A) Macrophone
telephone speech, (B) 3,094 conversation sides from the
BBN-segmented Switchboard-1 training set (with some
hand-corrections), and (C) 100 CallHome English training
conversations.
We first calculated the ROS for all the words in the
training corpus based on the above-mentioned measure,
sorted these words accordingly, and then split them into
two categories: fast and slow. The ROS threshold for
splitting was selected to achieve equal amounts of training
data for the fast and the slow speech. The training
transcriptions were labeled accordingly. We then prepared
a special training lexicon: words with a fast label were
given the fast-version pronunciation, and words with a
slow label the slow-version pronunciation. In this way, we
were able to train the fast and slow models
simultaneously.
We used the DECIPHER genonic training tools to run
standard maximum likelihood estimation (MLE) gender-
dependent training [5] and obtained rate-dependent
models with 3233 genones for male speech and 2501
genones for female speech. The genone clustering for
rate-dependent models used the same information loss
threshold as the training of rate-independent models.
We compared the rate-dependent acoustic model with
the rate-independent acoustic model (baseline system) on
a development subset of the 1998 Hub-5 evaluation data,
consisting of 1143 sentences from 20 speakers (9 male, 11
female). Table 1 shows the word error rate (WER) for
both models. Note that all results reported here are based
on speaker-independent within-word triphone acoustic
models and bigram language models, and are therefore
not comparable with the full evaluation system.
male
female
all
rate-independent model 55.3 63.4 59.8
rate-dependent model from training 52.9 61.9 57.9
Table 1:
Comparison between the baseline system with rate-
independent models and the system with rate-dependent
models (% WER on the development set)
Rate-dependent modeling brings an absolute WER
reduction of 1.9%, which is statistically significant. To
eliminate the possible effect of different numbers of
parameters, we adjusted the information loss threshold for
genone clustering to obtain another rate-independent
model that had a number of parameters similar to that of
the rate-dependent model. However, we did not observe
any improvement from the increased number of
parameters. This suggests the win is indeed due to the
introduction of rate dependency.
Adaptation vs. Standard Training
In our previous work on the Broadcast News corpus
(Hub-4) [3], instead of using the training method
described above, we trained the rate-dependent model
using a modified Bayesian adaptation scheme [7], by
adapting the rate-independent model to rate-specific data
to obtain rate-specific models. This was motivated by the
small amount of available training data relative to the
model size. In [3], we used a baseline system with a very
large model comprising 256,000 Gaussians, and classified
the training data into three categories: fast, slow, and
medium. For this model size, the training data was not
sufficient to perform standard training. However, in the
current task of Hub-5 telephone speech transcription we
had significantly more training data, and we used a
different strategy to partition the data into two classes
instead of three, yielding more training data for each rate
class. In addition, the optimal models we started with
were smaller. Thus, we were able to train the rate-
dependent model robustly with standard training methods.
For comparison we tested the Bayesian adaptation
approach that we used in [3] on the current training set.
Similar to [3], even though we used separate rate-specific
models for each triphone, we did not create separate
copies of the genones, but let the fast and slow models for
a given triphone share the same genone. In this way, we
used the same number of Gaussians for the rate-dependent
model as for the rate-independent model.
male
female
all
rate-independent model 55.3 63.4 59.8
rate-dependent model from adaptation
54.0 62.6 58.8
Table 2:
Comparison between the baseline system with rate-
independent model and the system with rate-dependent model
from adaptation (% WER on the development set)
Table 2 shows the results of the same development data
set we used in the previous section. We see that this
approach brings a win of 1.0% over the baseline, less than
the standard training scheme. This indicates that the
difference between fast and slow speech in the acoustic
space is significant, and that standard training might be
better than the previous adaptation scheme at capturing
this difference. In fact, standard training optimizes the
parameter tying for the rate-dependent model, reestimates
the HMM transition probabilities, and performs multiple
iterations of parameter reestimation. The adaptation
approach, on the other hand does not recompute genonic
clustering, does not change the transition probabilities,
and includes only one iteration of reestimation for the
rate-dependent model on top of the rate-independent
model. These differences might explain why the
adaptation scheme did not perform as well as standard
training.
4
. EXPERIMENTS IN THE 2000 NIST HUB-5
EVALUATION SYSTEM
For the March 2000 NIST Hub-5 benchmark, numerous
improvements were made to SRIs 1998 evaluation
system [8], and the baseline system had been enhanced
substantially. Below we show some minimal pair
experiments based on different baseline systems during
the development process. The baseline system in Table 3
used a wider-band front end (with 13 cepstral coefficients
instead of 9), and vocal tract length (VTL) normalization
[9] during training. As we can see, the win from
introducing word-level rate dependency is still 1.9%, over
a baseline that was itself improved by 5.0%.
male
female
all
WER of baseline system 50.6 57.9 54.6
WER of rate-dependent system 49.2 55.6 52.7
Table 3:
Minimal pair comparison based on an improved
baseline system using a wider front end and VTL
normalization (% WER on the development set)
Another major addition to the evaluation system was
the introduction of multiword pronunciations. A
multiword is a high-frequency word bigram or trigram,
such as a lot of, that is handled as a single unit in the
vocabulary. By using handcrafted phonetic pronunciations
describing various kinds of pronunciation reduction
phenomena for these multiwords, we achieved better
modeling of crossword coarticulation. In SRIs March
2000 evaluation system, 1389 multiwords were included
in the dictionary, for a win of about 4.0% absolute on top
of the improved baseline system in Table 3 [8].
We tried applying our rate-dependent modeling
approach to the multiword-augmented system by treating
the multiwords as ordinary words. In this case, we
obtained a smaller win of 0.5% , as shown in Table 4.
(Compared to Table 3, a small part of the baseline WER
reduction -- about 1.3% absolute -- comes from other
improvements, such as variance normalization and
pronunciation probabilities.)
male
female
all
WER of baseline system 44.3 53.3 49.3
WER of rate-dependent system 43.6 53.0 48.8
Table 4:
Minimal pair comparison based on a multiword-
augmented baseline system on the development set
There are several possible reasons for the diminished
effectiveness of ROS modeling with multiwords. First,
each multiword is given multiple parallel pronunciations
reflecting both full and reduced forms. This by itself
models fast and slow speech variants to some extent.
However, since this affects only the 1389 multiwords
(covering about 40% of word tokens), there should still be
room for improvement from rate-dependent modeling.
Second, by treating multiwords as ordinary words, we fail
to model the rate variation occurring within the
multiwords, and thus may influence the quality of the
rate-dependent acoustic models. Third, in our current
implementation, the introduction of multiwords made the
search much more expensive than before; rate-dependent
modeling on top of the multiword dictionary made this
problem even worse, and may have produced a loss in
performance due to search pruning.
Based on the above analysis, we tested another scheme:
instead of treating multiwords as ordinary words we
trained them with multiword-specific phone units, that is,
using separate phonetic models to describe the
multiwords. Similar to the original approach, we trained
three classes of phone models simultaneously: fast models
for ordinary words, slow models for ordinary words, and
a separate set of phone models trained only on the
multiword data. With this approach, we improved the
WER reduction to 0.7%, as shown in Table 5.
male
female
all
WER of baseline system 44.3 53.3 49.3
WER of rate-dependent system 43.6 52.6 48.6
Table 5:
Minimal pair comparison on the development set
between the multiword-augmented baseline system and the
rate-dependent system with multiword-specific phone models
Finally, we replicated the same experiment on the
March 2000 Hub-5 evaluation data set, comprising 4466
utterances from 80 speakers (29 male, 51 female). As
shown in Table 6, we again obtained a win of 0.7%
absolute, which is statistically significant for this data set.
male
female
all
WER of baseline system 40.0 41.8 41.2
WER of rate-dependent system 39.7 41.0 40.5
Table 6:
Minimal pair comparison on the March 2000 Hub-5
evaluation set between the multiword-augmented baseline
system and the rate-dependent system with multiword-specific
phone models
5. PRELIMINARY RESULTS WITH RATE-
DEPENDENT PRONUNCIATION MODELING
In the above experiments, we used identical
pronunciations but different phone types to model fast
versus slow speech. However, some pronunciation
variation phenomena, such as coarticulation and
reduction, are very possibly related to ROS. In our
previous research on the Hub-4 corpus [3], we proposed
the notion of zero-length phone” to model highly
coarticulated pronunciations. The basic idea is to allow
certain short phones to be skipped during search, while
preserving their contextual influences to their neighboring
phones. By letting different phones in a pronunciation be
zero-length, we can generate different pronunciation
variants for that word. By using rate-dependent
pronunciations, we hope to better model pronunciation
variation than by using rate-independent pronunciations.
In [3], we allowed certain types of monophones to
optionally be zero-length and used the recognizer to
collect the pronunciations that actually occurred, but did
not retrain the acoustic models. In this paper, we
improved the method in several aspects. First, we
analyzed the duration distribution for each type of
triphone based on forced Viterbi alignments of the
training data with rate-independent acoustic models. We
defined a pool of triphones that are candidates for
realization as zero-length phones by using the following
criterion: each triphone must have (1) at least 30 instances
in the training corpus and (2) more than 35% probability
to have duration of three frames. Note that since we used
three-state HMMs for all triphones, three frames is the
minimum duration for a triphone. Then, based on forced
Viterbi alignments, we collected pronunciation variants
for each word that has at least five instances in the
training data. If a triphone in a pronunciation instance
belongs to the pool of zero-length phone candidates, and
its duration is exactly three frames, we convert this
triphone to a zero-length phone, and save the resulting
pronunciation for that word. To make this method robust
with limited training data, we used both male and female
data to obtain the pronunciation variants, and only kept
those that occurred at least twice in the training corpus.
With this pronunciation dictionary, we trained the rate-
dependent acoustic models in the same way as described
in Section 3, and assigned each pronunciation a
probability based on count information. To accommodate
the Viterbi search algorithm, the pronunciation
probabilities are scaled in such a way that the most
frequently used pronunciation for a word has probability
1. Note that since we used different word tokens for fast
and slow speech in training, the same pronunciation
would have different probabilities for fast versus slow
speech. By using this approach, we achieved rate-
dependent pronunciation modeling.
As a preliminary experiment, we applied this method to
the system without multiwords that was introduced in
Section 4, Table 3. Table 7 shows a 0.3% additional
improvement brought by the rate-dependent
pronunciation modeling with zero-length phones.
Altogether, we achieved 2.2% absolute WER reduction
with respect to the rate-independent baseline.
male
female
all
Rate-independent system 50.6 57.9 54.6
Using rate-dependent phones 49.2 55.6 52.7
Using both rate-dependent phones
and pronunciations
48.5 55.6 52.4
Table 7:
WER of a rate-independent baseline system, a system
with rate-dependent phones and a system with both rate-
dependent phones and pronunciations
6. SUMMARY
We proposed a rate-dependent acoustic modeling
scheme, which is able to model within-sentence speech
rate variation, and does not rely on ROS estimation prior
to recognition. Experiments show that this method results
in a 1.9% (absolute) word error rate reduction on a
conversational telephone speech test set. When combined
with multiword pronunciation modeling, our method led
to a win of 0.7% on the same data set, and a statistically
significant win of 0.7% on the March 2000 Hub-5
evaluation set. The combination of rate-dependent
acoustic models with rate-dependent pronunciations
containing zero-length phones further improves the rate-
dependent recognition system.
REFERENCES
[1] M.A. Siegler and Richard M. Stern, On the Effects of
Speech Rate in Large Vocabulary Speech Recognition
Systems,Proc.
ICASSP
, vol. 1, pp. 612-615, 1995.
[2] N. Mirghafori, E. Fosler and N. Morgan, Towards Robust-
ness to Fast Speech in ASR,Proc.
ICASSP,
vol. 1, pp. 335-
338, 1996.
[3] J. Zheng, H. Franco, F. Weng, A. Sankar,, and H. Bratt,
Word-level Rate-of-Speech Modeling Using Rate-Specific
Phones and Pronunciations,Proc.
ICASSP
, vol. 3, pp.
1775-1778, 2000.
[4] N. Morgan and E. Fosler, Combining Multiple Estimators
of Speaking rate,Proc.
ICASSP,
vol. 2, pp. 729-732, 1995.
[5] V. Digalakis, P. Monaco, and H. Murveit, Genones,
Generalized Mixture Tying in Continuous Hidden Markov
Model-based Speech Recognizers,
IEEE TSAP
, vol. 4, no.
4, pp. 281-289, 1996.
[6] H. Murveit, J. Butzberger, V. Digalakis, and M. Weintraub,
Large-Vocabulary Dictation Using SRI’s DECIPHER(TM)
Speech Recognition System: Progressive-Search
Techniques,Proc
. ICASSP,
vol. 2, pp. 319-322, 1993.
[7] V. Digalakis and L. G. Neumeyer, Speaker Adaptation
Using Combined Transformation and Bayesian Methods,
IEEE TSAP
, vol. 4, no. 4, pp. 294-300, 1996.
[8] A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao
Gadde, M. Plauche, C. Rickey, E. Shriberg, K. Sonmez, F.
Weng, and J. Zheng, The SRI March 2000 Hub-5
Conversational Speech Transcription System,Proc.
NIST
Speech Transcription Workshop
, College Park, MD, May
2000.
[9] S. Wegmann, D. McAllaster, J. Orloff, and B. Peskin,
Speaker Normalization on Conversational Telephone
Speech,Proc.
ICASSP
, vol. 1, pp. 339-341, 1996.
... @BULLET Parallel decoding: the idea is to train many different acoustic models on data coming from different conditions [Matsuda 2004, Utsuro 2002, Barrault 2005 ; during decoding all models are used and their output hypothesis are combined to determine the most likely recognition result; and @BULLET Rate-dependent models: this theme is the one that has received the most of the research efforts [Mirghafori 1996, Pfau 1998, Zheng 2000, Wrede 2001, Nanjo 2002. ...
... Models robust to fast speech are an active area of research [Mirghafori 1996, Pfau 1998, Zheng 2000, Wrede 2001, Nanjo 2002; not all works that are cited treat this problem with multiple models, but they provide an overview of the subject. We will focus this review on [Zheng 2000] which employs different models for normal (average) and fast speech. ...
... Models robust to fast speech are an active area of research [Mirghafori 1996, Pfau 1998, Zheng 2000, Wrede 2001, Nanjo 2002; not all works that are cited treat this problem with multiple models, but they provide an overview of the subject. We will focus this review on [Zheng 2000] which employs different models for normal (average) and fast speech. In [Wrede 2001] an in-depth analysis of the variabilities is presented, but on a corpus of read speech, thus it may not be really representative of the effects of fast speech. ...
Research
Full-text available
The subject of this thesis is acoustic modeling for speech recognition. There are two main aspects: modeling using long-units (consonant groups, quasi-syllables, multi-phones) and multi-modeling (e.i. how to efficiently "build" a model for a given class of data?)Methods for speeding-up the learning of the models using fixed alignments have been studied, and also how to produce Gaussianmixture models without passing through splitting methods that are usually employed during mixture learning.There is a growing interest in modeling using units longer than phonemes, because they enable modeling long-term dependencies and also pronunciation variations within the given group of phonemes. Context-dependent long-units yielded modest gains with respect to phonemes. But when this modeling is employed to a specific application, long-units significantly outperformed phonemes.The idea behind multi-modeling is to have specific models for each different conditions, thus being more precise in each one. A condition-specific model is estimated with data chosen by a priori knowledge, for example a group of speakers (gender, age, accent), transmission channel (PSTN, GSM), a range of signal-to-noise ratios, or any other feature of the signal. Different model combination schemes specific to each class are evaluated, and also for different a priori sources of variation. Using gender and channel-dependent models, a significant improvement over models learned on the same data but without the use of the a priori knowledge. Combining models at acoustic level (Gaussian mixtures) gave the highest performance.
... For ASR, the main concern is how to compensate the speaking rate effect in order to improve the relatively-low recognition performance of fast or slow speech [3]- [9]. Methods proposed included speaking rate normalization of spectral feature [3], [4], use of durational information [5], modeling of pronunciation variation [6], adjustment of mixture weights and transition probabilities [7], use of parallel rate-specific acoustic models [8], and decoding strategy adaptation [9]. For TTS, the speaking rate control of the synthetic speech is needed for making it sound more vividly to away from the criticism of machine-like sounding [10]- [17] as well as for being suitable for some special applications, e.g. ...
... is further divided into three sub-models by (8) Lastly, the break-syntax model is approximated by (9) where is the break type model for juncture . We also realize by the CART algorithm. ...
Article
A new data-driven approach to building a speaking rate-dependent hierarchical prosodic model (SR-HPM), directly from a large prosody-unlabeled speech database containing utterances of various speaking rates, to describe the influences of speaking rate on Mandarin speech prosody is proposed. It is an extended version of the existing HPM model which contains 12 sub-models to describe various relationships of prosodic-acoustic features of speech signal, linguistic features of the associated text, and prosodic tags representing the prosodic structure of speech. Two main modifications are suggested. One is designing proper normalization functions from the statistics of the whole database to compensate the influences of speaking rate on all prosodic-acoustic features. Another is modifying the HPM training to let its parameters be speaking-rate dependent. Experimental results on a large Mandarin read speech corpus showed that the parameters of the SR-HPM together with these feature normalization functions interpreted the effects of speaking rate on Mandarin speech prosody very well. An application of the SR-HPM to design and implement a speaking rate-controlled Mandarin TTS system is demonstrated. The system can generate natural synthetic speech for any given speaking rate in a wide range of 3.4-6.8 syllables/sec. Two subjective tests, MOS and preference test, were conducted to compare the proposed system with the popular HTS system. The MOS scores of the proposed system were in the range of 3.58-3.83 for eight different speaking rates, while they were in 3.09-3.43 for HTS. Besides, the proposed system had higher preference scores (49.8%-79.6%) than those (9.8%-30.7%) of HTS. This confirmed the effectiveness of the speaking rate control method of the proposed TTS system.
... Modeling the effects of speaking rate is also an important research issue in both automatic speech recognition (ASR) and text-to-speech (TTS). For ASR, the issue is to compensate the speaking rate effect for improving the low recognition performance of fast or slow speech [5][6][7][8][9][10][11]. For TTS, the speaking rate control of the synthetic speech is needed for making it sound more vividly to away from the criticism of machine-like sounding [12][13][14] as well as for being suitable for some special applications, e.g. ...
... For ASR, acoustic modeling (AM) is less robust for very fast or slow speech. Many methods have been proposed to enhance the robustness of AM, including utilization of durational information [2], SR-normalization of spectral features [3,4], modeling of pronunciation variations [5], adaptation of HMM's mixture weights and transition probabilities [6], use of parallel AMs of various SRs [7]. [8] used estimated SR to assist in emotion recognition. ...
... People speaking too slow or too fast are perceived as less benevolent and truthful, showing an inverted U-relationship with speech rate [8]. Speech rate also affect the performance of automatic speech recognition (ASR) systems, where speech rate variations degrade recognition performance due to mismatches between train and test conditions [9][10][11]. In emotion recognition, speech rate is usually described as an appealing discriminative feature [12,13]. ...
Conference Paper
Full-text available
It is commonly accepted that speaking rate is an important aspect characterizing expressive speech. The speaking rate increases for emotions such as happiness and anger, and decreases for emotions such as sadness. In spite of these observations, most of the current speech emotion classifiers do not explicitly use speaking rate features. This study explores two interrelated questions to evaluate the role of speaking rate in emotion recognition: Can we reliably estimate syllable rate from emotional speech? Does syllable rate provide complementary emotional information over other acoustic features? We consider two syllable rate estimation algorithms, as well as reference values derived from forced alignment. We evaluate the performance of these syllable rate estimation methods in expressive speech (SEMAINE database). The analysis reveals a drop in performance as the intensity of the emotion increases. Next, we conduct emotion recognition experiments to evaluate the contribution of syllable rate in recognizing emotions. The emotion classification experiments demonstrate that features conveying accurate syllable rate estimations complement features that are commonly used in current emotion recognition system.
... The frame shift size was controlled by applying the continuous frame rate normalization to each training data. The model developed by Jing Zheng et al. [5] used parallel rate specific acoustic models where one model for fast speech and another for slow speech. As parallel structure of rate specific models is used, the high quality ROS estimation before recognition was not needed. ...
Article
The automatic speech recognition (ASR) is an active field of research. The performance of the ASR can be degraded due to various features like environmental noise, channel distortion and speech rate variability. The speech rate variability is one of the important features that affect the accuracy of the speech recognition system (SRS). In this research work, the speech signal is categorized as slow, normal and fast speech using features like the sound intensity level, time duration and root mean square. This paper addresses the enhancement of the performance of a SRS by applying time normalization to the speech signal. The comparison of the proposed Model and baseline syllable based SRS is done.
Article
Full-text available
Direct acoustic feature-based speaking rate estimation is useful in applications including pronunciation assessment, dysarthria detection and automatic speech recognition. Most of the existing works on speaking rate estimation have steps which are heuristically designed. In contrast to the existing works, in this work a data-driven approach with convolutional neural network-bidirectional long short-term memory (CNN-BLSTM) network is proposed to jointly optimize all steps in speaking rate estimation through a single framework. Also, unlike existing deep learning-based methods for speaking rate estimation, the proposed approach estimates the speaking rate for an entire speech utterance in one go instead of considering segments of a fixed duration. We consider the traditional 19 sub-band energy (SBE) contours as the low-level features as the input of the proposed CNN-BLSTM network. The state-of-the-art direct acoustic feature-based speaking rate estimation techniques are developed based on 19 SBEs as well. Experiments are performed separately using three native English speech corpora (Switchboard, TIMIT and CTIMIT) and a non-native English speech corpus (ISLE). Among these, TIMIT and Switchboard are used for training the network. However, testing is carried out on all the four corpora as well as TIMIT and Switchboard with additive noise, namely white, car, high-frequency-channel, cockpit, and babble at 20, 10 and 0 dB signal-to-noise ratios. The proposed CNN-BLSTM approach outperforms the best of the existing techniques in clean as well as noisy conditions for all four corpora.
Article
This works concerns the problem of the application of artificial neural networks in the modelling of the hearing process. The aim of the research was to answer the question whether artificial neural networks are able to evaluate speech rate. Speech samples, first recorded during reading of a story with normal and next with slow articulation rate were used as research material. The experiment proceeded in two phases. In the first stage Kohonen network was used. The purpose of that network was to reduce the dimensions of the vector describing the input signals and to obtain the amplitude-time relationship. As a result of the analysis, an output matrix consisting of the neurons winning in a particular time frame was received. The matrix was taken as input for the following networks in the second phase of the experiment. Various types of artificial neural networks were examined with respect to their ability to classify correctly utterances with different speech rates into two groups. Good examination results were accomplished and classification correctness exceeded 88%.
Article
Due to the increasing aging population in modern society and to the proliferation of smart devices, there is a need to enhance speech recognition among smart devices in order to make information easily accessible to the elderly as it is to the younger population. In general, speech recognition systems are optimized to an average adult's voice and tend to exhibit a lower accuracy rate when recognizing an elderly person's voice, due to the effects of speech articulation and speaking style. Additional costs are bound to be incurred when adding modifications to current speech recognitions systems for better speech recognition among elderly users. Thus, using a preprocessing application on a smart device can not only deliver better speech recognition but also substantially reduce any added costs. Audio samples of 50 words uttered by 80 elderly and young adults were collected and comparatively analyzed. The speech patterns of the elderly have a slower speech rate with longer inter-syllabic silence length and slightly lower speech intelligibility. The speech recognition rate for elderly adults could be improved by means of increasing the speech rate, adding a 1.5% increase in accuracy, eliminating silence periods, adding another 4.2% increase in accuracy, and boosting the energy of the formant frequency bands for a 6% boost in accuracy. After all the preprocessing, a 12% increase in the accuracy of elderly speech recognition was achieved. Through this study, we show that speech recognition of elderly voices can be improved through modifying specific aspects of differences in speech articulation and speaking style. In the future, we will conduct studies on methods that can precisely measure and adjust speech rate and find additional factors that impact intelligibility.
Article
Full-text available
The authors describe a technique called progressive search which is useful for developing and implementing speech recognition systems with high computational requirements. The scheme iteratively uses more and more complex recognition schemes, where each iteration constrains the speech space of the next. An algorithm called the forward-backward word-life algorithm is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements. It is shown that speed-ups of more than an order of magnitude are achievable with only minor costs in accuracy.
Article
Full-text available
Variations in rate of speech (ROS) produce changes in both spectral features and word pronunciations that affect ASR systems. To cope with these effects, we propose to use rate specific phone models and pronunciations for ROS modeling at the word level. Words are given three types of pronunciations fast, slow, and medium-consisting of rate-specific phone models, respectively. This approach allows us to model within sentence rate variation. To better model coarticulation effects, we introduce the concept, of zero-length phones, which enables short phones to be skipped without having to change their neighboring phones' contexts. A data-driven approach is used to prune the pronunciation dictionary derived from rules for phone reduction. We tested these approaches on the Hub 4 database and achieved a relative improvement of 2.0% over the baseline-an evaluation-quality version of SRI's DECIPHER continuous speech recognition system-for clean native speech in the 1996 development set.
Conference Paper
Full-text available
It is well known that a higher-than-normal speech rate will cause the rate of recognition errors in large vocabulary automatic speech recognition (ASR) systems to increase. In this paper we attempt to identify and correct for errors due to fast speech. We first suggest that phone rate is a more meaningful measure of speech rate than the more common word rate. We find that when data sets are clustered according to the phone rate metric, recognition errors increase when the phone rate is more than 1 standard deviation greater than the mean. We propose three methods to improve the recognition accuracy of fast speech, each addressing different aspects of performance degradation. The first method is an implementation of Baum-Welch codebook adaptation. The second method is based on the adaptation of HMM state-transition probabilities. In the third method, the pronunciation dictionaries are modified using rule-based techniques and compound words are added. We compare improvements in recognition accuracy for each method using data sets clustered according to the phone rate metric. Adaptation of the HMM state-transition probabilities to fast speech improves recognition of fast speech by a relative amount of 4 to 6 percent
Conference Paper
Full-text available
Psychoacoustic studies show that human listeners are sensitive to speaking rate variations. Automatic speech recognition (ASR) systems are even more affected by the changes in rate, as double to quadruple word recognition error rates of average speakers have been observed for fast speakers on many ASR systems. In our earlier work (see Proceedings of EUROSPEECH95, p.491-4, 1995), we studied the causes of higher error and concluded that both the acoustic-phonetic and the phonological differences are sources of higher word error rates. In this work, we have studied various measures for quantifying rate of speech (ROS) and used simple methods for estimating the speaking rate of a novel utterance using ASR technology. We have also implemented mechanisms that make our ASR system more robust to fast speech. Using our ROS estimator to identify fast sentences in the test set, our rate-dependent system has 24.5% fewer errors on the fastest sentences and 6.2% fewer errors on all sentences of the WSJ93 evaluation set relative to the baseline HMM/MLP system
Article
Full-text available
Adapting the parameters of a statistical speaker independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers
Article
Full-text available
An algorithm is proposed that achieves a good tradeoff between modeling resolution and robustness by using a new, general scheme for tying of mixture components in continuous mixture-density hidden Markov model (HMM)-based speech recognizers. The sets of HMM states that share the same mixture components are determined automatically using agglomerative clustering techniques. Experimental results on ARPA's Wall Street Journal corpus show that this scheme reduces errors by 25% over typical tied-mixture systems. New fast algorithms for computing Gaussian likelihoods-the-most time-consuming aspect of continuous-density HMM systems-are also presented. These new algorithms-significantly reduce the number of Gaussian densities that are evaluated with little or no impact on speech recognition accuracy
Conference Paper
We report progress in the development of a measure of speaking rate that is computed from the acoustic signal. The newest form of our analysis incorporates multiple estimates of rate; besides the spectral moment for a full-band energy envelope that we have previously reported, we also used pointwise correlation between pairs of compressed sub-band energy envelopes. The complete measure, called mrate, has been compared to a reference syllable rate derived from a manually transcribed subset of the Switchboard database. The correlation with transcribed syllable rate is significantly higher than our earlier measure; estimates are typically within 1-2 syllables/second of the reference syllable rate. We conclude by assessing the use of mrate as a detector for rapid speech
Conference Paper
This paper reports on a simplified system for determining vocal tract normalization. Such normalization has led to significant gains in recognition accuracy by reducing variability among speakers and allowing the pooling of training data and the construction of sharper models. But standard methods for determining the warp scale have been extremely cumbersome, generally requiring multiple recognition passes. We present a new system for warp scale selection which uses a simple generic voiced speech model to rapidly select appropriate frequency scales. The selection is sufficiently streamlined that it can moved completely into the front-end processing. Using this system on a standard test of the Switchboard Corpus, we have achieved relative reductions in word error rates of 12% over unnormalized gender-independent models and 6% over our best unnormalized gender-dependent models
Article
Adapting the parameters of a statistical speaker-independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers. V. Digalakis L. Neu...
  • A Stolcke
  • H Bratt
  • J Butzberger
  • H Franco
  • V R Rao Gadde
  • M Plauche
  • C Rickey
  • E Shriberg
  • K Sonmez
  • F Weng
  • J Zheng
A. Stolcke, H. Bratt, J. Butzberger, H. Franco, V. R. Rao Gadde, M. Plauche, C. Rickey, E. Shriberg, K. Sonmez, F. Weng, and J. Zheng, "The SRI March 2000 Hub-5