Conference PaperPDF Available

A comparative study of speaker adaptation techniques

Authors:
ABSTRACT
In previous work, we showed how to constrain the estimation
of continuous mixture-density hidden Markov models
(HMMs) when the amount of adaptation data is small. We used
maximum-likelihood (ML) transformation-based approaches
and Bayesian techniques to achieve near native performance
when testing nonnative speakers of the recognizer language. In
this paper, we study various ML-based techniques and com-
pare experimental results on data sets with recordings from
nonnative and native speakers of American English. We divide
the transformation-based techniques into two groups. In fea-
ture-space techniques, we hypothesize an underlying transfor-
mation in the feature-space that results in a transformation of
the HMM parameters. In model-space techniques, we hypothe-
size a direct transformation of the HMM parameters. In the
experimental section we show how the combination of the best
ML and Bayesian adaptation techniques result in significant
improvements in recognition accuracy. All the experiments
were carried out with SRI’s DECIPHERTM speech recognition
system [1][2].
1. INTRODUCTION
Automatic speech recognition (ASR) performance
degrades rapidly when a mismatch exists between the training
and the testing conditions. For example, the performance of
ASR systems trained using native speakers degrades dramati-
cally when tested on nonnative speakers [3]. Current methods
to minimize the effect of such a mismatch include ML transfor-
mation-based approaches [4][5][6] and Bayesian adaptation
[3][7][8].
This work focuses on comparing various ML transforma-
tion-based techniques and finding the optimum method for a
given task. We also investigate combinations of these adapta-
tion techniques.
2. THEORY
2.1. ML Adaptation Techniques
In ML transformation-based techniques [4][5][6], adap-
tation is achieved via a transformation of the speaker-indepen-
dent observation densities. The transformation parameters
are obtained by maximizing the likelihood of the adaptation
data X given the corresponding word string W,
. (1)
θn
θnargmaxθp X θW,( )=
A separate transformation is used for each group of Gaus-
sian densities. The number of such transformations can be
adjusted based on the available amount of adaptation data [4].
We assume that the speaker-independent (SI) HMM has
state observation densities of the form
, (2)
where gis the index of the Gaussian codebook used by state st.
In this paper we investigate the adaptation of this system by
jointly transforming all the Gaussians of each codebook.
As in [5], we consider transformations in two spaces: 1)
the feature-space, and 2) the model-space. In the feature-space
approach, the “original” features are transformed to the
observed features by a hypothesized transformation
where are the parameters to be estimated. In the model-based
approach, the original model is transformed to the new
model by where are the parameters to be
estimated.
The next two subsections describe the proposed transfor-
mation methods. Table 1 summarizes the methods.
2.1.1. Transformations in the Feature Space
Method I (Diagonal Affine Transform) [4]. In this
method we assume that, given the HMM state index , the
observed features can be obtained from the original features
through the transformation
. (3)
Under this assumption, the speaker-adapted (SA) obser-
vation densities will have the form
(4)
the parameters are estimated using the
ML approach of Eq. (1), where Ngis the number of distinct
transformations. We use the EM algorithm [9] to derive the ML
estimates of the parameters , and . When is a diago-
nal matrix closed form estimates can be obtained for and
as described in [4][5]. When is a full matrix, however,
the estimation problem is more tedious. In this paper we use a
diagonal matrix.
Method II (Additive Transform). This case is identical
to Method I with .
pSI yt|st
( ) pωi|st
( ) N ytµig Σig
,;( )
i
=
yt
xt
fνyt
( )
ν
λy
λx
λxgηλy
( )=
st
xt
yt
xtAgytbg
+=
pSA xtst
( ) pωist
( ) N xtAgµig bg
+AgΣigAg
T
,;
i
=
Agbgg, , 1Ng
, ,=
Ag
bg
Ag
Ag
bg
Ag
AgI=
A COMPARATIVE STUDY OF SPEAKER ADAPTATION
TECHNIQUES
Leonardo Neumeyer, Ananth Sankar and Vassilios Digalakis
e-mail: {leo,sankar,vas}@speech.sri.com
SRI International
Speech Technology and Research Laboratory
Menlo Park, CA, 94025, USA
Method III (Stochastic Additive Transform) [5]. In this
case we model the acoustic mismatch using a stochastic trans-
formation. The stochastic transform is given by,
, (5)
where is a Gaussian random variable with
mean and variance . In this context, we can view
Method II as a special case in which the additive term is a
deterministic parameter.
2.1.2. Transformations in the Model Space
Method IV (Full Affine Transform) [6]. An alternative
to Method I is to transform the means of the Gaussian density
functions using a full matrix and leave the variances unchanged.
The advantage of using a full matrix is that we can model the
correlation between feature components at the expense of a
quadratic increase in the number of adaptation parameters. The
observation densities in this case will have the form
. (6)
Method V (Structured Affine Transform). Our continu-
ous density HMM system uses a single feature vector stream,
which is the augmented vector composed of three basic feature
vectors: cepstrum, first-derivative, and second-derivative of the
cepstrum. The structured matrix has non-zero values only in
the elements whose rows and columns correspond to the same
basic vector. For example, in Eq. (6) an element of the mean
vector corresponding to a cepstrum component will only be pre-
dicted by the mean subvector that corresponds to the cepstrum
and will not depend on the delta components.
The motivation for proposing this method is that the esti-
mation of involves inverting a sample correlation matrix.
The dependencies between the cepstrum and its derivatives may
result in an ill-conditioned sample correlation matrix, resulting
in bad estimates.
Method VI (Scaled Variance Transform) [5]. In this
case we transform the means using an additive shift and the
variance using a scaling factor. The difference between this
approach and Method I is that, in this case, the scale factor only
affects the variance and is not tied to the scaling of the means.
2.2. Bayesian Adaptation Technique
In the Bayesian adaptation approach, the prior informa-
tion is encapsulated in the SI models [3][7][8]. The Bayesian
algorithms asymptotically converge to the speaker-dependent
performance as the amount of adaptation speech increases.
However, the adaptation rate is usually slow.
2.3. Combined Adaptation Technique
Finally, we use a combination of ML and Bayesian tech-
niques to achieve the quick adaptation characteristics of the ML
transformation-based methods with the asymptotic properties of
Bayesian methods [3]. In this approach we first use the ML
transformation-based method to adapt the SI models to the new
speaker. These adapted models are then used as priors for the
Bayesian adaptation method. The advantage of this approach is
xtytbgµb g,σb g,
,( )+=
bgµb g,σb g,
,( )
µb g,
σ2b g,
bg
pSA xtst
( ) pωist
( ) N xtAgµig bg
+Σig
,;( )
i
=
Ag
Ag
that the priors obtained by the ML transformation method are
more closely matched to the observed data than the SI models.
3. EXPERIMENTS
Experiments were carried out using SRI’s DECIPHERTM
speech recognition system configured with a six-feature front
end that outputs 12 cepstral coefficients, cepstral energy, and
their first- and second-order derivatives. The cepstral features
are computed from a fast Fourier transform (FFT) filterbank,
and subsequent cepstral-mean normalization on a sentence
basis is performed. We used genonic HMMs with an arbitrary
degree of Gaussian sharing across different HMM states as
described in [2]. The SI continuous HMM systems that we used
as seed models for adaptation were gender-dependent, trained
on 140 speakers and 17,000 sentences for each gender. Each of
the two systems had 12,000 context-dependent phonetic models
that shared 500 Gaussian codebooks (1000 in the native speaker
experiments) with 32 Gaussian components per codebook. For
testing, we used the Wall Street Journal (WSJ) task [10]. For
fast experimentation, we used the progressive search frame-
work [1]: an initial, SI recognizer with a bigram language
model outputs word lattices for all the utterances in the test set.
These word lattices are then rescored using SA models. We
used the baseline WSJ 5,000-word (20,000 for native speaker
experiments), closed-vocabulary bigram and trigram language
models provided by the MIT Lincoln Laboratory. The trigram
language model was used in the N-best rescoring paradigm, by
rescoring the list of the N-best sentence hypotheses generated
using the bigram language model.
3.1. Nonnative Speakers
We evaluated the adaptation algorithms on the 1994
“Spoke 3” task of the phase-1, large-vocabulary WSJ corpus
[11][12]. For the first set of experiments we created a subset of
the dev94 test set consisting of 5 nonnative speakers with 20
test sentences and 20 adaptation sentences per speaker. A big-
ram language model was used to compare performance between
Feature Space Methods
Method Transf.
Name Mean Transform Variance
Transform
IDiagonal
Affine
II Additive
III Stochastic
Additive
Model Space Methods
Method Transf.
Name Mean Transform Variance
Transform
IV Full Affine
VStructured
Affine
VI Scaled
Variance
Table 1. Transformations of the means and variances for various
methods.
µi
SA aiiµi
SI bi
+=
σ2i
SA a2i iσ2i
SI
=
µi
SA µi
SI bi
+=
σ2i
SA σ2i
SI
=
µi
SA µi
SI µbi
+=
σ2i
SA σ2i
SI σbi
2
+=
µSA AfµS I b+=
σ2i
SA σ2i
SI
=
µSA AsµSI b+=
σ2i
SA σ2i
SI
=
µi
SA µi
SI bi
+=
σ2i
SA αiσ2i
SI
=
the different adaptation methods. The experimental results are
shown in Table 2. In the second column we indicate the adapta-
tion method used (IV + III means we adapted using Method IV
followed by Method III). All experiments were optimized for
the number of transformations and only the best result is shown.
The main conclusions from this experiments can be summa-
rized as follows:
Adapting the means with a Full transform produces better
results (7% improvement) than adapting the means and
the variances with a Diagonal transform (NNat2 vs
NNat7).
Bayesian adaptation helps when combined with ML adap-
tation, in both Diagonal and Full transform cases. In the
diagonal case (NNat2 vs NNat3), we obtained an 8%
improvement; in Full (NNat7 vs NNat8), we obtained an
18% improvement. Bayesian adaptation did not help in
NNat11 when compared to NNat10.
The Stochastic Additive transform is more effective than
the deterministic Additive transform (NNat5 vs NNat4).
In NNat10 and NNat11, the stochastic transform was used
after the means were adapted with the Full Affine trans-
form.
The Structured Affine transform produced an improve-
ment of 8% compared to the Full Affine case (NNat7 vs
NNat9).
To see how this methods generalize when using a larger
data set and more adaptation sentences, we used the best tech-
niques on the full WSJ Spoke 3 development and evaluation
sets. After some initial tests we decided to use Method IV fol-
lowed by Method III followed by Bayesian adaptation. The
results are presented in Table 3.These results show how the Full
matrix transform and the Stochastic transform produced
Non-
Native
Expts Method Number of
Transforms Word Error
Rate (%)
NNat1 Speaker
Independent NA 24.9
NNat2 I 162 18.9
NNat3 I + Bayes 162 17.4
NNat4 II 162 19.2
NNat5 III 162 17.9
NNat6 VI 162 18.5
NNat7 IV 5 17.5
NNat8 IV + Bayes 5 14.4
NNat9 V 30 16.1
NNat10 IV + III 10/200 15.0
NNat11 IV + III +
Bayes 10/200 15.2
Table 2. Word error rates for various supervised adaptation
methods on a subset of the WSJ spoke3 (nonnatives) dev94 set
using a bigram language model. Twenty adaptation sentences
are used per speaker.
improvements of 20% in the dev94 set and 7% in the eval94 set
over Method I + Bayes.
3.2. Native Speakers
Some of the adaptation methods described in this paper
were also tested on native speakers. We used 10 native speakers
on the 20,000-word, closed vocabulary WSJ task (a total of 230
test sentences) and 40 adaptation sentences per speaker. The
results are presented in Table 4. Unlike the nonnative case, we
did not see a significant improvement after adapting the models
using Method I (Nat2). The Full Affine transform (Nat3), how-
ever, produced a significant improvement of 14% after adapta-
tion. Further improvement was gained when using the
Structured Affine transform (Nat3 vs Nat5). The Bayesian
adaptation produced some improvement in the Full case (Nat3
vs Nat4) and no significant improvement in the Structured case
(Nat5 vs Nat6).
4. DISCUSSION
We compared six ML-based adaptation approaches and
some combinations with Bayesian techniques. We found that
transforming the means of the Gaussian density functions with
a full matrix produces a significant improvement over the joint
adaptation of the means and variances with a Diagonal trans-
form. The variances can be adapted in a second stage using the
Stochastic transform, and further improvement can be obtained
in a third stage using Bayesian adaptation.
We also proposed a structured transformation of the
means that overcomes the problem of inverting ill-conditioned
sample correlation matrices. Other techniques, such as singular
Data Set SI SA (I +Bayes) SA (IV + III +
Bayes)
S3 Dev 94 23.1 13.2 10.5
S3 Eval 94 23.2 11.3 10.5
Table 3. Speaker-independent and speaker-adapted word error
rates on the WSJ Spoke 3 benchmark test using a trigram
language model. Fourty adaptation sentences are used per
speaker.)
Native
Expts Method Number of
Transforms Word Error
Rate (%)
Nat1
Speaker
Independent NA 20.9
Nat2 I 160 20.5
Nat3 IV 2 17.9
Nat4 IV + Bayes 2 17.5
Nat5 V 10 17.6
Nat6 V + Bayes 10 17.5
Table 4. Word error rates for various supervised adaptation
methods on natives speakers using a bigram language model.
Forty adaptation sentences are used per speaker.
value decomposition, can be used to overcome this problem and
will be studied in more detail in the future.
Acknowledgments
This work was partially supported by ARPA through the
Office of Naval Research Contract N00014-92-C-0154 and by
Telia Research AB of Sweden.
The US Government has certain rights in this material.
Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the authors and do not
necessarily reflect the views of the Government funding agen-
cies.
REFERENCES
1. H. Murveit, J. Butzberger, V. Digalakis, and M. Weintraub,
“Large-Vocabulary Dictation Using SRI’s DECIPHER
Speech Recognition System: Progressive Search Tech-
niques,” 1993 IEEE ICASSP, pp. II-319—II-322.
2. V. Digalakis and H. Murveit, “GENONES: Optimizing the
Degree of Mixture Tying in a Large Vocabulary Hidden
Markov Model Based Speech Recognizer,” 1994 IEEE
ICASSP, pp. I537-I540.
3. V. Digalakis and L. Neumeyer, “Speaker Adaptation Using
Combined Transformation and Bayesian Methods,” 1995
IEEE ICASSP, pp. I-680 - I-683.
4. V. Digalakis, D. Rtischev, and L. Neumeyer, “Fast Speaker
Adaptation Using Constrained Estimation of Gaussian Mix-
tures,” IEEE Trans. on Speech and Audio Processing; to
appear.
5. A. Sankar and C.H. Lee, “Stochastic Matching for Robust
Speech Recognition,” IEEE Signal Processing Letters, Vol.
1, pp. 124-125, August 1994.
6. C.J. Leggetter and P.C. Woodland, “Flexible Speaker Adap-
tation using Maximum Likelihood Linear Regression,”
ARPA SLT Workshop, pp. 110-115, January 1995.
7. C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A Study on
Speaker Adaptation of the Parameters of Continuous Den-
sity Hidden Markov Models,” IEEE Trans. on Acoust.,
Speech and Signal Proc., Vol. ASSP-39(4), pp. 806—814,
April 1991.
8. C.-H. Lee and J.-L. Gauvain, “Speaker Adaptation Based
on MAP Estimation of HMM Parameters,” 1993 IEEE
ICASSP, pp. II-558 — II-561.
9. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum
Likelihood Estimation from Incomplete Data,” Journal of
the Royal Statistical Society (B), Vol. 39, No. 1, pp. 1—38,
1977.
10. G. Doddington, “CSR Corpus Development,” 1992
DARPA SLS Workshop, pp. 363-366.
11. D. Pallet, et al., “1994 Benchmark Tests for the ARPA Spo-
ken Language Program,” ARPA SLT Workshop, pp 5-36,
January 1995.
12. F. Kubala, “Design of the 1994 CSR Benchmark Tests,”
ARPA SLT Workshop, pp 41-46, January 1995.
... Very often, especially when target data is sparse, a greater degree of sharing is employed -for instance two shared adaptation transforms, one for Gaussians in speech models and one for Gaussians in non-speech models. Constrained MLLR [104], [105], is an important variant of MLLR, in which the same transform is used for both the mean and covariance:μ ...
... It is possible to limit connectivity from the auxiliary features to the rest of the network in order to improve robustness at test time or to better incorporate static features [131]- [133]. We will further consider transformations of the features as speaker embeddings, such as with fMLLR [104], [105], and they may also be used as label targets [134]. ...
Article
Full-text available
We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.
... Speaker adaptation is traditionally used in practical audio-only ASR systems to improve speaker-independent system performance, when little data from a speaker of interest are available (Gauvain and Lee, 1994;Leggetter and Woodland, 1995;Neumeyer et al., 1995;Anastasakos et al., 1997;Gales, 1999). Adaptation is also of interest across tasks or environments. ...
... MAP estimates of HMM parameters slowly converge to their EM-obtained estimates as the amount of adaptation data becomes large, however such a convergence is slow, and therefore, MAP is not suitable for rapid adaptation. In practice, MAP is often used in conjunction with MLLR (Neumeyer et al., 1995). Both techniques can be used in feature fusion and decision fusion models discussed above (Potamianos and Potamianos, 1999), in a straightforward manner. ...
Chapter
When we speak, we configure the vocal tract which shapes the visible motions of the face and the patterning of the audible speech acoustics. Similarly, we use these visible and audible behaviors to perceive speech. This book showcases a broad range of research investigating how these two types of signals are used in spoken communication, how they interact, and how they can be used to enhance the realistic synthesis and recognition of audible and visible speech. The volume begins by addressing two important questions about human audiovisual performance: how auditory and visual signals combine to access the mental lexicon and where in the brain this and related processes take place. It then turns to the production and perception of multimodal speech and how structures are coordinated within and across the two modalities. Finally, the book presents overviews and recent developments in machine-based speech recognition and synthesis of AV speech.
... To that end, personalization is based on users' attributes, such as demographics (gender, age etc.), personalities, and preferences. For example, in Information Retrieval, results are customized according to the user's information and search history (Speretta and Gauch, 2005), performance of Automatic Speech Recognition substantially improves when adapted to a specific speaker (Neumeyer et al., 1995), and Targeted Advertising makes use of the user's location and prior purchases (Kölmel and Alexakis, 2002). ...
Conference Paper
Full-text available
Machine Translation (MT) has advanced in recent years to produce better translations for clients' specific domains, and sophisticated tools allow translators to obtain translations according to their prior edits. We suggest that MT should be further personalized to the end-user level – the receiver or the author of the text – as done in other applications. As a step in that direction, we propose a method based on a recommender systems approach where the user's preferred translation is predicted based on preferences of similar users. In our experiments, this method outperforms a set of non-personalized methods, suggesting that user preference information can be employed to provide better-suited translations for each user.
... Applying a transformation either in model-space or featurespace has been shown to be a powerful tool for speaker adaptation in CDHMM based ASR1234. The most well known of these techniques include maximum likelihood linear regression (MLLR)1235], applied to model space adaptation, and constrained MLLR (CMLLR) [6, 7], applied to feature space adaptation. ...
Conference Paper
Full-text available
The use of a graph embedding framework is investigated as a regularization technique in the expectation-maximization (EM) algorithm applied to automatic speech recognition (ASR). The technique is motivated by the fact that graph em-beddings of feature vectors have been shown to provide useful characterizations of the underlying manifolds on which these features lie. Incorporating intrinsic graphs that describe these manifolds in the optimization criteria for the EM algorithm has the effect of constraining the solution space in a way that preserves the local structure of the data. Graph embedding based regularization is applied here to estimating parameters in constrained maximum likelihood linear regression (CMLLR) speaker adaptation in continuous density hidden Markov model (CDHMM) based ASR. CMLLR adaptation has been widely used as a maximum likelihood procedure for reducing mismatch between a given HMM model and utterances from an unknown speaker through a linear feature space transformation. However, there is no guarantee that CMLLR transformations will preserve the relationships of the feature vectors along this manifold. It is argued here that graph embedding based regularization will preserve this structure. The impact of this approach on ASR performance is evaluated for unsupervised speaker adaptation on two large vocabulary speech corpora.
... Several speaker adaptation techniques have been recently proposed, to improve the performance and robustness of speech recognition systems. These techniques include transformation based adaptation in the model space [1,2,3], Bayesian adaptation [4,5], or combined approaches [6]. ...
Article
Full-text available
The recognition accuracy in recent Automatic Speech Recognition (ASR) systems has proven to be highly related to the correlation of the training and testing conditions. Several adaptation approaches have been proposed in an eort to improve the speech recogni-tion performance, and have typically been applied to the speaker-and channel-adaptation tasks. We have shown in the past that a mismatch in dialects between the training and testing speakers signicantly in uences the recognition accuracy, and we h a v e used adaptation to compensate for this mismatch. The dialect of the speaker needs to be identied in a dialect-specic sys-tem, and in this paper we present results in this area. To achieve further improvement in recognition perfor-mance, we combine dialect-and speaker-adaptation.
Chapter
Speech Recognition is the process of translating human voice into textual form, which in turn drives many applications including HCI (Human Computer Interaction). A recognizer uses the acoustic model to define rules for mapping sound signals to phonemes. This article brings out a combined method of applying Maximum Likelihood Linear Regression (MLLR) and Maximum A Posteriori (MAP) techniques to the acoustic model of a generic speech recognizer, so that it can accept data of people with speech impairments and transcribe the same. In the first phase, MLLR technique was applied to alter the acoustic model of a generic speech recognizer, with the feature vectors generated from the training data set. In the second phase, parameters of the updated model were used as informative priors to MAP adaptation. This combined algorithm produced better results than a Speaker Independent (SI) recognizer and was less effortful for training compared to a Speaker Dependent (SD) recognizer. Testing of the system was conducted with the UA-Speech Database and the combined algorithm produced improvements in recognition accuracy from 43% to 90% for medium to highly impaired speakers revealing its applicability for speakers with higher degrees of speech disorders.
Thesis
Full-text available
Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them.
Article
Ways of improving the accuracy and efficiency of automatic speech recognition (ASR) systems have been a long term goal of researchers to develop the natural language man machine communication interface. In widely used statistical framework of ASR, feature extraction technique is used at the front-end for speech signal parameterization, and hidden Markov model (HMM) is used at the back-end for pattern classification. This chapter reviews classical and recent approaches of Markov modeling, and also presents an empirical study of few well known methods in the context of Hindi speech recognition system. Various performance issues such as number of Gaussian mixtures, tied states, and feature reduction procedures are also analyzed for medium size vocabulary. The experimental results show that using advanced techniques of acoustic models, more than 90% accuracy can be achieved. The recent advanced models outperform the conventional methods and fit for HCI applications.
Article
This chapter addresses a number of advances in formulating spoken document retrieval for the National Gallery of the Spoken Word (NGSW) and the U.S.-based Collaborative Digitization Program (CDP). After presenting an overview of the audio stream content of the NGSW and CDP audio corpus, an overall system diagram is presented with a discussion of critical tasks associated with effective audio information retrieval that include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. Our experimental online system entitled "SpeechFind" is presented which allows for audio retrieval from the NGSW and CDP corpus. Finally, a number of research challenges as well as new directions are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.
Article
Full-text available
The authors describe a technique called progressive search which is useful for developing and implementing speech recognition systems with high computational requirements. The scheme iteratively uses more and more complex recognition schemes, where each iteration constrains the speech space of the next. An algorithm called the forward-backward word-life algorithm is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements. It is shown that speed-ups of more than an order of magnitude are achievable with only minor costs in accuracy.
Article
Full-text available
Adapting the parameters of a statistical speaker independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers
Article
This paper documents benchmark tests implemented within the DARPA Spoken Language Program during the period November, 1992 - January, 1993. Tests were conducted using the Wall Street Journal-based Continuous Speech Recognition (WSJ-CSR) corpus and the Air Travel Information System (ATIS) corpus collected by the Multi-site ATIS Data COllection Working (MADCOW) Group. The WSJ-CSR tests consist of tests of large vocabulary (lexicons of 5,000 to more than 20,000 words) continuous speech recognition systems. The ATIS tests consist of tests of (1) ATIS-domain spontaneous speech (lexicons typically less than 2,000 words), (2) natural language understanding, and (3) spoken language understanding. These tests were reported on and discussed in detail at the Spoken Language Systems Technology Workshop held at the Massachusetts Institute of Technology, January 20-22, 1993.
Article
The CSR (Connected Speech Recognition) corpus represents a new DARPA speech recognition technology development initiative to advance the state of the art in CSR. This corpus essentially supersedes the now old Resource Management (RM) corpus that has fueled DARPA speech recognition technology development for the past 5 years. The new CSR corpus supports research on major new problems including unlimited vocabulary, natural grammar, and spontaneous speech. This paper presents an overview of the CSR corpus, reviews the definition and development of the "CSR pilot corpus", and examines the dynamic challenge of extending the CSR corpus to meet future needs.
Conference Paper
We propose a scheme that improves the robustness of continuous HMM systems that use mixture observation densities by sharing the same mixture components among different HMM states. The sets of HMM states that share the same mixture components are determined automatically using agglomerative clustering techniques. Experimental results on the Wall-Street Journal Corpus show that our new form of output distributions achieves a 25% reduction in error rate over typical tied-mixture systems
Conference Paper
A number of issues related to the application of Bayesian learning techniques to speaker adaptation are investigated. It is shown that the seed models required to construct prior densities to obtain the MAP (maximum a posteriori) estimate can be a speaker-independent (SI) model, a set of female and male models, or even a task-independent acoustic model. Speaker-adaptive training algorithms are shown to be effective in improving the performance of both speaker-dependent and speaker-independent speech recognition systems. The segmental MAP estimation formulation is used to perform adaptive acoustic modeling for speaker adaptation applications. Tested on an RM (resource management) task, it was found that supervised speaker adaptation based on two gender-dependent models gave a better result than that obtained with a single SI seed. Compared with speaker-dependent training, speaker adaptation achieved an equal or better performance with the same amount of training/adaptation data
Article
Presents an approach to decrease the acoustic mismatch between a test utterance Y and a given set of speech hidden Markov models /spl Lambda//sub X/ to reduce the recognition performance degradation caused by possible distortions in the test utterance. This is accomplished by a parametric function that transforms either U or /spl Lambda//sub X/ to better match each other. The functional form of the transformation depends on prior knowledge about the mismatch, and the parameters are estimated along with the recognized string in a maximum-likelihood manner. experimental results verify the efficacy of the approach in improving the performance of a continuous speech recognition system in the presence of mismatch due to different transducers and transmission channels.< >