Conference PaperPDF Available

A comparative study of speaker adaptation techniques

September 1995

September 1995

DOI:10.21437/Eurospeech.1995-282

Source
DBLP

Conference: Fourth European Conference on Speech Communication and Technology, EUROSPEECH 1995, Madrid, Spain, September 18-21, 1995

Authors:

Leonardo Neumeyer

Amazon

Vassilis V Digalakis

Technical University of Crete

Content uploaded by Vassilis V Digalakis

Content may be subject to copyright.

ABSTRACT

In previous work, we showed how to constrain the estimation

of continuous mixture-density hidden Markov models

(HMMs) when the amount of adaptation data is small. We used

maximum-likelihood (ML) transformation-based approaches

and Bayesian techniques to achieve near native performance

when testing nonnative speakers of the recognizer language. In

this paper, we study various ML-based techniques and com-

pare experimental results on data sets with recordings from

nonnative and native speakers of American English. We divide

the transformation-based techniques into two groups. In fea-

ture-space techniques, we hypothesize an underlying transfor-

mation in the feature-space that results in a transformation of

the HMM parameters. In model-space techniques, we hypothe-

size a direct transformation of the HMM parameters. In the

experimental section we show how the combination of the best

ML and Bayesian adaptation techniques result in signiﬁcant

improvements in recognition accuracy. All the experiments

were carried out with SRI’s DECIPHERTM speech recognition

system [1][2].

1. INTRODUCTION

Automatic speech recognition (ASR) performance

degrades rapidly when a mismatch exists between the training

and the testing conditions. For example, the performance of

ASR systems trained using native speakers degrades dramati-

cally when tested on nonnative speakers [3]. Current methods

to minimize the effect of such a mismatch include ML transfor-

mation-based approaches [4][5][6] and Bayesian adaptation

[3][7][8].

This work focuses on comparing various ML transforma-

tion-based techniques and ﬁnding the optimum method for a

given task. We also investigate combinations of these adapta-

tion techniques.

2. THEORY

2.1. ML Adaptation Techniques

In ML transformation-based techniques [4][5][6], adap-

tation is achieved via a transformation of the speaker-indepen-

dent observation densities. The transformation parameters

are obtained by maximizing the likelihood of the adaptation

data X given the corresponding word string W,

. (1)

θn

θnargmaxθp X θW,( )=

A separate transformation is used for each group of Gaus-

sian densities. The number of such transformations can be

adjusted based on the available amount of adaptation data [4].

We assume that the speaker-independent (SI) HMM has

state observation densities of the form

, (2)

where gis the index of the Gaussian codebook used by state st.

In this paper we investigate the adaptation of this system by

jointly transforming all the Gaussians of each codebook.

As in [5], we consider transformations in two spaces: 1)

the feature-space, and 2) the model-space. In the feature-space

approach, the “original” features are transformed to the

observed features by a hypothesized transformation

where are the parameters to be estimated. In the model-based

approach, the original model is transformed to the new

model by where are the parameters to be

estimated.

The next two subsections describe the proposed transfor-

mation methods. Table 1 summarizes the methods.

2.1.1. Transformations in the Feature Space

Method I (Diagonal Afﬁne Transform) [4]. In this

method we assume that, given the HMM state index , the

observed features can be obtained from the original features

through the transformation

. (3)

Under this assumption, the speaker-adapted (SA) obser-

vation densities will have the form

(4)

the parameters are estimated using the

ML approach of Eq. (1), where Ngis the number of distinct

transformations. We use the EM algorithm [9] to derive the ML

estimates of the parameters , and . When is a diago-

nal matrix closed form estimates can be obtained for and

as described in [4][5]. When is a full matrix, however,

the estimation problem is more tedious. In this paper we use a

diagonal matrix.

Method II (Additive Transform). This case is identical

to Method I with .

pSI yt|st

( ) pωi|st

( ) N ytµig Σig

,;( )

∑

fνyt

( )

λy

λx

λxgηλy

( )=

xtAgytbg

pSA xtst

( ) pωist

( ) N xtAgµig bg

+AgΣigAg

 

 

∑

Agbgg, , 1…Ng

, ,=

AgI=

A COMPARATIVE STUDY OF SPEAKER ADAPTATION

TECHNIQUES

Leonardo Neumeyer, Ananth Sankar and Vassilios Digalakis

e-mail: {leo,sankar,vas}@speech.sri.com

SRI International

Speech Technology and Research Laboratory

Menlo Park, CA, 94025, USA

Method III (Stochastic Additive Transform) [5]. In this

case we model the acoustic mismatch using a stochastic trans-

formation. The stochastic transform is given by,

, (5)

where is a Gaussian random variable with

mean and variance . In this context, we can view

Method II as a special case in which the additive term is a

deterministic parameter.

2.1.2. Transformations in the Model Space

Method IV (Full Afﬁne Transform) [6]. An alternative

to Method I is to transform the means of the Gaussian density

functions using a full matrix and leave the variances unchanged.

The advantage of using a full matrix is that we can model the

correlation between feature components at the expense of a

quadratic increase in the number of adaptation parameters. The

observation densities in this case will have the form

. (6)

Method V (Structured Afﬁne Transform). Our continu-

ous density HMM system uses a single feature vector stream,

which is the augmented vector composed of three basic feature

vectors: cepstrum, ﬁrst-derivative, and second-derivative of the

cepstrum. The structured matrix has non-zero values only in

the elements whose rows and columns correspond to the same

basic vector. For example, in Eq. (6) an element of the mean

vector corresponding to a cepstrum component will only be pre-

dicted by the mean subvector that corresponds to the cepstrum

and will not depend on the delta components.

The motivation for proposing this method is that the esti-

mation of involves inverting a sample correlation matrix.

The dependencies between the cepstrum and its derivatives may

result in an ill-conditioned sample correlation matrix, resulting

in bad estimates.

Method VI (Scaled Variance Transform) [5]. In this

case we transform the means using an additive shift and the

variance using a scaling factor. The difference between this

approach and Method I is that, in this case, the scale factor only

affects the variance and is not tied to the scaling of the means.

2.2. Bayesian Adaptation Technique

In the Bayesian adaptation approach, the prior informa-

tion is encapsulated in the SI models [3][7][8]. The Bayesian

algorithms asymptotically converge to the speaker-dependent

performance as the amount of adaptation speech increases.

However, the adaptation rate is usually slow.

2.3. Combined Adaptation Technique

Finally, we use a combination of ML and Bayesian tech-

niques to achieve the quick adaptation characteristics of the ML

transformation-based methods with the asymptotic properties of

Bayesian methods [3]. In this approach we ﬁrst use the ML

transformation-based method to adapt the SI models to the new

speaker. These adapted models are then used as priors for the

Bayesian adaptation method. The advantage of this approach is

xtytbgµb g,σb g,

,( )+=

bgµb g,σb g,

,( )

µb g,

σ2b g,

pSA xtst

( ) pωist

( ) N xtAgµig bg

+Σig

,;( )

∑

that the priors obtained by the ML transformation method are

more closely matched to the observed data than the SI models.

3. EXPERIMENTS

Experiments were carried out using SRI’s DECIPHERTM

speech recognition system conﬁgured with a six-feature front

end that outputs 12 cepstral coefﬁcients, cepstral energy, and

their ﬁrst- and second-order derivatives. The cepstral features

are computed from a fast Fourier transform (FFT) ﬁlterbank,

and subsequent cepstral-mean normalization on a sentence

basis is performed. We used genonic HMMs with an arbitrary

degree of Gaussian sharing across different HMM states as

described in [2]. The SI continuous HMM systems that we used

as seed models for adaptation were gender-dependent, trained

on 140 speakers and 17,000 sentences for each gender. Each of

the two systems had 12,000 context-dependent phonetic models

that shared 500 Gaussian codebooks (1000 in the native speaker

experiments) with 32 Gaussian components per codebook. For

testing, we used the Wall Street Journal (WSJ) task [10]. For

fast experimentation, we used the progressive search frame-

work [1]: an initial, SI recognizer with a bigram language

model outputs word lattices for all the utterances in the test set.

These word lattices are then rescored using SA models. We

used the baseline WSJ 5,000-word (20,000 for native speaker

experiments), closed-vocabulary bigram and trigram language

models provided by the MIT Lincoln Laboratory. The trigram

language model was used in the N-best rescoring paradigm, by

rescoring the list of the N-best sentence hypotheses generated

using the bigram language model.

3.1. Nonnative Speakers

We evaluated the adaptation algorithms on the 1994

“Spoke 3” task of the phase-1, large-vocabulary WSJ corpus

[11][12]. For the ﬁrst set of experiments we created a subset of

the dev94 test set consisting of 5 nonnative speakers with 20

test sentences and 20 adaptation sentences per speaker. A big-

ram language model was used to compare performance between

Feature Space Methods

Method Transf.

Name Mean Transform Variance

Transform

IDiagonal

Afﬁne

II Additive

III Stochastic

Additive

Model Space Methods

Method Transf.

Name Mean Transform Variance

Transform

IV Full Afﬁne

VStructured

Afﬁne

VI Scaled

Variance

Table 1. Transformations of the means and variances for various

methods.

µi

SA aiiµi

SI bi

σ2i

SA a2i iσ2i

µi

SA µi

SI bi

σ2i

SA σ2i

µi

SA µi

SI µbi

σ2i

SA σ2i

SI σbi

µSA AfµS I b+=

σ2i

SA σ2i

µSA AsµSI b+=

σ2i

SA σ2i

µi

SA µi

SI bi

σ2i

SA αiσ2i

the different adaptation methods. The experimental results are

shown in Table 2. In the second column we indicate the adapta-

tion method used (IV + III means we adapted using Method IV

followed by Method III). All experiments were optimized for

the number of transformations and only the best result is shown.

The main conclusions from this experiments can be summa-

rized as follows:

•Adapting the means with a Full transform produces better

results (7% improvement) than adapting the means and

the variances with a Diagonal transform (NNat2 vs

NNat7).

• Bayesian adaptation helps when combined with ML adap-

tation, in both Diagonal and Full transform cases. In the

diagonal case (NNat2 vs NNat3), we obtained an 8%

improvement; in Full (NNat7 vs NNat8), we obtained an

18% improvement. Bayesian adaptation did not help in

NNat11 when compared to NNat10.

• The Stochastic Additive transform is more effective than

the deterministic Additive transform (NNat5 vs NNat4).

In NNat10 and NNat11, the stochastic transform was used

after the means were adapted with the Full Afﬁne trans-

form.

• The Structured Afﬁne transform produced an improve-

ment of 8% compared to the Full Afﬁne case (NNat7 vs

NNat9).

To see how this methods generalize when using a larger

data set and more adaptation sentences, we used the best tech-

niques on the full WSJ Spoke 3 development and evaluation

sets. After some initial tests we decided to use Method IV fol-

lowed by Method III followed by Bayesian adaptation. The

results are presented in Table 3.These results show how the Full

matrix transform and the Stochastic transform produced

Non-

Native

Expts Method Number of

Transforms Word Error

Rate (%)

NNat1 Speaker

Independent NA 24.9

NNat2 I 162 18.9

NNat3 I + Bayes 162 17.4

NNat4 II 162 19.2

NNat5 III 162 17.9

NNat6 VI 162 18.5

NNat7 IV 5 17.5

NNat8 IV + Bayes 5 14.4

NNat9 V 30 16.1

NNat10 IV + III 10/200 15.0

NNat11 IV + III +

Bayes 10/200 15.2

Table 2. Word error rates for various supervised adaptation

methods on a subset of the WSJ spoke3 (nonnatives) dev94 set

using a bigram language model. Twenty adaptation sentences

are used per speaker.

improvements of 20% in the dev94 set and 7% in the eval94 set

over Method I + Bayes.

3.2. Native Speakers

Some of the adaptation methods described in this paper

were also tested on native speakers. We used 10 native speakers

on the 20,000-word, closed vocabulary WSJ task (a total of 230

test sentences) and 40 adaptation sentences per speaker. The

results are presented in Table 4. Unlike the nonnative case, we

did not see a signiﬁcant improvement after adapting the models

using Method I (Nat2). The Full Afﬁne transform (Nat3), how-

ever, produced a signiﬁcant improvement of 14% after adapta-

tion. Further improvement was gained when using the

Structured Afﬁne transform (Nat3 vs Nat5). The Bayesian

adaptation produced some improvement in the Full case (Nat3

vs Nat4) and no signiﬁcant improvement in the Structured case

(Nat5 vs Nat6).

4. DISCUSSION

We compared six ML-based adaptation approaches and

some combinations with Bayesian techniques. We found that

transforming the means of the Gaussian density functions with

a full matrix produces a signiﬁcant improvement over the joint

adaptation of the means and variances with a Diagonal trans-

form. The variances can be adapted in a second stage using the

Stochastic transform, and further improvement can be obtained

in a third stage using Bayesian adaptation.

We also proposed a structured transformation of the

means that overcomes the problem of inverting ill-conditioned

sample correlation matrices. Other techniques, such as singular

Data Set SI SA (I +Bayes) SA (IV + III +

Bayes)

S3 Dev 94 23.1 13.2 10.5

S3 Eval 94 23.2 11.3 10.5

Table 3. Speaker-independent and speaker-adapted word error

rates on the WSJ Spoke 3 benchmark test using a trigram

language model. Fourty adaptation sentences are used per

speaker.)

Native

Expts Method Number of

Transforms Word Error

Rate (%)

Nat1

Speaker

Independent NA 20.9

Nat2 I 160 20.5

Nat3 IV 2 17.9

Nat4 IV + Bayes 2 17.5

Nat5 V 10 17.6

Nat6 V + Bayes 10 17.5

Table 4. Word error rates for various supervised adaptation

methods on natives speakers using a bigram language model.

Forty adaptation sentences are used per speaker.

value decomposition, can be used to overcome this problem and

will be studied in more detail in the future.

Acknowledgments

This work was partially supported by ARPA through the

Ofﬁce of Naval Research Contract N00014-92-C-0154 and by

Telia Research AB of Sweden.

The US Government has certain rights in this material.

Any opinions, ﬁndings, and conclusions or recommendations

expressed in this material are those of the authors and do not

necessarily reﬂect the views of the Government funding agen-

cies.

REFERENCES

1. H. Murveit, J. Butzberger, V. Digalakis, and M. Weintraub,

“Large-Vocabulary Dictation Using SRI’s DECIPHER

Speech Recognition System: Progressive Search Tech-

niques,” 1993 IEEE ICASSP, pp. II-319—II-322.

2. V. Digalakis and H. Murveit, “GENONES: Optimizing the

Degree of Mixture Tying in a Large Vocabulary Hidden

Markov Model Based Speech Recognizer,” 1994 IEEE

ICASSP, pp. I537-I540.

3. V. Digalakis and L. Neumeyer, “Speaker Adaptation Using

Combined Transformation and Bayesian Methods,” 1995

IEEE ICASSP, pp. I-680 - I-683.

4. V. Digalakis, D. Rtischev, and L. Neumeyer, “Fast Speaker

Adaptation Using Constrained Estimation of Gaussian Mix-

tures,” IEEE Trans. on Speech and Audio Processing; to

appear.

5. A. Sankar and C.H. Lee, “Stochastic Matching for Robust

Speech Recognition,” IEEE Signal Processing Letters, Vol.

1, pp. 124-125, August 1994.

6. C.J. Leggetter and P.C. Woodland, “Flexible Speaker Adap-

tation using Maximum Likelihood Linear Regression,”

ARPA SLT Workshop, pp. 110-115, January 1995.

7. C.-H. Lee, C.-H. Lin, and B.-H. Juang, “A Study on

Speaker Adaptation of the Parameters of Continuous Den-

sity Hidden Markov Models,” IEEE Trans. on Acoust.,

Speech and Signal Proc., Vol. ASSP-39(4), pp. 806—814,

April 1991.

8. C.-H. Lee and J.-L. Gauvain, “Speaker Adaptation Based

on MAP Estimation of HMM Parameters,” 1993 IEEE

ICASSP, pp. II-558 — II-561.

9. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum

Likelihood Estimation from Incomplete Data,” Journal of

the Royal Statistical Society (B), Vol. 39, No. 1, pp. 1—38,

1977.

10. G. Doddington, “CSR Corpus Development,” 1992

DARPA SLS Workshop, pp. 363-366.

11. D. Pallet, et al., “1994 Benchmark Tests for the ARPA Spo-

ken Language Program,” ARPA SLT Workshop, pp 5-36,

January 1995.

12. F. Kubala, “Design of the 1994 CSR Benchmark Tests,”

ARPA SLT Workshop, pp 41-46, January 1995.

Adaptation Algorithms for Speech Recognition: An Overview

Article

Full-text available

Dec 2020

We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.

Audiovisual automatic speech recognition

Chapter

Apr 2012

When we speak, we configure the vocal tract which shapes the visible motions of the face and the patterning of the audible speech acoustics. Similarly, we use these visible and audible behaviors to perceive speech. This book showcases a broad range of research investigating how these two types of signals are used in spoken communication, how they interact, and how they can be used to enhance the realistic synthesis and recognition of audible and visible speech. The volume begins by addressing two important questions about human audiovisual performance: how auditory and visual signals combine to access the mental lexicon and where in the brain this and related processes take place. It then turns to the production and perception of multimodal speech and how structures are coordinated within and across the two modalities. Finally, the book presents overviews and recent developments in machine-based speech recognition and synthesis of AV speech.

Personalized Machine Translation: Predicting Translational Preferences

Conference Paper

Full-text available

Sep 2015

Machine Translation (MT) has advanced in recent years to produce better translations for clients' specific domains, and sophisticated tools allow translators to obtain translations according to their prior edits. We suggest that MT should be further personalized to the end-user level – the receiver or the author of the text – as done in other applications. As a step in that direction, we propose a method based on a recommender systems approach where the user's preferred translation is predicted based on preferences of similar users. In our experiments, this method outperforms a set of non-personalized methods, suggesting that user preference information can be employed to provide better-suited translations for each user.

Regularized constrained maximum likelihood linear regression for speech recognition

Conference Paper

Full-text available

May 2014

The use of a graph embedding framework is investigated as a regularization technique in the expectation-maximization (EM) algorithm applied to automatic speech recognition (ASR). The technique is motivated by the fact that graph em-beddings of feature vectors have been shown to provide useful characterizations of the underlying manifolds on which these features lie. Incorporating intrinsic graphs that describe these manifolds in the optimization criteria for the EM algorithm has the effect of constraining the solution space in a way that preserves the local structure of the data. Graph embedding based regularization is applied here to estimating parameters in constrained maximum likelihood linear regression (CMLLR) speaker adaptation in continuous density hidden Markov model (CDHMM) based ASR. CMLLR adaptation has been widely used as a maximum likelihood procedure for reducing mismatch between a given HMM model and utterances from an unknown speaker through a linear feature space transformation. However, there is no guarantee that CMLLR transformations will preserve the relationships of the feature vectors along this manifold. It is argued here that graph embedding based regularization will preserve this structure. The impact of this approach on ASR performance is evaluated for unsupervised speaker adaptation on two large vocabulary speech corpora.

ON THE INTEGRATION OF DIALECT AND SPEAKER ADAPTATION IN A MULTI-DIALECT SPEECH RECOGNITION SYSTEM

Article

Full-text available

The recognition accuracy in recent Automatic Speech Recognition (ASR) systems has proven to be highly related to the correlation of the training and testing conditions. Several adaptation approaches have been proposed in an eort to improve the speech recogni-tion performance, and have typically been applied to the speaker-and channel-adaptation tasks. We have shown in the past that a mismatch in dialects between the training and testing speakers signicantly in uences the recognition accuracy, and we h a v e used adaptation to compensate for this mismatch. The dialect of the speaker needs to be identied in a dialect-specic sys-tem, and in this paper we present results in this area. To achieve further improvement in recognition perfor-mance, we combine dialect-and speaker-adaptation.

Analysis of Unintelligible Speech for MLLR and MAP-Based Speaker Adaptation

Chapter

Feb 2021

Speech Recognition is the process of translating human voice into textual form, which in turn drives many applications including HCI (Human Computer Interaction). A recognizer uses the acoustic model to define rules for mapping sound signals to phonemes. This article brings out a combined method of applying Maximum Likelihood Linear Regression (MLLR) and Maximum A Posteriori (MAP) techniques to the acoustic model of a generic speech recognizer, so that it can accept data of people with speech impairments and transcribe the same. In the first phase, MLLR technique was applied to alter the acoustic model of a generic speech recognizer, with the feature vectors generated from the training data set. In the second phase, parameters of the updated model were used as informative priors to MAP adaptation. This combined algorithm produced better results than a Speaker Independent (SI) recognizer and was less effortful for training compared to a Speaker Dependent (SD) recognizer. Testing of the system was conducted with the UA-Speech Database and the combined algorithm produced improvements in recognition accuracy from 43% to 90% for medium to highly impaired speakers revealing its applicability for speakers with higher degrees of speech disorders.

Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems

Thesis

Full-text available

Dec 2017

Natalia Tomashenko

Differences between training and testing conditions may significantly degrade recognition accuracy in automatic speech recognition (ASR) systems. Adaptation is an efficient way to reduce the mismatch between models and data from a particular speaker or channel. There are two dominant types of acoustic models (AMs) used in ASR: Gaussian mixture models (GMMs) and deep neural networks (DNNs). The GMM hidden Markov model (GMM-HMM) approach has been one of the most common technique in ASR systems for many decades. Speaker adaptation is very effective for these AMs and various adaptation techniques have been developed for them. On the other hand, DNN-HMM AMs have recently achieved big advances and outperformed GMM-HMM models for various ASR tasks. However, speaker adaptation is still very challenging for these AMs. Many adaptation algorithms that work well for GMMs systems cannot be easily applied to DNNs because of the different nature of these models. The main purpose of this thesis is to develop a method for efficient transfer of adaptation algorithms from the GMM framework to DNN models. A novel approach for speaker adaptation of DNN AMs is proposed and investigated. The idea of this approach is based on using so-called GMM-derived features as input to a DNN. The proposed technique provides a general framework for transferring adaptation algorithms, developed for GMMs, to DNN adaptation. It is explored for various state-of-the-art ASR systems and is shown to be effective in comparison with other speaker adaptation techniques and complementary to them.

Speaker Adaptation Using Speaker Similarity Score on DNN Features

Conference Paper

Dec 2015

Recent Trends in Speech Recognition Systems

Article

Jan 2012

Ways of improving the accuracy and efficiency of automatic speech recognition (ASR) systems have been a long term goal of researchers to develop the natural language man machine communication interface. In widely used statistical framework of ASR, feature extraction technique is used at the front-end for speech signal parameterization, and hidden Markov model (HMM) is used at the back-end for pattern classification. This chapter reviews classical and recent approaches of Markov modeling, and also presents an empirical study of few well known methods in the context of Hindi speech recognition system. Various performance issues such as number of Gaussian mixtures, tied states, and feature reduction procedures are also analyzed for medium size vocabulary. The experimental results show that using advanced techniques of acoustic models, more than 90% accuracy can be achieved. The recent advanced models outperform the conventional methods and fit for HCI applications.

Speechfind: Advances in rich content based spoken document retrieval

Article

Jan 2009

This chapter addresses a number of advances in formulating spoken document retrieval for the National Gallery of the Spoken Word (NGSW) and the U.S.-based Collaborative Digitization Program (CDP). After presenting an overview of the audio stream content of the NGSW and CDP audio corpus, an overall system diagram is presented with a discussion of critical tasks associated with effective audio information retrieval that include advanced audio segmentation, speech recognition model adaptation for acoustic background noise and speaker variability, and information retrieval using natural language processing for text query requests that include document and query expansion. Our experimental online system entitled "SpeechFind" is presented which allows for audio retrieval from the NGSW and CDP corpus. Finally, a number of research challenges as well as new directions are discussed in order to address the overall task of robust phrase searching in unrestricted audio corpora.

Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques

Article

Full-text available

Apr 1993

The authors describe a technique called progressive search which is useful for developing and implementing speech recognition systems with high computational requirements. The scheme iteratively uses more and more complex recognition schemes, where each iteration constrains the speech space of the next. An algorithm called the forward-backward word-life algorithm is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements. It is shown that speed-ups of more than an order of magnitude are achievable with only minor costs in accuracy.

Speaker Adaptation Using Combined Transformation and Bayesian Methods

Article

Full-text available

Aug 1996

Adapting the parameters of a statistical speaker independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers

Speaker adaptation based on map estimation of HMM parameters

Article

Maximum Likelihood Estimation from Incomplete

Article

Jan 1977

Design of the 1994 CSR Benchmark Tests

Article

Jan 1995

F. Kubala

Benchmark tests for the DARPA Spoken Language Program

Article

Jan 1993

This paper documents benchmark tests implemented within the DARPA Spoken Language Program during the period November, 1992 - January, 1993. Tests were conducted using the Wall Street Journal-based Continuous Speech Recognition (WSJ-CSR) corpus and the Air Travel Information System (ATIS) corpus collected by the Multi-site ATIS Data COllection Working (MADCOW) Group. The WSJ-CSR tests consist of tests of large vocabulary (lexicons of 5,000 to more than 20,000 words) continuous speech recognition systems. The ATIS tests consist of tests of (1) ATIS-domain spontaneous speech (lexicons typically less than 2,000 words), (2) natural language understanding, and (3) spoken language understanding. These tests were reported on and discussed in detail at the Spoken Language Systems Technology Workshop held at the Massachusetts Institute of Technology, January 20-22, 1993.

CSR corpus development

Article

Jan 1992

George R. Doddington

The CSR (Connected Speech Recognition) corpus represents a new DARPA speech recognition technology development initiative to advance the state of the art in CSR. This corpus essentially supersedes the now old Resource Management (RM) corpus that has fueled DARPA speech recognition technology development for the past 5 years. The new CSR corpus supports research on major new problems including unlimited vocabulary, natural grammar, and spontaneous speech. This paper presents an overview of the CSR corpus, reviews the definition and development of the "CSR pilot corpus", and examines the dynamic challenge of extending the CSR corpus to meet future needs.

Genones: optimizing the degree of mixture tying in a largevocabulary hidden Markov model based speech recognizer

Conference Paper

May 1994

We propose a scheme that improves the robustness of continuous HMM systems that use mixture observation densities by sharing the same mixture components among different HMM states. The sets of HMM states that share the same mixture components are determined automatically using agglomerative clustering techniques. Experimental results on the Wall-Street Journal Corpus show that our new form of output distributions achieves a 25% reduction in error rate over typical tied-mixture systems

Speaker adaptation based on MAP estimation of HMM parameters

Conference Paper

May 1993

A number of issues related to the application of Bayesian learning techniques to speaker adaptation are investigated. It is shown that the seed models required to construct prior densities to obtain the MAP (maximum a posteriori) estimate can be a speaker-independent (SI) model, a set of female and male models, or even a task-independent acoustic model. Speaker-adaptive training algorithms are shown to be effective in improving the performance of both speaker-dependent and speaker-independent speech recognition systems. The segmental MAP estimation formulation is used to perform adaptive acoustic modeling for speaker adaptation applications. Tested on an RM (resource management) task, it was found that supervised speaker adaptation based on two gender-dependent models gave a better result than that obtained with a single SI seed. Compared with speaker-dependent training, speaker adaptation achieved an equal or better performance with the same amount of training/adaptation data

Stochastic Matching for Robust Speech Recognition

Article

Sep 1994

Presents an approach to decrease the acoustic mismatch between a test utterance Y and a given set of speech hidden Markov models /spl Lambda//sub X/ to reduce the recognition performance degradation caused by possible distortions in the test utterance. This is accomplished by a parametric function that transforms either U or /spl Lambda//sub X/ to better match each other. The functional form of the transformation depends on prior knowledge about the mismatch, and the parameters are estimated along with the recognized string in a maximum-likelihood manner. experimental results verify the efficacy of the approach in improving the performance of a continuous speech recognition system in the presence of mismatch due to different transducers and transmission channels.< >

A comparative study of speaker adaptation techniques

Recommended publications

Insights into the molecular basis of piezophilic adaptation: Extraction of piezophilic signatures

Containment control of networked autonomous underwater vehicles: A predictor-based neural DSC design

Antihypoxic effect of adaptation to stress and of adaptation to electrostimulation: a comparative st...

Comparative Study of the Functional Morphology of Anomalocardia Brasiliana (Gmelin, 1791) and Tivela...