Conference Paper

Robust speaker adaptation using a piecewise linear acoustic mapping

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In a large vocabulary speech recognition system, it is desirable to make use of previously acquired speech data when encountering new speakers. The authors describe an adaptation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker. This speaker-normalizing mapping is used to transform the previously acquired parameters of the reference speaker onto the space of the new speaker. This results in a robust speaker adaptation procedure which allows for a drastic reduction in the amount of training data required from the new speaker. The performance of this method is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Early work in data augmentation often assumed that beneficial techniques would produce images that would be close to the true data distribution [1,28]. However, with many of the techniques above, it is clear that the resulting images are unnatural and likely to be out-of-distribution with respect to the test set (see Figure 1 for examples). ...
... Data Augmentation Early examples of data augmentations often focused on creating realistic but 'different' training examples [1] and often included horizontal flips, crops, and minor color distortions to MNIST and CIFAR-10 images [5,27,28]. However, this is clearly no longer the case in modern models as data augmentation techniques that produce out-of-distribution and heavily modified images have been shown to significantly improve performance [2,13,41,39]. ...
Preprint
Data augmentation has emerged as a powerful technique for improving the performance of deep neural networks and led to state-of-the-art results in computer vision. However, state-of-the-art data augmentation strongly distorts training images, leading to a disparity between examples seen during training and inference. In this work, we explore a recently proposed training paradigm in order to correct for this disparity: using an auxiliary BatchNorm for the potentially out-of-distribution, strongly augmented images. Our experiments then focus on how to define the BatchNorm parameters that are used at evaluation. To eliminate the train-test disparity, we experiment with using the batch statistics defined by clean training images only, yet surprisingly find that this does not yield improvements in model performance. Instead, we investigate using BatchNorm parameters defined by weak augmentations and find that this method significantly improves the performance of common image classification benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. We then explore a fundamental trade-off between accuracy and robustness coming from using different BatchNorm parameters, providing greater insight into the benefits of data augmentation on model performance.
... In Sec. 4 we review and discuss the second series of algorithms which provide greater recognition accuracy by virtue of more detailed modeling of the statistics of degraded speech. ...
... These approaches are also similar to complementary work performed at other sites include piecewise-linear mapping and noise-adaptive prototypes developed at IBM [4,18] and the probabilistic optimal filtering (POF) algorithm developed at SRI [19]. The POF algorithm, for example, is typically realized with many more free environmental parameters than are commonly used in algorithms like MFCDCN or MPDCN to characterize the environment function. ...
Article
Full-text available
The accuracy of speech recognition systems degrades when operated in adverse acoustical environments. This paper reviews various methods by which more detailed mathematical descriptions of the effects of environmental degradation can improve speech recognition accuracy using both "data-driven" and "model-based" compensation strategies. Data-driven meth- ods learn environmental characteristics through direct compari- sons of speech recorded in the noisy environment with the same speech recorded under optimal conditions. Model-based methods use a mathematical model of the environment and attempt to use samples of the degraded speech to estimate model parameters. These general approaches to environmental compensation are discussed in terms of recent research in envi- ronmental robustness at CMU, and in terms of similar efforts at other sites. These compensation algorithms are evaluated in a series of experiments measuring recognition accuracy for speech from the ARPA Wall Street Journal database that is cor- rupted by artificially-added noise at various signal-to-noise ratios (SNRs), and in more natural speech recognition tasks.
... Since early uses of data augmentation in training neural networks, it has been assumed that they work because they simulate realistic samples from the true data distribution: "[augmentation strategies are] reasonable since the transformed reference data is now extremely close to the original data. In this way, the amount of training data is effectively increased" (Bellegarda et al., 1992). Because of this, augmentations have often been designed with the heuristic of incurring minimal distribution shift from the training data. ...
... Since early uses of data augmentation in training neural networks, there has been an assumption that effective transforms for data augmentation are those that produce images from an "overlapping but different" distribution (Bengio et al., 2011;Bellegarda et al., 1992). Indeed, elastic distortions as well as distortions in the scale, position, and orientation of training images have been used on MNIST (Ciregan et al., 2012;Sato et al., 2015;Simard et al., 2003;Wan et al., 2013), while horizontal flips, random crops, and random distortions to color channels have been used on CIFAR-10 and ImageNet (Krizhevsky et al., 2012;Zagoruyko & Komodakis, 2016;Zoph et al., 2017). ...
Preprint
Though data augmentation has become a standard component of deep neural network training, the underlying mechanism behind the effectiveness of these techniques remains poorly understood. In practice, augmentation policies are often chosen using heuristics of either distribution shift or augmentation diversity. Inspired by these, we seek to quantify how data augmentation improves model generalization. To this end, we introduce interpretable and easy-to-compute measures: Affinity and Diversity. We find that augmentation performance is predicted not by either of these alone but by jointly optimizing the two.
... A second approach to SA consists of mapping the incoming speech features to a new space using transformations designed to minimize the distance between the new speech vectors and a set of "reference" speech vectors. The forms of the transformation can be linear, piecewise linear, or even non-linear (as modeled by a neural network) [4] [2]. For example, Huang described a method of using neural networks to map between the acoustic spaces of different speakers' speech [12]. ...
Article
Full-text available
In an effort to reduce the degradation in speech recognition performance caused by variation in vocal tract shape among speakers, a frequency warping approach to speaker normalization is investigated. A set of low complexity, maximum likelihood based frequency warping procedures have been applied to speaker normalization for a telephone based connected digit recognition task. This paper presents an efficient means for estimating a linear frequency warping factor and a simple mechanism for implementing frequency warping by modifying the filterbank in mel-frequency cepstrum feature analysis. An experimental study comparing these techniques to other well-known techniques for reducing variability is described. The results have shown that frequency warping is consistently able to reduce word error rate by 20% even for very short utterances
... maximum-likelihood linear regression. 1 (fMLLR) [2], [3], the testing acoustic features are modified to match closer to the training acoustics. In model-based adaptation, the SI model parameters are modified so that the adapted model fits better the new speaker. ...
Article
This paper proposes a nonlinear generalization of the popular maximum-likelihood linear regression (MLLR) adaptation algorithm using kernel methods. The proposed method, called maximum penalized likelihood kernel regression adaptation (MPLKR), applies kernel regression with appropriate regularization to determine the affine model transform in a kernel-induced high-dimensional feature space. Although this is not the first attempt of applying kernel methods to conventional linear adaptation algorithms, unlike most of other kernelized adaptation methods such as kernel eigenvoice or kernel eigen-MLLR, MPLKR has the advantage that it is a convex optimization and its solution is always guaranteed to be globally optimal. In fact, the adapted Gaussian means can be obtained analytically by simply solving a system of linear equations. From the Bayesian perspective, MPLKR can also be considered as the kernel version of maximum a posteriori linear regression (MAPLR) adaptation. Supervised and unsupervised speaker adaptation using MPLKR were evaluated on the Resource Management and Wall Street Journal 5K tasks, respectively, achieving a word error rate reduction of 23.6% and 15.5% respectively over the speaker-independently model.
... However, this increases the amount of adaptation data required from the speaker so that there is sufficient data to estimate each transform. A piecewise-linear approach based on phone classes has also been used with discrete HMMs (Bellegarda et al., 1992). Here, we extend the ideas of Jaschul (1982) and Hewett (1989) to adapting the parameters of continuous density HMMs. ...
Article
A method of speaker adaptation for continuous density hidden Markov models (HMMs) is presented. An initial speaker-independent system is adapted to improve the modelling of a new speaker by updating the HMM parameters. Statistics are gathered from the available adaptation data and used to calculate a linear regression-based transformation for the mean vectors. The transformation matrices are calculated to maximize the likelihood of the adaptation data and can be implemented using the forward–backward algorithm. By tying the transformations among a number of distributions, adaptation can be performed for distributions which are not represented in the training data. An important feature of the method is that arbitrary adaptation data can be used—no special enrolment sentences are needed.Experiments have been performed on the ARPA RM1 database using an HMM system with cross-word triphones and mixture Gaussian output distributions. Results show that adaptation can be performed using as little as 11 s of adaptation data, and that as more data is used the adaptation performance improves. For example, using 40 adaptation utterances, a 37% reduction in error from the speaker-independent system was achieved with supervised adaptation and a 32% reduction in unsupervised mode.
... Parameter values of these models are estimated from samples of speech in the testing environments, and either the features of the incoming speech o r t h e i n ternally-stored representations of speech in the system are modi ed. Typical structural models for adaptation to acoustical variability assume that speech i s corrupted either by additive noise with an unknown power spectrum (Porter & Boll, 1984Ephraim, 1992Erell & Weintraub, 1990Gales & Young, 1992Lockwood, Boudy, et al., 1992Bellegarda, de Souza, et al., 1992, or by a combination of additive noise and linear ltering (Acero & Stern, 1990). Much o f t h e e a r l y w ork in robust recognition involved a re-implementation of techniques developed to remove additive noise for the purpose of speech enhancement, as reviewed in section 10.3. ...
Article
Full-text available
This book surveys the state of the art of human language technology. The goal of the survey is to provide an interested reader with an overview of the field---the main areas of work, the capabilities and limitations of current technology, and the technical challenges that must be overcome to realize the vision of graceful human computer interaction using natural communication skills. The book consists of thirteen chapters written by 97 different authors.
... Parameter values of these models are estimated from samples of speech in the testing environments, and either the features of the incoming speech or the internally-stored representations of speech in the system are modified. Typical structural models for adaptation to acoustical variability assume that speech is corrupted either by additive noise with an unknown power spectrum (Porter & Boll, 1984;Ephraim, 1992;Erell & Weintraub, 1990;Gales & Young, 1992;Lockwood, Boudy, et al., 1992;Bellegarda, de Souza, et al., 1992), or by a combination of additive noise and linear filtering (Acero & Stern, 1990). Much of the early work in robust recognition involved a re-implementation of techniques developed to remove additive noise for the purpose of speech enhancement, as reviewed in section 10.3. ...
Chapter
Full-text available
Spoken language interfaces to computers is a topic that has lured and fascinated engineers and speech scientists alike for over five decades. For many, the ability to converse freely with a machine represents the ultimate challenge to our understanding of the production and perception processes involved in human speech communication. In addition to being a provocative topic, spoken language interfaces are fast becoming a necessity. In the near future, interactive networks will provide easy access to a wealth of information and services that will fundamentally affect how people work, play and conduct their daily affairs. Today, such networks are limited to people who can read and have access to computers—a relatively small part of the population, even in the most developed countries. Advances in human language technology are needed to enable the average citizen to communicate with networks using natural communication skills and everyday devices, such as telephones and televisions. Without fundamental advances in user-centered interfaces, a large portion of society will be prevented from participating in the age of information, resulting in further stratification of society and tragic loss of human potential. The first chapter in this survey deals with spoken language input technologies. A speech interface, in a user's own language, is ideal because it is the most natural, flexible, efficient, and economical form of human communication. The following sections summarize spoken input technologies that will facilitate such an interface. Spoken input to computers embodies many different technologies and applications, as illustrated in Figure 1.1. In some cases, as shown at the bottom of the figure, one is interested not in the underlying linguistic content but in the identity of the speaker or the language being spoken. Speaker recognition can involve identifying a specific speaker out of a known population, which has forensic implications, or verifying the claimed identity of a user, thus enabling controlled access to locales (e.g., a computer room) and services (e.g., voice banking). Speaker 1
Article
A main difficulty in speech recognition for unspecified speakers is due to the variation of the acoustic features between different speakers. In order to deal with this difficulty we introduce a parallel phoneme labeling (PPL) method. In this approach, all speakers in the world are assumed to be classified into a limited number of groups, within each group speakers have similar or close voice characteristics. Then, the phoneme labeling for each group is carried out in parallel. An unspecified speaker is assumed to be suitable to one of the parallel branches. Through experiments, the PPL method showed a better performance than the usual method with only a single universal reference pattern set
Article
Full-text available
The goal of SRI's consistency modeling project is to improve the raw acoustic modeling component of SRI's DECIPHER speech recognition system and develop consistency modeling technology. Consistency modeling aims to reduce the number of improper independence assumptions used in traditional speech recognition algorithms so that the resulting speech recognition hypotheses are more self-consistent and, therefore, more accurate. At the initial stages of this effort, SRI focused on developing the appropriate base technologies for consistency modeling. We first developed the Progressive Search technology that allowed us to perform large-vocabulary continuous speech recognition (LVCSR) experiments. Since its conception and development at SRI, this technique has been adopted by most laboratories, including other ARPA contracting sites, doing research on LVSR. Another goal of the consistency modeling project is to attack difficult modeling problems, when there is a mismatch between the training and testing phases. Such mismatches may include outlier speakers, different microphones and additive noise. We were able to either develop new, or transfer and evaluate existing, technologies that adapted our baseline genonic HMM recognizer to such difficult conditions. (AN)
Article
The authors describe a supervised approach to the construction of context-sensitive acoustic prototypes for use in speech recognition systems using allophonic subword hidden Markov models (HMMs). A properly fine partition of the underlying acoustic space(s) is achieved by incorporating contextual supervision to relate the HMM allophonic models to their acoustic manifestations. By decoupling the overall procedure into a clustering phase followed by a pruning phase, it becomes possible to uncover satisfactorily once and for all the general interrelationships between various acoustic subevents, while customizing the acoustic prototypes themselves according to the available training data. This makes for a more efficient utilization of the training sentences, as evidenced by a substantial reduction in the error rate with respect to a baseline system not taking advantage of supervision. The performance of this method is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words.
Chapter
The performance of a large vocabulary speech recognition system is critically tied to the quality of the acoustic prototypes that are established in the relevant feature space(s). This is especially true in continuous speech and/or for speaker-independent tasks, where pronunciation variability is the greatest. In this chapter, we will discuss a number of clustering techniques which can be used to derive high quality acoustic prototypes.
Chapter
Techniques to account for the presence of noise, stress and channel distortions in the automatic recognition process are reviewed. These techniques have been classified in two classes depending on where they are applied in the recognition process and the fact that they tend to restore the clean speech signal or compensate for distortions. In the first class, filtering techniques, signal estimation methods based on statistical modeling, linear and non-linear spectral subtraction, and mapping algorithms are developed. The second class, comprising mostly model compensation algorithms, includes HMM composition and decomposition techniques, noise masking, adaptation methods, minimum error training, and channel and stress compensation. In the stress compensation subsection, we have summarized several algorithms which have been shown to improve ASR of Lombard speech.
Conference Paper
Complex multidimensional data may naturally require the decomposition of a regression/classification problem over local regions. Moreover, both global and local anisotropy can be present. We propose to address both problems with a flexible neural network structure embedding data quantization and coordinate transformations. The solution is applied in this paper to speaker normalization. The spectral mapping is realized as a weighted superposition of local neural mappings, estimated between subregions of a new speaker acoustic space and that of a reference speaker, combined with global and local space transformations. The local mappings are realized using the generalized resource allocating network model, a general radial basis function scheme that allows recursive allocation of kernels. The space transformations are based upon projections over the principal components, separately estimated for the global space and for the local subregions of the input and output acoustic spaces
Conference Paper
An attempt has been made to develop improved and more sophisticated SMMs (speaker Markov models) capable of modeling the acoustic differences between two speakers in a more accurate way, thus leading to improved recognition rates for the adapted speech recognition system. The original SSM approach has been improved by the introduction of the following three features: the use of fenonic speaker Markov models, the introduction of phoneme-dependent SMM parameters, and the use of special weighting between the short original training data of the new speaker and the adapted training data of the reference speaker. It was found that the phoneme recognition performance of these improved SMMs can be more than twice as high as the performance of the original SMM approach, which has already led to satisfying adaptation results for a large-vocabulary speech recognition task.< >
Chapter
The problem of finding criteria through which a model will be chosen to match a problem and available data and give optimal future performance is a crucial issue in practical applications, not to be understimated when proposing model combination to solve a complex regression or classification task. How can it be ensured that each specialised model has been trained with enough material and that the aggregate model has the optimal structure for reducing error on novel inputs? What if a key requirement is minimisation of training material and time? This chapter introduces bootstrap error estimation for automatic model selection in combined networks: the resulting model is embedded in the acoustic front-end of an automatic speech recognition system based on hidden Markov models. The method is evaluated in two applications: in a large vocabulary (10,000 words), continuous speech recognition task and in digit recognition over a noisy telephone line. Bootstrap estimates of minimum MSE allow selection of regression models that improve system recognition performance. The procedure allows a flexible strategy for dealing with inter-speaker variability without requiring an additional validation set. Recognition results are compared for linear, generalised Radial Basis Functions and Multilayer Perceptron network architectures and with system re-training methods.
Article
Full-text available
Adapting the parameters of a statistical speaker independent continuous-speech recognizer to the speaker and the channel can significantly improve the recognition performance and robustness of the system. In continuous mixture-density hidden Markov models the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, we have recently proposed a constrained estimation technique for Gaussian mixture densities. To improve the behavior of our adaptation scheme for large amounts of adaptation data, we combine it here with Bayesian techniques. We evaluate our algorithms on the large-vocabulary Wall Street Journal corpus for nonnative speakers of American English. The recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers
Article
Full-text available
A trend in automatic speech recognition systems is the use of continuous mixture-density hidden Markov models (HMMs). Despite the good recognition performance that these systems achieve on average in large vocabulary applications, there is a large variability in performance across speakers. Performance degrades dramatically when the user is radically different from the training population. A popular technique that can improve the performance and robustness of a speech recognition system is adapting speech models to the speaker, and more generally to the channel and the task. In continuous mixture-density HMMs the number of component densities is typically very large, and it may not be feasible to acquire a sufficient amount of adaptation data for robust maximum-likelihood estimates. To solve this problem, the authors propose a constrained estimation technique for Gaussian mixture densities. The algorithm is evaluated on the large-vocabulary Wall Street Journal corpus for both native and nonnative speakers of American English. For nonnative speakers, the recognition error rate is approximately halved with only a small amount of adaptation data, and it approaches the speaker-independent accuracy achieved for native speakers. For native speakers, the recognition performance after adaptation improves to the accuracy of speaker-dependent systems that use six times as much training data
Article
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1. Approaches to Overcoming Environmental Variability . . . . . . . . . . . . . . 6 1.1.1. Re-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.2. Multi-Style Training . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.1.3. Environmental Compensation Using Dynamic Adaptation . . . . . . . . . . 8 1.2. Towards Environment-Independent Recognition . . . . . . . . . . . . . . . . 8 1.2.1. Sources of Environmental Variability . . . . . . . . . . . . . . . . . . 9 1.2.2. Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 9 1.3. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 Overview of Environmental Robustness in Speech Recognition . . . . . . 12 2.1. Sources of Degradation . . . ....
Article
Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speaker's space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in Bellegarda et al. [1992], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.
Article
Full-text available
Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.
Article
A set of techniques to perform fast speaker adaptation for a large vocabulary, natural-language, speech recognition system are presented. The experimentation has been carried out using a 20000-word, real-time, natural-language speech recognizer for the Italian language. To perform speaker adaptation within the framework of the probabilistic approach to speech recognition two different problems must be addressed: codebook adaptation and hidden Markov model parameters adaptation. The basic idea is to use a set of data collected from several different speakers as a source of a priori knowledge with a small speech sample provided by the new speaker to perform the adaptation task. Several different techniques for codebook adaptation have been tried and discussed.
Conference Paper
Vector quantization (VQ) is a technique that reduces the computation amount and memory size drastically. In this paper, speaker adaptation algorithms through VQ are proposed in order to improve speaker-independent recognition. The speaker adaptation algorithms use VQ codebooks of a reference speaker and an input speaker. Speaker adaptation is performed by substituting vectors in the codebook of a reference speaker for vectors of the input speaker's codebook, or vice versa. To confirm the effectiveness of these algorithms, word recognition experiments are carried out using the IBM office correspondence task uttered by 11 speakers. The total number of words is 1174 for each speaker, and the number of different words is 422. The average word recognition rate using different speaker's reference through speaker adaptation is 80.9%, and the rate within the second choice is 92.0%.
Conference Paper
A description is presented of the authors' current research on automatic speech recognition of continuously read sentences from a naturally-occurring corpus: office correspondence. The recognition system combines features from their current isolated-word recognition system and from their previously developed continuous-speech recognition system. It consists of an acoustic processor, an acoustic channel model, a language model, and a linguistic decoder. Some new features in the recognizer relative to the isolated-word speech recognition system include the use of a fast match to prune rapidly to a manageable number the candidates considered by the detailed match, multiple pronunciations of all function words, and modeling of interphone coarticulatory behavior. The authors recorded training and test data from a set of ten male talkers. The perplexity of the test sentences was found to be 93; none of sentences was part of the data used to generate the language model. Preliminary (speaker-dependent) recognition results on these talkers yielded an average word error rate of 11.0%
Conference Paper
An alternative approach to speaker adaptation for a large-vocabulary hidden-Markov-model-based speech recognition system is described. The goal of this investigation was to train the IBM speech recognition system with only five minutes of speech data from a new speaker instead of the usual 20 minutes without the recognition rate dropping by more than 1-2%. The approach is based on the use of a stochastic model representing the different properties of the new speaker and an old speaker for which the full training set of 20 minutes is available. It is called a speaker Markov model. It is shown how the parameters of such a model can be derived and how it can be used for transforming the training set of the old speaker in order to use it in addition to the short training set of the new speaker. The adaptation algorithm was tested with 12 speakers. The average recognition rate dropped from 96.4% to 95.2% for a 5000-word vocabulary task. The decoding time increased by a factor of 1.35; this factor is often 3-5 if other adaptation algorithms are used
Conference Paper
The Speech Recognition Group at IBM Research has developed a real-time, isolated-word speech recognizer called Tangora, which accepts natural English sentences drawn from a vocabulary of 20000 words. Despite its large vocabulary, the Tangora recognizer requires only about 20 minutes of speech from each new user for training purposes. The accuracy of the system and its ease of training are largely attributable to the use of hidden Markov models in its acoustic match component. An automatic technique for constructing Markov word models is described and results are included of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks
Conference Paper
A technique for using the speech of multiple reference speakers as a basis for speaker adaptation in large-vocabulary continuous-speech recognition is introduced. In contrast to other methods that use a pooled reference model, this technique normalizes the training speech from multiple reference speakers to a single common feature space before pooling it. The normalized and pooled speech is then treated as if it came from a single reference speaker for training the reference hidden Markov model (HMM). The usual probabilistic spectrum transformation can be applied to the reference HMM to model a new speaker. Preliminary experimental results are reported from applying this approach to over 100 reference speakers from the speaker-independent portion of the DARPA 1000-Word Resource Management Database
Article
This paper describes an experimental real-time recognizer of isolated word dictation implemented at the IBM Thomas J. Watson Research Center, on a system of commercially available computers and array processors. The recognizer's intended use is creation of office memoranda. It is based on a 5000-word vocabulary. A specially designed workstation enables the user to correct and edit the transcribed speech. The paper outlines the self-organized, statistical approach underlying the basic algorithms of the recognizer. Results of several recognition experiments are then presented. The rest of the paper considers important issues in the future development of dictation recognizers, such as vocabulary selection, language model creation, and human factors.
Speaker Adaptation for Speech Recognition Systems: Multiple Linear Regression and Multilayer Perceptrons
  • J P Tubach
  • G Chollet
  • K Choukri
  • C Montacié
  • C Mokbel
  • H Valbret