Article

A variational perspective on noise-robust speech recognition

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Model compensation methods for noise-robust speech recognition have shown good performance. Predictive linear transformations can approximate these methods to balance computational complexity and compensation accuracy. This paper examines both of these approaches from a variational perspective. Using a matched-pair approximation at the component level yields a number of standard forms of model compensation and predictive linear transformations. However, a tighter bound can be obtained by using variational approximations at the state level. Both model-based and predictive linear transform schemes can be implemented in this framework. Preliminary results show that the tighter bound obtained from the state-level variational approach can yield improved performance over standard schemes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The iterative DPMC method [133] solves this problem by sampling from GMMs instead of Gaussians and then uses the Baum-Welch algorithm to re-estimate the distorted speech parameters. This is extended in [220], where a variational method is used to remove the constraint that the samples must be used to model the Gaussians they are originally drawn from. This is also extended to variational PCMLLR [220], which is shown to be better than PCMLLR [211] and has a much lower computational cost than variational DPMC. ...
... This is extended in [220], where a variational method is used to remove the constraint that the samples must be used to model the Gaussians they are originally drawn from. This is also extended to variational PCMLLR [220], which is shown to be better than PCMLLR [211] and has a much lower computational cost than variational DPMC. In [221] the Gaussian at the input of the nonlinearity is approximated by a GMM whose individual components have a smaller variance than the original Gaussian. ...
Article
New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.
Article
One way of making speech recognisers more robust to noise is model compensation. Rather than enhancing the incoming observations, model compensation techniques modify a recogniser's state-conditional distributions so they model the speech in the target environment. Because the interaction between speech and noise is non-linear, even for Gaussian speech and noise the corrupted speech distribution has no closed form. Thus, model compensation methods approximate it with a parametric distribution, such as a Gaussian or a mixture of Gaussians. The impact of this approximation has never been quantified. This paper therefore introduces a non-parametric method to compute the likelihood of a corrupted speech observation. It uses sampling and, given speech and noise distributions and a mismatch function, is exact in the limit. It therefore gives a theoretical bound for model compensation. Though computing the likelihood is computationally expensive, the novel method enables a performance comparison based on the criterion that model compensation methods aim to minimise: the KL divergence to the ideal compensation. It gives the point where the Kullback–Leibler (KL) divergence is zero. This paper examines the performance of various compensation methods, such as vector Taylor series (VTS) and data-driven parallel model combination (DPMC). It shows that more accurate modelling than Gaussian-for-Gaussian compensation improves the performance of speech recognition.
Book
Full-text available
HTK is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of HTK is similarly general-purpose. However, HTK is primarily designed for building HMM-based speech processing tools, in particular recognisers. Thus, much of the infrastructure support in HTK is dedicated to this task. As shown in the picture above, there are two major processing stages involved. Firstly, the HTK training tools are used to estimate the parameters of a set of HMMs using training utterances and their associated transcriptions. Secondly, unknown utterances are transcribed using the HTK recognition tools.
Conference Paper
Full-text available
To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is of- ten fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clean speech into ac- count. This paper re-analyses the recently-proposed predictive linear transformations for noise compensation as minimising the KL divergence between the predicted corrupted speech and the adapted models. New schemes are then introduced which apply observation-dependent transformations in the front-end to adapt the back-end distributions. One applies transforms in the exact same manner as the popular minimum mean square error (MMSE) feature enhancement scheme, and is as fast. The new method performs better on AURORA 2. Index Terms: speech recognition, noise robustness
Conference Paper
Full-text available
Model compensation techniques for noise-robust speech recognition approximate the corrupted speech distribution. This paper introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. Though it is too slow to compensate a speech recognition system, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence of individual components. This makes it possible to evaluate the impact of approximations that compensation schemes make, such as the form of the mismatch function. Index Terms: speech recognition, noise robustness 1.
Conference Paper
Full-text available
Rapidly adapting a speech recognition system to new speakers using a small amount of adaptation data is important to improve initial user experience. In this paper, a count-smoothing frame-work for incorporating prior information is extended to allow for the use of different forms of dynamic prior and improve the robustness of transform estimation on small amounts of data. Prior information is obtained from existing rapid adaptation techniques like VTLN and PCMLLR. Results using VTLN as a dynamic prior for CMLLR estimation show that transforms es-timated on just one utterance can yield relative gains of 15% and 46% over a baseline gender independent model on two tasks.
Conference Paper
Full-text available
In this paper, we propose to improve our previously developed method for joint compensation of additive and convolutive distortions (JAC) applied to model adaptation. The improvement entails replacing the vector Taylor series (VTS) approximation with unscented transform (UT) in formulating both the static and dynamic model parameter adaptation. Our new JAC-UT method differentiates itself from other UT-based approaches in that it combines the online noise and channel distortion estimation and model parameter adaptation in a unified UT framework. Experimental results on the standard Aurora 2 task show that the new algorithm enjoys 20.0% and 16.9% relative word error rate reductions over the previous JAC-VTS algorithm when using the simple and complex backend models, respectively.
Conference Paper
Full-text available
In this work, we derive a Monte Carlo expectation maximization algorithm for estimating noise from a noisy utterance. In contrast to earlier approaches, where the distribution of noise was estimated based on a vector Taylor series expansion, we use a combination of importance sampling and Parzen-window density estimation to numerically approximate the occurring integrals with the Monte Carlo method. Experimental results show that the proposed algorithm has superior convergence properties, compared to previous implementations of the EM algorithm. Its application to speech feature enhancement reduced the word error rate by over 30% on a phone number recognition task recorded in a (real) noisy car environment.
Conference Paper
Full-text available
The Kullback Leibler (KL) divergence is a widely used tool in statistics and pattern recognition. The KL divergence between two Gaussian mixture models (GMMs) is frequently needed in the fields of speech and image recognition. Unfortunately the KL divergence between two GMMs is not analytically tractable, nor does any efficient computational algorithm exist. Some techniques cope with this problem by replacing the KL divergence with other functions that can be computed efficiently. We introduce two new methods, the variational approximation and the variational upper bound, and compare them to existing methods. We discuss seven different techniques in total and weigh the benefits of each one against the others. To conclude we evaluate the performance of each one through numerical experiments
Article
Full-text available
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 2 The SPHINX-II Recognition System . . . . . . . . . . . . . . . . . . . . . . 17 2.1. An Overview of the SPHINX-II System . . . . . . . . . . . . . . . . . . 17 2.1.1. Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1.2. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 20 2.1.3. Recognition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1.4. Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.5. Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2. Experimental Tasks and Corpora . . . . ...
Thesis
A standard way of improving the robustness of speech recognition systems to noise is model compensation. This replaces a speech recogniser's distributions over clean speech by ones over noise-corrupted speech. For each clean speech component, model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution. This thesis looks into improving on this approximation in two ways: firstly, by estimating full-covariance Gaussian distributions; secondly, by approximating corrupted-speech likelihoods without any parameterised distribution. The first part of this work is about compensating for within-component feature correlations under noise. For this, the covariance matrices of the computed Gaussians should be full instead of diagonal. The estimation of off-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, like VTS compensation, use for dynamic coefficients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coefficients, and dynamic coefficients, which represent signal changes over time, and are normally computed from a window of static coefficients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It first compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coefficients. It introduces a number of methods that address the correlation changes that occur in noise within this framework. The next problem is decoding speed with full covariances. This thesis re-analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost. The second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequential importance resampling. Though it is too slow to use for recognition, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence to the ideal compensation for one component. The KL divergence proves to predict the word error rate well. This technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make.
Article
A standard way of improving the robustness of speech recognition systems to noise is model compensation. This replaces a speech recogniser's distributions over clean speech by ones over noise-corrupted speech. For each clean speech component, model compensation techniques usually approximate the corrupted speech distribution with a diagonal-covariance Gaussian distribution. This thesis looks into improving on this approximation in two ways: firstly, by estimating full-covariance Gaussian distributions; secondly, by approximating corrupted-speech likelihoods without any parameterised distribution. The first part of this work is about compensating for within-component feature correlations under noise. For this, the covariance matrices of the computed Gaussians should be full instead of diagonal. The estimation of off-diagonal covariance elements turns out to be sensitive to approximations. A popular approximation is the one that state-of-the-art compensation schemes, like VTS compensation, use for dynamic coefficients: the continuous-time approximation. Standard speech recognisers contain both per-time slice, static, coefficients, and dynamic coefficients, which represent signal changes over time, and are normally computed from a window of static coefficients. To remove the need for the continuous-time approximation, this thesis introduces a new technique. It first compensates a distribution over the window of statics, and then applies the same linear projection that extracts dynamic coefficients. It introduces a number of methods that address the correlation changes that occur in noise within this framework. The next problem is decoding speed with full covariances. This thesis re-analyses the previously-introduced predictive linear transformations, and shows how they can model feature correlations at low and tunable computational cost. The second part of this work removes the Gaussian assumption completely. It introduces a sampling method that, given speech and noise distributions and a mismatch function, in the limit calculates the corrupted speech likelihood exactly. For this, it transforms the integral in the likelihood expression, and then applies sequential importance resampling. Though it is too slow to use for recognition, it enables a more fine-grained assessment of compensation techniques, based on the KL divergence to the ideal compensation for one component. The KL divergence proves to predict the word error rate well. This technique also makes it possible to evaluate the impact of approximations that standard compensation schemes make.
Chapter
Although the SDCN technique performs acceptably, it has the disadvantage that new microphones must be “calibrated” by collecting longterm statistics from a new stereo database. Since this stereo database will not be available in general, SDCN cannot adapt to a new environment. A new algorithm, Codeword-Dependent Cepstral Normalization (CDCN), was proposed to circumvent these problems, and will be the topic of this chapter.
Book
This dissertation describes a number of algorithms developed to increase the robustness of automatic speech recognition systems with respect to changes in the environment. These algorithms attempt to improve the recognition accuracy of speech recognition systems when they are trained and tested in different acoustical environments, and when a desk-top microphone (rather than a close-talking microphone) is used for speech input. Without such processing, mismatches between training and testing conditions produce an unacceptable degradation in recognition accuracy.
Article
Model compensation is a standard way of improving the robustness of speech recognition systems to noise. A number of popular schemes are based on vector Taylor series (VTS) compensation, which uses a linear approximation to represent the influence of noise on the clean speech. To compensate the dynamic parameters, the continuous time approximation is often used. This approximation uses a point estimate of the gradient, which fails to take into account that dynamic coefficients are a function of a number of consecutive static coefficients. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to distributions over standard static and dynamic features. With this improved approximation, it is also possible to obtain full-covariance corrupted speech distributions. This addresses the correlation changes that occur in noise. The proposed scheme outperformed the standard VTS scheme by 10% to 20% relative on a range of tasks.
Article
This paper examines the application of linear transformations for speaker and environmental adaptation in an HMM-based speech recognition system. In particular, transformations that are trained in a maximum likelihood sense on adaptation data are investigated. Only model-based linear transforms are considered, since, for linear transforms, they subsume the appropriate feature–space transforms. The paper compares the two possible forms of model-based transforms: (i) unconstrained, where any combination of mean and variance transform may be used, and (ii) constrained, which requires the variance transform to have the same form as the mean transform. Re-estimation formulae for all appropriate cases of transform are given. This includes a new and efficient full variance transform and the extension of the constrained model–space transform from the simple diagonal case to the full or block–diagonal case. The constrained and unconstrained transforms are evaluated in terms of computational cost, recognition time efficiency, and use for speaker adaptive training. The recognition performance of the two model–space transforms on a large vocabulary speech recognition task using incremental adaptation is investigated. In addition, initial experiments using the constrained model–space transform for speaker adaptive training are detailed.
Conference Paper
A database of continuous read speech has been designed and recorded within the DARPA strategic computing speech recognition program. The data is intended for use in designing and evaluating algorithms for speaker-independent, speaker-adaptive and speaker-dependent speech recognition. The data consists of read sentences appropriate to a naval resource management task built around existing interactive database and graphics programs. The 1000-word task vocabulary is intended to be logically complete and habitable. The database, which represents over 21000 recorded utterances from 160 talkers with a variety of dialects, includes a partition of sentences and talkers for training and for testing purposes
Article
Model-based noise compensation techniques are a powerful approach to improve speech recognition performance in noisy environments. However, one of the major issues with these schemes is that they are computationally expensive. Though techniques have been proposed to address this problem, they often result in degradations in performance. This paper proposes a new, highly flexible, approach which allows the computational load required for noise compensation to be controlled while maintaining good performance. The scheme applies the improved joint uncertainty decoding with the predictive linear transform framework. The final compensation is implemented as a set of linear transforms of the features, decoupling the computational cost of compensation from the complexity of the recognition system acoustic models. Furthermore, by using linear transforms, changes in the correlations in the feature vector can also be efficiently modeled. The proposed methods can be easily applied in an adaptive training scheme, including discriminative adaptive training. The performance of the approach is compared to a number of standard schemes on Aurora 2 as well as in-car speech recognition tasks. Results indicate that the proposed scheme is an attractive alternative to existing approaches.
Conference Paper
We recently proposed a new algorithm to perform acoustic model adaptation to noisy environments called Linear Spline Interpolation (LSI). In this method, the nonlinear relationship between clean and noisy speech features is modeled using linear spline regression. Linear spline parameters that minimize the error the between the predicted noisy features and the actual noisy features are learned from training data. A variance associated with each spline segment captures the uncertainty in the assumed model. In this work, we extend the LSI algorithm in two ways. First, the adaptation scheme is extended to compensate for the presence of linear channel distortion. Second, we show how the noise and channel parameters can be updated during decoding in an unsupervised manner within the LSI framework. Using LSI, we obtain an average relative improvement in word error rate of 10.8% over VTS adaptation on the Aurora 2 task with improvements of 15-18% at SNRs between 10 and 15 dB.
Article
The NOISEX-92 experiment and database is described and discussed. NOISEX-92 specifies a carefully controlled experiment on artificially noisy speech data, examining performance for a limited digit recognition task but with a relatively wide range of noises and signal-to-noise ratios. Example recognition results are given.ZusammenfassungEs werden die Datenbank und die Tests NOISEX-92 beschrieben und kommentiert. NOISEX-92 spezifiziert Tests über künstliche Lärmdaten, wobei die Leistungen im Rahmen der Erjennung von Zahlen und ein relativ breiter Rauschabstand bewertet werden. Es werden Beispiele der Ergebnisse für Erkennungen beschrieben.RésuméLa base de données et les tests NOISEX-92 sont décrits et commentés. NOISEX-92 spécifie des tests sur des données bruitées artificiellement en évaluant les performances dans le cadre de la reconnaissance de chiffres et une gamme relativement large de rapports signal sur bruit. Des examples de scores de reconnaissances sont présentés.
Conference Paper
Model compensation is a standard way of improving speech recognisers' robustness to noise. Most model compensation techniques produce diagonal covariances. However, this fails to handle changes in the feature correlations due to the noise. This paper presents a scheme that allows full covariance matri- ces to be estimated. One problem is that full covariance ma- trix estimation will be more sensitive to approximations, like those for dynamic parameters which are known to be crude. In this paper a linear transformation of a window of consec- utive frames is used as the basis for dynamic parameter com- pensation. A second problem is that the resulting full covari- ance matrices slow down decoding. This is addressed by using predictive linear transforms that decorrelate the feature space, so that the decoder can then use diagonal covariance matrices. On a noise-corrupted Resource Management task, the proposed scheme outperformed the standard VTS compensation scheme. Index Terms: Noise robust speech recognition, vector Taylor series, joint uncertainty decoding.
Conference Paper
In this paper we present an analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors. The development shows, among others, that the probability density of the phase factor is of sub-Gaussian nature and that it is independent of the noise type and the signal-to-noise ratio, however dependent on the mel filter bank index. Further we show how to compute the contribution of the phase factor to both the mean and the variance of the noisy speech observation likelihood, which relates the speech and noise feature vectors to those of noisy speech. The resulting phase-sensitive observation model is then used in model-based speech feature enhancement, leading to significant improvements in word accuracy on the AURORA2 database.
Conference Paper
In this paper we address the problem of robustness of speech recognition systems in noisy environments. The goal is to estimate the parameters of a HMM that is matched to a noisy environment, given a HMM trained with clean speech and knowledge of the acoustical environment. We propose a method based on truncated vector Taylor series that approximates the performance of a system trained with that corrupted speech. We also provide insight on the approximations used in the model of the environment and compare them with the lognormal approximation in PMC.
Conference Paper
Model compensation is a standard way of improving speech recognisers' robustness to noise. Currently popular schemes are based on vector Taylor series (VTS) compensation. They often use the continuous time approximation to compensate dynamic parameters. In this paper, the accuracy of dynamic parameter compensation is improved by representing the dynamic features as a linear transformation of a window of static features. A modified version of VTS compensation is applied to the distribution of the window of static features and, importantly, their correlations. These compensated distributions are then transformed to standard static and dynamic distributions. The proposed scheme outperformed the standard VTS scheme by about 10% relative.
Conference Paper
It is well known that the addition of background noise alters the correlations between the elements of, for example, the MFCC feature vector. However, standard model-based compensation techniques do not modify the feature-space in which the diagonal covariance matrix Gaussian mixture models are estimated. One solution to this problem, which yields good performance, is joint uncertainty decoding (JUD) with full transforms. Unfortunately, this results in a high computational cost during decoding. This paper contrasts two approaches to approximating full JUD while lowering the computational cost. Both use predictive linear transforms to modify the feature-space: adaptation-based linear transforms, where the model parameters are restricted to be the same as the original clean system; and precision matrix modelling approaches, in particular semi-tied covariance matrices. These predictive transforms are estimated using statistics derived from the full JUD transforms rather than noisy data. The schemes are evaluated on AURORA 2 and a noise-corrupted resource management task.
Article
observed in terms of both a distance measure, the average Kullback-Leibler number on a feature vector component level, and the effect on word accuracy. For best performance in noise-corrupted environments, it is necessary to compensate all these parameters. Various methods for compensating the HMMs are described. These may be split into two classes. The first, non-iterative PMC, assumes that the frame/state component alignment associated with the speech models and the clean speech data is unaltered by the addition of noise. This implies that the corrupted-speech distributions are approximately Gaussian, which is known to be false. However, this assumption allows rapid adaptation of the model parameters. The second class of PMC is iterative PMC, where only the frame/state alignment is assumed unaltered. By allowing the component alignment within a state to vary, it is possible to better model the corrupted-speech distribution. One implementation is described, Data-driven Parallel Model