Conference Paper

Limited Training Data Robust Speech Recognition Using Kernel-Based Acoustic Models

Dept. of Electr. Eng. & Inf. Technol., Otto-von-Guericke-Univ., Magdeburg
DOI: 10.1109/ICASSP.2006.1660226 Conference: Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, Volume: 1
Source: IEEE Xplore


Contemporary automatic speech recognition uses hidden-Markov-models (HMMs) to model the temporal structure of speech where one HMM is used for each phonetic unit. The states of the HMMs are associated with state-conditional probability density functions (PDFs) which are typically realized using mixtures of Gaussian PDFs (GMMs). Training of GMMs is error-prone especially if training data size is limited. This paper evaluates two new methods of modeling state-conditional PDFs using probabilistically interpreted support vector machines and kernel Fisher discriminants. Extensive experiments on the RMI (P. Price et al., 1988) corpus yield substantially improved recognition rates compared to traditional GMMs. Due to their generalization ability, our new methods reduce the word error rate by up to 13% using the complete training set and up to 33% when the training set size is reduced

11 Reads
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: During the last decade, a new learning paradigm called Structural Risk Minimization (SRM) derived from Statistical Learning Theory, has become widely studied in machine learning. Machines implementing SRM, e. g., Support Vector Machines (SVMs) and Kernel Fisher Discriminants (KFDs), have been very successfully used for solving pattern recognition and function regression problems. SRM's ability to simultaneously minimize the risk of error on training data and the complexity of a learning machine results in better generalization capability than plain Empirical Risk Minimization (ERM), especially if the amount of training data is limited. The present work is devoted to applying SRM to the problem of probability density function (PDF) estimation. When modeling sequences of continuous-valued events using Hidden Markov Models (HMMs), e. g., automatic speech recognition (ASR), PDFs are used to model the emission probabilities of the HMMs' states. This thesis investigates and develops methods to efficiently train sparse kernel PDF models by regression of the empirical cumulative distribution function (ECDF). A new method for obtaining a sparse approximation of the orthogonal least-squares regression solution by forward-selection of relevant samples is presented, where a novel memory-efficient thin update of the orthogonal decomposition is used. This method is evaluated on standard benchmark problems of up to five dimensions, showing superior performance to traditional parametric Gaussian Mixture Models (GMMs) and similar performance to the theoretically optimal, non-sparse Parzen windows PDF models. However, it is found that this new method cannot be applied to the problem of estimating PDFs for ASR due to the complexity of the ECDF in high dimensions. Instead, posterior class probabilities calibrated from the outputs of binary discriminants such as SVMs or KFDs are turned into class-conditional PDFs using Bayes' rule. This approach is tested within a monophone HMM ASR system on the Resource Management task, outperforming traditional HMM-GMM systems significantly, especially on random limited samples which demonstrates the new models' improved generalization ability on small-sample problems. In order to realize these large-scale experiments, a novel machine learning software library is presented. Primary focus is put on fast computations, simplicity both in terms of expressing algorithms and extending functionality, and flexibility in order to properly appreciate algorithms' properties and advantages. The software library follows an object-oriented design and has been implemented in C++. For productivity, the library is equipped with fine-grained tracing, an object-oriented persistence model, transparent error handling and parallelization on distributedmemory computer clusters.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the growing need for security applications speaker recognition as the biometric task of authenticating a claimant by voice has currently become a focus of interest. Traditionally approaches in the area of speaker recognition were mainly based on generative classifiers like Gaussian Mixture Models (GMMs). However, more recently other classifiers like Support Vector Machines (SVMs) have been successfully applied to several fields of pattern recognition. These discriminative classifiers which are theoretically derived from statistical learning theory obtain a high generalization ability. Therefore these so called discriminative methods have also been discussed as a promising approach to specifically improve performance of speaker recognition systems. Following this train of thought, this work focuses on the development and integration of different discriminative classifiers into the field of speaker recognition. As an alternative to the SVM we present the Sparse Kernel Logistic Regression (SKLR), a sparse non-linear expansion of the well known Logistic Regression. In contrast to Support Vector Machines the SKLR directly models the posterior probability of class membership and therefore naturally provides a probability output. For this reason a new speaker recognition environment is designed and implemented which includes two different recognition approaches, one for limited and one for large (the so called extended) training data. In the first recognition approach the discriminative classifiers are applied directly on feature vectors from parameterized speech frames and it is shown that both, SVM as well as SKLR outperform traditional GMM methods. In the second approach a state-of-the-art speaker recognition system for large amount of training data is designed that combines Gaussian Mixture Models (GMM) with discriminative classifiers and integrates the SKLR into this system. Furthermore, we investigate different feature extraction methods for speaker recognition on large amount of training data. It is shown that the application of fusion schemes which combine these subsystems yield a significant improvement of the recognition performance in comparison to the application of single subsystems. All presented approaches are evaluated on internationally recognized corpora and were published in appropriate international media. The comparison of our speaker recognition systems with other state-of-the-art systems revealed equal or significantly better recognition performance.