Article

Comparison of Discriminative Training Criteria

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, a formally unifying approach for a class of discriminative training criteria including Maximum Mutual Information (MMI) and Minimum Classification Error (MCE) criterion is presented, including the optimization methods gradient descent (GD) and extended Baum-Welch (EB) algorithm. Comparisons are discussed for the MMI and the MCE criterion, including the determination of the sets of word sequence hypotheses for discrimination using word graphs. Experiments have been carried out on the SieTill corpus for telephone line recorded German continuous digit strings. Using several approaches for acoustic modeling, the word error rates obtained by MMI training using single densities always were better than those for Maximum Likelihood (ML) using mixture densities. Finally, results obtained for corrective training (CT), i.e. using only the best recognized word sequence in addition to the spoken word sequence, could not be improved by using the word graph based discriminative training.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Note that for ease of representation we skip the dimension index d in the following formulae. with iteration constant D. ki (g(x)) and k (g(x)) are discriminative averages of functions g(x) of the training observations, de ned by ki (g(x)) = X n i;i k;n k;kn p (kjx n )] g(x n ) (8) k (g(x)) = X i ki (g(x)) (9) i;j is the Kronecker delta, i.e. given a training observation x n of class k n , i;i k;n = 1 only if i is the 'best-tting' component density i k;n given class k and k;kn = 1 only if k = k n . For fast but reliable convergence of the MMI criterion, the choice of the iteration constant D is crucial. ...
... For instance, the reestimation formula (7) for mixture weights c ki is known to converge very slowly. We are currently implementing modi ed reestimation formulae which are known to give better convergence 8,9]. Future work also includes realizing other discriminative criteria such as the minimum classi cation error criterion. ...
Conference Paper
Full-text available
In this paper we present a discriminative training procedure for Gaussian mixture densities. Conventional maximum likelihood (ML) training of such mixtures proved to be very efficient for object recognition, even though each class is treated separately in training. Discriminative criteria offer the advantage that they also use out-of-class data, that is they aim at optimizing class separability. We present results on the US Postal Service (USPS) handwritten digits database and compare the discriminative results to those obtained by ML training. We also compare our best results with those reported by other groups, proving them to be state-of-the-art.
Article
Multilabel categorization, which is more difficult but practical than the conventional binary and multiclass categorization, has received a great deal of attention in recent years. This paper proposes a novel probabilistic generative model, label correlation mixture model (LCMM), to depict the multiply labeled documents, which can be used for multilabel spoken document categorization as well as multilabel text categorization. In LCMM, labels and topics have the one-to-one correspondences. The LCMM consists of two important components: 1) a label correlation model and 2) a multilabel conditioned document model. The label correlation model formulates the generating process of labels where the dependences between the labels are taken into account. We also propose an efficient algorithm for calculating the probability of generating an arbitrary subset of labels. The multilabel conditioned document model can be regarded as a supervised label mixture model, in which labels for a document are known. Each label is characterized by distributions over words. For the parameter learning of the multilabel conditioned document model, in addition to maximum-likelihood estimation, a discriminative approach based on the minimum classification error rate training is proposed. To evaluate LCMM, extensive multilabel categorization experiments are conducted on a spoken document data set and three standard text data sets. The experimental results in comparison with other competitive methods demonstrate the effectiveness of LCMM.
Article
Full-text available
We study whether it is possible to infer from eye movements measured during reading what is relevant for the user in an information retrieval task. Inference is made using hidden Markov and discriminative hidden Markov models. The result of this feasibility study is that prediction of relevance is possible to a certain extent, and models benefit from taking into account the time series nature of the data.
Article
The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classification error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree search-based restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for efficient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate significant differences between EB and GD optimization. For acoustic models of low complexity, MCE training gave significantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a significant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No significant correlation has been observed between the language models chosen for training and recognition.
Article
Full-text available
In this work we compare two parameter optimization techniques for discriminative training using the MMI cri-terion: the extended Baum-Welch (EBW) algorithm and the generalized probabilistic descent (GPD) method. Us-ing Gaussian emission densities we found special expres-sions for the step sizes in GPD, leading to reestimation formula very similar to those derived for the EBW algo-rithm. Results were produced for both the TI digitstring and the SieTill corpus for continuously spoken American English and German digitstrings. The results for both techniques do not show significant differences. This ex-perimental results support the strong link between EBW and GPD as expected from the analytic comparison.
Article
Full-text available
Hidden Markov Models (HMMs) are one of the most powerful speech recognition tools available today. Even so, the inadequacies of HMMs as a "correct" modeling framework for speech are well known. In that context, we argue that the maximum mutual information estimation (MMIE) formulation for training is more appropriate vis-a-vis maximum likelihood estimation (MLE) for reducing the error rate. We also show how MMIE paves the way for new training possibilities. We introduce Corrective MMIE training, a very efficient new training algorithm which uses a modified version of a discrete reestimation formula recently proposed by Gopalakrishnan et al. We propose reestimation formulas for the case of diagonal Gaussian densities, experimentally demonstrate their convergence properties, and integrate them into our training algorithm. In a connected digit recognition task, MMIE consistently improves the recognition performance of our recognizer.
Conference Paper
Full-text available
This paper describes a framework for optimising the parameters of a continuous density HMM-based large vocabulary recognition system using a maximum mutual information estimation (MMIE) criterion. To limit the computational complexity arising from the need to find confusable speech segments in the large search space of alternative utterance hypotheses, word lattices generated from the training data are used. Experiments are presented on the Wall Street journal database using up to 66 hours of training data. These show that lattices combined with an improved estimation algorithm makes MMIE training practicable even for very complex recognition systems and large training sets. Furthermore, experimental results show that MMIE training can yield useful increases in recognition accuracy
Article
Full-text available
In this paper we describe the optimization of 'conventional ' template matching techniques for connected digit recognition (TI/NIST connected digit corpus). In particular we carried out a series of experiments in which we studied various aspects of signal processing, acoustic modeling, mixture densities and linear transforms of the acoustic vector. After all optimization steps, our best string error rate on the TI/NIST connected digit corpus was 1.71% for single densities and 0.74% for mixture densities. 1. INTRODUCTION Over the last five years much progress has been made in connected digit recognition [3, 7, 8, 9]. This paper describes how the systematic optimization of various components of a 'conventional' recognition system leads to high performance comparable with other systems that use much more complicated techniques. Experimental results on the adult corpus of the TI/NIST connected digit corpus are given. The optimization steps presented in this paper are: 1. Several methods f...
Article
Full-text available
For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate which words of the speech recognizer's output are likely to be correct and which are not reliable. Many of today's speech recognition systems use word lattices as a compact representation of a set of alternative hypothesis. We exploit the use of such word lattices as information sources for the measure-of-confidence tagger JANKA [1]. In experiments on spontaneous human -to-human speech data the use of word lattice related information significantly improves the tagging accuracy. 1.
Chapter
This chapter describes ways in which the concept of maximum mutual information estimation (MMIE) can be used to improve the performance of HMM-based speech recognition systems. First, the basic MMIE concept is introduced with some intuition on how it works. Then we show how the concept can be extended to improve the power of the basic models. Since estimating HMM parameters with MMIE training can be computationally expensive, this problem is studied at length and some solutions proposed and demonstrated. Experiments are presented to demonstrate the usefulness of the MMIE technique.
Article
The authors study issues related to string level acoustic modeling in continuous speech recognition. They derive the formulation of minimum string error rate training. A minimum string error rate training algorithm, segmental minimum string error rate training, is described. It takes a further step in modeling the basic speech recognition units by directly applying discriminative analysis to string level acoustic model matching. One of the advantages of this training algorithm lies in its ability to model strings which are competitive with the correct string but are unseen in the training material. The robustness and acoustic resolution of the unit model set can therefore be significantly improved. Various experimental results have shown that significant error rate reduction can be achieved using this approach.
Article
The interaction of linear discriminant analysis (LDA) and a modeling approach using continuous Laplacian mixture density HMM is studied experimentally. The largest improvements in speech recognition could be obtained when the classes for the LDA transform were defined to be sub-phone units. On a 12000 word German recognition task with small overlap between training and test vocabulary a reduction in error rate by one-fifth was achieved compared to the case without LDA. On the development set of the DARPA RM1 task the error rate was reduced by one-third. For the DARPA speaker-dependent no-grammar case, the error rate averaged over 12 speakers was 9.9%. This was achieved with a recognizer using LDA and a set of only 47 Viterbi-trained context-independent phonemes.
Conference Paper
Although widely used, there are still open questions concerning which properties of linear discriminant analysis (LDA) account for its success in many speech recognition systems. In order to gain more insight into the nature of the transformation we compare LDA with mel-cepstral feature vectors with respect to the following criteria: decorrelation and ordering property; invariance under linear transforms; automatic learning of dynamical features; and data dependence of the transformation
Article
Discriminative training techniques for Hidden-Markov Models were recently proposed and successfully applied for automatic speech recognition. In this paper a discussion of the Minimum Classification Error and the Maximum Mutual Information objective is presented. An extended reestimation formula is used for the HMM parameter update for both objective functions. The discriminative training methods were utilized in speaker independent phoneme recognition experiments and improved the phoneme recognition rates for both discriminative training techniques. 1. INTRODUCTION Recently discriminative training techniques for Hidden- Markov Models (HMM) were used successfully for automatic speech recognition. They provide better performance compared to Maximum Likelihood Estimation (MLE), since the training is concentrated on the estimation of class boundaries and not on parameters of assumed model distributions [1,12]. Although MLE and discriminative training are theoretically equivalent (if suff...
Acoustic Modeling of Phoneme Units for Continuous Speech Recognition
  • H Ney
H. Ney. "Acoustic Modeling of Phoneme Units for Continuous Speech Recognition," Proc. Fifth Europ. Signal Processing Conf., Barcelona, pp 65-72, September 1990.