Comparison of Discriminative Training Criteria

Source: CiteSeer

ABSTRACT In this paper, a formally unifying approach for a class of discriminative training criteria including Maximum Mutual Information (MMI) and Minimum Classification Error (MCE) criterion is presented, including the optimization methods gradient descent (GD) and extended Baum-Welch (EB) algorithm. Comparisons are discussed for the MMI and the MCE criterion, including the determination of the sets of word sequence hypotheses for discrimination using word graphs. Experiments have been carried out on the SieTill corpus for telephone line recorded German continuous digit strings. Using several approaches for acoustic modeling, the word error rates obtained by MMI training using single densities always were better than those for Maximum Likelihood (ML) using mixture densities. Finally, results obtained for corrective training (CT), i.e. using only the best recognized word sequence in addition to the spoken word sequence, could not be improved by using the word graph based discriminative training.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Hidden Markov Models (HMMs) are one of the most powerful speech recognition tools available today. Even so, the inadequacies of HMMs as a "correct" modeling framework for speech are well known. In that context, we argue that the maximum mutual information estimation (MMIE) formulation for training is more appropriate vis-a-vis maximum likelihood estimation (MLE) for reducing the error rate. We also show how MMIE paves the way for new training possibilities. We introduce Corrective MMIE training, a very efficient new training algorithm which uses a modified version of a discrete reestimation formula recently proposed by Gopalakrishnan et al. We propose reestimation formulas for the case of diagonal Gaussian densities, experimentally demonstrate their convergence properties, and integrate them into our training algorithm. In a connected digit recognition task, MMIE consistently improves the recognition performance of our recognizer.
  • [Show abstract] [Hide abstract]
    ABSTRACT: The interaction of linear discriminant analysis (LDA) and a modeling approach using continuous Laplacian mixture density HMM is studied experimentally. The largest improvements in speech recognition could be obtained when the classes for the LDA transform were defined to be sub-phone units. On a 12000 word German recognition task with small overlap between training and test vocabulary a reduction in error rate by one-fifth was achieved compared to the case without LDA. On the development set of the DARPA RM1 task the error rate was reduced by one-third. For the DARPA speaker-dependent no-grammar case, the error rate averaged over 12 speakers was 9.9%. This was achieved with a recognizer using LDA and a set of only 47 Viterbi-trained context-independent phonemes.
    Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing 01/1992; 1:13-16. DOI:10.1109/ICASSP.1992.225984
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a framework for optimising the parameters of a continuous density HMM-based large vocabulary recognition system using a maximum mutual information estimation (MMIE) criterion. To limit the computational complexity arising from the need to find confusable speech segments in the large search space of alternative utterance hypotheses, word lattices generated from the training data are used. Experiments are presented on the Wall Street journal database using up to 66 hours of training data. These show that lattices combined with an improved estimation algorithm makes MMIE training practicable even for very complex recognition systems and large training sets. Furthermore, experimental results show that MMIE training can yield useful increases in recognition accuracy
    Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on; 06/1996