Are you H.-K.J. Kuo?

Claim your profile

Publications (12)23.15 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Egyptian Arabic (EA) is a colloquial version of Arabic. It is a low-resource morphologically rich language that causes problems in Large Vocabulary Continuous Speech Recognition (LVCSR). Building LMs on morpheme level is considered a better choice to achieve higher lexical coverage and better LM probabilities. Another approach is to utilize information from additional features such as morphological tags. On the other hand, LMs based on Neural Networks (NNs) with a single hidden layer have shown superiority over the conventional n-gram LMs. Recently, Deep Neural Networks (DNNs) with multiple hidden layers have achieved better performance in various tasks. In this paper, we explore the use of feature-rich DNN-LMs, where the inputs to the network are a mixture of words and morphemes along with their features. Significant Word Error Rate (WER) reductions are achieved compared to the traditional word-based LMs.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013 · 4.63 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report word error rate improvements with syntactic features using a neural probabilistic language model through N-best re-scoring. The syntactic features we use include exposed head words and their non-terminal labels both before and after the predicted word. Neural network LMs generalize better to unseen events by modeling words and other context features in continuous space. They are suitable for incorporating many different types of features, including syntactic features, where there is no pre-defined back-off order. We choose an N-best re-scoring framework to be able to take full advantage of the complete parse tree of the entire sentence. Using syntactic features, along with morphological features, improves the word error rate (WER) by up to 5.5% relative, from 9.4% to 8.6%, on the latest GALE evaluation test set.
    Automatic Speech Recognition & Understanding, 2009. ASRU 2009. IEEE Workshop on; 01/2010
  • Source
    H.-K.J. Kuo, B. Kingsbury, G. Zweig
    [Show abstract] [Hide abstract]
    ABSTRACT: Finite-state decoding graphs integrate the decision trees, pronunciation model and language model for speech recognition into a unified representation of the search space. We explore discriminative training of the transition weights in the decoding graph in the context of large vocabulary speech recognition. In preliminary experiments on the RT-03 English Broadcast News evaluation set, the word error rate was reduced by about 5.7% relative, from 23.0% to 21.7%. We discuss how this method is particularly applicable to low-latency and low-resource applications such as real-time closed captioning of broadcast news and interactive speech-to-speech translation
    Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on; 05/2007 · 4.63 Impact Factor
  • Source
    I. Zitouni, H.-K.J. Kuo
    [Show abstract] [Hide abstract]
    ABSTRACT: Backoff hierarchical class n-gram language models use a class hierarchy to define an appropriate context. Each node in the hierarchy is a class containing all the words of the descendant nodes (classes). The closer a node is to the root, the more general the corresponding class, and consequently the context, is. We demonstrate experimentally the effectiveness of the backoff hierarchical class n-gram language modeling approach to model unseen events in speech recognition: improvement is achieved over regular backoff n-gram models. We also study the performance of this approach on vocabularies of different sizes and we investigate the impact of the hierarchy depth on the performance of the model. Performance is presented on several databases such as switchboard, call-home and Wall Street Journal (WSJ). Experiments on switchboard and call-home databases, which contain a few unseen events in the test set, show up to 6% improvement on unseen events perplexity with a vocabulary of 16,800 words. With a relatively large number of unseen events on the WSJ test corpus and using two vocabulary sets of 5,000 and 20,000 words, we obtain up to 26% improvement on unseen events perplexity and up to 12% improvement in WER when a backoff hierarchical class trigram language model is used on an ASR test set. Results confirm that improvement is achieved when the number of unseen events increases.
    Automatic Speech Recognition and Understanding, 2003. ASRU '03. 2003 IEEE Workshop on; 01/2007
  • Source
    Yuqing Gao, Liang Gu, H.-K.J. Kuo
    [Show abstract] [Hide abstract]
    ABSTRACT: Statistical methods commonly used in developing interactive dialogue systems require large amounts of training data to achieve high accuracy and robustness. This becomes a major bottleneck in building free-style dialogue systems in a new domain or for a new language. Portability challenges hence arise regarding how to build statistical models rapidly and with low cost in terms of data collection, transcription and annotation. In this paper, we discuss challenges as well as potential solutions in several critical issues of efficient language modeling, utilization of untranscribed speech data, automatic annotation, and cross-lingual modeling. We believe that current approaches in these areas are far from mature and call for serious efforts from the research community.
    Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on; 04/2005 · 4.63 Impact Factor
  • Source
    C. Wu, D. Lubensky, J. Huerta, X. Li, H.-K.J. Kuo
    [Show abstract] [Hide abstract]
    ABSTRACT: A framework is proposed for enterprise automated call routing system development and large scalable natural language call routing application deployment based on IBM's speech recognition and NLU application engagement practices in recently years. To facilitate employing different call classification algorithms in an easy integration manner, this framework architecture provides a plug & play environment for evaluating promising call routing algorithms and a systematic approach to carry out a large scalable enterprise application deployment. The paradigm illustrates the complementary effort to develop an automatic call routing application for enterprise call centers and covers from call classification algorithm investigation to application programming model. Experimental results on a live data testing set collected from an enterprise call center shows that the performance of the call classification algorithm implemented in this framework is outstanding.
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on; 11/2003
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new formulation of minimum verification error training and apply it to the problem of topic verification as an example. In topic verification, a decision is made as to whether a document truly belongs to a particular topic of interest. Such a decision typically depends on a comparison between a model for the desired topic and a model for background topics, using a decision threshold. We propose modeling the background topics as a cohort model consisting of a weighted combination of the M closest topics discovered from the training data. The weights and the decision threshold are optimized using the generalized probabilistic descent algorithm to explicitly minimize the verification error rate, which is defined to be a weighted sum of the Type I (false rejection) and Type II (false acceptance) errors.
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003 · 4.63 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate both generative and statistical approaches for language modeling in spoken dialogue systems. Semantic class-based finite state and n-gram grammars are used for improving coverage and modeling accuracy when little training data is available. We have implemented dialogue-state specific language model adaptation to reduce perplexity and improve the efficiency of grammars for spoken dialogue systems. A novel algorithm for combining state-independent n-gram and state-dependent finite state grammars using acoustic confidence scores is proposed. Using this combination strategy, a relative word error reduction of 12% is achieved for certain dialogue states within a travel reservation task. Finally, semantic class multigrams are proposed and briefly evaluated for language modeling in dialogue systems
    Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on; 02/2002
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe how discriminative training can be applied to language models for speech recognition. Language models are important to guide the speech recognition search, particularly in compensating for mistakes in acoustic decoding. A frequently used measure of the quality of language models is the perplexity; however, what is more important for accurate decoding is not necessarily having the maximum likelihood hypothesis, but rather the best separation of the correct string from the competing, acoustically confusable hypotheses. Discriminative training can help to improve language models for the purpose of speech recognition by improving the separation of the correct hypothesis from the competing hypotheses. We describe the algorithm and demonstrate modest improvements in word and sentence error rates on the DARPA Communicator task without any increase in language model complexity decoding
    Acoustics, Speech, and Signal Processing, 2002. Proceedings. (ICASSP '02). IEEE International Conference on; 02/2002
  • Source
    E. Fosler-Lussier, H.-K.J. Kuo
    [Show abstract] [Hide abstract]
    ABSTRACT: When dialogue system developers tackle a new domain, much effort is required; the development of different parts of the system usually proceeds independently. Yet it may be profitable to coordinate development efforts between different modules. We focus our efforts on extending small amounts of language model training data by integrating semantic classes that were created for a natural language understanding module. By converting finite state parses of a training corpus into a probabilistic context free grammar and subsequently generating artificial data from the context free grammar, we can significantly reduce perplexity and automatic speech recognition (ASR) word error for situations with little training data. Experiments are presented using data from the ATIS and DARPA Communicator travel corpora.
    Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing 05/2001; 1:553-556.
  • Source
    H.K. Kuo, Chin-Hui Lee
    [Show abstract] [Hide abstract]
    ABSTRACT: We study techniques that allow us to relax some constraints imposed by expert knowledge in task specifications of a natural language call router design. We intend to fully automate the training of the routing matrix while still maintaining the same level of performance (over 90% accuracy) as that in an optimized system. Two specific issues are investigated: (1) reducing the matrix size by removing word pairs and triplets in key term definition while using only single word terms; and (2) increasing the matrix size by removing the need for defining stop words and performing stop word filtering. Since simplification of design often implies a degradation of performance, discriminative training of routing matrix parameters becomes an essential procedure. We show in our experiments that the performance degradation caused by relaxing the design constraints can be compensated entirely by minimum error classification (MCE) training even with the above two simplifications. We believe the procedure is applicable to algorithms addressing a broad range of speech understanding, topic identification, and information retrieval problems
    Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on; 02/2001 · 4.63 Impact Factor
  • Source
    H.-K.J. Kuo, Yuqing Gao
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditional statistical models for speech recognition have all been based on a Bayesian framework using generative models such as hidden Markov models (HMMs). The paper focuses on a new framework for speech recognition using maximum entropy direct modeling, where the probability of a state or word sequence given an observation sequence is computed directly from the model. In contrast to HMMs, features can be asynchronous and overlapping. This model therefore allows for the potential combination of many different types of features. A specific kind of direct model, the maximum entropy Markov model (MEMM), is studied. Even with conventional acoustic features, the approach already shows promising results for phone level decoding. The MEMM significantly outperforms traditional HMMs in word error rate when used as stand-alone acoustic models. Preliminary results combining the MEMM scores with HMM and language model scores show modest improvements over the best HMM speech recognizer.
    Automatic Speech Recognition and Understanding, 2003. ASRU '03. 2003 IEEE Workshop on;