Category-Based Statistical Language Models

Source: CiteSeer

ABSTRACT this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams. Finally, a technique which permits also the inclusion of long-range word-pair relationships is presented in chapter 5.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper highlights the interest of a language model in increasing the performances of on-line handwriting recognition systems. Models based on statistical approaches, trained on written corpora, have been investigated. Two kinds of models have been studied: n-gram models and n-class models. In the latter case, the classes result either from a syntactic criteria or a contextual criteria. In order to integrate it into small capacity systems (mobile device), an n-class model has been designed by combining these criteria. It outperforms bulkier models based on n-gram. Integration into an on-line handwriting recognition system demonstrates a substantial performance improvement due to the language model.
    Document Analysis and Recognition, International Conference on. 01/2003; 2:1053.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In statistical language models, how to integrate diverse linguistic knowledge in a general framework for long- distance dependencies is a challenging issue. In this paper, an improved language model incorporating linguistic structure into maximum entropy framework is presented. The proposed model combines trigram with structure knowledge of base phrase in which trigram is used to capture the local relation between words, while structure knowledge of base phrase is considered to represent the long-distance relations between syntactical structures. The knowledge of syntax, semantics and word is integrated into the maximum entropy framework. Experimental results show that the proposed model improves by 24% language model perplexity and increases about 3% sign language recognition rate over the trigram model.
    Journal of Computer Science and Technology 01/2003; 18:131-138. · 0.48 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work several sets of categories obtained by a statisti- cal clustering algorithm, as well as a linguistic set, were used to design category-based language models. The language models proposed were evaluated, as usual, in terms of perplexity of the text corpus. Then they were integrated into an ASR system and also evaluated in terms of sys- tem performance. It can be seen that category-based language models can perform better, also in terms of WER, when categories are obtained through statistical models instead of using linguistic techniques. They also show that better system performance are obtained when the lan- guage model interpolates category based and word based models.
    Progress in Pattern Recognition, Image Analysis and Applications, 10th Iberoamerican Congress on Pattern Recognition, CIARP 2005, Havana, Cuba, November 15-18, 2005, Proceedings; 01/2005

Full-text (2 Sources)

Available from
May 21, 2014

Thomas Niesler