Category-Based Statistical Language Models

Source: CiteSeer

ABSTRACT this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams. Finally, a technique which permits also the inclusion of long-range word-pair relationships is presented in chapter 5.

  • Source
  • Source
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Theoretically, an improvement in a language model occurs as the size of the n-grams increases from 3 to 5 or higher. As the n-gram size increases, the number of parameters and calculations, and the storage requirement increase very rapidly if we attempt to store all possible combinations of n-grams. To avoid these problems, the reduced n-grams' approach previously developed by O' Boyle and Smith (1993) can be applied. A reduced n-gram language model, called a reduced model, can efficiently store an entire corpus's phrase-history length within feasible storage limits. Another advantage of reduced n-grams is that they usually are semantically complete. In our experiments, the reduced n-gram creation method or the O' Boyle-Smith reduced n-gram algorithm was applied to a large Chinese corpus. The Chinese reduced n-gram Zipf curves are presented here and compared with previously obtained conventional Chinese n-grams. The Chinese reduced model reduced perplexity by 8.74% and the language model size by a factor of 11.49. This paper is the first attempt to model Chinese reduced n-grams, and may provide important insights for Chinese linguistic research.

Full-text (2 Sources)

Available from
May 21, 2014

Thomas Niesler