Category-Based Statistical Language Models

Source: CiteSeer


this document. The first section, in chapter 3, develops a model for syntactic dependencies based on word-category n-grams. The second section, in chapter 4, extends this model by allowing short-range word relations to be captured through the incorporation of selected word n-grams. Finally, a technique which permits also the inclusion of long-range word-pair relationships is presented in chapter 5.

Download full-text


Available from: Thomas Niesler, May 13, 2014
    • "As explained above, the purpose of statistical language models is to help accurately predict the next word w t based on its current history h t = [w 0 , ..., w t−1 ]. Over the past years, considerable efforts have been reported to find the factors in h t that best predict w t , including the use of syntactic and semantic information, see for instance (Niesler, 1997; Rosenfeld, 1996; Bod, 2000; Chelba and Jelinek, 2000; Charniak, 2001). Much of the previous work has been carried out to model natural languages, such as French or English, with varied characteristics and singularities. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation model probabilities are estimated through standard language modeling of such bilingual units. Therefore, similar to other translation model approaches (phrase-based or hierarchical), the sparseness problem of the units being modeled leads to unreliable probability estimates, even under conditions where large bilingual corpora are available. In order to tackle this problem, we extend the n-gram-based approach to SMT by tightly integrating more general word representations, such as lemmas and morphological classes, and we use the flexible framework of FLMs to apply a number of different back-off techniques. In this work, we show that FLMs can also be successfully applied to translation modeling, yielding more robust probability estimates that integrate larger bilingual contexts during the translation process. KeywordsStatistical machine translation-Bilingual n-gram language models-Factored language models
    Machine Translation 06/2010; 24(2):159-175. DOI:10.1007/s10590-010-9082-5
  • Source
    • "In [6], an approach based on word posterior probabilities computed on a confusion network is proposed to detect OOV words. Finally, works on tagging texts containing OOV words rely on POS categories which are used in conjunction with n-gram LMs to achieve better results [10]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we investigate the use of linguistic information given by language models to deal with word recognition errors on handwritten sentences. We focus especially on errors due to out-of-vocabulary (OOV) words. First, word posterior probabilities are computed and used to detect error hypotheses on output sentences. An SVM classifier allows these errors to be categorized according to defined types. Then, a post-processing step is performed using a language model based on Part-of-Speech (POS) tags which is combined to the n-gram model previously used. Thus, error hypotheses can be further recognized and POS tags can be assigned to the OOV words. Experiments on on-line handwritten sentences show that the proposed approach allows a significant reduction of the word error rate.
    07/2009; DOI:10.1109/ICDAR.2009.78
  • Source
    • "Considering the intrinsic defect of the POS and word form, some other information with different granularity has been investigated. One of them is word categories, which have been used to improve the performance of statistical language models [4] . Some other efforts focus on the acquisition of verb subcategorization frames [5] . "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a new way to improve the performance of dependency parser: subdividing verbs according to their grammatical functions and integrating the information of verb subclasses into lexicalized parsing model. Firstly, the scheme of verb subdivision is described. Secondly, a maximum entropy model is presented to distinguish verb subclasses. Finally, a statistical parser is developed to evaluate the verb subdivision. Experimental results indicate that the use of verb subclasses has a good influence on parsing performance.
    Journal of Electronics (China) 04/2007; 24(3):347-352. DOI:10.1007/s11767-005-0193-8
Show more