Download full-text


Available from: Guodong Zhou, Jun 30, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words
  • [Show abstract] [Hide abstract]
    ABSTRACT: In principle, n-gram probabilities can be estimated from a large sample of text by counting the number of occurrences of each n-gram of interest and dividing by the size of the training sample. This method, which is known as maximum likelihood estimator (MLE), is very simple. However, it is unsuitable because n-grams which do not occur in the training sample are assigned zero probability. This is qualitatively wrong for use as a prior model, because it would never allow the n-gram, while clearly some of the unseen n-grams will occur in other texts. For non-zero frequencies, the MLE is quantitatively wrong. Moreover, at all frequencies, the MLE does not separate bigrams with the same frequency.We study two alternative methods. The first method is an enhanced version of the method due to Good and Turing (I. J. Good [1953]. Biometrika, 40, 237–264). Under the modest assumption that the distribution of each bigram is binomial, Good provided a theoretical result that increases estimation accuracy. The second method is an enhanced version of the deleted estimation method (F. Jelinek & R. Mercer [1985]. IBM Technical Disclosure Bulletin, 28, 2591–2594). It assumes even less, merely that the training and test corpora are generated by the same process.We emphasize three points about these methods. First, by using a second predictor of the probability in addition to the observed frequency, it is possible to estimate different probabilities for bigrams with the same frequency. We refer to this use of a second predictor as “enhancement.” With enhancement, we find 1200 significantly different probabilities (with a range of five orders of magnitude) for the group of bigrams not observed in the training text; the MLE method would not be able to distinguish any one of these bigrams from any other. The probabilities found by the enhanced methods agree quite closely in qualitative comparisons with the standard calculated from the test corpus.Second, the enhanced Good-Turing method provides accurate predictions of the variances of the standard probabilities estimated from the test corpus. Third, we introduce a refined testing method that enables us to measure the prediction errors directly and accurately and thus to study small differences between methods. We find that while the errors of both methods are small due to the large amount of data that we use, the enhanced Good-Turing method is three to four times as efficient in its use of data as the enhanced deleted estimate method. Good-Turing method is preferable to the enhanced deleted estimate method. Both methods are much better than MLE.
    Computer Speech & Language 01/1991; DOI:10.1016/0885-2308(91)90016-J · 1.81 Impact Factor
  • Source
    Computational Linguistics 01/1993; 19:1-24. · 1.47 Impact Factor