Conference Paper

The Significance of Recall in Automatic Metrics for MT Evaluation

DOI: 10.1007/978-3-540-30194-3_16 Conference: Machine Translation: From Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004, Proceedings
Source: DBLP


Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that signican tly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is signican tly benecial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

Full-text preview

Available from:
  • Source
    • "Their experiment also revealed that the enhancement of both BLEU and NIST is correlated to human evaluation. To overcome the weaknesses of the above mentioned metrics, a new metric was proposed by Lavie et al. (2004), called METEOR. The concept of METEOR is rather different from that of the above metrics in that all the other metrics relied on unigram precision (the two SL and TL identical word strings) only, while METEOR gives more weight to unigram precision and unigram recall. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This article aims to investigate and evaluate the translation of verb-noun collocation in English into Arabic Google and Bing online translation engines. A number of sentences were used as a testing dataset to evaluate both engines. Human translations by three bilingual speakers were used as a gold standard. A simple evaluation metric was proposed to calculate the translation accuracy of verb-noun collocations. The results showed that Bing scored a verb-noun collocation value of 0.72 with a trend estimation ranging between 0.65 and 0.67. Google scored a verb-noun collocation value of 0.75 (3% higher than Bing) with a trend estimation ranging between 0.63 and 0.85. The results also showed that, in most cases, the Arabic translation output of both engines produced a one verb synonym which did not collocate with the different nouns in the testing data sentences. These results indicate that Google and Bing, so far, have not been able to resolve the verb-noun collocability problem in their Arabic output. This study and its results may help to shed some light on the problem and to develop new methods to improve Arabic verb noun collocability in the output translation of current machine translation engines.
    Full-text · Article · Jan 2016
  • Source
    • "ORdering) is an automatic evaluation metric for the machine translation output. Lavie, Kenji and Jayaraman study [3] proposes and casts METEOR metric for the first time in 2004, and aimed to improve correlation with human judgments of MT quality at the segment level. METEOR scores machine translation hypotheses by aligning them to one or more reference translations. "

    Full-text · Article · Nov 2015 · International Journal of Advanced Computer Science and Applications
  • Source
    • "meteor, initially proposed and released in 2004 (Lavie et al., 2004) was explicitly designed to improve correlation with human judgments of MT quality at the segment level. Previous publications on Meteor (Lavie et al., 2004; Banerjee and Lavie, 2005; Lavie and Agarwal, 2007) have described the details underlying the metric and have extensively compared its performance with Bleu and several other MT evaluation metrics. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe our submission to the NIST Met- rics for Machine Translation Challenge consist- ing of 4 metrics - two versions of meteor, m-bleu and m-ter. We first give a brief de- scription of Meteor . That is followed by de- scriptino of m-bleu and m-ter, enhanced ver- sions of two other widely used metrics bleu and ter, which extend the exact word match- ing used in these metrics with the flexible matching based on stemming and Wordnet in Meteor .
    Preview · Article · Apr 2012
Show more