Conference Paper

The Significance of Recall in Automatic Metrics for MT Evaluation

DOI: 10.1007/978-3-540-30194-3_16 Conference: Machine Translation: From Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004, Proceedings
Source: DBLP


Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that signican tly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is signican tly benecial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

Full-text preview

Available from:
  • Source
    • "meteor, initially proposed and released in 2004 (Lavie et al., 2004) was explicitly designed to improve correlation with human judgments of MT quality at the segment level. Previous publications on Meteor (Lavie et al., 2004; Banerjee and Lavie, 2005; Lavie and Agarwal, 2007) have described the details underlying the metric and have extensively compared its performance with Bleu and several other MT evaluation metrics. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe our submission to the NIST Met- rics for Machine Translation Challenge consist- ing of 4 metrics - two versions of meteor, m-bleu and m-ter. We first give a brief de- scription of Meteor . That is followed by de- scriptino of m-bleu and m-ter, enhanced ver- sions of two other widely used metrics bleu and ter, which extend the exact word match- ing used in these metrics with the flexible matching based on stemming and Wordnet in Meteor .
  • Source
    • "This single mteval_special_terp.tex; 26/10/2009; 14:20; p.10 addition gives statistically significant improvements over Ter at the segment and document levels. This validates similar observations of the importance of recall noted by Lavie et al. (2004). The other three features of Terp— stemming, synonymy, and paraphrases — are added on top of the optimized TER condition since optimization is required to determine the edit costs for the new features. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate translation adequacy. KeywordsMachine translation evaluation-Paraphrasing-Alignment
    Machine Translation 09/2009; 23(2):117-127. DOI:10.1007/s10590-009-9062-9
  • Source
    • "Even though the Bleu metric is widely used and has greatly driven progress in statistical MT, it suffers from several weaknesses which we specifically aimed to address in the design of our Meteor metric: − Lack of Recall: Our early experiments (Lavie et al., 2004) led us to believe that the lack of recall within Bleu was a significant weakness, and that the " Brevity Penalty " in the Bleu metric does not adequately compensate for the lack of recall. It has since been demonstrated by several evaluations of metrics that recall strongly correlates with human judgments of translation quality, and that recall is thus an extremely important feature component in automatic metrics (Lavie et al., 2004). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The Meteor Automatic Metric for Machine Translation evaluation, originally developed and released in 2004, was designed with the explicit goal of producing sentence-level scores which correlate well with human judgments of translation quality. Several key design decisions were incorporated into Meteor in support of this goal. In contrast with IBM’s Bleu, which uses only precision-based features, Meteor uses and emphasizes recall in addition to precision, a property that has been confirmed by several metrics as being critical for high correlation with human judgments. Meteor also addresses the problem of reference translation variability by utilizing flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences. Furthermore, the feature ingredients within Meteor are parameterized, allowing for the tuning of the metric’s free parameters in search of values that result in optimal correlation with human judgments. Optimal parameters can be separately tuned for different types of human judgments and for different languages. We discuss the initial design of the Meteor metric, subsequent improvements, and performance in several independent evaluations in recent years.
    Machine Translation 09/2009; 23(2-3):105-115. DOI:10.1007/s10590-009-9059-4
Show more