Conference Paper

The Significance of Recall in Automatic Metrics for MT Evaluation.

DOI: 10.1007/978-3-540-30194-3_16 Conference: Machine Translation: From Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004, Proceedings
Source: DBLP

ABSTRACT Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that signican tly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is signican tly benecial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate translation adequacy. KeywordsMachine translation evaluation-Paraphrasing-Alignment
    Machine Translation 09/2009; 23(2):117-127. DOI:10.1007/s10590-009-9062-9
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Meteor Automatic Metric for Machine Translation evaluation, originally developed and released in 2004, was designed with the explicit goal of producing sentence-level scores which correlate well with human judgments of translation quality. Several key design decisions were incorporated into Meteor in support of this goal. In contrast with IBM’s Bleu, which uses only precision-based features, Meteor uses and emphasizes recall in addition to precision, a property that has been confirmed by several metrics as being critical for high correlation with human judgments. Meteor also addresses the problem of reference translation variability by utilizing flexible word matching, allowing for morphological variants and synonyms to be taken into account as legitimate correspondences. Furthermore, the feature ingredients within Meteor are parameterized, allowing for the tuning of the metric’s free parameters in search of values that result in optimal correlation with human judgments. Optimal parameters can be separately tuned for different types of human judgments and for different languages. We discuss the initial design of the Meteor metric, subsequent improvements, and performance in several independent evaluations in recent years.
    Machine Translation 09/2009; 23(2-3):105-115. DOI:10.1007/s10590-009-9059-4
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes usage of MT metrics in choosing the best candidates for MT-based query translation resources. Our main metrics is METEOR, but we also use NIST and BLEU. Language pair of our evaluation is English → German, because MT metrics still do not offer very many language pairs for comparison. We evaluated translations of CLEF 2003 topics of four different MT programs with MT metrics and compare the metrics evaluation results to results of CLIR runs. Our results show, that for long topics the correlations between achieved MAPs and MT metrics is high (0.85-0.94), and for short topics lower but still clear (0.63-0.72). Overall it seems that MT metrics can easily distinguish the worst MT programs from the best ones, but smaller differences are not so clearly shown. Some of the intrinsic properties of MT metrics do not also suit for CLIR resource evaluation purposes, because some properties of translation metrics, especially evaluation of word order, are not significant in CLIR.
    Advances in Information Retrieval, 31th European Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009. Proceedings; 01/2009