Conference Paper

The Significance of Recall in Automatic Metrics for MT Evaluation.

DOI: 10.1007/978-3-540-30194-3_16 Conference: Machine Translation: From Real Users to Research, 6th Conference of the Association for Machine Translation in the Americas, AMTA 2004, Washington, DC, USA, September 28-October 2, 2004, Proceedings
Source: DBLP

ABSTRACT Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that signican tly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is signican tly benecial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Beyond manual and automated post- editing, we describe an approach that takes post-editing information to automatically improve the underlying rules and lexical entries of a transfer-based Machine Trans- lation (MT) system. This process can be divided into two main steps. In the first step, an online post-editing tool allows for easy error diagnosis and implicit error categorization. In the second step, an Auto- matic Rule Refiner performs error remedia- tion, by tracking errors and suggesting repairs that are mostly lexical and morpho- syntactic in nature (such as word-order or incorrect agreement in transfer rules). This approach directly improves the intelligibil- ity of corrected MT output and, more inter- estingly, it generalizes over unseen data, providing improved MT output for similar sentences that have not been corrected. Hence our approach is an alternative to fully-automated Post-Editing.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe our submission to the NIST Met- rics for Machine Translation Challenge consist- ing of 4 metrics - two versions of meteor, m-bleu and m-ter. We first give a brief de- scription of Meteor . That is followed by de- scriptino of m-bleu and m-ter, enhanced ver- sions of two other widely used metrics bleu and ter, which extend the exact word match- ing used in these metrics with the flexible matching based on stemming and Wordnet in Meteor .
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study aims to compare the effectiveness of two popular machine translation systems (Google Translate and Babylon machine translation system) used to translate English sentences into Arabic relative to the effectiveness of English to Arabic human translation. There are many automatic methods used to evaluate different machine translators, one of these methods; Bilingual Evaluation Understudy (BLEU) method, which was adopted and implemented to achieve the main goal of this study. BLEU method is based on the assumptions of automated measures that depend on matching machine translators' output to human reference translations; the higher the score, the closer the translation to the human translation will be. Well known English sayings in addition to manually collected sentences from different Internet web sites were used for evaluation purposes. The results of this study have showed that Google machine translation system is better than Babylon machine translation system in terms of precision of translation from English to Arabic.
    International Journal of Advanced Computer Science and Applications 01/2013; 4(1):66-73. · 1.32 Impact Factor