The Significance of Recall in Automatic Metrics
for MT Evaluation
Alon Lavie and Kenji Sagae and Shyamsundar Jayaraman
Language Technologies Institute
Carnegie Mellon University
Abstract. Recent research has shown that a balanced harmonic mean
(F1 measure) of unigram precision and recall outperforms the widely used
BLEU and NIST metrics for Machine Translation evaluation in terms
of correlation with human judgments of translation quality. We show
that significantly better correlations can be achieved by placing more
weight on recall than on precision. While this may seem unexpected,
since BLEU and NIST focus on n-gram precision and disregard recall,
our experiments show that correlation with human judgments is highest
when almost all of the weight is assigned to recall. We also show that
stemming is significantly beneficial not just to simpler unigram precision
and recall based metrics, but also to BLEU and NIST.
Automatic Metrics for machine translation (MT) evaluation have been receiv-
ing significant attention in the past two years, since IBM’s BLEU metric was
proposed and made available . BLEU and the closely related NIST metric 
have been extensively used for comparative evaluation of the various MT sys-
tems developed under the DARPA TIDES research program, as well as by other
MT researchers. Several other automatic metrics for MT evaluation have been
proposed since the early 1990s. These include various formulations of measures
of “edit distance” between an MT-produced output and a reference translation
 , and similar measures such as “word error rate” and “position-independent
word error rate” , .
The utility and attractiveness of automatic metrics for MT evaluation has
been widely recognized by the MT community. Evaluating an MT system using
such automatic metrics is much faster, easier and cheaper compared to human
evaluations, which require trained bilingual evaluators. In addition to their utility
for comparing the performance of different systems on a common translation
task, automatic metrics can be applied on a frequent and ongoing basis during
system development, in order to guide the development of the system based on
concrete performance improvements.
In this paper, we present a comparison between the widely used BLEU and
NIST metrics, and a set of easily computable metrics based on unigram precision
and recall. Using several empirical evaluation methods that have been proposed
in the recent literature as concrete means to assess the level of correlation of au-
tomatic metrics and human judgments, we show that higher correlations can be
obtained with fairly simple and straightforward metrics. While recent researchers
  have shown that a balanced combination of precision and recall (F1 mea-
sure) has improved correlation with human judgments compared to BLEU and
NIST, we claim that even better correlations can be obtained by assigning more
weight to recall than to precision. In fact, our experiments show that the best
correlations are achieved when recall is assigned almost all the weight. Previous
work by Lin and Hovy  has shown that a recall-based automatic metric for
evaluating summaries outperforms the BLEU metric on that task. Our results
show that this is also the case for evaluation of MT. We also demonstrate that
stemming both MT-output and reference strings prior to their comparison, which
allows different morphological variants of a word to be considered as “matches”,
significantly further improves the performance of the metrics.
We describe the metrics used in our evaluation in Section 2. We also discuss
certain characteristics of the BLEU and NIST metrics that may account for the
advantage of metrics based on unigram recall. Our evaluation methodology and
the data used for our experimentation are described in section 3. Our experiments
and their results are described in section 4. Future directions and extensions of
this work are discussed in section 5.
2 Evaluation Metrics
The metrics used in our evaluations, in addition to BLEU and NIST, are based on
explicit word-to-word matches between the translation being evaluated and each
of one or more reference translations. If more than a single reference translation
is available, the translation is matched with each reference independently, and
the best-scoring match is selected. While this does not allow us to simultaneously
match different portions of the translation with different references, it supports
the use of recall as a component in scoring each possible match. For each metric,
including BLEU and NIST, we examine the case where matching requires that
the matched word in the translation and reference be identical (the standard
behavior of BLEU and NIST), and the case where stemming is applied to both
strings prior to the matching1. In the second case, we stem both translation
and references prior to matching and then require identity on stems. We plan
to experiment in the future with less strict matching schemes that will consider
matching synonymous words (with some cost), as described in section 5.
2.1 BLEU and NIST
The main principle behind IBM’s BLEU metric  is the measurement of the
overlap in unigrams (single words) and higher order n-grams of words, between a
1We include BLEU and NIST in our evaluations on stemmed data, but since neither
one includes stemming as part of the metric, the resulting BLEU-stemmed and
NIST-stemmed scores are not truly BLEU and NIST scores. They serve to illustrate
the effectiveness of stemming in MT evaluation.
This research was funded in part by NSF grant number IIS-0121631.
1. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th
Annual Meeting of the Association for Computational Linguistics (ACL), pages
311–318, Philadelphia, PA, July.
2. Doddington, George. 2002. Automatic Evaluation of Machine Translation Quality
Using N-gram Co-Occurrence Statistics. In Proceedings of the Second Conference
on Human Language Technology (HLT-2002). San Diego, CA. pp. 128–132.
3. K.-Y. Su, M.-W. Wu, and J.-S. Chang. 1992. A New Quantitative Quality Measure
for Machine Translation Systems. In Proceedings of the fifteenth International
Conference on Computational Linguistics (COLING-92). Nantes, France. pp. 433–
4. Y. Akiba, K. Imamura, and E. Sumita. 2001. Using Multiple Edit Distances to
Automatically Rank Machine Translation Output. In Proceedings of MT Summit
VIII. Santiago de Compostela, Spain. pp. 15–20.
5. S. Niessen, F. J. Och, G. Leusch, and H. Ney. 2000. An Evaluation Tool for Machine
Translation: Fast Evaluation for Machine Translation Research. In Proceedings
of the Second International Conference on Language Resources and Evaluation
(LREC-2000). Athens, Greece. pp. 39–45.
6. Gregor Leusch, Nicola Ueffing and Herman Ney. 2003. String-to-String Distance
Measure with Applications to Machine Translation Evaluation. In Proceedings of
MT Summit IX. New Orleans, LA. Sept. 2003. pp. 240–247.
7. I. Dan Melamed, R. Green and J. Turian. 2003. Precision and Recall of Machine
Translation. In Proceedings of HLT-NAACL 2003. Edmonton, Canada. May 2003.
Short Papers: pp. 61–63.
8. Joseph P. Turian, Luke Shen and I. Dan Melamed. 2003. Evaluation of Machine
Translation and its Evaluation. In Proceedings of MT Summit IX. New Orleans,
LA. Sept. 2003. pp. 386–393.
9. Chin-Yew Lin and Eduard Hovy. 2003. Automatic Evaluation of Summaries Using
N-gram Co-occurrence Statistics. In Proceedings of HLT-NAACL 2003. Edmonton,
Canada. May 2003. pp. 71–78.
10. C. van Rijsbergen. 1979. Information Retrieval. Butterworths. London, England.
11. Deborah Coughlin. 2003.Correlating Automated and Human Assessments of
Machine Translation Quality. In Proceedings of MT Summit IX. New Orleans,
LA. Sept. 2003. pp. 63–70.
12. Bradley Efron and Robert Tibshirani. 1986. Bootstrap Methods for Standard Er-
rors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical
Science, 1(1). pp. 54–77.
13. George Doddington. 2003. Automatic Evaluation of Language Translation us-
ing N-gram Co-occurrence Statistics. Presentation at DARPA/TIDES 2003 MT
Workshop. NIST, Gathersberg, MD. July 2003.
14. Bo Pang, Kevin Knight and Daniel Marcu. 2003. Syntax-based Alignment of
Multiple Translations: Extracting Paraphrases and Generating New Sentences. In
Proceedings of HLT-NAACL 2003. Edmonton, Canada. May 2003. pp. 102–109.