The Significance of Recall in Automatic Metrics
for MT Evaluation
Alon Lavie and Kenji Sagae and Shyamsundar Jayaraman
Language Technologies Institute
Carnegie Mellon University
Abstract. Recent research has shown that a balanced harmonic mean
(F1 measure) of unigram precision and recall outperforms the widely used
BLEU and NIST metrics for Machine Translation evaluation in terms
of correlation with human judgments of translation quality. We show
that significantly better correlations can be achieved by placing more
weight on recall than on precision. While this may seem unexpected,
since BLEU and NIST focus on n-gram precision and disregard recall,
our experiments show that correlation with human judgments is highest
when almost all of the weight is assigned to recall. We also show that
stemming is significantly beneficial not just to simpler unigram precision
and recall based metrics, but also to BLEU and NIST.
Automatic Metrics for machine translation (MT) evaluation have been receiv-
ing significant attention in the past two years, since IBM’s BLEU metric was
proposed and made available . BLEU and the closely related NIST metric 
have been extensively used for comparative evaluation of the various MT sys-
tems developed under the DARPA TIDES research program, as well as by other
MT researchers. Several other automatic metrics for MT evaluation have been
proposed since the early 1990s. These include various formulations of measures
of “edit distance” between an MT-produced output and a reference translation
 , and similar measures such as “word error rate” and “position-independent
word error rate” , .
The utility and attractiveness of automatic metrics for MT evaluation has
been widely recognized by the MT community. Evaluating an MT system using
such automatic metrics is much faster, easier and cheaper compared to human
evaluations, which require trained bilingual evaluators. In addition to their utility
for comparing the performance of different systems on a common translation
task, automatic metrics can be applied on a frequent and ongoing basis during
system development, in order to guide the development of the system based on
concrete performance improvements.
In this paper, we present a comparison between the widely used BLEU and
NIST metrics, and a set of easily computable metrics based on unigram precision
and recall. Using several empirical evaluation methods that have been proposed
in the recent literature as concrete means to assess the level of correlation of au-
tomatic metrics and human judgments, we show that higher correlations can be
obtained with fairly simple and straightforward metrics. While recent researchers
  have shown that a balanced combination of precision and recall (F1 mea-
sure) has improved correlation with human judgments compared to BLEU and
NIST, we claim that even better correlations can be obtained by assigning more
weight to recall than to precision. In fact, our experiments show that the best
correlations are achieved when recall is assigned almost all the weight. Previous
work by Lin and Hovy  has shown that a recall-based automatic metric for
evaluating summaries outperforms the BLEU metric on that task. Our results
show that this is also the case for evaluation of MT. We also demonstrate that
stemming both MT-output and reference strings prior to their comparison, which
allows different morphological variants of a word to be considered as “matches”,
significantly further improves the performance of the metrics.
We describe the metrics used in our evaluation in Section 2. We also discuss
certain characteristics of the BLEU and NIST metrics that may account for the
advantage of metrics based on unigram recall. Our evaluation methodology and
the data used for our experimentation are described in section 3. Our experiments
and their results are described in section 4. Future directions and extensions of
this work are discussed in section 5.
2 Evaluation Metrics
The metrics used in our evaluations, in addition to BLEU and NIST, are based on
explicit word-to-word matches between the translation being evaluated and each
of one or more reference translations. If more than a single reference translation
is available, the translation is matched with each reference independently, and
the best-scoring match is selected. While this does not allow us to simultaneously
match different portions of the translation with different references, it supports
the use of recall as a component in scoring each possible match. For each metric,
including BLEU and NIST, we examine the case where matching requires that
the matched word in the translation and reference be identical (the standard
behavior of BLEU and NIST), and the case where stemming is applied to both
strings prior to the matching1. In the second case, we stem both translation
and references prior to matching and then require identity on stems. We plan
to experiment in the future with less strict matching schemes that will consider
matching synonymous words (with some cost), as described in section 5.
2.1 BLEU and NIST
The main principle behind IBM’s BLEU metric  is the measurement of the
overlap in unigrams (single words) and higher order n-grams of words, between a
1We include BLEU and NIST in our evaluations on stemmed data, but since neither
one includes stemming as part of the metric, the resulting BLEU-stemmed and
NIST-stemmed scores are not truly BLEU and NIST scores. They serve to illustrate
the effectiveness of stemming in MT evaluation.
translation being evaluated and a set of one or more reference translations. The
main component of BLEU is n-gram precision: the proportion of the matched n-
grams out of the total number of n-grams in the evaluated translation. Precision
is calculated separately for each n-gram order, and the precisions are combined
via a geometric averaging. BLEU does not take recall into account directly.
Recall – the proportion of the matched n-grams out of the total number of
n-grams in the reference translation, is extremely important for assessing the
quality of MT output, as it reflects to what degree the translation covers the
entire content of the translated sentence. BLEU does not use recall because
the notion of recall is unclear when simultaneously matching against multiple
reference translations (rather than a single reference). To compensate for recall,
BLEU uses a Brevity Penalty, which penalizes translations for being “too short”.
The NIST metric is conceptually similar to BLEU in most aspects, including the
weaknesses discussed below:
– The Lack of Recall: We believe that the brevity penalty in BLEU does
not adequately compensate for the lack of recall. Our experimental results
strongly support this claim.
– Lack of Explicit Word-matching Between Translation and Refer-
ence: N-gram counts don’t require an explicit word-to-word matching, but
this can result in counting incorrect “matches”, particularly for common
function words. A more advanced metric that we are currently developing
(see section 4.3) uses the explicit word-matching to assess the grammatical
coherence of the translation.
– Use of Geometric Averaging of N-grams: Geometric averaging results
in a score of “zero” whenever one of the component n-gram scores is zero.
Consequently, BLEU scores at the sentence level can be meaningless. While
BLEU was intended to be used only for aggregate counts over an entire
test-set (and not at the sentence level), a metric that exhibits high levels
of correlation with human judgments at the sentence level would be highly
desirable. In experiments we conducted, a modified version of BLEU that
uses equal-weight arithmetic averaging of n-gram scores was found to have
better correlation with human judgments at both the sentence and system
2.2 Metrics Based on Unigram Precision and Recall
The following metrics were used in our evaluations:
1. Unigram Precision: As mentioned before, we consider only exact one-to-
one matches between words. Precision is calculated as follows:
where m is the number of words in the translation that match words in the
reference translation, and wtis the number of words in the translation. This
may be interpreted as the fraction of the words in the translation that are
present in the reference translation.
2. Unigram Precision with Stemming: Same as above, but the translation
and references are stemmed before precision is computed.
3. Unigram Recall: As with precision, only exact one-to-one word matches
are considered. Recall is calculated as follows:
where m is the number of matching words, and wris the number of words in
the reference translation. This may be interpreted as the fraction of words
in the reference that appear in the translation.
4. Unigram Recall with Stemming: Same as above, but the translation
and references are stemmed before recall is computed.
5. F1: The harmonic mean  of precision and recall. F1is computed as follows:
P + R
6. F1with Stemming: Same as above, but using the stemmed version of both
precision and recall.
7. Fmean: This is similar to F1, but recall is weighted nine times more heavily
than precision. The precise amount by which recall outweighs precision is
less important than the fact that most of the weight is placed on recall. The
balance used here was estimated using a development set of translations and
references (we also report results on a large test set that was not used in any
way to determine any parameters in any of the metrics). Fmean is calculated
9P + R
3 Evaluating MT Evaluation Metrics
We evaluated the metrics described in section 2 and compared their perfor-
mances with BLEU and NIST on two large data sets: the DARPA/TIDES 2002
and 2003 Chinese-to-English MT Evaluation sets. The data in both cases con-
sists of approximately 900 sentences with four reference translations each. Both
evaluations had corresponding human assessments, with two human judges eval-
uating each translated sentence. The human judges assign an Adequacy Score
and a Fluency Score to each sentence. Each score ranges from one to five (with
one being the poorest grade and five the highest). The adequacy and fluency
scores of the two judges for each sentence are averaged together, and an overall
average adequacy and average fluency score is calculated for each evaluated sys-
tem. The total human score for each system is the sum of the average adequacy
and average fluency scores, and can range from two to ten. The data from the
2002 evaluation contains system output and human evaluation scores for seven
systems. The 2003 data includes system output and human evaluation scores for
six systems. The 2002 set was used in determining the weights of precision and
recall in the Fmean metric.
Our goal in the evaluation of the MT scoring metrics is to effectively quantify
how well each metric correlates with human judgments of MT quality. Several
different experimental methods have been proposed and used in recent work by
various researchers. In our experiments reported here, we use two methods of
1. Correlation of Automatic Metric Scores and Human Scores at the
System-level: We plot the automatic metric score assigned to each tested
system against the average total human score assigned to the system, and
calculate a correlation coefficient between the metric scores and the human
scores. Melamed et al ,  suggest using the Spearman rank correlation
coefficient as an appropriate measure for this type of correlation experiment.
The rank correlation coefficient abstracts away from the absolute scores and
measures the extent to which the two scores (human and automatic) similarly
rank the systems. We feel that this rank correlation is not a sufficiently
sensitive evaluation criterion, since even poor automatic metrics are capable
of correctly ranking systems that are very different in quality. We therefore
opted to evaluate the correlation using the Pearson correlation coefficient,
which takes into account the distances of the data points from an optimal
regression curve. This method has been used by various other researchers 
and also in the official DARPA/TIDES evaluations.
2. Correlation of Score Differentials between Pairs of Systems: For
each pair of systems we calculate the differentials between the systems for
both the human score and the metric score. We then plot these differentials
and calculate a Pearson correlation coefficient between the differentials. This
method was suggested by Coughlin . It provides significantly more data
points for establishing correlation between the MT metric and the human
scores. It makes the reasonable assumption that the differentials of auto-
matic metric and human scores should highly correlate. This assumption is
reasonable if both human scores and metric scores are linear in nature, which
is generally true for the metrics we compare here.
As mentioned before, the values presented in this paper are Pearson’s corre-
lation coefficients, and consequently they range from -1 to 1, with 1 representing
a very strong association between the automatic score and the human score.
Thus the different metrics are assessed primarily by looking at which metric has
a higher correlation coefficient in each scenario.
In order to validate the statistical significance of the differences in the scores,
we apply a commonly used bootstrapping sampling technique  to estimate
the variability over the test set, and establish confidence intervals for each of the
system scores and the correlation coefficients.