ArticlePDF Available

Are multiple reference translations necessary? Investigating the value of paraphrased reference translations in parameter optimization


Abstract and Figures

Most state-of-the-art statistical machine trans-lation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maxi-mize a translation quality metric, using held-out test sentences and their corresponding ref-erence translations. However, obtaining refer-ence translations is expensive. In our earlier work (Madnani et al., 2007), we introduced a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and demonstrated that the resulting paraphrases can be used to cut the number of human reference translations needed in half. In this paper, we take the idea a step further, asking how far it is possible to get with just a single good reference translation for each item in the development set. Our analysis suggests that it is necessary to invest in four or more hu-man translations in order to significantly im-prove on a single translation augmented by monolingual paraphrases.
Content may be subject to copyright.
Are Multiple Reference Translations Necessary?
Investigating the Value of Paraphrased Reference Translations
in Parameter Optimization
Nitin Madnani§, Philip Resnik§, Bonnie J. Dorr§& Richard Schwartz
§Laboratory for Computational Linguistics and Information Processing
§Institute for Advanced Computer Studies
§University of Maryland, College Park
BBN Technologies
Most state-of-the-art statistical machine trans-
lation systems use log-linear models, which
are defined in terms of hypothesis features
and weights for those features. It is standard
to tune the feature weights in order to maxi-
mize a translation quality metric, using held-
out test sentences and their corresponding ref-
erence translations. However, obtaining refer-
ence translations is expensive. In our earlier
work (Madnani et al., 2007), we introduced a
new full-sentence paraphrase technique, based
on English-to-English decoding with an MT
system, and demonstrated that the resulting
paraphrases can be used to cut the number of
human reference translations needed in half.
In this paper, we take the idea a step further,
asking how far it is possible to get with just a
single good reference translation for each item
in the development set. Our analysis suggests
that it is necessary to invest in four or more hu-
man translations in order to significantly im-
prove on a single translation augmented by
monolingual paraphrases.
1 Introduction
Most state-of-the-art statistical machine translation
systems use log-linear models, which are defined in
terms of hypothesis features and weights for those
features. Such models usually take the form
f, ¯e)(1)
where hiare features of the hypothesis eand λiare
weights associated with those features.
It is standard practice to tune the feature weights
in models of this kind in order to maximize a trans-
lation quality metric such as BLEU (Papineni et al.,
2002) or TER (Snover et al., 2006), using held-
out “development” sentences paired with their corre-
sponding reference translations. Och (2003) showed
that system achieves its best performance when the
model parameters are tuned using the same objective
function being used for evaluating the system. How-
ever, this reliance on multiple reference translations
creates a problem, because reference translations are
labor intensive and expensive to obtain. For exam-
ple, producing reference translations at the Linguis-
tic Data Consortium, a common source of translated
data for MT research, requires undertaking an elab-
orate process that involves translation agencies, de-
tailed translation guidelines, and quality control pro-
cesses (Strassel et al., 2006).
In our previous work (Madnani et al., 2007),
we introduced an automatic paraphrasing technique
based on English-to-English translation of full sen-
tences using a statistical MT system, and demon-
strated that, using this technique in the context of pa-
rameter tuning, it is possible to cut in half the usual
number of reference translations used—when each
of two human reference translations is paraphrased
automatically, tuning on the resulting four transla-
tions yields translation performance that is no worse
than that obtained using four human translations.
Our method enables the generation of paraphrases
for thousands of sentences in a very short amount
of time (much shorter than creating other low-cost
human references).
In this paper, we take the idea a step further, ask-
ing how far it is possible to get with just a single
good reference translation for each item in the de-
velopment set. This question is important for a num-
ber of reasons. First, with a few exceptions — no-
tably NIST’s annual MT evaluations — most new
MT research data sets are provided with only a sin-
gle reference translation. Second, obtaining mul-
tiple reference translations in rapid development,
low-density source language scenarios (e.g. (Oard,
2003)) is likely to be severely limited (or made en-
tirely impractical) by limitations of time, cost, and
ready availability of qualified translators. Finally, if
a single good reference translation turns out to suf-
fice for parameter tuning, this opens the door to fu-
ture investigations in which we ask how good such
translations need to be. Ultimately, it may be pos-
sible to remove human development-set translations
from the statistical MT process altogether, instead
simply holding out a subset of sentence pairs that
are already part of the training bitext.
The next section lays out the critical research
questions that we wish to address in this work. Sec-
tion 3 describes the paraphrasing model that we used
for the experiments in this paper. Section 4 presents
experimentation and results, followed by discussion
and conclusions in Section 5.
2 Research Questions
There are a number of important research ques-
tions that need to be answered in order to determine
whether it is feasible to eliminate the need for mul-
tiple reference translations, using automatic para-
phrases of a single reference translation instead.
1. If only a single reference translation is available
for tuning, can adding a paraphrased reference
provide significant gains?
2. Can k-best paraphrasing instead of just 1-best
lead to better optimization, and how does this
compare with using additional human refer-
ences translations?
3. Does the full-sentence paraphraser always need
to be trained on all of the training data be-
ing used by the MT system (as it was in our
previous work) or can it be trained on only a
subset of the data? The answer to this ques-
tion is essential to test the hypothesis that the
paraphraser may not actually be producing the
claimed n-gram diversity but just performing a
form of smoothing over the feature value esti-
4. To what extent are the gains obtained from this
technique contingent on the quality of the hu-
man references that are being paraphrased, if at
5. How severely does the genre mismatch affect
any gains that are to be had? For example, can
using paraphrased references still provide large
gains if the validation set is of a different genre
than the one that the paraphraser is trained on?
6. Given the claim that the paraphraser provides
additional n-gram diversity, can it be useful in
situations where the tuning criterion does not
depend heavily on such overlap?
Answering these questions will make it possible
to characterize the utility of paraphrase-based opti-
mization in real-world scenarios, and and how best
to leverage it in those scenarios where it does prove
3 Paraphrasing Model
We generate sentence-level paraphrases via English-
to-English translation using phrase table pivoting,
following (Madnani et al., 2007). The transla-
tion system we use (for both paraphrase generation
and translation) is based on a state-of-the-art hierar-
chical phrase-based translation model as described
in (Chiang, 2007). English-to-English hierarchical
phrases are induced using the pivot-based technique
proposed in (Bannard and Callison-Burch, 2005)
with primary features similar to those used by (Mad-
nani et al., 2007): the joint probability p( ¯e1,¯e2), the
two conditionals p( ¯e1|¯e2)&p( ¯e2|¯e1)and the target
To limit noise during pivoting, we only keep the
top 20 paraphrase pairs resulting from each pivot, as
determined by the induced fractional counts.
Furthermore, we pre-process the source to
identify all named entities using BBN Identi-
Finder (Bikel et al., 1999) and strongly bias our de-
coder to leave them unchanged against during the
paraphrasing (translation) process to avoid any erro-
neous paraphrasing of entities.
4 Experiments
Before presenting paraphrase-based tuning experi-
ments, we outline some general information that is
common to all of the experiments described below:
We choose Chinese-English translation as our
test-bed since there are sufficient resources
available in this language pair to conduct all of
our desired experiments.
Unless otherwise specified, we use 2million
sentences of newswire text as our training cor-
pus for the Chinese-English MT system for all
experiments but train the paraphraser only on
a subset—1million sentences—instead of the
full set.
We use a 1-3 split of the 4reference translations
from the NIST MT02 test set to tune the feature
weights for the paraphraser similar to Madnani
et al. (2007).
No changes are made to the number of refer-
ences in any validation set. Only the tuning sets
differed in the number of references across dif-
ferent experiments.
BLEU and TER are calculated on lowercased
translation output. Brevity penalties for BLEU
are indicated if not equal to 1.
For each experiment, BLEU scores shown in
bold are significantly better (Koehn, 2004) than
the appropriate baselines for that experiment
(p < 0.05).
4.1 Single Reference Datasets
In this section, we attempt to gauge the utility of
the paraphrase approach in a realistic scenario where
only a single reference translation is available for the
tuning set. We use the NIST MT03 data, which has
four references per development item, to simulate a
tuning set in which only a single reference transla-
tion is available.1.
1The reasons for choosing a set with 4references will be-
come clear in Section 4.2
One way to create such a simulated set is simply
to choose one of the 4reference sets, i.e., all the
translations with the same system identifier for all
source documents in the set. However, for the NIST
sets, each of the reference sets is typically created
by a different human translator. In order to imitate a
more realistic scenario where multiple human trans-
lators collaborate to produce a single set of reference
translations instead of multiple sets, it is essential
to normalize over any translator idiosyncrasies so as
to avoid any bias. Therefore, we create the simu-
lated single-reference set by choosing, at random,
for each source document in the set, one of the 4
available reference translations.
As our baseline, we use this simulated single-
reference set as the tuning set (1H=1 Human) and
evaluate on a held-out validation set consisting of
both the NIST MT04 and MT05 data sets (a total of
2870 sentences), hereafter referred to as MT04+05.
We then paraphrase the simulated set, extract the 1-
best paraphrase as an additional reference, and tune
the MT system on this new 2reference tuning set
(1H+1P=1 Human, 1 Paraphrase).
The results, shown in Table 1, confirm that using
a paraphrased reference when only a single human
reference is available is extremely useful and leads
to huge gains in both the BLEU and TER scores on
the validation set. In addition, since we see gains de-
spite the fact that the paraphraser is only trained on
half of the MT training corpus, we can conclude that
these improvements are not the result of fortuitous
smoothing, but rather of increased n-gram diversity
on the target side of the development set.
Table 1: BLEU and TER scores are shown for MT04+05.
1H=Tuning with 1 human reference, 1H+1P=Tuning
with the human reference and its paraphrase. Lower TER
scores are better.
1H 37.65 56.39
1H+1P 39.32 54.39
4.2 Using k-best Paraphrases
Since the paraphraser is an English-to-English SMT
system, it can generate n-best hypothesis para-
phrases from the chart for each source sentence. An
obvious extension to the above experiment then is to
0 1 2 3 4
Number of references in tuning set
1: Additional human references
2: Additional paraphrased references
Figure 1: Using the k-best paraphrases are added as ref-
erences, the graph depicts MT04+05 BLEU scores as ad-
ditional references—human and paraphrased—are added
to the single reference tuning set.
see whether using k-best paraphrase hypotheses as
additional reference translations, instead of just the
1-best, can alleviate the reference sparsity to a larger
extent during the optimization process. For this ex-
periment, we use the top 1,2and 3paraphrases for
the MT03 simulated single reference set as addi-
tional references; three tuning sets 1H+1P, 1H+2P
and 1H+3P respectively. As points of comparison,
we also construct the tuning sets 2H, 3H and 4H
from MT03 in the same simulated fashion2as the
single reference tuning set 1H. The results for this
experiment are shown in Figure 1.
Table 2: MT04+05 BLEU and TER scores are shown,
as additional references—human and paraphrased—are
added to the single reference tuning set.
# tuning refs Human Paraphrased
1 (1H+0) 37.65 56.39 37.65 56.39
2 (1H+1) 39.20 54.48 39.32 54.39
3 (1H+2) 40.01 53.50 39.79 53.71
4 (1H+3) 40.56 53.31 39.21 53.46
The graph shows that starting from the simulated
single reference set, adding one more human refer-
2By randomly choosing the sufficient number of random ref-
erence translations from the available 4for each source docu-
ence translation leads to a significant gain in BLEU
score, and adding more human references provides
smaller but consistent gains at each step. Table 2
shows the BLEU and TER scores corresponding to
Figure 1. With paraphrased references, gains con-
tinue up to 3 references, and then drop off; presum-
ably beyond the top two paraphrases or so, n-best
paraphrasing adds more noise than genuine diver-
sity (one can observe this drop off in provided diver-
sity in the example shown in Figure 2).3Crucially,
however, it is important to note that only the perfor-
mance difference with four references—between the
human and the paraphrase condition—is statistically
O: (hong kong, macau and taiwan) macau
passed legalization to avoid double tax.
P1: macao adopted bills to avoidance of dou-
ble taxation (hong kong, macao and tai-
P2: (hong kong, macao and taiwan) macao
adopted bills and avoidance of double
P3: (hong kong, macao and taiwan) macao
approved bills and avoidance of double
Figure 2: The 3-best paraphrase hypotheses for the origi-
nal sentence O with Chinese as the pivot language.
The amount of n-gram diversity decreases with each suc-
cessive hypothesis.
4.3 Effect of Genre Mismatch
It is extremely important to test the utility of opti-
mization with paraphrased references when there is
a mismatch between the genre of the data that the
paraphraser is trained on and the genre of the actual
test set that the system will eventually be scored on.
To measure the effect of such mismatch, we con-
ducted two different sets of experiments, each re-
lated to a common scenario encountered in MT re-
3This lack of diversity is found in most forms of n-best lists
used in language processing systems and has been documented
elsewhere in more detail (Langkilde, 2000; Mi et al., 2008).
4.3.1 Mixed-genre Test Set
For this experiment, we use the same paraphraser
training data, MT training data and tuning sets as
in Section 4.2. However, we now use a mixed-
genre test set (MT06-GALE) as our validation set.
MT06-GALE is a data set released by NIST in 2006
with 779 sentences, each with only a single refer-
ence translation. The composition of this set is as
follows: 369 from the newswire genre and 410 sen-
tences from the newsgroup genre. Since we are us-
ing MT03 for this experiment as well, we can also
test whether using k-best paraphrases instead of just
the 1-best helps on this mixed-genre validation set.
The results are shown in Figure 3 and Table 3.
0 1 2 3 4
Number of references in tuning set
4-gram Precision
1: Additional human references
2: Additional paraphrased references
1: Additional human references
2: Additional paraphrased references
Figure 3: Testing the paraphraser on a mixed-
genre validation set MT06-GALE. The graph depicts
MT006-GALE 4-gram precision scores as additional
references—human and paraphrased—are added to the
single reference tuning set.
The first thing to notice about these results is
that as we use additional references (human or para-
phrased) for tuning the system, the brevity penalty
on the validation set increases significantly. This is a
well-known weakness of tuning for BLEU with mul-
tiple references and testing on a set with a single ref-
erence.4However, we can focus on the 4-gram pre-
cision which is the component that would be directly
affected by larger n-gram diversity. The precision
4In the NIST formulation of the BLEU metric, the brevity
penalty is calculated against the shortest of the available refer-
ence translations. With multiple references available, it’s very
likely that the brevity penalty will be higher than if there was
only a single reference.
Table 3: The BLEU scores (Prec.=4-gram precision,
BP=brevity penalty) are shown here along with TER
scores for MT06-GALE as additional references—
human and paraphrased—are added to the single refer-
ence tuning set.
1H+0 1H+1 1H+2 1H+3
Human Prec. 19.83 20.33 20.83 21.87
BP 0.86 0.79 0.76 0.72
Para Prec. 19.83 20.23 20.47 20.22
BP 0.86 0.77 0.76 0.76
1H+0 1H+1 1H+2 1H+3
Human 64.09 64.02 63.99 63.37
Para 64.09 64.78 63.99 63.35
increases fairly regularly with additional human ref-
erences. However, with additional paraphrased ref-
erences, there are no statistically significant gains to
be seen. In fact, as seen in Section 4.2, adding more
paraphrases leads to a noisier tuning set. The TER
scores, although following a similar trend, seem to
provide no statistically significant evidence for ei-
ther the human or the paraphrase portion of this ex-
4.3.2 Porting to New Genres
Another important challenge in the MT world
arises when systems are used to translate data from
genres that are fairly new and for which a large
amount of parallel data is not yet available. One such
genre that has recently gained in popularity is the
weblog genre. In order to test how the paraphrase
approach works in that genre, we train both the MT
system and the paraphraser on 400,000 sentences
of weblog data. Note that this is less than half the
amount of newswire text that we previously used to
train the paraphraser. From our experience with this
genre, we find that if BLEU is used as the tuning cri-
terion for this genre, the TER scores on held-out val-
idation sets tend to be disproportionately worse and
that a better criterion to use is a hybrid TER-BLEU
measure given by
TERBLEU = 0.5TER + 0.5(1 BLEU)
We used the same measure for tuning our MT sys-
tem in this experiment because we want to test how
the use of a criterion that’s not as heavily dependent
on n-gram diversity as BLEU affects the utility of
the paraphrasing approach in a real-world scenario.
As our tuning set, we use an actual weblog data
set with only a single reference translation. As our
validation set, we used a different weblog data set
(WEB) containing 767 sentences, also with a single
reference translation. The results are shown in Ta-
ble 4.
Table 4: BLEU and TER scores for using paraphrases in
tuning the web genre.
Prec. BP
1H 16.85 0.90 68.35
1H+1P 17.25 0.88 68.00
Since our validation set has a single reference
translation, we separate out the 4-gram precision and
brevity penalty components of BLEU scores so that
we can focus on the precision which is directly af-
fected by the increased n-gram diversity supplied
by the paraphrase. However, for this experiment,
we find that while there seem to be improvements
in both the the 4-gram precision and TER scores,
they are statistically insignificant. In order to iso-
late whether the lack of improvement is due to the
relatively small size of the training data or the met-
ric mismatch, we re-run the same experiment with
BLEU as the tuning criterion instead of TER-BLEU.
Table 5: A significant gain in BLEU is achieved only
when the tuning criterion for the MT system can take ad-
vantage of the diversity.
Prec. BP
1H 17.05 0.89 70.32
1H+1P 18.30 0.87 69.94
The results, shown in Table 5, indicate a signifi-
cant gain in both the 4-gram precision and the over-
all BLEU score. They indicate that while a relatively
small amount of training data may not hamper the
paraphraser’s effectiveness for parameter tuning, a
tuning criterion that doesn’t benefit from added n-
gram diversity certainly can.
1H 1H+1P
Tuning Set
#3 #4
#3 #4
#2: Refset 2
#1: Refset 1
4: Refset 4
3: Refset 3
#3: Refset 3
#4: Refset 4
Figure 4: Measuring the impact of reference quality on
use of paraphrased references. The graph shows the
BLEU and TER scores computed for MT04+05 for cases
where tuning utilizes reference translations created by
different human translators and their corresponding para-
phrases. Tuning usefulness of human translations vary
widely (e.g., Refset #2 vs Refset #4) and, in turn, impact
the utility of the paraphraser.
4.4 Impact of Human Translation Quality
Each of the 4 sets of references translations in MT03
was created by a different human translator. Since
human translators are likely to vary significantly in
the quality of translations that they produce, it is im-
portant to gauge the impact of the quality of a ref-
erence on the effectiveness of using its paraphrase,
at least as produced by the paraphraser, as an addi-
tional reference. To do this, we choose each of the 4
reference sets from MT03 in turn to create the sim-
ulated single-reference set5(1H), paraphrased it and
used the 1-best paraphrase as an additional reference
to create a 2-reference tuning set (1H+1P). We then
use each of the 8tuning sets to tune the SMT system
and compute BLEU and TER scores on MT04+05.
Figure 4 and Table 6 show these results in graph-
ical form and tabular form, respectively. These re-
sults allow for two very interesting observations:
The human reference translations do vary sig-
nificantly in quality. This is clearly seen from
the significant differences in the BLEU and
TER scores between the 1H conditions, e.g.,
the third and the fourth human reference trans-
5Note that these per-translator simulated sets are different
from the bias-free simulated set created in Sections 4.1 and 4.2.
Table 6: MT04+05 BLEU and TER results are shown
for cases where tuning utilizes reference translations cre-
ated by different human translators and their correspond-
ing paraphrases.
#1 #2 #3 #4
1H 37.56 35.86 38.39 38.41
1H+1P 39.19 37.94 38.85 38.90
#1 #2 #3 #4
1H 57.23 60.55 54.50 54.12
1H+1P 54.21 56.42 53.40 53.51
lations seem to be better suited for tuning than,
say, the second reference. Note that the term
“better” does not necessarily refer to a more
fluent translation but to one that is closer to the
output of the MT system.
The quality of the human reference has a sig-
nificant impact on the effectiveness of its para-
phrase as an additional tuning reference. Using
paraphrases for references that are not very in-
formative, e.g. the second one, leads to signif-
icant gains in both BLEU and TER scores. On
the other hand, references that are already well-
suited to the tuning process, e.g., the fourth
one, show much smaller improvements in both
BLEU and TER on MT04+05.
In addition, we also want to see how genre mis-
match interacts with reference quality. Therefore,
we also measure the BLEU and TER scores of each
tuned MT system on MT06-GALE, a mixed-genre
validation set with a single reference translation de-
scribed earlier. These results—shown in Table 7—
confirm our observations. The improvements in the
TER scores with additional paraphrased references
are proportional to how good the original reference
was; in fact, for the fourth set of reference transla-
tions that seem best suited to tuning, adding a para-
phrased reference amounts to adding noise and leads
to lower performance on the mixed-genre set. As
for the BLEU scores, we see similar trends with its
4-gram precision6component: it improves signifi-
6Since MT06-GALE is a single reference validation set,
brevity penalties are usually higher when scoring a system tuned
with multiple references.
cantly for reference sets that are not as useful for
tuning on their own but does not change (or even
degrades) for the others.
Table 7: Measuring the impact of reference quality on
MT06-GALE, a mixed-genre validation set.
#1 #2 #3 #4
1H Prec. 19.09 19.19 20.34 20.61
BP 0.88 0.88 0.82 0.84
1H+1P Prec. 20.63 19.98 20.60 20.31
BP 0.79 0.83 0.73 0.74
#1 #2 #3 #4
1H 64.98 66.30 63.42 62.98
1H+1P 63.19 64.01 63.64 63.98
4.5 Effect of Larger Tuning Sets
An obvious question to ask is whether the para-
phrased references are equally useful with larger
tuning sets. More precisely, would using a larger
set of sentences (with a single human reference
translation) be as effective as using a paraphraser
to produce additional artificial reference transla-
tions? Given that creating additional human refer-
ence translations is so expensive, the most realistic
and cost-effective option of scaling to larger tuning
sets is to take the required number of sentences from
the training data and add them to the tuning set. The
parallel nature of the training corpus facilitates the
the use of the same corpus as a tuning set with a sin-
gle human-authored reference translation.
In order to replicate this scenario, we choose the
single reference MT03 bias-free tuning set described
previously as our starting point. To add to this tun-
ing set, we remove a block of sentences from the
MT training corpus7and added sentences from this
block to the baseline MT03 tuning set in three steps
to create three new tuning sets as shown in Table 8.
Once we create the larger tuning sets, we use each
of them to tune the parameters of the MT system
(which is trained on bitext excluding this block of
sentences) and score the MT04+05 validation set.
To see how this compares to the paraphrase-based
approach, we paraphrase each of the tunings sets and
7We made sure that these sentences did not overlap with the
paraphraser training data.
Table 8: Creating larger single reference tuning sets by
adding sentences from the training corpus to the single
reference base tuning set (MT03).
Tuning Set # of Sentences
Base (MT03) 919
T1 (Base+600) 1519
T2 (T1+500) 2019
T3 (T2+500) 2519
used the paraphrases as additional reference transla-
tions for tuning the MT system. Figure 5 and Ta-
ble 9 show these results in graphical form and tabu-
lar form, respectively.
The most salient observation we can make from
the results is that doubling or even tripling the tun-
ing set by adding more sentences from the training
data does not lead to statistically significant gains.
However, adding the paraphrased of the correspond-
ing human reference translations as additional ref-
erences for tuning always leads to significant gains,
irrespective of the size of the tuning set.
0919 1519 2019 2519
Number of Sentences in Tuning Set
1: Additional human references
2: Additional paraphrased references
1: Tuning with Human
2: Tuning with Human + Paraphrase
Figure 5: BLEU scores for the MT04+05 validation set
as the tuning set is enlarged—by adding sentences from
the training data.
5 Conclusion & Future Work
In this paper, we have examined in detail the value
of multiple human reference translations, as com-
pared with a single human reference augmented by
means of fully automatic paraphrasing obtained via
English-to-English statistical translation. We found
that for the largest leap in performance, going from
Table 9: BLEU and TER scores are shown for the
MT04+05 validation set as the tuning set is enlarged by
borrowing from the training data.
Base T1 T2 T3
1H 36.40 36.85 36.95 37.00
1H+1P 38.25 38.59 38.60 38.55
Base T1 T2 T3
1H 56.17 58.23 58.60 59.03
1H+1P 54.20 55.43 55.59 55.77
a single reference to two references, an automated
paraphrase does quite as well as a second human
translation, and using n-best paraphrasing we found
that the point of diminishing returns is not hit un-
til four human translations are available. In addi-
tion, we performed a number of additional analy-
ses in order to understand in more detail how the
paraphrase-based approach is affected by a variety
of factors, including genre mismatch, human trans-
lation quality and tuning criteria that may not find
additional n-gram diversity as valuable as BLEU
does. The same analyses also validate the hypoth-
esis that the paraphraser indeed works by providing
additional n-gram diversity and not by means of ac-
cidental smoothing.
For these analyses, we used only a subset of the
data used to train the MT system (2 million sen-
tences). The point of this artificial restriction was
to verify that the gains achieved by paraphrasing
are not simply due to an inadvertent smoothing of
the feature values in the MT system. Of course,
a great advantage of the pivot-based full-sentence
paraphrase technique is that it does not require any
resources beyond those needed for building the MT
system: a bitext and an MT decoder. Therefore, the
best (and simplest) way to employ this technique
is to use the full MT training set for training the
paraphraser which, we believe, should provide even
larger gains.
Another important issue that must be discussed
concerns the brevity penalty component of the
BLEU score. One might question whether the suc-
cess of the paraphrase-based references derives pri-
marily from the potential for generating longer out-
puts, thereby bypassing the brevity penalty. How-
ever, our TER results offer conclusive evidence that
this is, in fact, not the case. If all this method did was
to force longer MT outputs without contributing any
meaningful content, then we would have observed a
large loss in TER scores (due to an increase in the
number of errors).
In order to achieve detailed comparisons with
multiple human reference translations, our exper-
imentation was done using a carefully translated
NIST development set. However, the results here
clearly point in a more ambitious direction: doing
away entirely with any human translations beyond
those already a part of the training material already
expected by statistical MT systems. If the quality of
the translations in the training set are good enough
— or if a high quality subset can be identified —
then the paraphrasing techniques we have applied
here may suffice to obtain the target-language vari-
ation needed to tune statistical MT systems effec-
tively. Experimentation of this kind is clearly a pri-
ority for future work.
We also intend to take advantage of one aspect of
the paraphraser that radically differentiates it from
an MT system: the fact that the source and the
target languages are the same. This fact will al-
low to develop features and incorporate additional
knowledge—much more easily than for a bilingual
MT system—that can substantially improve the per-
formance of the paraphraser and make it even more
useful in scenarios where it may not yet perform up
to its potential.
Finally, another avenue of further research is the
tuning metric used for the paraphrasers. Currently
the feature weights for the paraphraser features are
tuned as described in (Madnani et al., 2007), i.e., by
iteratively “translating” a set of source paraphrases,
comparing the answers to a set of reference para-
phrases according to the BLEU metric and updating
the feature weights to maximize the BLEU value in
the next iteration. While this is not unreasonable,
it is not optimal or even close to optimal: in ad-
dition to striving for semantic equivalence, an au-
tomatic paraphraser should also aim for lexical di-
versity especially if said diversity is required in a
downstream application. However, the BLEU met-
ric is designed to reward larger n-gram overlap with
reference translations. Therefore, using BLEU as
the metric for the tuning process might actually lead
to paraphrases with lower lexical diversity. Met-
rics recently proposed for the task of detecting para-
phrases and entailment (Dolan et al., 2004; Jo˜
ao et
al., 2007a; Jo˜
ao et al., 2007b) might be better suited
to this task.
6 Acknowledgments
This work was supported, in part, by BBN un-
der DARPA/IPTO contract HR0011-06-C-0022 and
IBM under HR0011-06-2-001. Any opinions, find-
ings, conclusions or recommendations expressed in
this paper are those of the authors and do not nec-
essarily reflect the view of DARPA. We are grateful
to Necip Fazil Ayan, Christof Monz, Adam Lopez,
Smaranda Muresan, Chris Dyer and other colleagues
for their valuable input. Finally, we would also like
to thank the anonymous reviewers for their useful
comments and suggestions.
Colin Bannard and Chris Callison-Burch. 2005. Para-
phrasing with bilingual parallel corpora. In Proceed-
ings of ACL.
Daniel M. Bikel, Richard L. Schwartz, and Ralph M.
Weischedel. 1999. An algorithm that learns what’s
in a name. Machine Learning, 34(1-3):211–231.
David Chiang. 2007. Hierarchical phrase-based transla-
tion. Computational Linguistics, 33(2):201–228.
William Dolan, Chris Quirk, and Chris Brockett. 2004.
Unsupervised construction of large paraphrase cor-
pora: Exploiting massively parallel news sources. In
Proceedings of COLING 2004, Geneva, Switzerland.
Cordeiro Jo˜
ao, Dias Ga¨
el, and Brazdil Pavel. 2007a. A
metric for paraphrase detection. In Proceedings of the
The Second International Multi-Conference on Com-
puting in the Global Information Technology.
Cordeiro Jo˜
ao, Dias Ga¨
el, and Brazdil Pavel. 2007b.
New functions for unsupervised asymmetrical para-
phrase detection. Journal of Software, 2(4).
Philipp Koehn. 2004. Statistical significance tests for
machine translation evaluation. In Proceedings of
Irene Langkilde. 2000. Forest-based statistical sentence
generation. In Proceedings of NAACL.
Nitin Madnani, Necip Fazil Ayan, Philip Resnik, and
Bonnie J. Dorr. 2007. Using paraphrases for param-
eter tuning in statistical machine translation. In Pro-
ceedings of the Second ACL Workshop on Statistical
Machine Translation.
Haitao Mi, Liang Huang, and Qun Liu. 2008. Forest-
based translation. In Proceedings of ACL-08: HLT,
pages 192–199, Columbus, Ohio, June. Association
for Computational Linguistics.
D. W. Oard. 2003. The surprise langauge exercises.
ACM Transactions on Asian Language Information
Processing, 2(3).
Franz Josef Och. 2003. Minimum error rate training in
statistical machine translation. In Proceedings of ACL.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. 2002.
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of ACL.
M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
J. Makhoul. 2006. A study of translation edit rate with
targeted human annotation. In Proceedings of AMTA.
S. Strassel, C. Cieri, A. Cole, D. DiPersio, M. Liberman,
X. Ma, M. Maamouri, and K. Maeda. 2006. Inte-
grated linguistic resources for language exploitation
technologies. In Proceedings of LREC.
... However, it is not reasonable to average the BLEU scores of the generated response to each reference, because the semantic of each reference varies significantly. Aiming at revealing the variation and diversity among responses, which have not yet covered at the NMT models, we propose a MaxBLEU metric customized for response generation based on the Multi-BLEU metric (Madnani et al., 2008). Noticing that the metrics inherited from SMT, like BLEU, is not able to evaluate the diversity of responses, we propose the specified metrics for diversity evaluation, which will be described in the next subsection. ...
Conference Paper
Full-text available
It has been proven that automatic conversational agents can be built up using the End-to-End Neural Response Generation (NRG) framework, and such a data-driven methodology requires a large number of dialog pairs for model training and reasonable evaluation met-rics for testing. This paper proposes a Large Scale Domain-Specific Conversational Corpus (LSDSCC) composed of high-quality query-response pairs extracted from the domain-specific online forum, with thorough pre-processing and cleansing procedures. Also, a testing set, including multiple diverse responses annotated for each query, is constructed , and on this basis, the metrics for measuring the diversity of generated results are further presented. We evaluate the performances of neural dialog models with the widely applied diversity boosting strategies on the proposed dataset. The experimental results have shown that our proposed corpus can be taken as a new benchmark dataset for the NRG task, and the presented metrics are promising to guide the optimization of NRG models by quantifying the diversity of the generated responses reasonably.
... However, it is not reasonable to average the BLEU scores of the generated response to each reference, because the semantic of each reference varies significantly. Aiming at revealing the variation and diversity among responses, which have not yet covered at the NMT models, we propose a MaxBLEU metric customized for response generation based on the Multi-BLEU metric (Madnani et al., 2008). Noticing that the metrics inherited from SMT, like BLEU, is not able to evaluate the diversity of responses, we propose the specified metrics for diversity evaluation, which will be described in the next subsection. ...
Full-text available
It has been proven that automatic conversational agents can be built up using the Endto-End Neural Response Generation (NRG) framework, and such a data-driven methodology requires a large number of dialog pairs for model training and reasonable evaluation metrics for testing. This paper proposes a Large Scale Domain-Specific Conversational Corpus (LSDSCC) composed of high-quality queryresponse pairs extracted from the domainspecific online forum, with thorough preprocessing and cleansing procedures. Also, a testing set, including multiple diverse responses annotated for each query, is constructed, and on this basis, the metrics for measuring the diversity of generated results are further presented. We evaluate the performances of neural dialog models with the widely applied diversity boosting strategies on the proposed dataset. The experimental results have shown that our proposed corpus can be taken as a new benchmark dataset for the NRG task, and the presented metrics are promising to guide the optimization of NRG models by quantifying the diversity of the generated responses reasonably.
... Using pseudo-references, i.e. raw translation outputs from different MT systems has been investigated in Hwa, 2007, Albrecht andHwa, 2008] and it is shown that, even though these are not correct human translations, it is beneficiary to add pseudo-references instead of using one single reference. Adding automatically generated paraphrases together to a set of standard human references for tuning has been investigated in [Madnani et al., 2008], and it is shown that the paraphrases are improving automatic scores BLEU and TER when the number of multiple human references is less than four. Recently, multiple references have been explored in [Qin and Specia, 2015] in terms of using recurring information in these references in order to generate better version of BLEU and NIST [Doddington, 2002] metrics by better n-gram weighting. ...
Conference Paper
Full-text available
This work investigates the potential use of post-edited machine translation (MT) outputs as reference translations for automatic machine translation evaluation, focusing mainly on the following important question: Is it necessary to take into account the machine translation system and the source language from which the given post-edits are generated? In order to explore this, we investigated the use of post-edits originating from different machine translation systems (two statistical systems and two rule-based systems), as well as the use of post-edits originating from two different source languages (English and German). The obtained results shown that for comparison of different systems using automatic evaluation metrics, a good option is to use a post-edit originating from a high-quality (possibly distinct) system. A better option is to use it together with other references and post-edits, however post-edits originating from poor translation systems should be avoided. For tuning or development of a particular system, post-edited output of this same system seems to be the best reference translation.
... One way of reducing bias in evaluation towards the particular decisions made in a reference translation is to source multiple reference translations and calculate an aggregated score across them (Culy and Riehemann 2003;Madnani et al. 2008). Even here, however, automatic evaluation metrics tend to focus too much on local similarity with translations, and ignore the global fidelity and coherence of the translation (Lo et al. 2013); this introduces a bias when comparing systems that are based on differing principles. ...
Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd's work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology's feasibility in large-scale human evaluation through replication of the human evaluation component of Workshop on Statistical Machine Translation shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.
... Nevertheless, while several papers have successfully discussed ways to minimize annotator bias effects in SMT (Snover et al., 2006;Madnani et al., 2008), IAA metrics such as κ still unhelpfully play a role in the field and have, for example, been reported almost every year in the Workshop on Machine Translation (WMT) conference. ...
In this paper, we first explore the role of inter-Annotator agreement statistics in grammatical error correction and conclude that they are less informative in fields where there may be more than one correct answer. We next created a dataset of 50 student essays, each corrected by 10 different annotators for all error types, and investigated how both human and GEC system scores vary when different combinations of these annotations are used as the gold standard. Upon learning that even humans are unable to score higher than 75% F0.5, we propose a new metric based on the ratio between human and system performance. We also use this method to investigate the extent to which annotators agree on certain error categories, and find that similar results can be obtained from a smaller subset of just 10 essays.
... Resnik (2004) was one of the earlier works proposing semantic similarity (with a looser definition of semantically similar/equivalent phrases) using triangulation between parallel corpora. This was extended later by Madnani et al. (2008a;2008b)). Mareček (2009) proposed aligning tectogrammatical trees, where only content (autosemantic) words are nodes, in a parallel English/Czech corpus to improve overall word alignment and thereby improve machine translation. Padó and Lapata (2005;2006) used word alignment and syntax based argument similarity to project English FrameNet seman-tic roles to German. ...
Conference Paper
Full-text available
We describe 2 improvements to Chinese-English PropBank predicate-argument structure alignment. Taking advantage of the recently expanded PropBank English nominal and adjective predicate annotation (Bo-nial et al., 2014), we performed predicate-argument alignments between both verb and nominal/adjective predicates in Chinese and English. Using our alignment system, this increased the number of aligned predicate-argument structures by 24.5% on the parallel Xinhua News corpus. We also improved the PropBank alignment system using expectation-maximization (EM) techniques. By collecting Chinese-English predicate-to-predicate and argument type-to-argument type alignment probabilities and iteratively improving the alignment output using these probabilities on a large unannotated parallel corpora, we improved the predicate alignment performance by 1 F point when using all automatic SRL and word alignment inputs.
Generative adversarial networks (GAN) have great successes on natural language processing (NLP) and neural machine translation (NMT). However, the existing discriminator in GAN for NMT only combines two words as one query to train the translation models, which restrict the discriminator to be more meaningful and fail to apply rich monolingual information. Recent studies only consider one single reference translation during model training, this limit the GAN model to learn sufficient information about the representation of source sentence. These situations are even worse when languages are morphologically rich. In this article, an extended version of GAN model for neural machine translation is proposed to optimize the performance of morphologically rich language translation. In particular, we use the morphological word embedding instead of word embedding as input in GAN model to enrich the representation of words and overcome the data sparsity problem during model training. Moreover, multiple references are integrated into discriminator to make the model consider more context information and adapt to the diversity of different languages. Experimental results on German $\leftrightarrow$ English, French $\leftrightarrow$ English, Czech $\leftrightarrow$ English, Finnish $\leftrightarrow$ English, Turkish $\leftrightarrow$ English, Chinese $\leftrightarrow$ English, Finnish $\leftrightarrow$ Turkish and Turkish $\leftrightarrow$ Czech translation tasks demonstrate that our method achieves significant improvements over baseline systems.
Full-text available
Language variation, or the fact that messages can be conveyed in a great variety of ways by means of linguistic expressions, is one of the most challenging and certainly fascinating features of language for Natural Language Processing, with wide applications in language analysis and generation. The term paraphrase is now commonly used to refer to textual units of equivalent meaning, down to the level of sub-sentential fragments. Although one can envisage to manually build high-coverage lists of synonyms, enumerating meaning equivalences at the level of phrases is too daunting a task for humans. Consequently, acquiring this type of knowledge by automatic means has attracted a lot of attention and significant research efforts have been devoted to this objective. In this thesis we use parallel monolingual corpora for a detailed study of the task of sub-sentential paraphrase acquisition. We argue that the scarcity of this type of resource is compensated by the fact that it is the most suited corpus type for studies on paraphrasing. We propose a large exploration of this task with experiments on two languages with five different acquisition techniques, selected for their complementarity, their combinations, as well as four monolingual corpus types of varying comparability. We report, under all conditions, a significant improvement over all techniques by validating candidate paraphrases using a maximum entropy classifier. An important result of our study is the identification of difficult-to-acquire paraphrase pairs, which are classified and quantified in a bilingual typology.
Viewing machine translation (MT) as a structured classification problem has provided a gateway for a host of structured prediction techniques to enter the field. In particular, large-margin methods for discriminative training of feature weights, such as the structured perceptron or MIRA, have started to match or exceed the performance of existing methods such as MERT. One issue with these problems in general is the difficulty in obtaining fully structured labels, e.g. in MT, obtaining reference translations or parallel sentence corpora for arbitrary language pairs. Another issue, more specific to the translation domain, is the difficulty in online training and updating of MT systems, since existing methods often require bilingual knowledge to correct translation outputs online. The problem is an important one, especially with the usage of MT in the mobile domain: in the process of translating user inputs, these systems can also receive feedback from the user on the quality of the translations produced. We propose a solution to these two problems, by demonstrating a principled way to incorporate binary-labeled feedback (i.e. feedback on whether a translation hypothesis is a “good” or understandable one or not), a form of supervision that can be easily integrated in an online and monolingual manner, into an MT framework. Experimental results on Chinese–English and Arabic–English corpora for both sparse and dense feature sets show marked improvements by incorporating binary feedback on unseen test data, with gains in some cases exceeding 5.5 BLEU points. Experiments with human evaluators providing feedback present reasonable correspondence with the larger-scale, synthetic experiments and underline the relative ease by which binary feedback for translation hypotheses can be collected, in comparison to parallel data.
Full-text available
We investigate unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources. Two techniques are employed: (1) simple string edit distance, and (2) a heuristic strategy that pairs initial (presumably summary) sentences from different news stories in the same cluster. We evaluate both datasets using a word alignment algorithm and a metric borrowed from machine translation. Results show that edit distance data is cleaner and more easily-aligned than the heuristic data, with an overall alignment error rate (AER) of 11.58% on a similarly-extracted test set. On test data extracted by the heuristic strategy, however, performance of the two training sets is similar, with AERs of 13.2% and 14.7% respectively. Analysis of 100 pairs of sentences from each set reveals that the edit distance data lacks many of the complex lexical and syntactic alternations that characterize monolingual paraphrase. The summary sentences, while less readily alignable, retain more of the non-trivial alternations that are of greatest interest learning paraphrase relationships.
Full-text available
Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals requires a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.
Full-text available
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judg-ments. Translation Edit Rate (TER) mea-sures the amount of editing that a hu-man would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results in-dicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judg-ments as well as—or better than—a sec-ond human judgment does.
Full-text available
Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis fea-tures and weights for those features. It is standard to tune the feature weights in or-der to maximize a translation quality met-ric, using held-out test sentences and their corresponding reference translations. How-ever, obtaining reference translations is ex-pensive. In this paper, we introduce a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and we demonstrate that the result-ing paraphrases can be used to drastically re-duce the number of human reference transla-tions needed for parameter tuning, without a significant decrease in translation quality.
Full-text available
Monolingual text-to-text generation is an emerging research area in Natural Language Processing. One reason for the interest in such generation systems is the possibility to automatically learn text-to-text generation strategies from aligned monolingual corpora. In this context, paraphrase detection can be seen as the task of aligning sentences that convey the same information but yet are written in different forms, thereby building a training set of rewriting examples. In this paper, we propose a new metric for unsupervised detection of paraphrases and test it over a set of standard paraphrase corpora. The results are promising as they outperform state-of-the-art measures developed for similar tasks.
Conference Paper
Full-text available
If two translation systems differ differ in perfor- mance on a test set, can we trust that this indicates a difference in true system quality? To answer this question, we describe bootstrap resampling meth- ods to compute statistical significance of test results, and validate them on the concrete example of the BLEU score. Even for small test sizes of only 300 sentences, our methods may give us assurances that test result differences are real.
Conference Paper
Full-text available
Among syntax-based translation models, the tree-based approach, which takes as input a parse tree of the source sentence, is a promis- ing direction being faster and simpler than its string-based counterpart. However, current tree-based systems suffer from a major draw- back: they only use the 1-best parse to direct the translation, which potentially introduces translation mistakes due to parsing errors. We propose a forest-based approach that trans- lates a packed forest of exponentially many parses, which encodes many more alternatives than standard n-best lists. Large-scale exper- iments show an absolute improvement of 1.7 BLEU points over the 1-best baseline. This result is also 0.8 points higher than decoding with 30-best parses, and takes even less time.
Conference Paper
Full-text available
Previous work has used monolingual par- allel corpora to extract and generate para- phrases. We show that this task can be done using bilingual parallel corpora, a much more commonly available resource. Using alignment techniques from phrase- based statistical machine translation, we show how paraphrases in one language can be identified using a phrase in another language as a pivot. We define a para- phrase probability that allows paraphrases extracted from a bilingual parallel corpus to be ranked using translation probabili- ties, and show how it can be refined to take contextual information into account. We evaluate our paraphrase extraction and ranking methods using a set of manual word alignments, and contrast the qual- ity with paraphrases extracted from auto- matic alignments.
Conference Paper
Minimum Error Rate Training (MERT) is an effective means to estimate the feature func- tion weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature func- tion its exact error surface on a given set of candidate translations. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N - best lists which contain the N most probable translation hypotheses produced by a decoder. In this paper, we present a novel algorithm that allows for efficiently constructing and repre- senting the exact error surface of all trans- lations that are encoded in a phrase lattice. Compared to N -best MERT, the number of candidate translations thus taken into account increases by several orders of magnitudes. The proposed method is used to train the feature function weights of a phrase-based statistical machine translation system. Experi- ments conducted on the NIST 2008 translation tasks show significant runtime improvements and moderate BLEU score gains over N -best MERT.
We present a statistical machine translation model that uses hierarchical phrases—phrases that contain subphrases. The model is formally a synchronous context-free grammar but is learned from a parallel text without any syntactic annotations. Thus it can be seen as combining fundamental ideas from both syntax-based translation and phrase-based translation. We describe our system's training and decoding methods in detail, and evaluate it for translation speed and translation accuracy. Using BLEU as a metric of translation accuracy, we find that our system performs significantly better than the Alignment Template System, a state-of-the-art phrase- based system.