Conference PaperPDF Available

Language-independent Model for Machine Translation Evaluation with Reinforced Factors

Authors:

Abstract and Figures

The conventional machine translation eval-uation metrics tend to perform well on cer-tain language pairs but weak on other lan-guage pairs. Furthermore, some evalua-tion metrics could only work on certain language pairs not language-independent. Finally, no considering of linguistic in-formation usually leads the metrics re-sult in low correlation with human judg-ments while too many linguistic features or external resources make the metric-s complicated and difficult in replicabil-ity. To address these problems, a nov-el language-independent evaluation metric is proposed in this work with enhanced factors and optional linguistic information (part-of-speech, n-grammar) but not very much. To make the metric perform well on different language pairs, extensive factors are designed to reflect the translation qual-ity and the assigned parameter weights are tunable according to the special character-istics of focused language pairs. Experi-ments show that this novel evaluation met-ric yields better performances compared with several classic evaluation metrics (in-cluding BLEU, TER and METEOR) and two state-of-the-art ones including ROSE and MPF. 1
Content may be subject to copyright.
Language-independent Model for Machine Translation Evaluation with
Reinforced Factors
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Liangye He
Yi Lu, Junwen Xing, and Xiaodong Zeng
Natural Language Processing & Portuguese-Chinese Machine Translation Laboratory
Department of Computer and Information Science
University of Macau, Macau S.A.R., China
hanlifengaaron@gmail.com, {derekfw, lidiasc}@umac.mo
{wutianshui0515,takamachi660,nlp2ct.anson,nlp2ct.samuel}@gmail.com
Abstract
The conventional machine translation eval-
uation metrics tend to perform well on cer-
tain language pairs but weak on other lan-
guage pairs. Furthermore, some evalua-
tion metrics could only work on certain
language pairs not language-independent.
Finally, no considering of linguistic in-
formation usually leads the metrics re-
sult in low correlation with human judg-
ments while too many linguistic features
or external resources make the metric-
s complicated and difficult in replicabil-
ity. To address these problems, a nov-
el language-independent evaluation metric
is proposed in this work with enhanced
factors and optional linguistic information
(part-of-speech, n-grammar) but not very
much. To make the metric perform well on
different language pairs, extensive factors
are designed to reflect the translation qual-
ity and the assigned parameter weights are
tunable according to the special character-
istics of focused language pairs. Experi-
ments show that this novel evaluation met-
ric yields better performances compared
with several classic evaluation metrics (in-
cluding BLEU, TER and METEOR) and
two state-of-the-art ones including ROSE
and MPF.
1
1 Introduction
The machine translation (MT) began as early as
in the 1950s (Weaver, 1955) and gained a big
1
The final publication is available at http://www.mt-
archive.info/
progress science the 1990s due to the develop-
ment of computers (storage capacity and compu-
tational power) and the enlarged bilingual corpora
(Marino et al., 2006), e.g. (Och, 2003) present-
ed MERT (Minimum Error Rate Training) for log-
linear statistical machine translation (SMT) mod-
els to achieve better translation quality, (Su et al.,
2009) used the Thematic Role Templates model to
improve the translation and (Xiong et al., 2011)
employed the maximum-entropy model etc. The
statistical MT (Koehn, 2010) became mainly ap-
proaches in MT literature. Due to the wide-spread
development of MT systems, the MT evaluation
becomes more and more important to tell us how
well the MT systems perform and whether they
make some progress. However, the MT evaluation
is difficult because some reasons, e.g. language
variability results in no single correct translation,
the natural languages are highly ambiguous and d-
ifferent languages do not always express the same
content in the same way (Arnold, 2003).
How to evaluate each MT system’s quality and
what should be the criteria have become the new
challenges in front of MT researchers. The earliest
human assessment methods include the intelligi-
bility (measuring how understandable the sentence
is) and fidelity (measuring how much information
the translated sentence retains compared to the o-
riginal) used by the Automatic Language Process-
ing Advisory Committee (ALPAC) around 1966
(Carroll, 1966), and the afterwards proposed ad-
equacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and compre-
hension (improved intelligibility) by Defense Ad-
vanced Research Projects Agency (DARPA) of US
(White et al., 1994). The manual evaluations suf-
fer the main disadvantage that it is time-consuming
and thus too expensive to do frequently.
The early automatic evaluation metrics include
the word error rate WER (Su et al., 1992) (edit dis-
tance between the system output and the closest
reference translation), position independent word
error rate PER (Tillmann et al., 1997) (variant of
WER that disregards word ordering), BLEU (Pap-
ineni et al., 2002) (the geometric mean of n-gram
precision by the system output with respect to
reference translations), NIST (Doddington, 2002)
(adding the information weight) and GTM (Turian
et al., 2003). Recently, many other methods were
proposed to revise or improve the previous works.
One of the categories is the lexical similarity
based metric. The metrics of this kind include
the edit distance based method, such as the TER
(Snover et al., 2006) and the work of (Akiba et
al., 2001) in addition to WER and PER, the pre-
cision based method such as SIA (Liu and Gildea,
2006) in addition to BLEU and NIST, recall based
method such as ROUGE (Lin and Hovy, 2003),
the word order information utilized by (Wong and
Kit, 2008), (Isozaki et al., 2010) and (Talbot et
al., 2011), and the combination of precision and
recall such as Meteor-1.3 (Denkowski and Lavie,
2011) (an modified version of Meteor, includes
ranking and adequacy versions and has overcome
some weaknesses of previous version such as noise
in the paraphrase matching, lack of punctuation
handling and discrimination between word type-
s), BLANC (Lita et al., 2005), LEPOR (Han et
al., 2012) and PORT (Chen et al., 2012). An-
other category is the employing of linguistic fea-
tures. The metrics of this kind include the syntactic
similarity such as the Part-of-Speech information
used by ROSE (Song and Cohn, 2011) and MPF
(Popovic, 2011), and phrase information employed
by (Echizen-ya and Araki, 2010) and (Han et al.,
2013b); the semantic similarity such as Textual en-
tailment used by (Mirkin et al., 2009), Synonyms
by (Chan and Ng, 2008), paraphrase by (Snover et
al., 2009).
The evaluation methods proposed previously
suffer from several main weaknesses more or less:
perform well in certain language pairs but weak on
others, which we call the language-bias problem;
consider no linguistic information (not reasonable
from the aspect of linguistic analysis) or too many
linguistic features (making it difficult in replicabil-
ity), which we call the extremism problem; present
incomprehensive factors (e.g. BLEU focus on pre-
cision only). To address these problems, a novel
automatic evaluation metric is proposed in this pa-
per with enhanced factors, tunable parameters and
optional linguistic information (part-of-speech, n-
gram).
2 Designed Model
2.1 Employed Internal Factors
Firstly, we introduce the internal factors utilized in
the calculation model.
2.1.1 Enhanced Length Penalty
Enhanced length penalty ELP is designed to
put the penalty on both longer and shorter sys-
tem output translations (an enhanced version of the
brevity penalty in BLEU):
ELP =
(
e
1
r
c
: c < r
e
1
c
r
: c r
(1)
where the parameters c and r are the sentence
length of automatically output (candidate) and ref-
erence translation respectively.
2.1.2 N-gram Position Difference Penalty
The N-gram Position Difference Penalty
NP osP enal is developed to compare the word
order between the output and reference translation.
NP osP enal = e
NP D
(2)
where NP D is defined as:
NP D =
1
Length
output
Length
output
X
i=1
|P D
i
| (3)
where Length
output
is the length of system output
sentence and P D
i
means the position difference
value of each output word. Every word from both
output translation and reference should be aligned
only once. When there is no match, the value of
P D
i
is assigned with zero as default for this output
token.
Two steps are designed to measure the NP D
value. The first step is the context-dependent n-
gram alignment: we use the n-gram method and
assign it with higher priority, which means the sur-
rounding context of the potential words are con-
sidered when selecting the matched pairs between
the output and reference sentence. The nearest
match is accepted as a backup choice to establish
the alignment, if there are both nearby matching or
there is no other matched words surrounding the
potential word pairs. The one-direction alignment
is from output sentence to the reference.
Assuming that w
x
represents the current word
in output sentence and w
x+k
means the kth word
to the previous (k < 0) or following (k > 0).
On the other hand, w
r
y
means the word matching
w
x
in the references, and w
r
y+j
has the similar
meaning as w
x+k
but in reference sentence. The
variable Distance is the position difference val-
ue between the matching word in outputs and ref-
erences. The operation process and pseudo code
of the context-dependent n-gram word alignmen-
t algorithm are shown in Figure 1 (with as the
alignment). There is an example in Figure 2. In
the calculating step, each word is labeled with the
quotient value of its position number divided by
sentence length (the total number of the tokens in
the sentence).
Let’s see the example in Figure 2 for the NP D
introduction (Figure 3). Each output word is la-
beled with the position quotient value from 1/6 to
6/6 (indicating the word position marked by sen-
tence length which is 6). The words in the refer-
ence sentence is labeled using the same subscripts.
2.1.3 Precision and Recall
Precision and recall are two commonly used cri-
teria in the NLP literature. We use the HP R to
represent the weighted Harmonic mean of preci-
sion and recall, i.e. Harmonic(αR, βP ). The
weights are the tunable parameters α and β.
HP R =
(α + β)P recision × Recall
αP recision + βRecall
(4)
P recision =
Aligned
num
Length
output
(5)
Recall =
Aligned
num
Length
ref erence
(6)
where Aligned
num
represents the number of suc-
cessfully matched words appearing both in trans-
lation and reference.
2.2 Sentence Level Score
Secondly, we introduce the mathematical harmon-
ic mean to group multi-variables (n variables
(X
1
, X
2
, . . . , X
n
)).
Harmonic(X
1
, X
2
, ..., X
n
) =
n
P
n
i=1
1
X
i
(7)
where n is the number of factors. Then, the
weighted harmonic mean for multi-variables is:
Harmonic(w
X
1
X
1
, w
X
2
X
2
, ..., w
X
n
X
n
) =
P
n
i=1
w
X
i
P
n
i=1
w
X
i
X
i
(8)
where w
X
i
is the weight of variable X
i
. Final-
ly, the sentence level score of the developed eval-
uation metric hLEP OR (Harmonic mean of en-
hanced Length Penalty, Precision, n-gram Position
difference Penalty and Recall) is measured by:
hLEP OR =
=
P
n
i=1
w
i
P
n
i=1
w
i
F actor
i
=
w
ELP
+ w
NP osP enal
+ w
HP R
w
ELP
ELP
+
w
NP osP enal
NP osP enal
+
w
HP R
HP R
(9)
where ELP , NP osP enal and HP R are the three
factors explained in previous section with tunable
weights w
ELP
, w
NP osP enal
and w
HP R
respective-
ly.
2.3 System-level Score
The system level score is the arithmetical mean of
the sentence scores as below.
hLEP OR =
1
SentNum
SentN um
X
i=1
hLEP OR
i
(10)
where hLEP OR represents the system-level s-
core of hLEP OR, SentNum specifies the
sentence number of the test document, and
hLEP OR
i
means the score of the ith sentence.
3 Enhanced Version
This section introduces an enhanced version of the
developed metric hLEP OR as hLEP OR
E
. As
discussed by many researchers, language variabili-
ty results in no single correct translation and differ-
ent languages do not always express the same con-
tent in the same way. In addition to the augment-
ed factors of the designed metric hLEP OR, we
present that optional linguistic information can be
combined into this metric concisely. As an exam-
ple, we will show how the part-of-speech (POS) in-
formation can be employed into this metric. First,
we calculate the system-level hLEP OR scores
on the surface words ( hLEP OR
word
). Then
we employ the same algorithms of hLEP OR on
the corresponding POS sequences of the words (
hLEP OR
P OS
). Finally, we combine this two
system-level scores together with tunable weights
(w
hw
and w
hp
) as the final score.
hLEP OR
E
=
1
w
hw
+ w
hp
(w
hw
hLEP OR
word
+w
hp
hLEP OR
P OS
) (11)
We mention the POS information because it
sometimes acts as the similar function with the
synonyms, e.g. “there is a big bag” and “there
is a large bag” could be the same meaning but
with different surface words “big” and “large”
(the same POS adjective). The POS information
has been proved helpful in the research works of
ROSE (Song and Cohn, 2011) and MPF (Popovic,
2011). The POS information could be replaced by
any other concise linguistic information in our de-
signed model.
4 Evaluating the Evaluation Metric
In order to distinguish the reliability of differen-
t MT evaluation metrics, Spearman rank correla-
tion coefficient ρ is commonly used to calculate
the correlation in the annual workshop of statisti-
cal machine translation (WMT) for Association of
Computational Linguistics (ACL) (Callison-Burch
et al., 2011). When there are no ties, Spearman
rank correlation coefficient is calculated as:
ρ
φ(XY )
= 1
6
P
n
i=1
d
2
i
n(n
2
1)
(12)
where d
i
is the difference-value (D-value) be-
tween the two corresponding rank variables
~
X =
{x
1
, x
2
, ..., x
n
} and
~
Y = {y
1
, y
2
, ..., y
n
} describ-
ing the system φ and n is the number of variables
in the system.
5 Experiments
The experiment corpora are from the ACLs spe-
cial interest group of machine translation SIGMT
(WMT workshop) which contain eight corpora in-
cluding English-to-other (Spanish, Czech, French
and German) and other-to-English. There are in-
deed a lot of linguistic POS tagger tools for differ-
ent languages available. We conduct an evaluation
with different POS taggers, and find that the em-
ploying of POS information can make an increase
of the correlation score with human judgment for
some language pairs but little or no effect on other-
s. The employed POS tagging tools include Berke-
ley POS tagger for French, English and German
(Petrov et al., 2006), COMPOST Czech morphol-
ogy tagger (Collins, 2002) and TreeTagger Span-
ish tagger (Schmid, 1994). To avoid the overfitting
problem, the WMT 2008
2
data are used in the de-
velopment stage for the tuning of the parameter-
s and the WMT 2011 corpora are used in testing.
The tuned parameter values for different language
pairs are shown in Table 1. The abbreviations EN,
CZ, DE, ES and FR mean English, Czech, Ger-
man, Spanish and French respectively. In the n-
gram word (POS) alignment, bigram is selected in
all the language pairs. To make the model con-
cise using as fewer of external resources as possi-
ble, the value of “N/A” means the POS informa-
tion of that language pair is not employed due to
that it makes little or no effect in the correlation
scores. The label “(W)” and “(POS)” means the
parameters tuned on word and POS respectively.
The “NPP” means N P osP enal to save window
space. The tuned parameter values also prove that
different language pairs embrace different charac-
teristics.
The testing results on WMT 2011
3
corpora are
shown in Table 2. The comparisons with language-
independent evaluation metrics include the clas-
sic metrics (BLEU, TER and METEOR) and two
state-of-the-art metrics MPF and ROSE. We selec-
t MPF and ROSE because that these two metrics
also employ the POS information and MPF yield-
ed the highest correlation score with human judg-
ments among all the language-independent metric-
s (performing on eight language pairs) in WMT
2011. The numbers of participated automatic MT
systems in WMT 2011 are 10, 22, 15 and 17 re-
spectively for English-to-other (CZ, DE, ES and
FR) and 8, 20, 15 and 18 respectively for the op-
posite translation direction. The gold standard ref-
erence data for those corpora consists of 3,003 sen-
tences offered by manual work. Automatic MT e-
2
http://www.statmt.org/wmt08/
3
http://www.statmt.org/wmt11/
Ratio Other-to-English English-to-Other
CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
HPR:ELP:NPP(W) 7:2:1 3:2:1 7:2:1 3:2:1 3:2:1 1:3:7 3:2:1 3:2:1
HPR:ELP:NPP(POS) N/A 3:2:1 N/A 3:2:1 N/A 7:2:1 N/A 3:2:1
α : β(W) 1:9 9:1 1:9 9:1 9:1 9:1 9:1 9:1
α : β(POS) N/A 9:1 N/A 9:1 N/A 9:1 N/A 9:1
w
hw
: w
hp
N/A 1:9 N/A 9:1 N/A 1:9 N/A 9:1
Table 1: Values of tuned weight parameters
valuation metrics are evaluated by the correlation
coefficient with the human judgments.
Several conclusions could be drawn from the re-
sults. First, some evaluation metrics show good
performances on part of the language pairs but low
performances on others, e.g ROSE results in 0.92
correlation with human judgments on Spanish-to-
English corpus but down to 0.41 score on English-
to-German; METEOR gets 0.93 score on French-
to-English but 0.3 on English-to-German. Second,
hLEP OR
E
generally yields good performances
on different language pairs except for the English-
to-Czech and results in the highest Mean correla-
tion score 0.83 on eight corpora. Third, the recent-
ly developed methods (e.g. MPF, 0.81 mean score)
correlate better with human judgments than the tra-
ditional ones (e.g. BLEU, 0.74 means score), indi-
cating an improvement of the researches. Final-
ly, no metric can yield high performance on all
the language pairs, which shows that there remains
large potential to achieve improvement.
6 Conclusion and Perspectives
This work proposes a language-independent mod-
el for machine translation evaluation. Considering
the different characteristics of different languages,
hLEP OR
E
has been extensively designed from
different aspects. That spans from word order
(context-dependent n-gram alignment), output ac-
curacy (precision), and loyalty (recall) to trans-
lation length performance (sentence length). D-
ifferent weight parameters are assigned to adjust
the importance of each factor, for instance, the
word position could be free in some languages but
strictly constrained in other languages. In prac-
tice, these employed features by hLEP OR
E
are
also the vital ones when people facilitate language
translation. This is the philology behind the formu-
lation and the study of this work, and we believe
human’s translation ideology is the exact direction
that MT systems should try to approach. Further-
more, this work specifies that different external re-
sources or linguistic information could be integrat-
ed into this model easily. As suggested by other
works, e.g. (Avramidis et al., 2011), the POS infor-
mation is considered in the experiments and shows
some improvements on certain language pairs.
There are several main contributions of this pa-
per compared with our previous work (Han et al.,
2013). This work combines the utilizing of sur-
face words and linguistic features together (in-
stead of relying on the consilience of the POS se-
quence only). This paper measures the system-
level hLEP OR score by the arithmetical mean of
each sentence-level score (instead of the Harmon-
ic mean of system-level internal factors). This pa-
per shows the performances of enhanced method
hLEP OR
E
on all the eight language pairs re-
leased by WMT official web (instead of part lan-
guage pairs by previous work) and most of the per-
formances have achieved improvements than pre-
vious work on the same language pairs (e.g. the
correlation score on German-English is 0.86 in-
creased from 0.83; the correlation score on French-
English is 0.92 increased from 0.74.). Other po-
tential linguistic features are easily to be employed
into the flexible model built in this paper.
There are also several aspects that should be ad-
dressed in the future works. Firstly, more language
pairs, in addition to the European languages, will
be tested such as Japanese, Korean and Chinese
and the performances of linguistic features (e.g.
POS tagging) will also be explored on the new lan-
guage pairs. Secondly, the tuning of weight pa-
rameters to achieve high correlation with human
judgments during the development period will be
automatically performed. Thirdly, since the use
of multiple references helps the usual translation
Metrics Other-to-English English-to-Other Mean
CZ-EN DE-EN ES-EN FR-EN EN-CZ EN-DE EN-ES EN-FR
hLEP OR
E
0.93 0.86 0.88 0.92 0.56 0.82 0.85 0.83 0.83
MPF 0.95 0.69 0.83 0.87 0.72 0.63 0.87 0.89 0.81
ROSE 0.88 0.59 0.92 0.86 0.65 0.41 0.9 0.86 0.76
METEOR 0.93 0.71 0.91 0.93 0.65 0.3 0.74 0.85 0.75
BLEU 0.88 0.48 0.9 0.85 0.65 0.44 0.87 0.86 0.74
TER 0.83 0.33 0.89 0.77 0.5 0.12 0.81 0.84 0.64
Table 2: Correlation coefficients with human judgments
quality measures correlate with the human judg-
ing, the scheme of how to use the multiple refer-
ences will be designed.
The designed source codes of this paper
can be freely downloaded for research pur-
pose, open source online. The source code of
hLEPOR measuring algorithm is available here
“https://github.com/aaronlifenghan/aaron-project-
hlepor”. “The final publication is available at
http://www.mt-archive.info/”
Acknowledgments.
The authors are grateful to the Science and
Technology Development Fund of Macau and the
Research Committee of the University of Macau
for the funding support for our research, under
the reference No. 017/2009/A and RG060/09-
10S/CS/FST. The authors also wish to thank the
anonymous reviewers for many helpful comments.
References
Akiba, Y., K. Imamura, and E. Sumita. 2001. Using
Multiple Edit Distances to Automatically Rank Ma-
chine Translation Output. Proceedings of MT Sum-
mit VIII , Santiago de Compostela, Spain.
Arnold, D. 2003. Why translation is difficult for com-
puters. In Computers and Translation: A transla-
tor’s guide , Benjamins Translation Library.
Avramidis, E., Popovic, M., Vilar, D., Burchardt, A.
2011. Evaluate with Confidence Estimation: Ma-
chine ranking of translation outputs using grammat-
ical features. Proceedings of ACL-WMT , pages 65-
70, Edinburgh, Scotland, UK.
Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O.
F. 2011. Findings of the 2011 Workshop on Statisti-
cal Machine Translation. Proceedings of ACL-WMT
, pages 22-64, Edinburgh, Scotland, UK.
Carroll, J. B. 1966. Aan experiment in evaluating
the quality of translation. Languages and machines:
computers in translation and linguistics , Automat-
ic Language Processing Advisory Committee (AL-
PAC), Publication 1416, Division of Behavioral Sci-
ences, National Academy of Sciences, National Re-
search Council, page 67-75.
Chan, Y. S. and Ng, H. T. 2008. MAXSIM: A maxi-
mum similarity metric for machine translation eval-
uation. Proceedings of ACL 2008: HLT , pages
55–62.
Chen, Boxing, Roland Kuhn and Samuel Larkin.
2012. PORT: a Precision-Order-Recall MT Evalu-
ation Metric for Tuning. Proceedings of 50th ACL) ,
pages 930–939, Jeju, Republic of Korea.
Collins, M. 2002. Discriminative Training Method-
s for Hidden Markov Models: Theory and Experi-
ments with Perceptron Algorithms. Proceedings of
the ACL-02 conference, Volume 10 (EMNLP 02) ,
pages 1-8. Stroudsburg, PA, USA .
Denkowski, M. and Lavie, A. 2011. Meteor: Meteor
1.3: Automatic metric for reliable optimization and
evaluation of machine translation systems. Proceed-
ings of (ACL-WMT) ,pages 85-91, Edinburgh, Scot-
land, UK.
Doddington, G. 2002. Automatic evaluation of ma-
chine translation quality using n-gram co-occurrence
statistics. Proceedings of the second internation-
al conference on Human Language Technology Re-
search , pages 138-145, San Diego, California, USA.
Echizen-ya, H. and Araki, K. 2010. Automatic eval-
uation method for machine translation using noun-
phrase chunking. Proceedings of ACL 2010 , pages
108–117. Association for Computational Linguistic-
s.
Han, Aaron L.-F., Derek F. Wong, and Lidia S. Chao.
2012. LEPOR: A Robust Evaluation Metric for Ma-
chine Translation with Augmented Factors. Pro-
ceedings of the 24th International Conference of
COLING, Posters, pages 441-450, Mumbai, India.
Han, Aaron L.-F., Derek F. Wong, Lidia S. Chao, and
Liangye He. 2013. Automatic Machine Trans-
lation Evaluation with Part-of-Speech Information.
Proceedings of the 16th International Conference of
Text, Speech and Dialogue (TSD 2013), LNCS Vol-
ume Editors: Vaclav Matousek et al. Springer-Verlag
Berlin Heidelberg. Plzen, Czech Republic.
Han, Aaron L.-F., Derek F. Wong, Lidia S. Chao,
Liangye He, Shuo Li, and Ling Zhu. 2013b.
Phrase Mapping for French and English Treebank
and the Application in Machine Translation Evalu-
ation. Proceedings of the International Conference
of the German Society for Computational Linguis-
tics and Language Technology, (GSCL 2013), LNC-
S Volume Editors: Iryna Gurevych, Chris Biemann
and Torsten Zesch. Darmstadt, Germany.
Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsuka-
da, H. 2010. Automatic evaluation of translation
quality for distant language pairs. Proceedings of
the 2010 Conference on EMNLP , pages 944–952,
Cambridge, MA.
Koehn, P. 2010. Statistical Machine Translation. Cam-
bridge University Press .
Marino, B. Jose, Rafael E. Banchs, Josep M. Crego,
Adria de Gispert, Patrik Lambert, Jose A. Fonollosa,
and Marta R. Costa-jussa. 2006. N -gram based ma-
chine translation. Journal of the Computational Lin-
guistics ,Vol. 32, No. 4. pp. 527-549, MIT Press.
Lin, Chin-Yew and E.H. Hovy. 2003. Automatic Eval-
uation of Summaries Using N-gram Co-occurrence
Statistics. Proceedings of HLT-NAACL 2003, Ed-
monton, Canada.
Lita, Lucian Vlad, Monica Rogati and Alon Lavie.
2005. BLANC: Learning Evaluation Metrics for
MT. Proceedings of the HLT/EMNLP, pages
740–747, Vancouver.
Liu D. and Daniel Gildea. 2006. Stochastic iterative
alignment for machine translation evaluation. Pro-
ceedings of ACL-06, Sydney.
Mirkin S., Lucia Specia, Nicola Cancedda, Ido Dagan,
Marc Dymetman, and Idan Szpektor. 2009. Source-
Language Entailment Modeling for Translating Un-
known Terms. Proceedings of the ACL-IJCNLP
2009) , pages 791–799, Suntec, Singapore.
Och, F. J. 2003. Minimum Error Rate Training for S-
tatistical Machine Translation. Proceedings of ACL-
2003 , pp. 160-167.
Papineni, K., Roukos, S., Ward, T. and Zhu, W. J. 2002.
BLEU: a method for automatic evaluation of ma-
chine translation. Proceedings of the ACL 2002 ,
pages 311-318, Philadelphia, PA, USA.
Petrov, S., Leon Barrett, Romain Thibaux, and Dan K-
lein 2006. Learning accurate, compact, and inter-
pretable tree annotation. Proceedings of the 21st A-
CL , pages 433–440, Sydney.
Popovic, M. 2011. Morphemes and POS tags for n-
gram based evaluation metrics. Proceedings of WMT
, pages 104-107, Edinburgh, Scotland, UK.
Schmid, H. 1994. Probabilistic Part-of-Speech Tag-
ging Using Decision Trees. Proceedings of Inter-
national Conference on New Methods in Language
Processing , Manchester, UK.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and
Makhoul J. 2006. A study of translation edit rate
with targeted human annotation. Proceedings of the
AMTA, pages 223-231, Boston, USA.
Snover, Matthew G., Nitin Madnani, Bonnie Dorr, and
Richard Schwartz. 2009. TER-Plus: paraphrase, se-
mantic, and alignment enhancements to Translation
Edit Rate. J. Machine Translation, 23: 117-127.
Song, X. and Cohn, T. 2011. Regression and rank-
ing based optimisation for sentence level MT eval-
uation. Proceedings of the WMT , pages 123-129,
Edinburgh, Scotland, UK.
Su, Hung-Yu and Chung-Hsien Wu. 2009. Improving
Structural Statistical Machine Translation for Sign
Language With Small Corpus Using Thematic Role
Templates as Translation Memory. IEEE TRANS-
ACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PROCESSING , VOL. 17, NO. 7.
Su, Keh-Yih, Wu Ming-Wen and Chang Jing-Shin.
1992. A New Quantitative Quality Measure for Ma-
chine Translation Systems. Proceedings of COL-
ING, pages 433–439, Nantes, France.
Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J.,
Seno, M. and Och, F. 2011. A Lightweight Evalu-
ation Framework for Machine Translation Reorder-
ing. Proceedings of the WMT, pages 12-21, Edin-
burgh, Scotland, UK.
Tillmann, C., Stephan Vogel, Hermann Ney, Arkaitz
Zubiaga, and Hassan Sawaf. 1997. Accelerated
DP Based Search For Statistical Translation. Pro-
ceedings of the 5th European Conference on Speech
Communication and Technology .
Turian, J. P., Shen, L. and Melanmed, I. D. 2003. E-
valuation of machine translation and its evaluation.
Proceedings of MT Summit IX , pages 386-393, New
Orleans, LA, USA.
Weaver, Warren. 1955. Translation. Machine Trans-
lation of Languages: Fourteen Essays, In William
Locke and A. Donald Booth, editors, John Wiley and
Sons. New York, pages 15—23.
White, J. S., O’Connell, T. A., and O’Mara, F. E. 1994.
The ARPA MT evaluation methodologies: Evolu-
tion, lessons, and future approaches. Proceedings
of AMTA, pp193-205.
Wong, B. T-M and Kit, C. 2008. Word choice and
word position for automatic MT evaluation. Work-
shop: MetricsMATR of AMTA, short paper, 3 pages,
Waikiki, Hawai’I, USA.
Xiong, D., M. Zhang, H. Li. 2011. A Maximum-
Entropy Segmentation Model for Statistical Machine
Translation. IEEE Transactions on Audio, Speech,
and Language Processing , Volume: 19, Issue: 8,
2011 , pp. 2494- 2505.
Figure 1: N-gram word alignment algorithm
Figure 2: Example of n-gram word alignment
Figure 3: Example of NP D calculation
... However, there are several fully Automatic Machine Translation Evaluation (AMTE) metrics. They can be classified into five categories [4]: lexical [31,23], character [30], semantic [18,24], syntactic [3,13,19,5], and semantic-syntactic metrics [7]. ...
... Among syntactic AMTE metrics, MaxSIM [3], Helpor [13] and HWCM [19] use dependency parsing and POS tags to compute the similarity between two sentences, overlooking semantic information. Guzman proposes to add discourse structures to the quality measure metrics [11]. ...
... hLEPOR is a further developed variant of LEPOR (Han et al., 2012) metric which was firstly proposed in 2013 including all evaluation factors from LEPOR but using harmonic mean for grouping factors to produce final calculation score (Han et al., 2013b). Its submission to WMT2013 metrics task achieved system level highest average correlating scores to human judgement on English-toother (French, Spanish, Russian, German, Czech) language pairs by Pearson correlation coefficient (0.854) (Han, 2014;Macháček and Bojar, 2013). ...
... The hybrid version of hLEPOR (Han et al., 2013b) use POS features to function as pseudo synonyms to capture alternative correct translations. However it relays on POS taggers for target language, which does not exist for newly proposed languages, and its tagging accuracy may be low, and it cost extra processing steps. ...
Preprint
Full-text available
Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python portable version we developed which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses LABSE distilled knowledge model to improve the metric agreement with human judgements by automatically optimised factor weights regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LABSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU (data available at \url{https://github.com/poethan/cushLEPOR}).
... В рамках представленного алгоритма могут применяться любые метрики оценки качества в зависимости от требований к качеству перевода. Была выбрана метрика hLEPOR, которая является комбинацией существующих и доработанных факторов и показывает лучшие результаты оценки по сравнению с MPF, ROSE, METEOR, BLEU и TER, а также имеет наивысший балл корреляции Пирсона с человеческими суждениями по языковой паре английский-русский [14]. Оценка метрики производилась путем сравнения сгенерированного МП (гипотезы) и эталонным переводом, выполненным человеком, при помощи библиотеки hLEPOR 5 . ...
Conference Paper
The study addresses the issue of applying optimizing pre-editing of Russian-language texts in order to improve the quality of machine translation into English. A probabilistic assessment of translation task complexity is proposed to be used for selecting a pre-editing strategy. A generalized model of the translation process is presented. A mathematical model and algorithm for automated assessment of translation task complexity are proposed. Test of the model on specialized texts of oil and gas industry is described, which showed that the estimate correlates with an estimate of translation quality and can be used in selecting a strategy for optimizing pre-editing of source texts in machine translation tasks.
... In light of these findings, we adopt two alternative evaluation metrics, i.e. SACREBLEU (Post, 2018) and hLEPOR (Han et al., 2013b;Erofeev et al., 2021;Han et al., 2021b) that we will give further details about. ...
Preprint
Full-text available
Pre-trained language models (PLMs) often take advantage of the monolingual and multilingual dataset that is freely available online to acquire general or mixed domain knowledge before deployment into specific tasks. Extra-large PLMs (xLPLMs) are proposed very recently to claim supreme performances over smaller-sized PLMs such as in machine translation (MT) tasks. These xLPLMs include Meta-AI's wmt21-dense-24-wide-en-X and NLLB. \textit{In this work, we examine if xLPLMs are absolutely superior to smaller-sized PLMs in fine-tuning toward domain-specific MTs.} We use two different in-domain data of different sizes: commercial automotive in-house data and \textbf{clinical} shared task data from the ClinSpEn2022 challenge at WMT2022. We choose popular Marian Helsinki as smaller sized PLM and two massive-sized Mega-Transformers from Meta-AI as xLPLMs. Our experimental investigation shows that 1) on smaller sized in-domain commercial automotive data, xLPLM wmt21-dense-24-wide-en-X indeed shows much better evaluation scores using S\textsc{acre}BLEU and hLEPOR metrics than smaller-sized Marian, even though its score increase rate is lower than Marian after fine-tuning; 2) on relatively larger-size well prepared clinical data fine-tuning, the xLPLM NLLB \textbf{tends to lose} its advantage over smaller-sized Marian on two sub-tasks (clinical terms and ontology concepts) using ClinSpEn offered metrics METEOR, COMET, and ROUGE-L, and totally lost to Marian on Task-1 (clinical cases) on all metrics including S\textsc{acre}BLEU and BLEU; 3) \textbf{metrics do not always agree} with each other on the same tasks using the same model outputs.
... where c is the total length of candidate translation, and r refers to the sum of effective reference sentence length in the corpus. Bellow is from NIST metric, then F-measure, METEOR and LEPOR: Info = log 2 ( #occurrence of w 1 , · · · , w n−1 #occurrence of w 1 , · · · , w n ) (18) where, in our own metric LEPOR and its variations, nLEPOR (n-gram precision and recall LEPOR) and hLEPOR (harmonic LEPOR), P and R are for precision and recall, LP for length penalty, NPosPenal for n-gram position difference penalty, and HPR for harmonic mean of precision and recall, respectively (Han et al., 2012(Han et al., , 2013bHan, 2014;. ...
Preprint
Full-text available
Since the 1950s, machine translation (MT) has become one of the important tasks of AI and development, and has experienced several different periods and stages of development, including rule-based methods, statistical methods, and recently proposed neural network-based learning methods. Accompanying these staged leaps is the evaluation research and development of MT, especially the important role of evaluation methods in statistical translation and neural translation research. The evaluation task of MT is not only to evaluate the quality of machine translation, but also to give timely feedback to machine translation researchers on the problems existing in machine translation itself, how to improve and how to optimise. In some practical application fields, such as in the absence of reference translations, the quality estimation of machine translation plays an important role as an indicator to reveal the credibility of automatically translated target languages. This report mainly includes the following contents: a brief history of machine translation evaluation (MTE), the classification of research methods on MTE, and the the cutting-edge progress, including human evaluation, automatic evaluation, and evaluation of evaluation methods (meta-evaluation). Manual evaluation and automatic evaluation include reference-translation based and reference-translation independent participation; automatic evaluation methods include traditional n-gram string matching, models applying syntax and semantics, and deep learning models; evaluation of evaluation methods includes estimating the credibility of human evaluations, the reliability of the automatic evaluation, the reliability of the test set, etc. Advances in cutting-edge evaluation methods include task-based evaluation, using pre-trained language models based on big data, and lightweight optimisation models using distillation techniques.
... Based on the experiments on eight language pairs from ACL-WMT2011, LEPOR is found to have yielded higher correlation with human judgments than several popular evaluation metrics (e.g., BLEU, METEOR, and TER). Moreover, LEPOR metric has been enhanced to hLEPOR (Han et al., 2013), which exploits the harmonic mean to integrate the sub-factors of the metric. We leave the experimentation with LEPOR (and hLEPOR) metric as a potential future work of this study. ...
... where c is the total length of candidate translation, and r refers to the sum of effective reference sentence length in the corpus. Bellow is from NIST metric, then F-measure, METEOR and LEPOR: Info = log 2 ( #occurrence of w 1 , · · · , w n−1 #occurrence of w 1 , · · · , w n ) (18) where, in our own metric LEPOR and its variations, nLEPOR (n-gram precision and recall LEPOR) and hLEPOR (harmonic LEPOR), P and R are for precision and recall, LP for length penalty, NPosPenal for n-gram position difference penalty, and HPR for harmonic mean of precision and recall, respectively (Han et al., 2012(Han et al., , 2013bHan, 2014;. ...
Preprint
Full-text available
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgement criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU) and natural language generation (NLG).
... where c is the total length of candidate translation, and r refers to the sum of effective reference sentence length in the corpus. Bellow is from NIST metric, then F-measure, METEOR and LEPOR: Info = log 2 ( #occurrence of w 1 , · · · , w n−1 #occurrence of w 1 , · · · , w n ) (18) where, in our own metric LEPOR and its variations, nLEPOR (n-gram precision and recall LEPOR) and hLEPOR (harmonic LEPOR), P and R are for precision and recall, LP for length penalty, NPosPenal for n-gram position difference penalty, and HPR for harmonic mean of precision and recall, respectively (Han et al., 2012(Han et al., , 2013bHan, 2014;. ...
Conference Paper
Full-text available
To facilitate effective translation modeling and translation studies, one of the crucial questions to address is how to assess translation quality. From the perspectives of accuracy, reliability, repeatability, and cost, translation quality assessment (TQA) itself is a rich and challenging task. In this work, we present a high-level and concise survey of TQA methods, including both manual judgment criteria and automated evaluation metrics, which we classify into further detailed sub-categories. We hope that this work will be an asset for both translation model researchers and quality assessment researchers. In addition, we hope that it will enable practitioners to quickly develop a better understanding of the conventional TQA field, and to find corresponding closely relevant evaluation solutions for their own needs. This work may also serve to inspire further development of quality assessment and evaluation methodologies for other natural language processing (NLP) tasks in addition to machine translation (MT), such as automatic text summarization (ATS), natural language understanding (NLU), and natural language generation (NLG). [available pdf http://doras.dcu.ie/25738/ ]
Article
This paper investigates the development and evaluation of machine translation models from Cantonese to English (and backward), where we propose a novel approach to tackle low-resource language translations. Despite recent improvements in Neural Machine Translation (NMT) models with Transformer-based architectures, Cantonese, a language with over 80 million native speakers, has below-par State-of-the-art commercial translation models due to a lack of resources. The main objectives of the study are to develop a model that can effectively translate Cantonese to English and evaluate it against state-of-the-art commercial models. To achieve this, a new parallel corpus has been created by combining different available corpora online with preprocessing and cleaning. In addition, a monolingual Cantonese dataset has been created through web scraping to aid the synthetic parallel corpus generation. Following the data collection process, several approaches, including fine-tuning models, back-translation , and model switch , have been used. The translation quality of models has been evaluated with multiple quality metrics, including lexicon-based metrics (SacreBLEU and hLEPOR) and embedding-space metrics (COMET and BERTscore). Based on the automatic metrics, the best model is selected and compared against the 2 best commercial translators using a new human evaluation framework HOPES . The best model proposed in this investigation (NLLB-mBART) with model switch mechanisms has reached comparable and even better automatic evaluation scores against State-of-the-art commercial models (Bing and Baidu Translators), with a SacreBLEU score of 16.8 on our test set. Furthermore, an open-source web application has been developed to allow users to translate between Cantonese and English, with the different trained models available for effective comparisons between models from this investigation and users. CantonMT is available at https://github.com/kenrickkung/CantoneseTranslation
Article
В статье исследуется процесс перевода текстов, а именно метод оптимизационного предредактирования, как способ повышения качества машинного перевода на английский язык при работе с русскоязычными узкоспециальными текстами. Авторы рассматривают математическую модель процесса перевода и постановку задачи машинного перевода, предлагают новую теорию вероятностной оценки сложности задачи перевода, приводят постановку и решение задачи оптимизационного предредактирования, описывают методику подготовки данных для обучения модели автоматического оптимизационного предредактирования. В результате исследования реализован программный комплекс оптимизационного предредактирования русскоязычных текстов. При разработке программного комплекса использованы ресурсы Центра коллективного пользования научным оборудованием «Центр обработки и хранения научных данных ДВО РАН». Данные для обучения и валидации моделей предоставлены ООО «Агентство переводов “ФИАС-Амур”». Тестирование программного комплекса показало эффективность предложенных методик для повышения качества машинного перевода узкоспециальных русскоязычных текстов на английский язык.
Conference Paper
Full-text available
We present a pilot study on an evaluation method which is able to rank translation outputs with no reference translation, given only their source sentence. The system employs a statistical classifier trained upon existing human rankings, using several features derived from analysis of both the source and the target sentences. Development experiments on one language pair showed that the method has considerably good correlation with human ranking when using features obtained from a PCFG parser.
Conference Paper
Full-text available
We propose the use of morphemes for automatic evaluation of machine translation output, and systematically investigate a set of F score and bleu score based metrics calculated on words, morphemes and pos tags along with all corresponding combinations. Correlations between the new metrics and human judgments are calculated on the data of the third, fourth and fifth shared tasks of the Statistical Machine Translation Workshop. Machine translation outputs in five different European languages are used: English, Spanish, French, German and Czech. The results show that the F scores which take into account morphemes and POS tags are the most promising metrics.
Conference Paper
Full-text available
Many treebanks have been developed in recent years for different languages. But these treebanks usually employ different syntactic tag sets. This forms an obstacle for other researchers to take full advantages of them, especially when they undertake the multilingual research. To address this problem and to facilitate future research in unsupervised induction of syntactic structures, some researchers have developed a universal POS tag set. However, the disaccord problem of the phrase tag sets remains unsolved. Trying to bridge the phrase level tag sets of multilingual treebanks, this paper designs a phrase mapping between the French Treebank and the English Penn Treebank. Furthermore, one of the potential applications of this mapping work is explored in the machine translation evaluation task. This novel evaluation model developed without using reference translations yields promising results as compared to the state-of-the-art evaluation metrics.
Conference Paper
Full-text available
One problem of automatic translation is the evaluation of the result. The result should be as close to a human reference translation as possible, but varying word order or synonyms have to be taken into account for the evaluation of the similarity of both. In the conventional methods, researchers tend to employ many resources such as the synonyms vocabulary, paraphrasing, and text entailment data, etc. To make the evaluation model both accurate and concise, this paper explores the evaluation only using Part-of-Speech information of the words, which means the method is based only on the consilience of the POS strings of the hypothesis translation and reference. In this developed method, the POS also acts as the similar function with the synonyms in addition to its syntactic or morphological behaviour of the lexical item in question. Measures for the similarity between machine translation and human reference are dependent on the language pair since the word order or the number of synonyms may vary, for instance. This new measure solves this problem to a certain extent by introducing weights to different sources of information. The experiment results on English, German and French languages correlate on average better with the human reference than some existing measures, such as BLEU, AMBER and MP4IBM1.
Article
Full-text available
This paper presents the results of the WMT11 shared tasks, which included a translation task, a system combination task, and a task for machine translation evaluation metrics. We conducted a large-scale manual evaluation of 148 machine translation systems and 41 system combination entries. We used the ranking of these systems to measure how strongly automatic metrics correlate with human judgments of translation quality for 21 evaluation metrics. This year featured a Haitian Creole to English task translating SMS messages sent to an emergency response service in the aftermath of the Haitian earthquake. We also conducted a pilot 'tunable metrics' task to test whether optimizing a fixed system to different metrics would result in perceptibly different translation quality.
Book
This introductory text to statistical machine translation (SMT) provides all of the theories and methods needed to build a statistical machine translator, such as Google Language Tools and Babelfish. In general, statistical techniques allow automatic translation systems to be built quickly for any language-pair using only translated texts and generic software. With increasing globalization, statistical machine translation will be central to communication and commerce. Based on courses and tutorials, and classroom-tested globally, it is ideal for instruction or self-study, for advanced undergraduates and graduate students in computer science and/or computational linguistics, and researchers in natural language processing. The companion website provides open-source corpora and tool-kits.
Conference Paper
This paper describes Meteor 1.3, our submission to the 2011 EMNLP Workshop on Statistical Machine Translation automatic evaluation metric tasks. New metric features include improved text normalization, higher-precision paraphrase matching, and discrimination between content and function words. We include Ranking and Adequacy versions of the metric shown to have high correlation with human judgments of translation quality as well as a more balanced Tuning version shown to outperform BLEU in minimum error rate training for a phrase-based Urdu-English system.
Conference Paper
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT, a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties).
Conference Paper
Automatic evaluation metrics are fundamentally important for Machine Translation, allowing comparison of systems performance and efficient training. Current evaluation metrics fall into two classes: heuristic approaches, like BLEU, and those using supervised learning trained on human judgement data. While many trained metrics provide a better match against human judgements, this comes at the cost of including lots of features, leading to unwieldy, non-portable and slow metrics. In this paper, we introduce a new trained metric, ROSE, which only uses simple features that are easy portable and quick to compute. In addition, ROSE is sentence-based, as opposed to document-based, allowing it to be used in a wider range of settings. Results show that ROSE performs well on many tasks, such as ranking system and syntactic constituents, with results competitive to BLEU. Moreover, this still holds when ROSE is trained on human judgements of translations into a different language compared with that use in testing.