ArticlePDF Available

Abstract

In this paper, we report an analysis of the strengths and weaknesses of several Machine Translation (MT) engines implementing the three most widely used paradigms. The analysis is based on a manually built test suite that comprises a large range of linguistic phenomena. Two main observations are on the one hand the striking improvement of an commercial online system when turning from a phrase-based to a neural engine and on the other hand that the successful translations of neural MT systems sometimes bear resemblance with the translations of a rule-based MT system.
The Prague Bulletin of Mathematical Linguistics
NUMBER 108 JUNE 2017 159–170
A Linguistic Evaluation of Rule-Based,
Phrase-Based, and Neural MT Engines
Aljoscha Burchardt,aVivien Macketanz,aJon Dehdari,aGeorg Heigold,a
Jan-Thorsten Peter,bPhilip Williamsc
aGerman Research Center for Artificial Intelligence (DFKI)
bRWTH Aachen University
cUniversity of Edinburgh
Abstract
In this paper, we report an analysis of the strengths and weaknesses of several Machine
Translation (MT) engines implementing the three most widely used paradigms. The analysis
is based on a manually built test suite that comprises a large range of linguistic phenomena.
Two main observations are on the one hand the striking improvement of an commercial online
system when turning from a phrase-based to a neural engine and on the other hand that the
successful translations of neural MT systems sometimes bear resemblance with the translations
of a rule-based MT system.
1. Introduction
Test suites are a familiar tool in NLP in areas such as grammar checking, where
one may wish to ensure that a parser is able to analyse certain sentences correctly or
test the parser after changes to see if it still behaves in the expected way. In contrast to
a “real-life” corpus the input in a test suite may well be made-up or edited to isolate
and illustrate issues.
Apart from several singular attempts (King and Falkedal, 1990; Isahara, 1995; Koh
et al., 2001, etc.) broadly-dened test suites have not generally been used in MT re-
search. One of the reasons for this might be the fear that the performance of statis-
tical MT systems depends so much on the particular input data, parameter settings,
etc., that nal conclusions about the errors they make, particularly about the dierent
© 2017 PBML. Distributed under CC BY-NC-ND. Corresponding author: aljoscha.burchardt@dfki.de
Cite as: Aljoscha Burchardt, Vivien Macketanz, Jon Dehdari, Georg Heigold, Jan-Thorsten Peter, Philip Williams. A
Linguistic Evaluation of Rule-Based, Phrase-Based, and Neural MT Engines. The Prague Bulletin of Mathematical
Linguistics No. 108, 2017, pp. 159–170. doi: 10.1515/pralin-2017-0017.
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
reasons (e.g., length of n-grams, missing training examples), are dicult to obtain.
A related concern is that statistical MT systems are designed to maximise scores on
test corpora that are comparable to the training/tuning corpora and that it is there-
fore unreliable to test these systems in dierent settings. While these concerns may
hold for systems trained on very narrowly-dened domains, genres, and topics (such
as biomedical patent abstracts), in fact many systems are trained on large amounts of
data covering mixed sources and are expected to generalize to some degree.
A last reason might be that “correct” MT output cannot be specied in the same
way as the output of other language processing tasks like parsing or fact extraction
where the expected results can be more or less clearly dened. Due to the variation
of language, ambiguity, etc., checking and evaluating MT output can be almost as
dicult as the translation itself. Still, people have tried to automatically classify errors
comparing MT output to reference translations or post-edited MT output using tools
like Hjerson (Popovic, 2011).
In narrow domains there seems to be interest in detecting dierences between sys-
tems and within the development of one system, e.g., in terms of verb-particle con-
structions (Schottmüller and Nivre, 2014) or pronouns (Guillou and Hardmeier, 2016).
Bentivogli et al. (2016) performed a comparison of neural- with phrase-based MT sys-
tems on IWSLT data using a coarse-grained error typology. Neural systems have been
found to make fewer morphological, lexical and word-order errors.
Below, we present a pioneering eort to address translation barriers in a systematic
fashion. We are convinced that testing of system performance on error classes leads
to insights that can guide future research and improvements of systems. By using
test suites, MT developers will be able to see how their systems perform compared to
scenarios that are likely to lead to failure and can take corrective action.
This paper is structured as follows: After the general introduction (Section 1), Sec-
tion 2 will briey introduce the test suite we have used in the experiments reported
in Section 3. Section 4 concludes the paper.
2. The Test Suite
The experiments reported below are based on a test suite for MT Quality we are
currently building for the language pair English – German in the QT21 project. The
test suite itself will be described in more detail in a future publication. In brief, it con-
tains segments selected from various parallel corpora and drawn from other sources
such as grammatical resources, e.g., the TSNLP Grammar Test Suite (Lehmann et al.,
1996) and online lists of typical translation errors.
Each test sentence is annotated with the phenomenon category and the phenome-
non it represents. An example showing these elds can be seen in Table 1 with the
rst column containing the source segment and the second and third column contain-
ing the phenomenon category and the phenomenon, respectively. The fourth column
shows the translation given by the old Google Translate system and the last column
160
Unauthenticated
Download Date | 7/4/17 7:24 PM
A. Burchardt et al. Linguistic Evaluation of MT (159–170)
contains a post-edit of the MT output that is created by making as few changes as
possible. In our latest version of the test suite, we have a collection of about 5,000 seg-
ments per language direction that are classied in about 15 categories (most of them
similar in both language directions) and about 120 phenomena (many of them similar
but also some diering, as they are language-specic). Depending on the nature of
the phenomenon, each is represented by at least 20 test segments in order to guaran-
tee for a balanced test set. The categories cover a wide range of dierent grammatical
aspects that might or might not lead to translation diculties for a MT system. Cur-
rently, we are still in the process of optimising our test segments and working on an
automatic solution for the evaluation.
Source Phenomenon
Category
Pheno-
menon
Target (raw) Target (edited)
Lena machte sich
früh vom Acker.
MWE Idiom Lena [left the
eld early].
Lena left early.
Lisa hat Lasagne
gemacht, sie ist
schon im Ofen.
Non-verbal
agreement
Corefer-
ence
Lisa has made
lasagne, [she] is al-
ready in the oven.
Lisa has made
lasagna, it is al-
ready in the oven.
Ich habe der
Frau das Buch
gegeben.
Verb tense/
aspect/ mood
Ditran-
sitive -
perfect
I [have] the
woman of the
Book.
I have given
the woman the
book.
Table 1. Example test suite entries GermanEnglish (simplified for display purposes).
For the experiments presented here, we have used a preliminary version of our
test suite (ca. 800 items per language direction, to a large extent verb paradigms) to
include the changes of Google Translate which has recently been switched from a
phrase-based to neural approach according to the companies’ publications. There are
more than 100 dierent linguistic phenomena that we investigated in this version of
the test suite in each language direction. In this preliminary version, the number of
instances reported in the experiments below strongly varies among the categories (as
well as between the languages).
3. Evaluating PBMT, NMT, and RBMT Engines and an Online System
3.1. System Description
We have evaluated several engines from leading machine translation research
groups and a commercial rule-based system on the basis of the very same test suite
version to be able to compare performance with the leading online system that has
recently switched to a neural model. We included a number of dierent NMT sys-
tems with dierent properties and levels of sophistication to shed light on how these
161
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
new types of systems perform on the dierent kinds of phenomena. Below, we will
briey describe the systems.
O-PBMT Old version of Google Translate (web interface, Feb. 2016).
O-NMT New version of Google Translate (web interface, Nov. 2016).
OS-PBMT Open-source phrase-based system that primarily uses a default congu-
ration to serve as a baseline. This includes a 5-gram modied Kneser-Ney lan-
guage model, mkcls and MGiza for alignment, GDFA phrase extraction with a
maximum phrase length of ve, msd-bidi-fe lexical reordering, and the Moses
decoder (Koehn et al., 2007). The WMT’16 data was Moses-tokenized and nor-
malized, truecased, and deduplicated.
DFKI-NMT Barebone neural system from DFKI. The MT engine is based on the
encoder-decoder neural architecture with attention. The model was trained on
the respective parallel WMT’16 data.
ED-NMT Neural system from U Edinburgh. This MT engine is the top-ranked sys-
tem that was submitted to the WMT ’16 news translation task (Sennrich et al.,
2016). The system was built using the Nematus toolkit.1Among other features,
it uses byte-pair encoding (BPE) to split the vocabulary into subword units, uses
additional parallel data generated by back-translation, uses an ensemble of four
epochs (of the same training run), and uses a reversed right-to-left model to
rescore n-best output.
RWTH-NMT NMT-system from RWTH (only used for German – English experi-
ments). This system is equal to the ensemble out of 8 NMT systems optimized
on TEDX used in the (Peter et al., 2016) campaign. The eight networks used
make use of subwords units and are netuned to perform well on the IWSLT
2016 MSLT German to English task.
RBMT Commercial rule-based system Lucy (Alonso and Thurmair, 2003).
3.2. Evaluation Procedure
In order to evaluate a system’s performance on the categories in the test suite, we
concentrate solely on the phenomenon in the respective sentence and disregard other
errors. This means that we have to determine whether a translation error is linked
to the phenomenon under examination or if it is independent from the phenomenon.
If the former is the case, the segment will be validated as incorrect. If, however, the
error in the translation can not be traced back to the phenomenon, the segment will
be counted as correct.
Currently, the system outputs are being automatically compared to a “reference
translation” which is, in fact, a post-edit of the O-PBMT output as those were the
very rst translations to be generated and evaluated when we started building the
test suite (see description of the test suite in Section 2 and Table 1). In a second step,
1https://github.com/rsennrich/nematus
162
Unauthenticated
Download Date | 7/4/17 7:24 PM
A. Burchardt et al. Linguistic Evaluation of MT (159–170)
#O-
PBMT
O-
NMT
RBMT OS-
PBMT
DFKI-
NMT
RWTH-
NMT
ED-
NMT
Ambiguity 17 12% 35% 42% 24% 35% 12% 35%
Composition 11 27% 73% 55% 27% 45% 45% 73%
Function words 19 5% 68% 21% 11% 26% 68% 42%
LDD & interrogative 66 12% 79% 62% 21% 36% 55% 52%
MWE 42 14% 36% 7% 21% 10% 12% 19%
NE & terminology 25 48% 48% 40% 52% 40% 48% 40%
Subordination 36 22% 58% 50% 31% 47% 42% 31%
Verb tense/aspect/mood 529 59% 80% 91% 52% 53% 74% 63%
Verb valency 32 16% 50% 44% 13% 47% 38% 50%
Sum 777 358 567 583 337 367 490 435
Average 46% 73% 75% 43% 47% 63% 56%
Table 2. Results of German – English translations. Boldface indicates best system(s) on
each category (row).
all the translations that do not match the “reference” are manually evaluated by a
professional linguist since the translations might be very dierent from the O-PBMT
post-edit but nevertheless correct. As this is a very time-consuming process, we are
currently working on automating this evaluation process by providing regular ex-
pressions for various possible translation outputs – naturally, only focusing on the
phenomenon under investigation.
We refrain from creating an independent reference as we think that generating
the regular expressions that focus solely on the phenomena instead is the more so-
phisticated solution in this context. As a consequence, we cannot compute automatic
scores like BLEU. We do not see this as a disadvantage as with the test suite we want
to focus rather on gaining insights about the nature of translations than on how well
translations match a certain reference.
3.3. Results German – English
Table 2 shows the results for the translations from German to English from the
dierent systems on the categories. The second column in the table (“#”) contains the
number of instances per category. As the distribution of examples per category in this
old version of our test suite was very unbalanced with some categories having only
very few examples, some more categories we tested were excluded from the analysis
we present here.
Before we discuss the results, we want to point out that the selection of phenomena
and the number of instances used here is not representative of their occurrence in
163
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
corpora. Consequently, it can not be our goal to nd out which of the systems is the
globally “best” or winning system. Our goal is to check and illustrate the strengths
and weaknesses of system (types) with respect to the range of phenomena we cover
with this version of the test suite. Using this evaluation approach, researchers and
system developers ideally can form hypotheses about the reasons why certain errors
happen (systematically) and can come up with a prioritised strategy for improving
the systems. Our ultimate goal is to represent all phenomena relevant for translation
in our test suite.
Coming to the analysis, it is rst of all striking how much better the neural version
of Google Translate (O-NMT) is as compared to its previous phrase-based version (O-
PBMT). Interestingly, the O-NMT and the RBMT – two very dierent approaches – are
the best-performing systems on average, achieving almost the same amount of correct
translations on average, i.e., 73%, resp. 75%, but looking at the scores of the categories
reveals that the performance of the two systems regarding the categories is in fact very
diverse. While the O-NMT system is the most-frequent best-performing system per
phenomenon, as it is best on composition, function words, long distance dependency
(LDD) & interrogative, multi-word expressions (MWE), subordination and verb va-
lency, the RBMT is only the best system on ambiguity2and verb tense/aspect/mood.
The high number of instances of the latter category leads to the high average score
of the RBMT system, as verb paradigms are part of the linguistic information RBMT
systems are based on.
The OS-PMBT reaches the lowest average score, but it is nevertheless the best-
performing system on named entities (NE) & terminology. The DFKI-NMT system
reaches a higher average score than the PBMT system (four percentage points more).
The RWTH-NMT is (along with the O-NMT) the best-performing system on func-
tion words. On average it reaches 63% of correct translations. The ED-NMT outrules
(also along with the O-NMT) the other systems on composition and verb valency and
reaches 56% correct translations on average.
In order to see if we nd some interesting correlations that might serve as a preview
for more extensive analyses with a more solid and balanced amount of test segments
in the future, we have calculated Pearson’s coecient over the phenomenon counts
(being aware that we are dealing with very small numbers here). As the correlations
for the direction English – German were higher and for space reasons, we will show
the numbers only for the other direction in the following Subsection to give an indi-
cation about possible future work.
One general impression that will also be supported by the examples below is that
NMT seems to learn some capabilities that the RBMT system has. It may lead to the
speculation that NMT indeed learns something like the rules of the language. This,
however, needs more intensive investigation. Another interesting observation is that
2The good performance of RBMT on ambiguity can be explained by the very small number of items and
it is more or less accidental that the preferred readings were the ones the RBMT has coded in its lexicon.
164
Unauthenticated
Download Date | 7/4/17 7:24 PM
A. Burchardt et al. Linguistic Evaluation of MT (159–170)
the RWTH-NMT system has a lower overall correlation with the other NMT systems.
This might be because it has also been trained and optimised on transcripts of spoken
language as opposed to the other systems trained solely on written language.
The following examples depict interesting ndings from the analysis and compari-
son of the dierent systems. When a system created a correct output (on the respective
category), the system’s name is marked in boldface.
(1) Source: Warum hörte Herr Muschler mit dem Streichen auf?
Reference: Why did Mr. Muschler stop painting?
O-PBMT: Why heard Mr Muschler on with the strike?
O-NMT: Why did Mr. Muschler stop the strike?
RBMT: Why did Mr Muschler stop with the strike?
OS-PBMT: Why was Mr Muschler by scrapping on?
DFKI-NMT: Why did Mr Muschler listen to the rich?
RWTH-NMT: Why did Mr. Muschler listen to the stroke?
ED-NMT: Why did Mr. Muschler stop with the stump?
Example (1) contains a phrasal verb and belongs to the category composition. Ger-
man phrasal verbs have the characteristics that their prex might be separated from
the verb and move to the end of the sentence in certain constructions, as it has hap-
pened in example (1) with the prex auf being separated from the rest of the verb
hören. The verb aufhören means to stop, but the verb hören without the prex simply
means to listen. Thus, phrasal verbs might pose translations barriers in MT when the
system translates the verb separately not taking into account the prex at the end of
the sentence. The output of the O-PBMT, DFKI-NMT and RWTH-NMT indicates that
this might have happened. The O-NMT, RBMT and the ED-NMT correctly translate
the verb which could mean that more context (and thus, including the prex auf at
the end of the sentences) was taken into account for the generation of the output.
(2) Source: Warum macht der Tourist drei Fotos?
Reference: Why does the tourist take three fotos?
O-PBMT: Why does the tourist three fotos?
O-NMT: Why does the tourist make three fotos?
RBMT: Why does the tourist make three fotos?
OS-PBMT: Why does the tourist three fotos?
DFKI-NMT: Why does the tourist make three fotos?
RWTH-NMT: Why is the tourist taking three fotos?
ED-NMT: Why does the tourist make three fotos?
One of the phenomena in the category LDD & interrogative is wh-movement. It
is for example involved in wh-questions, like in the sentence in (2). A wh-question
in English is usually built with an auxiliary verb and a full verb, e.g., wh-word + to
165
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
have/to be/to do + full verb. In German on the other hand, an auxiliary verb is not
necessarily needed. This fact might lead to translation diculties, as can be seen in
(2), where the O-PBMT and the OS-PBMT treat the verb does as a full verb instead of an
auxiliary verb. All the other systems translate the question with two verbs, however,
except for the RWTH-NMT, they all mistranslate ein Foto machen as to make a foto (literal
translation) instead of to take a foto. Nevertheless, these translations count as correct,
since they do contain an auxiliary verb + a full verb.
(3) Source: Die Arbeiter müssten in den sauren Apfel beißen.
Reference: The workers would have to bite the bullet.
O-PBMT: The workers would have to bite the bullet.
O-NMT: The workers would have to bite into the acid apple.
RBMT: The workers would have to bite in the acid apple.
OS-PBMT: The workers would have to bite the bullet.
DFKI-NMT: Workers would have to bite in the acid apple.
RWTH-NMT: The workers would have to bite into the clean apple.
ED-NMT: The workers would have to bite in the acidic apple.
Idioms are an interesting phenomenon within the category MWE. The meaning
of an idiom in one language can not be transferred to another language by simply
translating the separate words, as the meaning of these multi-word units goes beyond
the meaning of the separate words. As a consequence, idioms have to be transferred to
another language as a whole. For German <> English it is often the case that an idiom
in one language can be transferred to another idiom in the other language. This is also
the case in example (3). The German idiom in den sauren Apfel beißen can be translated
as to bite the bullet. Only the two PBMT system correctly translate this idiom, the other
systems all give a literal translation - with the RWTH-NMT translating sauren as clean
instead of acid(ic) like the other systems, probably not knowing the word sauren and
instead translating the similar word sauberen. This is one example where a phrase-
based approach has a real advantage (if the phrase was in the training data).
(4) Source: Wie kann ich die Farbe, mit der ich arbeite, ändern?
Reference: How can I change the color I am working with?
O-PBMT: How can I change the color with which I work
to change?
O-NMT: How can I change the color with which I work?
RBMT: How can I change the color with which I work?
OS-PBMT: How can I change the colour, with whom i work,
change?
DFKI-NMT: How can I change the color I work with?
RWTH-NMT: How can I change the color I work with?
ED-NMT: How can I change the color I work with?
166
Unauthenticated
Download Date | 7/4/17 7:24 PM
A. Burchardt et al. Linguistic Evaluation of MT (159–170)
The sentence in (4) contains a relative clause which belongs to the category sub-
ordination. Relative clauses in English can, but do not have to contain a relative pro-
noun. The outputs in (4) show both properties. The O-PBMT and the OS-PBMT dou-
ble the verb change, the remaining systems correctly translate the relative clause.
(5) Source: Ich hätte nicht lesen gedurft.
Reference: I would not have been allowed to read.
O-PBMT: I would not have been allowed to read.
O-NMT: I should not have read.
RBMT:I would not have been allowed to read.
OS-PBMT: I would not have read gedurft.
DFKI-NMT: I would not have been able to read.
RWTH-NMT: I wouldn’t have read.
ED-NMT: I wouldn’t have read.
Verb paradigms (verb tense/aspect/mood) make up about one third of the whole
test suite. Example (5) shows a sentence with a negated modal verb, in the tense
pluperfect subjunctive II. This is a quite complex construction, thus it is not surprising
that only few systems correctly translate the sentence. As might be expected, one of
them is the RBMT system. The second one is the O-PBMT. The neural version of this
system on the other hand does not correctly produce the output.
3.4. Results English – German
The results for the English – German translations can be found in Table 3. For this
language direction, only ve systems were available instead of seven like for the other
direction. As in the analysis for the other language direction, we excluded the cate-
gories that had too few instances from the table. Nevertheless, similarities between
the categories of both language directions can be found.
As in the German – English translations, the RBMT system performs best of all sys-
tems on average, reaching 83%. It performs best of all systems on verb tense/aspect/
mood and verb valency. The second-best system is – just like in the other language
direction but with a greater distance (seven percentage points less on average, namely
76%) – the O-NMT. The O-NMT shows quite contrasting results on the dierent cat-
egories, compared to RBMT: it outrules (most of) the other systems on the remaining
categories, i.e., on coordination & ellipsis, LDD & interrogative, MWE, NE & termi-
nology, special verb types and subordination.
The third-best system on average is the ED-NMT system. It reaches an average
of 61% correct translations. The other remaining NMT system, the barebone DFKI-
NMT system, reaches 11 percentage points less on average than the ED-NMT, for it
reaches 50%. But it outrules the other systems on subordination along with O-NMT.
The system with the lowest average score is the previous version of Google Translate,
167
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
# O-
PBMT
O-
NMT
RBMT DFKI-
NMT
ED-
NMT
Coordination & ellipsis 17 6% 47% 29% 24% 35%
LDD & interrogative 70 19% 61% 54% 41% 40%
MWE 42 21% 29% 19% 21% 26%
NE & terminology 20 25% 80% 40% 45% 65%
Special verb types 14 14% 86% 79% 29% 64%
Subordination 35 11% 71% 54% 71% 69%
Verb tense/aspect/mood 600 41% 82% 96% 53% 66%
Verb valency 22 36% 59% 68% 64% 59%
Sum 820 287 622 679 410 499
Average 35% 76% 83% 50% 61%
Table 3. Results of English – German translations. Boldface indicates best system(s) on
each category (row).
Correlations O-PBMT O-NMT RBMT DFKI-NMT ED-NMT
O-PBMT 1.00
O-NMT 0.34 1.00
RBMT 0.39 0.55 1.00
DFKI-NMT 0.28 0.29 0.36 1.00
ED-NMT 0.30 0.33 0.43 0.55 1.00
Table 4. Overall correlation of English – German systems
namely the O-PBMT. With 35% on average, it reaches less than half of the score of the
O-NMT.
The results of the calculation of the Pearson’s coecient can be found in Table 4.
Only categories with more than 25 observations had their correlation analysed. For
the interpretation, we used a rule-of-thumb mentioned in the literature3.
In the overall correlation, RBMT has a moderate correlation with O-NMT, which
might be traced back to the fact that these are the two systems that correctly translate
most of the test segments, compared to the other systems. The two neural systems,
DFKI-NMT and ED-NMT, also have moderate correlations. All the other systems have
weak correlation with each other.
Again, for the small and unbalanced numbers of samples, we do not want to put
too much emphasis on the observations regarding correlations. This type of analysis
might, however, become more informative in future work.
3http://www.dummies.com/education/math/statistics/how-to- interpret-a- correlation- coefficient-r
168
Unauthenticated
Download Date | 7/4/17 7:24 PM
A. Burchardt et al. Linguistic Evaluation of MT (159–170)
4. Conclusions and Outlook
While the selection of test items/categories and even more the selection of exam-
ples we discussed provides a selective view on the performance of the system, we are
convinced that this type of quantitative and qualitative evaluation provides valuable
insights and ideas for improvement of the systems, e.g., by adding linguistic knowl-
edge in one way or another. Two main observations we want to repeat here is the
striking improvement of the commercial online system when turning from a phrase-
based to a neural engine. A second observation is that the successful translations of
some NMT systems often bear resemblance with the translations of the RBMT system.
Hybrid combinations or pipelines where RBMT systems generate training material for
NMT systems seem a promising future research direction to us.
While the extracted examples above give very interesting insights on the systems
performances on the categories, these are only more or less random spot tests. How-
ever, taking a close look at the separate phenomena at a larger scale and in more detail
will lead to more general, systematic observations. This is what we aim to do with our
current version of the test suite which is therefore much more extensive and system-
atic and therefore also allows for more general observations and more quantitative
statements in future experiments.
Our ultimate goal is to automate the test suite testing. To this end, we are currently
working on a method that is using regular expressions for automatically checking the
output of engines on the test suite. The idea is to manually provide positive and neg-
ative tokens for each test item that can range from expected words in case of disam-
biguation up to, verbs and their prexes with wild cards in between up to complete
sentences in the case of verb paradigms.
Acknowledgements
This research is supported by the EC’s Horizon 2020 research and innovation pro-
gramme under grant agreements no. 645452 (QT21).
Bibliography
Alonso, Juan A and Gregor Thurmair. The Comprendium Translator system. In Proceedings
of the Ninth Machine Translation Summit. International Association for Machine Translation
(IAMT), 2003.
Bentivogli, Luisa, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. Neural versus
Phrase-Based Machine Translation Quality: a Case Study. CoRR, abs/1608.04631, 2016.
Guillou, Liane and Christian Hardmeier. PROTEST: A Test Suite for Evaluating Pronouns in
Machine Translation. In Chair), Nicoletta Calzolari (Conference, Khalid Choukri, Thierry
Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo,
Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth Interna-
169
Unauthenticated
Download Date | 7/4/17 7:24 PM
PBML 108 JUNE 2017
tional Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may 2016.
European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1.
Isahara, Hitoshi. JEIDA’s test-sets for quality evaluation of MT systems: Technical evaluation
from the developer’s point of view. In Proceedings of the MT Summit V. Luxembourg, 1995.
King, Margaret and Kirsten Falkedal. Using Test Suites in Evaluation of Machine Translation
Systems. In Proceedings of the 13th Conference on Computational Linguistics - Volume 2, COLING
’90, pages 211–216, Stroudsburg, PA, USA, 1990. Association for Computational Linguistics.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, MarcelloFederico, Nicola
Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej
Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open Source Toolkit for Statistical
Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computa-
tional Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180,
Prague, Czech Republic, June 2007. Association for Computational Linguistics.
Koh, Sungryong, Jinee Maeng, Ji-Young Lee, Young-Sook Chae, and Key-Sun Choi. A test suite
for evaluation of English-to-Korean machine translation systems. In Proceedings of the MT
Summit VIII. Santiago de Compostela, Spain, 2001.
Lehmann, Sabine, Stephan Oepen, Sylvie Regnier-Prost, Klaus Netter, Veronika Lux, Judith
Klein, Kirsten Falkedal, Frederik Fouvry, Dominique Estival, Eva Dauphin, Hervé Com-
pagnion, Judith Baur, Lorna Balkan, and Doug Arnold. TSNLP - Test Suites for Natural
Language Processing. In Proceedings of the 16th International Conference on Computational Lin-
guistics, pages 711–716, 1996.
Peter, Jan-Thorsten, Andreas Guta, Nick Rossenbach, Miguel Graça, and Hermann Ney. The
RWTH Aachen Machine Translation System for IWSLT 2016. In International Workshop on
Spoken Language Translation, Seattle, USA, Dec. 2016.
Popovic, Maja. Hjerson: An Open Source Tool for Automatic Error Classication of Machine
Translation Output. The Prague Bulletin of Mathematical Linguistics, 96:59–68, 10 2011.
Schottmüller, Nina and Joakim Nivre. Issues in Translating Verb-Particle Constructions from
German to English. In Proc. of the 10th Workshop on Multiword Expressions (MWE), pages
124–131, Gothenburg, Sweden, April 2014. Association for Computational Linguistics.
Sennrich, Rico, Barry Haddow, and Alexandra Birch. Edinburgh Neural Machine Translation
Systems for WMT 16. CoRR, abs/1606.02891, 2016.
Address for correspondence:
Aljoscha Burchardt
aljoscha.burchardt@dfki.de
German Research Center for Articial Intelligence (DFKI)
Language Technology Lab, Alt-Moabit 91c, 10559 Berlin, Germany
170
Unauthenticated
Download Date | 7/4/17 7:24 PM
... In the study by A. Burchardt, V. Macketanz et al. examine the connotative component of the evaluative value, and investigate different types of evaluative phraseology, defining the nature of evaluation in the pragmalinguistic aspect. The work states that the core of the functional and semantic field of evaluation includes emotionally coloured lexemes and that the pericore space is made up of words of rational evaluation, and the periphery is mainly units of morphological, word-formation and syntactic levels 18 . ...
... While MT has been in existence since the 1950s, it experienced a significant transformation in the past two decades, shifting from rule-based MT to statistical MT and more recently to neural machine translation (NMT). This evolution, especially the advent of deep learning, has dramatically enhanced the quality of MT (Burchardt et al, 2017;Melby, 2020;Popović, 2017;Turovsky et al., 2022). Some scholars argue that recent improvements have elevated MT quality to a level comparable to human translation (HT) for specific texts and language pairs (Hassan et al., 2018;Perrault et al., 2019). ...
Article
Full-text available
Machine translation (MT) is the automated process of translating text between different languages, encompassing a wide range of language pairs. This study focuses on non-professional bilingual speakers of Turkish and English, aiming to assess their ability to discern accuracy in machine translations and their preferences regarding MT. A particular emphasis is placed on the linguistically subtle yet semantically meaningful concept of evidentiality. In this experimental investigation, 36 Turkish–English bilinguals, comprising both early and late bilinguals, were presented with simple declarative sentences. These sentences varied in their evidential meaning, distinguishing between firsthand and non-firsthand evidence. The participants were then provided with MT of these sentences in both translation directions (Turkish to English and English to Turkish) and asked to identify the accuracy of these translations. Additionally, participants were queried about their preference for MT in four crucial domains: medical, legal, academic, and daily contexts. The findings of this study indicated that late bilinguals exhibited a superior ability to detect translation accuracy, particularly in the case of firsthand evidence translations, compared to their early bilingual counterparts. Concerning the preference for MT, age of acquisition and the accuracy detection of non-firsthand sentence translations emerged as significant predictors.
... The impressive advances in translation quality seen in recent years have led to a discussion whether translations produced by professional human translators can still be distinguished from the output of NMT systems, and to what extent automatic evaluation measures can reliably account for these differences (Hassan et al., 2018;Läubli et al., 2018;Toral et al., 2018). One answer to this question lies in the development of so-called test suites (Burchardt et al., 2017) or challenge sets (Isabelle et al., 2017) that focus on particular linguistic phenomena that are known to be difficult to evaluate with simple reference-based metrics such as BLEU. However, most existing test suites require significant amounts of expert knowledge and manual work for compiling the examples, which typically limits their coverage to a small number of translation directions. ...
Article
Full-text available
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our ongoing mission of increasing language coverage and translation quality, and also describe work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
... In our future work, we intend to continue exploring the potential of margin augmented losses, with the aim to also train the cost matrix, and to perform more systematic experiments with other systems and language pairs. Another line of investigation will be considering other linguistic phenomena posing difficult challenges for MT systems, such as the prediction of tense (Vanmassenhove et al., 2017) or mood information (Burchardt et al., 2017). ...
... More recently, researchers have explored how to systematically identify specific kinds of errors in MT models [49,91,118]. Many of these are languagedependent challenge sets to probe the syntactic competence of MT models [4,15,71]. For example, Isabelle et al. introduces a dataset of difficult sentences designed to test linguistic divergence phenomena between English and French. ...
Preprint
Full-text available
Machine learning (ML) models can fail in unexpected ways in the real world, but not all model failures are equal. With finite time and resources, ML practitioners are forced to prioritize their model debugging and improvement efforts. Through interviews with 13 ML practitioners at Apple, we found that practitioners construct small targeted test sets to estimate an error's nature, scope, and impact on users. We built on this insight in a case study with machine translation models, and developed Angler, an interactive visual analytics tool to help practitioners prioritize model improvements. In a user study with 7 machine translation experts, we used Angler to understand prioritization practices when the input space is infinite, and obtaining reliable signals of model quality is expensive. Our study revealed that participants could form more interesting and user-focused hypotheses for prioritization by analyzing quantitative summary statistics and qualitatively assessing data by reading sentences.
... Использование машинного перевода и переводческой памяти предполагает вероятность появления непереведенных сегментов исходного текста; именно такая ошибка оказывается наиболее частотной при машинном переводе (Burchardt et al. 2017). Пример Google translate с пропуском сегмента исходного текста на английский и русский языки показан на рис. 2. Непереведенные сегменты обведены в ячейке с исходным текстом на иврите. ...
Article
Full-text available
The paper presents analysis of errors in translation on the CAT platform Smartcat, which accumulates all tools for computer-assisted translation (CAT) including a machine translation (MT) system and translation memory (TM). The research is conducted on the material of the translation on Smartcat platform (a joint project of a tourist guide translation (35,000 words) from Hebrew to Russian, English, and French). The errors on the CAT platform disclose difficulties in mastering text semantic coherence and stylistic features. The influence of English as lingua franca appears in peculiar orthographic and punctuation errors in the target text in Russian. Peculiar errors in translation on the CAT platform reveal the necessity of advanced technological competence in translators. The peculiar errors uncover problems associated with a source text segmentation into sentences. The segmentation can trigger a translator to preserve the sentence boundaries and use a Russian complicated compound sentence that provoke punctuation errors. Difficulties of the anaphora resolution in distant semantically coherent segments are also associated with the source text segmentation and working window formatting. A joint project presupposes different translators to translate different files of the source document. To generate the coherence, contiguity and integrity of the whole document, the files have to be revised by a third-party editor to avoid conflict of interest. The editor-reviser is also responsible for improving the target text pragmatic and genre characteristics while applying top-down strategy to target text analysis. Thus, the translator’s errors while applying CAT tools reveal the effect of bottom-up text processing alongside with cross-language interference.
Chapter
Terminology translation plays a crucial role in domain-specific machine translation (MT). Preservation of domain knowledge from source to target is arguably the most concerning factor for clients in translation industry, especially for critical domains such as medical, transportation, military, legal and aerospace. Evaluation of terminology translation, despite its huge importance in the translation industry, has been a less examined area in MT research. Term translation quality in MT is usually measured with domain experts, either in academia or industry. To the best of our knowledge, as of yet there is no publicly available solution to automatically evaluate terminology translation in MT. In particular, manual intervention is often needed to evaluate terminology translation in MT, which, by nature, is a time-consuming and highly expensive task. In fact, this is unimaginable in an industrial setting where customised MT systems are often needed to be updated for many reasons (e.g. availability of new training data or leading MT techniques). Hence, there is a genuine need to have a faster and less expensive solution to this problem, which could aid the end-users to instantly identify term translation problems in MT. In this study, we propose an automatic evaluation metric, TermEval, for evaluating terminology translation in MT. To the best of our knowledge, there is no gold-standard dataset available for measuring terminology translation quality in MT. In the absence of gold-standard evaluation test set, we semi-automatically create a gold-standard dataset from English–Hindi judicial domain parallel corpus. We trained state-of-the-art phrase-based SMT (PB-SMT) and neural MT (NMT) models on two translation directions: English-to-Hindi and Hindi-to-English, and use TermEval to evaluate their performance on terminology translation over the created gold-standard test set. In order to measure the correlation between TermEval scores and human judgments, translations of each source terms (of the gold-standard test set) is validated with human evaluator. High correlation between TermEval and human judgements manifests the effectiveness of the proposed terminology translation evaluation metric. We also carry out comprehensive manual evaluation on terminology translation and present our observations.
Article
Full-text available
Within the field of Statistical Machine Translation (SMT), the neural approach (NMT) has recently emerged as the first technology able to challenge the long-standing dominance of phrase-based approaches (PBMT). In particular, at the IWSLT 2015 evaluation campaign, NMT outperformed well established state-of-the-art PBMT systems on English-German, a language pair known to be particularly hard because of morphology and syntactic differences. To understand in what respects NMT provides better translation quality than PBMT, we perform a detailed analysis of neural versus phrase-based SMT outputs, leveraging high quality post-edits performed by professional translators on the IWSLT data. For the first time, our analysis provides useful insights on what linguistic phenomena are best modeled by neural models -- such as the reordering of verbs -- while pointing out other aspects that remain to be improved.
Article
Full-text available
We describe Hjerson, a tool for automatic classification of errors in machine translation output. The tool features the detection of five word level error classes: morphological errors, reordering errors, missing words, extra words and lexical errors. As input, the tool requires original full form reference translation(s) and hypothesis along with their corresponding base forms. It is also possible to use additional information on the word level (e.g. POS tags) in order to obtain more details. The tool provides the raw count and the normalised score (error rate) for each error class at the document level and at the sentence level, as well as original reference and hypothesis words labelled with the corresponding error class in text and HTML formats.
Article
Full-text available
This paper describes KORTERM's test suite and their practicability. The test-sets have been being constructed on the basis of fine-grained classification of linguistic phenomena to evaluate the technical status of English-to-Korean MT systems systematically. They consist of about 5000 test-sets and are growing. Each test-set contains an English sentence, a model Korean translation, a linguistic phenomenon category, and a yes/no question about the linguistic phenomenon. Two commercial systems were evaluated with a yes/no test of prepared questions. Total accuracy rates of the two systems were different (50% vs. 66%). In addition, a comprehension test was carried out. We found that one system was more comprehensible than the other system. These results seem to show that our test suite is practicable.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Conference Paper
We participated in the WMT 2016 shared news translation task by building neural translation systems for four language pairs, each trained in both directions: English<->Czech, English<->German, English<->Romanian and English<->Russian. Our systems are based on an attentional encoder-decoder, using BPE subword segmentation for open-vocabulary translation with a fixed vocabulary. We experimented with using automatic back-translations of the monolingual News corpus as additional training data, pervasive dropout, and target-bidirectional models. All reported methods give substantial improvements, and we see improvements of 4.3--11.2 BLEU over our baseline systems.
Article
This paper describes a method of evaluating quality for developers of machine translation sys-tems to easily check imperfections in their own systems. This evaluation method is a systematic, objective method along with test example sets in which we clarified the evaluation procedure by adding yes/no questions and explanations to the example sentences for evaluation.