A. Przepiórkowski et al. (Eds.): Computational Linguistics, SCI 458, pp. 241–261, 2013.
DOI: 10.1007/978-3-642-34399-5_13 © Springer-Verlag Berlin Heidelberg 2013
Machine Translation at Work
Aljoscha Burchardt, Cindy Tscherwinka, Eleftherios Avramidis,
and Hans Uszkoreit*
Abstract. Machine translation (MT) is – not only historically – a prime applica-
tion of language technology. After years of seeming stagnation, the price pressure
on language service providers (LSPs) and the increased translation need have led
to new momentum for the inclusion of MT in industrial translation workflows. On
the research side, this trend is backed by improvements in translation perfor-
mance, especially in the area of hybrid MT approaches. Nevertheless, it is clear
that translation quality is far from perfect in many applications. Therefore, human
post-editing today seems the only way to go. This chapter reports on a system that
is being developed as part of taraXŰ, an ongoing joint project between industry
and research partners. By combining state-of-the-art language technology applica-
tions, developing informed selection mechanisms using the outputs of different
MT engines, and incorporating qualified translator feedback throughout the devel-
opment process, the project aims to make MT economically feasible and techni-
Just a few years ago, English was considered the lingua franca of the future, at
least in business contexts. Today, the situation has drastically changed, particu-
larly in light of the developments in web communication and publishing. The
amount of online content in other languages has exploded. According to some es-
timates, the European market volume for translation and interpretation, including
software localisation and website globalisation, was €5.7 billion in 2008 and was
expected to grow by 10% per annum.1 Yet this existing capacity based mostly on
human translation is by far not enough to satisfy current and future translation
needs. The integration of MT would seem to be the most promising way of
managing translation in a cost-effective and timely fashion in the future but, sur-
prisingly, neither the economic feasibility of MT nor the correlation with the real-
world needs of professional translators and LSPs have been analysed in depth up
to now. From the LSP’s point of view, MT lacks basic functionality to be a true
Aljoscha Burchardt ⋅ Eleftherios Avramidis ⋅ Hans Uszkoreit
DFKI LT Lab
1 European Commission Directorate-General for Translation, Size of the language industry
in the EU, Kingston Upon Thames, 2009.
242 A. Burchardt et al.
help in professional translation processes. Therefore, despite continuous im-
provements to MT quality, this technology is economically viable only in very
specific translation scenarios.
Within research, the move from rule-based to statistical MT systems and more
recently to hybrid ones has led to translation results one would not have dreamt of
just a few years ago. Still, there is no one-size-fits-all MT engine available. The
various MT paradigms have different strengths and shortcomings – not only in
terms of quality. For example, rule-based MT (RBMT) offers good control of the
overall translation process, but setting up and maintaining such a system is very
costly as it requires trained specialists. Statistical MT (SMT) is cheap, but it re-
quires huge amounts of computing power and training data, which may not be
available so that new languages and domains can be included. Translation Mem-
ory Systems (TMS) provide human translation quality, but are limited in coverage
due to their underlying design. Selecting the right system (combination) for the
right task is an open question.
This paper reports on a system that is being developed as part of the taraXŰ
project, which aims to find answers to questions such as:
1. Can MT help human translators reduce translation costs without sacrificing
2. How can MT be integrated into professional translation processes?
3. How can a hybrid system utilise input properties, metadata about system be-
haviour, linguistic analysis of input and output text, etc.?
4. When is post-editing most effective?
5. How can the result of human post-editing be fed back into the system to help
To answer these questions, integrated and human-centric research is needed. Some
related work is available on supporting human translators in using MT and post-
editing (e.g., Casacuberta et al., 2009; Koehn, 2009; Specia 2011), but for the
most part, no language professionals were involved in the design and calibration
of such systems. The question of how professional translators can optimally be
supported in their translation workflow is still open. In this article we focus on
questions 2-3 and 5. Questions 1 and 4 will be addressed in the further course of
As the first part of its analytic process, the taraXŰ system makes a selection
within a frame of hybrid MT, including RBMT, TMS, and SMT. Then, a self-
calibration component is applied, supplemented by controlled language technol-
ogy and human post-processing results from previous translations in order to
match real-world translation concerns. A novel feature of this project is that
human translators are integrated into the development process from the very be-
ginning: after several rounds of collecting and implementing user feedback, the se-
lection and calibration mechanisms will be refined and iteratively improved.
In this chapter we will consider the motivations behind the work from two dif-
ferent angles: We will give an overview of the language service provider’s general
conditions when working with MT in a professional translation environment and
the scientific motivation for working on hybrid MT in the taraXŰ project.
Machine Translation at Work 243
After that, we will describe the actual system being built in the project and show
how it takes account of the prerequisites for using MT in a professional translation
workflow. The final part of the chapter provides initial observations on the user
feedback and shows the feasibility of a sample selection mechanism. Some notes
on further work and conclusions close the chapter.
2 Translation Quality and Economic Pressure – The Language
Service Provider’s Motivation
The provision of high quality translations – preferably at a low cost – is a matter
of balancing a maximum level of automation in the translation process on the one
hand with carefully designed quality assurance measures on the other hand. The
ever-growing translation volumes and increasingly tight schedules cry out for the
integration of machine translation technology in professional translation work-
flows. Though, for a variety of reasons, this is only economically feasible under
very specific conditions. This section provides a background on LSPs’ require-
ments concerning translation workflows and in particular Translation Memory
Systems (TMS). This background is helpful for understanding the design of our
system, which we describe later on in this chapter.
2.1 Re-using Language Resources
For a long time machine translation technology has been developed and improved
with the clear focus on providing understandable translations for end users (in-
bound translation). Its primary goal has been to put people in a position to under-
stand the meaning of some foreign language sentence (information gisting) using
machine translation that preserves the source language meaning and complies with
general target language requirements such as spelling and grammar. While this is
a helpful procedure in a zero-resources scenario, the typical LSP scenario is that of
outbound translation for a specific customer (content producer). Usually, language
resources from past translations are available, having been revised and approved.
The contained material should not undergo MT again; instead, the resources
should be accounted for, during the translation process, in order to reduce redun-
dancy in the manual effort of post-editing and proofreading.
Professional translators needed a comfortable translation environment that en-
ables maximum reuse of past translations, in order to minimise translation effort
and ensure maximum consistency. This need resulted in the development of
Translation Memory (TM) systems, which are now standard in professional trans-
In TM systems, language resources do not only exist in the form of aligned
source and target language sentences in the memory. Usually, a client-specific
terminology database is available for translation as well. Such a termbase stores
multilingual terminology and has usually been carefully compiled and validated
during product development. It is a very important part of corporate identity and
binding for translation. Apart from terms, it also stores additional information that
244 A. Burchardt et al.
helps the translator, such as definitions, sample sentences or usage information.
Most TM systems support term recognition, which integrates the termbase into the
editor and helps the translator by displaying relevant terminology for the current
segment. Terminology verifiers check the translation for compliance with the
termbase and warn the translator if source language terms are not translated ac-
cording to the information stored in the termbase.
By comparison, in MT terminological assets can be factored into translation,
but compliance with given terminology still has to be manually checked by the
translator. All lexical choices that do not comply with a given terminology have
to be identified and manually modified by the human translator. That is true even
if the machine translation result would be acceptable from a semantic point of
When employing MT, one has to keep in mind that target language require-
ments may also go beyond the lexical level and concern phrasing at the sentence
level. From this aspect, even semantically suitable and grammatical translations
may be rejected if stylistic requirements of the target language are not met. On the
contrary, when working with a TM System the translations suggested to the trans-
lator are human-made, client-specific, validated and therefore expected to be
grammatically and orthographically correct and in line with the target language
requirements concerning style and terminology.
The matching algorithms of the TM systems provide an estimate of the similar-
ity of a proposed translation derived from the memory with the correct translation
of a given sentence. As a rule of thumb, 100% matches – where the sentence to be
translated exactly matches a source language sentence in the memory including
formatting information – do not need to be modified; high fuzzy matches require
minor post-editing effort and low fuzzy matches require a new translation. These
estimates enable a calculation of translation effort and are the basis of pricing. The
more translations can be re-used, the lower the price.
2.2 Machine Translation and Post-editing Effort
The advantage of MT over Translation Memories is that it can deliver translations
even if there are no past translation projects that can be reused for a new translation
job. In its current form, however, it is missing a valuable feature of TMs: It is not
possible to estimate machine translation quality or the post-editing effort required to
create a high quality translation without reference to the post-edited result. An MT
translation may have the value of a 100% match as well as a zero match. That poses
a real problem for the LSP. On what basis does it make a fair offer?
Assessing the time and effort needed to post-edit MT output may be possible in
a very restricted scenario, in which source language document characteristics are
known, the MT engine is optimised according to the expected source, and when
there is already some experience of MT results in that scenario. But even this is an
assessment at the document level and subject to change. For the economic applica-
tion of MT in any given scenario, methods for automatically assessing MT quality
at the sentence level are needed. As mentioned earlier, such an assessment
includes not only an evaluation of orthography and grammar of the source
Machine Translation at Work 245
language content, but also how well target language requirements, such as termi-
nology and style, are met. In the context of software documentation, there may
also be more general criteria such as the maximum permissible length of a sen-
2.3 Translation Workflow Requirements
Apart from these content-related requirements, there are also workflow require-
ments that have to be taken into account when deciding how MT can be deployed
in an economically feasible way, the most important being
• the processability of various file formats apart from plain text,
• the possibility of accessing MT from the familiar and functionally mature
working environment of the professional translator
• the possibility of making use of the post-edited machine translation results for
future translation jobs
• the continuous optimisation of the MT engines with minimum manual effort.
The professional translator generally works with a TM system that consists of a
translation memory, a terminology database and an editor. The TM system per-
forms the task of stripping formatting information from the translatable content.
Most TM systems come with filters that can handle a variety of file formats.
Translatable content is then compared with the translation memory and filled up
with target language content according to a defined threshold value in a bilingual
file. The translator then uses the editor, from which he or she has access to the
translation memory and the terminology database for translation. The proposals, if
there are any, from the translation memory are fed into the editor where they are
manually edited. After translation, the new source and target language pairs are
stored in the translation memory and the bilingual file is saved as a target language
file in the original file format.
In an effort to avoid the need to work with multiple software tools and to pro-
vide the translator with MT in his or her familiar working environment, a trend for
integrating MT into the translation memory has emerged. The major TM system
vendors have already integrated MT engines into their products and innovative
LSPs have built their own solutions to integrate MT.
Following this approach, the translator’s working environment does not change,
except that there is one more proposed translation – marked as being MT – to
check for suitability for post-editing, meaning that the effort of post-editing is
smaller than that of creating a new translation. Even the remaining problem of
saving the post-edited machine translations for future translation jobs in the mem-
ory is handled by the TMS.
2.4 Optimisation of MT Engines
For rule-based systems, substantial quality leaps can be achieved by means of dic-
tionary enhancement. System improvement is a targeted measure and applies
246 A. Burchardt et al.
mainly to a specific document to be translated. In an optimal translation workflow,
terminology is provided and translated before document translation and stored in
the terminology database to allow for active terminology recognition during trans-
lation via the TMS and thus ensure consistent translation. However, because there
are usually tight deadlines, terminology often has to be collected during document
translation. If the source language documents always concern a certain domain,
subsequent dictionary enhancement will most likely increase the machine transla-
tion quality of future translation jobs. Otherwise, subsequent dictionary enhance-
ment does not justify the effort, as it is a costly process and requires the skills of a
Concerning statistical systems, improvement is currently achieved by training
the system with ever more bilingual data, and weighting certain training content
for domain specificity. As opposed to the optimisation of rule-based systems, it is
impossible to optimise statistical MT systems on the basis of the content to be
translated. Instead, comparable material is used in the optimisation process to
In summary, the LSP is confronted with a variety of issues when introducing
MT into existing translation workflows. Some can be resolved easily, while others
certainly require some development effort, and it still remains completely unclear
how others are best dealt with. Questions such as how to determine MT quality
and post-editing effort before translation takes place, how to automatically choose
the best machine translation given a specific translation scenario, and how to eco-
nomically optimise MT quality are the subject of research and development in the
3 Hybrid Machine Translation – Scientific Motivation
This section explains the scientific motivation for including a hybrid MT system
in our project by briefly discussing strengths and weaknesses of the main MT
paradigms. We will also discuss different modes of MT support that are offered by
The idea of using computers to translate between languages can be traced back
to the late 1940s and was followed by substantial funding for research during the
1950s and again in the 1980s. At its most basic level, MT simply substitutes words
in one natural language with words in another language. This naive approach can
be useful in subject domains that have a very restricted, formulaic language such
as weather reports. However, to obtain a good translation of less standardised
texts, larger text units (phrases, sentences, or even whole passages) must be
matched to their closest counterparts in the target language. One way of approach-
ing this is based on linguistic rules. Research on rule-based MT goes back to the
early days of Artificial Intelligence in the 1960s, and some systems have reached a
high level of sophistication (e.g., Schwall & Thurmair, 1997). While this symbolic
approach allows close control of the translation process, maintenance is expensive
as it relies on the availability of large sets of grammar rules carefully designed by
a skilled linguist.
Machine Translation at Wor
Fig. 1 Statistical (left) and r
In the late 1980s when c
est in statistical models
Europarl parallel corpus,
ment in 21 European la
enough to derive appro
mid-1990s, statistical M
community (e.g., Koehn
online translation service
systems, statistical MT
matically illustrates a stat
Hybrid MT is a recent
leveraging the quality of
often have complementa
bridisation are investigat
good parts of several tran
Typical difficulties for s
range re-ordering and mi
cal choice. Rule-based
structure, and have the a
from parsing errors and
the complementary natur
• (1) Input: Wir sollten
− Human translatio
− Rule-based MT s
− Statistical MT sys
• (2) Input: Für eine ga
− Human translatio
and phasing out of
− Rule-based MT sy
tion and step-by-ste
le-based (right) MT (from: Burchardt et. al. 2012)
mputational power increased and became cheaper, int
or MT began to grow. Statistical translation models a
y analysing parallel, bilingual text corpora such as t
which contains the proceedings of the European Parli
guages. Given enough data, statistical MT works w
imate translations of foreign language texts. Since t
has become the prevalent MT approach in the resear
et al., 2007; Li et al., 2010). Well-known large fr
s also rely on statistical MT. Compared with rule-bas
ore often generates ungrammatical output. Fig. 1 sch
stical and a rule-based MT pipeline.
trend (e.g., Federmann et al., 2009; Chen et al., 2009) f
T. Based on the observation that different MT syste
y strengths and weaknesses, different methods for h
d that aim to “fuse” an improved translation out of t
atistical MT are morphology, sentence structure, lon
sing words, while strengths are disambiguation and le
T systems are typically strong in morphology, senten
ility to handle long-range phenomena. Weaknesses ari
rong lexical choice. The following examples illustr
of such system errors.
hn auf keinen Fall heute zerstören.
: We definitely shouldn’t destroy it/him today.
stem: We should not destroy it in any case today.
em: We should, in any case, it today.
ze Reihe von Bereichen bringt dies die Drosselung u
g von Aktivitäten mit sich.
: For a wide range of sectors this means the reducti
tem: For a whole series of fields this brings the restri
attitude from activities with themselves.
248 A. Burchardt et al.
− Statistical MT system: For a whole series of areas of this brings the restric-
tion and phasing out of activities.
In (1), the rule-based system produced an intelligible and almost perfect transla-
tion while the statistical system dropped both the negation, which is hidden in auf
keinen Fall (under no circumstances) and the main verb zerstören (destroy), which
has a sentence-final position in the German input. In (2), the statistical system
came up with a good translation. The rule-based system chose the wrong sense of
Einstellung (attitude instead of discontinuation) and mistranslated the German
bringt … mit sich (brings) literally into brings … with themselves.
In the design of a hybrid system, a fundamental conceptual decision is to try
and merge the translation results of different systems into a new sentence, a proc-
ess known as system combination. An alternative is to select the best sentence
from different translation results. The taraXŰ system described in the next section
follows the latter approach in selecting system outputs from all major MT para-
digms in a hybrid architecture.
3.2 Measuring Translation Quality
In the development of MT engines, a great deal of effort has been spent finding
measures that correlate well with human judgements when comparing translation
systems for quality. Commonly used measures of MT quality such as BLEU (Pap-
ineni et al., 2001) depend on the availability of human reference translations,
which are only available in artificial development scenarios and cannot be taken as
absolute because of the inherent subjectivity of the human judgement. Further-
more, the correlation of these metrics to human judgments has been mainly
measured on ranking between different machine translation systems (e.g., Calli-
son-Burch et al., 2006). While ranking systems is an important first step, it does
not provide many scientific insights for their improvement.
On the other side, a number of common translation quality standards are used
to assess translations produced by professional human translators that mostly as-
sess surface criteria at the sentence level such as the well-formedness of the trans-
lation (e.g., SAE J2450) and sometimes more abstract criteria at the document
level such as terminology, style, or consistency (e.g., LISA QA). The notion of
translation quality, however, is also relative to factors such as the purpose of the
translation (e.g., providing information vs. giving instructions) or the expectation
of the recipient (e.g., elaboration vs. brevity).
Overall, research and development in MT is confronted with heterogeneous re-
quirements from human translation workflows, a variety of different MT para-
digms, and quality criteria that partially rely on human judgement. Therefore, the
taraXŰ project follows a human-centric hybrid approach.
3.3 Post-editing vs. Standalone MT
Two different translation scenarios are studied in the project and thus have to be
handled by the project’s system:
Machine Translation at Work 249
• Standalone MT. As the name suggests, this is a pure MT scenario, where the
hybrid system selection mechanism has to find the translation that best pre-
serves the meaning of the source language sentence.
• Human post-editing. In this scenario, a human translator post-edits the transla-
tion result in order to reach the level of a high-quality human translation. The
system should thus select and offer the translation that is most easy to edit. Ide-
ally, it would even indicate if none of the outputs is good enough for post-
editing and creating a translation manually would require the least effort.
The sentences humans select as the “best translation” and “easiest-to-post-edit” in
fact differ. To anticipate one result from a MT post-editing task, described later,
here is the result of a human expert ranking of four different results of MT engines
and the sentence that was chosen for post-editing:
• Rank 1: Our experience shows that the majority of the customers in the three
department stores views not more at all on the prices.
• Rank 2: Our experience shows that the majority of the customers doesn’t look
on the prices in the three department stores any more.
• Rank 3: Our experience shows that the majority of the customers does not look
at the prices anymore at all in the three department stores.
• Rank 4: Our experience shows that the majority of customers in the three
Warenhäusern do not look more on prices.
• Editing result (on Rank 4): Our experience shows that the majority of custom-
ers in the three department stores no longer look at the prices.
The rationale behind this is clear. Due to the untranslated word Warenhäusern
(department stores), this translation is ranked worst. But by translating one word
and making a few other changes, it can be made acceptable.
4 The taraXŰ MT System at Work
4.1 Translation Workflow
In this section, we provide a high-level description of the core parts of the taraXŰ
system. The results of experiments and evaluations are presented in the next sec-
tion. Our system follows the approach of deploying several MT engines in one in-
frastructure and selecting the best of all machine translations for further use. To al-
low for maximum usability in professional translation environments we embedded
the process of requesting the MT results, selecting the best translation and post-
editing in a TMS workflow. In such a workflow, MT is applied to new content
only and input can be processed in a variety of file formats. The translator uses his
or her familiar working environment and various quality assurance measures that
come with the TMS can be applied to the bilingual content. Once post-edited,
the machine translation becomes part of the memory content and is available for
In taraXŰ the translat
all available machine tr
patchwork of the best tra
tion for each one of the
nario, in which the targ
needs to be fully transla
language content in the t
lation is so low that it wo
bly irritate the translator.
4.2 System Archite
The taraXŰ framework
engines, a system for aut
a translation memory sys
Each of the systems
in its own closed field o
current attempts at comb
lection mechanism that c
The selection mechan
that could be derived fro
that give insight into the
that information can be
editing effort. Addition
available translations are
translation output is then
Fig. 2 System architecture s
Fig. 2 depicts the co
act. As can be seen, the
the embedded systems.
provides the selection
spelling, grammar and st
produce target language
also fed to the selection
A. Burchardt et
r is offered the most suitable machine translation out
nslations. The target language document is therefore
slations selected on a sentence level, i.e. the best transl
riginal sentences. That is at least true for a gisting sc
t language document is not subject to post-editing a
ed. In the case of post-editing there still can be sour
anslated file if the expected quality of the machine tra
ld not speed up the translation process and would pos
ntegrates rule-based and statistical machine translati
matic source and target language quality assessment a
e use in our experiments is proven and well establish
application. Given that, and as a practical extension
ning several MT engines, taraXŰ offers an informed s
ooses the best out of a choice of machine translations.
sm evaluates all available syste
the systems. Most of the systems generate informati
ranslation process. We are convinced that at least part
used to estimate translation quality and possibly po
l workflow characteristics for choosing among t
incorporated into the selection mechanism. The opti
howing main steps: input analysis/translation, output checki
ponents of the taraXŰ approach and the way they int
egmented source language content is routed through
he source language quality assessment component (A
echanism (Select) with categorised information abo
le. The TMS (TMS) and all MT (RBMT, SMT) engin
utput and provide syste
-immanent metadata, which
echanism (Select). The machine translation results a
Machine Translation at Wor
subject to target languag
formation as for source
mechanism. Fig. 3 illustr
Fig. 3 Overview of the selec
4.3 Selection Mech
The Selection Mechanis
ity of machine translatio
value we are convinced
selection behaviour. Our
accessible information a
for fine-tuning and extra
• translation workflow (
• source language qualit
• target language qualit
• behaviour of the MT e
is relevant to this. Follo
(i) Source Language
strengths and weaknesse
content characteristics on
lection. Moreover, in a p
correct source language
erroneous content with r
by statistical and rule-b
source language quality a
• Source: Edit the scrip
• Target: Überarbeiten
• Source: The Greens p
• Target: Das Grünzeu
quality assessment (AA) and the same categorised i
anguage quality assessment is provided to the selecti
tes the process of generating a target language file.
takes over the task of automatically assessing the qu
, given the field of application. To be of real practi
hat the selection mechanism must closely model hum
approach is to model the correlation between machi
d human selection, using Machine Learning techniqu
cting patterns. The incorporation of pragmatic and sy
n that is derived from the
nformation gisting or pos
ing our approach, the following aspects are taken in
Quality Assessment. Regarding the complementa
of the different MT paradigms, evaluating the sour
a sentence level should have a positive effect on MT s
actical scenario not only well-formed and grammatical
ontent can be expected. Non-native writers may produ
spect to spelling and grammar that is handled different
sed MT engines. The following examples show ho
fects machine translation quality.
ie das Schriftgebrüll:
us click to add a new user and save.
plus Klick, um einen neuen Bediener hinzuzufügen u
252 A. Burchardt et al.
The German translations are incomprehensible nonsense. Correction of the source
language sentences results in understandable machine translations that can be
made perfect with a few post-edits.
• Source: Edit the script below:
• Target: Geben Sie das Skript unten heraus:
• Source: Click the green plus sign to add a new user and save.
• Target: Klicken Sie das grüne Pluszeichen, um einen neuen Benutzer hinzu-
zufügen, und sichern.
(ii) Target Language Quality Assessment. A target-language quality assurance
functionality can obviously be of help for the selection. Primarily the degree of
compliance with target language requirements (and in particular with terminology,
grammar and style) allows the graded assessment of post-editing effort.
(iii) Behaviour of the MT Engines. Instead of treating the machine translation
engines as a “black box”, we look further into their internal functioning. As the
translation process consists of several “decoding” steps, such an observation
should lead to useful conclusions regarding the quality of its result. The simplest
indication is the inability of a system to find the correct translation of a word.
More advanced indications depend on the respective translation method. For ex-
ample, a statistical system’s scoring mechanism may signify too many ambiguities
regarding phrasal/lexical selection, whereas rule-based systems may flag their in-
ability to fully analyse part of the input text.
5 Observations From System Operation and Further
TaraXŰ, aiming to offer a full solution for translation workflows as outlined in
Section 2, is subject to ongoing research and development. This is accomplished
within an iterative process, where the operation of the first modules produces re-
sults that are used as a driving force for the development of extensions for the sys-
tem. As we have now outlined the current structure of the system, we will go on to
discuss the results and observations from its operation in our controlled develop-
5.1 Observing Translator Preferences and MT Behaviour
As the goals of the system are human-centric, the iterative process of evaluation
and further development includes the involvement of actual human users, namely
professional translators. In this section we provide a prototype instance of the
aforementioned selection mechanism (i.e. with no actual automatic functionality)
and observe the translator preferences on the MT outputs offered. These prefer-
ences are analysed to show the relative performance of each MT system, in par-
ticular errors and hints as to the degree to which they complement each other.
Machine Translation at Work 253
Experiment Structure. Two tasks were performed by the professional translators,
mirroring the two modes of our selection mechanism:
1. In the first task, annotators ranked the output of the four systems, according to
how well these preserve the meaning of the source sentence. In a subsequent
step, they classified the two main types of errors (if any) of the best translation.
We used a subset of the error types suggested by Vilar et al., (2006).
2. In the second task, the translators selected the translation that is easiest to post-
edit and then edited it. They were asked to perform only the minimal post-
editing necessary to achieve acceptable translation quality. No target language
requirements were set.
Technical Details. In order to accomplish the tasks, we employed external Lan-
guage Service Providers that offer professional translation services by humans, hop-
ing to get as close to the target user group as possible. Annotation guidelines
describing the evaluation tasks as well as examples on the correct usage of the error
classification scheme and minimal post-editing were provided to the translators. The
evaluation interface was set up taking advantage of the Appraise evaluation tool
(Federmann 2010), which through its web-based platform allows for remote work
and interoperability. A sample screenshot of the interface can be seen in Fig. 4.
Fig. 4 The Appraise interface used within taraXŰ
Data. The experiment was performed based on the language directions German-
English, English-German and Spanish-German. The corpus size was about 50,000
words in total or 2,000 sentences per language pair.
The test data was domain-oriented. The “news” part of the test set consisted of
1,030 test sentences from two WMT shared tasks (Callison-Burch et al, 2008 and
Callison-Burch et al, 2010), subsampled proportionally to each one of the docu-
ments contained). The “technical documentation” part contained 400 sentences ex-
tracted from the community-based translations of the OpenOffice2 documentation
254 A. Burchardt et al.
(Tiedemann, 2009), post-processed manually in order to remove wrong align-
ments and translation mismatches.
The translation outputs were provided by four distinct state-of-the-art MT/TM
implementations: The statistical MT-systems Moses (Koehn, 2005), trained using
Europarl and News Corpus (Callison-Burch et al, 2010), and Google Translate,
the rule-based system Lucy (Alonso & Thurmair, 2003) and the translation mem-
ory system Trados (Carroll, 2000). The latter was included in order to investigate
a baseline scenario where no MT is involved. It was filled with the same bilingual
material that was used for training Moses. In order to enforce high fuzzy matches
for part of the evaluation data, we took some sentences from the bilingual material
and modified them minimally as regards grammar, spelling and negation – mirror-
ing a real-life translation scenario with existing language resources. There were no
training data constraints for Google Translate and Lucy. None of the MT engines
were optimized before the evaluation.
Observations. The main observations and conclusions from the ranking and error
classification task are described below.
i) User preference. An overview of the users’ preference to each one of the sys-
tems is depicted in Table 1, where system ranks are averaged for each domain and
language pair (bold type indicates the best system). The upper part of the table
contains the overall ranking among the four listed systems for each language di-
rection. The bottom part of the table contains domain-specific results from the
news and the technical domain respectively.
When possible, we measured inter-annotator agreement in terms of Scott’s pi
(Scott, 1955). Pi values were 0.607 for German-English (2 annotators) and 0.356
for German-English (3 annotators), which can be respectively interpreted as sub-
stantial and fair (following Landis and Koch, 1977). This comparably low agree-
ment seems to stem partly from the deliberately simple design of the interface: the
sentences were presented without context and we did not offer a special treatment
for ties, i.e. the evaluators had to enforce a ranking even if two or more sentences
were equally good/bad. A second reason for disagreement may have been missing
“project specifications” such as target language requirements concerning termi-
nology, style, etc.
Trados is listed here only for orientation purposes as it does not produce trans-
lations, but is more of a storage system for existing bilingual material. Its transla-
tion results provide translations of already known content whose source was
similar to the original test sentences.
One observation is that the MT system ranks are comparably close to each
other. The narrow margin between the ranks indicates that no system is better than
another in all cases, which supports the hypothesis that their performance is com-
plementary and that their combination can thus be beneficial.
ii) Error classification. As described above, the evaluators were asked to choose the
two most important errors of the best translation. The results of this classification are
presented in Table 2. The table reads as follows: 3.2% of errors made by system Lucy
Machine Translation at Work 255
Table 1 Human ranking results, as the average position of each system in each task
Lucy Moses Google Trados
Overall 2.00 2.38
Table 2 Human error classification: error distribution for each translation system (error
classes are based on Vilar et al., 2006)
Lucy Moses Trados
3.2% 16.8% 12.6%
34.6% 24.6% 33.2%
18.6% 11.8% 11.0%
13.1% 14.6% 9.1%
Incorrect word order 16.1% 22.0% 13.4%
Incorrect punctuation 3.7% 3.4% 2.1%
Other error 10.7% 6.8% 18.6%
were missing content words, 34.6% wrong content words, etc. It can be seen that the
most common errors in all systems are wrong content words. The next most frequent
error type is incorrect word order, followed by wrong functional words and incorrect
word form. This indicates the need for improvement of reordering and lexical choice
techniques for all translation systems.
256 A. Burchardt et al.
Table 3 Five types of automatically classified edits for three translation systems as a distri-
bution over all translated words per system. Percentages averaged over the total number of
errors of each system
Lucy 4.3% 7.0% 4.4% 6.2% 23.7%
Moses 4.9% 9.0% 7.5% 4.9% 21.8%
Trados 2.6% 4.9% 8.1% 6.5% 47.7%
iii) Post-editing. A similar analysis has been carried out on the outputs that were
chosen to be post-edited, which are not necessarily the best ranked ones. In fact,
only 33% of post-edited outputs were ranked as the best. More experiments are,
however, needed to check whether the post-editors’ intuitive choices really mirror
It should be noted that the Google Translate system was not considered as an
option for editing. We took this decision because we have no way of influencing
the system and wanted to avoid futile efforts.
The post-edited output was compared with the original MT output, in order to
conclude which types of editing are most frequent for each system. The following
five types of edits (Popović and Burchardt, 2011) are taken into account: correct-
ing word form (morphology), correcting word order, adding missing word, delet-
ing extra word and correcting lexical choice.
Table 3 presents numbers for each of the five correction types for the three sys-
tems. It reads as follows: In 4.3% of the words translated by Lucy, the human
translator corrected the word form, for 7% of the words, the order was corrected,
etc. The most frequent correction for all systems is lexical choice, however for the
Trados system the number of lexical corrections is significantly higher than for the
other systems. This is explained by the fact that the choice of the target language
sentence proposed for editing is not based on a semantic level but on the level of
string similarity of the source sentences. The next most frequent type of correction
is the word order. The Moses system shows slightly more incorrectly positioned
words (order errors) than other systems. These results confirm the human error
classification: the main weakness for all systems is incorrect lexical choice,
i.e. wrong content words, and incorrect word order. These two aspects should be
taken into account for further improvement of the systems.
5.2 Selection Mechanism
In order to show the feasibility of our approach, we are including here the results of
the first experimental implementation of the Selection Mechanism (Avramidis,
2011). It has been trained using 1,000 Spanish sentences, each of them accompanied
by five system outputs in English. The goal of the training was that, for each input
Machine Translation at Work 257
sentence, the mechanism selects the system output which is the closest to the known
translation, using Levenshtein distance (Levenshtein, 1966). The model was trained
with a Support Vector Machine algorithm (Tsochantaridis et. al.), which was given
system-immanent information by the five systems such as the use of segment com-
bination in the case of RBMT (see Avramidis, 2011 for details).
The resulting mechanism was tested on a test set of 1,000 sentences, translated by
the same systems, also including the system-immanent information from the transla-
tion process. We collected only the best chosen output for each sentence and evalu-
ated the quality of the output using the automatic evaluation metric of BLEU.
Our model reported having successfully compared the quality of the sentences
in 63% of the cases. Although this is relatively low, the overall translation quality
is encouraging: as can be seen in Table 4, we achieved a BLEU score that is at
least comparable with the best systems in the task. We consider this to be an en-
couraging result, indicating some feasibility for future efforts on the selection
mechanism. A more suitable metric for the purpose of post-editing is Word Error
Rate, which shows an average of the editing distance between the suggested MT
output and the goal translation, which we managed to decrease significantly. Fi-
nally, our selection mechanism performed better than state-of-the-art system com-
bination approaches in ML4HMT-2011 (Federmann et al, 2012; Table 5).
Finally, we want to report on speed improvement as one indicator for improved
efficiency of the translation process. Table 6 contrasts average processing time for
translation done from scratch versus post-editing of MT output on a collection of
documents including 1806 sentences sampled from the most recent evaluation
campaign in taraXŰ. These results indicate that post-editing the machine-
translation output is significantly faster than translating the same sentences from
scratch. Our comparative usage statistics show an average speed-up of about 16
seconds per sentence, which sums up to about 8 working hours for the entire set.
Table 4 The translation performance of a sample implementation of the selection mecha-
nism for Spanish-English using only system-immanent features trained using Levenshtein
System BLEU WER
Hierarchical SMT 19.68 62.37
Lucy RBMT 23.37 64.78
Metis SMT 12.62 77.62
Apertium RBMT 22.30 64.91
Moses-based SMT 23.14 60.66
Selection Mechanism 23.54 46.13
258 A. Burchardt et al.
Table 5 Results of human ranking, when comparing our selection Mechanism with other
system combination methods
System Overall rank
Syscomb DCU 2.52
Lucy RBMT enhanced 2.05
Syscomb LIUM 2.87
Selection Mechanism 2.50
Table 6 Average processing time for translation done from scratch versus post-editing of
machine translation output
Sentence length Average time (sec)
(words) from scratch post-editing
0 - 19 45.94 33.71
20 – 39 96.80 74.59
40 – 59 155.84 99.09
60 – 79 73.83 18.00
all 63.32 46.93
6 Further Work
The experiments described were performed with “off-the-shelf” core systems,
which means using general corpora to train the SMT engine and general vocabu-
lary for RBMT. Since then, different methods of optimisation have been applied to
the engines to achieve better overall translation results. Future experiments will al-
low to assess potential quality improvements with respect to the characteristics of
the different test sets. We are convinced that source language characteristics have
a major effect on translation quality and optimisation potential.
This gradual approach allows for the comparison of MT quality with respect to
the three intertwining levels: domain adaptation, source language characteristics
and more detailed error analysis on a word basis. Future experiments will also in-
clude more language directions.
Optimisation will be handled pragmatically, as would be feasible in a profes-
sional translation workflow. The focus will be on domain adaptation, which means
Machine Translation at Work 259
re-training the SMT engines with domain-specific material and adding domain-
specific terminology to the RBMT dictionaries. Moreover, it is planned to give the
translators more guidance regarding the target language requirements, e.g., regard-
ing style and expected translation quality.
More features that we expect to have a definite impact on quality assessment
will be used for the selection mechanism. They will be generated primarily from
TMS functionality, like the matching algorithms and quality assurance functional-
ity mentioned above. We expect that a deviation from terminological requirements
may have more weight on the semantic level than on post-editing effort, but that
grammatical misconstructions will most likely have a greater impact on post-
editing effort than on semantic closeness to the original meaning.
Lastly, the threshold for post-editing has not yet been defined in the context of
our larger experimental study. This threshold strongly depends on the individual
memory, but we hope to find some guidance for automatic assessment in the
course of the next evaluation studies.
In this chapter, we have shed light on the economic and scientific requirements for
integrating machine translation (MT) into professional human translation work-
flows. We have described the taraXŰ system prototype that is being developed in
an ongoing joint project between language service providers and MT research
By pairing a hybrid MT system with a translation memory (TM) architecture,
we believe that we have brought together the best of both worlds – the state of the
art in MT and professional translation workflows. So far the evaluation results are
in line with our presumptions that different systems have complementary strength
and weaknesses. Unsurprisingly, all systems leave room for improvements, which
will be examined in the next phase of the project.
As regards the assessment of translation quality, the divergences in ranking and
post-editing results between several annotators indicate that more flexible auto-
matic measures are needed to assess quality realistically. This subjective factor in
translation also runs counter to the use of reference translations for automatically
assessing translation quality.
What is definitely needed for effective post-editing is support with respect to
the type of divergence from target language requirements. This is particularly the
case when the MT is correct in terms of spelling and grammar, but needs to be
adapted regarding lexical choice or style.
Acknowledgments. The work presented here is a joint work. The authors would like to
thank Patrick Bessler, Christian Federmann, Horst Liebscher, Maja Popović, Marcus Watts,
and David Vilar for their contributions. Many thanks also go to the anonymous reviewers
for helping clarify certain points. The taraXŰ project is financed by TSB Technologies-
tiftung Berlin – Zukunftsfonds Berlin, co-financed by the European Union – European
Regional Development Fund.
260 A. Burchardt et al.
Alonso, J., Thurmair, G., Deutschland, C.: The Comprendium Translator System. In: Pro-
ceedings of the Ninth Machine Translation Summit, New Orleans (2003)
Avramidis, E.: DFKI System Combination with Sentence Ranking at ML4HMT-2011. In:
Proceedings of the International Workshop on Using Linguistic Information for Hybrid
Machine Translation and of the Shared Task on Applying Machine Learning Techniques
to Optimising the Division of Labour in Hybrid Machine Translation, San Francisco
Burchardt, A., Egg, M., Eichler, K., Krenn, B., Kreutel, J., Leßmöllmann, A., Rehm, G.,
Stede, M., Uszkoreit, H.: The German Language in the Digital Age. Springer (2012)
Callison-Burch, C., Fordyce, C., Koehn, P., et al.: Further Meta-Evaluation of Machine
Translation. In: Proceedings of the Third Workshop on Statistical Machine Translation,
pp. 70–106. Association for Computational Linguistics, Columbus (2008)
Callison-Burch, C., Koehn, P., Monz, C., et al.: Findings of the 2010 Joint Workshop on
Statistical Machine Translation and Metrics for Machine Translation. Proceedings of the
Joint Fifth Workshop on Statistical Machine Translation and Metrics, pp. 17–53. Asso-
ciation for Computational Linguistics, Uppsala (2010)
Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the Role of Bleu in Machine
Translation Research. In: Proceedings of the 11th Conference of the European Chapter
of the Association for Computational Linguistics, Trento, pp. 249–256 (2006)
Carroll, S.: Introducing the TRADOS workflow development. Translating and the Comput-
er 22. Aslib Proceedings, London (2000)
Casacuberta, F., Civera, J., Cubel, E., et al.: Human Interaction for High Quality Machine
Translation. Communications of the ACM 52, 135–138 (2007)
Chen, Y., Jellinghaus, M., Eisele, A., et al.: Combining Multi-Engine Translations with
Moses. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, pp.
42–46. Association for Computational Linguistics, Athens (2009)
Federmann, C.: Appraise: An Open-Source Toolkit for Manual Phrase-Based Evaluation of
Translations. In: Proceedings of the Seventh International Conference on Language Re-
sources and Evaluation, Valletta (2010)
Federmann, C., Avramidis, E., Costa-Jussa, M.R., et al.: The ML4HMT Workshop on Op-
timising the Division of Labour in Hybrid Machine Translation. In: Proceedings of the
Twelfths International Conference on Language Resources and Evaluation, Istanbul
Federmann, C., Theison, S., Eisele, A., et al.: Translation Combination using Factored
Word Substitution. In: Proceedings of the Fourth Workshop on Statistical Machine
Translation, pp. 70–74. Association for Computational Linguistics, Athens (2009)
Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: Proceedings
of the Tenth Machine Translation Summit, Phuket (2005)
Koehn, P.: A Process Study of Computer-aided Translation. Machine Translation 23,
Koehn, P., Hoang, H., Birch, A., et al.: Moses: Open Source Toolkit for Statistical Machine
Translation. In: Proceedings of the Forty-Fifth Annual Meeting of the Association for
Computational Linguistics, pp. 177–180. Association for Computational Linguistics,
Landis, J.R., Koch, G.G.: The Measurement of Observer Agreement for Categorical Data.
Biometrics 33, 159–174 (1977)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions and Insertions and Rever-
sals. Soviet Physics Doklady 10, 707–710 (1966)
Machine Translation at Work 261
Li, Z., Callison-Burch, C., Dyer, C., et al.: Joshua 2.0: A Toolkit for Parsing-Based Ma-
chine Translation with Syntax, Semirings, Discriminative Training and Other Goodies.
In: Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Me-
trics, pp. 133–137. Association for Computational Linguistics, Uppsala (2010)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: A Method for Automatic Evalua-
tion of Machine Translation. In: Proceedings of the Fortieths Annual Meeting of the As-
sociation for Computational Linguistics, pp. 311–318. Association for Computational
Linguistics, Pennsylvania (2002)
Popovic, M., Burchardt, A.: From Human to Automatic Error Classification for Machine
Translation Output. In: Proceedings of the Fifteenth International Conference of the Eu-
ropean Association for Machine Translation, Leuven (2011)
Schwall, U., Thurmair, G.: From METAL to T1: Systems and Components for Machine
Translation Applications. In: Proceedings of the Sixth Machine Translation Summit,
pp. 180–190 (1997)
Scott, W.A.: Reliability of Content Analysis: The Case of Nominal Scale Coding. Public
Opinion Quarterly 19, 321–325 (1955), doi:10.1086/266577
Specia, L.: Exploiting Objective Annotations for Measuring Translation Post-editing Effort.
In: Proceedings of the Fifteenth International Conference of the European Association
for Machine Translation, Leuven (2011)
Tiedemann, J.: News from OPUS—A Collection of Multilingual Parallel Corpora with
Tools and Interfaces. Recent Advances in Natural Language Processing 5, 237–248
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support Vector Machine Learning
for Interdependent and Structured Output Spaces. In: Proceedings of the Twenty-First
International Conference on Machine Learning, Banff, Alberta (2004)
Vilar, D., Xu, J., D’Haro, L.F., Ney, H.: Error Analysis of Machine Translation Output. In:
Proceedings of the Fifth International Conference on Language Resources and Evalua-
tion, Genoa, pp. 697–702 (2006)