ChapterPDF Available

Machine Translation Infrastructure and Post-editing Performance at Autodesk

Authors:
  • Apple, California, United States

Abstract and Figures

In this paper, we present the Moses-based infrastructure we developed and use as a pro- ductivity tool for the localisation of software documentation and user interface (UI) strings at Autodesk into twelve languages. We de- scribe the adjustments we have made to the machine translation (MT) training workflow to suit our needs and environment, our server environment and the MT Info Service that handles all translation requests and allows the integration of MT in our various localisation systems. We also present the results of our latest post-editing productivity test, where we measured the productivity gain for translators post-editing MT output versus translating from scratch. Our analysis of the data indi- cates the presence of a strong correlation be- tween the amount of editing applied to the raw MT output by the translators and their produc- tivity gain. In addition, within the last calen- dar year our system has processed over thir- teen million tokens of documentation content of which we have a record of the performed post-editing. This has allowed us to evaluate the performance of our MT engines for the different languages across our product portfo- lio, as well as spotlight potential issues with MT in the localisation process.
Content may be subject to copyright.
Machine Translation Infrastructure
and Post-editing Performance at Autodesk
Ventsislav Zhechev
Autodesk Development Sàrl
Rue de Puits-Godet 6
2000 Neuchâtel, Switzerland
ventsislav.zhechev@autodesk.com
Abstract
In this paper, we present the Moses-based
infrastructure we developed and use as a pro-
ductivity tool for the localisation of software
documentation and user interface (UI) strings
at Autodesk into twelve languages. We de-
scribe the adjustments we have made to the
machine translation (MT) training workflow
to suit our needs and environment, our server
environment and the MT Info Service that
handles all translation requests and allows the
integration of MT in our various localisation
systems. We also present the results of our
latest post-editing productivity test, where we
measured the productivity gain for translators
post-editing MT output versus translating
from scratch. Our analysis of the data indi-
cates the presence of a strong correlation be-
tween the amount of editing applied to the raw
MT output by the translators and their produc-
tivity gain. In addition, within the last calen-
dar year our system has processed over thir-
teen million tokens of documentation content
of which we have a record of the performed
post-editing. This has allowed us to evaluate
the performance of our MT engines for the
different languages across our product portfo-
lio, as well as spotlight potential issues with
MT in the localisation process.
1 Introduction
Autodesk is a company with a very broad range of
software products that are distributed worldwide.
The high-quality localisation of these products is a
major part of our commitment to a great user expe-
rience for all our clients. The translation of soft-
ware documentation and UI strings plays a central
role in our localisation process and we need to
provide a fast turnaround of very large volumes of
data. To accomplish this, we use an array of tools
from document– and localisation–management
systems to machine translation.
In this paper, we focus on the effect of the inte-
gration of MT in our localisation workflows. We
start in Section 2 with an in-depth look at our MT
infrastructure. Section 3 focuses on the productiv-
ity test we organised to evaluate the potential
benefit of our MT engines to translators. In Section
4, we turn to the analysis of our production post-
editing data from the last calendar twelve months.
Finally, we conclude in Section 5.
2 MT Infrastructure at Autodesk
In this section, we present the MT infrastructure
that we have built to support the localisation effort
at Autodesk. We actively employ MT as a produc-
tivity tool and we are constantly improving our
toolkit to widen our language coverage and
achieve better perceived quality. At the core of this
toolkit are the tools developed and distributed with
the open-source Moses project (Koehn et al.,
2007). Currently, we use MT for translating from
US English into twelve languages: Czech, German,
Spanish, French, Italian, Japanese, Korean, Polish,
Brazilian Portuguese, Russian, Simplified and Tra-
ditional Chinese (hereafter, we will use standard
short language codes). We are introducing MT for
translating into Hungarian as a pilot this year.
2.1 MT Training Workflow
We start with the training of our MT engines.
Training Data
Of course, no training is possible unless sufficient
amount of high-quality parallel data is available. In
our case, we create the parallel corpora for training
by aggregating data from three internal sources.
The smallest source by far consists of translation
memories (TMs) used for the localisation of mar-
keting materials. The next source are our reposito-
ries for translated UI strings. This data contains
many short sentences and partial phrases, as well
as some strings that contain UI variables and/or
UI-specific formatting. The biggest source of paral-
lel data are our main TMs used for the localisation
of the software documentation for all our products.
To ensu re b road er l exica l co verag e, a s well as to
reduce the administrative load, we do not divide the
parallel data based on product or domain. Instead,
we lump all available data for each language together
and use them as one single corpus per language.
The sizes of the corpora are shown on Chart 1.
Chart 1: Training Corpora Sizes in Millions of Segments
You may notice that we have the least amount
of data for PT-BR and HU, while our biggest cor-
pus by far is for JA. You can refer to this chart
when we discuss the evaluation of MT perform-
anceit turns out that language difficulty is a
stronger factor there than training data volume.
Data Preprocessing
After we have gathered all available data from the
different sources, we are ready to train our MT sys-
tems. For this, we have created a dedicated script
ENPT-BR
ENHU
ENCS
ENZH-HANT
ENPL
ENES
ENRU
ENIT
ENKO
ENZH-HANS
ENDE
ENFR
ENJA
2M 4M 6M 8M 10M
Documentation Software UI Marketing
that handles the complete training workflow. In
effect, we simply need to point the script to the
corpus for a particular language andafter a cer-
tain amount of timewe get a ready-to-deploy MT
system. The first step in this training workflow is the
preprocessing of the data, which we turn to now.
For the majority of the languages that we sup-
port, the preprocessing step consists simply of to-
kenisation, masking of problematic characters and
lowercasing. Some languages require additional
preprocessing and we will discuss the details later
in this section.
To perform the tokenisation, we have developed
a custom Perl tool that consists mostly of a cascade
of highly specialised regular expressions. We opted
for this tailored approach as our data contains a
large number of file paths and URLs, as well as
specific formatting conventions and non-content
placeholders that could be broken by a non-
specialised tool. We also built abbreviation lists
based on abbreviations observed in our data.
Another preprocessing step is lowercasing,
which is a standard procedure used to improve
lexical coverage and reduce lexical ambiguity.
The preprocessing scripts are used both to pre-
pare the corpora for the MT engine training and to
handle any data that has been received for transla-
tion at run time.
Data Post-processing
Although this is not a part of the training workflow,
we will have a quick look at the post-processing
tools we use, as they are closely related to the pre-
processing tools we just discussed.
Post-processing takes place after a sentence has
been translated and we have obtained the transla-
tion from the MT engine. As we tokenise and low-
ercase the data before training, we need to restore
the proper case and detokenise the output of the
MT engine to make it usable to humans.
For the former task, we use a statistical recaser.
This is realised as an additional monolingual MT
engine per language which is trained to translate
lowercased input into proper-case output. Of
course, this adds an additional element of uncer-
tainty and opportunity to produce errors, but with
the amount of data that we have available the per-
formance is subjectively reasonable. On the other
hand, it is much simpler to maintain statistical re-
casersthey are trained each time we train the
regular MT enginesrather than rule-based re-
caser tools. The latter might require constant adap-
tation as new data is added to our TMs.
In an effort to recover from some potential er-
rors the statistical recaser might introduce, we have
added two specific rules. The first makes sure that
the sentence-initial capitalisation of the MT output
matches that of the English input. The second rule
handles the capitalisation of unknown tokens.
These tokens will most likely be new variable
names or new URLs that the MT engine does not
recognise. The recaser is not able to restore the
proper case, which leads to hard-to-detect errors
and frustration for the translators. Thus, we make
sure that the casing of unknown tokens in the final
MT output matches the provided input.
The detokenisation is a much simpler task and
is realised as a cascade of regular expressions.
Language-specific Processing
Due to their specific makeup, some languages re-
quire extra preprocessing before we are able to
handle them with MT. In our case, these languages
are JA, KO, ZH-HANS and ZH-HANT.
Firstly, JA, ZH-HANS and ZH-HANT do not
use spaces in written text, which makes it impossi-
ble to directly use a phrase-based MT system like
Moses. We need to segment the data into word-like
tokens that will then be used to align against Eng-
lish words. From the available tools on the market,
we chose the open-source tool KyTea (Neubig et
al., 2011), because it allows us to handle all three
languages in question with the same process.
As expected, after translation we need to re-
verse these preprocessing actions to produce the
final MT output. The de-segmentation for ZH-
HANS and ZH-HANT is straightforward. We need
to take extra care when desegmenting JA, however,
as there are cases where the spaces need to remain
in placemostly within transliterated multipart
English terms.
A harder issue to resolve with JA arises from
the significant difference in syntactic structure be-
tween EN and JA, namely, EN is a Subject-Verb-
Object language, while JA is a Subject-Object-
Verb language. Hence, the linear distance between
the verb in the EN source and its translation in the
JA target may be very big making it difficult to
handle by a phrase-based system like Moses.
Our solution to the problem is to reorder the EN
source to make it more Japanese-like, thus reduc-
ing the linear distance between corresponding to-
kens in the EN and JA sentences. First, the EN
source is assigned its phrase-based syntactic struc-
ture using the OpenNLP parser (opennlp.apache.org).
Then, we use a rule-based tool developed in-house
to move the syntactic heads of the EN sentence to
positions corresponding to JA syntax. Our tests
have shown this reordering to significantly increase
the translators’ post-editing productivity, compared
to translating from scratch. In fact, using a plain
(non-reordered) JA engine does not lead to a mean-
ingful productivity increase, even though we have
by far the largest amount of parallel data for the
pair ENJA compared to our other corpora.
Improvements to the Moses Training Toolkit
As stated above, we use the de facto standard
Moses toolkit for training and decoding. However,
early in the process of integrating MT in our local-
isation workflow, we ran into resource issues dur-
ing the MT training. The main problem for us was
that we could not reliably predict the amount of free
disk space that might be required during training,
which lead to many interrupted trainings due to our
servers running out of disk space. Also, the training
process appeared to perform an excessive amount
of disk input-output (I/O) operations, which lead to
significant slowdowns exacerbated by the particu-
lar server architecture we use at our company.
These issues lead us to embark on an initiative
to improve the Moses training toolkit to reduce the
number of I/O operations and the peak disk usage. As
a starting point we took a Moses release from mid-
2010, as we considered it the most stable at the time.
The improvements we introduced were focused
mostly on avoiding the generation of temporary
files during the training process unless absolutely
necessary. Where two tools could not directly talk
to one another, we used UNIX-style named pipes
to handle the data flow, which significantly reduced
peak disk usage.
Finally, we noticed that a number of the training
steps are independent of one another and could be
run in parallel. We exploited this feature by modi-
fying the training script (train-model.perl) to run
the relevant training steps in parallel. The resulting
memory-based data flow during the parallel execu-
tion of training steps is shown in Figure 1.
!"#$%&#'
()$%*!*
+!,!$%#!'
$!-$.!$/,+'
#%01!
0/,%$/*!'
$!-$.!$/,+'
#%01!
0/,%$/*!'
()$%*!'#%01!
*-$#'!"#$%&#
+!,!$%#!'
()$%*!2
#%01!3)%14345!
&-,*-1/.%#!'
()$%*!'
#%01!*
*-$#'
!"#$%&#3/,6
*-$#'
!"#$%&#3-
+!,!$%#!'
()$%*!2
#%01!3)%143!54
*-$#'()$%*!2
#%01!2)%143!54
Figure 1: Data Flow for the Parallel Steps
of the Moses Training Workflow
A comparison of the peak disk usage and I/O
operations during the training of an ENJA en-
gine with the original and improved workflows is
shown in Table 1.
Original
Workflow
Improved
Workflow
extract file size
7,5GB
uses pipe
phrase-table.half size
1,7GB
uses pipe
phrase-table size
2GB
uses pipe
reordering-table size
2,5GB
uses pipe
total disk I/O
196GB
23GB
peak disk usage
45GB
12GB
disk usage after
training
9GB
6GB
Table 1: Disk Usage Statistics for ENJA MT Training
The modifications to the Moses training toolkit
listed above were provided to the MosesCore FP7
project for merging with the main Moses tree.
2.2 MT Info Service
We now turn to the MT Info Service that is the
centrepiece of our MT infrastructure, handling all
MT requests from within Autodesk. This service
and all its components are entirely Perl-based and
interact both internally and externally over TCP.
The first element of this infrastructure are the
MT servers that provide the interface to the avail-
able MT engines running in a data centre. At
launch time, the server code initiates the Moses
decoder for the requested language, together with
any necessary pre- and post-processing tools. The
MT servers read data one segment per line and
output translations as soon as they are available,
with the communication occurring over TCP. For
each language that we use in production, we cur-
rently have seven MT engines running simultane-
ously on different servers to provide higher overall
throughput.
The MT Info Service itself acts as a central dis-
patcher and hides the details of the MT servers’
setup, number and location from the clients. It is
the single entry point for any MT-related queries,
be it requests for translation, for information on the
server setup or administrative functions. It has real-
time data on the availability of MT servers for all
supported languages and performs load balancing
for all incoming translation requests to best utilise
the available resources. In real-life production, we
often see up to twenty concurrent requests for
translation that need to be handled by the system
some of them for translation into the same language.
We have devised a simple and ease-to-use API for
communication with the MT Info Service clients.
During the last twelve months, the MT Info
Service received over 180 000 translation requests
that were split into almost 700 000 jobs for load
balancing. Among these requests were over one
million documentation segments, as well as a large
volume of UI strings.
2.3 Integrating MT
in the Localisation Workflow
Once we have our MT infrastructure in place and
we have trained all MT engines, we need to make
this service available within our localisation work-
flow so that raw data is machine translated and the
output reaches the translators in due course. We
use two main localisation toolsSDL Passolo for
UI localisation and SDL WorldServer for docu-
mentation localisation.
Unfortunately, the current version of Passolo
that we use does not provide good integration with
MT and requires a number of manual steps. First,
the data needs to be exported into ‘Passolo bun-
dles’. These are then processed with in-house Py-
thon scripts that send any data that has not been
matched against previous translations to the MT
info service. The processed bundles are then
passed on to the translators for post-editing. Due to
limitations of Passolo, the MT output is not visibly
marked as such and Passolo has no way to distin-
guish it from human-produced data. We expect this
to be addressed in an upcoming version of the tool.
It is much easier to integrate MT within World-
Server. As this is a Java-based tool, it allows us to
build Java-based plugins that provide additional
functionality. In particular, we have developed an
MT adapter for WorldServer that communicates
directly with the MT Info Service over TCP and
sends all appropriate segments for machine transla-
tion. The MT output is then clearly marked for the
convenience of the translators both in the on-line
workbench provided by WorldServer and in the
files used to transfer data from WorldServer to
standalone desktop CAT tools.
WorldServer, however, does present us with its
own specific issues to handlewith its use of
placeholders (PHs) to mask XML tags. The major-
ity of our software documentation is authored us-
ing DITA-based XML and one goal of World-
Server is to hide the XML tags from the transla-
tors, as they do not represent actual content. The
first issue here is that WorldServer only stores the
PHs in the TMs and not the actual content they
mask. For example, the segment
The <b>new</b> features of AutoCAD <ver/> are:
will be stored as
The {1}new{2} features of AutoCAD {3} are:
Please note, that any PH may be either an opening
or closing formatting tag, or a standalone tag with
or without semantic meaning in the structure of the
sentence.
An major issue is that in the TMs the PHs are
stored with IDs numbered by segment, i.e. in each
segment the PHs start from 1; while during transla-
tion, the PHs are numbered continuously for the
whole project, sometimes reaching IDs into the
thousands. This means that any PH with an ID
above about 40 will be treated as an unknown
word, thus adding significant penalty during trans-
lation. We avoid this issue by temporarily renum-
bering PHs during translation making sure that
for any segment that the MT engines seethe PHs
start with ID 1. The original IDs are then restored
in the MT output. We found out that, with this
process, our MT engines produce very little errors
in the placement of PHs and we do not expect to
achieve better performance by, say, first removing
the PHs and then using word and/or phrase align-
ment information to reinsert them in the target.
Finally, as most PHs mask formatting XML
tags, the whitespace surrounding the PHs is signifi-
cant. It, however, gets lost during tokenisation and
could lead to errors that are hard to identify and fix
for the translators. For this, we added an extra
processing layer during MT that preserves to the
largest extent possible the whitespace surrounding
the PHs in the source, regardless of the output of
the MT engine and detokeniser.
So far we perused in detail the complex MT in-
frastructure at Autodesk. The question that arises is
if there is any practical benefit of the use of MT for
localisation and how do we measure this potential
benefit. We present our answer in the next section.
3 Post-editing Productivity Test
We now turn to the setup of our last productivity
test and analyse the data that we collected. The
main purpose of the productivity test was to meas-
ure the productivity increase (or decrease) when
translators are presented with raw MT output for
post-editing, rather than translating from scratch.
This is already the third productivity test that
Autodesk performs. The results of the first test in
2009 are discussed in (Plitt and Masselot, 2010).
Each of the tests has had a specific practical goal in
mind. With the first productivity test we simply
needed a clear indicator that would help us decide
whether to use MT in production or not and it only
included DE, ES, FR and IT. The second test fo-
cused on a different set of languages, for which we
planned to introduce MT into production, like RU
and ZH-HANS.
The goal of the productivity test described in
this paper was mainly to confirm our findings from
the previous tests, as well as to help us pick among
several MT options for some languages, as well as
compare MT performance across products. In the
following discussion we will only concentrate on
the overall outcome of the productivity test and on
our analysis of the post-editing performance versus
automatic edit-distance-based indicators.
3.1 Test Setup
The main challenge for the setup of the productiv-
ity test is the data preparation. It is obviously not
possible for the same translator to first translate a
text from scratch and then post-edit an MT version
without any biasthe second time around the text
will be too familiar and this will skew the produc-
tivity evaluation. Instead, we need to prepare data
sets that are similar enough, but not exactly the
same, while at the same time taking into account
that the translators cannot translate as much text
from scratch as they can post-edit—as our experi-
ence from previous productivity tests has shown.
This is further exacerbated by the fact that we need
to find data that has not been processed yet during
the production cycle and has not yet been included
in the training data for the MT engines.
We put together test sets with data from four
different products, but most translators only man-
aged to process meaningful amounts of data from
two products, as they ran out of time due to various
reasons (connectivity issues; picked the wrong data
set; etc.). These included three tutorials for Auto-
CAD users and a users manual for PhysX (a plug-
in for 3ds Max).
Due to resource restrictions, we only tested nine
out of the twelve production languages: DE, ES,
FR, IT, JA, KO, PL, PT-BR and ZH-HANS. For
each language, we engaged four translatorsone
each from our usual localisation vendorsfor two
business days, i.e. sixteen hours. We let our vendors
select the translators as per their usual process.
The translators used a purpose-built online
post-editing workbench that we developed in-
house. While this workbench lacked a number of
features common in traditional CAT tools (like e.g.
TM and terminology search), it allowed us to cal-
culate the time the translators took to look at and
translate / post-edit each individual segment. For
future productivity tests we plan to move away
from this tool and use a modified version of Pootle
(translate.sourceforge.net) instead, as it is easier
to manage and provides typical CAT functionality.
3.2 Evaluating Productivity
After gathering the raw productivity data, we
automatically removed any outlier segments, for
which the translators took unreasonably long time
to translate or post-edit. From the remaining data,
we averaged the productivity (measured in words
per eight-hour business dayWPD) for translat-
ing from scratch, taking a specific average for each
translator and product combination. We had to use
these separate baselines, as the variation between
individual translators, as well as between different
products for the same translator, is very big.
Comparing to the thus established correspond-
ing baselines, we calculated the apparent produc-
tivity delta for each segment that the translators
post-edited. The calculated average productivity
increase per language is presented in Chart 2.
Chart 2: Average Productivity Increase per Language
A caveat is in order here. We need to point out
thatdue to the setup of our online workbench
we exclude certain translator tasks that are inde-
pendent of the quality of MT from the productivity
calculation. This includes in particular the time that
translators would usually spend looking up termi-
nology and consulting the relevant style guides.
The calculation also does not include any pauses
taken for rest, coffee, etc.
3.3 Analysing the Post-editing Performance
Going deeper, we went on to analyse the post-
edited data using a battery of metrics. The metric
scores were computed on a per-segment basis so
that we could look for a correlation between the
amount of post-editing undertaken by the transla-
tors and their productivity increase.
The metrics we used were the following.
METEOR (Banerjee and Lavie, 2005) treating
punctuation as regular tokens, GTM (Turian et al.,
2003) with exponent set to three, TER (Snover et
al., 2006), PER (Position-independent Error
Rate—Tillmann et al., 1997) calculated as the in-
verse of the token-based F-measure, SCFS
(Character-based Fuzzy Score, taking whitespace
into account), CFS (Character-based Fuzzy Score,
on tokenised data), WFS (Word-based Fuzzy
Score). The Fuzzy Scores are calculated as the in-
verse of the Levenshtein edit distance (Leven-
shtein, 1965) weighted by the token or character
0%
25%
50%
75%
100%
FR KO IT PT-BR ES JA ZH-HANS DE PL
37,13%
44,94%
58,02%
59,07%
59,43%
63,23%
63,60%
81,93%
92,33%
count of the longer segment. They produce similar,
but not equal, results to the Fuzzy Match scores
familiar from the standard CAT tools. All score
calculations took character case into account.
After calculating the scores for all relevant seg-
ments, we obtained an extensive data set that we
used to evaluate the correlation between the listed
metrics and the measured productivity increase.
The correlation calculation was performed for each
language individually, as well as lumping the data
for all languages together. We used Spearman’s 𝜌
(1907) and Kendall’s 𝜏 (1938) as the correlation
measures. The results are shown in Table 2.
ProdIncrease
ProdIncrease
𝜌
𝜏
0,609
0,439
0,583
0,416
0,581
0,414
0,603
0,436
0,541
0,386
0,577
0,406
-0,594
-0,427
-0,578
-0,415
-0,143
-0,097
Table 2: Automatic Metric Correlation with
Translator Productivity Increase
We see that among the metrics listed above,
WFS exhibits the highest correlation with the
measured productivity increase, while METEOR
shows the least correlation. The results also show
that there is no significant correlation between the
productivity increase and the length of the transla-
tion. This suggests, for example, that a segment-
length-based payment model for MT may not be a
fair option. Also, we do not need to impose strong
guidelines for segment length to the technical writers.
Considering the results, we decided to look for
a possibility to create a joint metric that might ex-
hibit even higher level of correlation. The best
available combination turned out to be taking the
minimum of SCFS and WFS, which we list in the
table as JFS (Joint Fuzzy Score). This metric repre-
sents the worst-case editing scenario based on the
character and token levels. All other metric combi-
nations we evaluated resulted in lower correlation
than WFS. Chart 3 presents the JFS scores per lan-
guage and the corresponding average productivity
increase and post-editing speed. It also lists the
total number of segments that were post-edited for
each language.
In Charts 4–11, we investigate the distribution
of the JFS scores for the different languages tested.
The per-segment data is distributed into categories
based on the percentile rank. Due to their particular
makeup, we separate the segments that received a
score of 0% (worst translations) and those that re-
ceived a score of 100% (perfect translations) from
the rest. For each rank, we show the maximum ob-
served JFS (on the right scale). This gives us the
maximum JFS up to which the observed average
productivity increase is marked by the lower line
on the chart (on the left scale). For all languages,
we can observe a sharp rise in the productivity in-
crease for the perfect translations, while otherwise the
productivity increase grows mostly monotonically.
Additionally, for each percentile rank the left
bar on the graph shows the percentage of the total
number of tokens, while the right bar shows the
percentage of the total number of segments.
We do not include a chart for KO, as it does not
appear to follow the monotonicity trend and, in-
deed, our evaluation of the KO data on its own
showed a 𝜌 coefficient of only 0,361. We suspect
that this is due to one of the KO translators ignor-
ing the MT suggestions and translating everything
from scratch. Because of this peculiarity of the KO
data, we excluded it when calculating the overall
results shown in Table 1. This also suggests that
the productivity increase for KO shown in Chart 2
might not be realistic.
0%
50%
100%
150%
200%
250%
DE ES FR IT JA KO PL PT-BR ZH-HANS
0K
5K
10K
15K
20K
25K
1115
1205
1778
1327
450
1160
1091
653
740
JFS ProdIncrease WPD Segments
Chart 3: Edit Distance and Productivity Data for All Languages
Chart 4: JFS to Productivity Correlation FR
Chart 5: JFS to Productivity Correlation IT
Chart 6: JFS to Productivity Correlation PT-BR
Chart 7: JFS to Productivity Correlation ES
0%
50%
100%
150%
200%
250%
300%
350%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
14%
29%
43%
57%
71%
86%
100%
0%
25,00%
40,00%
50,00%
57,89%
66,67%
73,68%
80,00%
90,48%
97,37%
100,00%
0%
50%
100%
150%
200%
250%
300%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
17%
33%
50%
67%
83%
100%
0%
33,33%
45,45%
51,61%
57,14%
62,96%
69,44%
75,00%
82,35%
92,31%
100,00%
-50%
60%
170%
280%
390%
500%
0% 10% 20% 30% 40% 50% 60% 70% 80% 100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
36,67%
50,00%
60,00%
68,75%
76,00%
84,62%
90,91%
98,30%
100,00%
Chart 8: JFS to Productivity Correlation JA
Chart 9: JFS to Productivity Correlation ZH-HANS
Chart 10: JFS to Productivity Correlation DE
Chart 11: JFS to Productivity Correlation PL
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
500%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0%
23,53%
32,14%
47,83%
55,56%
64,52%
75,00%
85,00%
95,00%
97,37%
100,00%
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
11%
22%
33%
44%
56%
67%
78%
89%
100%
5,56%
26,09%
37,93%
47,62%
54,17%
62,16%
71,43%
82,86%
90,48%
95,00%
100,00%
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0%
11%
22%
33%
44%
56%
67%
78%
89%
100%
0%
17,65%
29,79%
40,00%
50,00%
57,69%
65,44%
75,00%
95,45%
98,15%
100,00%
0%
50%
100%
150%
200%
250%
300%
0% 10% 20% 30% 40% 50% 60% 70% 80% 100%
0%
17%
33%
50%
67%
83%
100%
0%
17,39%
31,03%
40,00%
50,00%
58,82%
66,67%
78,65%
96,77%
100,00%
0%
50%
100%
150%
200%
250%
300%
350%
400%
450%
0% 10% 20% 30% 40% 50% 60% 70% 80% 100%
0%
11%
22%
33%
44%
56%
67%
78%
89%
100%
0%
40,00%
53,13%
64,29%
71,43%
79,21%
85,37%
92,86%
98,59%
100,00%
Maximum JFS Average Productivity Increase % of all Tokens % of all Segments
A common observation for all languages is that
both the worst and the perfect translations are pre-
dominantly short segments, which is as expected.
First, it is much easier to achieve a perfect transla-
tion for a relatively short segment—especially
given that JFS takes whitespace into account and
our detokeniser is not perfect. Second, a complete
rewrite of the MT suggestion usually results from
an out-of-context translation of very short segments.
We also see that the JFS scores for the lan-
guages with the highest productivity increase (see
Chart 2) are predominantly in the higher ranges,
while for DE and PL there is a larger amount of
segments with lower JFS.
In the next section, we try to apply the same
evaluation methods to real-live post-editing data.
4 Evaluating Real-life Data
A new initiative at Autodesk, which will be extended
significantly in the future, provided for the archival
of all documentation segments that are post-edited
in production. Currently, we store the EN source,
the TM or MT target and the final target produced
by the translators, but we do not have available any
statistics on this data. In the future, we will store
the original Fuzzy Match score from our TMs, as
well as other metrics that we still need to decide on.
Of course, we do not have productivity data
attached to the production segments, as our pro-
duction environment does not provide for the ag-
gregation of such data. Nonetheless, this is a
wealth of post-editing data that we can analyse
using the automatic metrics from Section 3.
CS
DE
ES
FR
IT
JA
KO
PL
PT-BR
RU
ZH-HANS
ZH-HANT
3% 6% 9% 12% 15% 18%
0,92%
0,78%
1,91%
0,65%
1,71%
2,37%
1,03%
1,17%
1,22%
0,78%
1,67%
2,12%
8,50%
8,61%
2,76%
16,45%
6,22%
9,44%
3,90%
8,62%
8,81%
12,51%
6,46%
5,36%
% Perfect % Worst
Chart 12: Proportion of Worst and Perfect MT
The first interesting piece of information is the
proportion of worst and perfect MT translations,
based on the performed post-editing. It is taken as
the number of tokens in the worst / perfect transla-
tions versus all tokens for each language. Remem-
ber that only documentation segments that receive
a fuzzy match score below 75% against our TMs are
sent to MT. This statistic is presented in Chart 12.
The most important takeaway from this chart is
that the proportion of worst translations is negligi-
bly low. On the other hand, there are many perfect
translations, despite the disadvantage of Machine
Translating only those source segments that were
not found in the TMs.
As a further analysis step, we can order the MT
engines for the individual languages based on a spe-
cic metric per software product. The language or-
der based on the derived JFS metric is presented on
Chart 13 for the eight products with the largest
translation volume.
CZ DE ES FR IT JA
KO PL PT-BR RU ZH-HANS ZH-HANT
35%
40%
45%
50%
55%
60%
65%
70%
75%
80%
App1
App2
App3
App4
App5
App6
App7
App8
Chart 13: Language Order per Product according to JFS
Although this chart does not include data across
all languages for all products, some trends are
clearly visible. Namely, ES, IT and PT-BR often
present the best JFS, while KO, JA and RU per-
form poorly on average. While we could expect
lower quality MT for KO and JA, the data for RU
need an extra explanation. In this case, the poor
performance was due to a Unicode-related bug in
the recaser for RU that was not detected until late
in the production cycle. If we had analysed the data
earlier on, we would have found the bug earlier on.
Another trend is for lower performance on av-
erage for App5. As it turned out, this was due to
one single component within that product, for
which the segmentation had failed and many seg-
ments contained new line characters. This could
not be handled by the MT infrastructure and resulted
in MT output that did not match the EN source.
We plan to integrate this type of analysis in a
dedicated monitoring system, where we will auto-
matically point our teams to potential issues with
the localisation process. This will be accomplished
by looking for suspicious patterns in the evolution
of the JFS metrica larger number of over- or
under-edited segments may often be to either MT
issues or translator under-performance.
For example, we are currently investigating the
higher-than-average number of unedited PT-BR
segments, given that there we have the smallest
training corpus across all languages. We suspect
that this could be due to translators leaving the raw
MT output unedited without properly checking its
correctness. This suspicion is also supported by the
presence of a very large number of unedited Fuzzy
matches for PT-BR.
5 Conclusion
In this paper, we described the MT infrastructure at
Autodesk that is used to facilitate the localisation
of software documentation and UI strings from
English into twelve languages. We also investi-
gated the data collected during our last post-editing
productivity test and found a strong correlation
between the edit distance after post-editing and the
productivity increase compared to translating from
scratch. Finally, we had a look at the post-edited
data generated during production in the last twelve
months comparing the MT engine performance for
some of our products.
We plan to use the insights from the presented
data analysis to continuously monitor the perform-
ance of our MT engines and for the (semi-) auto-
matic detection of potential issues in the MT process.
References
Banerjee, Satanjeev and Alon Lavie. 2005. METEOR:
An Automatic Metric for MT Evaluation with
Improved Correlation with Human Judgements.
In Proceedings of the Workshop on Intrinsic and
Extrinsic Evaluation Measures for MT and/or
Summarization at the 43rd Annual Meeting of the
Association for Computational Linguistics (ACL ’05),
pp. 65–72. Ann Arbor, MI.
Kendall, Maurice G. 1938. A New Measure of Rank
Correlation [June 1938]. Biometrika, 30 (1/2): 81–93.
Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Constantin and Evan Herbst. 2007. Moses: Open
Source Toolkit for Statistical Machine Translation.
In Proceedings of the Demo and Poster Sessions of
the 45th Annual Meeting of the Association for
Computational Linguistics (ACL ’07), pp. 177–180.
Prague, Czech Republic.
Levenshtein, Vladimir I. 1965. Двоичные коды с
исправлением выпадений, вставок и замещений
символов (Binary Codes Capable of Correcting
Deletions, Insertions, and Reversals). Доклады
Академий Наук СССР, 163 (4): 845–848.
[reprinted in: Soviet Physics Doklady, 10: 707–710.].
Neubig, Graham, Yosuke Nakata and Shinsuke Mori.
2011. Pointwise Prediction for Robust, Adaptable
Japanese Morphological Analysis. In The 49th
Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies
(ACL-HLT1’11). Portland, OR.
Plitt, Mirko and François Masselot. 2010. A Productivity
Test of Statistical Machine Translation Post-Editing
in a Typical Localisation Context. The Prague
Bulletin of Mathematical Linguistics, 93: 7–16.
Snover, Matthew, Bonnie J. Dorr, Richard Schwartz,
Linnea Micciulla and John Makhoul. 2006. A Study
of Translation Edit Rate with Targeted Human
Annotation. In Proceedings of the 7th Conference of
the Association for Machine Translation in the
Americas (AMTA1’06), pp. 223–231. Cambridge, MA.
Spearman, Charles. 1907. Demonstration of Formulæ
for True Measurment of Correlation [April 1907].
The American Journal of Psychology, 18 (2): 161–169.
Tillmann, Christoph, Stefan Vogel, Hermann Ney, Alex
Zubiaga and Hassan Sawaf. 1997. Accelerated
DP - Bas ed Sear ch for St ati stic al Tr ans lati on.
In Proceedings of the Fifth European Conference
on Speech Comunication and Technology
(Eurospeech1’97), pp. 2667–2670. Rhodos, Greece.
Turian, Joseph P., Luke Shen and I. Dan Melamed.
2003. Evaluation of Machine Translation and its
Evaluation: Computer Science De pa rt ment,
New York University.
... Son muchas las empresas que como Moravia, Safaba o Systran están llevando a cabo considerables pruebas de productividad para medir la eficiencia de sus productos y servicios. Autodesk es una de las empresas más activas en realizar (y publicar) este tipo de evaluaciones de manera regular (Plitt y Masselot, 2010;Zhechev, 2012). Con la ayuda de cuatro traductores por lengua, calculan el aumento de productividad, es decir, la diferencia de palabras traducidas en el mismo periodo de tiempo, al traducir sin ninguna herramienta de traducción asistida por ordenador (TAO) y al traducir con ayuda de la TA. ...
... Diversos estudios de investigación y casos de uso revelan un aumento de la productividad gracias a la TA. En su última evaluación, Autodesk informó sobre una franja de incremento que varía desde un 37% para el polaco, hasta un 92% para el francés (Zhechev, 2012). Mientras que el rendimiento medio de un traductor se suele establecer en unas 2 000 palabras por día, la posedición parece aumentar esta media a 3 000-9 000 palabras por día. ...
Article
Full-text available
En este artículo se describe la nueva perspectiva de calidad como noción dinámica como impulsora de la adopción en la industria de la traducción automática que, generalmente combinada con la posedición, ofrece la flexibilidad para satisfacer diferentes niveles de calidad y tiempos de respuesta. Se presenta la productividad de posedición como modelo extendido para decidir si una empresa debería adoptar o no la traducción automática y se muestra la información que consigue recabar, sin olvidar la dificultad de implementación. Finalmente, se aborda brevemente la calidad de las traducciones automáticas combinadas con posedición, que según estudios, parece poder competir con la traducción manual tradicional.
... While limited in number, although there have been attempts by researchers to compare HT and PE (Guerberof-Arenas 2014; Läubli et al. 2019;Zhechev 2012), the comparison has been made mainly from the perspectives of speed and quality. While speed is the easiest variable to measure (Krings 2001), and quality is also relatively easy to measure (Sanchez-Torron and Koehn 2016), cognitive effort cannot be gauged directly. ...
Article
Full-text available
Due to the inaccuracies of Machine Translation (MT), Post-Editing (PE) is inevitable. This poses questions over whether human effort to polish an MT is worthwhile or whether it would be more efficient to translate manually. However, to date, fewer attempts have been made to compare the cognitive effort in the PE process and the sub-phases (orientation, drafting, and revision) of PE with that of Human Translation (HT). To fill this gap, the current study aims to investigate and compare cognitive effort in HT and PE processes in translation from Chinese to English. Data were collected via eye-tracking and keyboard-logging approaches from 25 participants recruited to fulfil three HT and three PE tasks respectively. The comparison of cognitive effort was made from the processes of HT and PE, and their different sub-phases. The study reveals a significant difference in cognitive effort, orientation duration, and drafting duration between HT and PE.
... The well-known corpus-level metric BLEU is based on the n-gram matching between the final MT output and the reference translations. Another popular metric, TER, measures the amount of editing needed to modify the MT output to exactly match a reference translation, and it works well in the post-editing scenario (Carl et al. 2011;Koehn 2012;Zhechev 2012). KSR (keystroke ratio) and KSMR (keystroke ratio plus mouse-action ratio) are used to estimate the effort needed, particularly in the IMT scenario (Foster and Lapalme 2002), to produce correct translations. ...
Article
Computer-aided translation (CAT) systems are the most popular tool for helping human translators efficiently perform language translation. To further improve the translation efficiency, there is an increasing interest in applying machine translation (MT) technology to upgrade CAT. To thoroughly integrate MT into CAT systems, in this article, we propose a novel approach: a new input method that makes full use of the knowledge adopted by MT systems, such as translation rules, decoding hypotheses, and n-best translation lists. The proposed input method contains two parts: a phrase generation model, allowing human translators to type target sentences quickly, and an n-gram prediction model, helping users choose perfect MT fragments smoothly. In addition, to tune the underlying MT system to generate the input method preferable results, we design a new evaluation metric for the MT system. The proposed input method integrates MT effectively and imperceptibly, and it is particularly suitable for many target languages with complex characters, such as Chinese and Japanese. The extensive experiments demonstrate that our method saves more than 23% in time and over 42% in keystrokes, and it also improves the translation quality by more than 5 absolute BLEU scores compared with the strong baseline, i.e., post-editing using Google Pinyin.
... For the target domain data, we use the Autodesk Post-Editing corpus (Zhechev, 2012), which is a dataset collected from actual MT post-editing sessions. The corpus is focused upon software localization, a domain which is likely to be very different from the WMT data used to train our general domain models. ...
... For the target domain data, we use the Autodesk Post-Editing corpus (Zhechev, 2012), which is a dataset collected from actual MT post-editing sessions. The corpus is focused upon software localization, a domain which is likely to be very different from the WMT data used to train our general domain models. ...
Article
Full-text available
We present Grid Beam Search (GBS), an algorithm which extends beam search to allow the inclusion of pre-specified lexical constraints. The algorithm can be used with any model that generates a sequence y^={y0yT} \mathbf{\hat{y}} = \{y_{0}\ldots y_{T}\} , by maximizing p(yx)=tp(ytx;{y0yt1}) p(\mathbf{y} | \mathbf{x}) = \prod\limits_{t}p(y_{t} | \mathbf{x}; \{y_{0} \ldots y_{t-1}\}) . Lexical constraints take the form of phrases or words that must be present in the output sequence. This is a very general way to incorporate additional knowledge into a model's output without requiring any modification of the model parameters or training data. We demonstrate the feasibility and flexibility of Lexically Constrained Decoding by conducting experiments on Neural Interactive-Predictive Translation, as well as Domain Adaptation for Neural Machine Translation. Experiments show that GBS can provide large improvements in translation quality in interactive scenarios, and that, even without any user input, GBS can be used to achieve significant gains in performance in domain adaptation scenarios.
... In this paper, we experimented with the Autodesk PE dataset (Zhechev, 2012). 6 The Autodesk corpus contains the source, MT and PE segments for several languages. ...
Article
Full-text available
This paper presents an open-source toolkit for predicting human post-editing efforts for closely related languages. At the moment, training resources for the Quality Estimation task are available for very few language directions and domains. Available resources can be expanded on the assumption that MT errors and the amount of post-editing required to correct them are comparable across related languages, even if the feature frequencies differ. In this paper we report a toolkit for achieving language adaptation, which is based on learning new feature representation using transfer learning methods. In particular, we report performance of a method based on Self-Taught Learning which adapts the English-Spanish pair to produce Quality Estimation models for translation from English into Portuguese, Italian and other Romance languages using the publicly available Autodesk dataset.
... Graham et al. (2014), and in industry, e.g. Zhechev (2012). All these cases have something in common: for all the language pairs and domains involved there are vast amounts of parallel data available. ...
Article
Full-text available
We present a widely applicable methodology to bring machine translation (MT) to under-resourced languages in a cost-effective and rapid manner. Our proposal relies on web crawling to automatically acquire parallel data to train statistical MT systems if any such data can be found for the language pair and domain of interest. If that is not the case, we resort to (1) crowdsourcing to translate small amounts of text (hundreds of sentences), which are then used to tune statistical MT models, and (2) web crawling of vast amounts of monolingual data (millions of sentences), which are then used to build language models for MT. We apply these to two respective use-cases for Croatian, an under-resourced language that has gained relevance since it recently attained official status in the European Union. The first use-case regards tourism, given the importance of this sector to Croatia’s economy, while the second has to do with tweets, due to the growing importance of social media. For tourism, we crawl parallel data from 20 web domains using two state-of-the-art crawlers and explore how to combine the crawled data with bigger amounts of general-domain data. Our domain-adapted system is evaluated on a set of three additional tourism web domains and it outperforms the baseline in terms of automatic metrics and/or vocabulary coverage. In the social media use-case, we deal with tweets from the 2014 edition of the soccer World Cup. We build domain-adapted systems by (1) translating small amounts of tweets to be used for tuning by means of crowdsourcing and (2) crawling vast amounts of monolingual tweets. These systems outperform the baseline (Microsoft Bing) by 7.94 BLEU points (5.11 TER) for Croatian-to-English and by 2.17 points (1.94 TER) for English-to-Croatian on a test set translated by means of crowdsourcing. A complementary manual analysis sheds further light on these results.
Chapter
Despite the increasingly good quality of automatic translations, machine-translated texts require corrections. Automatic post-editing models have been introduced to perform these corrections without human intervention. However, no system has been able to fully automate the post-editing process. Moreover, while numerous translation tools benefit from translators’ input, human–computer interaction has been underexplored in post-editing. This study discusses automatic post-editing models and suggests that they could be improved in more interactive scenarios, as previously done in machine translation. While some attempts were made to update automatic post-editing models incrementally, this was mostly done using synthetic corpora, which is likely to affect the performance. To address this issue and as part of this project, automatic post-editing models trained in a traditional setting were developed and updated in both batch and online modes without using synthetic resources, with a view to analysing the performance of incremental adaptations in different systems, domains and language pairs. While the interaction with the translator was simulated, an interactive functionality allowing for dynamic post-editing was included for demonstration purposes. The results showed that none of the models was able to beat the baseline and that the online models systematically yielded a lower performance. Moreover, a human evaluation identified recurrent error patterns. These outcomes confirm the difficulties faced by the task of automatic post-editing. Based on the results, several recommendations are put forward for conducting further research, including experiments with more data (possibly synthetic corpora) and different environmental variables.
Conference Paper
In order to improve the efficiency of human translation, there is an increasing interest in applying machine translation (MT) to computer assisted translation (CAT). The newly proposed CAT-oriented input method is such a typical approach, which can help translators significantly save keystrokes by exploiting MT deep information, such as n-best candidates, hypotheses and translation rules. In order to further save more keystrokes, we propose in this paper a novel MT evaluation metric for coordinating human translators with the input method. This evaluation metric takes MT deep information into account, and makes longer perfect fragments correspond to fewer keystrokes. Extensive experiments show that the novel evaluation metric makes MT substantially reduce the keystrokes of translating process by accurately grasping deep information for the CAT-oriented input method, and it significantly improves the productivity of human translation compared with BLEU and TER.
Conference Paper
Post-editing is the most popular approach to improve accuracy and speed of human translators by applying the machine translation (MT) technology. During the translation process, human translators generate the translation by correcting MT outputs in the post-editing scenario. To avoid repeating the same MT errors, in this paper, we propose an efficient framework to update MT in real-time by learning from user feedback. This framework includes: (1) an anchor-based word alignment model, being specially designed to get correct alignments for unknown words and new translations of known words, for extracting the latest translation knowledge from user feedback; (2) an online translation model, being based on random forests (RFs), updating translation knowledge in real-time for later predictions and having a strong adaptability with temporal noise as well as context changes. The extensive experiments demonstrate that our proposed framework significantly improves translation quality as the number of feedback sentences increasing, and the translation quality is comparable to that of the off-line baseline system with all training data.
Article
Full-text available
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judg-ments. Translation Edit Rate (TER) mea-sures the amount of editing that a hu-man would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results in-dicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judg-ments as well as—or better than—a sec-ond human judgment does.
Conference Paper
Full-text available
We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tag- ging. Despite the lack of structure, it is able to outperform the current state-of-the-art struc- tured approach for Japanese MA, and achieves accuracy similar to that of structured predic- tors using the same feature set. We also find that the method is both robust to out- of-domain data, and can be easily adapted through the use of a combination of partial an- notation and active learning.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Article
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.
Article
We evaluated the productivity increase of statistical MT post-editing as compared to tra-ditional translation in a two-day test involving twelve participants translating from English to French, Italian, German, and Spanish. The test setup followed an empirical methodology. A random subset of the entire new content produced in our company during a given year was translated with statistical MT engines trained on data from the previous year. The translation environment recorded translation and post-editing times for each sentence. The results show a productivity increase for each participant, with significant variance across inviduals.
Article
Evaluation of MT evaluation measures is limited by inconsistent human judgment data. Nonetheless, machine translation can be evaluated using the well-known measures precision, recall, and the F-measure. The F-measure has significantly higher correlation with human judgments than recently proposed alternatives. More importantly, the standard measures have an intuitive graphical interpretation, which can facilitate insight into how MT systems might be improved. The relevant software is publicly available from http://nlp.cs.nyu.edu/GTM/