ArticlePDF Available

Abstract and Figures

We propose to use a statistical phrase-based machine translation system in a post-editing task: the system takes as input raw machine translation output (from a commercial rule-based MT system), and produces post-edited target-language text. We report on experiments that were performed on data collected in precisely such a setting: pairs of raw MT output and their manually post-edited versions. In our evaluation, the output of our automatic post-editing (APE) system is not only better quality than the rule-based MT (both in terms of the BLEU and TER metrics), it is also better than the output of a state-of-the-art phrase-based MT system used in standalone translation mode. These results indicate that automatic post-editing constitutes a simple and efficient way of combining rule-based and statistical MT technologies.
Content may be subject to copyright.
NRC Publications Archive (NPArC)
Archives des publications du CNRC (NPArC)
Statistical Phrase-based Post-editing
Simard, Michel; Goutte, Cyril; Isabelle, Pierre
Contact us / Contactez nous: nparc.cisti@nrc-cnrc.gc.ca.
http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=fr
L’accès à ce site Web et l’utilisation de son contenu sont assujettis aux conditions présentées dans le site
Web page / page Web
http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=5764892&lang=en
http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=5764892&lang=fr
LISEZ CES CONDITIONS ATTENTIVEMENT AVANT D’UTILISER CE SITE WEB.
READ THESE TERMS AND CONDITIONS CAREFULLY BEFORE USING THIS WEBSITE.
Access and use of this website and the material on it are subject to the Terms and Conditions set forth at
http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=en
Statistical Phrase-based Post-editing *
Simard, M., Goutte, C., and Isabelle, P.
2007
* proceedings of the NAACL-HLT. 2007. NRC 49288.
Copyright 2007 by
National Research Council of Canada
Permission is granted to quote short excerpts and to reproduce figures and tables
from this report, provided that the source of such material is fully acknowledged.
To appear in: Proceedings of NAACL-HLT 2007
Statistical Phrase-based Post-editing
Michel Simard Cyril Goutte
Interactive Language Technologies
National Research Council of Canada
Gatineau, Canada, K1A 0R6
FirstName.LastName@nrc.gc.ca
Pierre Isabelle
Abstract
We propose to use a statistical phrase-
based machine translation system in a
post-editing task: the system takes as in-
put raw machine translation output (from
a commercial rule-based MT system), and
produces post-edited target-language text.
We report on experiments that were per-
formed on data collected in precisely such
a setting: pairs of raw MT output and
their manually post-edited versions. In our
evaluation, the output of our automatic
post-editing (APE) system is not only bet-
ter quality than the rule-based MT (both
in terms of the BLEU and TER metrics),
it is also better than the output of a state-
of-the-art phrase-based MT system used
in standalone translation mode. These re-
sults indicate that automatic post-editing
constitutes a simple and efficient way of
combining rule-based and statistical MT
technologies.
1 Introduction
The quality of machine translation (MT) is gener-
ally considered insufficient for use in the field with-
out a significant amount of human correction. In the
translation world, the term post-editing is often used
to refer to the process of manually correcting MT
output. While the conventional wisdom is that post-
editing MT is usually not cost-efficient compared to
full human translation, there appear to be situations
where it is appropriate and even profitable. Unfortu-
nately, there are few reports in the literature about
such experiences (but see Allen (2004) for exam-
ples).
One of the characteristics of the post-editing task,
as opposed to the revision of human translation for
example, is its partly repetitive nature. Most MT
systems invariably produce the same output when
confronted with the same input; in particular, this
means that they tend to make the same mistakes over
and over again, which the post-editors must correct
repeatedly. Batch corrections are sometimes pos-
sible when multiple occurrences of the same mis-
take appear in the same document, but when it is
repeated over several documents, or equivalently,
when the output of the same machine translation
system is handled by multiple post-editors, then the
opportunities for factoring corrections become much
more complex. MT users typically try to reduce
the post-editing load by customizing their MT sys-
tems. However, in Rule-based Machine Translation
(RBMT), which still constitutes the bulk of the cur-
rent commercial offering, customization is usually
restricted to the development of “user dictionaries”.
Not only is this time-consuming and expensive, it
can only fix a subset of the MT system’s problems.
The advent of Statistical Machine Translation,
and most recently phrase-based approaches (PBMT,
see Marcu and Wong (2002), Koehn et al. (2003))
into the commercial arena seems to hold the promise
of a solution to this problem: because the MT sys-
tem learns directly from existing translations, it can
be automatically customized to new domains and
tasks. However, the success of this operation cru-
cially depends on the amount of training data avail-
able. Moreover, the current state of the technology
is still insufficient for consistently producing human
readable translations.
This state of affairs has prompted some to ex-
amine the possibility of automating the post-editing
process itself, at least as far as “repetitive errors” are
concerned. Allen and Hogan (2000) sketch the out-
line of such an automated post-editing (APE) sys-
tem, which would automatically learn post-editing
rules from a tri-parallel corpus of source, raw MT
and post-edited text. Elming (2006) suggests using
tranformation-based learning to automatically ac-
quire error-correcting rules from such data; however,
the proposed method only applies to lexical choice
errors. Knight and Chander (1994) also argue in fa-
vor of using a separate APE module, which is then
portable across multiple MT systems and language
pairs, and suggest that the post-editing task could be
performed using statistical machine translation tech-
niques. To the best of our knowledge, however, this
idea has never been implemented.
In this paper, we explore the idea of using a
PBMT system as an automated post-editor. The un-
derlying intuition is simple: if we collect a paral-
lel corpus of raw machine-translation output, along
with its human-post-edited counterpart, we can train
the system to translate from the former into the lat-
ter. In section 2, we present the case study that mo-
tivates our work and the associated data. In section
3, we describe the phrase-based post-editing model
that we use for improving the output of the auto-
matic translation system. In section 4, we illus-
trate this on a dataset of moderate size containing
job ads and their translation. With less than 500k
words of training material, the phrase-based MT
system already outperforms the rule-based MT base-
line. However, a phrase-based post-editing model
trained on the output of that baseline outperforms
both by a fairly consistent margin. The resulting
BLEU score increases by up to 50% (relative) and
the TER is cut by one third.
2 Background
2.1 Context
The Canadian government’s department of Human
Resources and Social Development (HRSDC) main-
tains a web site called Job Bank,1where poten-
tial employers can post ads for open positions in
Canada. Over one million ads are posted on Job
Bank every year, totalling more than 180 million
words. By virtue of Canada’s Official Language Act,
HRSDC is under legal obligation to post all ads in
both French and English. In practice, this means
that ads submitted in English must be translated into
French, and vice-versa.
To address this task, the department has put to-
gether a complex setup, involving text databases,
translation memories, machine translation and hu-
man post-editing. Employers submit ads to the Job
Bank website by means of HTML forms containing
“free text” data fields. Some employers do period-
ical postings of identical ads; the department there-
fore maintains a database of previously posted ads,
along with their translations, and new ads are sys-
tematically checked against this database. The trans-
lation of one third of all ads posted on the Job Bank
is actually recuperated this way. Also, employers
will often post ads which, while not entirely identi-
cal, still contain identical sentences. The department
therefore also maintains a translation memory of in-
dividual sentence pairs from previously posted ads;
another third of all text is typically found verbatim
in this way.
The remaining text is submitted to machine trans-
lation, and the output is post-edited by human ex-
perts. Overall, only a third of all submitted text re-
quires human intervention. This is nevertheless very
labour-intensive, as the department tries to ensure
that ads are posted at most 24 hours after submis-
sion. The Job Bank currently employs as many as
20 post-editors working full-time, most of whom are
junior translators.
2.2 The Data
HRSDC kindly provided us with a sample of data
from the Job Bank. This corpus consists in a collec-
tion of parallel “blocks” of textual data. Each block
contains three parts: the source language text, as
submitted by the employer, its machine-translation,
produced by a commercial rule-based MT system,
and its final post-edited version, as posted on the
website.
1http://www.jobbank.gc.ca
The entire corpus contains less than one million
words in each language. This corresponds to the
data processed in less than a week by the Job Bank.
Basic statistics are given in Table 1 (see Section 4.1).
Most blocks contain only one sentence, but some
blocks may contain many sentences. The longest
block contains 401 tokens over several sentences.
Overall, blocks are quite short: the median number
of tokens per source block is only 9 for French-to-
English and 7 for English-to-French. As a conse-
quence, no effort was made to segment the blocks
further for processing.
We evaluated the quality of the Machine Transla-
tion contained in the corpus using the Translation
Edit Rate (TER, cf. Snover et al. (2006)). The
TER counts the number of edit operations, including
phrasal shifts, needed to change a hypothesis trans-
lation into an adequate and fluent sentence, and nor-
malised by the length of the final sentence. Note
that this closely corresponds to the post-editing op-
eration performed on the Job Bank application. This
motivates the choice of TER as the main metric in
our case, although we also report BLEU scores in
our experiments. Note that the emphasis of our work
is on reducing the post-edition effort, which is well
estimated by TER. It is not directly on quality so the
question of which metric better estimates translation
quality is not so relevant here.
The global TER (over all blocks) are 58.77%
for French-to-English and 53.33% for English-to-
French. This means that more than half the words
have to be post-edited in some way (delete / substi-
tute / insert / shift). This apparently harsh result is
somewhat mitigated by two factors.
First, the distribution of the block-based TER2
shows a large disparity in performance, cf. Figure 1.
About 12% of blocks have a TER higher than 100%:
this is because the TER normalises on the length of
the references, and if the raw MT output is longer
than its post-edited counterpart, then the number of
edit operations may be larger than that length.3At
the other end of the spectrum, it is also clear that
many blocks have low TER. In fact more than 10%
2Contrary to BLEU or NIST, the TER naturally decomposes
into block-based scores.
3A side effect of the normalisation is that larger TER are
measured on small sentences, e.g. 3 errors for 2 reference
words.
Histogram of TER for rule−based MT
TER for rule−based MT
Frequency
0 50 100 150
0 1000 2000 3000 4000
Figure 1: Distribution of TER on 39005 blocks from
the French-English corpus (thresholded at 150%).
have a TER of 0. The global score therefore hides a
large range of performance.
The second factor is that the TER measures the
distance to an adequate and fluent result. A high
TER does not mean that the raw MT output is not
understandable. However, many edit operations may
be needed to make it fluent.
3 Phrase-based Post-editing
Translation post-editing can be viewed as a simple
transformation process, which takes as input raw
target-language text coming from a MT system, and
produces as output target-language text in which “er-
rors” have been corrected. While the automation
of this process can be envisaged in many differ-
ent ways, the task is not conceptually very differ-
ent from the translation task itself. Therefore, there
doesn’t seem to be any good reason why a machine
translation system could not handle the post-editing
task. In particular, given such data as described in
Section 2.2, the idea of using a statistical MT system
for post-editing is appealing. Portage is precisely
such a system, which we describe here.
Portage is a phrase-based, statistical machine
translation system, developed at the National Re-
search Council of Canada (NRC) (Sadat et al.,
2005). A version of the Portage system is made
available by the NRC to Canadian universities for
research and education purposes. Like other SMT
systems, it learns to translate from existing parallel
corpora.
The system translates text in three main phases:
preprocessing of raw data into tokens; decoding to
produce one or more translation hypotheses; and
error-driven rescoring to choose the best final hy-
pothesis. For languages such as French and English,
the first of these phases (tokenization) is mostly a
straightforward process; we do not describe it any
further here.
Decoding is the central phase in SMT, involv-
ing a search for the hypotheses tthat have high-
est probabilities of being translations of the cur-
rent source sentence saccording to a model for
P(t|s). Portage implements a dynamic program-
ming beam search decoding algorithm similar to that
of Koehn (2004), in which translation hypotheses
are constructed by combining in various ways the
target-language part of phrase pairs whose source-
language part matches the input. These phrase pairs
come from large phrase tables constructed by col-
lecting matching pairs of contiguous text segments
from word-aligned bilingual corpora.
Portage’s model for P(t|s)is a log-linear com-
bination of four main components: one or more n-
gram target-language models, one or more phrase
translation models, a distortion (word-reordering)
model, and a sentence-length feature. The phrase-
based translation model is similar to that of Koehn,
with the exception that phrase probability estimates
Ps|˜
t)are smoothed using the Good-Turing tech-
nique (Foster et al., 2006). The distortion model is
also very similar to Koehn’s, with the exception of a
final cost to account for sentence endings.
Feature function weights in the loglinear model
are set using Och’s minium error rate algorithm
(Och, 2003). This is essentially an iterative two-step
process: for a given set of source sentences, generate
n-best translation hypotheses, that are representative
of the entire decoding search space; then, apply a
variant of Powell’s algorithm to find weights that op-
timize the BLEU score over these hypotheses, com-
pared to reference translations. This process is re-
peated until the set of translations stabilizes, i.e. no
new translations are produced at the decoding step.
To improve raw output from decoding, Portage re-
lies on a rescoring strategy: given a list of n-best
translations from the decoder, the system reorders
this list, this time using a more elaborate loglinear
model, incorporating more feature functions, in ad-
dition to those of the decoding model: these typ-
ically include IBM-1 and IBM-2 model probabili-
ties (Brown et al., 1993) and an IBM-1-based fea-
ture function designed to detect whether any word
in one language appears to have been left without
satisfactory translation in the other language; all of
these feature functions can be used in both language
directions, i.e. source-to-target and target-to-source.
In the experiments reported in the next section,
the Portage system is used both as a translation and
as an APE system. While we can think of a number
of modifications to such a system to better adapt it
to the post-editing task (some of which are discussed
later on), we have done no such modifications to the
system. In fact, whether the system is used for trans-
lation or post-editing, we have used exactly the same
translation model configuration and training proce-
dure.
4 Evaluation
4.1 Data and experimental setting
The corpus described in section 2.2 is available for
two language pairs: English-to-French and French-
to-English.4In each direction, each block is avail-
able in three versions (or slices): the original text
(or source), the output of the commercial rule-based
MT system (or baseline) and the final, post-edited
version (or reference).
In each direction (French-to-English and English-
to-French), we held out two subsets of approxi-
mately 1000 randomly picked blocks. The valida-
tion set is used for testing the impact of various high-
level choices such as pre-processing, or for obtain-
ing preliminary results based on which we setup new
experiments. The test set is used only once, in order
to obtain the final experimental results reported here.
The rest of the data constitutes the training set,
which is split in two. We sampled a subset of
1000 blocks as train-2, which is used for optimiz-
4Note that, in a post-editing context, translation direction is
crucially important. It is not possible to use the same corpus in
both directions.
English-to-French French-to-English
Corpus words: words:
blocks source baseline reference blocks source baseline reference
train-1 28577 310k 382k 410k 36005 485k 501k 456k
train-2 1000 11k 14k 14k 1000 13k 14k 12k
validation 881 10k 13k 13k 966 13k 14k 12k
test 899 10k 12k 13k 953 13k 13k 12k
Table 1: Data and split used in our experiments, (in thousand words). ’baseline’ is the output of the com-
mercial rule-based MT system and ’reference’ is the final, post-edited text.
ing the log-linear model parameters used for decod-
ing and rescoring. The rest is the train-1 set, used
for estimating IBM translation models, constructing
phrasetables and estimating a target language model.
The composition of the various sets is detailed in
Table 1. All data was tokenized and lowercased;
all evaluations were performed independent of case.
Note that the validation and test sets were originally
made out of 1000 blocks sampled randomly from
the data. These sets turned out to contain blocks
identical to blocks from the training sets. Consider-
ing that these would normally have been handled by
the translation memory component (see the HRSDC
workflow description in Section 2.1), we removed
those blocks for which the source part was already
found in the training set (in either train-1 or train-2),
hence their smaller sizes.
In order to check the sensitivity of experimental
results to the choice of the train-2 set, we did a
run of preliminary experiments using different sub-
sets of 1000 blocks. The experimental results were
nearly identical and highly consistent, showing that
the choice of a particular train-2 subset has no in-
fluence on our conclusions. In the experiments re-
ported below, we therefore use a single identical
train-2 set.
We initially performed two sets of experiments
on this data. The first was intended to compare the
performance of the Portage PBMT system as an al-
ternative to the commercial rule-based MT system
on this type of data. In these experiments, English-
to-French and French-to-English translation systems
were trained on the source and reference (manually
post-edited target language) slices of the training set.
In addition to the target language model estimated
on the train-1 data, we used an external contribution,
Language TER BLEU
English-to-French
Baseline 53.5 32.9
Portage translation 53.7 36.0
Baseline + Portage APE 47.3 41.6
French-to-English
Baseline 59.3 31.2
Portage translation 43.9 41.0
Baseline + Portage APE 41.0 44.9
Table 2: Experimental Results: For TER, lower (er-
ror) is better, while for BLEU, higher (score) is bet-
ter. Best results are in bold.
a trigram target language model trained on a fairly
large quantity of data from the Canadian Hansard.
The goal of the second set of experiments was to
assess the potential of the Portage technology in au-
tomatic post-editing mode. Again, we built systems
for both language directions, but this time using the
existing rule-based MT output as source and the ref-
erence as target. Apart from the use of different
source data, the training procedure and system con-
figurations of the translation and post-editing sys-
tems were in all points identical.
4.2 Experimental results
The results of both experiments are presented in Ta-
ble 2. Results are reported both in terms of the TER
and BLEU metrics; Baseline refers to the commer-
cial rule-based MT output.
The first observation from these results is that,
while the performance of Portage in translation
mode is approximately equivalent to that of the base-
line system when translating into French, its perfor-
mance is much better than the baseline when trans-
lating into English. Two factors possibly contribute
to this result: first, the fact that the baseline system
itself performs better when translating into French;
second, and possibly more importantly, the fact that
we had access to less training data for English-to-
French translation.
The second observation is that when Portage is
used in automatic post-editing mode, on top of the
baseline MT system, it achieves better quality than
either of the two translation systems used on its own.
This appears to be true regardless of the translation
direction or metric. This is an extremely interesting
result, especially in light of how little data was actu-
ally available to train the post-editing system.
One aspect of statistical MT systems is that, con-
trary to rule-based systems, their performance (usu-
ally) increases as more training data is available. In
order to quantify this effect in our setting, we have
computed learning curves by training the Portage
translation and Portage APE systems on subsets of
the training data of increasing sizes. We start with
as little as 1000 blocks, which corresponds to around
10-15k words.
Figure 2 (next page) compares the learning rates
of the two competing approaches (Portage transla-
tion vs. Portage APE). Both approaches display very
steady learning rates (note the logarithmic scale for
training data size). These graphs strongly suggest
that both systems would continue to improve given
more training data. The most impressive aspect is
how little data is necessary to improve upon the
baseline, especially when translating into English:
as little as 8000 blocks (around 100k words) for di-
rect translation and 2000 blocks (around 25k words)
for automatic post-editing. This suggests that such
a post-editing setup might be worth implementing
even for specialized domains with very small vol-
umes of data.
4.3 Extensions
Given the encouraging results of the Portage APE
approach in the above experiments, we were curi-
ous to see whether a Portage+Portage combination
might be as successful: after all, if Portage was good
at correcting some other system’s output, could it
not manage to correct the output of another Portage
translator?
We tested this in two settings. First, we actu-
ally use the output of the Portage translation sys-
Language TER BLEU
English-to-French
Portage Job Bank 53.7 36.0
+ Portage APE 53.7 36.2
Portage Hansard 76.9 13.0
+ Portage APE 64.6 26.2
French-to-English
Portage Job Bank 43.9 41.0
+ Portage APE 43.9 41.4
Portage Hansard 80.1 14.0
+ Portage APE 57.7 28.6
Table 3: Portage translation - Portage APE system
combination experimental results.
tem obtained above, i.e. trained on the same data.
In our second experiment, we use the output of
a Portage translator trained on different domain
data (the Canadian Hansard), but with much larger
amounts of training material (over 85 million words
per language). In both sets of experiments, the
Portage APE system was trained as previously, but
using Portage translations of the Job Bank data as
input text.
The results of both experiments are presented in
Table 3. The first observation in these results is that
there is nothing to be gained from post-editing when
both the translation and APE systems are trained on
the same data sets (Portage Job Bank + Portage APE
experiments). In other words, the translation system
is apparently already making the best possible use of
the training data, and additional layers do not help
(but nor do they hurt, interestingly).
However, when the translation system has been
trained using distinct data (Portage Hansard +
Portage APE experiments), post-editing makes a
large difference, comparable to that observed with
the rule-based MT output provided with the Job
Bank data. In this case, however, the Portage trans-
lation system behaves very poorly in spite of the im-
portant size of the training set for this system, much
worse in fact than the “baseline” system. This high-
lights the fact that both the Job Bank and Hansard
data are very much domain-specific, and that access
to appropriate training material is crucial for phrase-
based translation technology.
In this context, combining two phrase-based sys-
1000 2000 5000 10000 20000
40 45 50 55 60
TER learning curves
Training set size
TER
to English
to French
Post−edition
Translation
1000 2000 5000 10000 20000
0.30 0.35 0.40 0.45
BLEU learning curves
Training set size
BLEU
to English
to French
Post−edition
Translation
Figure 2: TER and BLEU scores of the phrase-based post-editing models as the amount of training data
increases (log scale). The horizontal lines correspond to the performance of the baseline system (rule-based
translation).
tems as done here can be seen as a way of adapting
an existing MT system to a new text domain; the
APE system then acts as an “adapter”, so to speak.
Note however that, in our experiments, this setup
doesn’t perform as well as a single Portage transla-
tion system, trained directly and exclusively on the
Job Bank data.
Such an adaptation strategy should be contrasted
with one in which the translation models of the
old and new domains are “merged” to create a new
translation system. As mentioned earlier, Portage
allows using multiple phrase translation tables and
language models concurrently. For example, in the
current context, we can extract phrase tables and lan-
guage models from the Job Bank data, as when train-
ing the “Portage Job Bank” translation system, and
then build a Portage translation model using both the
Hansard and Job Bank model components. Loglin-
ear model parameters are then optimized on the Job
Bank data, so as to find the model weights that best
fit the new domain.
In a straightforward implementation of this idea,
we obtained performances almost identical to those
of the Portage translation system trained solely on
Job Bank data. Upon closer examination of the
model parameters, we observed that Hansard model
components (language model, phrase tables, IBM
translation models) were systematically attributed
negligeable weights. Again, the amount of training
material for the new domain may be critical in chos-
ing between alternative adaptation mechanisms.
5 Conclusions and Future Work
We have proposed using a phrase-based MT sys-
tem to automatically post-edit the output of an-
other MT system, and have tested this idea with
the Portage MT system on the Job Bank data set, a
corpus of manually post-edited French-English ma-
chine translations. In our experiments, not only does
phrase-based APE significantly improve the quality
of the output translations, this approach outperforms
a standalone phrase-based translation system.
While these results are very encouraging, the
learning curves of Figure 2 suggest that the output
quality of the PBMT systems increases faster than
that of the APE systems as more data is used for
training. So while the combination strategy clearly
performs better with limited amounts of training
data, there is reason to believe that, given sufficient
training data, it would eventually be outperformed
by a direct phrase-based translation strategy. Of
course, this remains to be verified empirically, some-
thing which will obviously require more data than is
currently available to us. But this sort of behavior
is expectable: while both types of system improve
as more training data is used, inevitably some de-
tails of the source text will be lost by the front-end
MT system, which the APE system will never be
able to retrieve.5Ultimately, the APE system will
be weighted down by the inherent limitations of the
front-end MT system.
One way around this problem would be to modify
the APE system so that it not only uses the base-
line MT output, but also the source-language input.
In the Portage system, this could be achieved, for
example, by introducing feature functions into the
log-linear model that relate target-language phrases
with the source-language text. This is one research
avenue that we are currently exploring.
Alternatively, we could combine these two in-
puts differently within Portage: for example, use
the source-language text as the primary input, and
use the raw MT output as a secondary source. In
this perspective, if we have multiple MT systems
available, nothing precludes using all of them as
providers of secondary inputs. In such a setting, the
phrase-based system becomes a sort of combination
MT system. We intend to explore such alternatives
in the near future as well.
Acknowledgements
The work reported here was part of a collaboration
between the National Research Council of Canada
and the department of Human Resources and Social
Development Canada. Special thanks go to Souad
Benayyoub, Jean-Fr´
ed´
eric H¨
ubsch and the rest of
the Job Bank team at HRSDC for preparing data that
was essential to this project.
References
Jeffrey Allen and Christofer Hogan. 2000. Toward
the development of a post-editing module for Ma-
chine Translation raw output: a new productivity tool
for processing controlled language. In Third Inter-
5As a trivial example, imagine an MT system that “deletes”
out-of-vocabulary words.
national Controlled Language Applications Workshop
(CLAW2000), Washington, USA.
Jeffrey Allen. 2004. Case study: Implementing MT for
the translation of pre-sales marketing and post-sales
software deployment documentation. In Proceedings
of AMTA-2004, pages 1–6, Washington, USA.
Peter F Brown, Stephen A Della Pietra, Vincent J Della
Pietra, and Robert L Mercer. 1993. The Mathematics
of Statistical Machine Translation: Parameter Estima-
tion. Computational Linguistics, 19(2):263–311.
Jakob Elming. 2006. Transformation-based corrections
of rule-based MT. In Proceedings of the EAMT 11th
Annual Conference, Oslo, Norway.
George Foster, Roland Kuhn, and Howard Johnson.
2006. Phrasetable Smoothing for Statistical Machine
Translation. In Proceedings of EMNLP 2006, pages
53–61, Sydney, Australia.
Kevin Knight and Ishwar Chander. 1994. Automated
Postediting of Documents. In Proceedings of National
Conference on Artificial Intelligence, pages 779–784,
Seattle, USA.
Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.
Statistical Phrase-Based Translation. In Proceed-
ings of HLT-NAACL 2003, pages 127–133, Edmonton,
Canada.
Philipp Koehn. 2004. Pharaoh: a Beam Search De-
coder for Phrase-Based Statistical Machine Transla-
tion Models. In Proceedings of AMTA 2004, pages
115–124, Washington, USA.
Daniel Marcu and William Wong. 2002. A Phrase-
Based, Joint Probability Model for Statistical Ma-
chine Translation. In Proceedings of EMNLP 2002,
Philadelphia, USA.
Franz Josef Och. 2003. Minimum error rate training
in Statistical Machine Translation. In Proceedings of
ACL-2003, pages 160–167, Sapporo, Japan.
Fatiha Sadat, Howard Johnson, Akakpo Agbago, George
Foster, Roland Kuhn, Joel Martin, and Aaron Tikuisis.
2005. PORTAGE: A Phrase-Based Machine Trans-
lation System. In Proceedings of the ACL Workshop
on Building and Using Parallel Texts, pages 129–132,
Ann Arbor, USA.
Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-
nea Micciulla, and John Makhoul. 2006. A Study
of Translation Edit Rate with Targeted Human An-
notation. In Proceedings of AMTA-2006, Cambridge,
USA.
... The goal of APE system is to automatically fix these errors in a machine-translated text by learning from human post-edited samples. Earlier APE researchers adopted the phrase-based statistical machine translation (PBSMT) models to train the APE system as a monolingual rewriting task without considering the source sentence [1], [2]. However, PBSMT-based APE models are only applicable to fix the errors in the output of rule-based MT systems. ...
... Open APE triplets are available only for very few language pairs such as English-German and English-Chinese. 1 Most of the language pairs including English-Myanmar are absent of APE triplets and thus hinder the applicability of the APE task. To make APE more widely applicable for the most language pairs where APE triplets are unavailable, this work investigates an alternative solution to conduct the APE task without having access to the human-edited APE triplets. ...
Article
Full-text available
Automatic post-editing (APE) research aims to investigate methods for correcting systematic errors in machine translation (MT) results. Recent work has shown successful practices of APE for improving MT output quality; however, their effectiveness strongly relies on the availability of large-scale human-created APE triplets. The high production cost of human post-edited data has led to the absence of APE triplets for most language pairs, including English-Myanmar, which has become a limiting factor for the applicability of the APE task. This work investigates how to conduct the APE task on the English-Myanmar MT where human-edited APE triplets are unavailable. We build a denoising-based APE (DbAPE) system using only the monolingual and parallel MT corpora. The system takes the source sentence ( src ) and the MT output ( mt ) as inputs and produces the post-edited mt as output by operating the three processes together, including word alignment extraction, enriching mt using the extracted word alignment information, and denoising the enriched-version of mt . We conduct extensive experiments by applying our APE system as a post-processor to the raw output of the existing English-Myanmar MT systems. APE translations produced by DbAPE show statistically significant improvements of at least +4% BLEU and −16% TER points absolute over the original NMT. Moreover, DbAPE can improve the quality of the texts generated by state-of-the-art systems such as mT5 and Google Translate. In addition, we perform word alignment experiments with four types of alignment methods and demonstrate that the proposed multilingual word aligner can achieve robust performance over previous state-of-the-art models.
... These systems often took the form of hybrid MT systems, which added statistical MT (SMT) capabilities to correct the results produced by commercial rule-based MT (RBMT) systems (Dugast et al. 2007;Lagarda et al. 2009). The use of statistical approaches based on the repetitive nature of errors produced by RBMT systems may also be observed in Simard et al. (2007a) and Simard et al. (2007b). The opposite strategy-the use of rules to correct SMT output-was also used, for example, to improve the results on morphologically-rich languages (Mareček et al. 2011;Rosa et al. 2012). ...
... numbers, names and named entities), but the methods of feeding corrections to the decoder had to be adapted for each new application. Chatterjee et al. (2015b) discuss the two main methods available at the time for APE: monolingual, as in Simard et al. (2007a), in which a system performs a monolingual review of the MT output; and context-aware, as in Béchara et al. (2011), in which the system is reinforced by alignments to the source words and phrases, connecting resources in the two languages. The authors also stress that APE is especially important as means to feed corrections back to black-box systems. ...
Article
Full-text available
This article presents a review of the evolution of automatic post-editing, a term that describes methods to improve the output of machine translation systems, based on knowledge extracted from datasets that include post-edited content. The article describes the specificity of automatic post-editing in comparison with other tasks in machine translation, and it discusses how it may function as a complement to them. Particular detail is given in the article to the five-year period that covers the shared tasks presented in WMT conferences (2015–2019). In this period, discussion of automatic post-editing evolved from the definition of its main parameters to an announced demise, associated with the difficulties in improving output obtained by neural methods, which was then followed by renewed interest. The article debates the role and relevance of automatic post-editing, both as an academic endeavour and as a useful application in commercial workflows.
... In an early work, (Simard et al., 2007) combined a rule-based MT (RBMT) with a statistical MT (SMT) for monolingual post-editing. The reported results outperformed both systems in standalone translation mode. ...
... The MT baseline simply consists of the accuracy of the mt sentences with respect to the pe ground truth. The other baseline is given by a statistical PE (SPE) system (Simard et al., 2007) chosen by the WMT17 organizers. Table 2 shows that when our model is trained with only the 11K WMT17 official training sentences, it cannot even approach the baselines. ...
... In an early work, (Simard et al., 2007) combined a rule-based MT (RBMT) with a statistical MT (SMT) for monolingual post-editing. The reported results outperformed both systems in standalone translation mode. ...
... The MT baseline simply consists of the accuracy of the mt sentences with respect to the pe ground truth. The other baseline is given by a statistical PE (SPE) system (Simard et al., 2007) chosen by the WMT17 organizers. Table 2 shows that when our model is trained with only the 11K WMT17 official training sentences, it cannot even approach the baselines. ...
Preprint
Automatic post-editing (APE) systems aim to correct the systematic errors made by machine translators. In this paper, we propose a neural APE system that encodes the source (src) and machine translated (mt) sentences with two separate encoders, but leverages a shared attention mechanism to better understand how the two inputs contribute to the generation of the post-edited (pe) sentences. Our empirical observations have showed that when the mt is incorrect, the attention shifts weight toward tokens in the src sentence to properly edit the incorrect translation. The model has been trained and evaluated on the official data from the WMT16 and WMT17 APE IT domain English-German shared tasks. Additionally, we have used the extra 500K artificial data provided by the shared task. Our system has been able to reproduce the accuracies of systems trained with the same data, while at the same time providing better interpretability.
... Automatic PE, using another MT system [Simard et al., 2007] for correcting the initial outputs of a baseline system, has also been explored in an online adaptive setting by Simard and Foster [2013]. ...
Thesis
Automatic translation of natural language is still (as of 2017) a long-standing but unmet promise. While advancing at a fast rate, the underlying methods are still far from actually being able to reliably capture syntax or semantics of arbitrary utterances of natural language, way off transporting the encoded meaning into a second language. However, it is possible to build useful translating machines when the target domain is well known and the machine is able to learn and adapt efficiently and promptly from new inputs. This is possible thanks to efficient and effective machine learning methods which can be applied to automatic translation. In this work we present and evaluate methods for three distinct scenarios: a) We develop algorithms that can learn from very large amounts of data by exploiting pairwise preferences defined over competing translations, which can be used to make a machine translation system robust to arbitrary texts from varied sources, but also enable it to learn effectively to adapt to new domains of data; b) We describe a method that is able to efficiently learn external models which adhere to fine-grained preferences that are extracted from a constricted selection of translated material, e.g. for adapting to users or groups of users in a computer-aided translation scenario; c) We develop methods for two machine translation paradigms, neural- and traditional statistical machine translation, to directly adapt to user-defined preferences in an interactive post-editing scenario, learning precisely adapted machine translation systems. In all of these settings, we show that machine translation can be made significantly more useful by careful optimization via preference learning.
Chapter
Despite the increasingly good quality of automatic translations, machine-translated texts require corrections. Automatic post-editing models have been introduced to perform these corrections without human intervention. However, no system has been able to fully automate the post-editing process. Moreover, while numerous translation tools benefit from translators’ input, human–computer interaction has been underexplored in post-editing. This study discusses automatic post-editing models and suggests that they could be improved in more interactive scenarios, as previously done in machine translation. While some attempts were made to update automatic post-editing models incrementally, this was mostly done using synthetic corpora, which is likely to affect the performance. To address this issue and as part of this project, automatic post-editing models trained in a traditional setting were developed and updated in both batch and online modes without using synthetic resources, with a view to analysing the performance of incremental adaptations in different systems, domains and language pairs. While the interaction with the translator was simulated, an interactive functionality allowing for dynamic post-editing was included for demonstration purposes. The results showed that none of the models was able to beat the baseline and that the online models systematically yielded a lower performance. Moreover, a human evaluation identified recurrent error patterns. These outcomes confirm the difficulties faced by the task of automatic post-editing. Based on the results, several recommendations are put forward for conducting further research, including experiments with more data (possibly synthetic corpora) and different environmental variables.
Chapter
Fluency and faithfulness are the main criteria to evaluate the quality of machine translation. In order to acquire excellent translation results, the common method is to learn semantic-rich word embeddings by fine-tuning or pre-processing. However, there is no human intervention when generating translation due to the black-box prediction characteristics of neural network, which limits the generation of higher quality translation. Therefore, this paper proposes a translation automatic post-editing method combined with copying-rewriting (CoRe) network and introduces a double-ended attention module to realize the interaction between the source sentences and the machine translation. Meanwhile, we utilize interaction result (copy scores) to determine which fragments in the translation will be copied or be rewritten. Our method is verified on the CWMT2018 Mongolian-Chinese translation task and has obtained significant results.
Chapter
After more than 60 years of research in Machine Translation, it has not been possible yet to develop a perfect fully automatic translation system for unlimited purposes. Thus, it is still necessary post-editing to correct possible mistranslations output by the Machine Translation system. Several approaches have been proposed in order to also automate the post-editing task. This work addresses one of the main steps of an automatic post-editing tool: the automatic proposition of word replacements for a Machine Translation output. To do so, we propose a novel method based on bilingual word embeddings. In the experiments present in this paper we show the effectiveness of this approach in two of the most frequent lexical errors: ‘not translated word’ and ‘incorrectly translated word’.
Article
Full-text available
We examine a new, intuitive measure for evaluating machine-translation output that avoids the knowledge intensiveness of more meaning-based approaches, and the labor-intensiveness of human judg-ments. Translation Edit Rate (TER) mea-sures the amount of editing that a hu-man would have to perform to change a system output so it exactly matches a reference translation. We show that the single-reference variant of TER correlates as well with human judgments of MT quality as the four-reference variant of BLEU. We also define a human-targeted TER (or HTER) and show that it yields higher correlations with human judgments than BLEU—even when BLEU is given human-targeted references. Our results in-dicate that HTER correlates with human judgments better than HMETEOR and that the four-reference variants of TER and HTER correlate with human judg-ments as well as—or better than—a sec-ond human judgment does.
Conference Paper
Full-text available
We propose a new phrase-based translation model and decoding algorithm that enables us to evaluate and compare several, previously proposed phrase-based translation models. Within our framework, we carry out a large number of experiments to understand better and explain why phrase-based models out-perform word-based models. Our empirical results, which hold for all examined language pairs, suggest that the highest levels of performance can be obtained through relatively simple means: heuristic learning of phrase translations from word-based alignments and lexical weighting of phrase translations. Surprisingly, learning phrases longer than three words and learning phrases from high-accuracy word-level alignment models does not have a strong impact on performance. Learning only syntactically motivated phrases degrades the performance of our systems.
Article
Full-text available
This paper describes the participation of the Portage team at NRC Canada in the shared task1 of ACL 2005 Workshop on Building and Using Parallel Texts. We discuss Portage, a statistical phrase-based machine translation system, and present experimental results on the four language pairs of the shared task. First, we focus on the French-English task using multiple re-sources and techniques. Then we describe our contribution on the Finnish-English, Spanish-English and German-English lan-guage pairs using the provided data for the shared task. Ce document décrit la participation de l'équipe de Portage du CNRC à la tâche partagée 1 de l'atelier ACL 2005 sur la construction et l'utilisation de textes parallèles. Nous discutons de Portage, un système de traduction automatique statistique basé sur les syntagmes, et présentons les résultats d'expériences sur les quatre paires linguistiques de la tâche partagée. Premièrement, nous abordons la tâche de traduction français-anglais au moyen de multiples ressources et techniques. Puis, nous décrivons notre contribution au titre des paires linguistiques finnois-anglais, espagnol-anglais et allemand-anglais en utilisant des données fournies pour la tâche partagée.
Article
We present a pilot study for using transformation-based learning for au- tomatic correction of rule-based machine translation. Correction rules are learned based on a parallel corpus of machine translations from a com- mercial machine translation system and a human-corrected version of these translations. The correction rules exploit information on word forms and part of speech. The experiment results in a relative increase in translation quality of 4.6% measured using the BLEU metric.
Conference Paper
We discuss different strategies for smooth- ing the phrasetable in Statistical MT, and give results over a range of translation set- tings. We show that any type of smooth- ing is a better idea than the relative- frequency estimates that are often used. The best smoothing techniques yield con- sistent gains of approximately 1% (abso- lute) according to the BLEU metric.
Conference Paper
Minimum Error Rate Training (MERT) is an effective means to estimate the feature func- tion weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature func- tion its exact error surface on a given set of candidate translations. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N - best lists which contain the N most probable translation hypotheses produced by a decoder. In this paper, we present a novel algorithm that allows for efficiently constructing and repre- senting the exact error surface of all trans- lations that are encoded in a phrase lattice. Compared to N -best MERT, the number of candidate translations thus taken into account increases by several orders of magnitudes. The proposed method is used to train the feature function weights of a phrase-based statistical machine translation system. Experi- ments conducted on the NIST 2008 translation tasks show significant runtime improvements and moderate BLEU score gains over N -best MERT.
Conference Paper
We describe Pharaoh, a freely available decoder for phrase-based statistical machine translation models. The decoder is the implement at ion of an efficient dynamic programming search algorithm with lattice generation and XML markup for external components.
Conference Paper
Several major telecommunications companies have made significant investment in either controlled language and/or machine translation over the past 10 years.
Article
We present a joint probability model for statistical machine translation, which automatically learns word and phrase equivalents from bilingual corpora. Translations produced with parameters estimated using the joint model are more accurate than translations produced using IBM Model 4.
Article
Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation quality.