Incremental Adaptation Using Translation Information and Post-Editing Analysis
eric Blain⇤†, Holger Schwenk⇤
e du Maine, Avenue Laennec
72085 Le Mans, France
5, rue Feydeau
75002 Paris, France
It is well known that statistical machine translation sys-
tems perform best when they are adapted to the task. In this
paper we propose new methods to quickly perform incre-
mental adaptation without the need to obtain word-by-word
alignments from GIZA or similar tools. The main idea is
to use an automatic translation as pivot to infer alignments
between the source sentence and the reference translation,
or user correction. We compared our approach to the stan-
dard method to perform incremental re-training. We achieve
similar results in the BLEU score using less computational
resources. Fast retraining is particularly interesting when we
want to almost instantly integrate user feed-back, for instance
in a post-editing context or machine translation assisted CAT
tool. We also explore several methods to combine the trans-
Due to multiplication of resources and the diversity of lan-
guages, Machine Translation (MT) systems are widely used
as a precious help for human translators. Most of the systems
used today are based on the statistical approach. Those sys-
tems extract all the knowledge from the provided data. Nev-
ertheless, these systems have some limits: ﬁrst, the speciﬁc
resources available at ttime could be less appropriate at t+1.
Consequently, they need to be regularly re-trained in order
to be updated, which is usually computationally demanding.
The goal of incremental adaptation is then twofold: to adapt
the system on the ﬂy when new resources are available with-
out re-training the entire system.
Post-Editing (PE) the output of SMT systems is widely
used, amongst others, by professional translators of localiza-
tion services which need for example to translate technical
data in speciﬁc domains into several languages. However,
the use of PE is restricted by some aspects that must be taken
into consideration. As resumed by , the time spent by
the post-editor is a commonly used measure of the PE effort,
which should not to be, in case of poor translation quality,
more important than translation from scratch. Even if this
temporal aspect can be see as the most important, PE effort
can be evaluated using automatic metrics based on the edit
distance. These metrics commonly use the number of re-
quired edits of the MT system output to reach a reference
translation. From then, the combination of PE and incremen-
tal adaptation can be seen as a way to reduce the task effort
by allowing a MT system to gradually learn from its own er-
rors. Especially considering the repetitive nature of the task
highlighted by .
However, incremental adaptation is still a tricky task:
how to adapt the system correctly? Adaptation should not de-
grade system performance and accuracy. Some approaches
are possible and we will try to see the impact of several of
them in the second part of this article.
First of all, we present a new experimental approach for
incremental adaptation of a MT system using PE analysis.
Starting from a generic baseline, we have gradually adapted
our system by translating an in-domain corpora which was
split beforehand. Each part of the corpora was translated us-
ing the translation model adapted at the previous step, i.e.
updated with new extracted phrases. These phrases are the
result of a word-to-word alignment combination we present
1.1. Similar work
The most similar approach in the literature is proposed in
 who present an incremental re-training algorithm to sim-
ulate a post-editing situation. It is proposed to extract new
phrases from approximate alignments which were obtained
by a modiﬁed version of Giza-pp . An initial alignment
with one-to-one links between the same sentence positions
is created and then iteratively updated as long as improve-
ments are observed. In practice, a greedy search algorithm
is used to ﬁnd the locally optimal word alignment. All
source positions carrying only one link are tried, and the sin-
gle link change which produces the highest probability in-
crease according to the Giza-pp model 4 is kept. The result-
ing alignment is improved with two simple post-processing
steps. First, each unknown word in source side is aligned
with the ﬁrst non-aligned unknown word on the target side.
Second, unaligned pairs of positions surrounded by corre-
sponding alignments are automatically aligned.
In this paper, we present a very fast word-to-word align-
Figure 1: Incremental adaptation workﬂow in three steps protocol: 1. Translation and source-translation alignment: source
sentences are translated using the SMT system Moses. Alignment links are generated during the translation step; 2. Edit distance
on translation-reference: MT system output and its reference translation are aligned using edit distance algorithm of TER; 3.
Source-reference alignment: the alignment links are deduced from combination of alignments of both step 1 and 2. Phrase pairs
are then extracted, scored and added to translation model which is ﬁnally re-trained.
ment algorithm which is partially based on the edit-distance
algorithm. As argued in , “to be practical, incremen-
tal retraining must be performed in less than one second”.
For comparison, our entire alignment process takes few hun-
dredths of second for 1500 sentences, in comparison to sev-
eral seconds per sentences as reported in .
 present stream based incremental adaptation using an
on-line version of the EM algorithm. This approach designed
for large amounts of incoming data is not really adapted for
the post-editing context. Like , we propose an incremental
adaptation workﬂow that is more oriented to real time pro-
As part of our experiments, we have compared our ap-
proach with the use of the freely available tool named Inc-
Giza-pp,1an incremental version of Giza-pp. It is precisely
intended to inject new data into an SMT system without hav-
ing to restart the entire word alignment procedure. To our
knowledge, this is the standard method currently used in the
ﬁeld. In our experiments, we achieve similar results with re-
spect to the BLEU score using less time.
The reminder of this paper is organized as follows. In
the next section we ﬁrst describe our incremental adaptation
workﬂow and more particularly the word-to-word alignment
methodology based on the edit distance. Section 3 is ded-
icated to the experimental protocols and compares the per-
formance of our approach with the standard method using
Inc-Giza-pp. The paper concludes with a discussion of per-
spectives of this work.
2. Incremental Adaptation Workﬂow
In this paper, we present a new methodology to perform in-
cremental training and domain adaptation. Starting with a
generic phrase-based MT baseline system (PBMT), we have
sequentially translated the source side of an in-domain cor-
pus. At each step, like , we have simulated a human post-
editing the translations by using the corresponding reference
translations of the data. At the sentence level, the source and
its reference translation are aligned in order to subsequently
retrieve the corresponding phrase pairs. The extracted phrase
pairs are then scored and used to retrain (i.e. adapt) the trans-
lation model of our PBMT system.
We have developed an aligning protocol which operates
in three steps, named “translation”,“analysis” and “adap-
tation”. These three steps are linked together by a word-to-
word alignment algorithm which allows us to align a source
and its reference translation and then, to extract new phrase
pairs with which the MT system will be adapted. This algo-
rithm is illustrated in Figure 1 and explained in details in the
2.1. Word-to-word alignment combination
Our approach to align the source and its corresponding refer-
ence translation could be seen as a combination of the source
to hypothesis word alignments and an analysis of the edit dis-
tance between the hypothesis and the reference. The central
element of this approach is an automatic translation of the
source sentence into the target language. The principle of
this idea is illustrated in Figure 2.
Figure 2: Example of a source-to-reference alignment using using the automatic translation as pivot. The alignment links
between the source sentence and the translation are generated by the MT system. Those between the translation and its post-
edited version (i.e. the reference) are calculated by TER. Finally, the source-to-reference alignment links are deduced by an
alignment combination based on both alignment sets computed before.
2.1.1. Translation: source to translation alignment
The SMT system used to translate the source sentences is
based on the Moses SMT toolkit . Moses can provide the
word-to-word alignments between the source sentence and
the translation hypothesis. This aligning information repre-
sents the ﬁrst part of our alignment combination. This au-
tomatic translation is “compared” with the reference transla-
tion using an edit distance algorithm.
2.1.2. Analysis: edit distance alignment
In this paper, we use the Translation Error Rate (TER) algo-
rithm as proposed in . TER is an extension of the Word
Error Rate (WER) which is more suitable for machine trans-
lation since it can take into account word reorderings. TER
uses the following edit types: insertion, deletion, substitution
The TER is computed between the output of our SMT
system and the corresponding reference translation, and the
word-to-word alignments are inferred. We only keep the
aligned and substituted edit types in order to extract what
we consider as the most interesting phrase pairs. Indeed,
we argue that what is aligned correspond to what our sys-
tem knows, while what is substituted correspond to what our
system does not know.
Our approach can be extended to use TER-Plus , an
extension of TER using paraphrases, stemming and syn-
onyms in order to obtain better word-to-word alignments.
2.1.3. Adaptation: source to reference alignment
Considering the SMT translation hypothesis as a “pivot” for
aligning both source and its reference sentence, we have de-
signed the word-to-word alignment algorithm shown by Al-
gorithm 1. It combines source-to-translation and translation-
to-reference alignments, and then deduces the source-to-
reference alignment path. From this path, the translation
model is ﬁnally updated using the standard training phrase
extraction and scoring script provided with Moses.
Data: src-to-tgt word alignments, tgt-to-ref edit-path
foreach src-to-tgt word alignment do
alignment(src-word, tgt-word) = 1;
if edit-path has shift then
foreach shift do
foreach edit-type of edit-path do
if edit-type is ‘align’ or ‘substitution’ then
alignment(tgt-word, ref-word) = 1;
foreach ref-word of ref do
foreach tgt-word aligned to ref-word do
if isAligned?(src-word, tgt-word) then
alignment(src-word, ref-word) = 1;
Algorithm 1: Source-to-reference alignment algorithm
at word level. Using both source-to-translation align-
ments and translation-to-reference edit-path, the source-to-
reference alignments path are build.
3. Experimental evaluation
The approach described in the previous section is compared
to inc-Giza-pp which is considered as the state-of-the-art tool
for incremental training. In our ﬁrst experiments, each sys-
tem uses a single translation model which is updated and en-
tirely retrained after each iteration. For the results we present
hereinafter, the system with inc-Giza-pp will be called “inc-
Giza-pp” and the system with our approach will be called
3.1. Training data
The experiments were performed on data which was made
available by the French COSMAT project. The goal of this
project is to provide task-speciﬁc automatic translations of
scientiﬁc texts on the French HAL archive.2This archive
contains a large amount of scientiﬁc publications and PhD
Thesis. The MT system is closely integrated into the work-
ﬂow of the HAL archive. In particular, the author has the pos-
sibility to correct the provided automatic translations. These
translations will be then used to improve the system. In this
paper, we consider the automatic translation from English
Three corpora of parallel data are available to train the
translation model: two generic corpora and an in-domain
corpus for adaptation. The two ﬁrst corpora are Europarl and
News Commentary with 50 million and 3 million words, re-
spectively. They were used to train our SMT baseline sys-
tems. The third corpus, named “absINFO”, contains 500
thousand words randomly selected from abstracts of scien-
tiﬁc papers in the domain of Computer Science. Informa-
tion on the sub-domains is also available (networks, AI, data
base, theoretical CS, ...), but was not used in this study. The
corpus if freely available to support research in domain adap-
tation and was already used by the 2012 JHU summer work-
shop on this topic. A detailed description of this corpus can
be found in .
This in-domain corpus was split into three sub-corpora:
•absINFO.corr.train is composed of 350k words and
is used to simulate the user post-editing or corrective
•absINFO.dev is a set of 75k words and used for de-
•absINFO.test another set of 75k words used as a test
corpus to monitor the performance of our adaptation
Moreover, in order to better simulate a sequential post-
editing process, the absINFO.corr.train corpus was split into
10 sub-sets (about 1.5k sentences with 35k words each). This
corresponds quite well to the update of an MT system after a
post-correction of an entire document.
3.2. Baseline Training
The baseline SMT systems were constructed using the stan-
dard Moses pipeline and Giza-pp for word alignment. In or-
der to later use Inc-Giza-pp, the incremental version of Giza-
pp, we had to train a speciﬁc baseline system using the Hid-
den Markov Model (HMM) word alignment model option.
However, to make a fair comparison of the two adaptation
techniques, the baseline and following systems were trained
on the same data and tuned with MERT  with the same
initial parametrization. The inc-Giza-pp and noGizapp base-
line SMT systems achieve a BLEU score of 35.27 and 35.32
BLEU points on the development corpus respectively, and
31.89 and 32.27 BLEU points on the test corpus.
3.3. Analysis of processing time and alignment quality
The two incremental training approaches are compared with
respect to the BLEU score obtained by adding the additional
aligned data. We also report the time needed to perform the
word alignments. For inc-Giza-pp, the alignment protocol is
composed of several steps (for more details, see “Incremen-
tal Training” of the “Advanced Features” section in Moses
user documentation.3) First, one has to preprocess the data
for use by Giza-pp. This involves updating the vocab ﬁles,
converting the sentences into the snt format of Giza-pp, and
then, updating the co-occurrence ﬁle. Then, Giza-pp is exe-
cuted to update and compute the alignments for the new data.
This is performed in both directions, source-to-translation
and translation-to-source. For each iteration of our experi-
ment, this process takes about 14 minutes.
For the noGizapp system, the required time to perform
the source-to-translation alignment can be considered as null
because it is implicitly achieved during the translation. The
TER between the SMT translation and the reference trans-
lation is computed using a fast and freely available C++
implementation.4This tool can align about 35k words in
about three seconds (corresponding to 1.5k sentences in the
10% subset of the absINFO.corr.train corpus). The align-
ment combination of the source and reference translation, de-
scribed in algorithm 1, takes less than a second. Overall, we
can obtain the source-to-reference alignments of 35k words
in a few seconds only.
The BLEU scores on the development (left part) and test
data (right part) are compared in Figure 3. The following
systems were built:
Gizapp for each subcorpus of the absINFO.corr.train train-
ing data (10%, 20%, 30%. . . 100%), all the avail-
able training data is concatenated and the full training
pipeline is performed, including a new word alignment
which considers all the training data. We consider this
as the upper limit of the performance we could achieve
by incremental training. This procedure is very time
inc-Giza-pp the subcorpora of of the absINFO.corr.train
training data are added using the incremental version
of Giza. This resulted in a slight decrease of the BLEU
score on the development data and a quite unstable per-
formance on the Test data.
noGizapp incremental training using the new approach de-
scribed in this paper. We always used the same base-
3Available online: http://www.statmt.org
Figure 3: Incremental adaptation in BLEU score for our two PBMT systems on both development and test corpora. The Inc-
Giza-pp system uses incremental version of Giza-pp for aligning sentence pair, while noGizapp system uses the approach we
present in this paper, which is based on translation information and edit distance combination. The ‘Gizapp’ and ‘noGizapp’
curves represent the BLEU score obtained with a in-domain adaptation of our baseline systems, without incremental approach.
While the curves ‘Inc-Giza-pp’ and ‘(incremental) noGizapp’ represent the in-domain adaptation scores over an incremental
line SMT system to translate the additional adaptation
inc-noGizapp like noGizapp, but using the system adapted
in the previous step to translate the additional adapta-
The proposed approach to obtain incremental word align-
ments achieves slightly better BLEU scores on both the de-
velopment and the test corpus, but performs much faster.
The large variations on the test corpus could be explained
by two potential reasons. The ﬁrst one could be the char-
acteristics of the absINFO.corr.train corpus. It was created
from abstracts of (Computer Science) sub-domains which
were randomly selected. Consequently, a sub-corpus pre-
dominantly represented in a sub-corpus of absINFO corpus
could be not represented in the test corpus. The second rea-
son could be the use of only one translation model. As ex-
plained above, this translation model is updated with new
phrase pairs extracted from each iteration. Because we are
only interested by edit types corresponding to ’align’ and
’substitution’ edit type during the edit distance analysis (see
Section 2.1.2), the extracted phrase pairs could be generic
or in-domain. Added to all entries already in the translation
model, these new phrases disturb the probability distribution.
This could also explain why our incremental systems are per-
forming worse than the non incremental systems (what we
have called “oracle systems”) for which, the probability dis-
tribution is tuned in better way.
Another possibility could be to use two translation mod-
els like . In this way, we can quickly create a phrase-table
from the word alignments of the additional data.
3.4. Combination of translation models
In this section, we present results achieved by combining sev-
eral translation models. The techniques described in the pre-
vious sections can signiﬁcantly speed-up the word-alignment
process, in comparison to running incremental Giza-pp, but
we still need to create a new phrase table on all the data.
Therefore, we propose to create a new phrase table on the
newly added data only and to combine it with the original
unadapted phrase table.
3.4.1. Back-off Models
Moses support several modes to use multiple phrase tables.
We ﬁrst explored the back-off mode which favors the princi-
pal phrase table: the second phrase table is only considered if
the word or phrase is not found in the ﬁrst one. Figure 4. The
dotted curve represents the use of the incrementally trained
in-domain translation model with the generic one as back-off.
The crossed curve represents the use of these same models
but in reverse order.
As we can see, we got very different results depending
on which translation model is used ﬁrst, but this can be
easily explained by the nature of the back-off models. Our
in-domain translation model is built with the incrementally
added data only, i.e. very small amounts of data, in particular
during the ﬁrst iterations.
Figure 5 presents when jointly using both translation
models. In this conﬁguration, separate translation options
are created for each occurrence, the score being combined if
the same translation option is found in both translation mod-
els. Compared to the use of only one translation model, we
Figure 4: Results for use of “back-off” models. The crossed
curve represents our PBMT system using only one transla-
tion model while the dotted and third curves represent re-
spectively the impact of use two back-off models but in dif-
Figure 5: Comparison between use of back-off (dotted curve)
and non back-off models. The crossed curve represents our
PBMT system using only one translation model. The third
curve represents a PBMT system using its both translation
models for the decoding path while the dotted curve shows
our results for using our translation models in back-off mode.
can observe a signiﬁcant degradation near 80% of adaptation
data before ﬁnally achieving a similar ﬁnal BLEU score (up
to +0.2 points) compared to inc-Giza-pp and noGizapp.
Once again, we believe that the nature of our absINFO
corpus may explain the evolution of our score. When our
SMT systems has to translate more generic sentences, it
is likely that the translation options were provided by our
generic translation rather than our in-domain model.
Based on this observation, we tried to limit edit distance
analysis to substitutions only.
Figure 6: Use of 2 translations models with noback-off and
only substitution were kept, or not.
3.4.2. Filtering by edit-distance type
The Figure 6 shows the results obtained with an in-domain
translation model only trained from substitutions which were
detected during the edit distance analysis. As we argued in
section 2.1.2, we consider that the “substitution” edit type
corresponds to what the MT system does not know since it
was necessary to ﬁx its output.
As we can see, the previous degradation is less impor-
tant.Overall, the evolution of the BLEU score is smoother
than for the other approaches tested so far. By keeping the
phrase pairs corresponding to substitutions only (in the edit-
path), we have also limited the contextual phrases in our in-
domain translation model. It should also take into account
the alignment errors that would have a more important im-
pact in this conﬁguration on the quality of the translation
3.4.3. N-best alignment generation
One of the key points presented in this paper is the use of
the translations to generate the alignment links between a
source sentence and its translation generated by the system.
By default, our MT system returns the best translation can-
didate after decoding. This means that this translation has
obtained the highest decoding score, but that does not neces-
sarily mean that the alignment associated with it is the best
Based on this observation, we tried to explore the nmost
likely translations hypothesis (n-best list). Indeed, a source
sentence could be translated into the same translation us-
ing different segmentations into phrase-pairs. With our ap-
proach, for the same sentence-translation pair, if we have
multiple alignment candidates, we can generate more source-
to-reference alignments and then, potentially reinforce our
in-domain translation model. Using only the two best non
distinct translation candidates, we obtained the results shown
Figure 7: Use of n-best translation candidate to reinforce
alignment possibilities and then, extend our phrase-pair gen-
eration. The starred curve presents our PBMT system for
which we used the two ﬁrst translation candidates in order
to extract phrase pairs, while the second curve represents the
same system but only the 1-best translation candidate is used.
in Figure 7. Unfortunately, the results are worse than ex-
pected. In future work, we will investigate other options to
use the information in the n-best lists.
3.4.4. No tuning step
In the ﬁnal part of the paper, results from an incremental
adaptation of a PBMT system without tuning step are pre-
sented. This procedure is very time-efﬁcient and stable since
we do not apply tuning at every adaptation step. We argue
that we do not need to re-tune our models since adaptation
only adds small amounts of information. Tuning is only ap-
plied at the creation of the model, and the resulting parame-
ters are maintained during the adaptation process. The results
of this procedure are shown in Figure 8.
First, we can observe a clear difference between the
squared and the dotted curves for the 10% adaptation level,
even though they result from the same approach. This is due
to the baseline that we applied: By default, our PBMT sys-
tem is a translation model using only one phrase table. We
need to tune however on a “new baseline system” using two
phrase tables (the one at the 10% level), for which the tuning
weights obtained remain stable throughout adaptation.
Second, the resulting curve is rather smooth, indicating
the instability of the tuning process.
To sum up, by applying our incremental adaptation, we ob-
tain a clear improvement in BLEU scores (+0.5 points), how-
ever without the need to retune at every adaptation. Tuning
can be performed in larger time intervals, for example - in
an industrial post-editing context - every night or as soon as
processing resources become available.
Figure 8: Results for incremental adaptation with no tun-
ing step. The squared curve represents a PBMT system
with normal tuning process achieved at each adaptation it-
eration, while the dotted curve represents the same system
for which the tuning weights obtained at 10% level remain
stable throughout the entire adaptation.
4. Conclusion and Future Work
In this paper, we have presented a new word-to-word
alignment methodology for incremental adaptation using a
phrase-based MT system. This method uses the information
generated during the translation step and then relies on an
analysis of a (simulated) post-editing step to infer a source-
to-reference alignment at the word level.
Compared to incremental Giza, the standard method cur-
rently used in the ﬁeld, the ﬁrst part of our experiments show
that our approach allows us to obtain similar performance
in the BLEU score at an signiﬁcantly improved speed. In-
cremental Giza needs several minutes to align two corpora
of about 35k words while the approach proposed in this pa-
per runs in some seconds. Our approach could be therefore
integrated into an interface dedicated to post-editing which
would exploit user feedback in real time.
The second part of this article was dedicated to experi-
ments on translation model combination. These experiments
show that we can get better results by jointly using two trans-
lation models instead of only one. The results of these exper-
iments suggest some directions for future research. For ex-
ample, the use of the TER algorithm for analyzing the post-
editing result could be reinforced by the notion of “Post Edit
Actions” introduced by , in order to better identify errors
of the SMT system.
This research was partially ﬁnanced by the DGA and
the ANRT under CIFRE-Defense 7/2009, the french ANR
project COSMAT under ANR-09-CORD-004, and the Eu-
ropean Commission under the project MATE CAT, ICT-
2011.4.2 – 287688.
 M. Koponen, “Comparing human perceptions of post-
editing effort with post-editing operations,” Proceed-
ings of the Seventh Workshop on Statistical Machine
Translation, p. 181–190, June 2012.
 F. Blain, J. Senellart, H. Schwenk, M. Plitt, and
J. Roturier, “Qualitative analysis of post-editing for
high quality machine translation,” in Machine Trans-
lation Summit XIII, A.-P. A. for Machine Transla-
tion (AAMT), Ed., Xiamen (China), 19-23 sept. 2011.
 D. Hardt and J. Elming, Incremental Re-training for
Post-editing SMT., 2010.
 F. Och and H. Ney, “A systematic comparison of var-
ious statistical alignment models,” Computational lin-
guistics, vol. 29, no. 1, pp. 19–51, 2003.
 A. Levenberg, C. Callison-Burch, and M. Osborne,
“Stream-based translation models for statistical ma-
chine translation,” in Human Language Technologies:
The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguis-
tics. Association for Computational Linguistics, 2010,
 P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
M. Federico, N. Bertoldi, B. Cowan, W. Shen,
C. Moran, R. Zens, et al., “Moses: Open source toolkit
for statistical machine translation,” in Annual meeting-
association for computational linguistics, vol. 45, no. 2,
2007, p. 2.
 M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and
J. Makhoul, “A study of translation edit rate with tar-
geted human annotation,” in Proceedings of Associa-
tion for Machine Translation in the Americas, 2006, pp.
 M. Snover, N. Madnani, B. Dorr, and R. Schwartz,
“Fluency, adequacy, or hter? exploring different human
judgments with a tunable mt metric,” in Proceedings of
the Fourth Workshop on Statistical Machine Transla-
tion, vol. 30. Association for Computational Linguis-
tics, 2009, pp. 259–268.
 L. Patrik, H. Schwenk, and F. Blain, “Automatic trans-
lation of scientiﬁc documents in the hal archive,” in
Proceedings of the Eight International Conference on
Language Resources and Evaluation (LREC’12). Is-
tanbul, Turkey: European Language Resources Associ-
ation (ELRA), may 2012, pp. p.3933–3936.
 F. Och, “Minimum error rate training in statistical ma-
chine translation,” in Proceedings of the 41st Annual
Meeting on Association for Computational Linguistics-
Volume 1. Association for Computational Linguistics,
2003, pp. 160–167.