Conference PaperPDF Available

MT Adaptation for Under-Resourced Domains – What Works and What Not

Authors:

Abstract and Figures

In this paper the authors present various techniques of how to achieve MT domain adaptation with limited in-domain resources. This paper gives a case study of what works and what not if one has to build a domain specific machine translation system. Systems are adapted using in-domain comparable monolingual and bilingual corpora (crawled from the Web) and bilingual terms and named entities. The authors show how to efficiently integrate terms within statistical machine translation systems, thus significantly improving upon the baseline.
Content may be subject to copyright.
MT Adaptation for Under-Resourced
Domains – What Works and What Not
MƗrcis PINNIS
1
and Raivis SKADIƻŠ
2
Tilde, Latvia
Abstract. In this paper the authors present various techniques of how to achieve
MT domain adaptation with limited in-domain resources. This paper gives a case
study of what works and what not if one has to build a domain specific machine
translation system. Systems are adapted using in-domain comparable monolingual
and bilingual corpora (crawled from the Web) and bilingual terms and named
entities. The authors show how to efficiently integrate terms within statistical
machine translation systems, thus significantly improving upon the baseline.
Keywords. statistical machine translation, domain adaptation, comparable corpora
Introduction
This paper focuses on a very practical aspect of statistical machine translation (SMT) –
tailoring it to a particular narrow domain. The state-of-the-art SMT has reached a level
when it can be used by professional translators to improve productivity (for example,
see [1]), but to train practically usable domain specific SMT systems we need a
significant amount of parallel and monolingual corpora [2][3][1]. In this paper we are
searching for methods allowing us to build a domain specific SMT system even if we
have a very limited in-domain parallel corpus consisting of just a few thousand parallel
sentences.
If we do not have a big domain specific parallel corpus we can look for other
resources that could compensate for it. In this paper we show how we can benefit from
in-domain texts in the Web, e.g., how we can collect or crawl an in-domain comparable
corpus from the Web and how we can use it to build domain specific SMT systems. We
are showing how general out-of-domain SMT systems can be tailored using data
extracted from the in-domain comparable corpus. Particularly we are dealing with
domain specific terminology and named entities (NE). We extract terms and named
entities from initial parallel training data. These terms and named entities are used to
collect a comparable corpus from the Web. Then we extract parallel terms from the
collected comparable corpus, and finally we integrate them in the SMT system. The
1
Corresponding author: Tilde, VienƯbas gatve 75a, Riga, Latvia; E-mail: marcis.pinnis@tilde.lv
2
Corresponding author: Tilde, VienƯbas gatve 75a, Riga, Latvia; E-mail: raivis.skadins@tilde.lv
Human Language Technologies – The Baltic Perspective
A. Tavast et al. (Eds.)
© 2012 The Authors and IOS Press.
This article is published online with Open Access by IOS Press and distributed under the terms
of the Creative Commons Attribution Non-Commercial License.
doi:10.3233/978-1-61499-133-5-176
176
adapted SMT system quality changes are evaluated in respect to a general out-of-
domain baseline system. The process is thoroughly described in the further sections.
1. Baseline System
We start our experiments with the creation of an English-Latvian baseline system. In
the experiments we assume that the following data is available beforehand:
xa relatively large out-of-domain parallel corpus. For this paper we used the
publicly available DGT-TM
3
English-Latvian parallel corpus (release of 2007).
The corpus consists of 804,501 unique parallel sentence pairs and 791,144
unique Latvian sentences. The monolingual corpus is used for language
modeling.
xa small amount of in-domain parallel sentences (up to two or three thousand
parallel sentences). In our experiments we have selected the automotive
domain (more precisely, service manuals) as the target domain. The in-domain
data is split in two sets - tuning and evaluation. The tuning set and the
evaluation set consist of 1,745 and 872 unique sentence pairs from the
automotive domain. All systems were tuned with minimum error rate training
(MERT [4]) using the in-domain tuning set and evaluated on the evaluation set.
For MT system training (including the baseline system) we use the LetsMT! [5]
Web-based platform for SMT system creation. The LetsMT! platform is built upon the
state-of-the-art Moses [6] SMT experiment management system (EMS).
The baseline system’s results using different automatic evaluation methods (BLEU
[7], NIST [8], TER [9], and METEOR [10]) are given in Table 1.
Table 1 Baseline system’s evaluation results
Case sensitive BLEU NIST TER METEOR
No 10.97 3.9355 89.75 0.1724
Yes 10.31 3.7953 90.40 0.1301
2. Initial Extraction and Alignment of Terms and Named Entities
The first step in our SMT system adaptation technique is acquisition of translated in-
domain term pairs. Bilingual terminology will allow making the SMT system term-
aware and will allow finding better translation candidates for narrow domain
translation tasks. To acquire the term pairs we use bilingual comparable corpora from
the Web.
In order to find important domain specific documents on the Web, we use the
small amount of available parallel data and extract seed terms and named entities for a
focussed narrow domain Web crawl. Terms and named entities are monolingually
tagged in the parallel in-domain data. For terms we use the Tilde’s Wrapper System for
CollTerm (TWSC) [11] and for named entities – TildeNER [12] for Latvian and
3
The DGT Multilingual Translation Memory of the Acquis Communautaire: DGT-TM (available at:
http://langtech.jrc.ec.europa.eu/DGT-TM.html).
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains 177
OpenNLP
4
for English. In parallel, a Moses phrase table is created from the in-domain
parallel data.
Then the monolingually tagged terms and NEs (in our experiment 542 unique
English and 786 unique Latvian units in total) are bilingually aligned using the Moses
phrase table. At first we try to find all symmetric term and named entity phrases in the
phrase table that have been monolingually tagged in both languages. We allow only
full phrase table entry and term or named entity alignments, that is, a phrase is
considered valid only if all tokens from the phrase are identical to tokens of the
corresponding term or named entity. In order to allow also inflective form alignments,
all tokens of all terms, named entities and phrases are stemmed prior to alignment. This
allows finding more translation candidates in cases when some inflective forms have
not been tagged as terms, but others have.
After the symmetric alignment we align also terms and named entities that have
been tagged by only one of the monolingual taggers. If a phrase is aligned in the phrase
table with multiple phrases from the other language, we select the translation candidate
that has the highest averaged (source-to-target and target-to-source) translation
probability within the phrase table. This step allows finding terms and NEs, which have
been missed by one of the monolingual taggers, thus increasing the amount of extracted
term and named entity phrases. The alignment method on the in-domain parallel data
produced 783 bilingually aligned term and NE phrases.
3. Comparable Corpora Collection
The second step in our SMT system adaptation technique requires collection of
bilingual in-domain comparable corpora from the Web. We use the bilingual terms and
NEs that were extracted from the parallel in-domain data as seed terms for focussed
monolingual crawling of two monolingual narrow domain Web corpora with the FMC
[13] crawler. By using bilingually aligned seed terms we ensure that the crawled
corpora will be comparable and within one domain for both English and Latvian
languages. As the aligned seed terms may contain also out-of-domain or cross-domain
term and NE phrases, we apply a ranking method based on reference corpus statistics,
more precisely, we use the inverse document frequency (IDF) [14] scores of words
from general (broad) domain corpora (for instance, the whole Wikipedia and current
news corpora) to weigh the specificity of a phrase. We rank each bilingual phrase using
the following equation:
ܴ൫݌
௦௥௖
ǡ݌
௧௥௚
൯ൌ݉݅݊
σܫܦܨ
௦௥௖
൫݌
௦௥௖
݅
ȁೞೝ೎ȁ
௜ୀଵ
ǡσܫܦܨ
௧௥௚
ቀ݌
௧௥௚
݆
ห௣೟ೝ೒
௝ୀଵ
(1)
where ݌
௦௥௖
and ݌
௧௥௚
denote phrases in the source and target languages and ܫܦܨ
௦௥௖
and ܫܦܨ
௧௥௚
denote the respective language IDF score functions that return an IDF score
for a given token. The ranking method was selected through a heuristic analysis so that
specific in-domain term and named entity phrases would be ranked higher than broad-
domain or cross-domain phrases. This technique also allows filtering out phrase pairs
4
Apache OpenNLP (available at: http://opennlp.apache.org/).
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains178
where a phrase may have a more general meaning in one language, but a specific
meaning in the other language. After applying a threshold on the ranks, 614 phrase
pairs were kept in the seed term list for corpora collection.
Additionally to the seed terms FMC requires seed URLs. In total 55 English and
14 Latvian in-domain seed URLs were manually collected.
When the seed terms and seed URLs were acquired, a 48 hour focussed
monolingual web crawl was initiated for both languages. The collected English and
Latvian corpora were filtered for duplicates, broken into sentences and tokenised. The
statistics of the collected corpora are given in Table 2.
Table 2 Monolingual automotive domain corpora statistics
Language Unique
Documents
Sentences Tokens Unique
Sentences
Tokens in Unique
Sentences
English 34,540 8,743,701 58,526,502
1,481,331 20,134,075
Latvian 6,155 1,664,403 15,776,967
271,327 4,290,213
Both monolingual corpora were aligned in the document level using DictMetric
[15], a tool that scores document pair comparability and aligns document pairs that
exceed a specified comparability score threshold. Executing DictMetric on narrow
domain comparable corpora may cause over-generation of document pairs, that is,
every document from one language can be paired with many documents from the other
language. Therefore, we filtered the document alignments so that each Latvian
document would be paired with the top three comparable English documents and vice
versa, thus creating 81,373 document pairs. The comparable corpus statistics after
document level alignment are given in Table 3.
Table 3 English-Latvian automotive comparable corpus statistics
Language Unique
Documents
Unique
Sentences
Tokens in Unique
Sentences
English 24,124 1,114,609 15,660,911
Latvian 5,461 247,846 3,939,921
4. Extraction of Term Pairs from Comparable Corpus
Once the bilingual comparable corpus is collected, the third step is to extract
translated term pairs. Both parts (the Latvian and the English documents) similarly as
in the first step are monolingually tagged with TWSC. In this step we tag only terms as
the precision of named entity mapping without a phrase table is well below 90% and
this would create unnecessary noise in the extracted data for SMT adaptation. Then
using the document alignment information of the comparable corpus we map terms
bilingually using the TerminologyAligner (TEA) [15][11] tool with a translation
confidence score threshold of 0.7 (with a precision of 90% and higher [11]). In total
369 in-domain term pairs were extracted from the bilingual comparable corpus.
It is possible to use these newly extracted terms in an iterative comparable corpora
collection process, thus bootstrapping also the in-domain translated term pair collection.
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains 179
However, in this paper we limit corpora collection to only one iteration in order to have
a proof-of-concept of the whole SMT system adaptation process.
5. SMT System Adaptation
Following domain adaptation methods suggested in earlier research [2][3] we start the
SMT adaptation task by adding an in-domain language model built using the Latvian
monolingual comparable corpora that was collected in the second step. We built the
SMT system (named Int_LM) using two language models (a general and an in-domain
model). Both language models have different weights determined with system tuning
(MERT). The in-domain monolingual language model increases SMT quality to 11.3
BLEU points (a relative increase of only 3.0% over the baseline system). We trained
also an SMT system (named In-domain_LM_only) using only the in-domain language
model. The experiment achieved 11.16 BLEU points, which is an increase over the
baseline system, but also a decrease over the Int_LM system. This was expected as
MERT has tuned the in-domain language model to be more important, but the in-
domain language model may not contain some general language phrases that the broad
domain corpus has (thus also interpolation of the two models achieves a higher score).
We continue our experiments by adding the translated term pairs (in total 610) that
were extracted from the in-domain tuning set to the parallel data corpus and the
corresponding Latvian translations to the in-domain monolingual corpus, from which
the SMT system is trained. This simple addition of in-domain term translations to the
SMT system (named Int_LM+T_Terms) increased the quality to 12.93 BLEU points (a
relative increase of 17.8% over the baseline system). After adding also term pairs
extracted from the comparable corpus collected from the Web (in total 369 new pairs)
the quality of the system (named Int_LM+T&CC_Terms) increased to 13.5 BLEU
points (a relative increase of 23.1% over the baseline system).
Considering also term banks as possible translated term resources, we extracted
6,767 unique in-domain automotive term pairs from EuroTermBank
5
. Then we trained
an SMT system (named Int_LM+ETB_Terms) with the same parameters as the
Int_LM+T_Terms system. The system achieved 11.26 BLEU points, which is a
decrease in comparison with the Int_LM system and much worse than
Int_LM+T&CC_Terms (the best thus far performing system). The reason for the
decrease is fairly simple – term banks in many cases provide multiple translation
candidates for a single term. This causes ambiguities in the translation model and can
result in selection of the wrong translation hypothesis. To solve this issue (at least
partially), the term pairs from the term bank would have to be semantically
disambiguated in respect to the required domain so that only the correct in-domain
pairs would be used in the SMT system training.
Recent results in MT system adaptation [16] suggest that pseudo-parallel sentence
pairs extracted from in-domain comparable corpora and used for SMT system training
can significantly improve SMT system quality. Using the same pseudo-parallel
sentence extraction tool (LEXACC [15]) we extracted 6,718 and 678 unique sentence
pairs with two parallelism confidence score thresholds 0.51 and 0.35 (the thresholds
were based on previous evaluation on comparable news domain corpora). These
sentence pairs were then added to the available parallel data and the in-domain
5
EuroTermBank - the largest free online terminology resource (http://www.eurotermbank.com/)
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains180
monolingual corpus. The results after training the SMT systems (named
Int_LM+LEXACC_0.35 and Int_LM+LEXACC_0.51) show a decrease in BLEU points
(10.75 and 11.08 respectively) in comparison with the Int_LM system. After manually
analysing the MT output of Int_LM+LEXACC_0.35 in comparison with the baseline
system, it is evident that the translation quality has decreased because of non-parallel
sentence alignments in the LEXACC extracted sentence pairs that cause in-domain
term phrase pairs to receive lower weights (translation probability scores) in the
translation model. Although, in-domain terms in the pseudo-parallel sentences are in
many cases paired with correct translations, they are often also paired with incorrect
translations, thus creating noise for the translation model. This is not to say that the
pseudo-parallel sentences in general do not help improving SMT quality, but that for
very narrow and under-resourced domains where it is difficult to find strongly
comparable in-domain corpora in the Web, the results can lower translation quality
because of incorrect term translation hypothesis. We have shown in [17] that in cases
where large strongly comparable in-domain corpora are available, the pseudo-parallel
sentences extracted from the corpora (up to 500,000 sentence pairs and more) can
achieve a translation quality increase of up to five times in comparison to the baseline
system. The challenge, however, is finding such in-domain strongly comparable
corpora.
So far in our experiments only the in-domain language model helps distinguishing
in-domain translation hypotheses from broad (general) domain hypotheses. Therefore,
in the next step we transformed the Moses phrase table of the translation model to an
in-domain term-aware phrase table. We do this by adding a sixth feature to the default
5 features that are used in Moses phrase tables. The 6
th
feature receives the following
values:
x“1” if a phrase on both sides (in both languages) does not contain a term pair
from a bilingual term list. If a phrase contains a term only on one side (in one
language), but not on the other, it receives the value “1” as such situations
indicate about possible out-of-domain (wrong) translation candidates.
x“2” if a phrase on both sides (in both languages) contains a term pair from the
term list.
In order to find out whether a phrase in the phrase table contains a given term or
not, phrases and terms are stemmed prior to comparison. This allows finding inflected
forms of term phrases even if those are not given in the bilingual term list. The sixth
feature identifies phrases containing in-domain term translations and allows filtering
out out-of-domain (wrong) translation hypothesis in the translation process.
With the described methodology we transformed phrase tables of the systems
Int_LM+T_Terms (using the 610 tuning data term pairs) and Int_LM+T&CC_Terms
(using additionally the 369 term pairs from the comparable corpora) to term-aware
phrase tables. After tuning with MERT two new systems were created. The system
Int_LM+T_Terms+6
th
achieves 13.19 BLEU points and the system
Int_LM+T&CC_Terms+6
th
achieves 13.61 BLEU point (a relative increase of 24.1%
over the baseline system and the highest measured increase in this experiment).
Although the increase in translation quality over the systems without the 6
th
feature is
relatively small, the translations show better translation hypotheses selection for in-
domain terminology.
Complete results of the previously described automotive domain systems are
shown in Table 4 (“CS” stands for “Case Sensitive” evaluation).
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains 181
To show that improvements in SMT quality are consistent also using larger
corpora, we trained a new English-Latvian baseline system (Big_Baseline) using
5,363,043 parallel sentence pairs for translation model training and 33,270,743
monolingual Latvian sentences for the language model training. The system was tuned
using the same tuning set and evaluated on the same evaluation set as before. The
adapted systems (Big_Int_LM+T&CC_Terms and Big_Int_LM+T&CC_Terms+6
th
)
were built exactly as the Int_LM+T&CC_Terms and Int_LM+T&CC_Terms+6
th
systems from the previous experiment. The results (in Table 5) show a relative BLEU
increase of 8.8% and 14.9% for the system without the 6
th
feature and with the 6
th
feature over the baseline. As more data creates higher ambiguity, the 6
th
feature allows
increasing the results significantly more than in the previous experiment. This shows
the potential of the method when applied on larger corpora.
Table 4 English-Latvian automotive domain SMT system adaptation results
System BLEU BLEU
(CS)
NIST NIST
(CS)
TER TER
(CS)
METEOR METEOR
(CS)
Baseline 10.97 10.31 3.9355 3.7953 89.75 90.40 0.1724 0.1301
Int_LM 11.30 10.61 3.9606 3.8190 89.74 90.34 0.1736 0.1312
In-domain_LM_only 11.16 10.52 3.9447 3.8074 89.31 89.92 0.1726 0.1305
Int_LM+T_Terms 12.93 12.12 4.2243 4.0598 88.58 89.32 0.1861 0.1418
Int_LM+T&CC_Terms 13.50 12.65 4.2927 4.1105 88.86 89.70 0.1878 0.1443
Int_LM+ETB_Terms 11.26 10.52 3.9456 3.7882 89.43 90.04 0.1737 0.1290
Int_LM+LEXACC_0.35 10.75 10.09 3.7935 3.6682 90.31 90.86 0.1646 0.1229
Int_LM+LEXACC_0.51 11.08 10.28 3.9132 3.7709 90.23 90.78 0.1706 0.1286
Int_LM+T_Terms+6
th
13.19 12.36 4.2657 4.0962 88.84 89.62 0.1876 0.1439
Int_LM+T&CC_Terms+6
th
13.61 12.78 4.3514 4.1747 88.54 89.32 0.1920 0.1469
Table 5 English-Latvian automotive domain big SMT system adaptation results
System BLEU BLEU
(CS)
NIST NIST
(CS)
TER TER
(CS)
METEOR METEOR
(CS)
Big_Baseline 15.85 15.00 4.8448 4.6934 73.80 75.12 0.2098 0.1651
Big_Int_LM+T&
CC_Terms 17.24 16.12 5.0020 4.8278 72.16 73.59 0.2163 0.1717
Big_Int_LM+T&
CC_Terms+6
th
18.21 17.08 5.1476 4.9626 70.22 71.62 0.2191 0.1747
Conclusion
In this paper we have presented techniques for SMT domain adaptation utilizing
bilingual terms and bilingual comparable corpora collected from the Web. The
experiment results show that integration of terminology within SMT systems even with
simple techniques (adding translated term pairs to the parallel data corpus or adding an
in-domain language model) can achieve an SMT system quality improvement of up to
23.1% over the baseline system. Transformation of translation model phrase tables into
term-aware phrase tables can boost the quality up to 24.1% over the baseline system
mostly because of wrong translation candidate filtering in the translation process.
The experiments also show that the usage of pseudo-parallel sentence pairs extracted
from weakly comparable narrow-domain corpora and term pairs acquired from term
banks without a sophisticated term sense disambiguation and semantic analysis of the
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains182
source text may not result in increased SMT quality due to the added noise in in-
domain translation hypotheses.
Acknowledgements
The research within the project ACCURAT leading to these results has received
funding from the European Union Seventh Framework Programme (FP7/2007-2013),
grant agreement n° 248347. The research within the project TaaS leading to these
results has received funding from the European Union Seventh Framework Programme
(FP7/2007-2013), grant agreement n° 296312. This work has been supported by the
European Social Fund within the project «Support for Doctoral Studies at University of
Latvia».
References
[1] 01SkadiƼš, R., PuriƼš, M., SkadiƼa, I., and Vasiƺjevs, A. Evaluation of SMT in localization to under-
resourced inflected language. In Proceedings of the 15th International Conference of the European
Association for Machine Translation EAMT 2011, 2011, p. 35-40, Leuven, Belgium.
[2] 02Koehn, P. and Schroeder, J. Experiments in domain adaptation for statistical machine translation. In:
Proceedings of the Second Workshop on Statistical Machine Translation, 2007, Prague.
[3] 03Lewis, W., Wendt, C. and Bullock, D., Achieving Domain Specificity in SMT without Overt Siloing.
In: Proceedings of the 7
th
International Conference on Language Resources and Evaluation (LREC
2010), 2010.
[4] 04Bertoldi, N., Haddow, B., Fouet, J.B. Improved Minimum Error Rate Training in Moses, The Prague
Bulletin of Mathematical Linguistics, Vol. 91 (2009), p. 7-16, Prague, Czech Republic.
[5] 05Vasiljevs, A., Gornostay, T. and Skadins, R. LetsMT! – Online Platform for Sharing Training Data and
Building User Tailored Machine Translation. In: Proceedings of the Fourth International Conference
Baltic HLT 2010, 2010, p. 133-140, Riga, Latvia.
[6] 06Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E. Moses: Open Source Toolkit for
Statistical Machine Translation, Annual Meeting of the Association for Computational Linguistics
(ACL), demonstration session, June 2007, Prague, Czech Republic.
[7] 07Papineni, K., Roukos, S., Ward, T., and Zhu, W. J. BLEU: a method for automatic evaluation of
machine translation. In Proceedings of ACL-2002: 40
th
Annual meeting of the Association for
Computational Linguistics, 2002, p. 311-318.
[8] 08Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence
statistics. In: Proceedings of the second international conference on Human Language Technology
Research (HLT 2002), 2002, p. 138-145, San Diego, USA.
[9] 09Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J. A Study of Translation Edit Rate
with Targeted Human Annotation. In: Proceedings of Association for Machine Translation in the
Americas, 2006.
[10] 10Banerjee, S. and Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved
Correlation with Human Judgments. In: Proceedings of the Workshop on Intrinsic and Extrinsic
Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of
Computational Linguistics (ACL 2005), June 2005, Michigan, USA.
[11] 11Pinnis, M., Ljubešiü, N., ùtefănescu, D., SkadiƼa, I., Tadiü, M. and Gornostay, T. Term Extraction,
Tagging and Mapping Tools for Under-Resourced Languages. In: Proceedings of the 10
th
Conference
on Terminology and Knowledge Engineering (TKE 2012), 20-21 June, 2012, Madrid, Spain.
[12] 12Pinnis, M. Latvian and Lithuanian Named Entity Recognition with TildeNER. In Proceedings of
LREC 2012, 21-27 May, 2012, Istanbul, Turkey.
[13] 13ACCURAT D3.5. Tools for building comparable corpus from the Web, version 3.0, 29
th
June, 2012
(http://www.accurat-project.eu/), 46 pages.
[14] 14Spärck Jones, K. A statistical interpretation of term specificity and its application in retrieval. Journal
of Documentation, volume 28, 1972, p. 11-21.
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains 183
[15] 15ACCURAT D2.6 (2011). Toolkit for multi-level alignment and information extraction from
comparable corpora, version 3.0. 29
th
June, 2012 (http://www.accurat-project.eu/), 164 pages.
[16] 16ùtefănescu, D., Ion, R. and Hunsicker, S. Hybrid Parallel Sentence Mining from Comparable
Corpora. In: Proceedings of the 16th Annual Conference of the European Association for Machine
Translation (EAMT 2012), 2012, p. 137-144, Trento, Italy.
[17] 17ACCURAT D5.4. Report on requirements, implementation and evaluation of usability in application
for software localization, version 1.0. 29
th
June, 2012 (http://www.accurat-project.eu/), 38 pages.
M. Pinnis and R. Skadi¸nš / MT Adaptation for Under-Resourced Domains184
... Only few studies dealing with some specific groups of MWEs (e.g. phrasal verbs, terminology) investigate automatic translation of MWEs into morphology rich free word order under-resourced language [7,8,9]. ...
... For domain adaptation where MWEs mainly represent terminological units several authors [6,9] use automatically extracted bilingual domain dictionaries of multiword expressions as an additional resource during training. Three strategies which were already mentioned before -retraining SMT with MWEs as parallel sentence pairs, new feature in translation table for bilingual MWES, additional phrase table -are applied. ...
... For English-Latvian machine translation, Deksne et al. [8] propose to use special dictionary of MWEs and include MWE processing step in a rule-based machine translation. Pinnis and Skadiņš [9] analyse terminology translation problem for a narrow domain English-Latvian SMT system. They report transformation of translation model into term-aware phrase tables as the most successful approach. ...
... translation (Pinnis & Skadiņš, 2012). ...
... For terms, the methods described in section 2.2 were used and for named entities the TildeNER (Pinnis, 2012) named entity recogniser for Latvian and OpenNLP 22 for English were used. Then, the monolingually identified terms and named entities (542 unique English and 786 unique Latvian units in total) were cross-lingually mapped using the parallel tuning data and the methodology by Pinnis & Skadiņš (2012). As a result, 783 term and NE phrase pairs were identified. ...
... Consequently, the recall and the effectiveness of the method is higher than for morphologically richer languages. E.g., for Latvian, which is a morphologically rich language, even when translating from English, as shown by Pinnis and Skadiņš (2012), the method does not show quality improvements when using a term collection from an authoritative source (the EuroTermBank) because of two main reasons: 1) in Latvian terms appear in many different inflected forms, and 2) many of the terms in the authoritative data base are ambiguous (they may have multiple translation equivalents listed, which all may represent the same terminological concept), thus the addition of new term pairs causes more statistical uncertainty for the SMT system. However, it also does not show a quality decrease, which for the method in general is a positive result. ...
Thesis
Full-text available
The aim of this doctoral thesis is to research methods and develop tools that allow successfully integrating bilingual terminology into statistical machine translation systems so that the translation quality of terminology would increase and that the overall translation quality of the source text would increase. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources necessary for different tasks involved in workflows for terminology integration in SMT systems. The thesis describes and evaluates methods designed and implemented by the author for: 1) monolingual term identification in SMT system training data as well as documents submitted for translation, 2) term normalisation for acquisition of canonical forms of terms from terms in different inflected forms, 3) cross-lingual term mapping in parallel and comparable corpora collected from the Web, 4) probabilistic dictionary filtering in order to acquire resources for cross-lingual term mapping, 5) development of character-based SMT transliteration systems from probabilistic dictionaries, 6) inflected form generation for terms through rule-based morphological synthesis or monolingual corpus look-up, and other methods involved in the workflows for static and dynamic terminology integration in SMT systems. The terminology integration methods have been evaluated using the Moses SMT system and the LetsMT platform. The evaluation efforts show that the methods for monolingual term identification and cross-lingual term mapping allow achieving state-of-the-art performance, which has been also validated by third party (independent) evaluation efforts. The static terminology integration methods allow achieving a cumulative SMT quality improvement by up to 28.1% (or 3.56 absolute BLEU points) over an initial baseline system for the English-Latvian language pair. However, the most impressive achievement of the author’s work is the dynamic terminology integration method in SMT systems using a source text pre-processing workflow. In almost all experiments performed in the scope of the thesis the methods allowed achieving SMT quality improvements. Automatic evaluation for four investigated language pairs in the automotive domain shows SMT quality improvements by up to 26.9% (or 3.41 absolute BLEU points) over baseline systems. Manual comparative evaluation performed for seven language pairs in the information technology domain shows that the proportion of correctly translated terms increases for all language pairs by up to +52.6%.
... ||0.166667">The blue spruce</t> is a popular Christmas tree. (1) • The linguistically, statistically, and reference corpus motivated term identification method for semi-automatic creation of term collections • The Fast Term Identification method for term identification in SMT system training data designed for morphologically rich languages (Pinnis & Skadiņš, 2012;Pinnis, 2015) • The context independent cross-lingual term mapping method that performs term mapping in sub-word level using maximised character alignment maps (Pinnis, 2013) • The probabilistic dictionary filtering method using character-based SMT transliteration systems (Aker et al., 2014) • The character-based SMT transliteration system creation method using transliteration dictionaries that have been automatically created with a bootstrapping method (Pinnis, 2014) • The method for static terminology integration in SMT systems that transforms the SMT system phrase tables into term-aware phrase tables (Pinnis & Skadiņš, 2012) • The multi-dimensional method for dynamic terminology integration in SMT systems using a source text pre-processing workflow (Pinnis, 2015) Main Results ...
... ||0.166667">The blue spruce</t> is a popular Christmas tree. (1) • The linguistically, statistically, and reference corpus motivated term identification method for semi-automatic creation of term collections • The Fast Term Identification method for term identification in SMT system training data designed for morphologically rich languages (Pinnis & Skadiņš, 2012;Pinnis, 2015) • The context independent cross-lingual term mapping method that performs term mapping in sub-word level using maximised character alignment maps (Pinnis, 2013) • The probabilistic dictionary filtering method using character-based SMT transliteration systems (Aker et al., 2014) • The character-based SMT transliteration system creation method using transliteration dictionaries that have been automatically created with a bootstrapping method (Pinnis, 2014) • The method for static terminology integration in SMT systems that transforms the SMT system phrase tables into term-aware phrase tables (Pinnis & Skadiņš, 2012) • The multi-dimensional method for dynamic terminology integration in SMT systems using a source text pre-processing workflow (Pinnis, 2015) Main Results ...
... Terms and their translations, as one of the sub-sentential units that can be frequently found in comparable corpora, have shown to be beneficial for statistical machine translation system adaptation to narrow domains (Pinnis and Skadiņš 2012). Therefore, this section presents three different directions of methods that allow extracting bilingual term pairs from comparable corpora. ...
Chapter
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.
... As reported in sect. 2, methods in which terminology is extracted from other resources and then added to training data sets (Bouamor et al., 2012) are less effective than, for example, approaches in which terminology is extracted from the training data set (Ren et al., 2009;Pinnis and Skadinš, 2012) or injected dynamically, i.e. at run-time without re-training, into the MT engine (Arcan et al., 2014a). In our case the Italian-English term pairs were 4,143 against the 34,800 sentence pairs of the training data set, while for German-English we had 5,465 term pairs as compared to 27,300 sentence pairs. ...
... We show that approaches based on word alignment and term translation applied to the daily extracted data are more robust and more efficient than the state-of-the-art method based on pre-trained classifiers (Aker, Paramita and Gaizauskas 2013). Second, we cannot choose an SMT solution that requires the translation service to be regularly stopped to re-train the model, as shown in Bouamor, Semmar and Zweigenbaum (2012) or Pinnis and Skadins (2012). Instead, we investigate the integration of cache-based translation and language models (Bertoldi, Cettolo and Federico 2013) in the context of terminology integration. ...
Article
This work focuses on the extraction and integration of automatically aligned bilingual terminology into a Statistical Machine Translation (SMT) system in a Computer Aided Translation scenario. We evaluate the proposed framework that, taking as input a small set of parallel documents, gathers domain-specific bilingual terms and injects them into an SMT system to enhance translation quality. Therefore, we investigate several strategies to extract and align terminology across languages and to integrate it in an SMT system. We compare two terminology injection methods that can be easily used at run-time without altering the normal activity of an SMT system: XML markup and cache-based model. We test the cache-based model on two different domains (information technology and medical) in English, Italian and German, showing significant improvements ranging from 2.23 to 6.78 BLEU points over a baseline SMT system and from 0.05 to 3.03 compared to the widely-used XML markup approach.
... There has recently been a growing interest for terminology integration into MT models. Direct integration of terminology into the SMT model has been considered, either by extending SMT training data (Carl and Langlais, 2002), or via adding an additional term indicator feature (Pinnis and Skadins, 2012;Skadins et al., 2013) into the translation model. However none of the above is possible when we deal with an external black-box MT service. ...
Article
Full-text available
The paper presents series of experiments that aim to find best method how to treat multi-word expressions (MWE) in machine translation task. Methods have been investigated in a framework of statistical machine translation (SMT) for translation form English into Latvian. MWE candidates have been extracted using pattern-based and statistical approaches. Different techniques for MWE integration into SMT system are analysed. The best result - +0.59 BLEU points – has been achieved by combining two phrase tables - bilingual MWE dictionary and phrase table created from the parallel corpus in which statistically extracted MWE candidates are treated as single tokens. Using only bilingual dictionary as additional source of information the best result (+0.36 BLEU points) is obtained by combining two phrase tables. In case of statistically btained MWE lists, the best result (+0.51 BLEU points) is achieved with the largest list of MWE candidates.
Conference Paper
Full-text available
In this paper the author presents methods for dynamic terminology integration in statistical machine translation systems using a source text pre-processing work-flow. The workflow consists of exchange-able components for term identification, inflected form generation for terms, and term translation candidate ranking. Automatic evaluation for three language pairs shows a translation quality improvement from 0.9 to 3.41 BLEU points over the baseline. Manual evaluation for seven language pairs confirms the positive results; the proportion of correctly translated terms increases from 1.6% to 52.6% over the baseline.
Conference Paper
Full-text available
The authors present a service-based model for semi-automatic gener-ation of multilingual terminology resources which, if performed manually, is very time consuming. In this model, the automation of individual terminology work tasks is rendered as a set of interoperable cloud-based services integrated into workflows. These services automate the identification of term candidates in user documents, and the lookup of translation equivalents in online terminology resources and on the Web by automatically extracting multilingual terminology from comparable and parallel online resources. Collaborative involvement of users contributes to the refinement and enrichment of the raw terminological data. Finally, we present the TaaS platform, which implements this service-based model, particularly focusing on the processing of Web content.
Conference Paper
Full-text available
Although Machine Translation is very popular for personal tasks, its use in localization and other business applications is still very limited. The paper presents an experiment on the evaluation of an English-Latvian SMT system integrated into SDL Trados which has been used in an actual localization assignment by a professional localization company. We show that such an integrated localization environment can increase the productivity of localization by 32.9% without a critical reduction in quality.
Conference Paper
Full-text available
In this paper the author presents TildeNER ― an open source freely available named entity recognition toolkit and the first multi-class named entity recognition system for Latvian and Lithuanian languages. The system is built upon a supervised conditional random field classifier and features heuristic and statistical refinement methods that improve supervised classification, thus boosting the overall system's performance. The toolkit provides means for named entity recognition model bootstrapping, plaintext document and also pre-processed (morpho-syntactically tagged) tab-separated document named entity tagging and evaluation on test data. The paper presents the design of the system, describes the most important data formats and briefly discusses extension possibilities to different languages. It also gives evaluation on human annotated gold standard test corpora for Latvian and Lithuanian languages as well as comparative performance analysis to a state-of-the art English named entity recognition system using parallel and strongly comparable corpora. The author gives analysis of the Latvian and Lithuanian named entity tagged corpora annotation process and the created named entity annotated corpora.
Conference Paper
Full-text available
This position paper presents the recently started European collaboration project LetsMT!. This project creates a platform that gathers public and user-provided MT training data and generates multiple MT systems by combining and prioritizing this data. The project extends the use of existing state-of-the-art SMT methods that are applied to data supplied by users to increase quality, scope and language coverage of machine translation. The paper describes the background and motivation for this work, key approaches, and the technologies used.
Conference Paper
Full-text available
This paper presents a fast and accurate parallel sentence mining algorithm for comparable corpora called LEXACC based on the Cross-Language Information Retrieval framework combined with a trainable translation similarity measure that detects pairs of parallel and quasi-parallel sentences. LEXACC obtains state-of-the-art results in comparison with established approaches.
Conference Paper
Full-text available
Although term extraction has been researched for more than 20 years, only a few studies focus on under-resourced languages. Moreover, bilingual term mapping from comparable corpora for these languages has attracted researchers only recently. This paper presents methods for term extraction, term tagging in documents, and bilingual term mapping from comparable corpora for four under-resourced languages: Croatian, Latvian, Lithuanian, and Romanian. Methods described in this paper are language independent as long as language specific parameter data is provided by the user and the user has access to a part of speech or a morpho-syntactic tagger.
Article
We describe an open-source soware for minimum error rate training (MERT) for statistical machine translation (SMT). is was implemented within the Moses toolkit, although it is essentially standsalone, with the aim of replacing the existing implementation with a cleaner, more flexible design, in order to facilitate further research in weight optimisation. A description of the design is given, as well as experi- ments to compare performance with the previous implementation and to demonstrate extensibility.
Article
We describe METEOR, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machine- produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; further- more, METEOR can be easily extended to include more advanced matching strate- gies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference. We evaluate METEOR by measuring the cor- relation between the metric scores and human judgments of translation quality. We compute the Pearson R correlation value between its scores and human qual- ity assessments of the LDC TIDES 2003 Arabic-to-English and Chinese-to-English datasets. We perform segment-by- segment correlation, and show that METEOR gets an R correlation value of 0.347 on the Arabic data and 0.331 on the Chinese data. This is shown to be an im- provement on using simply unigram- precision, unigram-recall and their har- monic F1 combination. We also perform experiments to show the relative contribu- tions of the various mapping modules.
Article
The exhaustivity of document descriptions and the specificity of index terms are usually regarded as independent. It is suggested that specificity should be interpreted statistically, as a function of term use rather than of term meaning. The effects on retrieval of variations in term specificity are examined, experiments with three test collections showing in particular that frequently-occurring terms are required for good overall performance. It is argued that terms should be weighted according to collection frequency, so that matches on less frequent, more specific, terms are of greater value than matches on frequent terms. Results for the test collections show that considerable improvements in performance are obtained with this very simple procedure.
Article
Evaluation is recognized as an extremely helpful forcing function in Human Language Technology R&D. Unfortunately, evaluation has not been a very powerful tool in machine translation (MT) research because it requires human judgments and is thus expensive and time-consuming and not easily factored into the MT research agenda. However, at the July 2001 TIDES PI meeting in Philadelphia, IBM described an automatic MT evaluation technique that can provide immediate feedback and guidance in MT research. Their idea, which they call an "evaluation understudy", compares MT output with expert reference translations in terms of the statistics of short sequences of words (word N-grams). The more of these N-grams that a translation shares with the reference translations, the better the translation is judged to be. The idea is elegant in its simplicity. But far more important, IBM showed a strong correlation between these automatically generated scores and human judgments of translation quality. As a result, DARPA commissioned NIST to develop an MT evaluation facility based on the IBM work. This utility is now available from NIST and serves as the primary evaluation measure for TIDES MT research.