Conference Paper

A Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval Using Paraphrases

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Translation memory tools lack semantic knowledge like paraphrasing when they perform matching and retrieval. As a result, paraphrased segments are often not retrieved. One of the primary reasons for this is the lack of a simple and efficient algorithm to incorporate paraphrasing in the TM matching process. Gupta and Orăsan [1] proposed an algorithm which incorporates paraphrasing based on greedy approximation and dynamic programming. However, because of greedy approximation, their approach does not make full use of the paraphrases available. In this paper we propose an efficient method for incorporating paraphrasing in matching and retrieval based on dynamic programming only. We tested our approach on English-German, English-Spanish and English-French language pairs and retrieved better results for all three language pairs compared to the earlier approach [1].

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Due to the limitation of the TM algorithms, various researchers have focused on how to improve semantic matching in TMs. Gupta et al. [6,7]; Gupta and Orasan [8] offer a semantically enhanced edit-distance method by introducing a paraphrase data-base into the edit-distance metric during the matching process. The extra paraphrase TM database contains semantic information such as lexical, phrasal and syntactic paraphrases. ...
Conference Paper
Full-text available
Abstract. The aim of this paper is to investigate the similarity measurement approach of translation memory (TM) in five representative computer aided translation (CAT) tools when retrieving inflectional verb-variation sentences in Arabic to English translation. In English, inflectional affixes in verbs include suffixes only; unlike English, verbs in Arabic derive voice, mood, tense, number and person through various inflectional affixes e.g. pre or post a verb root. The research question focuses on how the TM matching metrics measure a combination of the inflectional affixes when retrieving a segment. If it is dealt with as a character intervention, are the types of intervention penalized equally or differently? This paper experimentally examines, through a black box testing methodology and a test suite instrument, the penalties that TM systems’ current algorithms impose when input segments and retrieved TM sources are exactly the same, except for a difference in an inflectional affix. It would be expected that, if TM systems had some linguistic knowledge, the penalty would be very light, which would be useful to translators, since a high-scoring match would be presented near the top of the list of proposals. However, analysis of TM systems’ output shows that inflectional affixes are penalized more heavily than expected, and in different ways. They may be treated as an intervention on the whole word, or as a single character change.
... Further work towards the development of third-generation TM systems included the more recent studies conducted by members of the Research Group in Computational Linguistics, University of Wolverhampton (Gupta et al. 2016a;Gupta et al., 2016b) who experimented with paraphrasing the TM with a view to securing more matches. The authors sought to embed information from PPDB, a database of paraphrases (Ganitkevitch et al., 2013), in the edit distance metric by employing dynamic programming (DP) 2 as well as dynamic programming and greedy approximation (DPGA). ...
Chapter
Corpus-based contrastive and translation research are areas that keep evolving in the digital age, as the range of new corpus resources and tools expands, opening up to different approaches and application contexts. The current book contains a selection of papers which focus on corpora and translation research in the digital age, outlining some recent advances and explorations. After an introductory chapter which outlines language technologies applied to translation and interpreting with a view to identifying challenges and research opportunities, the first part of the book is devoted to current advances in the creation of new parallel corpora for under-researched areas, the development of tools to manage parallel corpora or as an alternative to parallel corpora, and new methodologies to improve existing translation memory systems. The contributions in the second part of the book address a number of cutting-edge linguistic issues in the area of contrastive discourse studies and translation analysis on the basis of comparable and parallel corpora in several languages such as English, German, Swedish, French, Italian, Spanish, Portuguese and Turkish, thus showcasing the richness of the linguistic diversity carried out in these recent investigations. Given the multiplicity of topics, methodologies and languages studied in the different chapters, the book will be of interest to a wide audience working in the fields of translation studies, contrastive linguistics and the automatic processing of language.
... Recent work on new generation TM systems (Gupta 2015;Gupta et al. 2016a;Gupta et al. 2016b;Timonera and Mitkov 2015; show that when NLP techniques such as paraphrasing or clause splitting are applied, TM systems performance is enhanced. ...
Book
Full-text available
This workshop addresses BOTH the most recent developments in contributions of NLP to translation/interpreting and the contributions of translation/interpreting to NLP/MT. In this way it addresses the interests of researchers & specialists in both areas and their joint collaborations, aiming for example to improve their own tasks with the techniques & knowledge of the other field or to help the development of the other field with their own techniques & knowledge.
... Recent work on new generation TM systems (Gupta 2015;Gupta et al. 2016a;Gupta et al. 2016b;Timonera and Mitkov 2015; show that when NLP techniques such as paraphrasing or clause splitting are applied, TM systems performance is enhanced. ...
Conference Paper
Full-text available
Nowadays there is a pressing need to develop interpreting-related technologies, with practitioners and other end-users increasingly calling for tools tailored to their needs and their new interpreting scenarios. But, at the same time, interpreting as a human activity has resisted complete automation for various reasons, such as fear, unawareness, communication complexities, lack of dedicated tools, etc. Several computer-assisted interpreting tools and resources for interpreters have been developed, although they are rather modest in terms of the support they provide. In the same vein, and despite the pressing need to aiding in multilingual mediation, machine interpreting is still under development, with the exception of a few success stories. This paper will present the results of VIP, a R&D project on language technologies applied to interpreting. It is the ‘seed’ of a family of projects on interpreting technologies which are currently being developed or have just been completed at the Research Institute of Multilingual Language Technologies (IUITLM), University of Malaga.
Conference Paper
Full-text available
This paper investigates to what extent the use of paraphrasing in translation memory (TM) matching and retrieval is useful for human translators. Current translation memories lack semantic knowledge like paraphrasing in matching and retrieval. Due to this, paraphrased segments are often not retrieved. Lack of semantic knowledge also results in inappropriate ranking of the retrieved segments. Gupta and Orasan (2014) proposed an improved matching algorithm which incorporates paraphrasing. Its automatic evaluation suggested that it could be beneficial to translators. In this paper we perform an extensive human evaluation of the use of paraphrasing in the TM matching and retrieval process. We measure post-editing time, keystrokes, two subjective evaluations , and HTER and HMETEOR to assess the impact on human performance. Our results show that paraphrasing improves TM matching and retrieval, resulting in translation performance increases when translators use paraphrase enhanced TMs.
Conference Paper
Full-text available
This paper describes Meteor Universal, released for the 2014 ACL Workshop on Statistical Machine Translation. Meteor Universal brings language specific evaluation to previously unsupported target languages by (1) automatically extracting linguistic resources (paraphrase tables and function word lists) from the bitext used to train MT systems and (2) using a universal parameter set learned from pooling human judgments of translation quality from several language directions. Meteor Universal is shown to significantly outperform baseline BLEU on two new languages, Russian (WMT13) and Hindi (WMT14).
Article
Full-text available
While number of Translation Memory (TM) programs and tools have been developed which are now regarded as indispensable for the work of professional translators, it has been noted that a serious weakness of the current TM technology is the fact that its matching capability is far from perfect. An obvious shortcoming of current TM systems is the fact that they have no access to the meaning of the translated text and operate on its surface form. As a result, they fail to match sentences that have the same meaning, but different syntactic structure. To overcome this shortcoming Pekar and Mitkov (2007) developed the so-called 3rd Generation Translation Memory (3GTM) methodology which analyses the segments not only in terms of syntax but also in terms of semantics. Whereas this technology is a promising way forward, the limitations of current semantic processing may cast a doubt on its use in a practical environment. To enhance the overall low performance of semantic processing tasks, we propose the employment of rhetorical predicates to improve the accuracy of the matching algorithm. The paper will introduce the novel 3GTM developed by us and will show how rhetorical predicates can be used to enhance its performance.
Conference Paper
Full-text available
Current Translation Memory (TM) systems work at the surface level and lack semantic knowledge while matching. This paper presents an approach to incorporating semantic knowledge in the form of paraphrasing in matching and retrieval. Most of the TMs use Levenshtein edit-distance or some variation of it. Generating additional segments based on the para-phrases available in a segment results in exponential time complexity while matching. The reason is that a particular phrase can be paraphrased in several ways and there can be several possible phrases in a segment which can be paraphrased. We propose an efficient approach to incorporating paraphrasing with edit-distance. The approach is based on greedy approximation and dynamic programming. We have obtained significant improvement in both retrieval and translation of retrieved segments for TM thresholds of 100%, 95% and 90%.
Article
Full-text available
The European Commission's (EC) Directorate General for Translation, together with the EC's Joint Research Centre, is making available a large translation memory (TM; i.e. sentences and their professionally produced translations) covering twenty-two official European Union (EU) languages and their 231 language pairs. Such a resource is typically used by translation professionals in combination with TM software to improve speed and consistency of their translations. However, this resource has also many uses for translation studies and for language technology applications, including Statistical Machine Translation (SMT), terminology extraction, Named Entity Recognition (NER), multilingual classification and clustering, and many more. In this reference paper for DGT-TM, we introduce this new resource, provide statistics regarding its size, and explain how it was produced and how to use it.
Article
Full-text available
Translation memories (TMs) are very useful tools for translating texts in narrow domains. We propose the use of paraphrases for search-ing TMs. By using paraphrases, we can re-trieve sentences that have the same meaning as the input sentences even if they do not match exactly. The paraphrase pairs used in our sys-tem are obtained from parallel corpora and are used to retrieve sentences in a statistical framework.
Article
Full-text available
The TELA structure, a set of layered and linked lattices, and the notion of Similarity between TELA structures, based on the Edit Distance, are introduced in order to formalize Translation Memories (TM). We show how this approach leads to a real gain in recall and precision, and allows extending TM towards rudimentary, yet useful Example-Based Machine Translation that we call Shallow Translation.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Article
Full-text available
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.
Article
This paper describes the new linguistic approaches of the MetaMorphoTM system, a linguistically enriched translation memory. Our aim was to develop an improved TM system that uses linguistic analysis in both source and destination languages to yield more exact matches to the source sentence. The MetaMorphoTM stores and retrieve sub- sentential segments and uses a linguistically based measure to determine similarity between two source-language segments, and attempts to assemble a sensible translation using translations of source-language chunks if the entire source segment was not found.
Conference Paper
Although undeniably useful for the translation of certain types of repetitive document, current translation memory technology is limited by the rudimentary techniques employed for approximate matching. Such systems, moreover, incorporate no real notion of a document, since the databases that underlie them are essentially composed of isolated sentence strings. As a result, current TM products can only exploit a small portion of the knowledge residing in translators’ past production. This paper examines some of the changes that will have to be implemented if the technology is to be made more widely applicable.
New generation translation memory: content-sensivite matching
  • V Pekar
  • R Mitkov
Pekar, V., Mitkov, R.: New Generation Translation Memory: Content-Sensivite Matching. In: Proceedings of the 40th Anniversary Congress of the Swiss Association of Translators, Terminologists and Interpreters. (2007)
Use of language technology to imporve matching and retrieval in translation memory
  • R Gupta
Gupta, R.: Use of Language Technology to Imporve Matching and Retrieval in Translation Memory. PhD thesis, University of Wolverhampton (2016)
System, method, and product for dynamically aligning translations in a translation-memory system
  • J P Clark
Clark, J.P.: System, method, and product for dynamically aligning translations in a translation-memory system (February 5 2002) US Patent 6,345,244.
Improving translation memory matching through clause splitting
  • K Timonera
  • R Mitkov
Timonera, K., Mitkov, R.: Improving translation memory matching through clause splitting. In: Proceedings of the Workshop on Natural Language Processing for Translation Memories (NLP4TM), Hissar, Bulgaria (2015) 17-23
Translation memory systems. Computers and Translation: A Translator's Guide
  • H Somers
Somers, H.: Translation memory systems. Computers and Translation: A Translator's Guide 35 (2003) 31-48
Formalizing Translation Memories
  • E Planas
  • O Furuse
Planas, E., Furuse, O.: Formalizing Translation Memories. In: Proceedings of the 7th Machine Translation Summit. (1999) 331-339
MetaMorpho TM: a linguistically enriched translation memory
  • G Hodász
  • G Pohl
Hodász, G., Pohl, G.: MetaMorpho TM: a linguistically enriched translation memory. In: In International Workshop, Modern Approaches in Translation Technologies. (2005)
BLEU: a method for automatic evaluation of machine translation
  • K Papineni
  • S Roukos
  • T Ward
  • W J Zhu
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the ACL. (2002) 311-318