About
88
Publications
7,973
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,399
Citations
Introduction
Additional affiliations
August 2011 - present
Systran
Position
- Engineer
March 2008 - August 2011
September 2003 - February 2008
Publications
Publications (88)
Retrieval-augmented machine translation leverages examples from a translation memory by retrieving similar instances. These examples are used to condition the predictions of a neural decoder. We aim to improve the upstream retrieval step and consider a fixed downstream edit-based model: the multi-Levenshtein Transformer. The task consists of findin...
In our globalized world, a growing number of situations arise where people are required to communicate in one or several foreign languages. In the case of written communication, users with a good command of a foreign language may find assistance from computer-aided translation (CAT) technologies. These technologies often allow users to access exter...
Machine Translation (MT) is usually viewed as a one-shot process that generates the target language equivalent of some source text from scratch. We consider here a more general setting which assumes an initial target sequence, that must be transformed into a valid translation of the source, thereby restoring parallelism between source and target. F...
Non-autoregressive machine translation (NAT) has recently made great progress. However, most works to date have focused on standard translation tasks, even though some edit-based NAT models, such as the Levenshtein Transformer (LevT), seem well suited to translate with a Translation Memory (TM). This is the scenario considered here. We first analyz...
As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs. Automatic captioning and subtitling often need to be tightly intertwined to a...
When building machine translation systems, one often needs to make the best out of heterogeneous sets of parallel data in training, and to robustly handle inputs from unexpected domains in testing. This multi-domain scenario has attracted a lot of recent work that fall under the general umbrella of transfer learning. In this study, we revisit multi...
This paper describes SYSTRAN's systems submitted to the WMT 2017 shared news translation task for English-German, in both translation directions. Our systems are built using OpenNMT 1 , an open-source neural machine translation system, implementing sequence-to-sequence models with LSTM encoder/decoders and attention. We experimented using mono-ling...
Machine translation systems are very sensitive to the domains they were trained on. Several domain adaptation techniques have been deeply studied. We propose a new technique for neural machine translation (NMT) that we call domain control which is performed at runtime using a unique neural network covering multiple domains. The presented approach s...
This paper describes SYSTRAN's systems submitted to the WMT 2017 shared news translation task for English-German, in both translation directions. Our systems are built using OpenNMT, an open-source neural machine translation system, implementing sequence-to-sequence models with LSTM encoder/decoders and attention. We experimented using monolingual...
Text simplification aims at reducing the lexical, grammatical and structural complexity of a text while keeping the same meaning. In the context of machine translation, we introduce the idea of simplified translations in order to boost the learning ability of deep neural translation models. We conduct preliminary experiments showing that translatio...
Domain adaptation is a key feature in Machine Translation. It generally encompasses terminology, domain and style adaptation, especially for human post-editing workflows in Computer Assisted Translation (CAT). With Neural Machine Translation (NMT), we introduce a new notion of domain adaptation that we call "specialization" and which is showing pro...
Training efficiency is one of the main problems for Neural Machine Translation (NMT). Deep networks, very large data and many training iterations are necessary to achieve state-of-the-art performance for NMT. This results in very high computation cost and slow down research and industrialization. In this paper, we first investigate the instability...
Since the first online demonstration of Neural Machine Translation (NMT) by LISA, NMT development has recently moved from laboratory to production systems as demonstrated by several entities announcing roll-out of NMT engines to replace their existing technologies. NMT systems have a large number of training configurations and the training process...
A major weakness of extant statistical ma-chine translation (SMT) systems is their lack of a proper training procedure. Phrase extrac-tion and scoring processes rely on a chain of crude heuristics, a situation judged problem-atic by many. In this paper, we recast the ma-chine translation problem in the familiar terms of a sequence labeling task, th...
The Quaero program is an international project promot-ing research and industrial innovation on technologies for au-tomatic analysis and classification of multimedia and multi-lingual documents. Within the program framework, research organizations and industrial partners collaborate to develop prototypes of innovating applications and services for...
This paper describes N, an open source statistical machine translation (SMT) toolkit for translation models estimated as n-gram language models of bilingual units (tuples). This toolkit includes tools for extracting tuples, estimating models and performing translation. It can be easily coupled to several other open source toolkits to yield a co...
This paper describes LIMSI's submissions to the Sixth Workshop on Statistical Machine Translation. We report results for the French-English and German-English shared translation tasks in both directions. Our systems use n-code, an open source Statistical Machine Translation system based on bilingual n-grams. For the French-English task, we focussed...
We introduce a generic framework in Statistical Machine Translation (SMT) in which lexical hypotheses, in the form of a target language model local to the input sentence, are used to guide the search for the best translation, thus performing a lexical microadaptation. An in- stantiation of this framework is presented and evaluated on three language...
This paper describes our Statistical Machine Translation systems for the WMT10 evaluation, where LIMSI participated for two language pairs (French-English and German-English, in both directions). For German-English, we concentrated on normalizing the German side through a proper preprocessing, aimed at reducing the lexical redundancy and at splitti...
In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation
model probabilities are estimated through standard language modeling of...
Nous présentons un cadre générique en traduction automatique statistique (TAS) dans lequel des prédictions lexicales, sous forme d'un modèle de langue local à la phrase à traduire, sont exploitées pour guider la recherche de la meilleure hypothèse de traduction, ce qui a pour effet d'opérer une micro-adaptation lexicale. Nous proposons une instanci...
This paper advocates a complementary measure of translation performance that focuses on the constrastive ability of two or more systems or system versions to adequately translate source words. This is motivated by three main reasons : 1) existing automatic metrics sometimes do not show significant differences that can be revealed by fine-grained fo...
We present a new reordering model estimated as a standard n-gram language model with units built from morpho-syntactic information of the source and target languages. It can be seen as a model that translates the morpho-syntactic structure of the input sentence, in contrast to standard translation models which take care of the surface word forms. W...
We present a framework where auxiliary MT systems are used to provide lexical predictions to a main SMT system. In this work, predictions are obtained by means of pivoting via auxiliary languages, and introduced into the main SMT system in the form of a low order language model, which is estimated on a sentenceby- sentence basis. The linear combina...
This paper describes a technique to exploit multiple pivot languages when using machine translation (MT) on language pairs with scarce bilingual resources, or where no translation system for a language pair is avail-able. The principal idea is to generate intermediate translations in several pivot languages, translate them separately into the targe...
This paper presents an extension for a bilingual n-gram statistical machine translation (SMT) system based on al-lowing translation units with gaps. Our gappy translation units can be seen as a first step towards introducing hierar-chical units similar to those employed in hierarchical MT systems. Our goal is double. On the one hand we aim at captu...
This paper describes our Statistical Ma- chine Translation systems for the WMT09 (en:fr) shared task. For this evaluation, we have developed four systems, using two different MT Toolkits: our primary sub- mission, in both directions, is based on Moses, boosted with contextual informa- tion on phrases, and is contrasted with a conventional Moses-bas...
Résumé. Les systèmes de traduction statistiques intègrent différents types de modèles dont les prédictions sont combinées, lors du décodage, afin de produire les meilleures traduc-tions possibles. Traduire correctement des mots polysémiques, comme, par exemple, le mot avocat du français vers l'anglais (lawyer ou avocado), requiert l'utilisation de...
The Internet gives us access to a wealth of information in languages we don't understand. The investigation of automated or semi-automated approaches to translation has become a thriving research field with enormous commercial potential. This volume investigates how Machine Learning techniques can improve Statistical Machine Translation, currently...
We describe two methods to improve SMT accuracy using shallow syntax information. First, we use chunks to refine the set of word alignments typically used as a starting point in SMT systems. Second, we extend anN -gram- based SMT system with chunk tags to better account for long-distance reorderings. Exper- iments are reported on an Arabic-English...
This paper reports on the participation of the TALP Research Center of the UPC (Universitat Politècnica de Catalunya) to the ACL WMT 2008 evaluation campaign. This year's system is the evolution of the one we em- ployed for the 2007 campaign. Main updates and extensions involve linguistically motivated word re- ordering based on the reordering patt...
In present Statistical Machine Translation (SMT) systems, alignment is trained in a previous stage as the translation model. Consequently, alignment model parame- ters are not tuned in function of the trans- lation task, but only indirectly. In this paper, we propose a novel framework for discriminative training of alignment mod- els with automated...
This paper addresses the problem of reordering in statistical machine translation (SMT). We describe an elegant and efficient approach to couple reordering (word order monotonization) and decoding, which does not need for any additional model. We use linguistically motivated reordering rules to extend a monotonic search graph (with reordering hypot...
In the framework of the Tc-Star project, we analyze and propose a combination of two Statistical Machine Translation sys- tems: a phrase-based and an N-gram-based one. The exhaustive analysis includes a comparison of the translation models in terms of eciency (number of translation units used in the search and computational time) and an examination...
In this paper we present several extensions of MARIE1, a freely available N -gram-based sta- tistical machine translation (SMT) decoder. The extensions mainly consist of the ability to ac- cept and generate word graphs and the intro- duction of two new N -gram models in the log- linear combination of feature functions the de- coder implements. Addi...
This paper addresses the problem of word re-ordering in statistical machine translation. We follow a word order monotonization strategy making use of syntax information (dependency parse tree) of the source language to build a set of automatically extracted reordering rules. The input sentence is extended to a graph built with reordering hypotheses...
This paper describes TALPtuples, the 2007 N-gram-based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Politecnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the system of previous years. Mainly, these include optimizing alignment parameters in function of trans...
This paper describes the 2007 Ngram-based sta- tistical machine translation system developed at the TALP Research Center of the UPC (Uni- versitat Polit ecnica de Catalunya) in Barcelona. Emphasis is put on improvements and extensions of the previous years system, being highlighted and empirically compared. Mainly, these include a novel word orderi...
In this paper we present several extensions of MARIE, a freely available N-gram-based statistical machine translation (SMT) decoder. The extensions mainly consist of the ability to accept and generate word graphs and the introduction of two new N-gram models in the loglinear combination of feature functions the decoder implements. Additionally, the...
This article describes in detail an n-gram approach to statistical machine translation. This approach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is d...
In this paper we describe an elegant and efficient approach to coupling reordering and decoding in statistical machine translation,
where the n-gram translation model is also employed as distortion model. The reordering search problem is tackled through a set of linguistically
motivated rewrite rules, which are used to extend a monotonic search gra...
This paper presents a reordering frame-work for statistical machine translation (SMT) where source-side reorderings are integrated into SMT decoding, allowing for a highly constrained reordered search graph. The monotone search is extended by means of a set of reordering patterns (linguistically motivated rewrite patterns). Patterns are automatical...
This paper reports translation results for the "Exploiting Parallel Texts for Statistical Machine Translation" (HLT-NAACL Workshop on Parallel Texts 2006). We have studied different techniques to improve the standard Phrase-Based translation system. Mainly we introduce two reordering approaches and add morphological information.
This work presents translation results for the three data sets made available in the shared task "Exploiting Parallel Texts for Statistical Machine Translation" of the HLT-NAACL 2006 Workshop on Statistical Machine Translation. All results presented were generated by using the N-gram-based statistical machine translation system which has been enhan...
This paper describes TALPtuples, the 2006 Ngram- based statistical machine translation system developed at the TALP Research Center of the UPC (Universitat Polit ecnica de Catalunya) in Barcelona. Emphasis is put on improve- ments and extensions of the system of previous years, being highlighted and empirically compared. Mainly, these include a nov...
This article presents some experimental results on Chinese to Spanish machine translation. The implemented translation system is based on the statistical framework and, more specifically, it implements the bilingual n-gram approach. Since, as far as we know, no Chinese-Spanish parallel corpus is currently available for training purposes, an alterna...
This paper introduces a rule-based classification of single-word and compound verbs into a statistical machine translation ap- proach. By substituting verb forms by the lemma of their head verb, the data sparseness problem caused by highly-inflected languages can be successfully addressed. On the other hand, the information of seen verb forms can b...
In this paper we describe MARIE, an Ngram-based statistical machine translation decoder. It is implemented using a beam search strategy, with distortion (or reordering) capabilities. The underlying translation model is based on an Ngram ap-proach, extended to introduce reordering at the phrase level. The search graph structure is designed to perfor...
This communication introduces a stochastic machine translation system based on Ngram modelling of the joint probability of bilingual texts. The basic unit of this model is called a tuple and consists of a pair of both source (to be translated) language and target language (translation) word-strings. Translation is driven by a log-linear combination...
This work discusses translation results for the four Euparl data sets which were made available for the shared task "Exploiting Parallel Texts for Statistical Machine Translation". All results presented were generated by using a statistical machine translation system which implements a log-linear combination of feature functions along with a biling...
In Statistical Machine Translation, the use of reordering for certain language pairs can pro- duce a significant improvement on translation accuracy. However, the search problem is shown to be NP-hard when arbitrary reorderings are allowed. This paper addresses the question of reordering for an Ngram-based SMT approach following two complementary s...
Resumen: En esta comunicaci on se propone un m etodo para incorporar conoci- miento lingu stico relativo a las formas verbales en sistemas estoc asticos de traduc- ci on. Por medio de una clasicaci on basada en conocimiento de dichas formas, y de su sustituci on por el lema del verbo principal durante la fase de entrenamiento, se consigue un mejor...
This paper shows the common framework that underlies the translation systems based on phrases or driven by finite state transducers, and summarizes a first comparison between them. In both approaches the translation process is based on pairs of source and target strings of words (segments) related by word alignment. Their main difference comes from...
This paper describes the design and development of a trilingual spontaneous speech corpus for statistical speech-to-speech translation. The languages considered are Catalan, Spanish and US-English. This corpus has been built bearing in mind the strong need for multi-lingual collections of on-line data within the area of statistical translation, as...
This paper describes LIMSI's Statistical Machine Translation systems (SMT) for the IWSLT evaluation, where we participated in two tasks (Talk for English to French and BTEC for Turkish to English). For the Talk task, we studied an extension of our in-house n-code SMT system (the integration of a bilingual reordering model over generalized translati...
This work summarizes a comparison between two ap- proaches to Statistical Machine Translation (SMT), namely Ngram-based and Phrase-based SMT. In both approaches, the translation process is based on bilingual units related by word-to-word alignments (pairs of source and target words), while the main differences are based on the extraction process of...
This article presents and describes an experimental proto- type system for performing Chinese-to-Spanish and Spanish-to-Chinese machine translation. The system is based on the statistical machine translation (SMT) framework and, more specically , it implements the bilingual n-gram SMT approach. Since, as far as we know, no large Chinese-Spanish par...
This paper introduces a phrase alignment strategy that seeks phrase and word links in two stages using cooc- currence measures and linguistic information. On a first stage, the algorithm finds high-precision links involv- ing a linguistically-derived set of phrases, leaving word alignment to be performed in a second phase. Experi- ments have been c...
This paper describes the TALP phrase-based statistical machine translation system, enriched with the statistical ma- chine reordering technique. We also report the combination of this system and the TALP-tuple, the n-gram-based statis- tical machine translation system. We report the results for all the tasks (Chinese, Arabic, Italian and Japanese t...
This paper describes the UPC's bilingual n-gram approach to statistical machine translation, which implements the log-linear combination of a bilingual n-gram translation model with six additional feature functions. A brief description of the complete system is presented and special attention is devoted to the novel features and reordering strategi...
This paper provides a description of TALP-Ngram, the tuple-based statistical machine translation system devel- oped at the TALP Research Center of the UPC (Univer- sitat Polit` ecnica de Catalunya). Briefly, the system per- forms a log-linear combination of a translation model and additional feature functions. The translation model is es- timated a...