Conference Paper

Beyond Parallel Data: Joint Word Alignment and Decipherment Improves Machine Translation

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Both approaches have been shown to improve quality of MT systems for domain adaptation (Daumé and Jagarlamudi, 2011;Dou and Knight, 2012;Irvine et al., 2013) and low density languages (Irvine and Callison-Burch, 2013a;Dou et al., 2014). Meanwhile, they have their own advantages and disadvantages. ...
... The significant differences in word order pose great challenges for both parsing and decipherment. Table 2 lists the sizes of monolingual and parallel data used in this experiment, released by Dou et al. (2014). The monolingual data in Malagasy contains news text collected from Madagascar websites. ...
... The baseline system is the same as the baseline used in Spanish/English decipherment experiments. We use data provided in previous work (Dou et al., 2014) to build a Malagasy dependency parser. For English, we use the Turbo parser, trained on the Penn Treebank (Martins et al., 2013). ...
... However, monolingual corpora can be collected from various sources on the Internet, and are much easier to obtain than parallel corpora. Recent research has created a machine translation system using only monolingual corpora [163][164][165] by the unsupervised method to remove the dependency of sentence aligned parallel corpora. These systems are based on both SMT [166,167] and NMT [168]. ...
... The authors used the alphabets of two languages to extend the word embedding and modifying the similarity score functions of previous word-embedding methods to include the orthographic similarity measure. Bilingual lexicons are shown to improve machine translation in both RBMT [170] and CBMT [163,176,177]. ...
Article
Full-text available
Machine translation is one of the applications of natural language processing which has been explored in different languages. Recently researchers started paying attention towards machine translation for resource-poor languages and closely related languages. A widespread and underlying problem for these machine translation systems is the linguistic difference and variation in orthographic conventions which causes many issues to traditional approaches. Two languages written in two different orthographies are not easily comparable but orthographic information can also be used to improve the machine translation system. This article offers a survey of research regarding orthography’s influence on machine translation of under-resourced languages. It introduces under-resourced languages in terms of machine translation and how orthographic information can be utilised to improve machine translation. We describe previous work in this area, discussing what underlying assumptions were made, and showing how orthographic knowledge improves the performance of machine translation of under-resourced languages. We discuss different types of machine translation and demonstrate a recent trend that seeks to link orthographic information with well-established machine translation methods. Considerable attention is given to current efforts using cognate information at different levels of machine translation and the lessons that can be drawn from this. Additionally, multilingual neural machine translation of closely related languages is given a particular focus in this survey. This article ends with a discussion of the way forward in machine translation with orthographic information, focusing on multilingual settings and bilingual lexicon induction.
... To start deciphering the Chinese BINet, we need a few seeds based on prior knowledge as a starting point. Inspired by some previous work on bi-lingual dictionary induction (e.g., [9,10]), decipherment (e.g., [11,12,13]) and name translation mining (e.g., [14,15]), we utilize a few linguistic resources -a bi-lingual lexicon and language-universal representations such as time/calendar date, number, website URL, currency and emoticons to decipher a subset of Chinese nodes. For the example shown in Figure 1, we can decipher some nodes in the Chinese BINet such as "7-6" (to "7-6") and "种子" (to "seed"). ...
... Our work is the first to apply it to a cross-lingual setting. This is also the first attempt to apply the decipherment idea (e.g., [11,12,13]) to graph structures instead of sequence data. Another work related to ours is [41,45], which used phonetic transliteration and frequency correlation to discover transliteration of entities. ...
Article
Full-text available
Aligning coordinated text streams from multiple sources and multiple languages has opened many new research venues on cross-lingual knowledge discovery. In this paper we aim to advance state-of-the-art by: (1). extending coarse-grained topic-level knowledge mining to fine-grained information units such as entities and events; (2). following a novel Data-to-Network-to-Knowledge (D2N2K) paradigm to construct and utilize network structures to capture and propagate reliable evidence. We introduce a novel Burst Information Network (BINet) representation that can display the most important information and illustrate the connections among bursty entities, events and keywords in the corpus. We propose an effective approach to construct and decipher BINets, incorporating novel criteria based on multi-dimensional clues from pronunciation, translation, burst, neighbor and graph topological structure. The experimental results on Chinese and English coordinated text streams show that our approach can accurately decipher the nodes with high confidence in the BINets and that the algorithm can be efficiently run in parallel, which makes it possible to apply it to huge amounts of streaming data for never-ending language and information decipherment.
... On the other hand, large monolingual corpora can be easily downloaded from the internet for most languages. Decipherment algorithms exploit such monolingual corpora in order to learn translation model parameters, when parallel data is limited or unavailable (Koehn and Knight, 2000;Ravi and Knight, 2011;Dou et al., 2014). ...
... However, the objective functions for both EM and the latent variable log-linear model are non-convex, and the results may vary drastically based on initialization (Berg-Kirkpatrick and Klein, 2013). In future, we would like to start with a small parallel corpora, and initialize the decipherment models with the parameters learned from the small parallel corpora (Dou et al., 2014). We would also like to experiment with a more sophisticated translation model that incorporates NULL words, local reordering of neighboring words, and word fertilities (Ravi, 2013). ...
Article
Full-text available
Orthographic similarities across languages provide a strong signal for probabilistic decipherment, especially for closely related language pairs. The existing decipherment models, however, are not well-suited for exploiting these orthographic similarities. We propose a log-linear model with latent variables that incorporates orthographic similarity features. Maximum likelihood training is computationally expensive for the proposed log-linear model. To address this challenge, we perform approximate inference via MCMC sampling and contrastive divergence. Our results show that the proposed log-linear model with contrastive divergence scales to large vocabularies and outperforms the existing generative decipherment models by exploiting the orthographic features.
... While Klementiev et al. (2012) propose an approach to estimating phrase translation probabilities from monolingual corpora, Zhang and Zong (2013) directly extract parallel phrases from monolingual corpora using retrieval techniques. Another important line of research is to treat translation on monolingual corpora as a decipherment problem (Ravi and Knight, 2011;Dou et al., 2014). ...
Preprint
While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.
... Dou et al. (2015) showed that the mapping between monolingual word embeddings gives a good base distribution for the decipherment process. Dou et al. (2014) proposed to learn word alignment and decipherment using joint learning. ...
Article
Full-text available
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches have mostly been tried on high-resource European language pairs viz.English–French, English–German, etc. In this paper, we explore UNMT for 6 Indic language pairs viz., Hindi–Bengali, Hindi–Gujarati, Hindi–Marathi, Hindi–Malayalam, Hindi–Tamil, and Hindi–Telugu which are low-resource language pairs. We additionally perform experiments on 4 European language pairs viz., English–Czech, English–Estonian, English–Lithuanian, and English–Finnish. We observe that the lexical divergence within these language pairs plays a big role in the success of UNMT. In this context, we explore three approaches viz., (i) script conversion, (ii) unsupervised bilingual embedding-based initialization to bring the vocabulary of the two languages closer, and (iii) dictionary word substitution using a bilingual dictionary. We found that the script conversion using a simple rule-based system benefits language pairs that have high cognate overlap but use different scripts. We observe that script conversion combined with word substitution using a dictionary further improves the UNMT performance. We use a ground truth bilingual dictionary in our dictionary word substitution experiments, and such dictionaries can also be obtained using unsupervised bilingual embeddings. We empirically demonstrate that minimizing lexical divergence using simple heuristics leads to significant improvements in the BLEU score for both related and distant language pairs.
... The idea of unsupervised MT dates back to wordbased decipherment methods (Knight et al., 2006;Ravi and Knight, 2011). They learn only lexicon models at first, but add alignment models (Dou et al., 2014;Nuhn, 2019) or heuristic features (Naim et al., 2018) later. Finally, Artetxe et al. (2018a) and Lample et al. (2018b) train a fully-fledged phrase-based MT system in an unsupervised way. ...
Preprint
Full-text available
This paper studies the practicality of the current state-of-the-art unsupervised methods in neural machine translation (NMT). In ten translation tasks with various data settings, we analyze the conditions under which the unsupervised methods fail to produce reasonable translations. We show that their performance is severely affected by linguistic dissimilarity and domain mismatch between source and target monolingual data. Such conditions are common for low-resource language pairs, where unsupervised learning works poorly. In all of our experiments, supervised and semi-supervised baselines with 50k-sentence bilingual data outperform the best unsupervised results. Our analyses pinpoint the limits of the current unsupervised NMT and also suggest immediate research directions.
... Moreover, our BINets are cheap to construct, which can be easily extended to other languages. This is also the first attempt to apply the decipherment idea (e.g., (Ravi and Knight, 2011;Dou and Knight, 2012;Dou et al., 2014)) to graph structures instead of sequence data. ...
... Leveraging useful information from monolingual corpora can be extremely helpful for learning translation models for low-and no-resource language pairs. Decipherment algorithms (so called because of the assumption that one language is a cipher for the other) aim to exploit such monolingual corpora in order to learn translation model parameters, when parallel data is limited or unavailable (Koehn and Knight 2000; Ravi and Knight 2011;Dou, Vaswani, and Knight 2014). The key intuition is that similar words and n-grams tend to have similar distributional properties across languages. ...
Article
Full-text available
Orthographic similarities across languages provide a strong signal for unsupervised probabilistic transduction (decipherment) for closely related language pairs. The existing decipherment models, however, are not well suited for exploiting these orthographic similarities. We propose a log-linear model with latent variables that incorporates orthographic similarity features. Maximum likelihood training is computationally expensive for the proposed log-linear model. To address this challenge, we perform approximate inference via Markov chain Monte Carlo sampling and contrastive divergence. Our results show that the proposed log-linear model with contrastive divergence outperforms the existing generative decipherment models by exploiting the orthographic features. The model both scales to large vocabularies and preserves accuracy in low- and no-resource contexts.
... This entails translating from language A to an intermediate language I and then to the target language B-a process called triangulation (Singla et al. 2014). However, this method requires parallel data between languages A and B, and at least one more language I. Decipherment techniques (Dou and Knight 2013;Dou et al. 2014) have been used to induce translation lexicons from non-parallel data and improve translation not only for out-of-vocabulary (OOV) words but also for observed words. More recently, Zoph et al. (2016) use transfer learning to enhance MT for low-resource languages. ...
Article
Full-text available
The problem of a total absence of parallel data is present for a large number of language pairs and can severely detriment the quality of machine translation. We describe a language-independent method to enable machine translation between a low-resource language (LRL) and a third language, e.g. English. We deal with cases of LRLs for which there is no readily available parallel data between the low-resource language and any other language, but there is ample training data between a closely- related high-resource language (HRL) and the third language. We take advantage of the similarities between the HRL and the LRL in order to transform the HRL data into data similar to the LRL using transliteration. The transliteration models are trained on transliteration pairs extracted from Wikipedia article titles. Then, we automatically back-translate monolingual LRL data with the models trained on the transliterated HRL data and use the resulting parallel corpus to train our final models. Our method achieves significant improvements in translation quality, close to the results that can be achieved by a general purpose neural machine translation system trained on a significant amount of parallel data. Moreover, the method does not rely on the existence of any parallel data for training, but attempts to bootstrap already existing resources in a related language.
... By contrast, most of the prior work depend on parallel data in the form of a small bitext (Genzel, 2005), a gold seed lexicon (Mikolov et al., 2013b), or document-aligned comparable corpora (Vulić and Moens, 2015). Other prior work assumes access to additional resources or features, such as dependency parsers (Dou and Knight, 2013;Dou et al., 2014), temporal and web-based features (Irvine and Callison-Burch, 2013), or BabelNet (Wang and Sitbon, 2014). ...
... Most previous efforts have concentrated on learning parallel lexicons from non-parallel corpora, including parallel sentence and lexicon extraction via bootstrapping (Fung and Cheung, 2004), inducing parallel lexicons via canonical correlation analysis (Haghighi * Corresponding author: Yang Liu. et al., 2008), training IBM models on monolingual corpora as decipherment (Ravi and Knight, 2011;Nuhn et al., 2012;Dou et al., 2014), and deriving parallel lexicons from bilingual word embeddings (Vulić and Moens, 2013;Mikolov et al., 2013;Vulić and Moens, 2015). ...
Conference Paper
We introduce an agreement-based approach to learning parallel lexicons and phrases from non-parallel corpora. The basic idea is to encourage two asymmetric latent-variable translation models (i.e., source-to-target and target-to-source) to agree on identifying latent phrase and word alignments. The agreement is defined at both word and phrase levels. We develop a Viterbi EM algorithm for jointly training the two unidirectional models efficiently. Experiments on the Chinese-English dataset show that agreement-based learning significantly improves both alignment and translation performance.
... While Klementiev et al. (2012) propose an approach to estimating phrase translation probabilities from monolingual corpora, Zhang and Zong (2013) directly extract parallel phrases from monolingual corpora using retrieval techniques. Another important line of research is to treat translation on monolingual corpora as a decipherment problem (Ravi and Knight, 2011;Dou et al., 2014). ...
Conference Paper
While end-to-end neural machine translation (NMT) has made remarkable progress recently, NMT systems only rely on parallel corpora for parameter estimation. Since parallel corpora are usually limited in quantity, quality, and coverage, especially for low-resource languages, it is appealing to exploit monolingual corpora to improve NMT. We propose a semi-supervised approach for training NMT models on the concatenation of labeled (parallel corpora) and unlabeled (monolingual corpora) data. The central idea is to reconstruct the monolingual corpora using an autoencoder, in which the source-to-target and target-to-source translation models serve as the encoder and decoder, respectively. Our approach can not only exploit the monolingual corpora of the target language, but also of the source language. Experiments on the Chinese-English dataset show that our approach achieves significant improvements over state-of-the-art SMT and NMT systems.
... Table 1 shows the numbers of parallel sentences and the number of words. We used a dataset used in (Dou et al., 2014) for Malagasy and the Urdu data we used is a part of NIST MT evaluation in 2008-2012 2 . We used 2000 sentences for development and hold-out test set. ...
Article
We present a neural network architecture based on bidirectional LSTMs to compute representations of words in the sentential contexts. These context-sensitive word representations are suitable for, e.g., distinguishing different word senses and other context-modulated variations in meaning. To learn the parameters of our model, we use cross-lingual supervision, hypothesizing that a good representation of a word in context will be one that is sufficient for selecting the correct translation into a second language. We evaluate the quality of our representations as features in three downstream tasks: prediction of semantic supersenses (which assign nouns and verbs into a few dozen semantic classes), low resource machine translation, and a lexical substitution task, and obtain state-of-the-art results on all of these.
Chapter
In this chapter, we first review attention-based neural machine translation (NMT). Attentional mechanism make NMT outperform the non-attentional models. The related works confirm the importance of attentional mechanisms. Then we investigate how the bidirectional information is captured in SMT and NMT. Next we summarize a number of work which incorporate additional data resources, such as monolingual corpora and pivot language corpora, into machine translation systems. Finally, we make a simple review of the studies about contrastive learning, which is a key technique in our fourth work.
Article
We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical machine translation (SMT) system without the use of any bilingual sentence-aligned parallel corpora. We present detailed analysis of the accuracy of bilingual lexicon induction, and show how a discriminative model can be used to combine various signals of translation equivalence (like contextual similarity, temporal similarity, orthographic similarity and topic similarity). Our discriminative model produces higher accuracy translations than previous bilingual lexicon induction techniques. We reuse these signals of translation equivalence as features on a phrase-based SMT system. These monolingually estimated features enhance low resource SMT systems in addition to allowing end-to-end machine translation without parallel corpora.
Conference Paper
Full-text available
In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocabulary/corpora sizes. We show empirical results on the OPUS data-our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time-BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).
Conference Paper
Full-text available
In this paper we show how to train statistical machine translation systems on real-life tasks using only non-parallel monolingual data from two languages. We present a modification of the method shown in (Ravi and Knight, 2011) that is scalable to vocabulary sizes of several thousand words. On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model. The efficiency improvement of our method allows us to run experiments with vocabulary sizes of around 5,000 words, such as a non-parallel version of the VERBMOBIL corpus. We also report results using data from the monolingual French and English GIGAWORD corpora.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Conference Paper
Full-text available
In this paper, we describe a new model for word alignment in statistical translation and present experimental results. The idea of the model is to make the alignment probabilities dependent on the differences in the alignment positions rather than on the absolute positions. To achieve this goal, the approach uses a first-order Hidden Markov model (HMM) for the word alignment problem as they are used successfully in speech recognition for the time alignment problem. The difference to the time alignment HMM is that there is no monotony constraint for the possible word orderings. We describe the details of the model and test the model on several bilingual corpora.
Conference Paper
Full-text available
For Chinese POS tagging, word segmentation is a preliminary step. To avoid error propa- gation and improve segmentation by utilizing POS information, segmentation and tagging can be performed simultaneously. A challenge for this joint approach is the large combined search space, which makes efficient decod- ing very hard. Recent research has explored the integration of segmentation and POS tag- ging, by decoding under restricted versions of the full combined search space. In this paper, we propose a joint segmentation and POS tag- ging model that does not impose any hard con- straints on the interaction between word and POS information. Fast decoding is achieved by using a novel multiple-beam search algo- rithm. The system uses a discriminative sta- tistical model, trained using the generalized perceptron algorithm. The joint model gives an error reduction in segmentation accuracy of 14.6% and an error reduction in tagging ac- curacy of 12.2%, compared to the traditional pipeline approach.
Conference Paper
Full-text available
We propose a cascaded linear model for joint Chinese word segmentation and part- of-speech tagging. With a character-based perceptron as the core, combined with real- valued features such as language models, the cascaded model is able to efficiently uti- lize knowledge sources that are inconvenient to incorporate into the perceptron directly. Experiments show that the cascaded model achieves improved accuracies on both seg- mentation only and joint segmentation and part-of-speech tagging. On the Penn Chinese Treebank 5.0, we obtain an error reduction of 18.5% on segmentation and 12% on joint seg- mentation and part-of-speech tagging over the perceptron-only baseline.
Conference Paper
Full-text available
We describe an open-source toolkit for sta- tistical machine translation whose novel contributions are (a) support for linguisti- cally motivated factors, (b) confusion net- work decoding, and (c) efficient data for- mats for translation models and language models. In addition to the SMT decoder, the toolkit also includes a wide variety of tools for training, tuning and applying the system to many translation tasks.
Conference Paper
Full-text available
In this work, we tackle the task of machine translation (MT) without parallel training data. We frame the MT problem as a decipherment task, treating the foreign text as a cipher for English and present novel methods for training translation models from non-parallel text.
Conference Paper
Full-text available
We present a method for learning bilingual translation lexicons from monolingual cor- pora. Word types in each language are charac- terized by purely monolingual features, such as context counts and orthographic substrings. Translations are induced using a generative model based on canonical correlation analy- sis, which explains the monolingual lexicons in terms of latent matchings. We show that high-precision lexicons can be learned in a va- riety of language pairs and from a range of corpus types.
Conference Paper
Full-text available
This paper investigates the potential for projecting linguistic annotations including part-of-speech tags and base noun phrase bracketings from one language to another via automatically word-aligned parallel corpora. First, experiments assess the accuracy of unmodified direct transfer of tags and brackets from the source language English to the target languages French and Chinese, both for noisy machine-aligned sentences and for clean hand-aligned sentences. Performance is then substantially boosted over both of these baselines by using training techniques optimized for very noisy data, yielding 94-96% core French part-of-speech tag accuracy and 90% French bracketing F-measure for stand-alone monolingual tools trained without the need for any human-annotated data in the given language.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
When using a machine translation (MT) model trained on OLD-domain parallel data to translate NEW-domain text, one major challenge is the large number of out-of-vocabulary (OOV) and new-translation-sense words. We present a method to identify new translations of both known and unknown source language words that uses NEW-domain comparable document pairs. Starting with a joint distribution of source-target word pairs derived from the OLD-domain parallel corpus, our method recovers a new joint distribution that matches the marginal distributions of the NEW-domain comparable document pairs, while minimizing the divergence from the OLD-domain distribution. Adding learned translations to our French-English MT model results in gains of about 2 BLEU points over strong baselines.
Conference Paper
We present fast, accurate, direct non-projective dependency parsers with third-order features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-of-The-art accuracies for the largest datasets (English, Czech, and German).
Conference Paper
In this paper we address the problem of solving substitution ciphers using a beam search approach. We present a conceptually consistent and easy to implement method that improves the current state of the art for decipherment of substitution ciphers and is able to use high order n-gram language models. We show experiments with 1:1 substitution ciphers in which the guaranteed optimal solution for 3-gram language models has 38.6% decipherment error, while our approach achieves 4.13% decipherment error in a fraction of time by using a 6-gram language model. We also apply our approach to the famous Zodiac-408 cipher and obtain slightly better (and near to optimal) results than previously published. Unlike the previous state-of-the-art approach that uses additional word lists to evaluate possible decipherments, our approach only uses a letterbased 6-gram language model. Furthermore we use our algorithm to solve large vocabulary substitution ciphers and improve the best published decipherment error rate based on the Gigaword corpus of 7.8% to 6.0% error rate.
Conference Paper
Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an l0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).
Conference Paper
We apply slice sampling to Bayesian decipherment and use our new decipherment framework to improve out-of-domain machine translation. Compared with the state of the art algorithm, our approach is highly scalable and produces better results, which allows us to decipher ciphertext with billions of tokens and hundreds of thousands of word types with high accuracy. We decipher a large amount of monolingual data to improve out-of-domain translation and achieve significant gains of up to 3.8 BLEU points.
Conference Paper
We estimate the parameters of a phrase-based statistical machine translation system from monolingual corpora instead of a bilingual parallel corpus. We extend existing research on bilingual lexicon induction to estimate both lexical and phrasal translation probabilities for MT-scale phrase-tables. We propose a novel algorithm to estimate reordering probabilities from monolingual data. We report translation results for an end-to-end translation system using these monolingual features alone. Our method only requires monolingual corpora in source and target languages, a small bilingual dictionary, and a small bitext for tuning feature weights. In this paper, we examine an idealization where a phrase-table is given. We examine the degradation in translation performance when bilingually estimated translation probabilities are removed and show that 80%+ of the loss can be recovered with monolingually estimated features alone. We further show that our monolingual features add 1.5 BLEU points when combined with standard bilingually estimated phrase table features.
Conference Paper
Prior research into learning translations from source and target language monolingual texts has treated the task as an unsupervised learning problem. Although many techniques take advantage of a seed bilingual lexicon, this work is the first to use that data for supervised learning to combine a diverse set of signals derived from a pair of monolingual corpora into a single discriminative model. Even in a low resource machine translation setting, where induced translations have the potential to improve performance substantially, it is reasonable to assume access to some amount of data to perform this kind of optimization. Our work shows that only a few hundred translation pairs are needed to achieve strong performance on the bilingual lexicon induction task, and our approach yields an average relative gain in accuracy of nearly 50% over an unsupervised baseline. Large gains in accuracy hold for all 22 languages (low and high resource) that we investigate.
Article
We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.
Conference Paper
Minimum Error Rate Training (MERT) is an effective means to estimate the feature func- tion weights of a linear model such that an automated evaluation criterion for measuring system performance can directly be optimized in training. To accomplish this, the training procedure determines for each feature func- tion its exact error surface on a given set of candidate translations. The feature function weights are then adjusted by traversing the error surface combined over all sentences and picking those values for which the resulting error count reaches a minimum. Typically, candidates in MERT are represented as N - best lists which contain the N most probable translation hypotheses produced by a decoder. In this paper, we present a novel algorithm that allows for efficiently constructing and repre- senting the exact error surface of all trans- lations that are encoded in a phrase lattice. Compared to N -best MERT, the number of candidate translations thus taken into account increases by several orders of magnitudes. The proposed method is used to train the feature function weights of a phrase-based statistical machine translation system. Experi- ments conducted on the NIST 2008 translation tasks show significant runtime improvements and moderate BLEU score gains over N -best MERT.
Conference Paper
We study a number of natural language deci- pherment problems using unsupervised learn- ing. These include letter substitution ciphers, character code conversion, phonetic decipher- ment, and word-based ciphers with relevance to machine translation. Straightforward unsu- pervised learning techniques most often fail on the rst try, so we describe techniques for un- derstanding errors and signicantly increasing performance.
Conference Paper
Broad-coverage annotated treebanks necessary to train parsers do not exist for many resource-poor languages. The wide availability of parallel text and accurate parsers in English has opened up the possibility of grammar induction through partial transfer across bitext. We consider generative and discriminative models for dependency grammar induction that use word-level alignments and a source language parser (English) to constrain the space of possible target trees. Unlike previous approaches, our framework does not require full projected parses, allowing partial, approximate transfer through linear expectation constraints on the space of distributions over trees. We consider several types of constraints that range from generic dependency conservation to language-specific annotation rules for auxiliary verb analysis. We evaluate our approach on Bulgarian and Spanish CoNLL shared task data and show that we consistently outperform unsupervised methods and can outperform supervised learning for limited training data.
Conference Paper
We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrase-based translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.
Conference Paper
Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12% over string edit distance alone.
Article
The purpose of this paper is to provide guidelines for building a word alignment evaluation scheme. The notion of word alignment quality depends on the application: here we review standard scoring metrics for full text alignment and give explanations on how to use them better. We discuss strategies to build a reference corpus, and show that the ratio between ambiguous and unambiguous links in the reference has a great impact on scores measured with these metrics. In particular, automatically computed alignments with higher precision or higher recall can be favoured depending on the value of this ratio. Finally, we suggest a strategy to build a reference corpus particularly adapted to applications where recall plays a significant role, like in machine translation. The manually aligned corpus we built for the Spanish-English European Parliament corpus is also described. This corpus is freely available.
Article
A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Article
Often, the training procedure for statistical machine translation models is based on maximum likelihood or related criteria. A general problem of this approach is that there is only a loose relation to the final translation quality on unseen text. In this paper, we analyze various training criteria which directly optimize translation quality.
Article
This paper presents work on the task of constructing a word-level translation lexicon purely from unrelated monolingual corpora. We combine various clues such as cognates, similar context, preservation of word similarity, and word frequency. Experimental results for the construction of a German-English noun lexicon are reported.
Article
Common algorithms for sentence and word-alignment allow the automatic identification of word translations from parallel texts. This study suggests that the identification of word translations should also be possible with non-parallel and even unrelated texts. The method proposed is based on the assumption that there is a correlation between the patterns of word co-occurrences in texts of different languages.
Dependencybased decipherment for resource-limited machine translation
  • Qing Dou
  • Kevin Knight
Qing Dou and Kevin Knight. 2013. Dependencybased decipherment for resource-limited machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
An IR approach for translating new words from nonparallel, comparable texts
  • Pascale Fung
  • Yee Lo Yuen
Pascale Fung and Lo Yuen Yee. 1998. An IR approach for translating new words from nonparallel, comparable texts. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics -Volume 1. Association for Computational Linguistics.
Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences
  • Nikesh Garera
  • Chris Callison-Burch
  • David Yarowsky
Nikesh Garera, Chris Callison-Burch, and David Yarowsky. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Association for Computational Linguistics.
Monolingual marginal matching for translation model adaptation
  • Ann Irvine
  • Chris Quirk
  • Hal Daume
Ann Irvine, Chris Quirk, and Hal Daume III. 2013. Monolingual marginal matching for translation model adaptation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Learning a translation lexicon from monolingual corpora
  • Philipp Koehn
  • Kevin Knight
Philipp Koehn and Kevin Knight. 2002. Learning a translation lexicon from monolingual corpora. In Proceedings of the ACL-02 Workshop on Unsupervised Lexical Acquisition. Association for Computational Linguistics.