Article

Automatic identification of word translations from unrelated English and German corpora

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is more difficult, because most statistical clues useful in the processing of parallel texts cannot be applied to non-parallel texts. Whereas for parallel texts in some studies up to 99% of the word alignments have been shown to be correct, the accuracy for non-parallel texts has been around 30% up to now. The current study, which is based on the assumption that there is a correlation between the patterns of word co-occurrences in corpora of different languages, makes a significant improvement to about 72% of word translations identified correctly.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... Rapp (1995) presents evidence and correctly speculates that there may be sufficient signal. Fung (1997); Rapp (1999) used a initial dictionary to successfully expand the vocabulary further using a mutual information based association matrix. Alvarez-Melis and Jaakkola (2018) operates on an association matrix generated from word vectors. ...
... Normalization is a common and crucial element of all working methods. The methods of Rapp (1995);Fung (1997); Rapp (1999) were based on statistical principles but with modifications to make normalized measurements. Levy and Goldberg (2014); Pennington et al. (2014) actually proposed better normalization methods too that would equally apply to co-occurrences (Appendix A). ...
... Fung (1997) did not work with the prescribed ℓ 2 criteria (not shown) but worked with ℓ 1 as well. We think Rapp (1999) is similar to our modification of Fung (1997) with ℓ 1 , which contains its most important term. However, using ℓ 1 lead to lower data efficiency and lower accuracy in methods that also work with ℓ 2 . ...
Preprint
Full-text available
The striking ability of unsupervised word translation has been demonstrated with the help of word vectors / pretraining; however, they require large amounts of data and usually fails if the data come from different domains. We propose coocmap, a method that can use either high-dimensional co-occurrence counts or their lower-dimensional approximations. Freed from the limits of low dimensions, we show that relying on low-dimensional vectors and their incidental properties miss out on better denoising methods and useful world knowledge in high dimensions, thus stunting the potential of the data. Our results show that unsupervised translation can be achieved more easily and robustly than previously thought -- less than 80MB and minutes of CPU time is required to achieve over 50\% accuracy for English to Finnish, Hungarian, and Chinese translations when trained on similar data; even under domain mismatch, we show coocmap still works fully unsupervised on English NewsCrawl to Chinese Wikipedia and English Europarl to Spanish Wikipedia, among others. These results challenge prevailing assumptions on the necessity and superiority of low-dimensional vectors, and suggest that similarly processed co-occurrences can outperform dense vectors on other tasks too.
... C (Mikolov et al. 2013;Pennington et al. 2014;Vulic and Moens 2016;Artetxe et al. 2016;Ethan Fast 2017;Hazem and Morin 2017;Xu et al. 2018). Once contexts, usually consisting of cooccurring words, have been identified, they are mapped across languages using a bilingual lexicon (Fung 1998;Rapp 1999). As the bilingual lexicon usually used is either a large, general dictionary or a small, domain-specific lexicon, there is a high risk of missing potential associations across languages when trying to extract bilingual lexicons in specific domains (Gaussier et al. 2004;Déjean et al. 2005;Tamura et al. 2012;Irvine and Callison-Burch 2013;Linard et al. 2015;Vulic and Moens 2016;Morin and Hazem 2016). ...
... In our experiments, as comparative baselines, we consider the standard approach (Rapp 1999) and two recent unsupervised, neural-based approaches introduced in Xu et al. (2018) and Zhang et al. (2017). Section 2. It is used here with weights based on word embeddings, as described in Section 5.1. ...
... • The standard approach(Rapp 1999): The standard approach follows the three steps (modeling contexts, calculating context similarities, and finding translation pairs) described in https://www.cambridge.org/core/terms. https://doi.org/10.1017/S135132492100022X ...
Article
Bilingual corpora are an essential resource used to cross the language barrier in multilingual natural language processing tasks. Among bilingual corpora, comparable corpora have been the subject of many studies as they are both frequent and easily available. In this paper, we propose to make use of formal concept analysis to first construct concept vectors which can be used to enhance comparable corpora through clustering techniques. We then show how one can extract bilingual lexicons of improved quality from these enhanced corpora. We finally show that the bilingual lexicons obtained can complement existing bilingual dictionaries and improve cross-language information retrieval systems.
... The starting point of this strategy is a list of bilingual expressions that are used to build the context vectors of all words in both languages. This starting list, or initial dictionary, is named the seed dictionary (Fung, 1995) and is usually provided by an external bilingual dictionary (Rapp, 1999;Chiao and Zweigenbaum, 2002;Fung and McKeown, 1997;Fung and Yee, 1998). Some of the recent methods use small parallel corpora to create their seed list (Otero, 2007) and some other use no dictionary for starting phases (Rapp and Zock, 2010). ...
... There is a growing interest in the number of approaches focused on extracting word translations from comparable corpora (Fung and McKeown, 1997;Fung and Yee, 1998;Rapp, 1999;Chiao and Zweigenbaum, 2002;Djean et al., 2002;Kaji, 2005;Otero, 2007;Otero and Campos, 2010;Rapp and Zock, 2010;Bouamor et al., 2013;Irimia, 2012;E. Morin and Prochasson, 2013;Emmanuel and Hazem, 2014). ...
... Another interesting issue considered in recent years evaluating the effect of the degree of comparability on the accuracy of extracted resources (Li and Gaussier, 2010;Sharoff, 2013) As described before, it is assumed that there is a small bilingual dictionary available at the beginning. Most methods use an existing dictionary (Rapp, 1999;Chiao and Zweigenbaum, 2002;Fung and McKeown, 1997;Fung and Yee, 1998) or build one with some small parallel resources (Otero, 2007). Entries in the dictionary are used as an initial list of seed words. ...
Conference Paper
Full-text available
Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Almost all use a small existing dictionary or other resources to make an initial list called the "seed dictionary". In this paper, we discuss the use of different types of dictionaries as the initial starting list for creating a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments apply state-of-the-art techniques on three different seed dictionaries; an existing dictionary, a dictionary created with pivot-based schema, and a dictionary extracted from a small Persian-Italian parallel text. The interesting challenge of our approach is to find a way to combine different dictionaries together in order to produce a better and more accurate lexicon. We propose two different novel combination models and examine the effect of them on various comparable corpora that have differing degrees of comparability. We conclude our work with a new weighting schema to improve the extracted lexicon. The experimental results show the efficiency of our proposed models.
... However, most parallel corpora are owned by private companies, 2 such as language service providers, who consider them to be their intellectual property and are reluctant to share them publicly. For this reason (and in particular for language pairs not involving English) considerable efforts have also been invested into researching bilingual terminology extraction from comparable corpora (Fung and Yee 1998;Rapp 1999;Chiao and Zweigenbaum 2002;Cao and Li 2002;Daille and Morin 2005;Morin et al. 2008;Vintar 2010;Bouamor et al. 2013;Morin 2016, 2017). ...
... Note that under ''dataset'', we include corpora, gold standard termlists, seed dictionaries and all other linguistic resources needed to conduct the experiments in the paper. For example, we consider the following paragraph from Rapp (1999) to be a valid description of a dataset: As the German corpus, we used 135 million words of the newspaper Frankfurter Allgemeine Zeitung (1993Zeitung ( to 1996, and as the English corpus 163 million words of the Guardian (1990Guardian ( to 1994. On the other hand, this paragraph from Ideue et al. (2011) is not considered a valid description: We extracted bilingual term candidates from a Japanese-English parallel corpus consisting of documents related to apparel products. ...
... In fact, the earliest paper analyzed- Kupiec (1993)-provides a reference to a publicly available corpus (Canadian Hansards (Gale and Church 1993)). The first paper to have a separate section with data/resource description is Rapp (1999) and from this point on, almost all papers have such a section-usually titled ''Data and Resources'', ''Resources and Experimental Setup'', ''Linguistic resources'' or similar. ...
Article
Full-text available
In this paper, we look at the issue of reproducibility and replicability in bilingual terminology alignment (BTA). We propose a set of best practices for reproducibility and replicability of NLP papers and analyze several influential BTA papers from this perspective. Next, we present our attempts at replication and reproduction, where we focus on a bilingual terminology alignment approach described by Aker et al. (Extracting bilingual terminologies from comparable corpora. In: Proceedings of the 51st annual meeting of the association for computational linguistics, vol. 1 402–411, 2013) who treat bilingual term alignment as a binary classification problem and train an SVM classifier on various dictionary and cognate-based features. Despite closely following the original paper with only minor deviations—in areas where the original description is not clear enough—we obtained significantly worse results than the authors of the original paper. We then analyze the reasons for the discrepancy and describe our attempts at adaptation of the approach to improve the results. Only after several adaptations, we achieve results which are close to the results published in the original paper. Finally, we perform the experiments to verify the replicability and reproducibility of our own code. We publish our code and datasets online to assure the reproducibility of the results of our experiments and implement the selected BTA models in an online platform making them easily reusable even by the technically less-skilled researchers.
... The current work can be seen as a continuation of our previous work (Rapp 1995(Rapp , 1999. We present a novel algorithm and provide quantitative results for six language pairs rather than for just one. ...
... The word with the smallest value of the product is considered to be the translation of the source language word. This algorithm turned out to be a significant improvement over the previous one described by Rapp (1999). It provides better accuracy and considerably higher robustness with regard to sampling errors. ...
... Since semantic patterns are more reliable than syntactic patterns across language families, we hoped that eliminating the function words would increase the generality of our method. Rapp (1999) used a list of 100 German test words together with their English translations as the gold standard for testing results. As this list is rather small, and as we also needed French translations, we decided to compile a larger trilingual list of test words. ...
Chapter
This chapter describes several approaches of using comparable corpora beyond the area of MT for under-resourced languages, which is the primary focus of the ACCURAT project. Section 7.1, which is based on Rapp and Zock (Automatic dictionary expansion using non-parallel corpora. In: A. Fink, B. Lausen, W. Seidel, & A. Ultsch (Eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Proceedings of the 32nd Annual Meeting of the GfKl, 2008. Springer, Heidelberg, 2010), addresses the task of creating resources for bilingual dictionaries using a seed lexicon; Sect. 7.2 (based on Rapp et al., Identifying word translations from comparable documents without a seed lexicon. Proceedings of LREC 2012, Istanbul, 2012) develops and evaluates a novel methodology of creating bilingual dictionaries without an initial lexicon. Section 7.3 proposes a novel system that can extract Chinese–Japanese parallel sentences from quasi-comparable and comparable corpora.
... However, such a scenario is not feasible for all language pairs or domains, because ready-made parallel corpora do not exist for many of them, and compilation of such corpora is slow and expensive. This is why an alternative approach that relies on texts in two languages, which are not parallel but nevertheless share several parameters, such as topic, time of publication and communicative goal (Fung 1998;Rapp 1999), has been increasingly explored in the past decade. Compilation of such comparable corpora is much easier, especially since the availability of rich web data (Xiao and McEnery 2006). ...
... The seminal papers in bilingual lexicon constructions are Fung (1998) and Rapp (1999), who showed that texts do not need to be parallel in order to extract translation equivalents from them. Instead, their main assumption, central to distributional semantics, is that the term and its translation appear in similar contexts. ...
... A number of different vector similarity measures have been investigated. Rapp (1999) applies the city-block metric, whilst Fung (1998) works with cosine similarity. Recent work often uses the Jaccard index or DICE coefficient (Saralegi et al. 2008). ...
Chapter
Comparable corpora may comprise different types of single-word and multi-word phrases that can be considered as reciprocal translations, which may be beneficial for many different natural language processing tasks. This chapter describes methods and tools developed within the ACCURAT project that allow utilising comparable corpora in order to (1) identify terms, named entities (NEs), and other lexical units in comparable corpora, and (2) to cross-lingually map the identified single-word and multi-word phrases in order to create automatically extracted bilingual dictionaries that can be further utilised in machine translation, question answering, indexing, and other areas where bilingual dictionaries can be useful.
... These methods work on assumption that a word in a particular context will have its translation in similar context in the other language. [26]. The researchers focused on the similarity of context and calculated the similarity level. ...
... As such the context based methods require large size lexicons otherwise there would be little accuracy. As a solution, different methods were attempted which either make no use of seed lexicon [26] or make it size independent [19]. The commonly used methods to measure similarity scores make use of three steps: first of all words are converted to vectors. ...
... These methods are used for extraction of bilingual lexicons and not for extracting parallel sentences. The first and most relevant reference for bilingual lexicon extraction using un-parallel text for single word terms is probably [13] followed by [26]. [4] This approach also called Standard Method utilise the concept of vector formation. ...
Article
Full-text available
Bilingual lexicons are important resources for performing a number of bilingual tasks in machine translation (MT) and cross-language information retrieval (CLIR). Since the manual building of bilingual extraction is a tedious affair, researchers have focused upon the automatic extraction of bilingual lexicons from corpora. Another issue is the use of parallel and comparable corpora for extraction. Much success has been achieved in the use of parallel corpora but it is only available for a few language pairs and for limited domains. Therefore, the use of comparable corpora comes as an alternative but a lot need to be done in this field. The paper presents a review of different techniques and methods, which have been used for automatic extraction of bilingual lexicon suggesting that an integrated approach can give better results than using individual approaches. The paper also contains a proposed method for extraction of bilingual method using a combined approach.
... With regards to resource-poor languages, one approach that is indeed beneficial is to use comparable/non-parallel corpora. Although comparable corpora have been known to be helpful ( [14]), their application to this task has been rather limited ( [21], [9], [26]). ...
... The most common approach for extracting translation equivalents from parallel corpora is to use Statistical Machine Translation (SMT) ( [27]). Recently, several studies have suggested approaches for extracting parallel segments from comparable corpora for several different tasks, including bilingual lexicon construction ( [21], [9], [11], [4]), and sentence alignment for improving SMT ( [10], [25], [18]). Corpus-based distributional similarity has been used in a bilingual context to automatically discover translationally-equivalent words from comparable corpora ( [20], [21], [9]). ...
... Recently, several studies have suggested approaches for extracting parallel segments from comparable corpora for several different tasks, including bilingual lexicon construction ( [21], [9], [11], [4]), and sentence alignment for improving SMT ( [10], [25], [18]). Corpus-based distributional similarity has been used in a bilingual context to automatically discover translationally-equivalent words from comparable corpora ( [20], [21], [9]). It is not clear, however, whether a similar approach can be used for finding the translations of multi-word collocations. ...
... Gaussier, Renders, Matveeva, Goutte, and Dejean (2004) presented a geometric view of this process. Previous studies have used different definitions of context such as window-based context (Fung 1995;Rapp 1999;Koehn and Knight 2002;Haghighi, Liang, Berg-Kirkpatrick, and Klein 2008;Prochasson and Fung 2011;Tamura, Watanabe, and Sumita 2012), sentence-based context (Fung and Yee 1998), and syntax-based context (Garera, Callison-Burch, and Yarowsky 2009;Yu and Tsujii 2009;Qian, Wang, Zhou, and Zhu 2012). To quantify the strength of the association between a word and its context word, different association measures have been used, such as log likelihood ratio (Rapp 1999), term frequency -inverse document frequency (TF-IDF) (Fung and Yee 1998) and pointwise mutual information (Andrade, Nasukawa, and Tsujii 2010). ...
... Previous studies have used different definitions of context such as window-based context (Fung 1995;Rapp 1999;Koehn and Knight 2002;Haghighi, Liang, Berg-Kirkpatrick, and Klein 2008;Prochasson and Fung 2011;Tamura, Watanabe, and Sumita 2012), sentence-based context (Fung and Yee 1998), and syntax-based context (Garera, Callison-Burch, and Yarowsky 2009;Yu and Tsujii 2009;Qian, Wang, Zhou, and Zhu 2012). To quantify the strength of the association between a word and its context word, different association measures have been used, such as log likelihood ratio (Rapp 1999), term frequency -inverse document frequency (TF-IDF) (Fung and Yee 1998) and pointwise mutual information (Andrade, Nasukawa, and Tsujii 2010). Previous studies have also used different measures to compute the similarity between the vectors, such as cosine similarity (Fung and Yee 1998;Garera et al. 2009;Prochasson and Fung 2011;Tamura et al. 2012), Euclidean distance (Fung 1995;Yu and Tsujii 2009), the city-block metric (Rapp 1999), and Spearman rank order (Koehn and Knight 2002). ...
... To quantify the strength of the association between a word and its context word, different association measures have been used, such as log likelihood ratio (Rapp 1999), term frequency -inverse document frequency (TF-IDF) (Fung and Yee 1998) and pointwise mutual information (Andrade, Nasukawa, and Tsujii 2010). Previous studies have also used different measures to compute the similarity between the vectors, such as cosine similarity (Fung and Yee 1998;Garera et al. 2009;Prochasson and Fung 2011;Tamura et al. 2012), Euclidean distance (Fung 1995;Yu and Tsujii 2009), the city-block metric (Rapp 1999), and Spearman rank order (Koehn and Knight 2002). Laroche and Langlais (2010) conducted a systematic study using different association and similarity measures for CBM. ...
Article
Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for SMT. Parallel sentence extraction relies highly on bilingual lexicons that are also very scarce. We propose an unsupervised bilingual lexicon extraction based parallel sentence extraction system that first extracts bilingual lexicons from comparable corpora and then extracts parallel sentences using the lexicons. Our bilingual lexicon extraction method is based on a combination of topic model and context based methods in an iterative process. The proposed method does not rely on any prior knowledge, and the performance can be improved iteratively. The parallel sentence extraction method uses a binary classifier for parallel sentence identification. The extracted bilingual lexicons are used for the classifier to improve the performance of parallel sentence extraction. Experiments conducted with the Wikipedia data indicate that the proposed bilingual lexicon extraction method greatly outperforms existing methods, and the extracted bilingual lexicons significantly improve the performance of parallel sentence extraction for SMT.
... Results are given for a set of 100 English and German word translation pairs. Later formulations of the problem, including Fung and Yee (1998) and Rapp (1999), used small seed dictionaries to project word-based context vectors from the vector space of one language into the vector space of the other language. That is, each position in contextual vector v corresponds to a word in the source vocabulary, 2 and vectors v are computed for each source word in the test set. ...
... where maxn is the maximum frequency of any of the words in the corpus, and f i is the frequency of word i. Rapp (1999) uses the same projection method as Fung and Yee (1998) but uses log-likelihood ratios instead of TF · IDF. Once source and target language contextual vectors are built, each position in the source language vectors is projected onto the target side using a seed bilingual dictionary. ...
... We use the vector space approach of Rapp (1999) to compute similarity between words in the source and target languages. More formally, assume that (s 1 , s 2 , . . . ...
Article
Full-text available
Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages. In this article we present the most comprehensive analysis of bilingual lexicon induction to date. We present experiments on a wide range of languages and data sizes. We examine translation into English from 25 foreign languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Spanish, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese, and Welsh. We analyze the behavior of bilingual lexicon induction on low-frequency words, rather than testing solely on high-frequency words, as previous research has done. Low-frequency words are more relevant to statistical machine translation, where systems typically lack translations of rare words that fall outside of their training data. We systematically explore a wide range of features and phenomena that affect the quality of the translations discovered by bilingual lexicon induction. We provide illustrative examples of the highest ranking translations for orthogonal signals of translation equivalence like contextual similarity and temporal similarity. We analyze the effects of frequency and burstiness, and the sizes of the seed bilingual dictionaries and the monolingual training corpora. Additionally, we introduce a novel discriminative approach to bilingual lexicon induction. Our discriminative model is capable of combining a wide variety of features that individually provide only weak indications of translation equivalence. When feature weights are discriminatively set, these signals produce dramatically higher translation quality than previous approaches that combined signals in an unsupervised fashion (e.g., using minimum reciprocal rank). We also directly compare our model's performance against a sophisticated generative approach, the matching canonical correlation analysis (MCCA) algorithm used by Haghighi et al. (2008). Our algorithm achieves an accuracy of 42% versus MCCA's 15%.
... The starting point of this strategy is a list of bilingual expressions that are used to build the context vectors of all words in both languages. This starting list, or initial dictionary, is named the seed dictionary (Fung 1995) and is usually provided by an external bilingual dictionary (Rapp 1999;Chiao & Zweigenbaum 2002;Fung & McKeown 1997;Fung & Yee 1998). Some of recent methods use small parallel corpora to create their seed list (Otero 2007) and some other use no dictionary for starting phases (Rapp & Zock 2010). ...
... There is a growing interest in the number of approaches focused on extracting word translations from comparable corpora (Fung & McKeown 1997;Fung & Yee 1998;Rapp 1999;Chiao & Zweigenbaum 2002;Déjean, Gaussier & Sadat 2002;Kaji 2005;Otero 2007;Otero & Campos 2010;Rapp & Zock 2010;Bouamor, Semmar & Zweigenbaum 2013;Irimia 2012;E. Morin & Prochasson 2013;Emmanuel & Hazem 2014). ...
... As described before, it is assumed that there is a small bilingual dictionary available at the beginning. Most methods use an existing dictionary (Rapp 1999;Chiao & Zweigenbaum 2002;Fung & McKeown 1997;Fung & Yee 1998) or build one with some small parallel resources (Otero 2007). Entries in the dictionary are used as an initial list of seed words. ...
Article
Full-text available
Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Almost all use a small existing dictionary or other resource to make an initial list called the "seed dictionary". In this paper we discuss the use of different types of dictionaries as the initial starting list for creating a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments apply state-of-the-art techniques on three different seed dictionaries; an existing dictionary, a dictionary created with pivot-based schema, and a dictionary extracted from a small Persian-Italian parallel text. The interesting challenge of our approach is to find a way to combine different dictionaries together in order to produce a better and more accurate lexicon. In order to combine seed dictionaries, we propose two different combination models and examine the effect of our novel combination models on various comparable corpora that have differing degrees of comparability. We conclude with a proposal for a new weighting system to improve the extracted lexicon. The experimental results produced by our implementation show the efficiency of our proposed models.
... Distributional word representations obtained from a word co-occurrence matrix have been applied to word alignment since the 1990s (Fung, 1995;Rapp, 1999). This section introduces word alignment based on the distributional representation whose detail can be found in section 2.1. ...
... The historical context-based projection approach, also known as the standard approach (SA) has been studied in a variety of works (Fung, 1995;Rapp, 1999;Chiao and Zweigenbaum, 2002;Bouamor et al., 2013;Hazem and Morin, 2016;Jakubina and Langlais, 2017). The first step consists in building the distributional word representation for each language (see Section 2.1 for details). ...
Thesis
Significant advances have been achieved in bilingual word-level alignment from comparable corpora, yet the challenge remains for phrase-level alignment. Traditional methods to phrase alignment can only handle phrase of equal length, while word embedding based approaches learn phrase embeddings as individual vocabulary entries suffer from the data sparsity and cannot handle out of vocabulary phrases. Since bilingual alignment is a vector comparison task, phrase representation plays a key role. In this thesis, we study the approaches for unified phrase modeling and cross-lingual phrase alignment, ranging from co-occurrence models to most recent neural state-of-the-art approaches. We review supervised and unsupervised frameworks for modeling cross-lingual phrase representations. Two contributions are proposed in this work. First, a new architecture called tree-free recursive neural network (TF-RNN) for modeling phrases of variable length which, combined with a wrapped context prediction training objective, outperforms the state-of-the-art approaches on monolingual phrase synonymy task with only plain text training data. Second, for cross-lingual modeling, we propose to incorporate an architecture derived from TF-RNN in an encoder-decoder model with a pseudo back translation mechanism inspired by unsupervised neural machine translation. Our proposition improves significantly bilingual alignment of different length phrases.
... For all other uses, contact the owner/author(s). SIGIR '20, July [25][26][27][28][29][30]2020 Despite its scarcity, parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing. Traditionally, machine translation approaches have leveraged parallel sentences as training data for use with sequence-to-sequence models. ...
... Because training data can be scarce, previous works have shown that training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models [24]. On the document level, parallel cross-lingual documents can also be used for learning word-level translation lexicons [13,25]. Other tasks that leverage parallel data include cross-lingual information retrieval as well as cross-lingual document classification. ...
... Bilingual models of distributional semantics have been used to automatically find word translations from both parallel and comparable corpora between several languages [8,10,32]. Besides, and also using parallel and comparable corpora, several attempts have been made to identify translations of different types of multiword expressions (MWEs), using external resources such as bilingual dictionaries or cross-lingual models of distributional semantics [2,11,15,34,38]. Most of these approaches, however, often consider the semantic load of each MWE component as similar, so that they obtain candidates in the target languages as word-to-word translations, which are then filtered and ranked using different metrics. ...
... Finally, our work also takes advantage of cross-lingual distributional semantics models which map in the same vector space representations of different languages. Apart from bilingual models trained in parallel corpora [40], both count-based techniques [9,32] and recent neural network algorithms [1,22,27] obtain high quality bilingual models using comparable and unrelated corpora. ...
Chapter
Full-text available
This paper presents a method to automatically identify bilingual equivalents of collocations using only monolingual corpora in two languages. The method takes advantage of cross-lingual distributional semantics models mapped into a shared vector space, and of compositional methods to find appropriate translations of non-congruent collocations (e.g., pay attention–prestar atenção in English–Portuguese). This strategy is evaluated in the translation of English–Portuguese and English–Spanish collocations belonging to two syntactic patterns: adjective-noun and verb-object, and compared to other methods proposed in the literature. The results of the experiments performed show that the compositional approach, based on a weighted additive model, behaves better than the other strategies that have been evaluated, and that both the asymmetry and the compositional properties of collocations are captured by the combined vector representations. This paper also contributes with two freely available gold-standard data sets which are useful to evaluate the performance of automatic extraction of multilingual equivalents of collocations.
... There has been some research conducted on statistical-based methods utilising comparable data, initially provided by Rapp [1995], Fung [1998], and further developed by Rapp [1999], Haghighi et al. [2008], Schafer and Yarowsky [2002], Koehn and Knight [2002], Gaussier et al. [2004]. However, recent advances reoriented to neural-networkbased approaches, and currently, they produce more research papers than statistical approaches [Artetxe et al., 2018a, Kementchedjhieva et al., 2018, Woller et al., 2021, Bai et al., 2019, Marchisio et al., 2022. ...
Preprint
Full-text available
The importance of inducing bilingual dictionary components in many natural language processing (NLP) applications is indisputable. However, the dictionary compilation process requires extensive work and combines two disciplines, NLP and lexicography, while the former often omits the latter. In this paper, we present the most common approaches from NLP that endeavour to automatically induce one of the essential dictionary components, translation equivalents and focus on the neural-network-based methods using comparable data. We analyse them from a lexicographic perspective since their viewpoints are crucial for improving the described methods. Moreover, we identify the methods that integrate these viewpoints and can be further exploited in various applications that require them. This survey encourages a connection between the NLP and lexicography fields as the NLP field can benefit from lexicographic insights, and it serves as a helping and inspiring material for further research in the context of neural-network-based methods utilising comparable data.
... Therefore, researchers have focused their efforts on finding word translation pairs from non-parallel data, which is both more significant and more challenging (Koehn and Knight 2002;Fung and Cheung 2004;Haghighi et al. 2008). Most traditional approaches hinge on cross-lingual signals to link independent monolingual spaces: each word is associated with a vector that comprises monolingual statistics like PMI, and then the monolingual vector spaces are connected through bilingual signals, such as a seed lexicon or a bilingual topic model (Rapp 1999;Gaussier et al. 2004;Vulić, Smet, and Moens 2011;Vulić and Moens 2013a). ...
Article
Building bilingual lexica from non-parallel data is a long-standing natural language processing research problem that could benefit thousands of resource-scarce languages which lack parallel data. Recent advances of continuous word representations have opened up new possibilities for this task, e.g. by establishing cross-lingual mapping between word embeddings via a seed lexicon. The method is however unreliable when there are only a limited number of seeds, which is a reasonable setting for resource-scarce languages. We tackle the limitation by introducing a novel matching mechanism into bilingual word representation learning. It captures extra translation pairs exposed by the seeds to incrementally improve the bilingual word embeddings. In our experiments, we find the matching mechanism to substantially improve the quality of the bilingual vector space, which in turn allows us to induce better bilingual lexica with seeds as few as 10.
... In recent years, scholars have conducted significant research on the construction methods of bilingual dictionaries. Previously, Rapp et al. [9] proposed an algorithm based on a word-relation matrix. Although a word appears in different contexts in a monolingual text, the set of words that appear together with it is roughly the same. ...
Article
Full-text available
Bilingual lexicon extraction is useful, especially for low-resource languages that can leverage from high-resource languages. The Uyghur language is a derivative language, and its language resources are scarce and noisy. Moreover, it is difficult to find a bilingual resource to utilize the linguistic knowledge of other large resource languages, such as Chinese or English. There is little related research on unsupervised extraction for the Chinese-Uyghur languages, and the existing methods mainly focus on term extraction methods based on translated parallel corpora. Accordingly, unsupervised knowledge extraction methods are effective, especially for the low-resource languages. This paper proposes a method to extract a Chinese-Uyghur bilingual dictionary by combining the inter-word relationship matrix mapped by the neural network cross-language word embedding vector. A seed dictionary is used as a weak supervision signal. A small Chinese-Uyghur parallel data resource is used to map the multilingual word vectors into a unified vector space. As the word-particles of these two languages are not well-coordinated, stems are used as the main linguistic particles. The strong inter-word semantic relationship of word vectors is used to associate Chinese-Uyghur semantic information. Two retrieval indicators, such as nearest neighbor retrieval and cross-domain similarity local scaling, are used to calculate similarity to extract bilingual dictionaries. The experimental results show that the accuracy of the Chinese-Uyghur bilingual dictionary extraction method proposed in this paper is improved to 65.06%. This method helps to improve Chinese-Uyghur machine translation, automatic knowledge extraction, and multilingual translations.
... In terms of word translation, there exists a significant body of work in the area of bilingual lexicon induction, which is the task of translating words across languages without any parallel data (Fung and Yee, 1998;Rapp, 1999). Approaches can be divided into two types, text-based, which aim to find word translations by employing the words' linguistic information, and vision-based that use the words' images as pivots for translation (Bergsma and Van Durme, 2011;Kiela et al., 2015). ...
... In this approach, hierarchical phrase transduction probabilities are used to handle a range of reordering phenomena in the correct fashion. There has been a growing interest in approaches focused on extracting word translations from comparable corpora (Fung and McKeown, 1997;Fung and Yee, 1998;Rapp, 1999;Chiao and Zweigenbaum, 2002;Dejean et al., 2002;Kaji, 2005;Gamallo, 2007;Saralegui et al., 2008). ...
... This performance outperforms state-of-the-art work on extraction from comparable corpora, whose best scores were about 70% accuracy in Rapp (1999) and 60-83% in Aker et al. (2013). The correctness of the generated translation equivalents is similar to that achieved using parallel corpora. ...
Chapter
Full-text available
This volume assesses the state of the art of parallel corpus research as a whole, reporting on advances in both recent developments of parallel corpora – with some particular references to comparable corpora as well– and in ways of exploiting them for a variety of purposes. The first part of the book is devoted to new roles that parallel corpora can and should assume in translation studies and in contrastive linguistics, to the usefulness and usability of parallel corpora, and to advances in parallel corpus alignment, annotation and retrieval. There follows an up-to-date presentation of a number of parallel corpus projects currently being carried out in Europe, some of them multimodal, with certain chapters illustrating case studies developed on the basis of the corpora at hand. In most of these chapters, attention is paid to specific technical issues of corpus building. The third part of the book reflects on specific applications and on the creation of bilingual resources from parallel corpora. This volume will be welcomed by scholars, postgraduate and PhD students in the fields of contrastive linguistics, translation studies, lexicography, language teaching and learning, machine translation, and natural language processing.
... There are also works that attempt to discover the parallel word or phrase from unrelated documents. The work by Rapp (1999) was based on the idea that there is a correlation that exists between the co-occurrences of words that are the translation of each other in unrelated corpora of different languages [18]. In another word, if two words co-occur very frequently, then the translation of both words will also be found very frequently in the text even though the language is different. ...
Article
Full-text available
Parallel texts are essential resources in linguistics, natural language processing, and multilingual information retrieval. Many studies attempt to extract parallel text from existing resources, particularly from comparable texts. The approaches to extract parallel text from comparable text can be divided into sentence-level approach and fragment-level approach. In this paper, an approach that combines sentence-level approach and fragment-level approach is proposed. The study was evaluated using statistical machine translation (SMT) and neural machine translation (NMT). The experiment results show a very significant improvement in the BLEU scores of SMT and NMT. The BLEU scores for SMT for the test in computer science domain and news domain increase from 17.45 and 41.45 to 18.56 and 48.65 respectively. On the other hand, the BLEU scores for NMT in the computer science domain and news domain increase from 14.42 and 19.39 to 21.17 and 41.75 respectively.
... The method proposed in this paper also relies on count-based techniques to build bilingual vectors from monolingual corpora (Fung and McKeown, 1997;Rapp, 1999;Saralegi et al., 2008;Ansari et al., 2014). Neural-based strategies also have been used to learn translation equivalents from word embeddings (Mikolov et al., 2013a;Artetxe et al., 2016Artetxe et al., , 2018a. ...
Conference Paper
Full-text available
This article describes a dependency-based strategy that uses compositional distributional semantics and cross-lingual word embeddings to translate multiword expressions (MWEs). Our unsupervised approach performs translation as a process of word contextualization by taking into account lexico-syntactic contexts and selectional preferences. This strategy is suited to translate phraseological combinations and phrases whose constituent words are lexically restricted by each other. Several experiments in adjective-noun and verb-object compounds show that mutual contextualiza-tion (co-compositionality) clearly outperforms other compositional methods. The paper also contributes with a new freely available dataset of English-Spanish MWEs used to validate the proposed compositional strategy.
... Previous studies on cross-lingual text stream alignment tend to focus on coarse-grained (i.e., topic-level) alignment for finding common patterns (Wang et al., 2007;De Smet and Moens, 2009;Wang et al., 2009;Zhang et al., 2010;Hu et al., 2012) and discovering parallel sentences and documents (Munteanu and Marcu, 2005;Enright and Kondrak, 2007;Uszkoreit et al., 2010;Smith et al., 2010;Smith, 2011, 2016) across languages. Studies on fine-grained crosslingual alignment are mainly for bilingual lexicon induction (e.g., (Fung and Yee, 1998;Rapp, 1999;Koehn and Knight, 2002;Schafer and Yarowsky, 2002;Shao and Ng, 2004;Schafer III, 2006;Hassan et al., 2007;Haghighi et al., 2008;Udupa et al., 2009;Klementiev and Callison-Burch, 2010;Tamura et al., 2012;Callison-Burch, 2013, 2015b;Kiela et al., 2015;Irvine and Callison-Burch, 2015a;Vulic and Moens, 2015;Cao et al., 2016;Zhang et al., 2017b,a)) and name translation mining (e.g., (Sproat et al., 2006;Klementiev and Roth, 2006;Udupa et al., 2008;Ji, 2009;won You et al., 2010;Kotov et al., 2011;Lin et al., 2011;Sellami et al., 2014)) from nonparallel corpora. However, these approaches are mainly developed for general comparable corpora, not specially for cross-lingual text streams; thus many of them did not use the powerful streamlevel information (e.g., co-burst across languages). ...
... Most approaches to extract translation equivalents from monolingual corpora define the contextual distribution of a word by considering bilingual pairs of seed words. In most cases, seed words are provided by external bilingual dictionaries (Fung and McKeown 1997;Fung and Yee 1998;Rapp 1999;Chiao and Zweigenbaum 2002;Shao and Ng 2004;Saralegi, Vicente, and Gurrutxaga 2008;Gamallo 2007;Gamallo and Pichel 2008;Yu and Tsujii 2009a;Ismail and Manandhar 2010;Rubino and Linarés 2011;Tamura, Watanabe, and Sumita 2012;Aker, Paramita, and Gaizauskas 2013;Ansari et al. 2014). So, a word in the target language is a translation candidate of a word in the source language if they tend to co-occur with the pairs of words from the seed words. ...
Article
Full-text available
This article describes a compositional distributional method to generate contextualized senses of words and identify their appropriate translations in the target language using monolingual corpora. Word translation is modeled in the same way as contextualization of word meaning, but in a bilingual vector space. The contextualization of meaning is carried out by means of distributional composition within a structured vector space with syntactic dependencies, and the bilingual space is created by means of transfer rules and a bilingual dictionary. A phrase in the source language, consisting of a head and a dependent, is translated into the target language by selecting both the nearest neighbor of the head given the dependent, and the nearest neighbor of the dependent given the head. This process is expanded to larger phrases by means of incremental composition. Experiments were performed on English and Spanish monolingual corpora in order to translate phrasal verbs in context. A new bilingual data set to evaluate strategies aimed at translating phrasal verbs in restricted syntactic domains has been created and released.
... In pivotbased machine translation, context-based pruning method uses a context vector to identify the exact meaning of the pivot language [56]. Given source-pivot and pivot-target corpora, the source context vector S, the target context vector T and the pivot context vector P can be calculated following Rapp [65]. ...
Article
Full-text available
Machine translation, which will be used widely in human-computer interaction services to Internet of things (IoT), is a key technology in artificial intelligence field. This paper presents a minimum Bayes-risk (MBR) phrase table pruning method for pivot-based statistical machine translation (SMT). The SMT system requires a great amount of bilingual data to build a high-performance translation model. For some language pairs, such as Chinese-English, massive bilingual data are available on the web. However, for most language pairs, large-scale bilingual data are hard to obtain. Pivot-based SMT is proposed to solve the data scarcity problem: it introduces a pivot language to bridge the source language and the target language. Therefore, a source-target translation model based on well-trained source-pivot and pivot-target translation models can be derived with the pivot-based approach. However, due to the ambiguities of the pivot language, source and target phrases with different meanings may be wrongly matched. Consequently, the derived source-target phrase table may contain incorrect phrase pairs. To alleviate this problem, we apply the MBR method to prune the phrase table. The MBR pruning method removes the phrase pairs with the lowest risk from the phrase table. Experimental results on Europarl data show that the proposed method can both reduce the size of phrase tables and improve the performance of translations. This study also gives a useful reference to many IoT research field and smart web services.
... These methods often construct word vectors through a context matrix, then either use the vectors directly, perform some factorisation of the matrix, or alternatively use the context in a neural network that produces vectors for each word. The information captured by these embeddings can be exploited for bilingual translation by learning a translation matrix that allows one to match relative positions across two monolingual vector spaces [3,4,[6][7][8][9]11,12,15,18,19,22,24]. The assumption that a word appears in similar contexts and is distributed within its language similarly to its equivalent in another language is key to translation through a learned matrix. ...
Chapter
Full-text available
Methods used to learn bilingual word embedding mappings, which project the source-language embeddings into the target embedding space, are compared in this paper. Orthogonal transformations, which are robust to noise, can learn to translate between word pairs they have never seen during training (zero-shot translation). Using multiple translation paths, e.g. Finnish \rightarrow English \rightarrow Russian and Finnish \rightarrow French \rightarrow Russian, at the same time and combining the results was found to improve the results of this process. Four new methods are presented for the calculation of either the single most similar or the five most similar words, based on the results of multiple translation paths. Of these, the Summation method was found to improve the P@1 translation precision by 1.6% points compared to the best result obtained with a direct translation (Fi \rightarrow Ru). The probability margin is presented as a confidence score. With similar coverages, the probability margin was found to outperform probability as a confidence score in terms of P@1 and P@5.
... For example, Fung (1998) used statistical method for extracting SWTs from a bilingual English-Chinese corpus taken from the Wall Street Journal for English and the Nikkei Financial News for Chinese. Similarly, Rapp (1999) used a statistical method to extract SWTs from a comparable journalistic English-German corpus. Déjean and Gaussier (2002) investigated a medical comparable English-German corpus. ...
Article
This paper presents a methodology for the automatic extraction of specialized Arabic, English and French verbs of the field of computing. Since nominal terms are predominant in terminology, our interest is to explore to what extent verbs can also be part of a terminological analysis. Hence, our objective is to verify how an existing extraction tool will perform when it comes to specialized verbs in a given specialized domain. Furthermore, we want to investigate any particularities that a language can represent regarding verbal terms from the automatic extraction perspective. Our choice to operate on three different languages reflects our desire to see whether the chosen tool can perform better on one language compared to the others. Moreover, given that Arabic is a morphologically rich and complex language, we consider investigating the results yielded by the extraction tool. The extractor used for our experiment is TermoStat (Drouin 2003). So far, our results show that the extraction of verbs of computing represents certain differences in terms of quality and particularities of these units in this specialized domain between the languages under question.
... Traditional methods build statistical models for monolingual word co-occurrence, and combine cross-lingual supervision to solve the task. As word alignment for parallel sentences can produce fairly good bilingual lexica (Och and Ney, 2003), these methods focus on non-parallel data with a seed lexicon as cross-lingual supervision (Rapp, 1999;Gaussier et al., 2004). ...
... Comparable corpora are by definition multilingual and cross-lingual text collections. The use of comparable corpora for word similarity is a well-known task (Fung and McKeown, 1997;Rapp, 1999;Saralegi et al., 2008;Gamallo, 2007;Gamallo and Pichel, 2008;Ansari et al., 2014;Hazem and Morin, 2014). The main advantage of comparable corpora is that the Web can be used as a huge resource of multilingual texts. ...
Conference Paper
Full-text available
This article describes the distributional strategy submitted by the Citius team to the SemEval 2017 Task 2. Even though the team participated in two sub-tasks, namely monolingual and cross-lingual word similarity, the article is mainly focused on the cross-lingual sub-task. Our method uses comparable corpora and syntactic dependencies to extract count-based and transparent bilingual dis-tributional contexts. The evaluation of the results show that our method is competitive with other cross-lingual strategies, even those using aligned and parallel texts.
... For English we used the English 2012 Google 5-gram corpus, for French we used the French 2012 Google 5-gram corpus, for German we used the German 2012 Google 5-gram corpus, and for Spanish we used the Spanish 2012 Google 5-gram corpus. From these corpora we compute word context similarity scores across languages using Rapp's method (Rapp, 1995(Rapp, , 1999. The intuition behind this method is that cognates are more likely to occur in correlating context windows and this statistic inferred from large amounts of data captures this correlation. ...
... Inducing bilingual lexica from non-parallel data is a long-standing cross-lingual task. Except for the decipherment approach, traditional statistical methods all require cross-lingual signals (Rapp, 1999;Koehn and Knight, 2002;Fung and Cheung, 2004;Gaussier et al., 2004;Haghighi et al., 2008;Vulić et al., 2011;Vulić and Moens, 2013). Recent advances in cross-lingual word embeddings (Vulić and Korhonen, 2016;Upadhyay et al., 2016) have rekindled interest in bilingual lexicon induction. ...
... Object matching is the task of finding correspondence between objects in different domains, such as images and annotations (Socher and Fei-Fei 2010), user identifiers in different databases (Li et al. 2009), sentences written in different languages (Gale and Church 1991;Rapp 1999). Most object matching methods involve similarity or correspondence information. ...
Article
Full-text available
Unsupervised cluster matching is a task to find matching between clusters of objects in different domains. Examples include matching word clusters in different languages without dictionaries or parallel sentences and matching user communities across different friendship networks. Existing methods assume that every object is assigned into a cluster. However, in real-world applications, some objects would not form clusters. These irrelevant objects deteriorate the cluster matching performance since mistakenly estimated matching affect on estimation of matching of other objects. In this paper, we propose a probabilistic model for robust unsupervised cluster matching that discovers relevance of objects and matching of object clusters, simultaneously, given multiple networks. The proposed method finds correspondence only for relevant objects, and keeps irrelevant objects unmatched, which enables us to improve the matching performance since the adverse impact of irrelevant objects is eliminated. With the proposed method, relevant objects in different networks are clustered into a shared set of clusters by assuming that different networks are generated from a common network probabilistic model, which is an extension of stochastic block models. Objects assigned into the same clusters are considered as matched. Edges for irrelevant objects are assumed to be generated from a noise distribution irrespective of cluster assignments. We present an efficient Bayesian inference procedure of the proposed model based on collapsed Gibbs sampling. In our experiments, we demonstrate the effectiveness of the proposed method using synthetic and real-world data sets, including multilingual corpora and movie ratings.
... For English we used the English 2012 Google 5-gram corpus, for French we used the French 2012 Google 5-gram corpus, for German we used the German 2012 Google 5-gram corpus, and for Spanish we used the Spanish 2012 Google 5-gram corpus. From these corpora we compute word context similarity scores across languages using Rapp's method (Rapp, 1995(Rapp, , 1999. The intuition behind this method is that cognates are more likely to occur in correlating context windows and this statistic inferred from large amounts of data captures this correlation. ...
Article
Global constraints and reranking have not been used in cognates detection research to date. We propose methods for using global constraints by performing rescoring of the score matrices produced by state of the art cognates detection systems. Using global constraints to perform rescoring is complementary to state of the art methods for performing cognates detection and results in significant performance improvements beyond current state of the art performance on publicly available datasets with different language pairs and various conditions such as different levels of baseline state of the art performance and different data size conditions, including with more realistic large data size conditions than have been evaluated with in the past.
... Other approaches make assumptions about the languages or corpora, such as syntactic structure, orthographic similarities, presence of cognates, monogenetic relationships, and domain-specific content [Rapp 1999;Laroche and Langlais 2010;Haghighi et al. 2008;Morin et al. 2008;Koehn and Knight 2002;Rubino and Linarès 2011;Fišer and Ljubešic 2011]. Mausam et al. [2009] and Kaji et al. [2008] use existing dictionaries to induce translation correspondences. ...
Article
Identifying translations from comparable corpora is awell-known problem with several applications. Existing methods rely on linguistic tools or high-quality corpora. Absence of such resources, especially in Indian languages, makes this problem hard; for example, state-of-The-Art techniques achieve a mean reciprocal rank of 0.66 for English-Italian, and a mere 0.187 for Telugu-Kannada. In this work, we address the problem of comparable corpora-based translation correspondence induction (CC-TCI) when the only resources available are small noisy comparable corpora extracted from Wikipedia. We observe that translations in the source and target languages have many topically related words in common in other "auxiliary" languages. To model this, we define the notion of a translingual theme, a set of topically related words from auxiliary language corpora, and present a probabilistic framework for CC-TCI. Extensive experiments on 35 comparable corpora showed dramatic improvements in performance. We extend these ideas to propose a method for measuring cross-lingual semantic relatedness (CLSR) between words. To stimulate further research in this area, we make publicly available two new high-quality human-Annotated datasets for CLSR. Experiments on the CLSR datasets show more than 200% improvement in correlation on the CLSR task. We apply the method to the real-world problem of cross-lingual Wikipedia title suggestion and build the WikiTSu system. A user study on WikiTSu shows a 20% improvement in the quality of titles suggested.
Chapter
In the beginning of the 2000s the use of comparable corpora was on the margins of NLP research. Existing MT systems were nearly always based on fully parallel corpora, while NLP applications were mostly built separately in each language without the advantages of cross-lingual transfer.
Chapter
The aim of the Bilingual Lexicon Induction (BLI) task is to produce a bilingual lexicon using a pair of comparable corpora and either a small set of seed translations (a supervised setting) or no seeds at all (an unsupervised setting). A traditional bilingual dictionary usually offers a structure of senses and conditions for their translations, as well as POS tags for disambiguation. In contrast to this task, building bilingual lexicons as the aim for the BLI task involves a number of simplifications.
Chapter
Recent works rely on comparable corpora to extract efficient bilingual lexicon. Most of approaches in the litterature for bilingual lexicon extraction are based on context vectors (CV). These approaches suffer from noisy vectors that affect their accuracy. This paper presents new approaches which relies on some advanced text mining methods to extract association rules between terms (AR) and extend them to contextual meta-rules (MR). In this respect, we propose to extract bilingual lexicons by deploying standard context vectors, association rules and contextual meta-rules. These proposed approaches utilize correlations between co-occurrence patterns across language. An experimental validation conducted on a specialized comparable corpora, highlights a significant improvement of bilingual lexicon based on MR compared to the standard approach.
Article
Significant advances have been achieved in bilingual word-level alignment, yet the challenge remains for phrase-level alignment. Moreover, the need for parallel data is a critical drawback for the alignment task. This work proposes a system that alleviates these two problems: a unified phrase representation model using cross-lingual word embeddings as input and an unsupervised training algorithm inspired by recent works on neural machine translation. The system consists of a sequence-to-sequence architecture where a short sequence encoder constructs cross-lingual representations of phrases of any length, then an LSTM network decodes them w.r.t their contexts. After training with comparable corpora and existing key phrase extraction, our encoder provides cross-lingual phrase representations that can be compared without further transformation. Experiments on five data sets show that our method obtains state-of-the-art results on the bilingual phrase alignment task and improves the results of different length phrase alignment by a mean of 8.8 points in MAP.
Chapter
Whether you wish to deliver on a promise, take a walk down memory lane or even on the wild side, phraseological units (also often referred to as phrasemes or multiword expressions) are present in most communicative situations and in all world’s languages. Phraseology, the study of phraseological units, has therefore become a rare unifying theme across linguistic theories. In recent years, an increasing number of studies have been concerned with the computational treatment of multiword expressions: these pertain among others to their automatic identification, extraction or translation, and to the role they play in various Natural Language Processing applications. Computational Phraseology is a comparatively new field where better understanding and more advances are urgently needed. This book aims to address this pressing need, by bringing together contributions focusing on different perspectives of this promising interdisciplinary field.
Chapter
Methods for bilingual lexicon induction are often based on word embeddings (WE) similarity. These methods must be able to project the WE to the same space. Uncontextualized WE proved to be useful for this task. We compare them to contextualized WE and Bag of Words, using specialized and general datasets. We also evaluate the impact of seed lexicons and check the existing reference lists validity, claiming that extracting the translation of some words in those lists is not useful and confirming the need to have more fine-grained reference lists.
Article
Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments that are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system that is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.
Chapter
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.
Chapter
The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.
Preprint
We investigate the behavior of maps learned by machine translation methods. The maps translate words by projecting between word embedding spaces of different languages. We locally approximate these maps using linear maps, and find that they vary across the word embedding space. This demonstrates that the underlying maps are non-linear. Importantly, we show that the locally linear maps vary by an amount that is tightly correlated with the distance between the neighborhoods on which they are trained. Our results can be used to test non-linear methods, and to drive the design of more accurate maps for word translation.
Article
Named Entity Translation Equivalents extraction plays a critical role in machine translation (MT) and cross language information retrieval (CLIR). Traditional methods are often based on large-scale parallel or comparable corpora. However, the applicability of these studies is constrained, mainly because of the scarcity of parallel corpora of the required scale, especially for language pairs of Chinese and Japanese. In this paper, we propose a method considering the characteristics of Chinese and Japanese to automatically extract the Chinese-Japanese Named Entity (NE) translation equivalents based on inductive learning (IL) from monolingual corpora. The method adopts the Chinese Hanzi and Japanese Kanji Mapping Table (HKMT) to calculate the similarity of the NE instances between Japanese and Chinese. Then, we use IL to obtain partial translation rules for NEs by extracting the different parts from high similarity NE instances in Chinese and Japanese. In the end, the feedback processing updates the Chinese and Japanese NE entity similarity and rule sets. Experimental results show that our simple, efficient method, which overcomes the insufficiency of the traditional methods, which are severely dependent on bilingual resource. Compared with other methods, our method combines the language features of Chinese and Japanese with IL for automatically extracting NE pairs. Our use of a weak correlation bilingual text sets and minimal additional knowledge to extract NE pairs effectively reduces the cost of building the corpus and the need for additional knowledge. Our method may help to build a large-scale Chinese-Japanese NE translation dictionary using monolingual corpora.
Chapter
Application of semantic resources often requires linking phrases expressed in a natural language to formally defined notions. In case of ontologies lexical layers may be used for that purpose. In the paper we propose an automatic machine translation method for translating multi-word labels from lexical layers of domain ontologies. In the method we take advantage of Wikipedia and dictionaries services available on the Internet in order to provide translations of thematic texts from a given area of interest. Experimental evaluation shows usefulness of the proposed method in translating specialized thematic dictionaries.
Conference Paper
This paper presents a method for inducing the parts of speech of a language and part-of-speech labels for individual words from a large text corpus. Vector representations for the part-of-speech of a word are formed from entries of its near lexical neighbors. A dimensionality reduction creates a space representing the syntactic categories of unambiguous words. A neural net trained on these spatial representations classifies individual contexts of occurrence of ambiguous words. The method classifies both ambiguous and unambiguous words correctly with high accuracy.
Article
How do people know as much as they do with as little information as they get? The problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. A new general theory of acquired similarity and knowledge representation, latent semantic analysis (LSA), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. By inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, LSA acquired knowledge about the full vocabulary of English at a comparable rate to schoolchildren. LSA uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. Relations to other theories, phenomena, and problems are sketched.
Article
We propose a novel context heterogeneity similarity measure between words and their translations in helping to compile bilingual lexicon entries from a non-parallel English-Chinese corpus. Current algorithms for bilingual lexicon compilation rely on occurrence frequencies, length or positional statistics derived from parallel texts. There is little correlation between such statistics of a word and its translation in non-parallel corpora. On the other hand, we suggest that words with productive context in one language translate to words with productive context in another language, and words with rigid context translate into words with rigid context. Context heterogeneity measures how productive the context of a word is in a given domain, independent of its absolute occurrence frequency in the text. Based on this information, we derive statistics of bilingual word pairs from a non-parallel corpus. These statistics can be used to bootstrap a bilingual dictionary compilation algor...
Article
this paper, we present an initial algorithm for translating technical terms using a pair of non-parallel corpora. Evalution results show translation precisions at around 30% when only the top candidate is considered. While this precision is lower than that achieved with parallel corpora, we show that top 20 candidate output from our algorithm allows translators to increase their accuracy by 50.9%. In the following sections, we first describe a pair of non-parallel corpora we use for experiments, and then we introduce the Word Relation Matrix (WoRM), a statistical word feature representation for technical term translation from non-parallel corpora. We evaluate the effectiveness of this feature with two sets of experiments, using English/English, and English/Japanese non-parallel corpora. 2. BACKGROUND
Article
A statistical model is presented which predicts the strengths of word-associations from the relative frequencies of the common occurrences of words in large bodies of text. These predictions are compared with the Minnesota association norms for 100 stimulus words. The average agreement between the predicted and the observed responses is only slightly weaker than the agreement between the responses of an arbitrary subject and the responses of the other subjects. It is shown that the approach leads to equally good results for both English and German. 1 Introduction In the association experiment first used by Galton (1880) subjects are asked to respond to a stimulus word with the first word that comes to their mind. These associative responses have been explained in psychology by the principle of learning by contiguity: "Objects once experienced together tend to become associated in the imagination, so that when any one of them is thought of, the others are likely to be thought of also,...