Conference Paper

Evaluating Resource-Lean Cross-Lingual Embedding Models in Unsupervised Retrieval

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Cross-lingual embeddings (CLE) facilitate cross-lingual natural language processing and information retrieval. Recently, a wide variety of resource-lean projection-based models for inducing CLEs has been introduced, requiring limited or no bilingual supervision. Despite potential usefulness in downstream IR and NLP tasks, these CLE models have almost exclusively been evaluated on word translation tasks. In this work, we provide a comprehensive comparative evaluation of projection-based CLE models for both sentence-level and document-level cross-lingual Information Retrieval (CLIR). We show that in some settings resource-lean CLE-based CLIR models may outperform resource-intensive models using full-blown machine translation (MT). We hope our work serves as a guideline for choosing the right model for CLIR practitioners.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In previous work, Litschko et al. (2019) have shown that language transfer by means of cross-lingual embedding spaces (CLWEs) can be used to yield state-of-the-art performance in a range of unsupervised ad-hoc CLIR setups. This approach uses very weak cross-lingual (in this case, bilingual) supervision (i.e., only a bilingual dictionary spanning 1-5 K word translation pairs), or even no bilingual supervision at all, in order to learn a mapping that aligns two monolingual word embedding spaces Vulić et al. 2019). ...
... Let d = {t 1 , t 2 , … , t |D| } ∈ D be a document with |D| terms t i . CLIR with static CLWEs represents queries and documents as vectors �� ⃗ Q, �� ⃗ D ∈ ℝ d in a d-dimensional shared embedding space (Vulić and Moens 2015;Litschko et al. 2019). Each term is represented independently with a pre-computed static embedding vector � ⃗ t i = emb t i . ...
... Given the shared CLWE space, both query and document representations are obtained as aggregations of their term embeddings. We follow Litschko et al. (2019) and represent documents as the weighted sum of their terms' vectors, where each term's weight corresponds to its inverse document frequency (idf): 3 Documents are then ranked in decreasing order of the cosine similarity between their embeddings and the query embedding. ...
Article
Full-text available
Pretrained multilingual text encoders based on neural transformer architectures , such as multilingual BERT (mBERT) and XLM, have recently become a default paradigm for cross-lingual transfer of natural language processing models, rendering cross-lingual word embedding spaces (CLWEs) effectively obsolete. In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR—a setup with no relevance judgments for IR-specific fine-tuning—pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla ‘off-the-shelf’ variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank ) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that, despite the supervision, and due to the domain and language shift, supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to “monolingual overfitting” of retrieval models trained on monolingual (English) data, even if they are based on multilingual transformers.
... In previous work, Litschko et al. [39] have shown that language transfer by means of cross-lingual embedding spaces (CLWEs) can be used to yield state-of-the-art performance in a range of unsupervised ad-hoc CLIR setups. This approach uses very weak cross-lingual (in this case, bilingual) supervision (i.e., only a bilingual dictionary spanning 1K-5K word translation pairs), or even no bilingual supervision at all, in order to learn a mapping that aligns two monolingual word embedding spaces [24,69]. ...
... It is unclear, however, whether these general-purpose multilingual text encoders can be used directly for ad-hoc CLIR without any additional supervision (i.e., crosslingual relevance judgments). Further, can they outperform unsupervised CLIR approaches based on static CLWEs [39]? How do they perform depending on the (properties of the) language pair at hand? ...
... , t |D| u P D be a document with |D| terms t i . CLIR with static CLWEs represents queries and documents as vectors [70,39]. Each term is represented independently with a pre-computed static embedding vector ...
Preprint
Full-text available
In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla 'off-the-shelf' variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to "monolingual overfitting" of retrieval models trained on monolingual data.
... In previous work, Litschko et al. [27] have shown that language transfer through cross-lingual embedding spaces (CLWEs) can be used to yield state-of-the-art performance in a range of unsupervised ad-hoc CLIR setups. This approach uses very weak supervision (i.e., only a bilingual dictionary spanning 1K-5K word translation pairs), or even no supervision at all, in order to learn a mapping that aligns two monolingual word embedding spaces [19,45]. ...
... First, it is unclear whether these general-purpose multilingual text encoders can be used directly for adhoc CLIR without any additional supervision (i.e., relevance judgments). Further, can they outperform the previous unsupervised CLIR approaches based on static CLWEs [27]? How do they perform depending on the (properties of the) language pair at hand? ...
... , t |D| u P D be a document consisting of |D| terms t i . A typical approach to CLIR with static CLWEs is to represent queries and documents as vectors [46,27]. Each term is represented independently and obtained by performing a lookup on a pre-computed static embedding table ...
Preprint
Full-text available
Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
... Embedding the translation component in the fine-tuning stage along with the ranking makes the training of deep neural models for the CLIR more challenging, particularly when dealing with resource-lean languages [1,23]. Pre-trained language models such as BERT [12] have shown promising performance gains for monolingual information retrieval [15,34,46,49]. ...
... These models offer a shared representation space for a large number of languages and the representation of a token is contextualized based on the other tokens in a sequence. Thus these approaches capture higher-level semantics compared to CLWE and once fine-tuned, they have been shown to be effective across a wide variety of tasks, including CLIR [23,36,47]. However, we assume that the translation gap still exists in the multilingual transformers and it is important to inject translation knowledge into such architectures. ...
... Evaluation. For evaluating retrieval effectiveness, we follow prior work on CLEF dataset [2,23] and report mean average precision (MAP) of the top 100 ranked documents and precision of the top 10 retrieved documents (P@10). We determine statistical significance using the two-tailed paired t-test with p-value less than 0.05 (i.e., 95% confidence level). ...
Preprint
Full-text available
Pretrained contextualized representations offer great success for many downstream tasks, including document ranking. The multilingual versions of such pretrained representations provide a possibility of jointly learning many languages with the same model. Although it is expected to gain big with such joint training, in the case of cross lingual information retrieval (CLIR), the models under a multilingual setting are not achieving the same level of performance as those under a monolingual setting. We hypothesize that the performance drop is due to the translation gap between query and documents. In the monolingual retrieval task, because of the same lexical inputs, it is easier for model to identify the query terms that occurred in documents. However, in the multilingual pretrained models that the words in different languages are projected into the same hyperspace, the model tends to translate query terms into related terms, i.e., terms that appear in a similar context, in addition to or sometimes rather than synonyms in the target language. This property is creating difficulties for the model to connect terms that cooccur in both query and document. To address this issue, we propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table. We design a sandwich like architecture to embed MAT into the recent transformer based deep neural models. By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence. Experimental results demonstrate the effectiveness of the external knowledge and the significant improvement of MAT embedded neural reranking model on CLIR task.
... In a recent study the authors evaluate in an information retrieval task cross-lingual embeddings [15]. Apart from the fact that we also include a classification task in this paper we also focus on the inference part rather than relying on simple cosine similarities. ...
Preprint
Word embeddings are high dimensional vector representations of words that capture their semantic similarity in the vector space. There exist several algorithms for learning such embeddings both for a single language as well as for several languages jointly. In this work we propose to evaluate collections of embeddings by adapting downstream natural language tasks to the optimal transport framework. We show how the family of Wasserstein distances can be used to solve cross-lingual document retrieval and the cross-lingual document classification problems. We argue on the advantages of this approach compared to more traditional evaluation methods of embeddings like bilingual lexical induction. Our experimental results suggest that using Wasserstein distances on these problems out-performs several strong baselines and performs on par with state-of-the-art models.
... Unsupervised multilingual word embeddings. Cross-lingual embeddings of words can be obtained by post-hoc alignment of monolingual word embeddings (Mikolov et al., 2013) and mean-pooled with IDF weights to represent sentences (Litschko et al., 2019). Unsupervised techniques to find a linear mapping between embedding spaces were proposed by Artetxe et al. (2018) and Conneau et al. (2018), using iterative self-learning or adversarial training. ...
... Unsupervised multilingual word embeddings. Cross-lingual embeddings of words can be obtained by post-hoc alignment of monolingual word embeddings (Mikolov et al., 2013) and mean-pooled with IDF weights to represent sentences (Litschko et al., 2019). Unsupervised techniques to find a linear mapping between embedding spaces were proposed by Artetxe et al. (2018) and Conneau et al. (2017), using iterative self-learning or adversarial training. ...
Preprint
Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.
... Word Embeddings Crosslingual embedding methods perform cross-lingual relevance prediction by representing query and passage terms of different languages in a shared semantic space (Vulić and Moens, 2015;Litschko et al., 2019Litschko et al., , 2018. Both supervised approaches trained on parallel sentence corpora (Levy et al., 2017;Luong et al., 2015) and unsupervised approaches with no parallel data (Lample et al., 2018;Artetxe et al., 2018) have been proposed to train cross-lingual word embeddings. ...
Preprint
This paper proposes an approach to cross-language sentence selection in a low-resource setting. It uses data augmentation and negative sampling techniques on noisy parallel sentence data to directly learn a cross-lingual embedding-based query relevance model. Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data. Moreover, when a rationale training secondary objective is applied to encourage the model to match word alignment hints from a phrase-based statistical machine translation model, consistent improvements are seen across three language pairs (English-Somali, English-Swahili and English-Tagalog) over a variety of state-of-the-art baselines.
... Word Embeddings Crosslingual embedding methods perform cross-lingual relevance prediction by representing query and passage terms of different languages in a shared semantic space (Vulić and Moens, 2015;Litschko et al., 2019Litschko et al., , 2018Joulin et al., 2018). Both supervised approaches trained on parallel sentence corpora (Levy et al., 2017;Luong et al., 2015) and unsupervised approaches with no parallel data (Lample et al., 2018;Artetxe et al., 2018) have been proposed to train cross-lingual word embeddings. ...
... This does, however, assume that the vector spaces are isomorphic, which need not be true for languages that are typologically different, and this approach has been shown not to work well in such cases [184]. Litschko et al. [98] and Litschko et al. [99] used these unsupervised approaches to perform fully unsupervised CLIR in language pairs involving limited training resources. However, the effectiveness of fully unsupervised approaches was found to be rather limited when compared to supervised methods Vulić et al. [184]. ...
Preprint
Full-text available
Two key assumptions shape the usual view of ranked retrieval: (1) that the searcher can choose words for their query that might appear in the documents that they wish to see, and (2) that ranking retrieved documents will suffice because the searcher will be able to recognize those which they wished to find. When the documents to be searched are in a language not known by the searcher, neither assumption is true. In such cases, Cross-Language Information Retrieval (CLIR) is needed. This chapter reviews the state of the art for cross-language information retrieval and outlines some open research questions.
Chapter
Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR – a setup with no relevance judgments for IR-specific fine-tuning – pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders ‘off-the-shelf’, but rather relying on their variants that have been further specialized for sentence understanding tasks.
Chapter
Full-text available
We tested four methods of making document representations cross-lingual for the task of semantic search for the similar papers based on the corpus of papers from three Russian conferences on NLP: Dialogue, AIST and AINL. The pipeline consisted of three stages: preprocessing, word-by-word vectorisation using models obtained with various methods to map vectors from two independent vector spaces to a common one, and search for the most similar papers based on the cosine similarity of text vectors. The four methods used can be grouped into two approaches: 1) aligning two pretrained monolingual word embedding models with a bilingual dictionary on our own (for example, with the VecMap algorithm) and 2) using pre-aligned cross-lingual word embedding models (MUSE). To find out, which approach brings more benefit to the task, we conducted a manual evaluation of the results and calculated the average precision of recommendations for all the methods mentioned above. MUSE turned out to have the highest search relevance, but the other methods produced more recommendations in a language other than the one of the target paper.
Article
Projection-based methods for generating high-quality Cross-Lingual Embeddings (CLEs) have shown state-of-the-art performance in many multilingual applications. Supervised methods that rely on character-level information or unsupervised methods that need only monolingual information are both popular and have their pros and cons. However, there are still problems in terms of the quality of monolingual word embedding spaces and the generation of the seed dictionaries. In this work, we aim to generate effective CLEs with auxiliary Topic Models. We utilize both monolingual and bilingual topic models in the procedure of generating monolingual embedding spaces and seed dictionaries for projection. We present a comprehensive evaluation of our proposed model through the means of bilingual lexicon extraction, cross-lingual semantic word similarity and cross-lingual document classification tasks. We show that our proposed model outperforms existing supervised and unsupervised CLE models built on basic monolingual embedding spaces and seed dictionaries. It also exceeds CLE models generated from representative monolingual topical word embeddings.
Article
Full-text available
The alignment of word embedding spaces in different languages into a common crosslingual space has recently been in vogue. Strategies that do so compute pairwise alignments and then map multiple languages to a single pivot language (most often English). These strategies, however, are biased towards the choice of the pivot language, given that language proximity and the linguistic characteristics of the target language can strongly impact the resultant crosslingual space in detriment of topologically distant languages. We present a strategy that eliminates the need for a pivot language by learning the mappings across languages in a hierarchical way. Experiments demonstrate that our strategy significantly improves vocabulary induction scores in all existing benchmarks, as well as in a new non-English–centered benchmark we built, which we make publicly available.
Article
Cross-lingual information retrieval (CLIR) methods have quickly made the transition from translation-based approaches to semantic-based approaches. In this paper, we examine the limitations of current unsupervised neural CLIR methods, especially those leveraging aligned cross-lingual word embedding (CLWE) spaces. At the moment, CLWEs are normally constructed on the monolingual corpus of bilingual texts through an iterative induction process. Homonymy and polysemy have become major obstacles in this process. On the other hand, contextual text representation methods often fail to outperform static CLWE methods significantly for CLIR. We propose a method utilizing a novel neural generative model with Wasserstein autoencoders to learn neural topic-enhanced CLWEs for CLIR purposes. Our method requires minimal or no supervision at all. On the CLEF test collections, we perform a comparative evaluation of the state-of-the-art semantic CLWE methods along with our proposed method for neural CLIR tasks. We demonstrate that our method outperforms the existing CLWE methods and multilingual contextual text encoders. We also show that our proposed method obtains significant improvements over the CLWE methods based upon representative topical embeddings.
Article
Full-text available
State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent works showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally show that our method is a first step towards fully unsupervised machine translation and describe experiments on the English-Esperanto language pair, on which there only exists a limited amount of parallel data.
Conference Paper
Full-text available
We present a suite of query expansion methods that are based on word embeddings. Using Word2Vec's CBOW embedding approach, applied over the entire corpus on which search is performed, we select terms that are semantically related to the query. Our methods either use the terms to expand the original query or integrate them with the effective pseudo-feedback-based relevance model. In the former case, retrieval performance is significantly better than that of using only the query, and in the latter case the performance is significantly better than that of the relevance model.
Article
Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.
Conference Paper
We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two languages through an iterative process based on adversarial neural networks. Our experiments on the standard CLEF CLIR collections for three language pairs of varying degrees of language similarity (English-Dutch/Italian/Finnish) demonstrate the usefulness of the proposed fully unsupervised approach. Our CLIR models with unsupervised cross-lingual embeddings outperform baselines that utilize cross-lingual embeddings induced relying on word-level and document-level alignments. We then demonstrate that further improvements can be achieved by unsupervised ensemble CLIR models. We believe that the proposed framework is the first step towards development of effective CLIR models for language pairs and domains where parallel data are scarce or non-existent.
Article
Cross-lingual embedding models allow us to project words from different languages into a shared embedding space. This allows us to apply models trained on languages with a lot of data, e.g. English to low-resource languages. In the following, we will survey models that seek to learn cross-lingual embeddings. We will discuss them based on the type of approach and the nature of parallel data that they employ. Finally, we will present challenges and summarize how to evaluate cross-lingual embedding models.
Article
Usually bilingual word vectors are trained "online". Mikolov et al. showed they can also be found "offline", whereby two pre-trained embeddings are aligned with a linear transformation, using dictionaries compiled from expert knowledge. In this work, we prove that the linear transformation between two spaces should be orthogonal. This transformation can be obtained using the singular value decomposition. We introduce a novel "inverted softmax" for identifying translation pairs, with which we improve the precision @1 of Mikolov's original mapping from 34% to 43%, when translating a test set composed of both common and rare English words into Italian. Orthogonal transformations are more robust to noise, enabling us to learn the transformation without expert bilingual signal by constructing a "pseudo-dictionary" from the identical character strings which appear in both languages, achieving 40% precision on the same test set. Finally, we extend our method to retrieve the true translations of English sentences from a corpus of 200k Italian sentences with a precision @1 of 68%.
Conference Paper
We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).
Article
The distributional hypothesis of Harris (1954), according to which the meaning of words is evidenced by the contexts they occur in, has motivated several effective techniques for obtaining vector space semantic representations of words using unannotated text corpora. This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually. We evaluate the resulting word representations on standard lexical semantic evaluation tasks and show that our method produces substantially better semantic representations than monolingual techniques.
Article
We present a novel technique for learning semantic representations, which extends the distributional hypothesis to multilingual data and joint-space embeddings. Our models leverage parallel data and learn to strongly align the embeddings of semantically equivalent sentences, while maintaining sufficient distance between those of dissimilar sentences. The models do not rely on word alignments or any syntactic information and are successfully applied to a number of diverse languages. We extend our approach to learn semantic representations at the document level, too. We evaluate these models on two cross-lingual document classification tasks, outperforming the prior state of the art. Through qualitative analysis and the study of pivoting effects we demonstrate that our representations are semantically plausible and can capture semantic relationships across languages without parallel data.
Article
Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.
Conference Paper
Distributed representations of words have proven extremely useful in numerous natural lan-guage processing tasks. Their appeal is that they can help alleviate data sparsity problems common to supervised learning. Methods for inducing these representations require only unlabeled language data, which are plentiful for many natural languages. In this work, we induce distributed representations for a pair of languages jointly. We treat it as a multitask learning problem where each task corresponds to a single word, and task relatedness is derived from co-occurrence statistics in bilingual parallel data. These representations can be used for a number of crosslingual learning tasks, where a learner can be trained on annotations present in one language and applied to test data in another. We show that our representations are informative by using them for crosslingual document classification, where classifiers trained on these representations substantially outperform strong baselines (e.g. machine translation) when applied to a new language.
Article
We collected a corpus of parallel text in 11 lan-guages from the proceedings of the European Par-liament, which are published on the web 1 . This cor-pus has found widespread use in the NLP commu-nity. Here, we focus on its acquisition and its appli-cation as training data for statistical machine trans-lation (SMT). We trained SMT systems for 110 lan-guage pairs, which reveal interesting clues into the challenges ahead.
Sebastian Ruder Anders Søgaard and Ivan Vulić
  • Anders Sebastian Ruder
  • Ivan Søgaard
  • Vulić