Figure 2 - uploaded by Lluís Màrquez
Content may be subject to copyright.
Percentage of categories with at least one domain term in the title for the two languages and the three domains under study.
Source publication
Multiple approaches to grab comparable data from the Web have been developed up to date. Nevertheless, coming out with a high-quality comparable corpus of a specific topic is not straightforward. We present a model for the automatic extraction of comparable texts in multiple languages and on specific topics from Wikipedia. In order to prove the val...
Context in source publication
Context 1
... relatively small size of these vocabularies al- lows to manually check that 10% is the best op- tion to characterise the desired category, higher percentages add more noise than in-domain terms. The plots in Figure 2 show the percentage of cat- egories with at least one domain term in the ti- Vocabulary en es en es CS 4 130 106 447 Sc 29 3 464 140 Sp 3 10 122 100 Table 2 When extracting the corpus, one must decide the adequate percentage of positive categories allowed. High thresholds lead to small cor- pora whereas low thresholds lead to larger -but noisier-corpora. ...
Similar publications
The human brain processes language to optimise efficient communication. Studies have shown extensive evidence that the brain’s response to language is affected both by lower-level features, such as word-length and frequency, and syntactic and semantic violations within sentences. However, our understanding on cognitive processes at discourse level...
Parallel corpora are vital components in several applications of Natural Language Processing (NLP), especially in machine translation. In this paper, we present a novel method to automatically create parallel sentences from comparable corpora. The method requires a bilingual dictionary as well as an adequate word vectorisation method. We use Arabic...
This paper presents a network-theoretical approach to modeling the semantics of large text networks. By example of the German Wikipedia we demonstrate how to estimate the structuring of topics focused by large corpora of natural language texts. Algorithms of this sort are needed to implement distributional semantics of textual manifestations in lar...
In this paper, we describe our participation to GermEval-2019 Task 2, which requires identifying and classifying offensive content in German tweets. For all three challenging subtasks, i.e. i) Subtask 1-a binary classification between Offensive and Non-Offensive tweets, ii) Subtask 2-a fine-grained classification into three different categories: Pr...
Citations
... In this paper, we used five datasets PAN11 [22], JRC-ACQUIS [23], EUROPARL [24], Wikipedia [25], and conference papers [26] for three language pairs En-Fr, En-Es, En-De. In the case of English-French, we used 10,620 plagiarized (P) documents, combining data from two sources: conference papers and JRC-Acquis. ...
span lang="EN-US">The pervasive availability of vast online information has fundamentally altered our approach to acquiring knowledge. Nevertheless, this wealth of data has also presented significant challenges to academic integrity, notably in the realm of cross-lingual plagiarism. This type of plagiarism involves the unauthorized copying, translation, ideas, or works from one language into others without proper citation. This research introduces a methodology for identifying multilingual plagiarism, utilizing a pre-trained multilingual bidirectional and auto-regressive transformers (mBART) model for document feature extraction. Additionally, a siamese long short-term memory (SLSTM) model is employed for classifying pairs of documents as either "plagiarized" or "non-plagiarized". Our approach exhibits notable performance across various languages, including English (En), Spanish (Es), German (De), and French (Fr). Notably, experiments focusing on the En-Fr language pair yielded exceptional results, with an accuracy of 98.83%, precision of 98.42%, recall of 99.32%, and F-score of 98.87%. For En-Es, the model achieved an accuracy of 97.94%, precision of 98.57%, recall of 97.47%, and an F-score of 98.01%. In the case of En-De, the model demonstrated an accuracy of 95.59%, precision of 95.21%, recall of 96.85%, and F-score of 96.02%. These outcomes underscore the effectiveness of combining the MBART transformer and SLSTM models for cross-lingual plagiarism detection.</span
... This problem has been addressed in two main ways in previous work, not specifically related to news translation. Computational techniques have been employed to mine parallel sentences from comparable corpora and noisy parallel corpora (Barrón-Cedeño, España-Bonet, Boldoba, & Màrquez, 2015;Gete et al., 2022). Extracting parallel sentences from similar multilingual corpora is a well-known problem, addressed as a necessary step when gathering data for training and testing of machine translation systems, as well as for cross-lingual information retrieval algorithms. ...
This contribution addresses the challenging issue of building corpus resources for the study of news translation, a domain in which the coexistence of radical rewriting and close translation makes the use of established corpus-assisted analytical techniques problematic. In an attempt to address these challenges, we illustrate and test two related methods for identifying translated segments within trilingual (Spanish, French and English) sets of dispatches issued by the global news agency Agence France-Press. One relies on machine translation and semantic similarity scores, the other on multilingual sentence embeddings. To evaluate these methods, we apply them to a benchmark dataset of translations from the same domain and perform manual evaluation of the dataset under study. We finally leverage the cross-linguistic equivalences thus identified to build a ‘comparallel’ corpus, which combines the parallel and comparable corpus architectures, highlighting its affordances and limitations for the study of news translation. We conclude by discussing the theoretical and methodological implications of our findings both for the study of news translation and more generally for the study of contemporary, novel forms of translation.
... In order to have sufficient training data, we have gathered four datasets: PAN-PC-11, JRC-Acquis, Europarl, and Wikipedia (Spanish-English) [36,37,38,39]. An evaluation corpus for automatic plagiarism detection algorithms is the PAN 2011 (PAN-PC-11). ...
Academic plagiarism has become a serious concern as it leads to the retardation of scientific progress and violation of intellectual property. In this context, we make a study aiming at the detection of cross-linguistic plagiarism based on Natural language Preprocessing (NLP), Embedding Techniques, and Deep Learning. Many systems have been developed to tackle this problem, and many rely on machine learning and deep learning methods. In this paper, we propose Cross-language Plagiarism Detection (CL-PD) method based on Doc2Vec embedding techniques and a Siamese Long Short-Term Memory (SLSTM) model. Embedding techniques help capture the text's contextual meaning and improve the CL-PD system's performance. To show the effectiveness of our method, we conducted a comparative study with other techniques such as GloVe, FastText, BERT, and Sen2Vec on a dataset combining PAN11, JRC-Acquis, Europarl, and Wikipedia. The experiments for the Spanish-English language pair show that Doc2Vec+SLSTM achieve the best results compared to other relevant models, with an accuracy of 99.81%, a precision of 99.75%, a recall of 99.88%, an f-score of 99.70%, and a very small loss in the test phase.
... Unlike parallel corpora, so-called comparable corpora do not necessarily possess parallel structures, but merely share the same topics per corresponding unit (e.g., articles). Wikipedia 2 can be seen as a comparable corpus, since a correspondence relation between languages can be established for individual articles (McEnery and Xiao, 2007;Otero and López, 2010;Barrón-Cedeno et al., 2015). ...
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material.
... Wikipedia's inter-language links are crucial to obtain an aligned comparable corpus. The value of the Wikipedia as a source of highly comparable and parallel sentences has been appreciated over the years [1,5,9,37,[47][48][49]55]. With the rise of deep learning for NLP and the need of large amounts of clean data, the use of Wikipedia has grown exponentially not only for parallel sentence extraction and machine translation [25,44,46,53], but also for semantics. ...
... This results in a collection of 741 categories. For comparison purposes, categories used in previous research are added if not already present: Archaeology, Linguistics, Physics, Biology, and Sport [22]; Mountaineering [38] and Computer Science [5]. Observe that Computer Science does not exist in the Greek edition nor Mountaineering in the Occitan one. ...
We propose a language-independent graph-based method to build à-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopedia’s category graph and can produce both mono- and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph model reaches an average precision of 84% on in-domain articles, outperforming an alternative model based on information retrieval techniques. As manual evaluations are costly, we introduce the concept of domainness and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with human judgments, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities.
... Most likely, adding more language pairs and using ideas from recent work should help improve the accuracy of our models. Wikipedia has always been an interesting dataset for solving NLP problems including machine translation (Li et al., 2012;Patry and Langlais, 2011;Lin et al., 2011;Tufiş et al., 2013;Barrón-Cedeño et al., 2015;Ruiter et al., 2019). The WikiMatrix data (Schwenk et al., 2019a) is the most similar effort to ours in terms of using Wikipedia, but with using supervised translation models. ...
We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models in the target language. We show that first sentences and titles of linked Wikipedia pages, as well as cross-lingual image captions, are strong signals for a seed parallel data to extract bilingual dictionaries and cross-lingual word embeddings for mining parallel text from Wikipedia. Our final model achieves high BLEU scores that are close to or sometimes higher than strong supervised baselines in low-resource languages; e.g. supervised BLEU of 4.0 versus 12.1 from our model in English-to-Kazakh. Moreover, we tailor our wikily translation models to unsupervised image captioning and cross-lingual dependency parser transfer. In image captioning, we train a multi-tasking machine translation and image captioning pipeline for Arabic and English from which the Arabic training data is a wikily translation of the English captioning data. Our captioning results in Arabic are slightly better than that of its supervised model. In dependency parsing, we translate a large amount of monolingual text, and use it as an artificial training data in an annotation projection framework. We show that our model outperforms recent work on cross-lingual transfer of dependency parsers.
... The value of the Wikipedia as a source of highly comparable and parallel sentences was soon observed too (Adafre and de Rijke, 2006;Yasuda and Sumita, 2008;Smith et al., 2010;Plamada and Volk, 2012;Barrón-Cedeño et al., 2015). With the rise of deep learning for NLP and the need of large amounts of clean data, the use of Wikipedia has grown exponentially not only for parallel sentence extraction and machine translation (Varga, 2017;Harsha Ramesh and Prasad Sankaranarayanan, 2018;Ruiter et al., 2019;, but also for training models to obtain semantic representations of words and sentences. ...
... In order to extract the domains, we explore WCG and, as a result of avoiding their "strict" strategy based on the exact category, we are able to extract more articles. This idea was first sketched in Barrón-Cedeño et al. (2015), where we also extracted parallel sentences from the comparable corpora in Computer science, Science and Sport to successfully domain-adapt a machine translation system. ...
... This cleanup results in a final collection of 741 categories. Categories used in previous research are includedif not already present-for comparison purposes: Archaeology, Linguistics, Physics, Biology, and Sport (Gamallo Otero and González López, 2011); Mountaineering (Plamada and Volk, 2013) and Computer Science (Barrón-Cedeño et al., 2015). Observe that Computer Science does not exist in the Greek edition nor Mountaineering in the Occitan one. ...
We propose an automatic language-independent graph-based method to build \`a-la-carte article collections on user-defined domains from the Wikipedia. The core model is based on the exploration of the encyclopaedia's category graph and can produce both monolingual and multilingual comparable collections. We run thorough experiments to assess the quality of the obtained corpora in 10 languages and 743 domains. According to an extensive manual evaluation, our graph-based model outperforms a retrieval-based approach and reaches an average precision of 84% on in-domain articles. As manual evaluations are costly, we introduce the concept of "domainness" and design several automatic metrics to account for the quality of the collections. Our best metric for domainness shows a strong correlation with the human-judged precision, representing a reasonable automatic alternative to assess the quality of domain-specific corpora. We release the WikiTailor toolkit with the implementation of the extraction methods, the evaluation measures and several utilities. WikiTailor makes obtaining multilingual in-domain data from the Wikipedia easy.
... Experimental Setup We use Wikipedia (WP) as a comparable corpus to train the self-supervised system. We download the English, French, German and Spanish WP dumps 1 , pre-process them and extract the comparable articles per language pair using Wikitailor (Barrón-Cedeño et al., 2015). All articles are normalized, tokenized and truecased using standard Moses (Koehn et al., 2007) scripts. ...
Self-supervised neural machine translation (SS-NMT) learns how to extract/select suitable training data from comparable -- rather than parallel -- corpora and how to translate, in a way that the two tasks support each other in a virtuous circle. SS-NMT has been shown to be competitive with state-of-the-art unsupervised NMT. In this study we provide an in-depth analysis of the sampling choices the SS-NMT model takes during training. We show that, without it having been told to do so, the model selects samples of increasing (i) complexity and (ii) task-relevance in combination with (iii) a denoising curriculum. We observe that the dynamics of the mutual-supervision of both system internal representation types is vital for the extraction and hence translation performance. We show that in terms of the human Gunning-Fog Readability index (GF), SS-NMT starts by extracting and learning from Wikipedia data suitable for high school (GF=10--11) and quickly moves towards content suitable for first year undergraduate students (GF=13).
... Experimental Setup We use Wikipedia (WP) as a comparable corpus and download the English, French, German and Spanish dumps, 2 pre-process them and extract comparable articles per language pair using WikiTailor 3 ( Barrón-Cedeño et al., 2015;España-Bonet et al., 2020). All articles are normalized, tokenized and truecased using standard Moses (Koehn et al., 2007) scripts. ...
... The ACCURAT project (Ş tefȃnescu et al., 2012;Skadiņa et al., 2012) also devoted efforts in parallel sentence mining in Wikipedia. Later, Barrón-Cedeño et al. (2015) used the combination of cross-lingual similarity measures to extract domain specific parallel sentences. The most recent initiative is the so-called LASER (Artetxe and Schwenk, 2019b), which relies on vector representations of sentences to extract similar pairs. ...
We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machinetranslation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets