Figure 2 - uploaded by Jeremy Barnes
Content may be subject to copyright.
Binary and four class macro F 1 on Spanish (ES), Catalan (CA), and Basque (EU).

Binary and four class macro F 1 on Spanish (ES), Catalan (CA), and Basque (EU).

Source publication
Conference Paper
Full-text available
Sentiment analysis in low-resource languages suffers from a lack of annotated corpora to estimate high-performing models. Machine translation and bilingual word embeddings provide some relief through cross-lingual sentiment approaches. However , they either require large amounts of parallel data or do not sufficiently capture sentiment information....

Contexts in source publication

Context 1
... Figure 2, we report the results of all four methods. Our method outperforms the other projection methods (the baselines ARTETXE and BARISTA) on four of the six experiments substantially. ...
Context 2
... Figure 2, we report the results of all four meth- ods. Our method outperforms the other projection methods (the baselines ARTETXE and BARISTA) on four of the six experiments substantially. ...

Similar publications

Article
Full-text available
ESLORA is a corpus of Spanish made up of semi-directed interviews and spontaneous conversations recorded in Galicia between 2007 and 2015. The design and construction of the corpus meets three objectives: to register the use of a variety of Spanish which to date has been scarcely documented, to gain additional insight into the methods for the const...
Conference Paper
Full-text available
In this paper we describe the systems developed at Ixa for our participation in WMT20 Biomedical shared task in three language pairs, en-eu, en-es and es-en. When defining our approach , we have put the focus on making an efficient use of corpora recently compiled for training Machine Translation (MT) systems to translate Covid-19 related text, as...
Conference Paper
Full-text available
Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data-driven process. It employs machine learning and statistical algorithms to learn language structures from textual corp...
Article
Full-text available
The objective of this work is to set a corpus-driven methodology to quantify automatically diachronic language distance between chronological periods of several languages. We apply a perplexity-based measure to written text representing different historical periods of three languages: European English, European Portuguese, and European Spanish. For...
Article
Full-text available
In this paper, two parallel longitudinal corpora of interlingua are presented: the CORpus del ESPañol de los Italianos (CORESPI, 474 texts, 124 648 words) and the CORpus del Italiano de los Españoles (CORITE, 385 texts, 103 147 words). The texts have been pro-duced by 45 couples of Spanish and Italian as foreign language learners (A1- B2). The mann...

Citations

... Furthermore, researchers have delved into multi-modal sentiment evaluation, integrating visual and auditory signals alongside text to enhance accuracy [8]. Cross-lingual and cross-domain sentiment evaluation have also gained prominence, tackling the challenges of analyzing sentiments across diverse languages and fields [9,10]. ...
Article
Full-text available
Sentiment evaluation plays a crucial role in deciphering public perception and consumer responses in today's digital landscape. This investigation offers a thorough assessment of diverse sentiment evaluation techniques, contrasting conventional machine learning methodologies with cutting-edge deep learning frameworks. In particular, the research scrutinizes the efficacy of Bidirectional Encoder Representations from Transformers (BERT)-derived architectures (BERT-Base and Robustly Optimized BERT Pretraining Approach (RoBERTa)), Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), Support Vector Machines (SVM), and Naive Bayes classifiers. The study gauges these approaches based on their precision, recall, F1-metric, overall accuracy, and computational efficiency using an extensive sentiment evaluation dataset. The results reveal that BERT-based models, particularly RoBERTa, achieve the highest accuracy (87.44%) and F1-score (0.8746), though they also require the longest training time (approximately 3 hours). CNN and LSTM models strike a balance between performance and efficiency, while traditional methods like SVM and Naive Bayes offer faster training and deployment with moderate accuracy. The insights gained from this study are valuable for both researchers and practitioners, highlighting the trade-offs between model performance, computational demands, and practical deployment considerations in sentiment analysis applications.
... BLSE (Barnes and Klinger, 2018) presents a model for the sentiment analysis task, relying on supervised parallel bilingual data. ...
Preprint
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions. Specifically, to retrieve the tokens with similar meanings for the semantic data augmentation across different languages, we propose a sequential clustering process in 3 stages: within a single language, across multiple languages of a language family, and across languages from multiple language families. Meanwhile, considering the multi-lingual knowledge infusion with context-aware semantics while alleviating computation burden, we directly replace the key constituents of the sentences with the above-learned multi-lingual family knowledge, viewed as pseudo-semantic. The infusion process is further optimized via three de-biasing techniques without introducing any neural parameters. Extensive experiments demonstrate that our model consistently improves the performance on general zero-shot cross-lingual natural language understanding tasks, including sequence classification, information extraction, and question answering.
... Also, in [1], the authors experimented with the linear transformation method from [27] on English, Spanish and Chinese. In [6], an approach for training bilingual sentiment word embeddings is presented. The embeddings are jointly optimized to represent (a) semantic information in the source and target languages using a small bilingual dictionary and (b) sentiment information obtained from the source language only. ...
Preprint
Full-text available
This paper deals with cross-lingual sentiment analysis in Czech, English and French languages. We perform zero-shot cross-lingual classification using five linear transformations combined with LSTM and CNN based classifiers. We compare the performance of the individual transformations, and in addition, we confront the transformation-based approach with existing state-of-the-art BERT-like models. We show that the pre-trained embeddings from the target domain are crucial to improving the cross-lingual classification results, unlike in the monolingual classification, where the effect is not so distinctive.
... Emoji information was used as a new bridge for text sentiment analysis and was encoded into the generated word embeddings [48]. Similar to works [40] [41], Barnes proposed a semisupervised method to incorporate sentiment information into word embedding representations [49]. A Bilingual Sentiment Embedding (BLSE) was proposed to learn the projection matrices of source language and target language, which were jointly optimized to represent both semantic information and sentiment information. ...
Article
Full-text available
Cross-lingual sentiment analysis (CLSA) leverages one or several source languages to help the low-resource languages to perform sentiment analysis. Therefore, the problem of lack of annotated corpora in many non-English languages can be alleviated. Along with the development of economic globalization, CLSA has attracted much attention in the field of sentiment analysis and the last decade has seen a surge of researches in this area. Numerous methods, datasets and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art CLSA approaches from 2004 to the present. This paper teases out the research context of cross-lingual sentiment analysis and elaborates the following methods in detail: (1) The early main methods of CLSA, including those based on Machine Translation and its improved variants, parallel corpora or bilingual sentiment lexicon; (2) CLSA based on cross-lingual word embedding; (3) CLSA based on multi-BERT and other pre-trained models. We further analyze their main ideas, methodologies, shortcomings, etc., and attempt to reach a conclusion on the coverage of languages, datasets and their performance. Finally, we look into the future development of CLSA and the challenges facing the research area.
... Generally, these CLSA approaches can be divided into two patterns, parallel corpus based approaches, and unsupervised approaches. Through cross-lingual supervision provided by parallel corpus, the former kind of approaches can align the semantic gaps between source language and target language by Auto-Encoder (Zhou et al., 2014), Neural Translation Model (Eriguchi et al., 2018) and Bilingual Word Embedding (Barnes et al., 2018). However, the availability of parallel corpus limits the usability of these works. ...
... CLSA aims at transferring source language trained sentiment models to adapt target language (Zhou et al., 2014;Eriguchi et al., 2018;Barnes et al., 2018; (Ziser and Reichart, 2018) and parallel corpus (Xu and Yang, 2017) are also used to bridge the cross-lingual gaps. Recent studies propose the unsupervised CLSA (UCLSA) setting to avoid relying on the expensive parallel corpus. ...
... parallel or aligned corpora as in early work on cross-lingual transfer [6]. In particular, we adopt workflows for using LLOD in cross-lingual transfer learning based on task-informed, bilingual word embeddings (adopted from bilingual sentiment embeddings [7]) presented in [8] and apply them to a different target language (Spanish vs. French), a much more varied task (HRQoL aspect detection vs. sentiment analysis) and different text genre (online health community posts vs. medical experts' interview transcripts). ...
... Our approach to language-and task-informed transfer learning (LTTL) relies on the framework described in our previous work [8]. Using this architecture based on bilingual word embeddings [7], task-informed bilingual embedding spaces can be learned for any task which can be framed as text classification. Following this idea, we apply LTTL to HRQoL concept detection in this paper. ...
... The experiments reported in this section address the problem of HRQoL concept detection from French online health communities. We simulate a real-world setting in which 7 This includes a shallow post-processing step to remove broken rule syntax. no labeled training examples are available in the target language. ...
Chapter
Full-text available
We describe the use of Linguistic Linked Open Data (LLOD) to support a cross-lingual transfer framework for concept detection in online health communities. Our goal is to develop multilingual text analytics as an enabler for analyzing health-related quality of life (HRQoL) from self-reported patient narratives. The framework capitalizes on supervised cross-lingual projection methods, so that labeled training data for a source language are sufficient and are not needed for target languages. Cross-lingual supervision is provided by LLOD lexical resources to learn bilingual word embeddings that are simultaneously tuned to represent an inventory of HRQoL concepts based on the World Health Organization’s quality of life surveys (WHOQOL). We demonstrate that lexicon induction from LLOD resources is a powerful method that yields rich and informative lexical resources for the cross-lingual concept detection task which can outperform existing domain-specific lexica. Furthermore, in a comparative evaluation we find that our models based on bilingual word embeddings exhibit a high degree of complementarity with an approach that integrates machine translation and rule-based extraction algorithms. In a combined configuration, our models rival the performance of state-of-the-art cross-lingual transformers, despite being of considerably lower model complexity.
... parallel or aligned corpora as in early work on cross-lingual transfer [6]. In particular, we adopt workflows for using LLOD in cross-lingual transfer learning based on task-informed, bilingual word embeddings (adopted from bilingual sentiment embeddings [7]) presented in [8] and apply them to a different target language (Spanish vs. French), a much more varied task (HRQoL aspect detection vs. sentiment analysis) and different text genre (online health community posts vs. medical experts' interview transcripts). ...
... Our approach to language-and task-informed transfer learning (LTTL) relies on the framework described in our previous work [8]. Using this architecture based on bilingual word embeddings [7], task-informed bilingual embedding spaces can be learned for any task which can be framed as text classification. Following this idea, we apply LTTL to HRQoL concept detection in this paper. ...
Conference Paper
Full-text available
We describe the use of Linguistic Linked Open Data (LLOD) to support a cross-lingual transfer framework for concept detection in online health communities. Our goal is to develop multilingual text analytics as an enabler for analyzing health-related quality of life (HRQoL) from self-reported patient narratives. The framework capitalizes on supervised cross-lingual projection methods, so that labeled training data for a source language are sufficient and are not needed for target languages. Cross-lingual supervision is provided by LLOD lexical resources to learn bilingual word embeddings that are simultaneously tuned to represent an inventory of HRQoL concepts based on the World Health Organization's quality of life surveys (WHOQOL). We demonstrate that lexicon induction from LLOD resources is a powerful method that yields rich and informative lexical resources for the cross-lingual concept detection task which can outperform existing domain-specific lexica. Furthermore, in a comparative evaluation we find that our models based on bilingual word embeddings exhibit a high degree of complementarity with an approach that integrates machine translation and rule-based extraction algorithms. In a combined configuration, our models rival the performance of state-of-the-art cross-lingual transformers, despite being of considerably lower model complexity.
... In those cases, we leave the words as they are, resulting in an imperfect, code-switched translation of the monolingual sentence. Previous works on text classification has shown that fine-tuning multilingual models such as multilingual BERT on codeswitched data can improve performance on fewshot and zero-shot classification tasks (Akyürek et al., 2020; Qin et al., 2020) ranging from frame classification (Liu et al., 2019a) to natural language inference , sentiment classification (Barnes et al., 2018), document classification (Schwenk and Li, 2018), dialogue state tracking (Mrkšić et al., 2017), and spoken language understanding (Schuster et al., 2019). We include code-switched sentences that have at least 20% of words translated from their original sentences to augment our training. ...
Preprint
Full-text available
We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages, exploring the case when both parallel training data and compute resource are lacking, reflecting the reality of most of the world's languages and the researchers working on these languages. We propose a simple and scalable method to improve unsupervised NMT, showing how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance. We also demonstrate how the use of the dictionary to code-switch monolingual data to create more comparable data can further improve performance. With this weak supervision, our best method achieves BLEU scores that improve over supervised results for English\rightarrowGujarati (+18.88), English\rightarrowKazakh (+5.84), and English\rightarrowSomali (+1.16), showing the promise of weakly-supervised NMT for many low resource languages with modest compute resource in the world. To the best of our knowledge, our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
... We therefore aim to harness the knowledge previously collected in resourcerich languages [52]. Word-embedding vectors contain information that allows learning a translation matrix, which enables bilingual translation by matching relative positions of word vectors in two monolingual vector spaces [53], [54]. Translation matrices rely on the assumption that the same words in different languages are used in similar contexts and distributed similarly. ...
... As it is prohibitively expensive to obtain training data for all languages of interest, cross-lingual sentiment analysis (CLSA) (Barnes et al., 2018;Zhou et al., 2016b;Xu and Wan, 2017;Wan, 2009;Demirtas and Pechenizkiy, 2013;Xiao and Guo, 2012;Zhou et al., 2016a) offers the possibility of learning sentiment classification models for a target language using only annotated data from a different source language where large annotated data is available. These models often rely on bilingual lexicons, pre-trained cross-lingual word embeddings, or Machine Translation to bridge the gap between the source and target languages. ...
... MT, often trained from parallel corpora, may not be available for lowresource languages. Other CLSA methods (Barnes et al., 2018;Zhou et al., 2016b;Xu and Wan, 2017) Figure 1: CLAN architecture. We illustrate with a source language l s =English (solid line) and target language l t =French (dotted line). ...
... uses bilingual lexicons or cross-lingual word embeddings (CLWE) to project words with similar meanings from different languages into nearby spaces, to enable training cross-lingual sentiment classifiers. CLWE often depends on a bilingual lexicon (Barnes et al., 2018) or parallel or comparable corpora (Mogadala and Rettinger, 2016;Vulić and Moens, 2016). Recently, CLWE methods (Lample and ) that rely on no parallel resources are proposed, but they require very large monolingual corpora to train. ...