Rodrigo Agerri

Rodrigo Agerri
University of the Basque Country | UPV/EHU · Computer Languages and Systems

PhD in Computer Science

About

105
Publications
18,232
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
908
Citations
Introduction
Currently my research is focused on Computational Semantics and Information Extraction (Named Entity Recognition, Opinion Mining, Fake News and Stance detection, etc.), especially in multilingual and cross-lingual approaches.
Additional affiliations
December 2005 - November 2008
University of Birmingham
Position
  • Researcher
Description
  • Working in the NLP group on computational linguistics, focusing on metaphor processing and semantic inference.

Publications

Publications (105)
Conference Paper
Full-text available
IXA pipes is a modular set of Natural Language Processing tools (or pipes) which provide easy access to NLP technology. It offers robust and efficient linguistic annotation to both researchers and non-NLP experts with the aim of lowering the barriers of using NLP technology either for research purposes or for small industrial developers and SMEs. I...
Article
Full-text available
Requirements in computational power have grown dramatically in recent years. This is also the case in many language processing tasks, due to the overwhelming and ever increasing amount of textual information that must be processed in a reasonable time frame. This scenario has led to a paradigm shift in the computing architectures and large-scale da...
Article
Full-text available
We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empirical experimentation how to effectively combine various typ...
Article
Full-text available
In this article, we describe a system that . reads news articles in four different languages and detects what happened, who is involved, where and when. This event-centric information is represented as episodic situational knowledge on individuals in an interoperable RDF format that allows for reasoning on the implications of the events. Our system...
Article
Full-text available
In this research note we present a language independent system to model Opinion Target Extraction (OTE) as a sequence labelling task. The system consists of a combination of clustering features implemented on top of a simple set of shallow local features. Experiments on the well known Aspect Based Sentiment Analysis (ABSA) benchmarks show that our...
Preprint
Full-text available
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been conducted in English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored....
Preprint
Full-text available
The development of Large Language Models (LLMs) has brought impressive performances on mitigation strategies against misinformation, such as counterargument generation. However, LLMs are still seriously hindered by outdated knowledge and by their tendency to generate hallucinated content. In order to circumvent these issues, we propose a new task,...
Chapter
The need for transparent AI systems in sensitive domains like medicine has become key. In this paper we present ANTIDOTE, a software suite proposing different tools for argumentation-driven explainable Artificial Intelligence for digital medicine. Our system offers the following functionalities: multilingual argumentative analysis for the medical d...
Preprint
Full-text available
Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like medicine and law. However, the need to explain the rationale behind decisions is a main issue also for human-based deliberation as it is important to justify \textit{why} a certain decision has been taken. Re...
Preprint
Full-text available
Recent research on sequence labelling has been exploring different strategies to mitigate the lack of manually annotated data for the large majority of the world languages. Among others, the most successful approaches have been based on (i) the cross-lingual transfer capabilities of multilingual pre-trained language models (model-transfer), (ii) da...
Article
Full-text available
In this paper, we introduce ProxMetrics, a novel toolkit designed to evaluate similarity among social media entities through proxemic dimensions. Proxemics is the science that studies the organization of space and the effects of distances on behavior and interactions. It encompasses 5 core dimensions: Distance, Identity, Location, Movement, and Ori...
Preprint
Full-text available
The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to ach...
Preprint
Full-text available
Political leaning can be defined as the inclination of an individual towards certain political orientations that align with their personal beliefs. Political leaning inference has traditionally been framed as a binary classification problem, namely, to distinguish between left vs. right or conservative vs liberal. Furthermore, although some recent...
Preprint
Full-text available
Social media users express their political preferences via interaction with other users, by spontaneous declarations or by participation in communities within the network. This makes a social network such as Twitter a valuable data source to study computational science approaches to political learning inference. In this work we focus on three diver...
Article
Full-text available
Lemmatization is a natural language processing (NLP) task that consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected...
Article
Full-text available
TextBI est un tableau de bord interactif destiné à visualiser des indicateurs multidimensionnels sur de grandes quantités de données multilingues issues des réseaux sociaux. Il cible quatre dimensions principales d’analyse : spatiale, temporelle, thématique et personnelle, tout en intégrant des données contextuelles comme le sentiment et l’engageme...
Preprint
Full-text available
Providing high quality explanations for AI predictions based on machine learning is a challenging and complex task. To work well it requires, among other factors: selecting a proper level of generality/specificity of the explanation; considering assumptions about the familiarity of the explanation beneficiary with the AI task under consideration; r...
Chapter
Full-text available
This chapter landscapes the field of Language Technology (LT) and language- centric AI by assembling a comprehensive state-of-the-art of basic and applied research in the area. It sketches all recent advances in AI, including the most recent deep learning neural technologies. The chapter brings to light not only where language-centric AI as a whole...
Article
Full-text available
Detecting and normalizing temporal expressions is an essential step for many NLP tasks. While a variety of methods have been proposed for detection, best normalization approaches rely on hand-crafted rules. Furthermore, most of them have been designed only for English. In this paper we present a modular multilingual temporal processing system combi...
Preprint
Full-text available
Detecting and normalizing temporal expressions is an essential step for many NLP tasks. While a variety of methods have been proposed for detection, best normalization approaches rely on hand-crafted rules. Furthermore, most of them have been designed only for English. In this paper we present a modular multilingual temporal processing system combi...
Chapter
In this paper, we introduce a novel way to model and analyze social media interactions by leveraging the proxemics theory. Proxemics is the science that studies the effect of space and distance on interactions and behaviors. It is generally applied to the physical space but we hypothesize that adapting it to social media could provide a generic way...
Article
Full-text available
Legebiltzarretan, deliberazio demokrazien testuinguruan egindako hitzaldien azterketa garrantzitsua da, hitzaldi horiek hertsiki lotuta baitaude ekintza politikoarekin eta legegintza ekimenak garatzeko arrazoien azalpenarekin. Bestalde, azken urteotan, azterketa automatizatuei esker gehitu egin dira diskurtsoa aztertzeko aukerak, eta informazio bol...
Preprint
Full-text available
Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected...
Preprint
Full-text available
Nowadays the medical domain is receiving more and more attention in applications involving Artificial Intelligence. Clinicians have to deal with an enormous amount of unstructured textual data to make a conclusion about patients' health in their everyday life. Argument mining helps to provide a structure to such data by detecting argumentative comp...
Preprint
Full-text available
In the absence of readily available labeled data for a given task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data which may then be used to train supervised systems. Annotation projection has often been formulated as the task of projecting, on parallel corpora, some la...
Preprint
Full-text available
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data....
Chapter
In this article, we propose a generic method to build thematic datasets from social media. Many research works gather their data from social media, but the extraction processes used are mostly ad hoc and do not follow a formal or standardized method. We aim at extending the processes currently used by designing an iterative, generic and domain-inde...
Preprint
Full-text available
Zero-resource cross-lingual transfer approaches aim to apply supervised models from a source language to unlabelled target languages. In this paper we perform an in-depth study of the two main techniques employed so far for cross-lingual zero-resource sequence labelling, based either on data or model transfer. Although previous research has propose...
Preprint
Full-text available
The lack of wide coverage datasets annotated with everyday metaphorical expressions for languages other than English is striking. This means that most research on supervised metaphor detection has been published only for that language. In order to address this issue, this work presents the first corpus annotated with naturally occurring metaphors i...
Preprint
Full-text available
The large majority of the research performed on stance detection has been focused on developing more or less sophisticated text classification systems, even when many benchmarks are based on social network data such as Twitter. This paper aims to take on the stance detection task by placing the emphasis not so much on the text itself but on the int...
Preprint
Full-text available
In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 3382–3390. Marseille, 20-25 June 2022. Parliamentary transcripts provide a valuable resource to understand the reality and know about the most important facts that occur over time in our societies. Furthermore, the political debates captured in these tra...
Book
Full-text available
During the last few months, the COVID-19 pandemic and its associated issues have monopolized the debate on the media and social networks. Thus, the news actors in the hybrid media system have incorporated the fight against the infodemic and disinformation as another priority in a society that has demanded constant and reliable information in multip...
Preprint
Full-text available
The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking Basque as a case study, we explore tailored crawling (manually identifying an...
Preprint
Full-text available
The growing interest in employing counter narratives for hatred intervention brings with it a focus on dataset creation and automation strategies. In this scenario, learning to recognize counter narrative types from natural text is expected to be useful for applications such as hate speech countering, where operators from non-governmental organizat...
Preprint
Full-text available
In this paper we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combinin...
Article
In this paper, we take into account both social and linguistic aspects to perform demographic analysis by processing a large amount of tweets in Basque language. The study of demographic characteristics and social relationships are approached by applying machine learning and modern deep-learning Natural Language Processing (NLP) techniques, combini...
Article
Full-text available
Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has...
Article
Full-text available
Users voluntarily generate large amounts of textual content by expressing their opinions, in social media and specialized portals, on every possible issue, including transport and sustainability. In this work we have leveraged such User Generated Content to obtain a high accuracy sentiment analysis model which automatically analyses the negative an...
Preprint
Full-text available
Popular social media networks provide the perfect environment to study the opinions and attitudes expressed by users. While interactions in social media such as Twitter occur in many natural languages, research on stance detection (the position or attitude expressed with respect to a specific topic) within the Natural Language Processing field has...
Article
Full-text available
Lan honen helburu nagusia, hedabideetako euskarazko edukian aipatzen diren izendun entitate nabarmenen identifikazioa da, identifikazio hau denbora errealean eginez. Horretarako, euskaraz argitaratutako albisteetatik izendun entitateak automatikoki jaso eta etiketatzeko sistema garatu da, artearen egoerako Ikasketa Sakoneko ereduak erabiliz. Izendu...
Conference Paper
Full-text available
In this paper we describe our participation to the SardiStance shared task held at EVALITA 2020. We developed a set of classifiers that combined text features, such as the best performing systems based on large pre-trained language models, together with user profile features, such as psychological traits and social media user interactions. The clas...
Conference Paper
Full-text available
In this paper we present a language independent system to model Opinion Target Extraction (OTE) as a sequence labelling task. The system consists of a combination of clustering features implemented on top of a simple set of shallow local features. Experiments on the well known Aspect Based Sentiment Analysis (ABSA) benchmarks show that our approach...
Preprint
Full-text available
Word embeddings and pre-trained language models allow to build rich representations of text and have enabled improvements across most NLP tasks. Unfortunately they are very expensive to train, and many small companies and research groups tend to use models that have been pre-trained and made available by third parties, rather than building their ow...
Preprint
Full-text available
Stance detection aims to determine the attitude of a given text with respect to a specific topic or claim. While stance detection has been fairly well researched in the last years, most the work has been focused on English. This is mainly due to the relative lack of annotated data in other languages. The TW-10 Referendum Dataset released at IberEva...
Preprint
Full-text available
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings. Our method integrates multiple word embeddings created from complementary techniques, textual sources, knowledge bases and languages. Existing word vectors are projected to a common semantic space using linear transformations and averaging. With our meth...
Chapter
Full-text available
In this paper we describe our participation to the SardiStance shared task held at EVALITA 2020. We developed a set of classifiers that combined text features, such as the best performing systems based on large pre-trained language models, together with user profile features, such as psychological traits and social media user interactions. The clas...
Article
Full-text available
Social networks like Twitter are increasingly important in the creation of new ways of communication. They have also become useful tools for social and linguistic research due to the massive amounts of public textual data available. This is particularly important for less resourced languages, as it allows to apply current natural language processin...
Preprint
Full-text available
In this research note we present a language independent system to model Opinion Target Extraction (OTE) as a sequence labelling task. The system consists of a combination of clustering features implemented on top of a simple set of shallow local features. Experiments on the well known Aspect Based Sentiment Analysis (ABSA) benchmarks show that our...
Conference Paper
Preprint
Full-text available
Talaia is a platform for monitoring social media and digital press. A configurable crawler gathers content with respect to user defined domains or topics. Crawled data is processed by means of IXA-pipes NLP chain and EliXa sentiment analysis system. A Django powered interface provides data visualization to provide the user analysis of the data. Thi...
Conference Paper
Full-text available
We present enetCollect, a large European COST action network set up with the aim of promoting a research trend combining the well-established domain of Language Learning with recent and successful crowdsourcing approaches. More specifically, the challenge of enetCollect is to foster the language skills of all citizens regardless of their background...
Conference Paper
Full-text available
We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering semi-supervised features induced on large amounts of unlabeled text. Understanding via empiricalexperimentation how to effectively combine various type...
Article
Full-text available
In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of Multilingual and Cross-lingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering t...
Article
Full-text available
This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polari...
Article
Full-text available
This paper presents a simple, robust and (almost) unsupervised dictionary-based method, qwn-ppv (Q-WordNet as Personalized PageRanking Vector) to automatically generate polarity lexicons. We show that qwn-ppv outperforms other automatically generated lexicons for the four extrinsic evaluations presented here. It also shows very competitive and robu...
Preprint
Full-text available
In this paper we present an approach to extract ordered timelines of events, their participants, locations and times from a set of multilingual and cross-lingual data sources. Based on the assumption that event-related information can be recovered from different documents written in different languages, we extend the Cross-document Event Ordering t...
Conference Paper
Full-text available
We describe a novel modular system for cross-lingual event extraction for English, Spanish,, Dutch and Italian texts. The system consists of a ready-to-use modular set of advanced multilingual Natural Language Processing (NLP) tools. The pipeline integrates modules for basic NLP processing as well as more advanced tasks such as cross-lingual Named...
Conference Paper
Full-text available
This paper presents a supervised Aspect Based Sentiment Analysis (ABSA) system. Our aim is to develop a modular platform which allows to easily conduct experiments by replacing the modules or adding new features. We obtain the best result in the Opinion Target Extraction (OTE) task (slot 2) using an off-the-shelf sequence labeler. The target polari...
Article
Full-text available
The European project NewsReader develops advanced technology to process daily news streams in 4 languages, extracting what happened, when and where it happened and who was involved. NewsReader reads massive amounts of news coming from thousands of sources. It compares the results across sources to complement information and determine where the diff...