About
216
Publications
38,086
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,384
Citations
Introduction
Information Retrieval, Natural Language Processing, Evaluation Metrics and Methodologies
Current institution
Additional affiliations
January 1995 - present
Publications
Publications (216)
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather...
In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rathe...
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Span-ish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Lar...
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary...
The paper describes the EXIST 2024 lab on Sexism identification in social networks, that is expected to take place at the CLEF 2024 conference and represents the fourth edition of the EXIST challenge. The lab comprises five tasks in two languages, English and Spanish, with the initial three tasks building upon those from EXIST 2023 (sexism identifi...
In recent years, the rapid increase in the dissemination of offensive and discriminatory material aimed at women through social media platforms has emerged as a significant concern. This trend has had adverse effects on women’s well-being and their ability to freely express themselves. The EXIST campaign has been promoting research in online sexism...
The paper describes the lab on Sexism identification in social networks (EXIST 2023) that will be hosted as a lab at the CLEF 2023 conference. The lab consists of three tasks, two of which are continuation of EXIST 2022 (sexism detection and sexism categorization) and a third and novel one on source intention identification. For this edition new te...
The paper describes the organization, goals, and results of the sEXism Identification in Social neTworks (EXIST)2022 challenge, a shared task proposed for the second year at IberLEF. EXIST 2022 consists of two challenges: sexism identification and sexism categorization of tweets and gabs, both in Spanish and English. We have received a total of 45...
The design and analysis of experimental research in Data Mining (DM) is anchored in a correct choice of the type of task addressed (clustering, classification, regression, etc.). However, although DM is a relatively mature discipline, there is no consensus yet about what is the taxonomy of DM tasks, which are their formal characteristics, and their...
Online reputation management (ORM) comprises the collection of techniques that help monitoring and improving the public image of an entity (companies, products, institutions) on the Internet. The ORM experts try to minimize the negative impact of the information about an entity while maximizing the positive material for being more trustworthy to th...
Although computing similarity is one of the fundamental challenges of Information Access tasks, the notion of similarity in this context is not yet completely understood from a formal, axiomatic perspective. In this paper we show how axiomatic explanations of similarity from other fields (Tversky’s axioms from the point of view of cognitive science...
In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as positive, neutral, negative in sentiment analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for instance, precision/recall on each of the classes ignores their relat...
Producing online reputation summaries for an entity (company, brand, etc.) is a focused summarization task with a distinctive feature: issues that may affect the reputation of the entity take priority in the summary. In this paper we (i) present a new test collection of manually created (abstractive and extractive) reputation reports which summariz...
Given the task of finding influencers of a given domain (i.e. banking) in a social network, in this paper we investigate (i) the importance of characterizing followers for the automatic detection of influencers; (ii) the most effective way to combine signals obtained from followers and from the main profiles for the automatic detection of influence...
The emergence of social media and the huge amount of opinions that are posted everyday have influenced online reputation management. Reputation experts need to filter and control what is posted online and, more importantly, determine if an online post is going to have positive or negative implications towards the entity of interest. This task is ch...
Over a period of 3 years, RepLab was a CLEF initiative where computer scientists and online reputation experts worked together to identify and formalize the computational challenges in the area of online reputation monitoring. Two main results emerged from RepLab: a community of researchers engaged in the problem, and an extensive Twitter test coll...
Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a bet...
Given the task of finding influencers (opinion makers) for a given domain in a social network, we investigate (a) what is the relative importance of domain and authority signals, (b) what is the most effective way of combining signals (voting, classification, learning to rank, etc.) and how best to model the vocabulary signal, and (c) how large is...
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making...
The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but als...
This paper reports the findings of the Dagstuhl Perspectives Workshop 17442 on performance modeling and prediction in the domains of Information Retrieval, Natural language Processing and Recommender Systems. We present a framework for further research, which identifies five major problem areas: understanding measures, performance analysis, making...
This is a report on the eighth edition of the Conference and Labs of the Evaluation Forum (CLEF 2017), held in early September 2017, in Dublin, Ireland. CLEF was a four day event combining a Conference and an Evaluation Forum. The Conference featured keynotes by Leif Azzopardi and Vincent Wade, and presentation of 32 peer reviewed research papers c...
The EvALL online evaluation service aims to provide a unified evaluation framework for Information Access systems that makes results completely comparable and publicly available for the whole research community. For researchers working on a given test collection, the framework allows to: (i) evaluate results in a way compliant with measurement theo...
One of the core tasks of Online Reputation Monitoring is to determine whether a text mentioning the entity of interest has positive or negative implications for its reputation. A challenging aspect of the task is that many texts are polar facts, i.e. they do not convey sentiment but they do have reputational implications (e.g. A Samsung smartphone...
We present an in-depth formal and empirical comparison of unsupervised signal combination approaches in the context of tasks based on textual similarity. Our formal study introduces the concept of Similarity Information Quantity, and proves that the most salient combination methods are all estimations of Similarity Information Quantity under differ...
This book constitutes the refereed proceedings of the 8th International Conference of the CLEF Initiative, CLEF 2017, held in Dublin, Ireland, in September 2017.
The 7 full papers and 9 short papers presented together with 6 best of the labs papers were carefully reviewed and selected from 38 submissions. In addition, this volume contains the resu...
In this paper, the Evall framework for the automatic evaluation of information systems task is presented. With just one click and providing the system outputs of the algorithms, Evall allows researchers to automatically generate a Latex report including the results of their algorithms, statistical significance tests, measures descriptions, and refe...
Monitoring Online Reputation has already become a key part of Public Relations for organizations and individuals; and current search technologies do not suffice to help reputation experts to cope with the vast stream of online content relevant to their clients. One of the reasons is that the amount of relevant content for medium and large companies...
Producing online reputation reports for an entity (company, brand, etc.) is a focused summarization task with a distinctive feature: issues that may affect the reputation of the entity take priority in the summary. In this paper we (i) propose a novel methodology to evaluate summaries in the context of online reputation which profits from an analog...
In this tutorial we present a formal account of evaluation metrics for three of the most salient information related tasks: Retrieval, Clustering, and Filtering. We focus on the most popular metrics and, by exploiting measurement theory, we show some constraints for suitable metrics in each of the three tasks. We also systematically compare metrics...
This paper describes the organisation and results of RepLab 2014, the third competitive evaluation campaign for Online Reputation Management systems. This year the focus lied on two new tasks: reputation dimensions classification and author profiling, which complement the aspects of reputation analysis studied in the previous campaigns. The partici...
Reputation management experts have to monitor--among others--Twitter constantly and decide, at any given time, what is being said about the entity of interest (a company, organization, personality...). Solving this reputation monitoring problem automatically as a topic detection task is both essential--manual processing of data is either costly or...
Recently, significant progress has been made in research on what we call semantic matching (SM), in web search, question answering, online advertisement, cross-language information retrieval, and other tasks. Advanced technologies based on machine learning have been developed. Let us take Web search as example of the problem that also pervades the...
In this tutorial we will present, review, and compare the most popular evaluation metrics for some of the most salient information related tasks, covering: (i) Information Retrieval, (ii) Clustering, and (iii) Filtering. The tutorial will make a special emphasis on the specification of constraints for suitable metrics in each of the three tasks, an...
We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities—companies, organizations or public figures—in Twitter. The tool automatically annotates tweets for relevance (Is the tweet about the entity?), reputational polarity (Does the tweet convey positive or negative implications for the reput...
Many Artificial Intelligence tasks cannot be evaluated with a single quality
criterion and some sort of weighted combination is needed to provide system
rankings. A problem of weighted combination measures is that slight changes in
the relative weights may produce substantial changes in the system rankings.
This paper introduces the Unanimous Impro...
Microblogs play an important role for Online Reputation Management. Companies and organizations in general have an increasing interest in obtaining the last minute information about which are the emerging topics that concern their reputation. In this paper, we present a new technique to cluster a collection of tweets emitted within a short time spa...
This paper summarizes the goals, organization, and results of the second RepLab competitive evaluation campaign for Online Rep-utation Management Systems (RepLab 2013). RepLab focused on the process of monitoring the reputation of companies and individuals, and asked participant systems to annotate different types of information on tweets containin...
A major problem in monitoring the online reputation of companies, brands, and other
entities is that entity names are often ambiguous (apple may refer to the company,
the fruit, the singer, etc.). The problem is particularly hard in microblogging services
such as Twitter, where texts are very short and there is little context to disambiguate.
In th...
A number of key Information Access tasks -- Document Retrieval, Clustering, Filtering, and their combinations -- can be seen as instances of a generic {\em document organization} problem that establishes priority and relatedness relationships between documents (in other words, a problem of forming and ranking clusters). As far as we know, no analys...
In this paper we describe the collaborative participation of UvA & UNED at RepLab 2013. We propose an active learning approach for the filtering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweet-inherent features such as hashtags and usernames. The tweets manually inspected...
This paper describes the UNED's Online Reputation Moni-toring Team participation at RepLab 2013 [3]. Several approaches were tested: First, an instance-based learning approach that uses Heterogene-ity Based Ranking to combine seven different similarity measures was applied for all the subtasks. The filtering subtask was also tackled by au-tomatical...
The development of summarization systems requires reliable similarity (evaluation) measures that compare system outputs with human references. A reliable measure should have correspondence with human judgements. However, the reliability of measures depends on the test collection in which the measure is meta-evaluated; for this reason, it has not ye...
This paper describes the participation of UNED NLP group in the SEMEVAL 2012 Semantic Textual Similarity task. Our contribution consists of an unsupervised method, Heterogeneity Based Ranking (HBR), to combine similarity measures. Our runs focus on combining standard similarity measures for Machine Translation. The Pearson correlation achieved is o...
This paper explores the real-time summarization of scheduled events such as
soccer games from torrential flows of Twitter streams. We propose and evaluate
an approach that substantially shrinks the stream of tweets in real-time, and
consists of two steps: (i) sub-event detection, which determines if something
new has occurred, and (ii) tweet select...
This paper summarizes the goals, organization and results of the first RepLab competitive evaluation campaign for Online Reputation Management Systems (RepLab 2012). RepLab focused on the reputation of companies, and asked participant systems to annotate different types of information on tweets containing the names of several companies. Two tasks w...
Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. We present a unified, comparative view of the strenghts and weaknesses of proposed measures based on two formal constraints (which should be satisfied by any...
The practical goal of information retrieval (IR) research is to create ways to support humans to better access information in order to better carry out their tasks. Because of this, IR research has a primarily technological interest in knowledge creation ...
Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in-depth analysis of the state of the art...
Monitoring the online reputation of a company starts by retrieving all (fresh) information where the company is mentioned; and a major problem in this context is that company names are often ambiguous (apple may refer to the company, the fruit, the singer, etc.). The problem is particularly hard in microblogging, where there is little context to di...
This book constitutes the refereed proceedings of the Second International Conference on Multilingual and Multimodal Information Access Evaluation, in continuation of the popular CLEF campaigns and workshops that have run for the last decade, CLEF 2011, held in Amsterdem, The Netherlands, in September 2011.
The 14 revised full papers presented toge...
The third WePS (Web People Search) Evaluation campaign took place in 2009-2010 and attracted the participation of 13 research groups from Europe, Asia and North America. Given the top web search results for a person name, two tasks were addressed: a clustering task, which consists of grouping together web pages referring to the same person, and an...
This paper summarizes the denition, resources, evaluation methodology and metrics, participation and comparative results for the second task of the WEPS-3 evaluation campaign. The so-called Online- Reputation Management task consists of ltering Twitter posts contain- ing a given company name depending of whether the post is actually related with th...
Is it possible to use sense inventories to improve Web search results diversity for one word queries? To answer this question, we focus on two broad-coverage lexical resources of a different nature: Word- Net, as a de-facto standard used in Word Sense Disambiguation experiments; and Wikipedia, as a large coverage, updated encyclopaedic resource whi...
Information retrieval access research is based on evaluation as the main vehicle of research: benchmarking procedures are
regularly pursued by all contributors to the field. But benchmarking is only one half of evaluation: to validate the results
the evaluation must include the study of user behaviour while performing tasks for which the system und...
We have participated in the SENSEVAL-2 English tasks (all words and lexical sample) with an unsupervised system based on mutual information measured over a large corpus (277 million words) and some additional heuristics. A supervised extension of the system was also presented to the lexical sample task. Our system scored first among unsupervised sy...
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article,
we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering
are captured by different metric families. These formal constraints are validated in an experiment in...
In this paper we summarize the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison
with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and
characterize 5,243 individual search sessions. We focus on the analysis of users’ behavior and th...
This paper summarises activities from the iCLEF 2009 task. As in 2008, the task was organised based on users participating in an interactive cross-language image search experiment. Organizers provided a default multilingual search system (Flickling) which accessed images from Flickr, with the whole iCLEF experiment run as an online game. Interactio...
Searching for a person name in a Web Search Engine usually leads to a number of web pages that refer to several people sharing the same name. In this paper we study whether it is reasonable to assume that pages about the desired person can be filtered by the user by adding query terms. Our results indicate that, although in most occasions there is...
A number of approaches to Automatic MT Evaluation based on deep linguistic knowledge have been suggested. How- ever, n-gram based metrics are still to- day the dominant approach. The main reason is that the advantages of employ- ing deeper linguistic information have not been clarified yet. In this work, we pro- pose a novel approach for meta-evalu...
The ambiguity of person names in the Web has become a new area of interest for NLP researchers. This challenging problem has been formulated as the task of clustering Web search results (returned in response to a person name query) according to the individual they mention. In this paper we compare the coverage, reliability and independence of a num...
The second WePS (Web People Search) Evaluation cam-paign took place in 2008-2009 with the participation of 19 re-search groups from Europe, Asia and North America. Given the output of a Web Search Engine for a (usually ambiguous) person name as query, two tasks were addressed: a clustering task, which consists of grouping together web pages referri...
This paper presents the Unanimous Improvement Ratio (UIR), a measure that allows to compare systems using two evalua-tion metrics without dependencies on relative metric weights. For clustering tasks, this kind of measure becomes neces-sary given the trade-off between precision and recall oriented metrics (e.g. Purity and Inverse Purity) which usua...
The goal of the project is to analyze, experiment, and develop intelligent, interactive and multilingual Text Mining technologies, as a key element of the next generation of search engines, systems with the capacity to find "the need behind the query". This new generation will provide specialized services and interfaces according to the search doma...
In this paper, we summarize our analysis over the logs of multilingual image searches in Flickr provided to iCLEF 2008 participants. We have studied: a) correlations between the language skills of searchers in the target language and other session parameters, such as success (was the image found?), number of query refinements, etc.; b) usage of spe...
This paper summarises activities from the iCLEF 2008 task. In an attempt to encourage greater participation in user-orientated
experiments, a new task was organised based on users participating in an interactive cross-language image search experiment.
Organizers provided a default multilingual search system which accessed images from Flickr, with t...
El objeto de este proyecto es analizar, experimentar y desarrollar tecnologías inteligentes, interactivas y multilingües de minería de textos, como pieza clave de la próxima generación de motores de búsqueda y análisis textual, sistemas capaces de encontrar “la necesidad que subyace a la consulta”. Estas tecnología ofrecerán servicios e interfaces...
The Cross Language Evaluation Forum has been an activity of DELOS for the last eight years.
During this period, it has promoted research in the domain of multilingual information retrieval.
This activity has produced considerable results; in particular it has encouraged experimentation
with all kinds of multilingual information access – from the de...
This paper presents the motivation, resources and results for the first Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. Also, we will describe a survey and proposal for a new task, "attribute extraction", which is planned for inclusion in the second evaluation, planned for autumn, 2008.
Objectives. Information retrieval is an empirical science; the field cannot move forward unless there are means of evaluating the innovations
devised by researchers. However the methodologies conceived in the early years of IR and used in the campaigns of today are
starting to show their age and new research is emerging to understand how to overcom...
Is Cross-Language answer finding harder than Monolingual answer finding for users? In this paper we provide initial quantitative and qualitative evidence to answer this question.In our study, which involves 16 users searching questions under four different system conditions, we find that interactive cross-language answer finding is not substantiall...
The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is p...
Participation in evaluation campaigns for interactive information retrieval systems has received variable success over the years. In this paper we discuss the large-scale interactive evaluation of multilingual information access systems, as part of the Cross- Language Evaluation Forum evaluation campaign. In particular, we describe the evaluation p...
A possible way of solving the knowledge acquisition bottleneck in word sense disambiguation is mining very large corpora (most
prominently the World Wide Web) to automatically acquire lexical information and examples to feed supervised learning methods.
Although this area of research remains largely unexplored, it has already revealed a strong pote...
This paper presents the task definition, re-sources, participation, and comparative re-sults for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
This paper presents the task definition, resources, participation, and comparative results for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
The Cross Language Evaluation Forum has been an activity of DELOS for the last eight years.
During this period, it has promoted research in the domain of multilingual information retrieval.
This activity has produced considerable results; in particular it has encouraged experimentation
with all kinds of multilingual information access – from the de...
In this paper, we present our participation in the ImageCLEF 2005 ad-hoc task. After a pool of preliminary tests in which
we evaluated the impact of different-size dictionaries using three distinct approaches, we proved that the biggest differences
were obtained by recognizing named entities and launching structured queries over the metadata. Thus,...
This paper summarizes the participation of UNED in the CLEF 2006 interactive task. Our goal was to measure the attitude of users towards cross-language searching when the search system provides the possibility (as an option) of searching cross-language, and when the search tasks can clearly benefit from searching in multiple languages.
Our results...
This paper summarizes the task design for iCLEF 2006 (the CLEF interactive track). Compared to previous years, we have proposed a radically new task: searching images in a naturally multilingual database, Flickr, which has millions of photographs shared by people all over the planet, tagged and described in a wide variety of languages. Participants...
This paper presents a proposal for iCLEF 2006, the interactive track of the CLEF cross-language evaluation campaign. In the past, iCLEF has addressed applications such as information retrieval and question answering. However, for 2006 the focus has turned to text-based image retrieval from Flickr. We describe Flickr, the challenges this kind of col...
We present a comparative study on Ma-chine Translation Evaluation according to two different criteria: Human Likeness and Human Acceptability. We provide empirical evidence that there is a relation-ship between these two kinds of evalu-ation: Human Likeness implies Human Acceptability but the reverse is not true. From the point of view of automatic...