Julio Gonzalo

Julio Gonzalo
  • PhD
  • Professor (Full) at National University of Distance Education

About

216
Publications
38,086
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,384
Citations
Introduction
Information Retrieval, Natural Language Processing, Evaluation Metrics and Methodologies
Current institution
National University of Distance Education
Current position
  • Professor (Full)
Additional affiliations
January 1995 - present
National University of Distance Education
Position
  • Head of Department

Publications

Publications (216)
Preprint
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather...
Preprint
Full-text available
In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rathe...
Article
Full-text available
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Span-ish and manually translated into English, and have not ever been publicly released, ensuring minimal contamination when evaluating Lar...
Preprint
Full-text available
In this article we present UNED-ACCESS 2024, a bilingual dataset that consists of 1003 multiple-choice questions of university entrance level exams in Spanish and English. Questions are originally formulated in Spanish and translated manually into English, and have not ever been publicly released. A selection of current open-source and proprietary...
Chapter
Full-text available
The paper describes the EXIST 2024 lab on Sexism identification in social networks, that is expected to take place at the CLEF 2024 conference and represents the fourth edition of the EXIST challenge. The lab comprises five tasks in two languages, English and Spanish, with the initial three tasks building upon those from EXIST 2023 (sexism identifi...
Chapter
Full-text available
In recent years, the rapid increase in the dissemination of offensive and discriminatory material aimed at women through social media platforms has emerged as a significant concern. This trend has had adverse effects on women’s well-being and their ability to freely express themselves. The EXIST campaign has been promoting research in online sexism...
Conference Paper
Full-text available
The paper describes the lab on Sexism identification in social networks (EXIST 2023) that will be hosted as a lab at the CLEF 2023 conference. The lab consists of three tasks, two of which are continuation of EXIST 2022 (sexism detection and sexism categorization) and a third and novel one on source intention identification. For this edition new te...
Article
Full-text available
The paper describes the organization, goals, and results of the sEXism Identification in Social neTworks (EXIST)2022 challenge, a shared task proposed for the second year at IberLEF. EXIST 2022 consists of two challenges: sexism identification and sexism categorization of tweets and gabs, both in Spanish and English. We have received a total of 45...
Article
The design and analysis of experimental research in Data Mining (DM) is anchored in a correct choice of the type of task addressed (clustering, classification, regression, etc.). However, although DM is a relatively mature discipline, there is no consensus yet about what is the taxonomy of DM tasks, which are their formal characteristics, and their...
Article
Full-text available
Online reputation management (ORM) comprises the collection of techniques that help monitoring and improving the public image of an entity (companies, products, institutions) on the Internet. The ORM experts try to minimize the negative impact of the information about an entity while maximizing the positive material for being more trustworthy to th...
Article
Full-text available
Although computing similarity is one of the fundamental challenges of Information Access tasks, the notion of similarity in this context is not yet completely understood from a formal, axiomatic perspective. In this paper we show how axiomatic explanations of similarity from other fields (Tversky’s axioms from the point of view of cognitive science...
Preprint
Full-text available
In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as positive, neutral, negative in sentiment analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for instance, precision/recall on each of the classes ignores their relat...
Article
Full-text available
Producing online reputation summaries for an entity (company, brand, etc.) is a focused summarization task with a distinctive feature: issues that may affect the reputation of the entity take priority in the summary. In this paper we (i) present a new test collection of manually created (abstractive and extractive) reputation reports which summariz...
Article
Full-text available
Given the task of finding influencers of a given domain (i.e. banking) in a social network, in this paper we investigate (i) the importance of characterizing followers for the automatic detection of influencers; (ii) the most effective way to combine signals obtained from followers and from the main profiles for the automatic detection of influence...
Article
The emergence of social media and the huge amount of opinions that are posted everyday have influenced online reputation management. Reputation experts need to filter and control what is posted online and, more importantly, determine if an online post is going to have positive or negative implications towards the entity of interest. This task is ch...
Chapter
Over a period of 3 years, RepLab was a CLEF initiative where computer scientists and online reputation experts worked together to identify and formalize the computational challenges in the area of online reputation monitoring. Two main results emerged from RepLab: a community of researchers engaged in the problem, and an extensive Twitter test coll...
Article
Full-text available
Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. Our goal is to compare metrics from a formal point of view, in order to understand whether each metric is appropriate, why and when, in order to achieve a bet...
Article
Full-text available
Given the task of finding influencers (opinion makers) for a given domain in a social network, we investigate (a) what is the relative importance of domain and authority signals, (b) what is the most effective way of combining signals (voting, classification, learning to rank, etc.) and how best to model the vocabulary signal, and (c) how large is...
Article
Full-text available
We describe the state-of-the-art in performance modeling and prediction for Information Retrieval (IR), Natural Language Processing (NLP) and Recommender Systems (RecSys) along with its shortcomings and strengths. We present a framework for further research, identifying five major problem areas: understanding measures, performance analysis, making...
Article
The news industry has undergone a revolution in the past decade, with substantial changes continuing to this day. News consumption habits are changing due to the increase in the volume of news and the variety of sources. Readers need new mechanisms to cope with this vast volume of information in order to not only find a signal in the noise, but als...
Article
Full-text available
This paper reports the findings of the Dagstuhl Perspectives Workshop 17442 on performance modeling and prediction in the domains of Information Retrieval, Natural language Processing and Recommender Systems. We present a framework for further research, which identifies five major problem areas: understanding measures, performance analysis, making...
Article
Full-text available
This is a report on the eighth edition of the Conference and Labs of the Evaluation Forum (CLEF 2017), held in early September 2017, in Dublin, Ireland. CLEF was a four day event combining a Conference and an Evaluation Forum. The Conference featured keynotes by Leif Azzopardi and Vincent Wade, and presentation of 32 peer reviewed research papers c...
Conference Paper
Full-text available
The EvALL online evaluation service aims to provide a unified evaluation framework for Information Access systems that makes results completely comparable and publicly available for the whole research community. For researchers working on a given test collection, the framework allows to: (i) evaluate results in a way compliant with measurement theo...
Conference Paper
Full-text available
One of the core tasks of Online Reputation Monitoring is to determine whether a text mentioning the entity of interest has positive or negative implications for its reputation. A challenging aspect of the task is that many texts are polar facts, i.e. they do not convey sentiment but they do have reputational implications (e.g. A Samsung smartphone...
Conference Paper
We present an in-depth formal and empirical comparison of unsupervised signal combination approaches in the context of tasks based on textual similarity. Our formal study introduces the concept of Similarity Information Quantity, and proves that the most salient combination methods are all estimations of Similarity Information Quantity under differ...
Book
This book constitutes the refereed proceedings of the 8th International Conference of the CLEF Initiative, CLEF 2017, held in Dublin, Ireland, in September 2017. The 7 full papers and 9 short papers presented together with 6 best of the labs papers were carefully reviewed and selected from 38 submissions. In addition, this volume contains the resu...
Article
In this paper, the Evall framework for the automatic evaluation of information systems task is presented. With just one click and providing the system outputs of the algorithms, Evall allows researchers to automatically generate a Latex report including the results of their algorithms, statistical significance tests, measures descriptions, and refe...
Conference Paper
Monitoring Online Reputation has already become a key part of Public Relations for organizations and individuals; and current search technologies do not suffice to help reputation experts to cope with the vast stream of online content relevant to their clients. One of the reasons is that the amount of relevant content for medium and large companies...
Conference Paper
Producing online reputation reports for an entity (company, brand, etc.) is a focused summarization task with a distinctive feature: issues that may affect the reputation of the entity take priority in the summary. In this paper we (i) propose a novel methodology to evaluate summaries in the context of online reputation which profits from an analog...
Conference Paper
In this tutorial we present a formal account of evaluation metrics for three of the most salient information related tasks: Retrieval, Clustering, and Filtering. We focus on the most popular metrics and, by exploiting measurement theory, we show some constraints for suitable metrics in each of the three tasks. We also systematically compare metrics...
Conference Paper
Full-text available
This paper describes the organisation and results of RepLab 2014, the third competitive evaluation campaign for Online Reputation Management systems. This year the focus lied on two new tasks: reputation dimensions classification and author profiling, which complement the aspects of reputation analysis studied in the previous campaigns. The partici...
Conference Paper
Full-text available
Reputation management experts have to monitor--among others--Twitter constantly and decide, at any given time, what is being said about the entity of interest (a company, organization, personality...). Solving this reputation monitoring problem automatically as a topic detection task is both essential--manual processing of data is either costly or...
Article
Recently, significant progress has been made in research on what we call semantic matching (SM), in web search, question answering, online advertisement, cross-language information retrieval, and other tasks. Advanced technologies based on machine learning have been developed. Let us take Web search as example of the problem that also pervades the...
Conference Paper
In this tutorial we will present, review, and compare the most popular evaluation metrics for some of the most salient information related tasks, covering: (i) Information Retrieval, (ii) Clustering, and (iii) Filtering. The tutorial will make a special emphasis on the specification of constraints for suitable metrics in each of the three tasks, an...
Conference Paper
Full-text available
We present a semi-automatic tool that assists experts in their daily work of monitoring the reputation of entities—companies, organizations or public figures—in Twitter. The tool automatically annotates tweets for relevance (Is the tweet about the entity?), reputational polarity (Does the tweet convey positive or negative implications for the reput...
Article
Full-text available
Many Artificial Intelligence tasks cannot be evaluated with a single quality criterion and some sort of weighted combination is needed to provide system rankings. A problem of weighted combination measures is that slight changes in the relative weights may produce substantial changes in the system rankings. This paper introduces the Unanimous Impro...
Conference Paper
Full-text available
Microblogs play an important role for Online Reputation Management. Companies and organizations in general have an increasing interest in obtaining the last minute information about which are the emerging topics that concern their reputation. In this paper, we present a new technique to cluster a collection of tweets emitted within a short time spa...
Conference Paper
Full-text available
This paper summarizes the goals, organization, and results of the second RepLab competitive evaluation campaign for Online Rep-utation Management Systems (RepLab 2013). RepLab focused on the process of monitoring the reputation of companies and individuals, and asked participant systems to annotate different types of information on tweets containin...
Article
Full-text available
A major problem in monitoring the online reputation of companies, brands, and other entities is that entity names are often ambiguous (apple may refer to the company, the fruit, the singer, etc.). The problem is particularly hard in microblogging services such as Twitter, where texts are very short and there is little context to disambiguate. In th...
Conference Paper
A number of key Information Access tasks -- Document Retrieval, Clustering, Filtering, and their combinations -- can be seen as instances of a generic {\em document organization} problem that establishes priority and relatedness relationships between documents (in other words, a problem of forming and ranking clusters). As far as we know, no analys...
Conference Paper
Full-text available
In this paper we describe the collaborative participation of UvA & UNED at RepLab 2013. We propose an active learning approach for the filtering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweet-inherent features such as hashtags and usernames. The tweets manually inspected...
Conference Paper
Full-text available
This paper describes the UNED's Online Reputation Moni-toring Team participation at RepLab 2013 [3]. Several approaches were tested: First, an instance-based learning approach that uses Heterogene-ity Based Ranking to combine seven different similarity measures was applied for all the subtasks. The filtering subtask was also tackled by au-tomatical...
Conference Paper
The development of summarization systems requires reliable similarity (evaluation) measures that compare system outputs with human references. A reliable measure should have correspondence with human judgements. However, the reliability of measures depends on the test collection in which the measure is meta-evaluated; for this reason, it has not ye...
Conference Paper
This paper describes the participation of UNED NLP group in the SEMEVAL 2012 Semantic Textual Similarity task. Our contribution consists of an unsupervised method, Heterogeneity Based Ranking (HBR), to combine similarity measures. Our runs focus on combining standard similarity measures for Machine Translation. The Pearson correlation achieved is o...
Article
Full-text available
This paper explores the real-time summarization of scheduled events such as soccer games from torrential flows of Twitter streams. We propose and evaluate an approach that substantially shrinks the stream of tweets in real-time, and consists of two steps: (i) sub-event detection, which determines if something new has occurred, and (ii) tweet select...
Article
Full-text available
This paper summarizes the goals, organization and results of the first RepLab competitive evaluation campaign for Online Reputation Management Systems (RepLab 2012). RepLab focused on the reputation of companies, and asked participant systems to annotate different types of information on tweets containing the names of several companies. Two tasks w...
Conference Paper
Although document filtering is simple to define, there is a wide range of different evaluation measures that have been proposed in the literature, all of which have been subject to criticism. We present a unified, comparative view of the strenghts and weaknesses of proposed measures based on two formal constraints (which should be satisfied by any...
Article
Full-text available
The practical goal of information retrieval (IR) research is to create ways to support humans to better access information in order to better carry out their tasks. Because of this, IR research has a primarily technological interest in knowledge creation ...
Conference Paper
Full-text available
Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an in-depth analysis of the state of the art...
Conference Paper
Full-text available
Monitoring the online reputation of a company starts by retrieving all (fresh) information where the company is mentioned; and a major problem in this context is that company names are often ambiguous (apple may refer to the company, the fruit, the singer, etc.). The problem is particularly hard in microblogging, where there is little context to di...
Book
This book constitutes the refereed proceedings of the Second International Conference on Multilingual and Multimodal Information Access Evaluation, in continuation of the popular CLEF campaigns and workshops that have run for the last decade, CLEF 2011, held in Amsterdem, The Netherlands, in September 2011. The 14 revised full papers presented toge...
Conference Paper
Full-text available
The third WePS (Web People Search) Evaluation campaign took place in 2009-2010 and attracted the participation of 13 research groups from Europe, Asia and North America. Given the top web search results for a person name, two tasks were addressed: a clustering task, which consists of grouping together web pages referring to the same person, and an...
Conference Paper
Full-text available
This paper summarizes the denition, resources, evaluation methodology and metrics, participation and comparative results for the second task of the WEPS-3 evaluation campaign. The so-called Online- Reputation Management task consists of ltering Twitter posts contain- ing a given company name depending of whether the post is actually related with th...
Conference Paper
Full-text available
Is it possible to use sense inventories to improve Web search results diversity for one word queries? To answer this question, we focus on two broad-coverage lexical resources of a different nature: Word- Net, as a de-facto standard used in Word Sense Disambiguation experiments; and Wikipedia, as a large coverage, updated encyclopaedic resource whi...
Chapter
Information retrieval access research is based on evaluation as the main vehicle of research: benchmarking procedures are regularly pursued by all contributors to the field. But benchmarking is only one half of evaluation: to validate the results the evaluation must include the study of user behaviour while performing tasks for which the system und...
Article
Full-text available
We have participated in the SENSEVAL-2 English tasks (all words and lexical sample) with an unsupervised system based on mutual information measured over a large corpus (277 million words) and some additional heuristics. A supervised extension of the system was also presented to the lexical sample task. Our system scored first among unsupervised sy...
Article
Full-text available
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment in...
Conference Paper
In this paper we summarize the analysis performed on the logs of multilingual image search provided by iCLEF09 and its comparison with the logs released in the iCLEF08 campaign. We have processed more than one million log lines in order to identify and characterize 5,243 individual search sessions. We focus on the analysis of users’ behavior and th...
Conference Paper
Full-text available
This paper summarises activities from the iCLEF 2009 task. As in 2008, the task was organised based on users participating in an interactive cross-language image search experiment. Organizers provided a default multilingual search system (Flickling) which accessed images from Flickr, with the whole iCLEF experiment run as an online game. Interactio...
Conference Paper
Full-text available
Searching for a person name in a Web Search Engine usually leads to a number of web pages that refer to several people sharing the same name. In this paper we study whether it is reasonable to assume that pages about the desired person can be filtered by the user by adding query terms. Our results indicate that, although in most occasions there is...
Conference Paper
Full-text available
A number of approaches to Automatic MT Evaluation based on deep linguistic knowledge have been suggested. How- ever, n-gram based metrics are still to- day the dominant approach. The main reason is that the advantages of employ- ing deeper linguistic information have not been clarified yet. In this work, we pro- pose a novel approach for meta-evalu...
Conference Paper
Full-text available
The ambiguity of person names in the Web has become a new area of interest for NLP researchers. This challenging problem has been formulated as the task of clustering Web search results (returned in response to a person name query) according to the individual they mention. In this paper we compare the coverage, reliability and independence of a num...
Article
Full-text available
The second WePS (Web People Search) Evaluation cam-paign took place in 2008-2009 with the participation of 19 re-search groups from Europe, Asia and North America. Given the output of a Web Search Engine for a (usually ambiguous) person name as query, two tasks were addressed: a clustering task, which consists of grouping together web pages referri...
Article
Full-text available
This paper presents the Unanimous Improvement Ratio (UIR), a measure that allows to compare systems using two evalua-tion metrics without dependencies on relative metric weights. For clustering tasks, this kind of measure becomes neces-sary given the trade-off between precision and recall oriented metrics (e.g. Purity and Inverse Purity) which usua...
Article
Full-text available
The goal of the project is to analyze, experiment, and develop intelligent, interactive and multilingual Text Mining technologies, as a key element of the next generation of search engines, systems with the capacity to find "the need behind the query". This new generation will provide specialized services and interfaces according to the search doma...
Conference Paper
Full-text available
In this paper, we summarize our analysis over the logs of multilingual image searches in Flickr provided to iCLEF 2008 participants. We have studied: a) correlations between the language skills of searchers in the target language and other session parameters, such as success (was the image found?), number of query refinements, etc.; b) usage of spe...
Conference Paper
Full-text available
This paper summarises activities from the iCLEF 2008 task. In an attempt to encourage greater participation in user-orientated experiments, a new task was organised based on users participating in an interactive cross-language image search experiment. Organizers provided a default multilingual search system which accessed images from Flickr, with t...
Article
Full-text available
El objeto de este proyecto es analizar, experimentar y desarrollar tecnologías inteligentes, interactivas y multilingües de minería de textos, como pieza clave de la próxima generación de motores de búsqueda y análisis textual, sistemas capaces de encontrar “la necesidad que subyace a la consulta”. Estas tecnología ofrecerán servicios e interfaces...
Article
Full-text available
The Cross Language Evaluation Forum has been an activity of DELOS for the last eight years. During this period, it has promoted research in the domain of multilingual information retrieval. This activity has produced considerable results; in particular it has encouraged experimentation with all kinds of multilingual information access – from the de...
Conference Paper
Full-text available
This paper presents the motivation, resources and results for the first Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. Also, we will describe a survey and proposal for a new task, "attribute extraction", which is planned for inclusion in the second evaluation, planned for autumn, 2008.
Conference Paper
Full-text available
Objectives. Information retrieval is an empirical science; the field cannot move forward unless there are means of evaluating the innovations devised by researchers. However the methodologies conceived in the early years of IR and used in the campaigns of today are starting to show their age and new research is emerging to understand how to overcom...
Article
Is Cross-Language answer finding harder than Monolingual answer finding for users? In this paper we provide initial quantitative and qualitative evidence to answer this question.In our study, which involves 16 users searching questions under four different system conditions, we find that interactive cross-language answer finding is not substantiall...
Conference Paper
Full-text available
The importance of evaluation in promoting research and development in the information retrieval and natural language processing domains has long been recognised but is this sufficient? In many areas there is still a considerable gap between the results achieved by the research community and their implementation in commercial applications. This is p...
Article
Full-text available
Participation in evaluation campaigns for interactive information retrieval systems has received variable success over the years. In this paper we discuss the large-scale interactive evaluation of multilingual information access systems, as part of the Cross- Language Evaluation Forum evaluation campaign. In particular, we describe the evaluation p...
Chapter
Full-text available
A possible way of solving the knowledge acquisition bottleneck in word sense disambiguation is mining very large corpora (most prominently the World Wide Web) to automatically acquire lexical information and examples to feed supervised learning methods. Although this area of research remains largely unexplored, it has already revealed a strong pote...
Article
Full-text available
This paper presents the task definition, re-sources, participation, and comparative re-sults for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
Conference Paper
Full-text available
This paper presents the task definition, resources, participation, and comparative results for the Web People Search task, which was organized as part of the SemEval-2007 evaluation exercise. This task consists of clustering a set of documents that mention an ambiguous person name according to the actual entities referred to using that name.
Conference Paper
The Cross Language Evaluation Forum has been an activity of DELOS for the last eight years. During this period, it has promoted research in the domain of multilingual information retrieval. This activity has produced considerable results; in particular it has encouraged experimentation with all kinds of multilingual information access – from the de...
Conference Paper
Full-text available
In this paper, we present our participation in the ImageCLEF 2005 ad-hoc task. After a pool of preliminary tests in which we evaluated the impact of different-size dictionaries using three distinct approaches, we proved that the biggest differences were obtained by recognizing named entities and launching structured queries over the metadata. Thus,...
Conference Paper
Full-text available
This paper summarizes the participation of UNED in the CLEF 2006 interactive task. Our goal was to measure the attitude of users towards cross-language searching when the search system provides the possibility (as an option) of searching cross-language, and when the search tasks can clearly benefit from searching in multiple languages. Our results...
Conference Paper
Full-text available
This paper summarizes the task design for iCLEF 2006 (the CLEF interactive track). Compared to previous years, we have proposed a radically new task: searching images in a naturally multilingual database, Flickr, which has millions of photographs shared by people all over the planet, tagged and described in a wide variety of languages. Participants...
Article
Full-text available
This paper presents a proposal for iCLEF 2006, the interactive track of the CLEF cross-language evaluation campaign. In the past, iCLEF has addressed applications such as information retrieval and question answering. However, for 2006 the focus has turned to text-based image retrieval from Flickr. We describe Flickr, the challenges this kind of col...
Conference Paper
Full-text available
We present a comparative study on Ma-chine Translation Evaluation according to two different criteria: Human Likeness and Human Acceptability. We provide empirical evidence that there is a relation-ship between these two kinds of evalu-ation: Human Likeness implies Human Acceptability but the reverse is not true. From the point of view of automatic...

Network

Cited By