
Jose Hernandez-Orallo- PhD
- Professor at Polytechnic University of Valencia
Jose Hernandez-Orallo
- PhD
- Professor at Polytechnic University of Valencia
About
255
Publications
135,857
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,117
Citations
Introduction
Current institution
Publications
Publications (255)
Adversarial benchmark construction, where harder instances challenge new generations of AI systems, is becoming the norm. While this approach may lead to better machine learning models —on average and for the new benchmark—, it is unclear how these models behave on the original distribution. Two opposing effects are intertwined here. On the one han...
Exemplar-based explainable artificial intelligence (XAI) aims at creating human understanding about the behaviour of an AI system, usually a machine learning model, through examples. The advantage of this approach is that the human creates their own explanation in their own internal language. However, what examples should be chosen? Existing framew...
In curriculum learning the order of concepts is determined by the teacher but not the examples for each concept, while in machine teaching it is the examples that are chosen by the teacher to minimise the learning effort, though the concepts are taught in isolation. Curriculum teaching is the natural combination of both, where both concept order an...
Even with obvious deficiencies, large prompt-commanded multimodal models are proving to be flexible cognitive tools representing an unprecedented generality. But the directness, diversity, and degree of user interaction create a distinctive “human-centred generality” (HCG), rather than a fully autonomous one. HCG implies that —for a specific user—...
Language models have become very popular recently and many claims have been made about their abilities, including for commonsense reasoning. Given the increasingly better results of current language models on previous static benchmarks for commonsense reasoning, we explore an alternative dialectical evaluation. The goal of this kind of evaluation i...
Aggregate metrics and lack of access to results limit understanding.
The automation of data science and other data manipulation processes depend on the integration and formatting of ‘messy’ data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users exp...
The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These EIs are not only increasing in number, but also in complexity and diversity, making it hard to underst...
The Workshop Program of the Association for the Advancement of Artificial Intelligence’s Thirty-Sixth Conference on Artificial Intelligence was held virtually from February 22 – March 1, 2022. There were thirty-nine workshops in the program: Adversarial Machine Learning and Beyond, AI for Agriculture and Food Systems, AI for Behavior Change, AI for...
Over the past decades in the field of machine teaching, several restrictions have been introduced to avoid ‘cheating’, such as collusion-free or non-clashing teaching. However, these restrictions forbid several teaching situations that we intuitively consider natural and fair, especially those ‘changes of mind’ of the learner as more evidence is gi...
We present a framework for analysing the impact of AI on occupations. This framework maps 59 generic tasks from different occupational datasets to 14 cognitive abilities and these to a comprehensive list of 328 AI benchmarks used to evaluate research intensity in AI. The use of cognitive abilities as an intermediate layer allows for an identificati...
In AI evaluation, performance is often calculated by averaging across various instances. But to fully understand the capabilities of an AI system, we need to understand the factors that cause its pattern of success and failure. In this paper, we present a new methodology to identify and build informative instance features that can provide explanato...
The new generation of language models is reported to solve some extraordinary tasks the models were never trained for specifically, in few-shot or zero-shot settings. However, these reports usually cherry-pick the tasks, use the best prompts, and unwrap or extract the solutions leniently even if they are followed by nonsensical text. In sum, they a...
Many present and future problems associated with artificial intelligence are not due to its limitations, but to our poor assessment of its behaviour. Our evaluation procedures produce aggregated performance metrics that lack detail and quantified uncertainty about the following question: how will an AI system, with a particular profile \pi, behave...
One of challenges of artificial intelligence as a whole is robustness. Many issues such as adversarial examples, out of distribution performance, Clever Hans phenomena, and the wider areas of AI evaluation and explainable AI, have to do with the following question: Did the system fail because it is a hard instance or because something else? In this...
Given the complexity of data science projects and related demand for human expertise, automation has the potential to transform the data science process.
Success in all sorts of situations is the most classical interpretation of general intelligence. Under limited resources, however, the capability of an agent must necessarily be limited too, and generality needs to be understood as comprehensive performance up to a level of difficulty . The degree of generality then refers to the way an agent’s cap...
The progress of some AI paradigms such as deep learning is said to be linked to an exponential growth in the number of parameters. There are many studies corroborating these trends, but does this translate into an exponential increase in energy consumption? In order to answer this question we focus on inference costs rather than training costs, as...
Machine teaching under strong simplicity priors can teach any concept in universal languages. Remarkably, recent experiments suggest that the teaching sets are shorter than the concept description itself. This raises many important questions about the complexity of concepts and their teaching size, especially when concepts are taught incrementally....
Machine teaching under strong simplicity priors can teach any concept in universal languages. Remarkably, recent experiments suggest that the teaching sets are shorter than the concept description itself. This raises many important questions about the complexity of concepts and their teaching size, especially when concepts are taught incremen-tally...
The Workshop Program of the Association for the Advancement of Artificial Intelligence’s Thirty-Fifth Conference on Artificial Intelligence was held virtually from February 8-9, 2021. There were twenty-six workshops in the program: Affective Content Analysis, AI for Behavior Change, AI for Urban Mobility, Artificial Intelligence Safety, Combating O...
Machine intelligence differs signficantly from human intelligence. While human perception has similarities to the way machine perception works, human learning is mostly a directed process, guided by other people: parents, teachers, ... The area of machine teaching is becoming increasingly popular as a different paradigm for making machines learn. I...
The widespread use of experimental benchmarks in AI research has created competition and collaboration dynamics that are still poorly understood. Here we provide an innovative methodology to explore these dynamics and analyse the way different entrants in these challenges, from academia to tech giants, behave and react depending on their own or oth...
Recent research in machine teaching has explored the instruction of any concept expressed in a universal language. In this compositional context, new experimental results have shown that there exist data teaching sets surprisingly shorter than the concept description itself. However, there exists a bound for those remarkable experimental findings t...
Given the complexity of typical data science projects and the associated demand for human expertise, automation has the potential to transform the data science process. Key insights: * Automation in data science aims to facilitate and transform the work of data scientists, not to replace them. * Important parts of data science are already being aut...
Artificial Intelligence (AI) offers the potential to transform our lives in radical ways. However, the main unanswered questions about this foreseen transformation are its depth, breadth and timelines. To answer them, not only do we lack the tools to determine what achievements will be attained in the near future, but we even ignore what various te...
We present muppets, a framework for partitioning cells in a table in segments that fulfil the same semantic role or belong to the same semantic data type, similar to how image segmentation is used to group pixels that represent the same semantic object in computer vision. Flexible constraints can be imposed on these segmentations for different use...
Matrices are a very common way of representing and working with data in data science and artificial intelligence. Writing a small snippet of code to make a simple matrix transformation is frequently frustrating, especially for those people without an extensive programming expertise. We present AUTOMAT[R]IX, a system that is able to induce R program...
Artificial Intelligence is making rapid and remarkable progress in the development of more sophisticated and powerful systems. However, the acknowledgement of several problems with modern machine learning approaches has prompted a shift in AI benchmarking away from task-oriented testing (such as Chess and Go) towards ability-oriented testing, in wh...
In the EU, the right to timely access the ‘affordable, preventive and curative health care of good quality’ and the right to ‘affordable long-term services of good quality’ are enshrined in the European Pillar of Social Rights (C(2017) 2600 final). The backbone of health and long-term care (LTC) systems’ capacity to ensure that EU citizens can exer...
This paper attempts to answer a central question in unsupervised learning: what does it mean to “make sense” of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents...
The extended mind thesis maintains that the functional contributions of tools and artifacts can become so essential for our cognition that they can be constitutive parts of our minds. In other words, our tools can be on a par with our brains: our minds and cognitive processes can literally “extend” into the tools. Several extended mind theorists ha...
In the last 20 years the Turing test has been left further behind by new developments in artificial intelligence. At the same time, however, these developments have revived some key elements of the Turing test: imitation and adversarialness. On the one hand, many generative models, such as generative adversarial networks (GAN), build imitators unde...
The Association for the Advancement of Artificial Intelligence 2020 Workshop Program included twenty-three workshops covering a wide range of topics in artificial intelligence. This report contains the required reports, which were submitted by most, but not all, of the workshop chairs.
The Apperception Engine is an unsupervised learning system. Given a sequence of sensory inputs, it constructs a symbolic causal theory that both explains the sensory sequence and also satisfies a set of unity conditions. The unity conditions insist that the constituents of the theory - objects, properties, and laws - must be integrated into a coher...
A common way of learning to perform a task is to observe how it is carried out by experts. However, it is well known that for most tasks there is no unique way to perform them. This is especially noticeable the more complex the task is because factors such as the skill or the know-how of the expert may well affect the way she solves the task. In ad...
In this paper we develop a framework for analysing the impact of AI on occupations. Leaving aside the debates on robotisation, digitalisation and online platforms as well as workplace automation, we focus on the occupational impact of AI that is driven by rapid progress in machine learning. In our framework we map 59 generic tasks from several work...
Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if t...
Programming languages such as R or Python are commonplace in data science projects. However, transforming data is usually tricky and the composition of the right primitives (using the appropriate libraries) to get the most elegant code transformation is not always easy. In this paper, we present the first system that is able to automatically synthe...
In this paper we present a setting for examining the relation between the distribution of research intensity in AI research and the relevance for a range of work tasks (and occupations) in current and simulated scenarios. We perform a mapping between labour and AI using a set of cognitive abilities as an intermediate layer. This setting favours a t...
This paper attempts to answer a central question in unsupervised learning: what does it mean to "make sense" of a sensory sequence? In our formalization, making sense involves constructing a symbolic causal theory that explains the sensory sequence and satisfies a set of unity conditions. This model was inspired by Kant's discussion of the syntheti...
The theoretical hardness of machine teaching has usually been analyzed for a range of concept languages under several variants of the teaching dimension: the minimum number of examples that a teacher needs to figure out so that the learner identifies the concept. However, for languages where concepts have structure (and hence size), such as Turing-...
Recent advances in artificial intelligence have been strongly driven by the use of game environments for training and evaluating agents. Games are often accessible and versatile, with well-defined state-transitions and goals allowing for intensive training and experimentation. However, agents trained in a particular environment are usually tested o...
The workshop program of the Association for the Advancement of Artificial Intelligence’s 33rd Conference on Artificial Intelligence (AAAI-19) was held in Honolulu, Hawaii, on Sunday and Monday, January 27–28, 2019. There were fifteen workshops in the program: Affective Content Analysis: Modeling Affect-in-Action, Agile Robotics for Industrial Autom...
The quality of the decisions made by a machine learning model depends on the data and the operating conditions during deployment. Often, operating conditions such as class distribution and misclassification costs have changed during the time since the model was trained and evaluated. When deploying a binary classifier that outputs scores, once we k...
AI systems are usually evaluated on a range of problem instances and compared to other AI systems that use different strategies. These instances are rarely independent. Machine learning, and supervised learning in particular, is a very good example of this. Given a machine learning model, its behaviour for a single instance cannot be understood in...
The causes underlying unfair decision making are complex, being internalised in different ways by decision makers, other actors dealing with data and models, and ultimately by the individuals being affected by these decisions. One frequent manifestation of all these latent causes arises in the form of missing values: protected groups are more reluc...
Many areas of AI today use benchmarks and competitions with larger and wider sets of tasks. This tries to deter AI systems (and research effort) from specialising to a single task, and encourage them to be prepared to solve previously unseen tasks. It is unclear, however, whether the methods with best performance are actually those that are most ge...
Humans infer much of the intentions of others by just looking at their gaze. Similarly, we want to understand how machine learning systems solve a problem. New tools are developed to find out what strategies a learning machine is using, such as what it is paying attention to when classifying images.
Humans and AI systems are usually portrayed as separate systems that we need to align in values and goals. However, there is a great deal of AI technology found in non-autonomous systems that are used as cognitive tools by humans. Under the extended mind thesis, the functional contributions of these tools become as essential to our cognition as our...
Artificial intelligence is set to rival the human mind, just as the engine did the horse. José Hernández-Orallo looks at how we compare cognitive performance.
With the purpose of better analysing the result of AI benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a ne...
Item response theory (IRT) can be applied to the analysis of the evaluation of results from AI benchmarks. The two-parameter IRT model provides two indicators (difficulty and discrimination) on the side of the item (or AI problem) while only one indicator (ability) on the side of the respondent (or AI agent). In this paper we analyse how to make th...
Given one or two examples, humans are good at understanding how to solve a problem independently of its domain, because they are able to detect what the problem is and to choose the appropriate background knowledge according to the context. For instance, presented with the string "8/17/2017" to be transformed to "17th of August of 2017", humans wil...
We address the novel question of determining which kind of machine learning model is behind the predictions when we interact with a black-box model. This may allow us to identify families of techniques whose models exhibit similar vulnerabilities and strengths. In our method, we first consider how an adversary can systematically query a given black...
New types of artificial intelligence (AI), from cognitive assistants to social robots, are challenging meaningful comparison with other kinds of intelligence. How can such intelligent systems be catalogued, evaluated, and contrasted, with representations and projections that offer meaningful insights? To catalyse the research in AI and the future o...
Machine learning (ML) models make decisions for governments, companies, and individuals. Accordingly, there is the increasing concern of not having a rich explanatory and predictive account of the behaviour of these ML models relative to the users’ interests (goals) and (pre-)conceptions (ontologies). We argue that the recent research trends in fin...
This paper presents a multidisciplinary task approach for assessing the impact of artificial intelligence on the future of work. We provide definitions of a task from two main perspectives: socio-economic and computational. We propose to explore ways in which we can integrate or map these perspectives, and link them with the skills or capabilities...
We present nine facets for the analysis of the past and future evolution of AI. Each facet has also a set of edges that can summarise different trends and contours in AI. With them, we first conduct a quantitative analysis using the information from two decades of AAAI/IJCAI conferences and around 50 years of documents from AI topics, an official d...
This document contains the outcome of the first Human behaviour and machine intelligence (HUMAINT) workshop that took place 5-6 March 2018 in Barcelona, Spain. The workshop was organized in the context of a new research programme at the Centre for Advanced Studies, Joint Research Centre of the European Commission, which focuses on studying the pote...
We analyze and reframe AI progress. In addition to the prevailing metrics of performance, we highlight the usually neglected costs paid in the development and deployment of a system, including: data, expert knowledge, human oversight, software resources, computing cycles, hardware and network facilities, development time, etc. These costs are paid...
We investigate the teaching of infinite concept classes through the effect of the learning bias (which is used by the learner to prefer some concepts over others and by the teacher to devise the teaching examples) and the sampling bias (which determines how the concepts are sampled from the class). We analyse two important classes: Turing machines...
We report on a series of new platforms and events dealing with AI evaluation that may change the way in which AI systems are compared and their progress is measured. The introduction of a more diverse and challenging set of tasks in these platforms can feed AI research in the years to come, shaping the notion of success and the directions of the fi...
The evaluation of artificial intelligence systems and components is crucial for the progress of the discipline. In this paper we describe and critically assess the different ways AI systems are evaluated, and the role of components and techniques in these systems. We first focus on the traditional task-oriented evaluation approach. We identify thre...
We present an event detection system in a laparoscopic surgery domain, as part of a more ambitious supervision by observation project. The system, which only requires the incorporation of two cameras in a laparoscopic training box, integrates several computer vision and machine learning techniques to detect the states and movements of the elements...
We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs.
One of the major difficulties in activity recognition stems from the lack of a model of the world where activities and events are to be recognised. When the domain is fixed and repetitive we can manually include this information using some kind of ontology or set of constraints. On many occasions, however, there are many new situations for which on...
While some computational models of intelligence test problems were proposed throughout the second half of the XXth century, in the first years of the XXIst century we have seen an increasing number of computer systems being able to score well on particular intelligence test tasks. However, despitethis increasing trend there has been no general acco...
The progression in several cognitive tests for the same subjects at different ages provides valuable information about their cognitive development. One question that has caught recent interest is whether the same approach can be used to assess the cognitive development of artificial systems. In particular, can we assess whether the ‘fluid’ or ‘crys...
Are psychometric tests valid for a new reality of artificial intelligence systems, technology-enhanced humans, and hybrids yet to come? Are the Turing Test, the ubiquitous CAPTCHAs, and the various animal cognition tests the best alternatives? In this fascinating and provocative book, José Hernández-Orallo formulates major scientific questions, int...
We describe a systematic approach called reframing, defined as the process of preparing a machine learning model (e.g., a classifier) to perform well over a range of operating contexts. One way to achieve this is by constructing a versatile model, which is not fitted to a particular context, and thus enables model reuse. We formally characterise re...
Item response theory (IRT) is widely used to measure latent abilities of subjects (specially for educational testing) based on their responses to items with different levels of difficulty. The adaptation of IRT has been recently suggested as a novel perspective for a better understanding of the results of machine learning experiments and, by extens...
Some supervised tasks are presented with a numerical output but decisions have to be made in a discrete, binarised, way, according to a particular cutoff. This binarised regression task is a very common situation that requires its own analysis, different from regression and classification—and ordinal regression. We first investigate the application...
Abstract While some computational models of intelligence test problems were proposed throughout the second half of the XXth century, in the first years of the XXIst century we have seen an increasing number of computer systems being able to score well on particular intelligence test tasks. However, despite this increasing trend there has been no ge...
The integration of multidimensional data and machine learning seems to be natural in the area of business intelligence. On-Line Analytical Processing (OLAP) tools are frequent in this area where the data are usually represented in multidimensional datamarts and data mining tools are integrated in some of these tools. However, the efforts for a full...
MUCH OF THE world's population use computers for everyday tasks, but most fail to benefit from the power of computation due to their inability to program. Most crucially, users often have to perform repetitive actions manually because they are not able to use the macro languages available for many application programs. Recently, a first mass-market...
Identifying the balance between remembering and forgetting is the key to abstraction in the human brain and, therefore, the creation of memories and knowledge. We present an incremental, lifelong view of knowledge acquisition which tries to improve task after task by determining what to keep, consolidate and forget, overcoming the stability–plastic...
Multidimensional data is systematically analysed at multiple granularities by applying aggregate and disaggregate operators (e.g., by the use of OLAP tools). For instance, in a supermarket we may want to predict sales of tomatoes for next week, but we may also be interested in predicting sales for all vegetables (higher up in the product hierarchy)...
In recent years the number of research projects on computer programs solving human intelligence problems in artificial intelligence (AI), artificial general intelligence, as well as in Cognitive Modelling, has significantly grown. One reason could be the interest of such problems as benchmarks for AI algorithms. Another, more fundamental, motivatio...
We establish a setting for asynchronous stochastic tasks that account for episodes, rewards and responses, and, most especially, the computational complexity of the algorithm behind an agent solving a task. This is used to determine the difficulty of a task as the (logarithm of the) number of computational steps required to acquire an acceptable po...
We explore the aggregation of tasks by weighting them using a difficulty function that depends on the complexity of the (acceptable) policy for the task (instead of a universal distribution over tasks or an adaptive test). The resulting aggregations and decompositions are (now retrospectively) seen as the natural (and trivial) interactive generalis...
The evaluation of an ability or skill happens in some kind of testbed, and so does with social intelligence. Of course, not all testbeds are suitable for this matter. But, how can we be sure of their appropriateness? In this paper we identify the components that should be considered in order to measure social intelligence, and provide some instrume...
This note revisits the concepts of task and difficulty. The notion of
cognitive task and its use for the evaluation of intelligent systems is still
replete with issues. The view of tasks as MDP in the context of reinforcement
learning has been especially useful for the formalisation of learning tasks.
However, this alternate interaction does not ac...
The application of cognitive mechanisms to support knowledge acquisition is,
from our point of view, crucial for making the resulting models coherent,
efficient, credible, easy to use and understandable. In particular, there are
two characteristic features of intelligence that are essential for knowledge
development: forgetting and consolidation. B...
Mundus vult decipi, ergo decipiatur—the world wants to be deceived, so let it be deceived.Artificial intelligence (AI) has been a deceiving discipline: AI addresses those tasks that, if performed by humans, would require intelligence, but have been solved without featuring any genuine intelligence. This delusion has come, in return, with algorithmi...
In this exploratory note we ask the question of what a measure of performance
for all tasks is like if we use a weighting of tasks based on a difficulty
function. This difficulty function depends on the complexity of the
(acceptable) solution for the task (instead of a universal distribution over
tasks or an adaptive test). The resulting aggregatio...
A more effective vision of machine learning systems entails tools that are able to improve task after task and to reuse the patterns and knowledge that are acquired previously for future tasks. This incremental, long-life view of machine learning goes beyond most of state-of-the-art machine learning techniques that learn throwaway models. In this p...
Common-day applications of predictive models usually involve the full use of the available contextual information. When the operating context changes, one may fine-tune the by-default (incontextual) prediction or may even abstain from predicting a value (a reject). Global reframing solutions, where the same function is applied to adapt the estimate...
Artificial intelligence develops techniques and systems whose performance
must be evaluated on a regular basis in order to certify and foster progress in
the discipline. We will describe and critically assess the different ways AI
systems are evaluated. We first focus on the traditional task-oriented
evaluation approach. We see that black-box (beha...
Social intelligence in natural and artificial systems is usually measured by
the evaluation of associated traits or tasks that are deemed to represent some
facets of social behaviour. The amalgamation of these traits is then used to
configure the intuitive notion of social intelligence. Instead, in this paper
we start from a parametrised definition...
The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is pe...
We present and develop the notion of ‘universal psychometrics’ as a subject of study, and eventually a discipline, that focusses on the measurement of cognitive abilities for the machine kingdom, which comprises any (cognitive) system, individual or collective, either artificial, biological or hybrid. Universal psychometrics can be built, of course...
The notion of a universal intelligence test has been recently advocated as a means to assess humans, non-human animals and machines in an integrated, uniform way. While the main motivation has been the development of machine intelligence tests, the mere concept of a universal test has many implications in the way human intelligence tests are unders...
This paper presents a way to estimate the difficulty and discriminating power of any task instance. We focus on a very general setting for tasks: interactive (possibly multi-agent) environments where an agent acts upon observations and rewards. Instead of analysing the complexity of the environment, the state space or the actions that are performed...
Receiver Operating Characteristic (ROC) analysis is one of the most popular tools for the visual assessment and understanding of classifier performance. In this paper we present a new representation of regression models in the so-called regression ROC (RROC) space. The basic idea is to represent over-estimation against under-estimation. The curves...