Cèsar Ferri

Cèsar Ferri
Universitat Politècnica de València | UPV · Department of Computer Systems and Computation

About

124
Publications
28,462
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,954
Citations
Citations since 2017
27 Research Items
1560 Citations
2017201820192020202120222023050100150200250300350
2017201820192020202120222023050100150200250300350
2017201820192020202120222023050100150200250300350
2017201820192020202120222023050100150200250300350

Publications

Publications (124)
Article
Full-text available
The automation of data science and other data manipulation processes depend on the integration and formatting of ‘messy’ data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users exp...
Conference Paper
Over the past decades in the field of machine teaching, several restrictions have been introduced to avoid ‘cheating’, such as collusion-free or non-clashing teaching. However, these restrictions forbid several teaching situations that we intuitively consider natural and fair, especially those ‘changes of mind’ of the learner as more evidence is gi...
Chapter
We present muppets, a framework for partitioning cells in a table in segments that fulfil the same semantic role or belong to the same semantic data type, similar to how image segmentation is used to group pixels that represent the same semantic object in computer vision. Flexible constraints can be imposed on these segmentations for different use...
Article
Full-text available
Matrices are a very common way of representing and working with data in data science and artificial intelligence. Writing a small snippet of code to make a simple matrix transformation is frequently frustrating, especially for those people without an extensive programming expertise. We present AUTOMAT[R]IX, a system that is able to induce R program...
Article
A common way of learning to perform a task is to observe how it is carried out by experts. However, it is well known that for most tasks there is no unique way to perform them. This is especially noticeable the more complex the task is because factors such as the skill or the know-how of the expert may well affect the way she solves the task. In ad...
Chapter
Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if t...
Chapter
Programming languages such as R or Python are commonplace in data science projects. However, transforming data is usually tricky and the composition of the right primitives (using the appropriate libraries) to get the most elegant code transformation is not always easy. In this paper, we present the first system that is able to automatically synthe...
Article
CRISP-DM (CRoss-Industry Standard Process for Data Mining) has its origins in the second half of the nineties and is thus about two decades old. According to many surveys and user polls it is still thede factostandard for developing data mining and knowledge discovery projects. However, undoubtedly the field has moved on considerably in twenty year...
Thesis
Full-text available
Spectrum monitoring is an important part of the radio spectrum management process, providing feedback on the workflow that allows for our current wirelessly interconnected lifestyle. The constantly increasing number of users and uses of wireless technologies is pushing the limits and capabilities of the existing infrastructure, demanding new altern...
Article
Full-text available
The theoretical hardness of machine teaching has usually been analyzed for a range of concept languages under several variants of the teaching dimension: the minimum number of examples that a teacher needs to figure out so that the learner identifies the concept. However, for languages where concepts have structure (and hence size), such as Turing-...
Article
Full-text available
The quality of the decisions made by a machine learning model depends on the data and the operating conditions during deployment. Often, operating conditions such as class distribution and misclassification costs have changed during the time since the model was trained and evaluated. When deploying a binary classifier that outputs scores, once we k...
Preprint
Full-text available
The causes underlying unfair decision making are complex, being internalised in different ways by decision makers, other actors dealing with data and models, and ultimately by the individuals being affected by these decisions. One frequent manifestation of all these latent causes arises in the form of missing values: protected groups are more reluc...
Article
Full-text available
In this work, we present a dataset which provides information on the scientific program of a set conferences of Machine Learning. Data were extracted from the IEEE Xplore Digital Library and the official web site of the International Conference on Machine Learning Applications (ICMLA). We include data of four different editions (from 2014 to 2017)....
Chapter
In this paper, we introduce SALER, an ongoing project developed by the Universitat Politècnica de València (Spain) which aims at detecting and preventing bad practices and fraud in public administration. The main contribution of the project is the development of a data science-based solution to systematically assist managing authorities to increase...
Preprint
Full-text available
Given one or two examples, humans are good at understanding how to solve a problem independently of its domain, because they are able to detect what the problem is and to choose the appropriate background knowledge according to the context. For instance, presented with the string "8/17/2017" to be transformed to "17th of August of 2017", humans wil...
Chapter
Machine learning (ML) models make decisions for governments, companies, and individuals. Accordingly, there is the increasing concern of not having a rich explanatory and predictive account of the behaviour of these ML models relative to the users’ interests (goals) and (pre-)conceptions (ontologies). We argue that the recent research trends in fin...
Article
The improvement in the performance of classifiers has been the focus of attention of many researchers over the last few decades. Obtaining accurate predictions becomes more complicated as the number of classes increases. Most families of classification techniques generate models that define decision boundaries trying to separate the classes as well...
Article
Full-text available
Creating sessions in scientific conferences consists in grouping papers with common topics taking into account the size restrictions imposed by the conference schedule. Therefore, this problem can be considered as semi-supervised clustering of documents based on their content. This paper aims to propose modifications in traditional clustering algor...
Article
Full-text available
In the last decades, one issue that has received a lot of attention in classification problems is how to obtain better classifications. This problem becomes even more complicated when the number of classes is high. In this multiclass scenario, it is assumed that the class labels are independent of each other, and thus, most techniques and methods p...
Conference Paper
We present an event detection system in a laparoscopic surgery domain, as part of a more ambitious supervision by observation project. The system, which only requires the incorporation of two cameras in a laparoscopic training box, integrates several computer vision and machine learning techniques to detect the states and movements of the elements...
Article
Full-text available
We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs.
Article
The progression in several cognitive tests for the same subjects at different ages provides valuable information about their cognitive development. One question that has caught recent interest is whether the same approach can be used to assess the cognitive development of artificial systems. In particular, can we assess whether the ‘fluid’ or ‘crys...
Article
The wide propagation of devices, such as mobile phones, that include a global positioning system (GPS) sensor has popularised the storing of geographic information for different kind of activities, many of them recreational, such as sport. Extracting and learning knowledge from GPS data can provide useful geographic information that can be used for...
Article
People living in urban areas are exposed to outdoor air pollution. Air contamination is linked to numerous premature and pre-native deaths each year. Urban air pollution is estimated to cost approximately 2% of GDP in developed countries and 5% in developing countries. Some works reckon that vehicle emissions produce over 90% of air pollution in ci...
Article
Full-text available
Some supervised tasks are presented with a numerical output but decisions have to be made in a discrete, binarised, way, according to a particular cutoff. This binarised regression task is a very common situation that requires its own analysis, different from regression and classification—and ordinal regression. We first investigate the application...
Article
Identifying the balance between remembering and forgetting is the key to abstraction in the human brain and, therefore, the creation of memories and knowledge. We present an incremental, lifelong view of knowledge acquisition which tries to improve task after task by determining what to keep, consolidate and forget, overcoming the stability–plastic...
Article
The first international workshop on Learning over Multiple Contexts, devoted to generalization and reuse of machine learning models over multiple contexts, was held on September 19th, 2014, as part of the 7th European machine learning and data mining conference (ECML-PKDD 2014) in Nancy, France. This short report summarizes the presentations and di...
Article
Full-text available
The application of cognitive mechanisms to support knowledge acquisition is, from our point of view, crucial for making the resulting models coherent, efficient, credible, easy to use and understandable. In particular, there are two characteristic features of intelligence that are essential for knowledge development: forgetting and consolidation. B...
Article
This paper presents Airvlc, an application for producing real-time urban air pollution forecasts for the city of Valencia in Spain. Although many cities provide air quality data, in many cases, this information is presented with significant delays (three hours for the city of Valencia) and it is limited to the area where the measurement stations ar...
Conference Paper
Full-text available
A more effective vision of machine learning systems entails tools that are able to improve task after task and to reuse the patterns and knowledge that are acquired previously for future tasks. This incremental, long-life view of machine learning goes beyond most of state-of-the-art machine learning techniques that learn throwaway models. In this p...
Article
The problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is pe...
Article
Full-text available
In this paper, we push forward the idea of machine learning systems whose operators can be modified and fine-tuned for each problem. This allows us to propose a learning paradigm where users can write (or adapt) their operators, according to the problem, data representation and the way the information should be navigated. To achieve this goal, data...
Book
In this volume, the authors apply insights from a variety of perspectives to explore the alignment among strategy, organization design, process and human resource management, and e-business practices on developing successful social networking programs-with particular regard to applying such initiatives against the backdrop of the global financial c...
Article
ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area under the optimal cost curve. In this paper we present new findings and connections between ROC spa...
Conference Paper
Full-text available
Policy reuse is a kind of transfer learning to improve a reinforcement learning system by reusing part of the information of the state-value function from previous problems to new problems. In this paper we overhaul this aproach in the context of a general learning framework for structured prediction using user-de�fined operators and a functional p...
Article
Distance-based and generalization-based methods are two families of artificial intelligence techniques that have been successfully used over a wide range of real-world problems. In the first case, general algorithms can be applied to any data representation by just changing the distance. The metric space sets the search and learning space, which is...
Article
Full-text available
Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (w...
Conference Paper
Full-text available
In this paper, we push forward the idea of machine learning systems for which the operators can be modified and finetuned for each problem. This allows us to propose a learning paradigm where users can write (or adapt) their operators, according to the problem, data representation and the way the information should be navigated. To achieve this goa...
Article
A general approach to classifier combination considers each model as a probabilistic classifier which outputs a class membership posterior probability. In this general scenario, it is not only the quality and diversity of the models which are relevant, but the level of calibration of their estimated probabilities as well. In this paper, we study th...
Article
Negotiation and agreement generally require models of the peers who are involved in the negotiation. One typical area where negotiation takes place is in selling and retailing, which is also known as Customer Relationship Management (CRM). Customers and products are usually modelled using previous retailing experiences with similar or dissimilar cu...
Article
Full-text available
In this paper we analyse three different techniques to establish an optimal-cost class threshold when training data is not available. One technique is directly derived from the definition of cost, a second one is derived from a ranking of estimated probabilities and the third one is based on ROC analysis. We analyse the approaches theoretically and...
Article
Full-text available
Learning methods based on distances are widely used to deal with structured information, since several distance functions can be found for the most common sorts of data. In these algorithms the jus-tification of the labelling of a new case is normally guided by a pattern expressing the similarity to a prototype. Other patterns based on the structur...
Chapter
The evaluation of machine learning models is a crucial step before their application because it is essential to assess how well a model will behave for every single case. In many real applications, not only is it important to know the “total” or the “average” error of the model, it is also important to know how this error is distributed and how wel...
Article
Full-text available
Many performance metrics have been introduced for the evaluation of classification performance, with different origins and niches of application: accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the absolute error, and the Brier score (with its decomposition into refinement and calibration). One way of understanding the rela...
Article
Full-text available
Data mining is usually concerned on the construction of accurate models from data, which are usually applied to well-defined problems that can be clearly isolated and formulated independently from other problems. Although much computational effort is devoted for their training and statistical evaluation, model deployment can also represent a scient...
Conference Paper
In some data mining problems, there are some input features that can be freely modified at prediction time. Examples happen in retailing, prescription or control (prices, warranties, medicine doses, delivery times, temperatures, etc.). If a traditional model is learned, many possible values for the special attribute will have to be tried to attain...
Article
Full-text available
ROC curves and cost curves are two popular ways of visualising classifier performance, finding appropriate thresholds according to the operating condition, and deriving useful aggregated measures such as the area under the ROC curve (AUC) or the area under the optimal cost curve. In this note we present some new findings and connections between ROC...
Conference Paper
Full-text available
It is often necessary to evaluate classifier performance over a range of operating conditions, rather than as a point estimate. This is typically assessed through the construction of 'curves' over a 'space', visualising how one or two performance metrics vary with the operating condition. For binary classifiers in particular, cost space is a natura...
Conference Paper
Full-text available
The area under the ROC curve (AUC), a well-known measure of ranking performance, is also often used as a measure of classification performance, aggregating over decision thresholds as well as class and cost skews. However, David Hand has recently argued that AUC is fundamentally incoherent as a measure of aggregated classifier performance and propo...
Article
Terms are the basis for functional and logic programming representations. In turn, functional and logic programming can be used for knowledge representation in a variety of applications (knowledge-based systems, data mining, etc.). Distances between terms provide a very useful tool to compare terms and arrange the search space in many of these appl...
Conference Paper
Full-text available
In this work, we present an instantiation of our framework for Hierarchical Distance-based Conceptual Clustering (HDCC) using sequences, a particular kind of structured data. We analyse the relationship between distances and generalisation operators for sequences in the context of HDCC. HDCC is a general approach to conceptual clustering that exten...
Conference Paper
Full-text available
This paper presents Newton trees, a redefinition of probability estimation trees (PET) based on a stochastic understanding of decision trees that follows the principle of attraction (relating mass and distance through the Inverse Square Law). The structure, application and the graphical representation of Newton trees provide a way to make their sto...
Conference Paper
Full-text available
Quantification is the name given to a novel machine learning task which deals with correctly estimating the number of elements of one class in a set of examples. The output of a quantifier is a real value, since training instances are the same as a classification problem, a natural approach is to train a classifier and to derive a quantifier from i...
Conference Paper
Full-text available
In this work, we introduce a new distance function for data representations based on first-order logic (atoms, to be more precise) which integrates the main advantages of the distances that have been previously presented in the literature. Basically, our distance simultaneously takes into account some relevant aspects, concerning atom-based present...
Conference Paper
Full-text available
In some application areas, similarities and distances are used to calculate how similar two objects are in order to use these measurements to find related objects, to cluster a set of objects, to make classifications or to perform an approximate search guided by the distance. In many other application areas, we require patterns to describe similari...
Article
Full-text available
In this work, we introduce a new distance function for data representations based on first-order logic (atoms, to be more precise) which integrates the main advantages of the distances that have been previously presented in the literature. Basically, our distance simultane-ously takes into account some relevant aspects, concerning atom-based presen...
Conference Paper
Full-text available
In this paper we revisit the problem of classifier calibration, motivated by the issue that existing calibration methods ignore the problem attributes (i.e., they are univariate). We propose a new calibration method inspired in binning-based methods in which the calibrated probabilities are obtained from k instances from a dataset. Bins are constru...
Conference Paper
Full-text available
In this work we analyse the relationship between distance and generalisation operators for real numbers, nominal data and tuples in the context of hierarchical distance-based conceptual clustering (HDCC). HDCC is a general approach to conceptual clustering that extends the traditional algorithm for hierarchical clustering by producing conceptual ge...
Article
Performance metrics in classification are fundamental in assessing the quality of learning methods and learned models. However, many different measures have been defined in the literature with the aim of making better choices in general or for a specific application area. Choices made by one metric are claimed to be different from choices made by o...
Article
Evaluation of machine learning methods is a crucial step before application, because it is essential to assess how good a model will behave for every single case. In many real applications, not only the "total " or the "average " of the error of the model is important but it is also important to know how this error is distributed or how well confid...
Conference Paper
Full-text available
In this work we analyse the relation between hierarchical distance-based clustering and the concepts that can be obtained from the hierarchy by generalisation. Many inconsistencies may arise, because the distance and the conceptual generalisation operator are usually incompatible. To overcome this, we propose an algorithm which integrates distance-...
Conference Paper
Full-text available
Frequently, organisations have to face complex situations where decision making is difficult. In these scenarios, several related decisions must be made at a time, which are also bounded by constraints (e.g. inventory/stock limitations, costs, limited resources, time schedules, etc). In this paper, we present a new method to make a good global deci...
Conference Paper
Full-text available
The area under the ROC curve (AUC) has been widely used to measure ranking performance for binary classification tasks. AUC only employs the classifier’s scores to rank the test instances; thus, it ignores other valuable information conveyed by the scores, such as sensitivity to small differences in the score values However, as such differences are...
Article
Full-text available
Resumen Learning from structured data is becoming increasingly important. Besides the well-known approaches which deal directly with complex data representations (inductive logic programming and multi-relational data mi-ning), new techniques have been recently proposed by upgrading propositional learning algorithms. Focusing on distance-based metho...