Project

EU H2020-MSCA-RISE NeEDS: Research and Innovation Staff Exchange Network of European Data Scientists

Goal: NeEDS (Network of European Data Scientists) provides an integrated modelling and computing environment that facilitates data analysis and data visualization to enhance interaction. NeEDS brings together an excellent interdisciplinary research team that integrates expertise from three relevant academic disciplines, Mathematical Optimization, Visualization and Network Science, and is excellently placed to tackle the challenges. NeEDS develops mathematical models, yielding results which are interpretable, easy-to-visualize, and flexible enough to incorporate user knowledge from complex data. These models require the numerical resolution of computationally demanding Mixed Integer Nonlinear Programming formulations, and for this purpose NeEDS develops innovative mathematical optimization based heuristics.

Date: 1 January 2019 - 31 December 2022

Updates
0 new
9
Recommendations
0 new
18
Followers
0 new
65
Reads
1 new
1006

Project log

Cristina Molero-Río
added a research item
In this paper, we tailor optimal randomized regression trees to handle multivariate functional data. A compromise between prediction accuracy and sparsity is sought. Whilst fitting the tree model, the detection of a reduced number of intervals that arecritical for prediction, as well as the control of their length, is performed. Local and global sparsities can be modeled through the inclusion of LASSO-type regularization terms over the coefficients associated to functional predictor variables. The resulting optimization problem is formulated as a nonlinear continuous and smooth model with linear constraints. We illustrate that our approach with small depth is competitive against benchmarks.
M. Remedios Sillero-Denamiel
added a research item
Many real-life applications consider nominal categorical predictor variables that have a hierarchical structure, e.g. economic activity data in Official Statistics. In this paper, we focus on linear regression models built in the presence of this type of nominal categorical predictor variables, and study the consolidation of their categories to have a better tradeoff between interpretability and fit of the model to the data. We propose the so-called Tree based Linear Regression (TLR) model that optimizes both the accuracy of the reduced linear regression model and its complexity, measured as a cost function of the level of granularity of the representation of the hierarchical categorical variables. We show that finding non-dominated outcomes for this problem boils down to solving Mixed Integer Convex Quadratic Problems with Linear Constraints, and small to medium size instances can be tackled using off-the-shelf solvers. We illustrate our approach in two real-world datasets, as well as a synthetic one, where our methodology finds a much less complex model with a very mild worsening of the accuracy.
Marcela Galvis-Restrepo
added a research item
In recent years, supervised classification has been used to support or even replace human decisions in high stakes domains such as pre-trial risk assessment, police stop-and-frisk programs, credit scoring, insurance premiums and healthcare access. The training of these algorithms uses historical data which might be biased against individuals with certain sensitive characteristics. The increasing concern over potential biases has motivated lawmakers to pass anti-discrimination laws which prohibit unfair treatment based on characteristics such as gender or race. In this paper, we propose a methodology that enhances the trade-off between accuracy and unfairness in classification. We use a numerical method that shrinks the possible values the predictors can take guided by a linear combination of the accuracy and the disparate mistreatment, our measure of unfairness, of the shrunk model. We illustrate the performance of our methodology in terms of accuracy and unfairness on a collection of real-world datasets.
Jasone Ramírez-Ayerbe
added a research item
Due to the increasing use of complex machine learning models in high stakes decisions, post-hoc explanations have become crucial to be able to understand and explain their behaviour. An effective class of post-hoc explanations are counterfactual explanations, which are minimal perturbations of the predictor variables to change the prediction for a specific instance. Most of the research on counterfactual explainability focuses on tabular and image data and much less on models dealing with functional data. In this paper we propose a novel Mathematical Optimization formulation for constructing counterfactual explanations when dealing with functional data. Our approach is able to generate sparse and plausible counterfactuals and to identify the samples of the dataset from which the counterfactual explanation is made of. Our methodology can be used with different distance measures and is applicable to any score-based classifier. We illustrate our methodology using two different real-world datasets, one univariate and another multivariate.
Kseniia Kurishchenko
added a research item
In this paper, we tackle the problem of enhancing the interpretability of the results of Cluster Analysis. Our goal is to find an explanation for each cluster, such that clusters are characterized as precisely and distinctively as possible, i.e., the explanation is fulfilled by as many as possible individuals of the corresponding cluster, true positive cases, and by as few as possible individuals in the remaining clusters, false positive cases. We assume that a dissimilarity between the individuals is given, and propose distance-based explanations, namely those defined by individuals that are close to its so-called prototype. To find the set of prototypes, we address the biobjective optimization problem that maximizes the total number of true positive cases across all clusters and minimizes the total number of false positive cases, while controlling the true positive rate as well as the false positive rate in each cluster. We develop two mathematical optimization models, inspired by classic Location Analysis problems, that differ in the way individuals are allocated to prototypes. We illustrate the explanations provided by these models and their accuracy in both real-life data as well as simulated data.
Dolores Romero Morales
added an update
2021 KPIs of the Machine Learning NeEDS Mathematical Optimization Online Seminar Series, https://congreso.us.es/mlneedsmo/:
44 speakers from 15 countries
> 1000 colleagues from > 80 countries subscribed
> 5000 views on @needs_project @YouTube channel, https://www.youtube.com/c/NeEDSNetworkofEuropeanDataScientists
> 3500 views on @imus_us @YouTube channel,
In 2022, we will have another round of great speakers to tell us about the latest advances in adversarial learning, fairness, bayesian learning, mathematical optimization, machine learning and much more!
The organizers
Emilio Carrizosa, IMUS - Instituto de Matemáticas Universidad de Sevilla
Kseniia Kurishchenko, Copenhagen Business School
Cristina Molero del Río, IMUS - Instituto de Matemáticas Universidad de Sevilla
Dolores Romero Morales, Copenhagen Business School
 
M. Remedios Sillero-Denamiel
added 2 research items
The Naïve Bayes is a tractable and efficient approach for statistical classification. In general classification problems, the consequences of misclassifications may be rather different in different classes, making it crucial to control misclassification rates in the most critical and, in many realworld problems, minority cases, possibly at the expense of higher misclassification rates in less problematic classes. One traditional approach to address this problem consists of assigning misclassification costs to the different classes and applying the Bayes rule, by optimizing a loss function. However, fixing precise values for such misclassification costs may be problematic in realworld applications. In this paper we address the issue of misclassification for the Naïve Bayes classifier. Instead of requesting precise values of misclassification costs, threshold values are used for different performance measures. This is done by adding constraints to the optimization problem underlying the estimation process. Our findings show that, under a reasonable computational cost, indeed, the performance measures under consideration achieve the desired levels yielding a user-friendly constrained classification procedure.
The Naïve Bayes has proven to be a tractable and efficient method for classification in multivariate analysis. However, features are usually correlated, a fact that violates the Naïve Bayes’ assumption of conditional independence, and may deteriorate the method’s performance. Moreover, datasets are often characterized by a large number of features, which may complicate the interpretation of the results as well as slow down the method’s execution. In this paper we propose a sparse version of the Naïve Bayes classifier that is characterized by three properties. First, the sparsity is achieved taking into account the correlation structure of the covariates. Second, different performance measures can be used to guide the selection of features. Third, performance constraints on groups of higher interest can be included. Our proposal leads to a smart search, which yields competitive running times, whereas the flexibility in terms of performance measure for classification is integrated. Our findings show that, when compared against well-referenced feature selection approaches, the proposed sparse Naïve Bayes obtains competitive results regarding accuracy, sparsity and running times for balanced datasets. In the case of datasets with unbalanced (or with different importance) classes, a better compromise between classification rates for the different classes is achieved.
Kseniia Kurishchenko
added a research item
In this paper, we make Cluster Analysis more interpretable with a new approach that simultaneously allocates individuals to clusters and gives rule-based explanations to each cluster. The traditional homogeneity metric in clustering, namely the sum of the dissimilarities between individuals in the same cluster, is enriched by considering also, for each cluster and its associated explanation, two explainability criteria, namely, the accuracy of the explanation, i.e., how many individuals within the cluster satisfy its explanation, and the distinctiveness of the explanation, i.e., how many individuals outside the cluster satisfy its explanation. Finding the clusters and the explanations optimizing a joint measure of homogeneity, accuracy, and distinctiveness is formulated as a multi-objective Mixed Integer Linear Optimization problem, and tested on a well-known real-world dataset.
Dolores Romero Morales
added a research item
Due to the increasing use of Machine Learning models in high stakes decision making settings, it has become increasingly important to be able to understand how models arrive at decisions. Assuming an already trained Supervised Classification model, an effective class of post-hoc explanations are counterfactual explanations, i.e., a set of actions that can be taken by an instance such that the given Machine Learning model would have classified it in a different class. For score-based multiclass classification models, we propose novel Mathematical Optimization formulations to construct the so-called collective counterfactual explanations, i.e., explanations for a group of instances in which we minimize the perturbation in the data (at the individual and group level) to have them labelled by the classifier in a given group. Although the approach is valid for any classification model based on scores, we focus on additive tree models, like random forests or XGBoost. Our approach is capable of generating diverse, sparse, plausible and actionable collective counterfactuals. Real-world data are used to illustrate our method.
Marcela Galvis-Restrepo
added a research item
We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy.
Dolores Romero Morales
added a research item
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context.
Marcela Galvis-Restrepo
added 2 research items
In this paper, we propose a methodology to deal with Generalized Linear Models including interactions between categorical predictors. In the presence of categorical predictors, searching for interaction effects can quickly become a highly combinatorial problem when we have many of them or even a few high-cardinality categorical predictors. In these cases, including all potential interactions in the model is computationally time consuming, if not intractable. To alleviate this and at the same time enhance model interpretability without compromising accuracy, we propose to find an alternative representation for each categorical predictor as a binarized predictor via a clustering procedure. We apply our methodology to both real-world and simulated data demonstrating the usefulness of our approach.
Cristina Molero-Río
added a research item
Classification and regression trees, as well as their variants, are off-the-shelf methods in Machine Learning. In this paper, we review recent contributions within the Continuous Optimization and the Mixed-Integer Linear Optimization paradigms to develop novel formulations in this research area. We compare those in terms of the nature of the decision variables and the constraints required, as well as the optimization algorithms proposed. We illustrate how these powerful formulations enhance the flexibility of tree models, being better suited to incorporate desirable properties such as cost-sensitivity, explainability, and fairness, and to deal with complex data, such as functional data.
Cristina Molero-Río
added a research item
Classification and Regression Trees (CARTs) are off-the-shelf techniques in modern Statistics and Machine Learning. CARTs are traditionally built by means of a greedy procedure, sequentially deciding the splitting predictor variable(s) and the associated threshold. This greedy approach trains trees very fast, but, by its nature, their classification accuracy may not be competitive against other state-of-the-art procedures. Moreover, controlling critical issues, such as the misclassification rates in each of the classes, is difficult. To address these shortcomings, optimal decision trees have been recently proposed in the literature, which use discrete decision variables to model the path each observation will follow in the tree. Instead, we propose a new approach based on continuous optimization. Our classifier can be seen as a randomized tree, since at each node of the decision tree a random decision is made. The computational experience reported demonstrates the good performance of our procedure.
Kseniia Kurishchenko
added a research item
In this paper, we tackle the problem of enhancing the interpretability of the results of Cluster Analysis. Our goal is to find an explanation for each cluster, such that clusters are characterized as precisely and distinctively as possible, i.e., the explanation is fulfilled by as many as possible individuals of the corresponding cluster, true positive cases, and by as few as possible individuals in the remaining clusters, false positive cases. We assume that a dissimilarity between the individuals is given, and propose distance-based explanations, namely those defined by individuals that are close to its so-called prototype. To find the set of prototypes, we address the biobjective optimization problem that maximizes the total number of true positive cases across all clusters and minimizes the total number of false positive cases, while controlling the true positive rate as well as the false positive rate in each cluster. We develop two mathematical optimization models, inspired by classic Location Analysis problems, that differ in the way individuals are allocated to prototypes. We illustrate the explanations provided by these models and their accuracy in both real-life data as well as simulated data.
Marcela Galvis-Restrepo
added a research item
We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical pre-dictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy.
Dolores Romero Morales
added an update
Dear colleagues,
We warmly welcome you to the Online Seminar Series “Machine Learning NeEDS Mathematical Optimization” (https://congreso.us.es/mlneedsmo/).
This is a weekly seminar series that will take place every Monday, at 16.30 (CET). It will be 100% online-access, and it will have speakers from around the globe. To receive weekly updates about the online seminars and uploaded videos, please fill in this form here: 
We have lined up a number of presentations from leading academics in the field of Data Science and Analytics that will cover important topics such as explainability, fairness, fraud, privacy, etc. Mathematical Modeling and Mathematical Optimization will be at the core of their presentations. We will also have presentations from junior academics showing their latest results in this burgeoning area.
Looking forward to e-seeing you at the seminar series,
Emilio Carrizosa
IMUS-Instituto de Matemáticas de la Universidad de Sevilla
@emiliocarrizosa
Dolores Romero Morales
Copenhagen Business School
@DoloresRomeroM
Sponsors:
IMUS-Instituto de Matemáticas de la Universidad de Sevilla, www.imus.us.es
Copenhagen Business School, www.cbs.dk
H2020 EU RISE NeEDS project, www.riseneeds.eu
 
Dolores Romero Morales
added a research item
Classification and regression trees, as well as their variants, are off-the-shelf methods in Machine Learning. In this paper, we review recent contributions within the Continuous Optimization and the Mixed-Integer Linear Optimization paradigms to develop novel formulations in this research area. We compare those in terms of the nature of the decision variables and the constraints required, as well as the optimization algorithms proposed. We illustrate how these powerful formulations enhance the flexibility of tree models, being better suited to incorporate desirable properties such as cost-sensitivity, explainability and fairness, and to deal with complex data, such as functional data.
Dolores Romero Morales
added an update
SAVE-the-DATE!!!!
July 1, 2pm-5pm (CEST)
Online "Workshop on Data and Decisions in the COVID19 times"
jointly organised by IMUS-Instituto de Matemáticas de la Universidad de Sevilla and Copenhagen Business School, within the H2020 RISE NeEDS – Network of European Data Scientists, www.riseneeds.eu.
Colleagues from CARTO, Danmarks Statistik, Instituto de Estadística y Cartografía de Andalucía, KU Leuven, Universidad de Chile, Universidad de Sevilla, Università degli Studi di Milano and University of Duisburg-Essen, will present contributions from Data Science, Official Statistics and Mathematical Modeling to enhance Data Driven Decision Making in the COVID19 times.
Confirmed speakers are:
Sandra Benítez Peña, Giulia Carella, Iria Enrique Regueira, Jonas Klingwort, Alessandra Micheletti, Cristina Molero del Río, Laust Mortensen, Klass Nelissen, Hector Ramirez Cabrera, Remedios Sillero Denamiel
To register, follow the link
 
Dolores Romero Morales
added an update
SAVE-the-DATE!!!!
July 1, 2pm-5pm (CEST)
Online "Workshop on Data and Decisions in the COVID19 times"
jointly organised by IMUS-Instituto de Matemáticas de la Universidad de Sevilla and Copenhagen Business School, within the H2020 RISE NeEDS – Network of European Data Scientists, www.riseneeds.eu.
Colleagues from CARTO, Danmarks Statistik, Instituto de Estadística y Cartografía de Andalucía, KU Leuven, Universidad de Chile, Universidad de Sevilla, Università degli Studi di Milano and University of Duisburg-Essen, will present contributions from Data Science, Official Statistics and Mathematical Modeling to enhance Data Driven Decision Making in the COVID19 times.
Confirmed speakers are:
Sandra Benítez Peña, Giulia Carella, Iria Enrique Regueira, Jonas Klingwort, Alessandra Micheletti, Cristina Molero del Río, Laust Mortensen, Klass Nelissen, Hector Ramirez Cabrera, Remedios Sillero Denamiel
To register, follow the link
 
Dolores Romero Morales
added a research item
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components , but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes bad regressors. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with a real data set arising in the COVID-19 context.
Sandra Benítez-Peña
added a research item
Support vector machines (SVMs) are widely used and constitute one of the best examined and used machine learning models for 2-class classification. Classification in SVM is based on a score procedure, yielding a deterministic classification rule, which can be transformed into a probabilistic rule (as implemented in off-the-shelf SVM libraries), but is not probabilistic in nature. On the other hand, the tuning of the regularization parameters in SVM is known to imply a high computational effort and generates pieces of information which are not fully exploited, and not used to build a probabilistic classification rule. In this paper we propose a novel approach to generate probabilistic outputs for the SVM. The highlights of the paper are: first, a SVM method is designed to be cost-sensitive, and thus the different importance of sensitivity and speci-ficity is readily accommodated in the model. Second, SVM is embedded in an ensemble method to improve its performance, making use of the valuable information generated in the parameters tuning process. Finally, the probabilities estimation is done via bootstrap estimates, avoiding the use of parametric models as competing probabilities estimation in SVM. Numerical tests show the advantages of our procedures.
Cristina Molero-Río
added a research item
In this paper, we model an optimal regression tree through a continuous optimization problem, where a compromise between prediction accuracy and both types of sparsity, namely local and global, is sought. Our approach can accommodate important desirableproperties for the regression task, such as cost-sensitivity and fairness. Thanks to thesmoothness of the predictions, we can derive local explanations on the continuous predic-tor variables. The computational experience reported shows the outperformance of ourapproach in terms of prediction accuracy against standard benchmark regression methods such as CART, OLS and LASSO. Moreover, the scalability of our approach with respect to the size of the training sample is illustrated.
Dolores Romero Morales
added a research item
In this paper, we propose a mathematical optimization approach to cluster the rows and/or columns of contingency tables to detect possible statistical dependencies among the observed variables. With this, we obtain a clustered contingency table of smaller size, which is desirable when interpreting the statistical dependence results of the observed variables. An assignment and set partitioning mathematical formulations for this problem are proposed. Our model is able to successfully incorporate user-knowledge on the structure of the clusters sought including, e.g., cannot-link constraints on the rows and/or columns that cannot be merged as well as relational constraints on the ones that must be merged together. We illustrate the usefulness of the stated methodology using a dataset of a medical study, for which structural requirements in the clusters are imposed stemming from the inherent nature of the data.
M. Remedios Sillero-Denamiel
added 2 research items
In this paper, we study linear regression models built on categorical predictor variables that have a hierarchical structure. For such variables, the categories are arranged as a directed tree, where the categories in the leaf nodes give the highest granularity in the representation of the variable. Instead of taking the fully detailed model, the user can go upstream the tree and use a less complex, and thus more interpretable, model with fewer coefficients to be estimated and interpreted, hopefully without damaging the accuracy. We study the mathematical optimization problem that trades off the accuracy of the linear regression model and its complexity, measured as a cost function of the level of granularity of the representation of the hierarchical categorical variables. We show that finding non-dominated solutions for this problem boils down to solving a Mixed Integer Quadratic Problem with Linear Constraints. We illustrate our approach in a real-world cancer trial dataset, as well as in a simulated one, where our methodology finds a much less complex model with a very mild worsening of the accuracy.
The Lasso has become a benchmark data analysis procedure, and numerous variants have been proposed in the literature. Although the Lasso formulations are stated so that overall prediction error is optimized, no full control over the accuracy prediction on certain individuals of interest is allowed. In this work we propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions, in such a way that threshold values are set to bound the prediction errors in the different groups of interest (not necessarily disjoint). As a result, a constrained sparse regression model is defined by a nonlinear optimization problem. This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Both theoretical properties and empirical studies concerning the new method are explored in this paper. In addition, two illustrations of the method on biomedical and sociological contexts are considered.
Dolores Romero Morales
added 3 research items
COVID-19 is an infectious disease that was first identified in China in December 2019. Subsequently COVID-19 started to spread broadly, to also arrive in Spain by the end of Jan-uary 2020. This pandemic triggered confinement measures, in order to reduce the expansion of the virus so as not to saturate the health care system. With the aim of providing Span-ish authorities information about the behavior of variables of interest on the virus spread in the short-term, the Spanish Commission of Mathematics (CEMat) made a call among researchers in different areas to collaborate and build a cooperative predictor. Our research group is particularly focused on the seven-days-ahead prediction of the number of hospitalized patients, as well as ICU patients, in Andalusia. This manuscript describes the data pre-processing and methodology. This contribution is based at the Institute of Mathematics of the University of Seville (IMUS).
Dolores Romero Morales
added a research item
Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.
Dolores Romero Morales
added a research item
Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.
Dolores Romero Morales
added an update
Registration is open
for
The EURO PhD School “Data Driven Decision Making and Optimization”, IMUS-Mathematical Institute of the University of Seville, Spain, July 10-19, 2020.
Please follow the link https://congreso.us.es/epsdata/.
The EURO PhD School (EPS) will be focused on giving participants advanced training on the use of Data Science to aid Data Driven Decision Making. There will be two main components in this EPS. First, methodological training on the role of Mathematical Optimization in Data Science will be given in the format of lectures and computer workshops. The lectures will highlight the mathematical modelling and numerical optimization behind data analysis and data visualization tools. The computer workshops will put this knowledge in action. Second, applications of the acquired knowledge to the modeling of specific industrial problems will be presented by professionals from industry and worked out by the PhD students. Mathematical models and numerical solution approaches will be developed and communicated, following a collaborative approach, in which the PhD students will work in small groups under the guidance of the instructors.
Chairs of the Scientific Committee
Prof Emilio Carrizosa
IMUS-Institute of Mathematics of the University of Seville, Spain
&
Prof Dolores Romero Morales
Copenhagen Business School, Denmark
 
Dolores Romero Morales
added an update
The registration for the second NeEDS Modeling Week is now open. The details are below. The modeling week is organized by the NeEDS scientific leader at the Universidad de Chile, Richard Weber.
Second NeEDS Modelling Week
Faculty of Physical and Mathematical Sciences
Universidad de Chile (Santiago de Chile, Chile)
January 6th-11th, 2020
The aims of NeEDS Modelling Week are
(i) to train students in modeling real-world problems using quantitative approaches, drawn from Data Science and Mathematical Optimization,
(ii) to stimulate their collaboration and communication skills in an international and intersectoral environments,
(iii) to bring together academia with professionals from the public and private sector to show the potentials advanced quantitative models can have for practice as well as to nurture the respective research agendas.
The format of the Second NeEDS Modeling Week is to spend one week working to solve real-world problems using different quantitative modeling approaches. Small groups of Master students, Ph.D. students, and junior researchers (like post-doctoral students) will be assigned to each problem in term of their preference and own skills. During the first day of the event the problems will be presented. An instructor leads each of these groups throughout the week. During the following four days, students will work on solving the problems under the guidance of the instructor and industrial collaborators. The final day of the meeting will be devoted to the presentation of the results.
To register for this modeling week, please follow the link
 
Dolores Romero Morales
added a research item
Exploratory Factor Analysis (EFA) is a widely used statistical technique to discover the structure of latent unobserved variables, called factors, from a set of observed variables. EFA exploits the property of rotation invariance of the factor model to enhance factors' interpretability by building a sparse loading matrix. In this paper, we propose an optimization-based procedure to give meaning to the factors arising in EFA by means of an additional set of variables, called explanatory variables, which may include in particular the set of observed variables. A goodness-of-fit criterion is introduced which quantifies the quality of the interpretation given this way. Our methodology also exploits the rotational invariance of EFA to obtain the best orthogonal rotation of the factors, in terms of the goodness-of-fit, but making them match to some of the explanatory variables, thus going beyond traditional rotation methods. Therefore, our approach allows the analyst to interpret the factors not only in terms of the observed variables, but in terms of a broader set of variables. Our experimental results demonstrate how our approach enhances interpretability in EFA, first in an empirical dataset, concerning volumes of reservoirs in California, and second in a synthetic data example.
Dolores Romero Morales
added a research item
This paper proposes an integrative approach to feature (input and output) selection in Data Envelopment Analysis (DEA). The DEA model is enriched with zero-one decision variables modelling the selection of features, yielding a Mixed Integer Linear Programming formulation. This single-model approach can handle different objective functions as well as constraints to incorporate desirable properties from the real-world application. Our approach is illustrated on the benchmarking of electricity Distribution System Operators (DSOs). The numerical results highlight the advantages of our single-model approach provide to the user, in terms of making the choice of the number of features, as well as modeling their costs and their nature.
Dolores Romero Morales
added an update
ECMI Postgraduate / VI Iberian / NeEDS Modelling Week
July 7-13, 2019 @ Instituto de Matemáticas de la Universidad de Sevilla, IMUS, Seville, Spain
The ECMI Postgraduate / VI Iberian / NeEDS Modelling Week will be held at the Mathematical Institute of the University of Seville https://www.imus.us.es/en/ (Seville, Spain) on July 7th-13th, 2019. It is co-organized by the European Consortium for Mathematics and Industry (https://ecmiindmath.org/), the Spanish Network for Mathematics-Industry (http://www.math-in.net/?q=en), the Portuguese Network of Mathematics for Industry and Innovation (https://www.spm.pt/PT-MATHS-IN/), and the H2020-MSCA-RISE NeEDS project (http://www.riseneeds.eu/), and takes part of the satellite meetings to the 9th International Congress on Industrial and Applied Mathematics https://iciam2019.org/ (July 15th-19th, 2019, Valencia, Spain).
The format of the Modelling Week is to spend one week working to solve real problems that can be tackled through mathematical modeling. Small groups of multinational Master students, Ph.D. students and junior researchers (like post-doctoral students) will be assigned to each problem in term of their preference and own skills on the first day of the event after the presentation of the problems. An instructor, that must be an expert in the area, of the proposed problem leads each of these groups. During the following four days students will work on solving the problems under the guidance of the instructor and industrial collaborators. Last day of the meeting will be devoted to the presentation of the results, which will be collected in the proceedings of the event.
To register for this modeling week, please follow the link
For information on competitive financial support, please follow the link
 
Dolores Romero Morales
added a research item
This paper proposes an integrative approach to feature (input and output) selection in Data Envelopment Analysis (DEA). The DEA model is enriched with zero-one decision variables modelling the selection of features, yielding a Mixed Integer Linear Programming formulation. This single-model approach can handle different objective functions as well as constraints to incorporate desirable properties from the real-world application. Our approach is illustrated on the benchmarking of electricity Distribution System Operators (DSOs). The numerical results highlight the advantages of our single-model approach provide to the user, in terms of making the choice of the number of features, as well as modeling their costs and their nature.
Dolores Romero Morales
added a research item
Exploratory Factor Analysis (EFA) is a widely used statistical technique to discover the structure of latent unobserved variables, called factors, from a set of observed variables. EFA exploits the property of rotation invariance of the factor model to enhance factors' interpretability by building a sparse loading matrix. In this paper, we propose an optimization-based procedure to give meaning to the factors arising in EFA by means of an additional set of variables, called explanatory variables, which may include in particular the set of observed variables. A goodness-of-fit criterion is introduced which quantifies the quality of the interpretation given this way. Our methodology also exploits the rotational invariance of EFA to obtain the best orthogonal rotation of the factors, in terms of the goodness-of-fit, but making them match to some of the explanatory variables. Therefore, our approach allows the analyst to interpret the factors not only in terms of the observed variables, but in terms of a broader set of variables. Our experimental results demonstrate how our approach enhances interpretability in EFA, first in an empirical dataset, concerning volumes of reservoirs in California, and second in a synthetic data example.
Min Chen
added a research item
Visualization is a human-centric process, which is inevitably associated with potential biases in humans’ judgment and decision-making. While the discussions on humans’ biases have been heavily influenced by the work of Daniel Kahneman as summarized in his book “Thinking, Fast and Slow’, there have also been viewpoints in psychology in favor of heuristics, such as by Gigerenzer. In this chapter, we present a balanced discourse on the humans’ heuristics and biases as the two sides of the same coin. In particular, we examine these two aspects from a probabilistic perspective, and relate them to the notions of global and local sampling. We use three case studies in Kahneman’s book to illustrate the potential biases of human- and machine-centric decision processes. Our discourse leads to a concrete conclusion that visual analytics, where interactive visualization is integrated with statistics and algorithms, offers an effective and efficient means to overcome biases in data intelligence.
Dolores Romero Morales
added an update
Dear all,
We have now the website of the project up and running!
Best,
Dolores.
 
Dolores Romero Morales
added an update
I am thrilled to announce that our H2020-MSCA-RISE NeEDS project has just received the final thumbs-up from the European Commission.
Abstract
NeEDS (Network of European Data Scientists) consists of six academic participants and eight industrial ones from five EU countries, USA and Latin America with strong and complementary expertise, from industry sectors ranging from energy, retailing, insurance to banking, as well as national statistical offices. With this composition, NeEDS is in a unique position to deliver cutting-edge multidisciplinary research to advance academic thinking on Data Science in Europe, and to improve the Data Science capabilities of industry and the public sector.
Participants
COPENHAGEN BUSINESS SCHOOL, Denmark (Coordinator)
UNIVERSIDAD DE SEVILLA, Spain
KATHOLIEKE UNIVERSITEIT LEUVEN, Belgium
THE UNIVERSITY OF OXFORD, United Kingdom
DUKE UNIVERSITY, United States
UNIVERSIDAD DE CHILE, Chile
AGEAS, Belgium
CENTRAAL BUREAU VOOR DE STATISTIEK, The Netherlands
DANMARKS STATISTIK, Denmark
GEOGRAPHICA, Spain
IECISA, Spain
REPSOL S.A., Spain
TESCO STORES LIMITED, United Kingdom
BANCO DEL ESTADO DE CHILE, Chile
Call: H2020-MSCA-RISE-2018
Duration: 48 months
Start Date: 01 Jan 2019
EU Contribution: €1,168,400.00
 
Dolores Romero Morales
added a project goal
NeEDS (Network of European Data Scientists) provides an integrated modelling and computing environment that facilitates data analysis and data visualization to enhance interaction. NeEDS brings together an excellent interdisciplinary research team that integrates expertise from three relevant academic disciplines, Mathematical Optimization, Visualization and Network Science, and is excellently placed to tackle the challenges. NeEDS develops mathematical models, yielding results which are interpretable, easy-to-visualize, and flexible enough to incorporate user knowledge from complex data. These models require the numerical resolution of computationally demanding Mixed Integer Nonlinear Programming formulations, and for this purpose NeEDS develops innovative mathematical optimization based heuristics.