PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

COVID-19 is an infectious disease that was first identified in China in December 2019. Subsequently COVID-19 started to spread broadly, to also arrive in Spain by the end of Jan-uary 2020. This pandemic triggered confinement measures, in order to reduce the expansion of the virus so as not to saturate the health care system. With the aim of providing Span-ish authorities information about the behavior of variables of interest on the virus spread in the short-term, the Spanish Commission of Mathematics (CEMat) made a call among researchers in different areas to collaborate and build a cooperative predictor. Our research group is particularly focused on the seven-days-ahead prediction of the number of hospitalized patients, as well as ICU patients, in Andalusia. This manuscript describes the data pre-processing and methodology. This contribution is based at the Institute of Mathematics of the University of Seville (IMUS).
Short-Term Predictions of the Evolution of COVID-19 in
Andalusia. An Ensemble Method
Sandra Ben´ıtez-Pe˜na1,2, Emilio Carrizosa1,2 , Vanesa Guerrero3, Mar´ıa Dolores
Jim´enez-Gamero1,2, Bel´en Mart´ın-Barrag´an4, Cristina Molero-R´ıo1,2 , Pepa
Ram´ırez-Cobo1,5, Dolores Romero Morales6, and M. Remedios Sillero-Denamiel1,2
1Instituto de Matem´aticas de la Universidad de Sevilla, Seville, Spain
2Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla, Seville, Spain
3Departamento de Estad´ıstica, Universidad Carlos III de Madrid, Getafe, Spain
4The University of Edinburgh Business School, University of Edinburgh, Edinburgh, UK
5Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz, Cadiz, Spain
6Department of Economics, Copenhagen Business School, Frederiksberg, Denmark
May 2, 2020
Abstract
COVID-19 is an infectious disease that was first identified in China in December 2019. Sub-
sequently COVID-19 started to spread broadly, to also arrive in Spain by the end of January
2020. This pandemic triggered confinement measures, in order to reduce the expansion of the
virus so as not to saturate the health care system. With the aim of providing Spanish authorities
information about the behavior of variables of interest on the virus spread in the short-term,
the Spanish Commission of Mathematics (CEMat) made a call among researchers in different
areas to collaborate and build a cooperative predictor. Our research group is particularly fo-
cused on the seven-days-ahead prediction of the number of hospitalized patients, as well as ICU
patients, in Andalusia. This manuscript describes the data pre-processing and methodology.
This contribution is based at the Institute of Mathematics of the University of Seville (IMUS).
Key words COVID-19; Time Series; Ensemble method; Support Vector Regression; Random
Forests; Sparse Optimal Randomized Regression Trees; Rolling Forecasting Origin; Box Cox trans-
formations
1
1 Data pre-processing
Raw data
The data provided by the Spanish Comission of Mathematics (CEMat) is available online at https:
//covid19.isciii.es/. This data consists of values per day (Fecha) of five variables of interest:
the cumulative number of cases (Casos), hospitalized patients (Hospitalizados), ICU patients
(UCI), deceased (Fallecidos) and recovered (Recuperados) people in Spain since 20th of February
2020, disaggregated by region (CCAA). For further information, the reader is referred to http:
//matematicas.uclm.es/cemat/covid19/en/.
Clean data
Given a variable of interest in a region, the corresponding univariate time series
{Xt, t = 1, . . . , T }
is converted into a multivariate setting in order to implement the Machine Learning tools discussed
in Section 2. These models will be trained to make predictions for a variable that will depend
on its past values (up to nlags days). The resulting multivariate data consist of a (Tnlags)×
(nlags + 1)-matrix. For each t=nlags+1, . . . , T , one wants to predict Xtusing (Xtnlag s, . . . , Xt1),
the attribute vector consisting of the values of the variable in the past nlags days. Zero rows, due
to late spread of the virus, are removed.
Enriched clean data: Incorporation of information of all regions
Two different approaches are used for training the models. The first one only considers information
about the variable of interest of the Andalusian region while the second incorporates the information
of all regions (CCAAs) in Spain.
Enriched clean data: Box Cox transformations
To better capture nonlinearities in the data, as customary in Regression Analysis [9], Box-Cox
transformations to the data are explored. Two standard transformations, log (x+ 1) and x2, are
considered in addition to plain clean data.
2
Enriched clean data: Differenced series
Moreover, it is possible to add information about the monotonicity and the curvature of the original
time series. This is obtained by adding to the multivariate data, for each response value Xt, new
attributes defined by the first and second differenced series:
{Yt:= 5Xt=XtXt1}
and
{Wt:= 5Yt=YtYt1},
respectively.
To ensure that the predictions show the monotonicity property of the data, we use as response
variable Yt=XtXt1, or log(1 + Yt), instead of Xt. In order to obtain the values for the
predictions of Xt, we have to undo this transformation.
2 Methodology
The proposed methodology is divided into three phases.
Phase 1: Construction of basic regressors
Three state-of-the-art Machine Learning tools are used: Support Vector Regression (SVR) [6],
Random Forest (RF) [1, 5], and Sparse Optimal Randomized Regression Trees (S-ORRT) [2, 3, 4].
These models are trained using all the available data, except for the last qdays, that are going to
be used in Phase 2, i.e., these models are trained on the subseries {Xt, t = 1, . . . , T q}.
Parameter tuning and time series cross-validation
Both SVR and RF need parameter tuning. We have implemented a well-known time series cross-
validation technique called rolling forecasting origin [10], that extends the classic cross-validation
approach, see [8]. If all regions (CCAAs) in Spain are considered, each fold has a test set, composed
by one observation per region. The selected configuration of parameters is the one that leads to
the smallest mean squared error (in the test sets).
3
The way to predict several consecutive days is to start by predicting the first one, and use this
prediction to compute the values of the attributes that will serve as an input for the next day. The
procedure is repeated until the last day.
Phase 2: Construction of the ensemble predictor
According to Section 1 and Phase 1, different prediction methods based on SVRs, RFs and S-
ORRTs are obtained in terms of the following options: (1) data are transformed by a Box-Cox
transformation or not; (2) series are differenced or not; (3) only information from Andalusia is
incorporated or in contrast, the information from the complete set of Spanish regions is considered.
This leads to 12 different regressors using SVR and S-ORRT, similarly, one obtains 6 regressors
based on the RF framework. Note that we have eliminated 6 RF regressors, the ones not using the
differenced series, since they are outperformed by the ones that include them. Let Rdenote the
set of the 30 regressors. Then, the resulting predictions prt for t=Tq+ 1, . . . , T and r∈ R are
used as regressors in an ensemble regressor [13] defined as follows:
min 1
q
T
X
t=Tq+1 X
r∈R
αrprt Xt!2
+λX
r∈R
αr
1
q
T
X
t=Tq+1
(prt Xt)2
,
s.t. X
r∈R
αr= 1,(1)
αr0, r ∈ R,
where αris the weight associated to regressor r∈ R,Xtis the observed response value in the
instant t, and λis a scalar parameter. Problem (1) seeks the optimal linear combination of
regressors that minimizes the mean squared error in these qobservations, i.e., in the subseries
{Xt, t =Tq+ 1, . . . , T }; and, in turn, penalizes with same weights the individual performance
of regressors. This tradeoff is parametrized by λ.
Phase 3: Predictions seven-days-ahead
The final step is to train again every regressor in Rusing the whole series {Xt, t = 1, . . . , T }. The
selected configuration of parameters found in Phase 1 is used. Finally, the predictions obtained
for the seven-day period are weighted according to the weights found in Phase 2. The output is
submitted to CEMat, to be part of the combined predictor available at http://matematicas.
4
uclm.es/cemat/covid19/.
3 Setup
Our goal is to make seven-days-ahead predictions for the values of two variables of interest in
Andalusia: Hospitalizados and UCI.
In our computational experiments, we fix nlags = 7, q= 4 and λ= 0.01. If only information
from Andalusia is included, eight fold cross-validation is used. However, when information from all
regions is included, we limit this to five fold cross-validation, due to the small amount of data and
the lack of observations in some regions of Spain, as is the case for Ceuta and Melilla.
The e1071 [12] and randomForest [11] R packages have been used for running SVR and
RF, respectively. The grid for the tuning parameters explored by SVR were cost,gamma
{2a:a=10,...,10}; and by RF, ntree = 500 and mtry eight random values between 1 and
the number of attributes. The computational details for running S-ORRT are those in [3]. Finally,
Problem (1) was solved using Gurobi [7], a well-known efficient solver for quadratically constrained
programming problems.
Acknowledgements
This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant
agreement ID: 822214); FQM-329 and P18-FR-2369 (Junta de Andaluc´ıa, Spain); MTM2017-
89422-P (Ministerio de Econom´ıa, Industria y Competitividad, Spain); PR2019-029 (Universidad
de C´adiz, Spain); PITUFLOW-CM-UC3M (Comunidad de Madrid and Universidad Carlos III
de Madrid, Spain); and EP/R00370X/1 (EPSRC, United Kingdom). This support is gratefully
acknowledged.
References
[1] G. Biau and E. Scornet. A random forest guided tour. TEST, 25(2):197–227, 2016.
[2] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. Optimal randomized
classification trees. Technical report, IMUS, Sevilla, Spain, https://www.researchgate.
net/publication/326901224_Optimal_Randomized_Classification_Trees, 2018.
5
[3] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. On sparse optimal
regression trees. Technical report, IMUS, Sevilla, Spain, https://www.researchgate.net/
publication/341099512_On_Sparse_Optimal_Regression_Trees, 2020.
[4] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. Sparsity in optimal
randomized classification trees. European Journal of Operational Research, 284(1):255 – 272,
2020.
[5] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[6] E. Carrizosa and D. Romero Morales. Supervised classification and mathematical optimization.
Computers & Operations Research, 40(1):150–165, 2013.
[7] L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.
[8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
New York, 2nd edition, 2009.
[9] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and
generalizations. CRC press, 2015.
[10] R. J. Hyndman and G. Athanasopoulos. Forecasting: Principles and Practice. OTexts, 2018.
[11] A. Liaw and M. Wiener. Classification and Regression by randomForest. R News, 2(3):18–22,
2002.
[12] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of
the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2019.
R package version 1.7-1.
[13] Y. Ren, L. Zhang, and P. N. Suganthan. Ensemble classification and regression-recent de-
velopments, applications and future directions. IEEE Computational Intelligence Magazine,
11(1):41–53, 2016.
6
... Prediction and simulation assume relevant roles in fighting COVID-19. Working together with the Spanish Commission of Mathematics, NeEds researchers Benítez-Peña et al. [10] made seven-days-ahead predictions for the values of two variables of interest in Andalusia (Spain): the number of hospital beds and the number of patients in ICU. Data were collected online, cleaned and given as input to machine learning tools as Support Vector Regression, Random Forest and Sparse Optimal Randomized Regression Trees. ...
Article
Full-text available
Several Operations Research (OR) applications have been presented at the \(4^{th}\) AIROYoung Workshop in Bozen (February 5-7, 2020). We start this essay by reporting the main research topics studied and discussed nowadays by young OR researchers. Once again, these have shown the potential and the effect of exploiting applied mathematics, in particular OR, to solve problems arising from reality. Indeed, speaking about the present, OR has been already contributing to fight the COVID-19 pandemic from its beginning, integrated with other techniques to forecast and simulate future scenarios. For instance, OR can help decide how to optimally allocate resources and how to manage the supply chain of food, medical and essential items. We illustrate a few examples of the efforts done to tackle several aspects of the COVID-19 pandemic. Given its strong impact and wide applicability, we wonder about the visibility problem of OR: why is it still unknown to people and has not become a buzzword such as the terms ”machine learning” or ”artificial intelligence”? And, still, what does it mean to ”become visible”? We compare some search terms with Google Trends and report several opinions on this topic. The main purpose of this essay is to refuel the discussion on OR communication to laypeople, by highlighting issues while considering different contexts. Starting from our young community itself, we would like to encourage researchers to take action to make OR more visible.
Article
Full-text available
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components, but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes regressors with a poor individual performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising in the COVID-19 context.
Article
Full-text available
Decision trees are popular Classification and Regression tools and, when small-sized, easy to interpret. Traditionally, a greedy approach has been used to build the trees, yielding a very fast training process; however, controlling sparsity (a proxy for interpretability) is challenging. In recent studies, optimal decision trees, where all decisions are optimized simultaneously, have shown a better learning performance, especially when oblique cuts are implemented. In this paper, we propose a continuous optimization approach to build sparse optimal classification trees, based on oblique cuts, with the aim of using fewer predictor variables in the cuts as well as along the whole tree. Both types of sparsity, namely local and global, are modeled by means of regularizations with polyhedral norms. The computational experience reported supports the usefulness of our methodology. In all our data sets, local and global sparsity can be improved without harming classification accuracy. Unlike greedy approaches, our ability to easily trade in some of our classification accuracy for a gain in global sparsity is shown.
Article
Full-text available
Ensemble methods use multiple models to get better performance. Ensemble methods have been used in multiple research fields such as computational intelligence, statistics and machine learning. This paper reviews traditional as well as state-of-the-art ensemble methods and thus can serve as an extensive summary for practitioners and beginners. The ensemble methods are categorized into conventional ensemble methods such as bagging, boosting and random forest, decomposition methods, negative correlation learning methods, multi-objective optimization based ensemble methods, fuzzy ensemble methods, multiple kernel learning ensemble methods and deep learning based ensemble methods. Variations, improvements and typical applications are discussed. Finally this paper gives some recommendations for future research directions.
Article
Full-text available
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.
Article
Full-text available
Data mining techniques often ask for the resolution of optimization problems. Supervised classification, and, in particular, support vector machines, can be seen as a paradigmatic instance. In this paper, some links between mathematical optimization methods and supervised classification are emphasized. It is shown that many different areas of mathematical optimization play a central role in off-the-shelf supervised classification methods. Moreover, mathematical optimization turns out to be extremely useful to address important issues in classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data.
Article
In this paper, we model an optimal regression tree through a continuous optimization problem, where a compromise between prediction accuracy and both types of sparsity, namely local and global, is sought. Our approach can accommodate important desirable properties for the regression task, such as cost-sensitivity and fairness. Thanks to the smoothness of the predictions, we can derive local explanations on the continuous predictor variables. The computational experience reported shows the outperformance of our approach in terms of prediction accuracy against standard benchmark regression methods such as CART, OLS and LASSO. Moreover, the scalability of our approach with respect to the size of the training sample is illustrated.
Book
Discover New Methods for Dealing with High-Dimensional Data A sparse statistical model has only a small number of nonzero parameters or weights; therefore, it is much easier to estimate and interpret than a dense model. Statistical Learning with Sparsity: The Lasso and Generalizations presents methods that exploit sparsity to help recover the underlying signal in a set of data. Top experts in this rapidly evolving field, the authors describe the lasso for linear regression and a simple coordinate descent algorithm for its computation. They discuss the application of ℓ1 penalties to generalized linear models and support vector machines, cover generalized penalties such as the elastic net and group lasso, and review numerical methods for optimization. They also present statistical inference methods for fitted (lasso) models, including the bootstrap, Bayesian methods, and recently developed approaches. In addition, the book examines matrix decomposition, sparse multivariate analysis, graphical models, and compressed sensing. It concludes with a survey of theoretical results for the lasso. In this age of big data, the number of features measured on a person or object can be large and might be larger than the number of observations. This book shows how the sparsity assumption allows us to tackle these problems and extract useful and reproducible patterns from big datasets. Data analysts, computer scientists, and theorists will appreciate this thorough and up-to-date treatment of sparse statistical modeling.
Gurobi optimizer reference manual
  • L Gurobi Optimization
L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.
Forecasting: Principles and Practice
  • R J Hyndman
  • G Athanasopoulos
R. J. Hyndman and G. Athanasopoulos. Forecasting: Principles and Practice. OTexts, 2018.