Content uploaded by Dolores Romero Morales
Author content
All content in this area was uploaded by Dolores Romero Morales on May 02, 2020
Content may be subject to copyright.
Short-Term Predictions of the Evolution of COVID-19 in
Andalusia. An Ensemble Method
Sandra Ben´ıtez-Pe˜na1,2, Emilio Carrizosa1,2 , Vanesa Guerrero3, Mar´ıa Dolores
Jim´enez-Gamero1,2, Bel´en Mart´ın-Barrag´an4, Cristina Molero-R´ıo1,2 , Pepa
Ram´ırez-Cobo1,5, Dolores Romero Morales6, and M. Remedios Sillero-Denamiel1,2
1Instituto de Matem´aticas de la Universidad de Sevilla, Seville, Spain
2Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de Sevilla, Seville, Spain
3Departamento de Estad´ıstica, Universidad Carlos III de Madrid, Getafe, Spain
4The University of Edinburgh Business School, University of Edinburgh, Edinburgh, UK
5Departamento de Estad´ıstica e Investigaci´on Operativa, Universidad de C´adiz, Cadiz, Spain
6Department of Economics, Copenhagen Business School, Frederiksberg, Denmark
May 2, 2020
Abstract
COVID-19 is an infectious disease that was first identified in China in December 2019. Sub-
sequently COVID-19 started to spread broadly, to also arrive in Spain by the end of January
2020. This pandemic triggered confinement measures, in order to reduce the expansion of the
virus so as not to saturate the health care system. With the aim of providing Spanish authorities
information about the behavior of variables of interest on the virus spread in the short-term,
the Spanish Commission of Mathematics (CEMat) made a call among researchers in different
areas to collaborate and build a cooperative predictor. Our research group is particularly fo-
cused on the seven-days-ahead prediction of the number of hospitalized patients, as well as ICU
patients, in Andalusia. This manuscript describes the data pre-processing and methodology.
This contribution is based at the Institute of Mathematics of the University of Seville (IMUS).
Key words COVID-19; Time Series; Ensemble method; Support Vector Regression; Random
Forests; Sparse Optimal Randomized Regression Trees; Rolling Forecasting Origin; Box Cox trans-
formations
1
1 Data pre-processing
Raw data
The data provided by the Spanish Comission of Mathematics (CEMat) is available online at https:
//covid19.isciii.es/. This data consists of values per day (Fecha) of five variables of interest:
the cumulative number of cases (Casos), hospitalized patients (Hospitalizados), ICU patients
(UCI), deceased (Fallecidos) and recovered (Recuperados) people in Spain since 20th of February
2020, disaggregated by region (CCAA). For further information, the reader is referred to http:
//matematicas.uclm.es/cemat/covid19/en/.
Clean data
Given a variable of interest in a region, the corresponding univariate time series
{Xt, t = 1, . . . , T }
is converted into a multivariate setting in order to implement the Machine Learning tools discussed
in Section 2. These models will be trained to make predictions for a variable that will depend
on its past values (up to nlags days). The resulting multivariate data consist of a (T−nlags)×
(nlags + 1)-matrix. For each t=nlags+1, . . . , T , one wants to predict Xtusing (Xt−nlag s, . . . , Xt−1),
the attribute vector consisting of the values of the variable in the past nlags days. Zero rows, due
to late spread of the virus, are removed.
Enriched clean data: Incorporation of information of all regions
Two different approaches are used for training the models. The first one only considers information
about the variable of interest of the Andalusian region while the second incorporates the information
of all regions (CCAAs) in Spain.
Enriched clean data: Box Cox transformations
To better capture nonlinearities in the data, as customary in Regression Analysis [9], Box-Cox
transformations to the data are explored. Two standard transformations, log (x+ 1) and x2, are
considered in addition to plain clean data.
2
Enriched clean data: Differenced series
Moreover, it is possible to add information about the monotonicity and the curvature of the original
time series. This is obtained by adding to the multivariate data, for each response value Xt, new
attributes defined by the first and second differenced series:
{Yt:= 5Xt=Xt−Xt−1}
and
{Wt:= 5Yt=Yt−Yt−1},
respectively.
To ensure that the predictions show the monotonicity property of the data, we use as response
variable Yt=Xt−Xt−1, or log(1 + Yt), instead of Xt. In order to obtain the values for the
predictions of Xt, we have to undo this transformation.
2 Methodology
The proposed methodology is divided into three phases.
Phase 1: Construction of basic regressors
Three state-of-the-art Machine Learning tools are used: Support Vector Regression (SVR) [6],
Random Forest (RF) [1, 5], and Sparse Optimal Randomized Regression Trees (S-ORRT) [2, 3, 4].
These models are trained using all the available data, except for the last qdays, that are going to
be used in Phase 2, i.e., these models are trained on the subseries {Xt, t = 1, . . . , T −q}.
Parameter tuning and time series cross-validation
Both SVR and RF need parameter tuning. We have implemented a well-known time series cross-
validation technique called rolling forecasting origin [10], that extends the classic cross-validation
approach, see [8]. If all regions (CCAAs) in Spain are considered, each fold has a test set, composed
by one observation per region. The selected configuration of parameters is the one that leads to
the smallest mean squared error (in the test sets).
3
The way to predict several consecutive days is to start by predicting the first one, and use this
prediction to compute the values of the attributes that will serve as an input for the next day. The
procedure is repeated until the last day.
Phase 2: Construction of the ensemble predictor
According to Section 1 and Phase 1, different prediction methods based on SVRs, RFs and S-
ORRTs are obtained in terms of the following options: (1) data are transformed by a Box-Cox
transformation or not; (2) series are differenced or not; (3) only information from Andalusia is
incorporated or in contrast, the information from the complete set of Spanish regions is considered.
This leads to 12 different regressors using SVR and S-ORRT, similarly, one obtains 6 regressors
based on the RF framework. Note that we have eliminated 6 RF regressors, the ones not using the
differenced series, since they are outperformed by the ones that include them. Let Rdenote the
set of the 30 regressors. Then, the resulting predictions prt for t=T−q+ 1, . . . , T and r∈ R are
used as regressors in an ensemble regressor [13] defined as follows:
min 1
q
T
X
t=T−q+1 X
r∈R
αrprt −Xt!2
+λX
r∈R
αr
1
q
T
X
t=T−q+1
(prt −Xt)2
,
s.t. X
r∈R
αr= 1,(1)
αr≥0, r ∈ R,
where αris the weight associated to regressor r∈ R,Xtis the observed response value in the
instant t, and λis a scalar parameter. Problem (1) seeks the optimal linear combination of
regressors that minimizes the mean squared error in these qobservations, i.e., in the subseries
{Xt, t =T−q+ 1, . . . , T }; and, in turn, penalizes with same weights the individual performance
of regressors. This tradeoff is parametrized by λ.
Phase 3: Predictions seven-days-ahead
The final step is to train again every regressor in Rusing the whole series {Xt, t = 1, . . . , T }. The
selected configuration of parameters found in Phase 1 is used. Finally, the predictions obtained
for the seven-day period are weighted according to the weights found in Phase 2. The output is
submitted to CEMat, to be part of the combined predictor available at http://matematicas.
4
uclm.es/cemat/covid19/.
3 Setup
Our goal is to make seven-days-ahead predictions for the values of two variables of interest in
Andalusia: Hospitalizados and UCI.
In our computational experiments, we fix nlags = 7, q= 4 and λ= 0.01. If only information
from Andalusia is included, eight fold cross-validation is used. However, when information from all
regions is included, we limit this to five fold cross-validation, due to the small amount of data and
the lack of observations in some regions of Spain, as is the case for Ceuta and Melilla.
The e1071 [12] and randomForest [11] R packages have been used for running SVR and
RF, respectively. The grid for the tuning parameters explored by SVR were cost,gamma ∈
{2a:a=−10,...,10}; and by RF, ntree = 500 and mtry eight random values between 1 and
the number of attributes. The computational details for running S-ORRT are those in [3]. Finally,
Problem (1) was solved using Gurobi [7], a well-known efficient solver for quadratically constrained
programming problems.
Acknowledgements
This research has been financed in part by research projects EC H2020 MSCA RISE NeEDS (Grant
agreement ID: 822214); FQM-329 and P18-FR-2369 (Junta de Andaluc´ıa, Spain); MTM2017-
89422-P (Ministerio de Econom´ıa, Industria y Competitividad, Spain); PR2019-029 (Universidad
de C´adiz, Spain); PITUFLOW-CM-UC3M (Comunidad de Madrid and Universidad Carlos III
de Madrid, Spain); and EP/R00370X/1 (EPSRC, United Kingdom). This support is gratefully
acknowledged.
References
[1] G. Biau and E. Scornet. A random forest guided tour. TEST, 25(2):197–227, 2016.
[2] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. Optimal randomized
classification trees. Technical report, IMUS, Sevilla, Spain, https://www.researchgate.
net/publication/326901224_Optimal_Randomized_Classification_Trees, 2018.
5
[3] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. On sparse optimal
regression trees. Technical report, IMUS, Sevilla, Spain, https://www.researchgate.net/
publication/341099512_On_Sparse_Optimal_Regression_Trees, 2020.
[4] R. Blanquero, E. Carrizosa, C. Molero-R´ıo, and D. Romero Morales. Sparsity in optimal
randomized classification trees. European Journal of Operational Research, 284(1):255 – 272,
2020.
[5] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
[6] E. Carrizosa and D. Romero Morales. Supervised classification and mathematical optimization.
Computers & Operations Research, 40(1):150–165, 2013.
[7] L. Gurobi Optimization. Gurobi optimizer reference manual, 2018.
[8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,
New York, 2nd edition, 2009.
[9] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical learning with sparsity: the lasso and
generalizations. CRC press, 2015.
[10] R. J. Hyndman and G. Athanasopoulos. Forecasting: Principles and Practice. OTexts, 2018.
[11] A. Liaw and M. Wiener. Classification and Regression by randomForest. R News, 2(3):18–22,
2002.
[12] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch. e1071: Misc Functions of
the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2019.
R package version 1.7-1.
[13] Y. Ren, L. Zhang, and P. N. Suganthan. Ensemble classification and regression-recent de-
velopments, applications and future directions. IEEE Computational Intelligence Magazine,
11(1):41–53, 2016.
6