PreprintPDF Available

Abstract and Figures

Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine different base regressors to generate a unique one have been proposed in the literature. The so-obtained regressor method may have better accuracy than its components , but at the same time it may overfit, it may be distorted by base regressors with low accuracy, and it may be too complex to understand and explain. This paper proposes and studies a novel Mathematical Optimization model, which trades off the accuracy of the ensemble and the number of base regressors used. The latter is controlled by means of a regularization term that penalizes bad regressors. Our approach is flexible to incorporate desirable properties one may have on the ensemble, such as controlling the performance of the ensemble in critical groups of records, or the costs associated with the base regressors involved in the ensemble. We illustrate our approach with a real data set arising in the COVID-19 context.
Content may be subject to copyright.
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
European Journal of Operational Research xxx (xxxx) xxx
Contents lists available at ScienceDirect
European Journal of Operational Research
journal homepage: www.elsevier.com/locate/ejor
Computational Intelligence & Inform. Management
On sparse ensemble methods: An application to short-term
predictions of the evolution of COVID-19
Sandra Benítez-Peña
a , b
, Emilio Carrizosa
a , b
, Vanesa Guerrero
c
,
M. Dolores Jiménez-Gamero
a , b
, Belén Martín-Barragán
d
, Cristina Molero-Río
a , b
,
Pepa Ramírez-Cobo
e , a
, Dolores Romero Morales
f , , M. Remedios Sillero-Denamiel
a , b
a
Instituto de Matemáticas de la Universidad de Sevilla, Seville, Spain
b
Departamento de Estadística e Investigación Operativa, Universidad de Sevilla, Seville, Spain
c
Departamento de Estadística, Universidad Carlos III de Madrid, Getafe, Spain
d
The University of Edinburgh Business School, University of Edinburgh, Edinburgh, UK
e
Departamento de Estadística e Investigación Operativa, Universidad de Cádiz, Cadiz, Spain
f
Department of Economics, Copenhagen Business School, Frederiksberg, Denmark
a r t i c l e i n f o
Article history:
Received 27 May 2020
Accepted 7 April 2021
Available online xxx
Keywo rds:
Machine Learning
Ensemble Method
Mathematical Optimization
Selective Sparsity
COVID-19
a b s t r a c t
Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine
different base regressors to generate a unique one have been proposed in the literature. The so-obtained
regressor method may have better accuracy than its components, but at the same time it may overfit,
it may be distorted by base regressors with low accuracy, and it may be too complex to understand
and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse
ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The
latter is controlled by means of a regularization term that penalizes regressors with a poor individual
performance. Our approach is flexible to incorporate desirable properties one may have on the ensemble,
such as controlling the performance of the ensemble in critical groups of records, or the costs associated
with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising
in the COVID-19 context.
©2021 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license
( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
1. Introduction
A plethora of methodologies of very different nature is cur-
rently available for predicting a continuous response variable, as
it is the case in regression as well as in time series forecast-
ing. Those methodologies come mainly from Machine Learning,
such as Support Vector Machines ( Carrizosa & Romero Morales,
2013; Vapnik, 1995 ), Random Forests ( Breiman, 2001 ), Optimal
Trees ( Bertsimas & Dunn, 2017; Blanquero, Carrizosa, Molero-Río, &
Romero Morales, 2021; Carrizosa, Molero-Río, & Romero Morales,
2021 ), Deep Learning ( Gambella, Ghaddar, & Naoum-Sawaya,
Corresponding author.
E-mail addresses: sbenitez1@us.es (S. Benítez-Peña), ecarrizosa@us.es (E. Car-
rizosa), vanesa.guerrero@uc3m.es (V. Guerrero), dolores@us.es (M.D. Jiménez-
Gamero), Belen.Martin@ed.ac.uk (B. Martín-Barragán), mmolero@us.es (C. Molero-
Río), pepa.ramirez@uca.es (P. Ramírez-Cobo), drm.eco@cbs.dk (D. Rome ro Morales),
rsillero@us.es (M.R. Sillero-Denamiel).
2021 ); or from Statistics, such as Generalized Linear Models
( Hastie, Tibshirani, & Wainwright, 2015 ), Semi- and Nonpara-
metric approaches to regression (such as smoothing techniques)
( Härdle, 1990 ), Regression models for time series analysis ( Kedem
& Fokianos, 2005 ), or Random Effects models ( Lee, Nelder, & Paw-
itan, 2018 ). Some of these techniques have shown a relatively high
degree of success in COVID-19 time series forecasting ( Benítez-
Peña et al., 2020b; Nikolopoulos, Punia, Schäfers, Tsinopoulos, &
Vasilakis, 2021 ), which is the application that has inspired this
work.
In this way, the user has at hand a long list of fitted regression
models, referred to in what follows as base regressors, and faces
the problem of deciding which one to choose, or alternatively, how
to combine (some of) the competing approaches, that is, how to
build an ensemble. While a thorough computational study of the
different models may help the user to identify the most conve-
nient one, such an approach becomes unworkable when predicting
https://doi.org/10.1016/j.ejor.2021.04.016
0377-2217/© 2021 The Autho rs. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )
Please cite this article as: S. Benítez-Peña, E. Carrizosa, V. Guerrero et al., On sparse ensemble methods: An application to short-term
predictions of the evolution of COVID-19, European Journal of Operational Research, https://doi.org/10.1016/j.ejor.2021.04.016
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
new phenomena in real-time, like the evolution of the COVID-19
counts (confirmed cases, hospitalized patients, ICU patients, recov-
ered patients, and fatalities). Here, the most accurate method will
probably change over time since we are dealing with a dynamic
setting, but also because of the non-stationarity of the data caused,
for instance, by the different interventions of authorities to flatten
the curve .
Hence, it may be more convenient to build an ensemble where
some accuracy measure, such as a (cross-validation) estimate of
the expected squared error or of the absolute error ( Ando & Li,
2014; Bates & Granger, 1969 ), is optimized at each forecast ori-
gin. With this approach other relevant issues can be modeled, such
as sparsity in the feature space ( Bertsimas, King, & Mazumder,
2016; Carrizosa, Mortensen, Romero Morales, & Sillero-Denamiel,
2020a; Carrizosa, Olivares-Nadal, & Ramírez-Cobo, 2017b; Foun-
toulakis & Gondzio, 2016 ), interpretability ( Carrizosa, Nogales-
Gómez, & Romero Morales, 2016; 2017a; Carrizosa, Olivares-
Nadal, & Ramírez-Cobo, 2020b; Martín-Barragán, Lillo, & Romo,
2014 ), critical values of features ( Carrizosa, Martín-Barragán, &
Romero Morales, 2010; 2011 ), measurement costs ( Carrizosa,
Martín-Barragán, & Romero Morales, 2008 ), or cost-sensitive
performance constraints ( Benítez-Peña, Blanquero, Carrizosa, &
Ramírez-Cobo, 2019a; 2020a; Blanquero, Carrizosa, Ramírez-Cobo,
& Sillero-Denamiel, 2020 ). See ( Friese, Bartz-Beielstein, & Em-
merich, 2016; Mendes-Moreira, Soares, Jorge, & Sousa, 2012; Ren,
Zhang, & Suganthan, 2016 ) and references therein for the role
of mathematical optimization when constructing ensembles and
( Friese, Bartz-Beielstein, Bäck, Naujoks, & Emmerich, 2019 ) for the
use of ensembles to enhance the optimization of black-box expen-
sive functions.
In this paper, we propose an optimization approach to build
a sparse ensemble. In contrast to existing proposals in the litera-
ture, our paper focuses on an innovative definition of sparsity, the
so-called selective sparsity . Our goal is to build a sparse ensemble,
which takes into account the individual performance of each base
regressor, in such a way that only good base regressors are allowed
to take part in the ensemble. This is done with the aim to adapt to
dynamic settings, such as in COVID-19 counts, where the composi-
tion of the ensemble may change over time, but also to avoid that
the ensemble is distorted by base regressors with low accuracy or
may be too complex to understand and explain. Ours can be seen
as a sort of what ( Mendes-Moreira et al., 2012 ) calls an ensemble
pruning , where the ensemble is constructed by using a subset of
all available base regressors. The novelty of our approach resides
in the fact that the selection of the subset and the weights in the
ensemble are simultaneously optimized.
We propose a Mathematical Optimization model that trades off
the accuracy of the ensemble and the number of base regressors
used. The latter is controlled by means of a regularization term
that penalizes regressors with a poor individual performance. Our
approach is flexible to incorporate desirable properties one may
have on the ensemble, such as controlling the performance of the
ensemble in critical groups of records, or the costs associated with
the base regressors involved in the ensemble. Our data-driven ap-
proach is applied to short-term predictions of the evolution of
COVID-19, as an alternative to model-based prediction algorithms
as in Achterberg et al. (2020) and references therein.
The remainder of the paper is structured as follows.
Section 2 formulates the Mathematical Optimization problem to
construct the sparse ensemble. Theoretical properties of the opti-
mal solution are studied, and how to accommodate some desirable
properties on the ensemble is also discussed. Section 3 illustrates
our approach with real data sets arising in the COVID-19 context,
where one can see how the ensemble composition changes over
time. The paper ends with some concluding remarks and lines for
future research in Section 4 .
2. The optimization model
This section presents the new ensemble approach.
Section 2.1 describes the formulation of the model in terms
of an optimization problem with linear constraints. Section 2.2 es-
tablishes the connection of the approach with the constrained
Lasso ( Blanquero et al., 2020; Gaines, Kim, & Zhou, 2018 ) and
some theoretical results of the solution are derived. Finally,
Section 2.3 considers some extensions of the model concern-
ing the control of the set of base regressors or control of the
performance in critical groups.
2.1. The formulation
Let Fbe a finite set of base regressors for the response
variable y . No restriction is imposed on the collection of base
regressors. It may include a variety of state-of-the-art models
and methodologies for setting their parameters and hyperparam-
eters. It may even use alternative samples for training, for ex-
ample where individuals are characterized by different sets of
features. By taking convex combinations of the base regressors
in F, we obtain a broader class of regressors, namely, co(F) =
F =
fF
αf
f :
fF
αf
= 1 , αf
0 , f F
. Throughout this
section, vectors will be denoted with bold typesetting, e.g., α=
(αf
)
fF
.
The selection of one combined regressor from co(F) will be
made by optimizing a function which takes into account two cri-
teria. The first and fundamental criterion is the overall accuracy of
the combined regressor, measured through a loss function L , de-
fined on co(F) ,
L : co(F) −→ R
F −→ L (F ) .
For each base regressor f Fwe assume its individual loss L
f
is
given. This may be simply defined as L
f
= L (f) , but other options
are possible too, in which, for instance, L
f
and L are both empiri-
cal losses, as in Section 2.2 , but use different training samples.
With the second criterion, a selective sparsity is pursued to
make the method more reluctant to choose base regressors f F
with lower reliability, i.e., with higher individual loss L
f
, reduc-
ing thus overfitting. To achieve this, we add a regularization term
in which the weight of base regressor f, say αf
, is multiplied by
its individual loss L
f
. The selective sparse ensemble is obtained
by solving the following Mathematical Optimization problem with
linear constraints:
min
αS
L
fF
αf
f
+ λ
fF
αf
L
f
, (1)
where Sis the unit simplex in R
|F|
,
S =
α R
|F|
:
fF
αf
= 1 , αf
0 , f F
,
and λ0 is a regularization parameter, which trades off the im-
portance given to the loss of the ensemble regressor and to the
selective sparsity of the base regressors used.
2.2. Theoretical results
In general, Problem (1) has a nonlinear objective function and
linear constraints. For loss functions commonly used in the litera-
ture, we can rewrite its objective as a linear or a convex quadratic
function while the constraints remain linear. Therefore, for these
loss functions, Problem (1) is easily tractable with commercial
2
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
solvers. In addition, and under some mild assumptions, we char-
acterize the behavior of the optimal solution with respect to the
parameter λ.
First, we will rewrite the second term in the objective function,
so that the proposed model can be seen as a particular case of
the constrained Lasso. As for Lasso models and extensions of them,
having a sparse model reduces the danger of overfitting.
Remark 1. The so-called selective
1 norm ·
sel
1
in R
|F| is de-
fined as
α
sel
1
=
fF
L
f
| αf
| .
The objective function in Problem (1) can be written as
L
fF
αf
f
+ λ α
sel
1
. With this, and for well-known losses
L , Problem (1) can be seen as a constrained Lasso problem,
( Blanquero et al., 2020; Gaines, Kim, & Zhou, 2018 ), in which a se-
lective sparsity is sought, as opposed to a plain sparsity with as
few nonzero coefficients αf
as possible.
Remark 2. Let Ibe a training sample, in which each individual
i Iis characterized by its feature vector x
i
R
p and its response
y
i
. Let L be the empirical loss of quantile regression, ( Koenker &
Hallock, 2001 ), for I,
L
fF
αf
f
=
i I
ρτ
y
i
fF
αf
f(x
i
)
, (2)
where
ρτ(s ) =
τs, if s 0
(1 τ) s, if s < 0 ,
for some τ (0 , 1) . Then, as in e.g. Koenker and Ng (2005) , Prob-
lem (1) can be expressed as a linear program and thus efficiently
solved with Linear Programming solvers.
Remark 3. Let Ibe a training sample, in which each individual
i Iis characterized by its feature vector x
i
R
p and its response
y
i
. Let L be the empirical loss of Ordinary Least Squares (OLS) re-
gression for I, i.e.,
L
fF
αf
f
=
i I
y
i
fF
αf
f(x
i
)
2
. (3)
Hence, Problem (1) is a convex quadratic problem with linear con-
straints, which, by Remark 1 , can be seen as a constrained Lasso. In
particular, the results in Gaines, Kim, and Zhou, (2018) apply, and
thus, we can assert that, if the design matrix
(
f(x
i
)
)
i I,fF
has full
rank, then,
1. For any λ0 , Problem (1) has unique optimal solution αλ.
2. The path of optimal solutions αλis piecewise linear in λ.
Under mild conditions on L , applicable in particular for the
quantile and OLS empirical loss functions, we characterize the op-
timal solution of Problem (1) for large values of the parameter λ.
Intuitively speaking, for λgrowing to infinity, the first term in the
objective function becomes negligible, and thus we only need to
solve the Linear Programming problem of minimizing
fF
αf
L
f
in the simplex S. This problem attains its optimum at one of the
extreme points of the feasible region, i.e., at some f
F, namely,
one for which L
f
L
f
, f. We formalize this intuition in the fol-
lowing proposition, where under the assumption of convexity of L ,
we show that a finite value of λexists for which such sparse so-
lution is optimal. Before stating it, notice that, since the set Fis
given, we can define
L : −→ R
w −→ L (w ) = L
fF
w
f
f
,
for some R
|F|
, such that S.
Proposition 1. Assume that L is convex in an open convex set S.
Furthermore, assume that there exists a base regressor f
such that
L
f
< L
f
for all f F,f = f
. Then, there exists λ< + such that,
for any λλ, f
is an optimal solution to Problem (1) .
Proof. Let f
be as in the statement of the proposition, and let
α Sdenote the vector with 1 in its component corresponding to
f
and 0 otherwise. Since L is defined in the open set α, the
subdifferential L ( α) of the convex function L at αis not empty.
Let p L ( α) , and let N ( α) denote the normal cone of Sat α.
Then,
0 p + λL
f
fF
+ N ( α) iff p
f
+ λL
f
p
f
+ λL
f f F,
(4)
which is satisfied iff
λmax
p
f
p
f
L
f
L
f
: f F, f = f
. (5)
Setting λequal to the value on the right-hand side of (5) , and
taking into account that the condition on the left-hand side of
(4) is necessary and sufficient for the optimality of α, the result
follows.
2.3. Extensions
Problem (1) can be enriched to address some desirable proper-
ties one may seek for the ensemble. Three of them are discussed
in what follows. The first two properties relate to the transparency
and interpretability of the ensemble, Deng (2019) and Florez-Lopez
and Ramon-Jeronimo (2015) , while the third one relates to the per-
formance of the ensemble in critical groups.
As mentioned in the introduction, the ensemble may contain
base regressors built with several methodologies of very diverse
nature. Therefore, one may want to control the number of method-
ologies used in the final ensemble. For instance, in the application
described in Section 3 , we consider four methodologies, namely,
Support Vector Regression, Random Forests, Optimal Trees, and Lo-
gistic Regression. Let F =
m M
F
type
m
, where F
type
m
is the set of
base regressors using methodology m M , and let αtype
m
be the
corresponding subvector of α, namely, the one containing the com-
ponents in αreferring to methodology m M . With this, we can
extend the objective function of Problem (1) to
L
fF
αf
f
+ λ
fF
αf
L
f
+ λtype
m M
αtype
m
. (6)
In a similar fashion, one may want to control the set of features
used by the ensemble. Let F
fea
j
Fbe the set of base regressors
using feature j { 1 , . . . , p} , and let αfea
j
be the corresponding sub-
vector of α, namely, the one containing the components in αre-
ferring to feature j { 1 , . . . , p} . With this, we can extend the ob-
jective function of Problem (1) to
L
fF
αf
f
+ λ
fF
αf
L
f
+ λfea
p
j=1
αfea
j
. (7)
In both cases, the
terms can be rewritten using new deci-
sion variables and linear constraints, and thus the structure of the
problem is not changed. This way, if L is the quantile regression
3
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
Fig. 1. Cumulative number of hospitalized patients in Andalusia (Spain) for COVID-19 in the period 10/03/2020–20/05/2020.
(respectively, the Ordinary Least Squares) empirical loss, the op-
timization problem with objective as in (6) is written as a linear
problem (respectively, as a convex quadratic problem with linear
constraints). The same holds for the optimization problem with ob-
jective as in (7) .
In addition, our approach can easily incorporate cost-sensitive
performance constraints to ensure that we control not only the
overall accuracy of the regressor, but also the accuracy on a num-
ber of critical groups, as in Benítez-Peña et al. (2019a) , Benítez-
Peña, Blanquero, Carrizosa, and Ramírez-Cobo (2019b) , Blanquero
et al. (2020) and Datta and Das (2015) . With this, if δg
> 0 denotes
the threshold on the loss L
g for group g G, we can add to the
feasible region of Problem (1) constraints
L
g
fF
αf
f
δg
, g G. (8)
For the quantile and Ordinary Least Squares empirical loss func-
tions, these constraints are linear or convex quadratic, respectively,
and thus the optimization problems can be addressed with the
very same numerical tools as before.
3. Short-term predictions of the evolution of COVID-19
The purpose of this section is to illustrate how, thanks to the
selective sparsity term in Problem (1) , we can provide good en-
sembles in terms of accuracy. For this, we use data sets arising in
the context of COVID-19.
3.1. The data
COVID-19 was first identified in China in December 2019 and,
subsequently, started to spread broadly. Quickly after this, data
started to be collected daily by the different countries. Several vari-
ables of interest, such as confirmed cases, hospitalized patients,
ICU patients, recovered patients, and fatalities, among others, were
considered. Different initiatives around the world emerged in order
to get to know this new scenario.
In this section, we focus on the evolution of the pandemic in
Spain and Denmark. The first cases were confirmed in Spain and
Denmark in late February 2020 and early March 2020, respectively.
In this paper, the considered variable of interest is the cumula-
tive number of hospitalized patients in the regions of Andalusia
(Spain) and Sjælland (Denmark). Figs. 1 and 2 display the data in
the periods 10/03/2020-20/05/2020 for Andalusia and 06/03/2020-
20/05/2020 for Sjælland, which can be found at the repositories in
Fernández-Casal (2020) and Statens Serum Institut (2020) , respec-
tively.
The univariate time series
{
X
t
, t = 1 , . . . , T
}
, with X
t represent-
ing the cumulative number of hospitalized patients in the region
under consideration in day t, is converted into a multivariate se-
ries using seven lags. In other words, the data fed to the base re-
gressors is not the time series itself, but the vectors of covariates
and responses in Fig. 3 . This training set is just one of the different
options we have considered to create base regressors. In the next
section, we discuss other data choices, which we will refer to as
Country , Transformation and Differences .
3.2. Options for feeding the data
We first discuss the Country data choice. Let R be the number
of regions of the country under consideration, and, without loss of
generalization, let us assume that the first one is the region un-
der consideration. The time series X
r
t
, t = 1 , . . . , T
, for regions
r = 2 , . . . , R, were also available. Such times series are correlated
with the one under consideration. We had to decide whether to
incorporate these additional time series in our forecasting model.
If we do so, the feeding data contains the 7-uples in Fig. 3 from
the region under consideration, as well as the ones from the other
R 1 regions, see Fig. 4 . We now move to the Transformation
choice. For the two choices in Figs. 3 and 4 , either the crude data
Xare used or they are transformed using some standard Box-Cox
transformations, Hastie, Tibshirani and Wainwright (2015) , namely,
X
2 and log
(
X + 1
)
. Finally, with respect to the Differences
choice, we have also considered whether information about the
monotonicity (first difference, X
t
:= X
t
X
t1
) and the curvature
(second difference, 2
X
t
:= X
t
X
t1
) is added to the feeding
data as predictors, thus yielding 6 and 5 new predictors because
of monotonicity and curvature, respectively.
To end this section, observe that the time series
{
X
t
, t = 1 , . . . , T
} of cumulative number of hospitalized patients
in the region under consideration is, by nature, nondecreasing.
However, some of the methodologies in the next section used to
build base regressors do not guarantee such monotonicity. To en-
sure that the predictions show the monotonicity property present
in the data, we use as response variable log (1 + X
t
) , instead of
4
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
Fig. 2. Cumulative number of hospitalized patients in Sjælland (Denmark) for COVID-19 in the period 06/03/2020–20/05/2020.
Fig. 3. Covariates (in parentheses) and response variable for the cumulative number
of hospitalized patients in the reg ion under consideration.
Fig. 4. Covariates (in parentheses) and response variable for the cumulative number
of hospitalized patients in each of the R regions of the country.
X
t
. Once the procedure is completed, we undo this transformation
to predict the original response variable X
t
. Figs. 5 and 6 display
log (1 + X
t
) for Andalusia and Sjælland, respectively, where tis
as in Figs. 1 and 2 .
3.3. The base regressors
We consider four base methodologies to build the set of base
regressors F. This includes three state-of-the-art Machine Learn-
ing tools, namely Support Vector Regression (SVR) ( Carrizosa &
Romero Morales, 2013 ), Random Forest (RF) ( Breiman, 2001 ), and
Sparse Optimal Randomized Regression Trees (S-ORRT) ( Blanquero,
Carrizosa, Molero-Río, & Romero Morales, 2020a ), as well as the
classic Linear Regression (LR). Each of them is fed each time with
one of the data choices described in Section 3.1 . See Table 1 for
a description of the elements of F = F
SVR
F
RF
F
LR
F
S ORRT
=
f
j
: j = 1 , . . . , 36
according to their methodology and the data
choices. These methodologies have some parameters which must
be tuned, and we explain below the tuning we have performed to-
gether with other computational details.
To tune the parameters, the different base regressors are trained
using all the available data, except for the last four days, i.e., these
models are trained on t { 1 , . . . , T 4 } . The e1071 ( Meyer, Dim-
itriadou, Hornik, Weingessel, & Leisch, 2019 ) and randomForest
( Liaw & Wiener, 2002 ) R packages have been used for training SVR
and RF, respectively, while the lm routine in R is used for LR. The
computational details for training S-ORRT are those in Blanquero
et al. (2020a) . For SVR, we use the RBF kernel and perform a
grid search in
{
2
a
: a = 10 , ... , 10
}
for both parameters, cost and
gamma . For RF, we set ntree = 500 and for mtry we try out five
random values. If only information from the region under consid-
eration is included (‘ Country No’ data option in Table 1 ), eight
fold cross-validation is used. However, when information from all
regions in the country is included, we limit this to five fold cross-
validation, due to the small amount of data and the lack of obser-
vations in some regions. Such cross-validation estimates are used
to select the best values of the parameters. With those best values,
for each combination of feeding data and methodology, the base
regressors f Fare built using information from t { 1 , . . . , T 4 } ,
see Fig. 7 .
3.4. The pseudocode of the complete procedure
The complete procedure for making short-term predictions
with our selective sparse ensemble methodology is summarized
in Algorithm 1 and can be visualized in Fig. 7 . The considered
grid of values for the tradeoff parameter λin Problem (1) is
0 , 2
10
, 2
9
, . . . , 2
3
. For the tests considered in this section, this
grid is wide enough. On one extreme, we have included the trivial
value λ= 0 , for which the selective sparsity term does not play a
role. On the other extreme, with this grid we ensure that λ= λ
is reached, for which, by Proposition 1 , the ensemble shows the
highest level of sparsity.
We start by training the base regressors Fin Table 1 , with tun-
ing parameters as in Section 3.3 , using the data available up to day
T 4 . We then move to solve Problem (1) for the different val-
ues of λin the grid. For this, we have chosen the loss L as in (3) ,
where Iconsists of the data in the four days left out when tuning
5
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
Fig. 5. Representation of the function log (1 + X
t
) , where X
t denote the cumulative number of hospitalized patients in Andalusia for COVID-19 in the period 10/03/2020–
20/05/2020.
Fig. 6. Representation of the function log (1 + X
t
) , where X
t denote the cumulative number of hospitalized patients in Sjælland for COVID -19 in the period 06/03/2020–
20/05/2020.
Tabl e 1
Description of the chosen base regressors according to the data choices on
Country , Transformation and Differences and the four methodologies used, with tuning
parameters as in Section 3.3 .
F
SVR
F
RF
f
1 f
2 f
3 f
4 f
5 f
6 f
7 f
8 f
9 f
10 f
11 f
12 f
13 f
14 f
15 f
16 f
17 f
18
Country No
Country Yes
Transformation X
Transformation log (X + 1)
Transformation X
2
Differences Yes
Differences No
F
LR
F
S ORRT
f
19 f
20 f
21 f
22 f
23 f
24 f
25 f
26 f
27 f
28 f
29 f
30 f
31 f
32 f
33 f
34 f
35 f
36
Country No
Country Yes
Transformation X
Transformation log (X + 1)
Transformation X
2
Differences Yes
Differences No
6
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
Fig. 7. The timeline of building the base regressors in F, solving Problem (1) to obtain the sparse ensemble for a given value of λ, and making the out-of-sample predictions.
Algorithm 1: Pseudocode for the complete procedure.
1 Input:
{
X
t
, t = 1 , ... , T
}
,
X
r
t
, t = 1 , ... , T
, r = 2 , . . . , R, and
F as in Table 1
2 Set L equal to the loss defined in (3)
3 Train the base regressors in F in t { 1 , . . . , T 4 }
4 for λin
0 , 2
10
, 2
9
, . . . , 2
3
do
5 Solve Problem (1) for λin t { T 3 , ... , T } and obtain an
optimal solution, αλ
6 end
7 Train the base regressors in F in t { 1 , . . . , T }
8 for λin
0 , 2
10
, 2
9
, . . . , 2
3
do
9 Build the final ensemble regressor with weights α= αλ
10 Compute the predictions given by the final ensemble
regressor with weights α= αλin t { T + 1 , . . . , T + 14 }
11 end
12 Output: For each λ, the fourteen-days-ahead out-of-sample
predictions of the final ensemble regressor with weights
α= αλ
the base regressors, namely, T 3 , T 2 , T 1 , T , while the indi-
vidual losses are taken as L
f
= L (f) . For each value of λ, we ob-
tain the optimal weights αλreturned by Problem (1) . With these
weights, the final ensemble regressor is built using all the data
up to day T , and this final ensemble regressor is used to make
fourteen-day-ahead predictions in t { T + 1 , . . . , T + 14 } .
The commercial optimization package Gurobi ( Gurobi Optimiza-
tion, 2018 ) has been used to solve the convex quadratic problems
with linear constraints arising when solving Problem (1) with the
loss in (3) . Our experiments have been conducted on a PC, with
an Intel ®Core
TM i7-8550U CPU 1.80 G H z processor and 8 GB RAM.
The operating system is 64 bits.
3.5. The numerical results
The out-of-sample prediction performance of our approach is
illustrated in three training and testing splits, with all training
periods starting on 10/03/2020 for Andalusia and on 06/03/2020
for Sjælland, and all testing periods containing 14 days. For An-
dalusia, we have 10/03/2020–03/04/2020 (Training Period 1) and
04/04/2020–17/04/2020 (Testing Period 1), 10/03/2020–14/04/2020
(Training Period 2) and 15/04/2020–28/04/2020 (Testing Period
2), and 10/03/2020–06/05/2020 (Training Period 3) 07/05/2020–
20/05/2020 (Testing Period 3). Similar periods are chosen for Sjæl-
land, where all training periods start on 06/03/2020.
For each value of λin the considered grid, the fourteen-days-
ahead predictions made by the ensemble together with the real-
ized values of the variable can be found in Tables 2–7 for each
period and region, while Tables 8 and 9 report the Mean Squared
Error (MSE) and the Mean Absolute Error (MAE) over the fourteen
days. In Tables 8 and 9 , we highlight in bold the best MSE per-
formance of the ensemble across all the values of λconsidered,
and denote by λbest the value of the parameter where the mini-
mum MSE is achieved. Note that in this case, for each period and
region combination, the best MAE is also achieved at λ= λbest
.
Figs. 14 and 15 present the weights of the base regressors in the
ensembles as a function of λby means of heatmaps. The color bar
of each heatmap transitions from white to black, where the darker
means a higher weight.
Figs. 8 13 depict the realized values of the variable at hand, cu-
mulative number of hospitalized patients in the respective region
(in red), as well as the fourteen-days-ahead predictions for three
different ensembles. In the first ensemble, with λ= 0 , the selec-
tive sparsity term does not play a role by construction (blue line).
In the second ensemble, λ= λbest
, the ensemble is the one that
performs the best in terms of MSE among all values of λconsid-
ered (black line). Finally, in the third ensemble, with λ= λ, the
ensemble is the one showing the highest level of sparsity (green
line).
We start by discussing the results obtained for Period 1 in An-
dalusia. In Fig. 8 , we can see that it is possible to improve the
out-of-sample prediction performance by taking a strictly posi-
tive value of λ. As pointed in the introduction, this is one of
the advantages of our approach, namely, when seeking selective
sparsity one may obtain also improvements on the out-of-sample
prediction performance. A great benefit is observed with the en-
semble that performs the best (black line), which is rather close
to the actual values (red line). While the ensemble with λ= 0
presents a MAE of 532.71, for λbest
= 2
6 the MAE is reduced to
40.50. This ensemble consists of the base regressors f
2
F
SVR and
f
21
, f
23
F
LR
, with respective weights 0.71, 0.14 and 0.15. In Fig. 9 ,
we plot the out-of-sample information for Andalusia and Period
2. Similar conclusions hold. In addition, the best ensemble is the
one with λbest
= 2
5
, and consists of f
5
, f
11
F
SVR
, with respec-
tive weights 0.25 and 0.75. This means that the ensemble compo-
sition has changed over time, which can be explained by the non-
stationarity of the data. If after having built the best ensemble for
Training Period 1 one would have discarded these two base regres-
sors because they were not selected, we would have lost the best
combination for Training Period 2. This illustrates another advan-
tage of our approach, namely, its adaptability. The ensemble com-
position changes again in Training Period 3 in Andalusia, where
7
S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx
ARTICLE IN PRESS
JID: EOR [m5G; May 14, 2021;10:45 ]
Fig. 8. Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testi ng Period 1 for three values of the tradeoff
parameter
λ, together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
Fig. 9. Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testi ng Period 2 for three values of the tradeoff
parameter
λ, together with the actual values of the variable. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
Tabl e 2