Content uploaded by Dolores Romero Morales

Author content

All content in this area was uploaded by Dolores Romero Morales on May 15, 2021

Content may be subject to copyright.

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

European Journal of Operational Research xxx (xxxx) xxx

Contents lists available at ScienceDirect

European Journal of Operational Research

journal homepage: www.elsevier.com/locate/ejor

Computational Intelligence & Inform. Management

On sparse ensemble methods: An application to short-term

predictions of the evolution of COVID-19

Sandra Benítez-Peña

a , b

, Emilio Carrizosa

a , b

, Vanesa Guerrero

c

,

M. Dolores Jiménez-Gamero

a , b

, Belén Martín-Barragán

d

, Cristina Molero-Río

a , b

,

Pepa Ramírez-Cobo

e , a

, Dolores Romero Morales

f , ∗, M. Remedios Sillero-Denamiel

a , b

a

Instituto de Matemáticas de la Universidad de Sevilla, Seville, Spain

b

Departamento de Estadística e Investigación Operativa, Universidad de Sevilla, Seville, Spain

c

Departamento de Estadística, Universidad Carlos III de Madrid, Getafe, Spain

d

The University of Edinburgh Business School, University of Edinburgh, Edinburgh, UK

e

Departamento de Estadística e Investigación Operativa, Universidad de Cádiz, Cadiz, Spain

f

Department of Economics, Copenhagen Business School, Frederiksberg, Denmark

a r t i c l e i n f o

Article history:

Received 27 May 2020

Accepted 7 April 2021

Available online xxx

Keywo rds:

Machine Learning

Ensemble Method

Mathematical Optimization

Selective Sparsity

COVID-19

a b s t r a c t

Since the seminal paper by Bates and Granger in 1969, a vast number of ensemble methods that combine

different base regressors to generate a unique one have been proposed in the literature. The so-obtained

regressor method may have better accuracy than its components, but at the same time it may overﬁt,

it may be distorted by base regressors with low accuracy, and it may be too complex to understand

and explain. This paper proposes and studies a novel Mathematical Optimization model to build a sparse

ensemble, which trades off the accuracy of the ensemble and the number of base regressors used. The

latter is controlled by means of a regularization term that penalizes regressors with a poor individual

performance. Our approach is ﬂexible to incorporate desirable properties one may have on the ensemble,

such as controlling the performance of the ensemble in critical groups of records, or the costs associated

with the base regressors involved in the ensemble. We illustrate our approach with real data sets arising

in the COVID-19 context.

©2021 The Authors. Published by Elsevier B.V.

This is an open access article under the CC BY-NC-ND license

( http://creativecommons.org/licenses/by-nc-nd/4.0/ )

1. Introduction

A plethora of methodologies of very different nature is cur-

rently available for predicting a continuous response variable, as

it is the case in regression as well as in time series forecast-

ing. Those methodologies come mainly from Machine Learning,

such as Support Vector Machines ( Carrizosa & Romero Morales,

2013; Vapnik, 1995 ), Random Forests ( Breiman, 2001 ), Optimal

Trees ( Bertsimas & Dunn, 2017; Blanquero, Carrizosa, Molero-Río, &

Romero Morales, 2021; Carrizosa, Molero-Río, & Romero Morales,

2021 ), Deep Learning ( Gambella, Ghaddar, & Naoum-Sawaya,

∗Corresponding author.

E-mail addresses: sbenitez1@us.es (S. Benítez-Peña), ecarrizosa@us.es (E. Car-

rizosa), vanesa.guerrero@uc3m.es (V. Guerrero), dolores@us.es (M.D. Jiménez-

Gamero), Belen.Martin@ed.ac.uk (B. Martín-Barragán), mmolero@us.es (C. Molero-

Río), pepa.ramirez@uca.es (P. Ramírez-Cobo), drm.eco@cbs.dk (D. Rome ro Morales),

rsillero@us.es (M.R. Sillero-Denamiel).

2021 ); or from Statistics, such as Generalized Linear Models

( Hastie, Tibshirani, & Wainwright, 2015 ), Semi- and Nonpara-

metric approaches to regression (such as smoothing techniques)

( Härdle, 1990 ), Regression models for time series analysis ( Kedem

& Fokianos, 2005 ), or Random Effects models ( Lee, Nelder, & Paw-

itan, 2018 ). Some of these techniques have shown a relatively high

degree of success in COVID-19 time series forecasting ( Benítez-

Peña et al., 2020b; Nikolopoulos, Punia, Schäfers, Tsinopoulos, &

Vasilakis, 2021 ), which is the application that has inspired this

work.

In this way, the user has at hand a long list of ﬁtted regression

models, referred to in what follows as base regressors, and faces

the problem of deciding which one to choose, or alternatively, how

to combine (some of) the competing approaches, that is, how to

build an ensemble. While a thorough computational study of the

different models may help the user to identify the most conve-

nient one, such an approach becomes unworkable when predicting

https://doi.org/10.1016/j.ejor.2021.04.016

0377-2217/© 2021 The Autho rs. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license ( http://creativecommons.org/licenses/by-nc-nd/4.0/ )

Please cite this article as: S. Benítez-Peña, E. Carrizosa, V. Guerrero et al., On sparse ensemble methods: An application to short-term

predictions of the evolution of COVID-19, European Journal of Operational Research, https://doi.org/10.1016/j.ejor.2021.04.016

S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

new phenomena in real-time, like the evolution of the COVID-19

counts (conﬁrmed cases, hospitalized patients, ICU patients, recov-

ered patients, and fatalities). Here, the most accurate method will

probably change over time since we are dealing with a dynamic

setting, but also because of the non-stationarity of the data caused,

for instance, by the different interventions of authorities to ﬂatten

the curve .

Hence, it may be more convenient to build an ensemble where

some accuracy measure, such as a (cross-validation) estimate of

the expected squared error or of the absolute error ( Ando & Li,

2014; Bates & Granger, 1969 ), is optimized at each forecast ori-

gin. With this approach other relevant issues can be modeled, such

as sparsity in the feature space ( Bertsimas, King, & Mazumder,

2016; Carrizosa, Mortensen, Romero Morales, & Sillero-Denamiel,

2020a; Carrizosa, Olivares-Nadal, & Ramírez-Cobo, 2017b; Foun-

toulakis & Gondzio, 2016 ), interpretability ( Carrizosa, Nogales-

Gómez, & Romero Morales, 2016; 2017a; Carrizosa, Olivares-

Nadal, & Ramírez-Cobo, 2020b; Martín-Barragán, Lillo, & Romo,

2014 ), critical values of features ( Carrizosa, Martín-Barragán, &

Romero Morales, 2010; 2011 ), measurement costs ( Carrizosa,

Martín-Barragán, & Romero Morales, 2008 ), or cost-sensitive

performance constraints ( Benítez-Peña, Blanquero, Carrizosa, &

Ramírez-Cobo, 2019a; 2020a; Blanquero, Carrizosa, Ramírez-Cobo,

& Sillero-Denamiel, 2020 ). See ( Friese, Bartz-Beielstein, & Em-

merich, 2016; Mendes-Moreira, Soares, Jorge, & Sousa, 2012; Ren,

Zhang, & Suganthan, 2016 ) and references therein for the role

of mathematical optimization when constructing ensembles and

( Friese, Bartz-Beielstein, Bäck, Naujoks, & Emmerich, 2019 ) for the

use of ensembles to enhance the optimization of black-box expen-

sive functions.

In this paper, we propose an optimization approach to build

a sparse ensemble. In contrast to existing proposals in the litera-

ture, our paper focuses on an innovative deﬁnition of sparsity, the

so-called selective sparsity . Our goal is to build a sparse ensemble,

which takes into account the individual performance of each base

regressor, in such a way that only good base regressors are allowed

to take part in the ensemble. This is done with the aim to adapt to

dynamic settings, such as in COVID-19 counts, where the composi-

tion of the ensemble may change over time, but also to avoid that

the ensemble is distorted by base regressors with low accuracy or

may be too complex to understand and explain. Ours can be seen

as a sort of what ( Mendes-Moreira et al., 2012 ) calls an ensemble

pruning , where the ensemble is constructed by using a subset of

all available base regressors. The novelty of our approach resides

in the fact that the selection of the subset and the weights in the

ensemble are simultaneously optimized.

We propose a Mathematical Optimization model that trades off

the accuracy of the ensemble and the number of base regressors

used. The latter is controlled by means of a regularization term

that penalizes regressors with a poor individual performance. Our

approach is ﬂexible to incorporate desirable properties one may

have on the ensemble, such as controlling the performance of the

ensemble in critical groups of records, or the costs associated with

the base regressors involved in the ensemble. Our data-driven ap-

proach is applied to short-term predictions of the evolution of

COVID-19, as an alternative to model-based prediction algorithms

as in Achterberg et al. (2020) and references therein.

The remainder of the paper is structured as follows.

Section 2 formulates the Mathematical Optimization problem to

construct the sparse ensemble. Theoretical properties of the opti-

mal solution are studied, and how to accommodate some desirable

properties on the ensemble is also discussed. Section 3 illustrates

our approach with real data sets arising in the COVID-19 context,

where one can see how the ensemble composition changes over

time. The paper ends with some concluding remarks and lines for

future research in Section 4 .

2. The optimization model

This section presents the new ensemble approach.

Section 2.1 describes the formulation of the model in terms

of an optimization problem with linear constraints. Section 2.2 es-

tablishes the connection of the approach with the constrained

Lasso ( Blanquero et al., 2020; Gaines, Kim, & Zhou, 2018 ) and

some theoretical results of the solution are derived. Finally,

Section 2.3 considers some extensions of the model concern-

ing the control of the set of base regressors or control of the

performance in critical groups.

2.1. The formulation

Let Fbe a ﬁnite set of base regressors for the response

variable y . No restriction is imposed on the collection of base

regressors. It may include a variety of state-of-the-art models

and methodologies for setting their parameters and hyperparam-

eters. It may even use alternative samples for training, for ex-

ample where individuals are characterized by different sets of

features. By taking convex combinations of the base regressors

in F, we obtain a broader class of regressors, namely, co(F) =

F =

f∈F

αf

f :

f∈F

αf

= 1 , αf

≥0 , ∀ f ∈ F

. Throughout this

section, vectors will be denoted with bold typesetting, e.g., α=

(αf

)

f∈F

.

The selection of one combined regressor from co(F) will be

made by optimizing a function which takes into account two cri-

teria. The ﬁrst and fundamental criterion is the overall accuracy of

the combined regressor, measured through a loss function L , de-

ﬁned on co(F) ,

L : co(F) −→ R

F −→ L (F ) .

For each base regressor f ∈ Fwe assume its individual loss L

f

is

given. This may be simply deﬁned as L

f

= L (f) , but other options

are possible too, in which, for instance, L

f

and L are both empiri-

cal losses, as in Section 2.2 , but use different training samples.

With the second criterion, a selective sparsity is pursued to

make the method more reluctant to choose base regressors f ∈ F

with lower reliability, i.e., with higher individual loss L

f

, reduc-

ing thus overﬁtting. To achieve this, we add a regularization term

in which the weight of base regressor f, say αf

, is multiplied by

its individual loss L

f

. The selective sparse ensemble is obtained

by solving the following Mathematical Optimization problem with

linear constraints:

min

α∈S

L

f∈F

αf

f

+ λ

f∈F

αf

L

f

, (1)

where Sis the unit simplex in R

|F|

,

S =

α∈ R

|F|

:

f∈F

αf

= 1 , αf

≥0 , ∀ f ∈ F

,

and λ≥0 is a regularization parameter, which trades off the im-

portance given to the loss of the ensemble regressor and to the

selective sparsity of the base regressors used.

2.2. Theoretical results

In general, Problem (1) has a nonlinear objective function and

linear constraints. For loss functions commonly used in the litera-

ture, we can rewrite its objective as a linear or a convex quadratic

function while the constraints remain linear. Therefore, for these

loss functions, Problem (1) is easily tractable with commercial

2

S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

solvers. In addition, and under some mild assumptions, we char-

acterize the behavior of the optimal solution with respect to the

parameter λ.

First, we will rewrite the second term in the objective function,

so that the proposed model can be seen as a particular case of

the constrained Lasso. As for Lasso models and extensions of them,

having a sparse model reduces the danger of overﬁtting.

Remark 1. The so-called selective

1 norm ·

sel

1

in R

|F| is de-

ﬁned as

α

sel

1

=

f∈F

L

f

| αf

| .

The objective function in Problem (1) can be written as

L

f∈F

αf

f

+ λ α

sel

1

. With this, and for well-known losses

L , Problem (1) can be seen as a constrained Lasso problem,

( Blanquero et al., 2020; Gaines, Kim, & Zhou, 2018 ), in which a se-

lective sparsity is sought, as opposed to a plain sparsity with as

few nonzero coeﬃcients αf

as possible.

Remark 2. Let Ibe a training sample, in which each individual

i ∈ Iis characterized by its feature vector x

i

∈ R

p and its response

y

i

. Let L be the empirical loss of quantile regression, ( Koenker &

Hallock, 2001 ), for I,

L

f∈F

αf

f

=

i ∈I

ρτ

y

i

−

f∈F

αf

f(x

i

)

, (2)

where

ρτ(s ) =

τs, if s ≥0

−(1 −τ) s, if s < 0 ,

for some τ∈ (0 , 1) . Then, as in e.g. Koenker and Ng (2005) , Prob-

lem (1) can be expressed as a linear program and thus eﬃciently

solved with Linear Programming solvers.

Remark 3. Let Ibe a training sample, in which each individual

i ∈ Iis characterized by its feature vector x

i

∈ R

p and its response

y

i

. Let L be the empirical loss of Ordinary Least Squares (OLS) re-

gression for I, i.e.,

L

f∈F

αf

f

=

i ∈I

y

i

−

f∈F

αf

f(x

i

)

2

. (3)

Hence, Problem (1) is a convex quadratic problem with linear con-

straints, which, by Remark 1 , can be seen as a constrained Lasso. In

particular, the results in Gaines, Kim, and Zhou, (2018) apply, and

thus, we can assert that, if the design matrix

(

f(x

i

)

)

i ∈I,f∈F

has full

rank, then,

1. For any λ≥0 , Problem (1) has unique optimal solution αλ.

2. The path of optimal solutions αλis piecewise linear in λ.

Under mild conditions on L , applicable in particular for the

quantile and OLS empirical loss functions, we characterize the op-

timal solution of Problem (1) for large values of the parameter λ.

Intuitively speaking, for λgrowing to inﬁnity, the ﬁrst term in the

objective function becomes negligible, and thus we only need to

solve the Linear Programming problem of minimizing

f∈F

αf

L

f

in the simplex S. This problem attains its optimum at one of the

extreme points of the feasible region, i.e., at some f

∗∈ F, namely,

one for which L

f

∗≤L

f

, ∀ f. We formalize this intuition in the fol-

lowing proposition, where under the assumption of convexity of L ,

we show that a ﬁnite value of λexists for which such sparse so-

lution is optimal. Before stating it, notice that, since the set Fis

given, we can deﬁne

L : −→ R

w −→ L (w ) = L

f∈F

w

f

f

,

for some ⊆R

|F|

, such that ⊇S.

Proposition 1. Assume that L is convex in an open convex set ⊇S.

Furthermore, assume that there exists a base regressor f

◦such that

L

f

◦< L

f

for all f ∈ F,f = f

◦. Then, there exists λ◦< + ∞ such that,

for any λ≥λ◦, f

◦is an optimal solution to Problem (1) .

Proof. Let f

◦be as in the statement of the proposition, and let

α◦∈ Sdenote the vector with 1 in its component corresponding to

f

◦and 0 otherwise. Since L is deﬁned in the open set α◦, the

subdifferential ∂L ( α◦) of the convex function L at α◦is not empty.

Let p ∈ ∂L ( α◦) , and let N ( α◦) denote the normal cone of Sat α◦.

Then,

0 ∈ p + λL

f

f∈F

+ N ( α◦) iff p

f

◦+ λL

f

◦≤p

f

+ λL

f ∀ f ∈ F,

(4)

which is satisﬁed iff

λ≥max

p

f

◦−p

f

L

f

−L

f

◦

: f ∈ F, f = f

◦. (5)

Setting λ◦equal to the value on the right-hand side of (5) , and

taking into account that the condition on the left-hand side of

(4) is necessary and suﬃcient for the optimality of α◦, the result

follows.

2.3. Extensions

Problem (1) can be enriched to address some desirable proper-

ties one may seek for the ensemble. Three of them are discussed

in what follows. The ﬁrst two properties relate to the transparency

and interpretability of the ensemble, Deng (2019) and Florez-Lopez

and Ramon-Jeronimo (2015) , while the third one relates to the per-

formance of the ensemble in critical groups.

As mentioned in the introduction, the ensemble may contain

base regressors built with several methodologies of very diverse

nature. Therefore, one may want to control the number of method-

ologies used in the ﬁnal ensemble. For instance, in the application

described in Section 3 , we consider four methodologies, namely,

Support Vector Regression, Random Forests, Optimal Trees, and Lo-

gistic Regression. Let F =

m ∈M

F

type

m

, where F

type

m

is the set of

base regressors using methodology m ∈ M , and let αtype

m

be the

corresponding subvector of α, namely, the one containing the com-

ponents in αreferring to methodology m ∈ M . With this, we can

extend the objective function of Problem (1) to

L

f∈F

αf

f

+ λ

f∈F

αf

L

f

+ λtype

m ∈M

αtype

m

∞

. (6)

In a similar fashion, one may want to control the set of features

used by the ensemble. Let F

fea

j

⊆Fbe the set of base regressors

using feature j ∈ { 1 , . . . , p} , and let αfea

j

be the corresponding sub-

vector of α, namely, the one containing the components in αre-

ferring to feature j ∈ { 1 , . . . , p} . With this, we can extend the ob-

jective function of Problem (1) to

L

f∈F

αf

f

+ λ

f∈F

αf

L

f

+ λfea

p

j=1

αfea

j

∞

. (7)

In both cases, the

∞ terms can be rewritten using new deci-

sion variables and linear constraints, and thus the structure of the

problem is not changed. This way, if L is the quantile regression

3

S. Benítez-Peña, E. Carrizosa, V. Guerrero et al. European Journal of Operational Research xxx (xxxx) xxx

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

Fig. 1. Cumulative number of hospitalized patients in Andalusia (Spain) for COVID-19 in the period 10/03/2020–20/05/2020.

(respectively, the Ordinary Least Squares) empirical loss, the op-

timization problem with objective as in (6) is written as a linear

problem (respectively, as a convex quadratic problem with linear

constraints). The same holds for the optimization problem with ob-

jective as in (7) .

In addition, our approach can easily incorporate cost-sensitive

performance constraints to ensure that we control not only the

overall accuracy of the regressor, but also the accuracy on a num-

ber of critical groups, as in Benítez-Peña et al. (2019a) , Benítez-

Peña, Blanquero, Carrizosa, and Ramírez-Cobo (2019b) , Blanquero

et al. (2020) and Datta and Das (2015) . With this, if δg

> 0 denotes

the threshold on the loss L

g for group g ∈ G, we can add to the

feasible region of Problem (1) constraints

L

g

f∈F

αf

f

≤δg

, ∀ g ∈ G. (8)

For the quantile and Ordinary Least Squares empirical loss func-

tions, these constraints are linear or convex quadratic, respectively,

and thus the optimization problems can be addressed with the

very same numerical tools as before.

3. Short-term predictions of the evolution of COVID-19

The purpose of this section is to illustrate how, thanks to the

selective sparsity term in Problem (1) , we can provide good en-

sembles in terms of accuracy. For this, we use data sets arising in

the context of COVID-19.

3.1. The data

COVID-19 was ﬁrst identiﬁed in China in December 2019 and,

subsequently, started to spread broadly. Quickly after this, data

started to be collected daily by the different countries. Several vari-

ables of interest, such as conﬁrmed cases, hospitalized patients,

ICU patients, recovered patients, and fatalities, among others, were

considered. Different initiatives around the world emerged in order

to get to know this new scenario.

In this section, we focus on the evolution of the pandemic in

Spain and Denmark. The ﬁrst cases were conﬁrmed in Spain and

Denmark in late February 2020 and early March 2020, respectively.

In this paper, the considered variable of interest is the cumula-

tive number of hospitalized patients in the regions of Andalusia

(Spain) and Sjælland (Denmark). Figs. 1 and 2 display the data in

the periods 10/03/2020-20/05/2020 for Andalusia and 06/03/2020-

20/05/2020 for Sjælland, which can be found at the repositories in

Fernández-Casal (2020) and Statens Serum Institut (2020) , respec-

tively.

The univariate time series

{

X

t

, t = 1 , . . . , T

}

, with X

t represent-

ing the cumulative number of hospitalized patients in the region

under consideration in day t, is converted into a multivariate se-

ries using seven lags. In other words, the data fed to the base re-

gressors is not the time series itself, but the vectors of covariates

and responses in Fig. 3 . This training set is just one of the different

options we have considered to create base regressors. In the next

section, we discuss other data choices, which we will refer to as

Country , Transformation and Differences .

3.2. Options for feeding the data

We ﬁrst discuss the Country data choice. Let R be the number

of regions of the country under consideration, and, without loss of

generalization, let us assume that the ﬁrst one is the region un-

der consideration. The time series X

r

t

, t = 1 , . . . , T

, for regions

r = 2 , . . . , R, were also available. Such times series are correlated

with the one under consideration. We had to decide whether to

incorporate these additional time series in our forecasting model.

If we do so, the feeding data contains the 7-uples in Fig. 3 from

the region under consideration, as well as the ones from the other

R −1 regions, see Fig. 4 . We now move to the Transformation

choice. For the two choices in Figs. 3 and 4 , either the crude data

Xare used or they are transformed using some standard Box-Cox

transformations, Hastie, Tibshirani and Wainwright (2015) , namely,

X

2 and log

(

X + 1

)

. Finally, with respect to the Differences

choice, we have also considered whether information about the

monotonicity (ﬁrst difference, X

t

:= X

t

−X

t−1

) and the curvature

(second difference, 2

X

t

:= X

t

−X

t−1

) is added to the feeding

data as predictors, thus yielding 6 and 5 new predictors because

of monotonicity and curvature, respectively.

To end this section, observe that the time series

{

X

t

, t = 1 , . . . , T

} of cumulative number of hospitalized patients

in the region under consideration is, by nature, nondecreasing.

However, some of the methodologies in the next section used to

build base regressors do not guarantee such monotonicity. To en-

sure that the predictions show the monotonicity property present

in the data, we use as response variable log (1 + X

t

) , instead of

4

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

Fig. 2. Cumulative number of hospitalized patients in Sjælland (Denmark) for COVID-19 in the period 06/03/2020–20/05/2020.

Fig. 3. Covariates (in parentheses) and response variable for the cumulative number

of hospitalized patients in the reg ion under consideration.

Fig. 4. Covariates (in parentheses) and response variable for the cumulative number

of hospitalized patients in each of the R regions of the country.

X

t

. Once the procedure is completed, we undo this transformation

to predict the original response variable X

t

. Figs. 5 and 6 display

log (1 + X

t

) for Andalusia and Sjælland, respectively, where tis

as in Figs. 1 and 2 .

3.3. The base regressors

We consider four base methodologies to build the set of base

regressors F. This includes three state-of-the-art Machine Learn-

ing tools, namely Support Vector Regression (SVR) ( Carrizosa &

Romero Morales, 2013 ), Random Forest (RF) ( Breiman, 2001 ), and

Sparse Optimal Randomized Regression Trees (S-ORRT) ( Blanquero,

Carrizosa, Molero-Río, & Romero Morales, 2020a ), as well as the

classic Linear Regression (LR). Each of them is fed each time with

one of the data choices described in Section 3.1 . See Table 1 for

a description of the elements of F = F

SVR

∪ F

RF

∪ F

LR

∪ F

S −ORRT

=

f

j

: j = 1 , . . . , 36

according to their methodology and the data

choices. These methodologies have some parameters which must

be tuned, and we explain below the tuning we have performed to-

gether with other computational details.

To tune the parameters, the different base regressors are trained

using all the available data, except for the last four days, i.e., these

models are trained on t ∈ { 1 , . . . , T −4 } . The e1071 ( Meyer, Dim-

itriadou, Hornik, Weingessel, & Leisch, 2019 ) and randomForest

( Liaw & Wiener, 2002 ) R packages have been used for training SVR

and RF, respectively, while the lm routine in R is used for LR. The

computational details for training S-ORRT are those in Blanquero

et al. (2020a) . For SVR, we use the RBF kernel and perform a

grid search in

{

2

a

: a = −10 , ... , 10

}

for both parameters, cost and

gamma . For RF, we set ntree = 500 and for mtry we try out ﬁve

random values. If only information from the region under consid-

eration is included (‘ Country No’ data option in Table 1 ), eight

fold cross-validation is used. However, when information from all

regions in the country is included, we limit this to ﬁve fold cross-

validation, due to the small amount of data and the lack of obser-

vations in some regions. Such cross-validation estimates are used

to select the best values of the parameters. With those best values,

for each combination of feeding data and methodology, the base

regressors f ∈ Fare built using information from t ∈ { 1 , . . . , T −4 } ,

see Fig. 7 .

3.4. The pseudocode of the complete procedure

The complete procedure for making short-term predictions

with our selective sparse ensemble methodology is summarized

in Algorithm 1 and can be visualized in Fig. 7 . The considered

grid of values for the tradeoff parameter λin Problem (1) is

0 , 2

−10

, 2

−9

, . . . , 2

3

. For the tests considered in this section, this

grid is wide enough. On one extreme, we have included the trivial

value λ= 0 , for which the selective sparsity term does not play a

role. On the other extreme, with this grid we ensure that λ= λ◦

is reached, for which, by Proposition 1 , the ensemble shows the

highest level of sparsity.

We start by training the base regressors Fin Table 1 , with tun-

ing parameters as in Section 3.3 , using the data available up to day

T −4 . We then move to solve Problem (1) for the different val-

ues of λin the grid. For this, we have chosen the loss L as in (3) ,

where Iconsists of the data in the four days left out when tuning

5

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

Fig. 5. Representation of the function log (1 + X

t

) , where X

t denote the cumulative number of hospitalized patients in Andalusia for COVID-19 in the period 10/03/2020–

20/05/2020.

Fig. 6. Representation of the function log (1 + X

t

) , where X

t denote the cumulative number of hospitalized patients in Sjælland for COVID -19 in the period 06/03/2020–

20/05/2020.

Tabl e 1

Description of the chosen base regressors according to the data choices on

Country , Transformation and Differences and the four methodologies used, with tuning

parameters as in Section 3.3 .

F

SVR

F

RF

f

1 f

2 f

3 f

4 f

5 f

6 f

7 f

8 f

9 f

10 f

11 f

12 f

13 f

14 f

15 f

16 f

17 f

18

Country No

Country Yes

Transformation X

Transformation log (X + 1)

Transformation X

2

Differences Yes

Differences No

F

LR

F

S −ORRT

f

19 f

20 f

21 f

22 f

23 f

24 f

25 f

26 f

27 f

28 f

29 f

30 f

31 f

32 f

33 f

34 f

35 f

36

Country No

Country Yes

Transformation X

Transformation log (X + 1)

Transformation X

2

Differences Yes

Differences No

6

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

Fig. 7. The timeline of building the base regressors in F, solving Problem (1) to obtain the sparse ensemble for a given value of λ, and making the out-of-sample predictions.

Algorithm 1: Pseudocode for the complete procedure.

1 Input:

{

X

t

, t = 1 , ... , T

}

,

X

r

t

, t = 1 , ... , T

, r = 2 , . . . , R, and

F as in Table 1

2 Set L equal to the loss deﬁned in (3)

3 Train the base regressors in F in t ∈ { 1 , . . . , T −4 }

4 for λin

0 , 2

−10

, 2

−9

, . . . , 2

3

do

5 Solve Problem (1) for λin t ∈ { T −3 , ... , T } and obtain an

optimal solution, αλ

6 end

7 Train the base regressors in F in t ∈ { 1 , . . . , T }

8 for λin

0 , 2

−10

, 2

−9

, . . . , 2

3

do

9 Build the ﬁnal ensemble regressor with weights α= αλ

10 Compute the predictions given by the ﬁnal ensemble

regressor with weights α= αλin t ∈ { T + 1 , . . . , T + 14 }

11 end

12 Output: For each λ, the fourteen-days-ahead out-of-sample

predictions of the ﬁnal ensemble regressor with weights

α= αλ

the base regressors, namely, T −3 , T −2 , T −1 , T , while the indi-

vidual losses are taken as L

f

= L (f) . For each value of λ, we ob-

tain the optimal weights αλreturned by Problem (1) . With these

weights, the ﬁnal ensemble regressor is built using all the data

up to day T , and this ﬁnal ensemble regressor is used to make

fourteen-day-ahead predictions in t ∈ { T + 1 , . . . , T + 14 } .

The commercial optimization package Gurobi ( Gurobi Optimiza-

tion, 2018 ) has been used to solve the convex quadratic problems

with linear constraints arising when solving Problem (1) with the

loss in (3) . Our experiments have been conducted on a PC, with

an Intel ®Core

TM i7-8550U CPU 1.80 G H z processor and 8 GB RAM.

The operating system is 64 bits.

3.5. The numerical results

The out-of-sample prediction performance of our approach is

illustrated in three training and testing splits, with all training

periods starting on 10/03/2020 for Andalusia and on 06/03/2020

for Sjælland, and all testing periods containing 14 days. For An-

dalusia, we have 10/03/2020–03/04/2020 (Training Period 1) and

04/04/2020–17/04/2020 (Testing Period 1), 10/03/2020–14/04/2020

(Training Period 2) and 15/04/2020–28/04/2020 (Testing Period

2), and 10/03/2020–06/05/2020 (Training Period 3) 07/05/2020–

20/05/2020 (Testing Period 3). Similar periods are chosen for Sjæl-

land, where all training periods start on 06/03/2020.

For each value of λin the considered grid, the fourteen-days-

ahead predictions made by the ensemble together with the real-

ized values of the variable can be found in Tables 2–7 for each

period and region, while Tables 8 and 9 report the Mean Squared

Error (MSE) and the Mean Absolute Error (MAE) over the fourteen

days. In Tables 8 and 9 , we highlight in bold the best MSE per-

formance of the ensemble across all the values of λconsidered,

and denote by λbest the value of the parameter where the mini-

mum MSE is achieved. Note that in this case, for each period and

region combination, the best MAE is also achieved at λ= λbest

.

Figs. 14 and 15 present the weights of the base regressors in the

ensembles as a function of λby means of heatmaps. The color bar

of each heatmap transitions from white to black, where the darker

means a higher weight.

Figs. 8 –13 depict the realized values of the variable at hand, cu-

mulative number of hospitalized patients in the respective region

(in red), as well as the fourteen-days-ahead predictions for three

different ensembles. In the ﬁrst ensemble, with λ= 0 , the selec-

tive sparsity term does not play a role by construction (blue line).

In the second ensemble, λ= λbest

, the ensemble is the one that

performs the best in terms of MSE among all values of λconsid-

ered (black line). Finally, in the third ensemble, with λ= λ◦, the

ensemble is the one showing the highest level of sparsity (green

line).

We start by discussing the results obtained for Period 1 in An-

dalusia. In Fig. 8 , we can see that it is possible to improve the

out-of-sample prediction performance by taking a strictly posi-

tive value of λ. As pointed in the introduction, this is one of

the advantages of our approach, namely, when seeking selective

sparsity one may obtain also improvements on the out-of-sample

prediction performance. A great beneﬁt is observed with the en-

semble that performs the best (black line), which is rather close

to the actual values (red line). While the ensemble with λ= 0

presents a MAE of 532.71, for λbest

= 2

−6 the MAE is reduced to

40.50. This ensemble consists of the base regressors f

2

∈ F

SVR and

f

21

, f

23

∈ F

LR

, with respective weights 0.71, 0.14 and 0.15. In Fig. 9 ,

we plot the out-of-sample information for Andalusia and Period

2. Similar conclusions hold. In addition, the best ensemble is the

one with λbest

= 2

−5

, and consists of f

5

, f

11

∈ F

SVR

, with respec-

tive weights 0.25 and 0.75. This means that the ensemble compo-

sition has changed over time, which can be explained by the non-

stationarity of the data. If after having built the best ensemble for

Training Period 1 one would have discarded these two base regres-

sors because they were not selected, we would have lost the best

combination for Training Period 2. This illustrates another advan-

tage of our approach, namely, its adaptability. The ensemble com-

position changes again in Training Period 3 in Andalusia, where

7

ARTICLE IN PRESS

JID: EOR [m5G; May 14, 2021;10:45 ]

Fig. 8. Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testi ng Period 1 for three values of the tradeoff

parameter

λ, together with the actual values of the variable. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of

this article.)

Fig. 9. Fourteen-day-ahead predictions for the cumulative number of hospitalized patients in Andalusia for COVID-19 in Testi ng Period 2 for three values of the tradeoff

parameter

λ, together with the actual values of the variable. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of

this article.)

Tabl e 2