Content uploaded by Thierry Moudiki
Author content
All content in this area was uploaded by Thierry Moudiki on Apr 09, 2019
Content may be subject to copyright.
Forecasting multivariate time series with boosted
configuration networks
Thierry Moudiki
ISFA - Laboratoire SAF, Universit´e Lyon I, 69007 Lyon, France
Abstract
This paper contributes to a development of Randomized Neural Networks, and
more specifically to a development of the Stochastic Configuration Networks
(SCNs). We present a family of learning algorithms based on the SCNs and
on ensembles of single layer feedforward networks (SLFNs). They are close to
Gradient Boosting, and to Matching Pursuit algorithms, and are denoted here
as boosted configuration networks (BCNs). In the BCN framework, as with
SCNs, the networks’ hidden layers are chosen in a supervised way, ensuring
that the universal approximation property of Neural Networks is met. Though,
the learning mechanism of the BCNs incorporates a learning rate, that allows
for a slower learning of the model’s residuals. It also applies a subsampling
of the models’ explanatory variables that decorrelates the base learners. The
BCNs are compared to other ensembles of Randomized Neural Networks and
to other forecasting techniques, on real world multivariate time series data. An
interesting direction for a future work would be to apply the BCNs to regres-
sion/classification data which are not time series.
Keywords: Neural Networks, Gradient Boosting, universal approximation
property, Randomized Neural Networks, forecasting
2010 MSC: 00-01, 99-00
∗Corresponding author
Email address: thierry.moudiki@gmail.com (Thierry Moudiki)
Preprint submitted to Journal of L
A
T
E
X Templates October 31, 2018
1. Introduction
The goal of ensemble learning is to combine astutely two or more individual
statistical/machine learning models – the base learners – into one, with the
expectation of an improved out-of-sample error of the ensemble over the base
learners.5
Gradient Boosting (see Friedman (2001), B¨uhlmann & Yu (2003), Hothorn
et al. (2010)) is an ensemble learning procedure, whose general idea is to fit the
model’s in-sample residuals with the base learners, iteratively and slowly, and
to stop learning those residuals just before the out-of-sample error of the model
starts to increase.10
In this paper, we discuss a family of statistical/machine learning algorithms
based on the Stochastic Configuration Networks (SCNs) from Wang & Li (2017b),
and on the boosting of Single Layer Feedforward Networks (SLFNs) (specifically
on the LS Boost from Friedman (2001)). We denote this family of algorithms
as Boosted Configuration Networks (BCNs).15
The idea of the SCNs from Wang & Li (2017b) followed some related works
on constructive Neural Networks (Friedman & Stuetzle (1981), Jones et al.
(1992), Barron (1993) and Kwok & Yeung (1997)) and constructive Random
Basis Approximators (Igelnik & Pao (1995) and Li & Wang (2017)). The com-
mon philosophy behind these references was to construct learning algorithms20
for target functions incrementally, with bounded functions of the explanatory
variables (of the target) as base learners, until the residual error of the result-
ing model falls under a certain level of tolerance. The universal approximation
property (Hornik et al. (1989)); the ability of the constructive model to converge
(in L2) towards the target function as the number of base learners grows, is also25
a general concern examined in these references. Other resources containing in-
sights on this body of work are Vincent & Bengio (2002) and Mallat & Zhang
(1993), who put a greater emphasis on Matching Pursuit algorithms (Friedman
& Stuetzle (1981)).
2
Wang & Li (2017b) introduced three different algorithms for SCN-learning,30
denoted as SC-I, SC-II, SC-III. In these algorithms, the base learners are SLFNs
and each one of the three algorithms verifies the universal approximation prop-
erty. SC-I does successive orthogonal projections of the target function on the
base learners. The parameters of the base learners (nodes of the hidden lay-
ers) are chosen in a supervised way, ensuring that the universal approximation35
property of the additive basis expansion of SC-I is met. SC-III is a modified
version of SC-I, in which all the weights of the target function in the additive
expansion are recalculated, at each learning iteration of the residuals, and not
just the output weight (the most recent weight of the additive expansion) as in
SC-I.40
SC-III provides the best results of the three algorithms presented in Wang
& Li (2017b), but is said to be less suited than SC-I for large-scale data –
notably because of its recalculation of all the weights of the expansion. SC-
II mixes ingredients from SC-I and SC-III, to serve as a compromise between
them, when both accuracy and scalability are needed. Another contribution45
to the SCN literature is Wang & Li (2017a), in which a version of the SCN is
introduced; more robust to noisy samples and to outliers.
In the BCN framework introduced here, the SCNs from Wang & Li (2017b)
are brought closer to the LS Boost from (Friedman (2001)). The base learners
(x,a)7→ h(x,a) (parametrized functions of the input variables x, characterized50
by parameters a={a1, a2, . . .}) of the LS Boost are SLFNs, and the parameters
aof the hidden layers are chosen in a supervised way that guarantees the univer-
sal approximation property of the BCN expansion. Contrary to the SCNs, and
similarly to Gradient Boosting machines, the BCNs incorporate both a learning
rate that allows for a slow learning of the residuals, and a subsampling of the55
models’ explanatory variables that achieves an increased diversity of the base
learners.
The contributions of this paper are thus twofold:
•define the BCN algorithms as supervised and automated ways of con-
3
structing new model’s features, a more general version of SCNs, a Gradi-60
ent Boosting procedure that verifies the universal approximation property
of Neural Networks.
•employ the BCNs for multivariate time series (MTS) forecasting and com-
pare it to other ensembles of randomized Neural Networks and usual fore-
casting techniques, on various real world MTS datasets.65
The BCNs are described in details in Section 2, and their forecasting capa-
bilities are examined in Section 3, on nine different MTS datasets. In Section
2, we notably demonstrate the universal approximation property of the BCNs,
after including a learning rate and a subsampling of the covariates to the SCNs.
We also discuss how some parameters of the model influence the convergence of70
the BCN expansion towards its target.
2. The boosted configuration networks
We start with a general remark that holds for all the models described in this
paper, including the BCNs. For the construction of training/testing/validation
sets, we always consider p∈N∗time series (X(j)
t)t≥0for j∈ {1, . . . , p}observed75
at n∈N∗discrete dates, and we are interested in obtaining simultaneous fore-
casts of these time series at time n+h,h∈N∗. Each series is allowed to be
influenced by the others, in the spirit of VAR models (L¨utkepohl (2005); Pfaff
(2008)).
For the purpose of forecasting these pvariables, k < n lags of each time80
series are used. Therefore, the output variables to be explained are:
Y(j)=X(j)
n, . . . , X(j)
k+1T
(1)
for j∈ {1, . . . , p}.X(j)
nis the most recent observed value of the jth time
series, and X(j)
k+1 was observed kdates earlier in time for (X(j)
t)t≥0. These
4
output variables are stored in a matrix:
Y∈R(n−k)×p(2)
and the original explanatory variables (before any nonlinear transformation)
are stored in a matrix:
X∈R(n−k)×(k×p)(3)
Xconsists of pblocks with ktime series lags each. For example, the jth
0
85
block of X, for j0∈ {1, . . . , p}contains in columns:
X(j0)
n−i, . . . , X(j0)
k+1−iT
(4)
with i∈ {1, . . . , k}. It is also possible to add other regressors, such as
dummy variables, indicators of special events, but as in Moudiki et al. (2018),
we consider only the inclusion of lags of the observed time series. For example,
if we have observed p= 2 time series (X(1)
t1, . . . , X(1)
t5) and (X(2)
t1, . . . , X(2)
t5) at90
n= 5 dates t1< . . . < t5, and would like to use k= 2 lags to construct the
explanatory variables, the response variables will be stored in:
Y=
X(1)
t5X(2)
t5
X(1)
t4X(2)
t4
X(1)
t3X(2)
t3
(5)
And the original explanatory variables will be stored in:
X=
X(1)
t4X(1)
t3X(2)
t4X(2)
t3
X(1)
t3X(1)
t2X(2)
t3X(2)
t2
X(1)
t2X(1)
t1X(2)
t2X(2)
t1
(6)
In the sequel of the paper, we will sometimes work on only one of the p
response variables and denote it as y∈R(n−k). Hence, the observation at time95
tiof ywill be denoted as yi, and yiwill be explained as a function of xi, the
ith line of matrix X. All of the presponse variables do share the same set of
5
predictors, in a multitask learning (Caruana (1998)) fashion. Their treatment
is thus equivalent.
Now, borrowing from some notations of Friedman (2001), the problem that100
we would like to solve is finding F∗that verifies:
F∗=arg minFEX[Ey[L(y, F (X))] |X] (7)
With (y, X)7→ L(y, F (X)) = (y−F(X))2being a squared-error loss func-
tion. Having these introductory informations and notations, we will now present
the BCNs.
2.1. General description of the boosted configuration networks (BCNs)105
In the SCN framework from Wang & Li (2017b) and similarly to simple
Matching Pursuit algorithms (see Vincent & Bengio (2002) and Mallat & Zhang
(1993)), it is assumed that an approximation FL−1of F(observation yi) has
already been constructed as:
FL−1(xi) = β0+
L−1
X
l=1
βlg(xT
iwl+bl) (8)
With F0=β0. The L−1 functions gl: (xi;wl, bl)7→ g(xT
iwl+bl) for
l∈ {1, . . . , L −1}, with their respective parameters, constitute the base learners
of the additive expansion FL−1.gis an activation function that transforms the
linear inputs into nonlinear features. The output weight βl, the parameters
wl∈Rk∗pand the bias parameter bl∈R∗are to be optimized, as it will be
shown at the end of the section. The current residual error between Fand FL−1
at step L−1 is given by:
eL−1=F−FL−1(9)
In order to iterate from step L−1 to step L, if ||eL−1||F(the Frobenius
norm of the residuals) still exceeds a given tolerance level , we need to generate
a new base learner gL: (xi, w∗
L, b∗
L)7→ g(xT
iw∗
L+b∗
L). That is, we need to find
6
optimal w∗
Land b∗
L, and evaluate the output weight β∗
L, so that the norm of the
residuals is decreased and:
FL=FL−1+β∗
LgL(10)
Wang & Li (2017b) showed that if the activation function gis bounded by110
a constant bg>0 (0 <||g||L2< bg), and the following condition is fulfilled for
each of the presponse variables (here, the most recent observations of the time
series):
< eL−1,q, gL>L2≥b2
gδL,q, q = 1, . . . , p (11)
Then, when L→ ∞, the additive expansion will converge towards F:
FL−→L2F(12)
With 0 < r < 1, µL=1−r
L+1 , and δL,q = (1 −r−µL)||eL−1,q ||2, q = 1, . . . , p.115
In this paper, we use a slightly modified version of SCNs: the BCNs. The
BCNs are close to Matching Pursuit algorithms, and possess two additional
hyperparameters when compared to the SCNs:
•a learning rate 0 < ν ≤1, so that the update at iteration Lbased on Eq.
(10) is now:120
FL=FL−1+νβ∗
LgL(13)
•a subsampling percentage of the covariates, comprised between 0.5 and 1.
The learning rate νwill allow the model to learn the residuals slower than
when νis always equal to 1; whereas subsampling the covariates will increase the
diversity of the base learners at each boosting iteration. Indeed, as mentioned
in Friedman (2001), the learning rate controls the degree of fit. Considering it125
could be roughly seen as including smaller or larger steps in gradients’ descent.
The subsampling coefficient allows to randomly select a fraction of the covariates
7
at each boosting iteration, and reduce overfitting. If it’s equal to 0.5, then only
one half of the covariates are randomly selected at each boosting iteration. It
must be chosen in such a way that, the number of covariates at each boosting130
iteration is always positive.
In Section 2.2, based on SCN framework, we show that the universal ap-
proximation property of the BCNs is verified.
2.2. Universal Approximation property of the boosted configuration networks
(BCNs)135
Including a learning rate νto the SCN algorithms (Wang & Li (2017b)), in
order to obtain the BCNs, leads to consider a new condition for the convergence
of FLtowards F:
< eL−1,q, gL>L2≥1
ν(2 −ν)b2
gδL,q, q = 1, . . . , p (14)
With 0 < r < 1, µL=1−r
L+1 , and δL,q = (1 −r−µL)||eL−1,q ||2, q = 1, . . . , p.
Indeed, in this case, updating the residuals from step L−1 to step Lis done140
by obtaining:
eL,q =eL−1,q −νβL,q gL, q = 1, . . . , p (15)
Consequently, by defining δL=Pp
q=1 δL,q, and following the steps of the
proof of Theorem 6 in Wang & Li (2017b), we have for the SC-I:
Proof.
||eL||2−(r+µL)||eL−1||2= (1 −r−µL)||eL−1||2−ν(2 −ν)
p
X
q=1
< eL−1,q, gL>2
L2
||gL||2
=δL−ν(2 −ν)
p
X
q=1
< eL−1,q, gL>2
L2
||gL||2
≤δL−ν(2 −ν)
p
X
q=1
< eL−1,q, gL>2
L2
b2
g
≤0
8
g:x7→ tanh(x) is used as an activation function for all the models presented145
in the paper. As required by the universal approximation property of the SCNs,
g:x7→ tanh(x) is indeed bounded.
We observe that this proof of the universal approximation property only
depends on the boundedness of gand the condition from Eq. (14) (given 0 <
r < 1 and Lsufficiently high). Therefore, it still holds when the columns of xi
150
are subsampled, as long as the condition from Eq. (14) is met. Similarly, for
the adaptation of SC-III to the BCN framework, the universal approximation
property holds, by using analogous arguments and Theorem 7 from Wang & Li
(2017b).
The algorithms that we used for implementing the BCNs are described in155
the next section, 2.3. There would be three versions of BCN algorithms as in
the SCN framework, but we only describe the adaptation of SC-I. SC-II and
SC-III implementations would follow similar ideas.
2.3. Algorithm for the boosted configuration networks (BCNs)
Having chosen 0 < r < 1; λ > 0: a regularization parameter for wland
bl;B: the budget number of boosting iterations; : a tolerance level for the
Frobenius norm of the successive matrices of residuals, algorithm SC-I from
Wang & Li (2017b) is hence modified as indicated in algorithm 1. With the
following notations, relying on the original notations from Wang & Li (2017b):
eL−1(X) = [eL−1,1(X). . . eL−1,p(X)] (16)
and eL−1(X)∈R(n−k)×p. And:
eL−1,q(X)=(eL−1,q(x1), . . . , eL−1,q (xn−k))T(17)
eL−1,q(X)∈Rn−kfor q= 1, . . . , p. Plus:
hL(X) = g(xT
1wL), . . . , g(xT
n−kwL)T(18)
hL(X)∈Rn−k, and:
ξL,q =ν(2 −ν)eL−1,q(X)ThL(X)2
hL(X)ThL(X)−(1 −r−µL)eL−1,q(X)TeL−1,q (X) (19)
9
Algorithm 1 Algorithm for BCN learning
1: procedure BCN-I(Y,X, B, η, λ, r, )BCN learning of Yby X
2: e0←Y;β0←¯
Y;e0←e0−β0;L= 1; Initialization
3: while L≤Band ||e0||F> do Loop until tolerance is reached
4: Obtain
w∗
L, b∗
L←arg maxwL,bL∈[−λ,λ]k∗p" p
X
q=1
ξL,q!11min(ξL,1,...,ξL,p)≥0#
5: Obtain h∗
L,ξ∗
L,q, based on Eq. (18) and (19), and µL:= (1−r)/(L+1)
6: Do
β∗
L←lsfit(response =eL−1,covariates =h∗
L)
least squares regression of eL−1on h∗
L: obtain β∗
L, output weight.
7: eL←eL−1−νβ∗
Lh∗
LUpdate the residuals.
8: e0←eL;L←L+ 1 Next iteration.
In algorithm (1), Step 4 is typically achieved by using a derivative-free opti-160
mization. In Section 2.4, we discuss the convergence rate of the BCN expansion
towards its target.
2.4. On the convergence of boosted configuration networks (BCNs)
In Section 2.1, we showed that for fixed L > 0, 0 < r < 1, and under
condition 14, we have:165
||eL||2≤(r+µL)||eL−1||2(20)
Because the target function to be explained is assumed to be square-integrable,
this leads to the following inequality:
||eL||2≤C(r+µL)L(21)
Where C > 0 is a positive constant. A sufficient condition for the conver-
10
gence of FLtowards F(in L2) and ||eL||2towards 0 when L→ ∞ is:
0< r +µL<1 (22)
Which is always true when 0 < r < 1 and L > 0. In figure 1, we present the170
function (r, L)7→ (r+µL)L; the convergence rate of ||eL||2towards 0, for 100
values or r∈[0.01,1], and 25 values of L∈ {1,2,3,...,25}
Figure 1: Convergence rate of the residuals of BCNs towards 0, as a function of rand L.
Figure 1 suggests that a very high value of Lwould not necessarily be re-
quired for the convergence of the residuals to occur. rmust be chosen ade-
quately: relatively close to 1 to prevent overfitting, but not too close, to avoid a175
divergence of the BCN expansion. rand Lare both chosen along with the other
hyperparameters of the BCNs by cross-validation, as demonstrated in Section
3.
3. Numerical examples
In this section, we examine the forecasting capabilities of the BCNs on 9 real180
world MTS datasets, presented in Section 3.2. Each dataset was partitioned into
11
two parts:
1. A training/testing containing 75% of the observations.
2. A validation set that contains the remaining 25% of the observations,
unseen at point 1, and used to assess the out-of-sample accuracy.185
Based on the 9 datasets presented in Section 3.2, we compare the out-of-
sample accuracy of the BCNs (on the validation set) to:
•the accuracy of two naive forecasting methods: random walk and sample
mean.
•the accuracy of other models based on Randomized Neural Networks.190
•the accuracy of other usual forecasting techniques.
All these competing models are described in Section 3.1. Their optimal
hyperparameters, used to produce forecasts on the validation set, are obtained
through a rolling forecasting methodology (Bergmeir et al. (2015)) applied to
the training/testing set.195
The rolling forecasting methodology proceeds as follows: a fixed window of
length training is set for training the model, and another window, contiguous
to the first one, and of length testing is set for testing it. The origin of
the training set is then advanced of 1 observation, and the training/testing
procedure is repeated until no more training/testing set can be constructed.200
Typically, for all the competing models we used a fixed rolling window of
length training = 18 points for the training, and testing sets of increasing
lengths testing = 3, 6, 9, and 12 points. The measure of accuracy comparing
a forecast ˆyto the observed data yfor each series is the Root Mean Squared
Error (RMSE):205
RMSE =v
u
u
t
1
H
H
X
h=1
(yn+h−ˆyn+h)2(23)
12
For the optimization of the training/testing RMSE, we use a bayesian op-
timization algorithm of each model’s hyperparameters. The bayesian optimiza-
tion algorithm employed is the one described as GP EI Opt in Snoek et al.
(2012), with 250 iterations and 10 repeats. The bounds for GP EI Opt’s hy-
perparameters search are those given in appendix 5.1, and the detailed cross-210
validation results can be found at https://github.com/thierrymoudiki/ins-sirn-2018.
The rolling forecasting methodology from (Bergmeir et al. (2015)) is presented
in figure 2:
Figure 2: Time series cross-validation procedure illustrated.
3.1. Competing models
The first forecasting models that we compared to the BCNs (BCN-I and215
BCN-III) on MTS forecasting are two naive ones: a random walk denoted here
and in the Github repo as rw, and a sample mean denoted as mean. The random
walk would obtain h-steps-ahead forecasts of a time series by using its last
observed value:
ˆyn+h=yn, h > 0 (24)
13
Whereas for the sample mean, we would have for each time series:220
ˆyn+h= ¯yn, h > 0 (25)
The sample mean is calculated on rolling windows of length training. Then,
other popular models were considered: an unrestricted VAR (L¨utkepohl (2005);
Pfaff (2008)) denoted as VAR and a Lasso VAR (see Cavalcante et al. (2017),
Davidson et al. (2004), Nicholson et al. (2014)) denoted as lassoVAR. As a
Lasso VAR model, we used Fu (1998) to implement the row Lasso VAR (rLV)225
model presented with details in Cavalcante et al. (2017).
To finish, three models based on Randomized Neural Networks and ensem-
bles of Randomized Neural Networks were considered:
•The MTS forecasting model from Moudiki et al. (2018), based on Quasi-
Randomized Neural Networks, and denoted here as ridge2.230
•A model based on Hothorn et al. (2010) and Hothorn et al. (2017), applied
to MTS forecasting, denoted as glmboost. In this model (using ideas from
Section 2.1), after Bboosting iterations, FBhas the form:
FB(xi) = β0+
B
X
l=1
M
X
m=1
νβ(m)
lgl(xi;W(m)
l, b(m)
l) (26)
Wl∈R(k∗p)×M, for each boosting iteration l∈ {1, . . . , B }, are matrices
of model’s parameters drawn from a quasi-random Sobol sequence (see235
Niederreiter (1992) and Joe & Kuo (2008)), as in Moudiki et al. (2018).
These matrices Wl, with columns W(m)
l, give the number of nodes in the
hidden layer (which is equal to M, and denoted as nb hidden), and create
a new set of Mfeatures from xi. The number of nodes in the hidden
layer M, is an hyperparameter of the whole procedure (fixed in advance),240
and the functions (xi;Wl, bl)7→ gl(xi;Wl, bl) are the base learners. The
14
general form of glis:
gl: (xi, Wl, bl)7→ g(xT
iWl+bl) (27)
Where gis the hyperbolic tangent (tanh), because the SCNs and BCNs
require a bounded activation function for their convergence.
•The Partial Least Squares (PLS) algorithm (see Friedman et al. (2001))245
denoted as pls. As stated in Friedman et al. (2001): the PLS seeks
directions that have high variance and high correlation with the response,
contrary to principal components regression which focuses only on the
variance. If we denote Tthe matrix of explanatory variables with M
columns, and ythe response variable, then for the PLS we have:250
1. Standardize each column of Tto have mean 0 and variance 1. Set
ˆy0= ¯y11 and T(j)
0=T(j), j = 1, . . . , M (the columns of Tare
T(j), j = 1, . . . , M )
2. For m= 1, . . . , M
– zm=PM
j=1 ˆ
φmj T(j)
m−1, where ˆ
φmj =<T(j)
m−1, y >255
–ˆ
θm=<zm,y>
||zm||2
–ˆym= ˆym−1+ˆ
θmzm
3. Output the sequence of fitted vectors {ˆym}p
m=1
The algorithm can be stopped before the loop on m(the number of PLS
directions) reaches M. Here, we use the SIMPLS algorithm from De Jong
(1993) to a set of new features obtained by transformations of the original
ones:
g(xT
iW+b) (28)
We start by choosing a fixed number of nodes in the hidden layer M
(denoted as nb hidden in the results), then obtain a quasi-random Sobol260
sequence for W∈R(k∗p)×M. A new set of Mfeatures are obtained with
formula 28, and stored in a matrix T, the matrix of transformed predictors.
Mis an hyperparameter for the algorithm, and the PLS is applied to T.
15
3.2. Datasets used for benchmarking the models
9 datasets are used for comparing the algorithms from Section 3.1 to the265
BCNs:
•The usconsumption dataset. 2 series, 164 observations. The quaterly
percentage changes of real personal expenditure and the real personal
disposable income in the United States, from March 1970 to December
2010. The source for this dataset is the Federal Reserve Bank of St Louis:270
http://data.is/AnVtzB and http://data.is/wQPcjU.usconsumption
is also available in Rpackage fpp (Hyndman (2013)). No transformation
is applied to the dataset for stationarity.
•The Canada dataset. 4 series, 83 observations of economic indicators ob-
served in Canada from 1980 (first quarter) to 2000 (fourth quarter). The275
source for this dataset is the OECD: http://www.oecd.org.Canada is
also available in Rpackage vars (Pfaff (2008)). We transform each one of
the original time series (I(i)
t)t, j = 1,...,4 as log I(j)
t+1/I (j)
t, j = 1,...,4.
•The ausmacro dataset used in Jiang et al. (2017). 35 series, 121 obser-
vations of macroeconomic indicators, available at http://ausmacrodata.280
org/research.php (accessed on August 4th, 2018). No transformation is
applied to the dataset for stationarity.
•usexp a dataset from Makridakis et al. (2008). 2 series, 87 observations.
Available in Rpackage fma under the name capital: Seasonally adjusted
quarterly capital expenditure and appropriations in U.S. manufacturing285
from 1953 to 1974. The times series are transformed as Canada for sta-
tionarity.
•germancons dataset. 3 series, 91 observations. Quarterly seasonally ad-
justed West German Fixed investment Disposable income and Consump-
tion expenditures in billions of Deutsche Marks, from March 1960 to De-290
cember 1982. Available at https://datamarket.com (accessed on August
4th, 2018). The times series are transformed as Canada for stationarity.
16
•Table F2.2: 10 series, 51 observations, the U.S. Gasoline Market from
1953 to 2004 (tableF2 2 in the results and Github data). And Ta-
ble F5.2: 9 series, 203 observations, macroeconomics data set, 1950I to295
2000IV (tableF5 2 in the results and Github data). Table F2.2 and
Table F5.2 datasets both come from the 7th and 8th edition of Greene
(2003), and are available at http://pages.stern.nyu.edu/~wgreene/
Text/econometricanalysis.htm (accessed on August 4th, 2018) with
their respective sources. The times series are transformed as Canada for300
stationarity.
•housing: 3 series, 81 observations. Monthly housing starts, construction
contracts and average new home mortgage rates (from January 1983 to
October 1989). Available in Rpackage fma. The times series are trans-
formed as Canada for stationarity.305
•ips. 7 series, 575 observations. Industrial production data in the US,
from 1959:I to 2006:IV. Obtained from Stock & Watson (2009) and the
Global Insights Basic Economics Database, with the content column la-
beled ’E.F.?’ equal to 1. The times series are transformed as Canada for
stationarity.310
3.3. Results
This section presents the rankings of the competing models based on their
out-of-sample RMSE, and Diebold-Mariano tests (Diebold & Mariano (1995))
of the forecasting accuracies.
3.3.1. Ranking the models by out-of-sample RMSE315
Once the optimal hyperparameters are chosen by cross validation on the
training/testing set (75%) for each dataset, they are used on the validation set
(25%) to determine which model has the lowest RMSE on unseen data. The
detailed cross-validation results on training, testing and validation set, can be
found at https://github.com/thierrymoudiki/ins-sirn-2018.320
17
Since the datasets are not on the same scale, the RMSEs are not directly
comparable accross the competing models. Instead, we compare the rank of
each model among the eight others, based on the out-of-sample RMSE. In table
1 and figure 3, we report the average rankings of each model calculated on all
the 9 datasets.325
Table 1: Average RMSE rankings and standard deviation on the 9 validation sets
horizon = 3 horizon = 6 horizon = 9 horizon = 12 Avg. rank
glmboost 5.11 +/- 1.90 5.33 +/- 2.55 5.88 +/- 2.37 4.88 +/- 2.62 5.30
lassoVAR 4.44 +/- 2.55 4.22 +/- 1.09 3.77 +/- 1.39 3.66 +/- 1.50 4.03
mean 4.00 +/- 2.87 4.11 +/- 2.09 3.11 +/- 2.31 3.00 +/- 2.24 3.55
pls 4.55 +/- 1.88 4.66 +/- 2.64 6.11 +/- 1.54 6.22 +/- 2.33 5.39
ridge2 2.66 +/- 1.50 2.88 +/- 2.15 3.88 +/- 2.20 4.00 +/- 1.58 3.36
rw 7.00 +/- 3.16 8.44 +/- 1.01 8.11 +/- 1.96 7.88 +/- 2.62 7.86
bcnI 5.44 +/- 2.13 4.66 +/- 2.06 3.55 +/- 1.94 3.44 +/- 1.59 4.28
bcnIII 4.00 +/- 2.18 3.55 +/- 2.24 3.77 +/- 1.92 4.66 +/- 2.24 4.00
VAR 7.77 +/- 1.30 7.11 +/- 2.42 6.77 +/- 2.86 7.22 +/- 1.99 7.22
18
Figure 3: Average RMSE rankings and standard deviation for ridge2,mean,bcnI,bcnIII
Based on the results in table 1, no method is uniformly superior to the other.
For short-term forecasts, ridge2 (Moudiki et al. (2018)) is doing better than
the other methods, and is followed by bcnIII. The sample mean is winning
on long-term forecasts, and is followed by bcnI. The bcn models are therefore
highly competitive on time series data.330
The unrestricted VAR is well-known to overfit the data, because its learning
mechanism uses a lot of parameters that are not constrained. But the lassoVAR
largely improves on it, by forcing some of its parameters to be equal to 0.
Overall, ridge2 has the best average rank of all the methods. And the
very good rank obtained by the mean confirms some observations of Jiang et al.335
(2017), who remarked that it is difficult to outperform the naive sample mean,
especially on long-term forecasts here.
Though, contrary to ridge2 or lassoVAR for example, the sample mean
model provides almost no insights to the analyst (on how the covariates influence
19
the response, how a shock on a covariate would affect the response, etc.), and340
mostly serves as a benchmark.
It might be beneficial to construct ensembles combining methods that per-
form well on short-term forecasts, ridge2 and bcnIII, with methods that per-
form well on long-term forecasts, mean and bcnI.
3.3.2. Diebold-Mariano tests of bcnIII vs mean345
In this section, we compare the forecasts of bcnIII and mean on two datasets:
•a dataset on which bcnIII has a better average rank than mean:usexp
•a dataset on which mean has a better average rank than bcnIII:ips
Details on the rank of each model among the competing models presented in
Section 3.1 can be found at https://github.com/thierrymoudiki/ins-sirn-2018.350
In order to compare the forecasts of bcnIII and mean, we use the Diebold-
Mariano test (Diebold & Mariano (1995)) on residuals of the validation set,
and obtain some information about the significance of the difference between
model’s forecast (at 5% and 10%). Negative values of the statistic in tables 2 and
3 indicate superiority of bcnIII forecasts over the sample mean forecasts. The355
null hypothesis is that the two forecasts have, on average, the same accuracy.
One asterisk denotes significance relative to the asymptotic null distribution at
the 10% level, and two asterisks denote significance relative to the asymptotic
null distribution at the 5% level.
Table 2: DM statistic for out-of-sample forecasting accuracy comparison, on usexp
series horizon = 3 horizon = 6 horizon = 9 horizon = 12
Appropriations 0.04 -0.11 -3.38∗∗ -1.88∗
Expenditure -1.88∗0.79 -0.75 -2.05∗∗
20
Table 3: DM statistic for out-of-sample forecasting accuracy comparison, on ips
series horizon = 3 horizon = 6 horizon = 9 horizon = 12
IPS13 0.01 0.62 0.86 1.25
IPS18 1.81∗2.29∗∗ 2.40∗∗ 3.33∗∗
IPS25 -1.03 -1.58 -1.83∗-1.24
IPS34 -0.68 -0.37 0.15 0.86
IPS38 0.66 1.86∗2.52∗∗ 3.00∗∗
IPS43 -0.77 -0.70 -0.39 0.35
IPS306 1.78∗2.38∗∗ 2.80∗∗ 3.61∗∗
Looking at each dataset/series individually can lead to more nuanced obser-360
vations than reported in Section 3.3.1. Typically on usexp,bcnIII forecasts are
always superior to the mean’s for long-term horizons; the difference is significant
3 times out of 4. Also, on ips for IPS25 series, bcnIII is always superior to
mean; the two forecasts being significantly different for horizon = 9.
21
4. Conclusion365
In this paper, we discussed the Boosted Configuration Networks (BCNs) as a
learning algorithm derived from the Stochastic Configuration Networks (SCNs)
(Wang & Li (2017b)), the LS Boost from (Friedman (2001)), and Matching
Pursuit algorithms (Vincent & Bengio (2002)). The hidden layers of the BCNs
are chosen in a supervised way, to ensure that the universal approximation370
property of Neural Networks is met, as with the SCNs. But contrary to the
SCNs, the BCNs incorporates both a learning rate that allows for a slow learning
of the residuals, and a subsampling of the models’ explanatory variables that
decorrelates the predictors and reduces overfitting. The optimization of the
hidden layers is also carried out by using derivative-free optimization.375
The results obtained here on various multivariate time series datasets by
the BCNs are promising. Interestingly, the overall performance of the BCNs
on these specific datasets is superior to the performance of unrestricted VAR
and lasso VAR. As mentioned in Exterkate et al. (2016) and the conclusion of
Makridakis et al. (2018) though, time series data are a very specific type of data,380
in which there is a serial dependence between the observations, and a strong
correlation between the covariates. These characteristics could lead relatively
complex models to fail, whereas relatively simple models would produce accurate
forecasts. This could be part of the reason why the BCNs are not doing as good
as the model from Moudiki et al. (2018) on these datasets, except in given cases385
(based on out-of-sample RMSE ranks).
In a future work, it could be interesting to test other bounded activation
functions against the hyperbolic tangent. In addition, combining methods that
performed well here on short-term forecasts (ridge2 and bcnIII), with meth-
ods that performed well on long-term forecasts (mean and bcnI) could be ben-390
eficial. Finally, we would also want to assess how well the BCNs do on regres-
sion/classification problems not based on time series data.
22
5. Appendix
5.1. Bounds for hyperparameters search
For choosing the hyperparameters, we use Bayesian Optimization (Snoek395
et al. (2012)’s GP EI Opt), with the following bounds for each model:
Table 4: Bounds for hyperparameters search: glmboost
B ν lags nb hidden
Lower bound 1 0.01 1 2
Upper bound 10 0.5 4 100
Table 5: Bounds for hyperparameters search: lassoVAR
lags λ
Lower bound 1 1e-02
Upper bound 4 1e04
Table 6: Bounds for hyperparameters search: pls
B lags nb hidden
Lower bound 1 1 2
Upper bound 10 4 100
Table 7: Bounds for hyperparameters search: ridge2
lags nb hidden λ1λ2
Lower bound 1 2 1e-02 1e-02
Upper bound 4 100 1e04 1e04
23
Table 8: Bounds for hyperparameters search: bcn
B lags ν λ r col sample
Lower bound 2 1 0.01 1e-02 0.8 1e-06 0.5
Upper bound 10 4 0.5 1e04 0.99 1e-02 1
Where:
•B: number of boosting iterations for glmboost,bcnI,bcnIII.
•ν: learning rate for glmboost,bcnI,bcnIII.
•lags: number of lags of each time series included in the regression (see400
details at the beginning of Section 2).
•nb hidden: number of nodes in the hidden layer (Min Section 2) for
ridge2,glmboost,pls.
•λ: regularization parameter for lassoVAR (see rLV in Cavalcante et al.
(2017)), bcnI,bcnIII (see algorithm 1).405
•λ1,λ2: regularization parameters for ridge2 (see Moudiki et al. (2018)).
•: level of tolerance for the Frobenius norm of residuals, for bcnI,bcnIII.
•col sample: percentage of the covariates used at each boosting iteration,
for bcnI,bcnIII.
The detailed cross-validation results can be found at https://github.com/410
thierrymoudiki/ins-sirn-2018.
24
References
Barron, A. R. (1993). Universal Approximation bounds for superpositions of a
sigmoidal function. IEEE Transactions on Information theory,39 , 930–945.
Bergmeir, C., Hyndman, R. J., Koo, B. et al. (2015). A note on the validity415
of cross-validation for evaluating Time Series prediction. Monash University,
Department of Econometrics and Business Statistics, Tech. Rep., .
B¨uhlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and
Classification. Journal of the American Statistical Association,98 , 324–339.
Caruana, R. (1998). Multitask Learning. In Learning to Learn (pp. 95–133).420
Springer.
Cavalcante, L., Bessa, R. J., Reis, M., & Browell, J. (2017). Lasso Vector
Autoregression structures for very short-term wind power forecasting. Wind
Energy,20 , 657–675.
Davidson, R., MacKinnon, J. G. et al. (2004). Econometric Theory and Methods425
volume 5. Oxford University Press New York.
De Jong, S. (1993). SIMPLS: an alternative approach to Partial Least Squares
Regression. Chemometrics and intelligent laboratory systems,18 , 251–263.
Diebold, F., & Mariano, R. (1995). Comparing Predictive Accuracy. Journal of
Business & Economic Statistics,13 , 253–63.430
Exterkate, P., Groenen, P. J., Heij, C., & van Dijk, D. (2016). Nonlinear fore-
casting with many predictors using Kernel Ridge Regression. International
Journal of Forecasting,32 , 736–753.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical
Learning volume 1. Springer Series in Statistics New York.435
Friedman, J. H. (2001). Greedy function approximation: a Gradient Boosting
Machine. Annals of statistics, (pp. 1189–1232).
25
Friedman, J. H., & Stuetzle, W. (1981). Projection Pursuit Regression. Journal
of the American statistical Association,76 , 817–823.
Fu, W. J. (1998). Penalized Regressions: the Bridge versus the Lasso. Journal440
of computational and graphical statistics,7, 397–416.
Greene, W. H. (2003). Econometric Analysis, 5th. Ed.. Upper Saddle River,
NJ , (pp. 89–140).
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feedforward
Networks are Universal Approximators. Neural networks,2, 359–366.445
Hothorn, T., B¨uhlmann, P., Kneib, T., Schmid, M., & Hofner, B. (2010). Model-
based Boosting 2.0. Journal of Machine Learning Research,11 , 2109–2113.
Hothorn, T., B¨uhlmann, P., Kneib, T., Schmid, M., & Hofner, B.
(2017). mboost: Model-Based Boosting. R package version 2.8-1. URL:
http://CRAN. R-project. org/package= mboost, .450
Hyndman, R. J. (2013). fpp: Data for ”Forecasting: principles and practice”.
URL: https://CRAN.R-project.org/package=fpp r package version 0.5.
Igelnik, B., & Pao, Y.-H. (1995). Stochastic choice of basis functions in adaptive
function approximation and the functional-link net. IEEE Transactions on
Neural Networks,6, 1320–1329.455
Jiang, B., Athanasopoulos, G., Hyndman, R. J., Panagiotelis, A., Vahid, F.
et al. (2017). Macroeconomic forecasting for Australia using a large number
of predictors. Monash Econometrics and Business Statistics Working Papers,
2, 17.
Joe, S., & Kuo, F. (2008). Notes on generating Sobol sequences. http://web.460
maths.unsw.edu.au/~fkuo/sobol/joe-kuo-notes.pdf.
Jones, L. K. et al. (1992). A simple lemma on greedy approximation in Hilbert
space and convergence rates for projection pursuit regression and Neural Net-
work training. The annals of Statistics,20 , 608–613.
26
Kwok, T.-Y., & Yeung, D.-Y. (1997). Objective functions for training new465
hidden units in constructive Neural Networks. IEEE Transactions on Neural
Networks,8, 1131–1148.
Li, M., & Wang, D. (2017). Insights into randomized algorithms for Neural
Networks: Practical issues and common pitfalls. Information Sciences,382 ,
170–178.470
L¨utkepohl, H. (2005). New introduction to Multiple Time Series analysis.
Springer Science & Business Media.
Makridakis, S., Spiliotis, E., & Assimakopoulos, V. (2018). Statistical and
Machine Learning forecasting methods: Concerns and ways forward. PloS
one,13 .475
Makridakis, S., Wheelwright, S. C., & Hyndman, R. J. (2008). Forecasting
methods and applications. John Wiley & sons.
Mallat, S., & Zhang, Z. (1993). Matching pursuit with time-frequency dictionar-
ies. Technical Report Courant Institute of Mathematical Sciences New York
United States.480
Moudiki, T., Planchet, F., & Cousin, A. (2018). Multiple Time Series Fore-
casting Using Quasi-Randomized Functional Link Neural Networks. Risks,
6, 22.
Nicholson, W. B., Matteson, D. S., & Bien, J. (2014). Structured regularization
for large Vector Autoregression. Cornell University, .485
Niederreiter, H. (1992). Random number generation and quasi-Monte Carlo
methods. SIAM.
Pfaff, B. (2008). VAR, SVAR and SVEC Models: Implementation Within
R package vars. Journal of Statistical Software,27 . URL: http://www.
jstatsoft.org/v27/i04/.490
27
Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Opti-
mization of Machine Learning algorithms. In Advances in neural information
processing systems (pp. 2951–2959).
Stock, J. H., & Watson, M. (2009). Forecasting in dynamic factor models subject
to structural instability. The Methodology and Practice of Econometrics. A495
Festschrift in Honour of David F. Hendry ,173 , 205.
Vincent, P., & Bengio, Y. (2002). Kernel Matching Pursuit. Machine Learning,
48 , 165–187.
Wang, D., & Li, M. (2017a). Robust Stochastic Configuration Networks with
kernel density estimation for uncertain data regression. Information Sciences,500
412 , 210–222.
Wang, D., & Li, M. (2017b). Stochastic Configuration Networks: Fundamentals
and algorithms. IEEE transactions on cybernetics,47 , 3466–3479.
28