PreprintPDF Available

Online Bayesian Quasi-Random functional link networks; application to the optimization of black box functions

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

This paper contributes to adding a Bayesian Quasi-Random Vector Functional Link network (BQRVFL) to the Machine Learning practitioner’s toolbox. The BQRVFL is an hybrid penalized regression/neural network model, that takes into account input data’s heterogeneity through clustering. Its regression coefficients are governed by a Multivariate Gaussian distribution, and the hidden layer’s nodes are drawn from a deterministic Sobol sequence. In addition to being interpretable as a linear model, the model is capable of producing highly nonlinear outputs. We show that the BQRVFL is amenable to Online Machine Learning, and use it as a workhorse for the Bayesian optimization of various black box functions.
Content may be subject to copyright.
Bayesian Quasi-Random functional link
networks for Machine Learning uncertainty
quantification and scalable optimization
T. Moudiki
August 12, 2024
Abstract
This paper contributes to adding Bayesian Quasi-Random Vector Func-
tional Link neural networks (BQRVFLs) to Machine Learning practitioner’s
toolbox. A BQRVFL is an hybrid penalized regression/neural network
model that takes into account input data’s heterogeneity through cluster-
ing. Its regression coefficients are governed by a prior Multivariate Gaus-
sian distribution, and the hidden layer’s nodes are drawn from a determin-
istic Sobol sequence. In addition to being interpretable as a linear model,
the model is capable of producing highly nonlinear outputs. I show that a
BQRVFL is amenable to Online Machine Learning, and I use it as a workhorse
for - scalable - Bayesian optimization of various black box/complex func-
tions.
1. Introduction
In this paper, I present a Bayesian Quasi-Random Vector Functional Link (BQRVFL)
network model; an hybrid nonlinear regression/neural network model de-
rived from the class of RVFL (Schmidt et al. (1992) and Pao et al. (1994))
models. RVFL networks are a class of scalable "neural" networks in which,
in addition to a single layer that incorporates the non linear effects, there is a
so-called direct link between the explanatory variables and the output variable
that includes the linear effects. They have been successfully applied to solving
different types of classification, regression and time series forecasting problems
(see Dehuri and Cho (2010) and Moudiki et al. (2018) for example).
BQRVFLs can be used for solving supervised Machine Learning (ML) prob-
lems requiring predictive uncertainty quantification, and I’ll show that when
used as a workhorse for Bayesian optimization (Mockus et al. (1978)), they
can obtain performances which are on par with Gaussian Processes’ (GPs, the
usual choice for this type of tasks). The clear avantage of BQRVFLs over GPs
in this context is their scalability. Indeed, whereas BQRVFLs mainly involve
1
relatively cheap matrices inversions in features’ space, vanilla GPs scale cubi-
cally with the number of observations of the input dataset, and perform poorly
(see Snoek et al. (2015) and Springenberg et al. (2016) for discussions on this)
in high dimensions.
The BQRVFL setting (more details in section 2) can be summarized as:
Input explanatory variables are clustered before entering the regression,
in order to take into account data’s heterogeneity (similarities or dissim-
ilarities between observations), as a Gaussian Process (Rasmussen and
Williams (2006)) or a Kernel machine Ferwerda et al. (2017) would do.
In order to obtain a nonlinear model, an hidden layer is employed, whose
nodes are quasi-randomized (see Moudiki et al. (2018) and Niederreiter
(1992)). Before clustering and applying an hidden layer to the inputs, the
explanatory variables are scaled. That makes it a 3-step scaling procedure
( 3 because the whole set of features is then scaled before being adjusted
to the response), with each transformation applied to the training set be-
ing subsequently applied to the test set.
Regression model’s coefficients are governed by a zero-mean multivari-
ate prior Gaussian distribution (see Rasmussen and Williams (2006), Chap-
ter 2) with constant variance. See section 2 for more details.
Regression model’s residuals are governed by a zero-mean multivariate
Gaussian distribution (see Rasmussen and Williams (2006), Chapter 2)
with constant variance. See section 2 for more details. As for GPs, these
relatively strong hypotheses on priors - and residuals - do not prevent
the model from capturing heteroskedastic patterns, as shown in 3.3 .
In addition to being interpretable as a linear model 1, the BQRVFL is amenable
to Online Machine Learning. Section 2 presents the model’s characteristics and
its sophisticated-
\footnotetext{
1. Typically, by taking partial derivatives of the response as a function of
model’s explanatory variables.
but-trivial posterior distribution. I notably show how BQRVFLs can be
used for Online Machine Learning (Harold et al. (1997)) in section 2.2.
Section 3 presents several robust numerical examples of use of the model
on various datasets, notably for Bayesian Optimization (Mockus et al.
(1978), Snoek et al. (2012)) tasks.
2
2. Model description
2.1. A Bayesian Quasi-Random Vector Functional Link network
model (BQRVFL)
Let yRnbe the centered model response, the variable to be explained, and
Z, standardized (columns’ mean equal to 0 and standard deviation equal to 1
), transformed explanatory variables. y|Zis modeled as:
y=Zβ+ϵ(1)
Where:
ZRn×(p+J+q)(2)
And
Z:= [XΦ(X)] (3)
Zis the column-wise concatenation of matrices Xand Φ(X). The first p+J
columns of Zcontain X; model’s standardized input data with pcovariates, en-
riched with their clustering information. The clustering information on input
data consists of Jone-hot encoded covariates; one for each k-means (Hartigan
and Wong (1979)) (or Gaussian Mixture or else) cluster 1. The J additional co-
variates aim at taking into account input data’s heterogeneity; similarities and
dissimilarities between observations.
For each line i {1, . . . , n}of matrix Φ(X), we have the following terms:
Φ(X)i=gxT
iW(3)
xiis the ith row of matrix X.WR(p+J)×q- which is deterministic and
not learned by the model - is drawn from a quasi-random Sobol sequence (see
Niederreiter (1992) and Moudiki et al. (2018)). gis an activation function that
produces nonlinearity in the model outputs. Typically, the activation function
could be a ReLU, g:x7→ max(x, 0), hyperbolic tangent g:x7→ tanh(x), the
sigmoid g:x7→ 1
1+ex, or any other activation function for "neural" networks.
qdenotes the number of nodes in the hidden layer, and also the number of
additional covariates constructed from Xvia W.
βRp+J+qis an unknown vector of regression coefficients, to be estimated
from yand Z, and we set the following prior distribution on β:
β N 0Rp+J+q,s2Ip+J+q,s>0 (4)
ϵRnare model residuals, on which we set the following distribution:
ϵ N 0Rn,σ2In,σ>0 (5)
3
In a Bayesian Linear Regression setting (see Rasmussen and Williams (2006)
or Bishop (2006)), the posterior distribution of estimated ˆ
βgiven Zis a multi-
variate Gaussian with parameters:
µˆ
β|Z=C1
nZTy(6)
Σˆ
β|Z=s2Ip+qC1
nZTZ(7)
where:
C1
n:=ZTZ+σ2
s2In1
(8)
σ2
s2=:λis equivalent to the Ridge regression (Hoerl and Kennard (1970))
regularization hyperparameter. Now, for new observations Zarriving into the
model, we can obtain the following mean and covariance for the multivariate
Gaussian distribution of the model’s predictions:
µy|Z=Zµˆ
β|Z(9)
Σy|Z=ZΣˆ
β|ZZT
+σ2In(10)
The model’s hyperparameters involved in µy|Zand Σy|Zare q,σ2,s2,J.
In order to obtain good values for these hyperparameters in the applications,
an optimization of a Generalized Cross Validation (GCV) (Golub et al. (1979))
criterion can be used. Another resource-hungry but hyperparameter-free - ap-
proach based on grids of hyperparameters is also presented in section 3.3.
Once ˆ
q,ˆ
σ2,ˆ
s2,ˆ
J,µy|Zand ˆ
Σy|Zare set, obtaining a sensitivity of the re-
sponse yto a change in an explanatory variable X(l), for l {1, . . . , p}is
straightforward as in a linear model; which makes BQRVLs highly nonlinear
but directly interpretable white-boxes.
2.2. Updating the model in a Online Learning fashion
BQRVFLs can be updated in an Online Machine Learning fashion, following
some ideas from Harold et al. (1997). Typically here, we would like to up-
date the estimates µˆ
β|Zand Σˆ
β|Zfor a new, unseen observation arriving into
the model, instead of re-training the whole model, but keeping the same fixed
hyperparameters.
Without disclosing all the mathematical details, keeping the same fixed hy-
perparameters and not re-training the whole model relies, in this context, on
the assumption that
\footnotetext{
4
1. 1Here, we use only the k-means
the conditional distribution y|Zremains relatively stable as time passes
by (no distribution shift of regime switching for example).
If at time t=n, as indicated in the previous paragraph, we’ll have estimates
µ(n)
y(Zand Σ(n)
y|Zfor µy|Zand Σy|Z, then we have the following updating
formulas for the model described in section 2 :
µ(n+1)
ˆ
β|Z=µ(n)
ˆ
β|Z+C1
n+1zT
(n+1)hy(n+1)z(n+1)µ(n)
ˆ
β|Zi
Σ(n+1)
ˆ
β|Z=Ip+qC1
n+1zT
(n+1)Z(n+1)Σ(n)
ˆ
β|Z
With:
C1
n+1=C1
nC1
nzT
(n)z(n)C1
n
1+z(n)C1
nzT
(n)
(20)
C1
n+1zT
(n)=C1
n
1+z(n)C1
nzT
(n)
zT
(n)(21)
Indeed, based on section 2.1, these are given:
µ(n)
ˆ
β|Z=C1
nZT
(n)y(n)(22)
Σ(n)
ˆ
β|Z=s2Ip+qC1
nZT
(n)Z(n)(23)
For a new, unseen observation z(n+1),y(n+1)arriving in the model, we’d
have:
µ(n+1)
ˆ
β|Z=C1
n+1ZT
(n+1)y(n+1)(24)
=C1
n+1ZT
(n)y(n)+y(n+1)zT
(n+1)(25)
So that:
5
µ(n+1)
ˆ
β|Zµ(n)
ˆ
β|Z=C1
n+1C1
nzT
(n)y(n)+y(n+1)C1
nzT
(n+1)
=C1
n+1(CnCn+1)C1
nZT
(n)y(n)
+y(n+1)C1
nzT
(n+1)
=C1
n+1zT
(n+1)z(n+1)C1
nZT
(n)y(n)
+y(n+1)C1
nzT
(n+1)
=C1
n+1zT
(n+1)hy(n+1)z(n+1)C1
nZT
(n)y(n)i
=C1
n+1zT
(n+1)hy(n+1)z(n+1)µ(n)
˜
β|Zi
Similarly for the covariance matrix:
1
s2Σ(n+1)
ˆ
β|ZΣ(n)
ˆ
β|Z=C1
nZT
(n)Z(n)
C1
n+1ZT
(n+1)Z(n+1)
=C1
nZT
(n)Z(n)
C1
n+1ZT
(n)Z(n)+zT
(n+1)Z(n+1)
=C1
n+1C1
nZT
(n)Z(n)
C1
n+1zT
(n+1)Z(n+1)
=C1
n+1zT
(n+1)Z(n+1)C1
nZT
(n)Z(n)
C1
n+1zT
(n+1)Z(n+1)
=C1
n+1ZT
(n+1)Z(n+1)hC1
nZT
(n)Z(n)Ip+qi
=1
s2C1
n+1zT
(n+1)Z(n+1)Σ(n)
ˆ
β|Z
3. BQRVL in action
We start this section by presenting test set predictions and posterior simula-
tions of a BQRVFL trained on a set 4 macroeconomic variables (Longley (1967)).
In section 3.3, out-of-sample predictions are obtained on 4 real-world datasets,
with simulated hyperparameters on a grid, as suggested in section 2.1. Section
3.4 applies Online BQRVFL to Bayesian optimization of complex functions and
Machine Learning hyperparameters’ tuning.
6
3.1. Examples on real-world datasets
3.2. A forecasting problem on macroeconomic data
A BQRVFL model is applied to Longley (1967)’s macroeconomic data set, for
anticipating the random evolution of ’noninstitutionalized’ population over 14
years of age as a function of unemployment, number of people enrolled in the
armed forces and Gross National Product (GNP). It’s worth mentioning that
considering these indicators’ lags might be very beneficial in this context.
The training set contains data observations from 1947 to 1957, and the test
set contains observations from 1958 to 1962. Also, ˆ
q=100, ˆ
J=3, and ˆ
λ=
ˆ
σ2
ˆ
s2=10. The true values are depicted as a black bold line (left and right). Mean
predicted values (left) are depicted as red line. Shaded regions (left) represent
80% and 95% predictions intervals around the mean. On the right, you’ll find
a spaghetti plot of 1000 posterior predictive simulations.
3.3. Without hyperparameters tuning
In this section, BQRVFL hyperparameters are not optimized. Instead, a grid of
hyperparameters containing 5000 combinations of BQRVFL’s hyperparameters
is constructed as follows for q,λ:=σ2
s2, and J:
20 values for the number of nodes: q {5, 10, 25, 40, 50} {15 Sobol
numbers [0, 1000]}
50 values for the regularization parameter: λ=10x, for 50 values of
x[5, 4]
5 values for the number of k-means clusters: J {2, 3, 4, 5, 6}
The 4 data sets used are: mtcars, quakes, UScrime and motors from R pack-
age MASS. The BQRVFL is trained on these 4 data sets, on 80% of the data, for
each one of the 5000 hyperparameters’ combination.
7
This procedure is more time-consuming than the one presented in the pre-
vious section, but is also embarrassingly parallelizable on the grid of hyperpa-
rameters. It has the advantage of avoiding the task of choosing the hyperpa-
rameters.
3.4. Bayesian optimization
3.4.1. OPTIMIZING COMPLEX FUNCTIONS
The numerical examples of this section use BQRVFLs as workhorses for the
optimization of complex functions. These functions are: Alpine, Branin (2D
and 5D), Hartmann (3D and 6D).
As a reminder on Bayesian Optimization, it is useful and efficient for find-
ing minima or maxima of black box functions, whose evaluations are expensive
and gradients not necessarily available in a closed form. It has been shown to
be very effective on challenging optimization functions (see Jones (2001) and
Snoek et al. (2012)).
The idea of Bayesian Optimization is to optimize an alternative, cheaper
function called the acquisition function, rather than the main, difficult one. For
8
doing this, the uncertainty
around the predictions of an alternative machine learning model - the surro-
gate model - is used.
The surrogate model’s posterior distribution tries to approximate the ob-
jective function in a probabilistic way, and the quality of the description of the
objective by this posterior is enhanced as more points are evaluated during the
optimization procedure.
GPs with Matérn 5/2 kernels are often used as surrogate models (see Snoek
et al. (2012) for example). Here, we also use the BQRVFL model as a surrogate,
and the acquisition function will be the Expected Improvement (EI), defined
as:
aEI (x;θ) = Emax ˜
f(x,θ)f, 0 (26)
Where fis the current minimum value found after a few evaluations of
the objective function f, and e
f(x,θ)is the prediction of the surrogate model
with hyperparameters θ.
Alpine, Branin, Hartmann functions
These functions are widely used as benchmarks for testing optimization meth-
ods on black box functions. Some examples of optimization for this type of
functions can be found in Picheny et al. (2013). Global minimas fmin for these
functions are firstly found, and we minimize the immediate regrets relative to
the minimas found by using the 2 surrogates’
predictions fBQRVL
min and fMatérn
min :
fBQRVFL
min fmin(27)
and
fMatérn 5/2
min fmin(28)
The Bayesian minimization procedure is restarted 50 times with different
random seeds and random (uniform) initial designs 2containing 10 points
each, so that we can obtain a distribution of the regrets.
In order to find good hyperparameters values for Matérn 5/2 surrogate,
we use maximum likelihood. And for the BQRVFL hyperparameters’ choice
we minimize GCV criteria, with 3 different activation functions: ReLU, sig-
moid and tanh. Figure 1 presents the distribution (on 50 repeats) of immediate
regrets for Branin 2D minimization as a function of the number of optimization
algorithms’ iterations.
9
Figure 1: Distribution (on 50 repeats) of immediate regrets for Branin 2D
minimization, with Matérn and BQRVFL ReLU surrogate models.
Matérn 5/2 ReLU SIGmoid TANH
Min. 0.0004 0.0018 0.0003 0.0001
1st Qt. 0.0101 0.0091 0.0087 0.0102
MediAn 0.0101 0.0101 0.0101 0.0101
3rd Qt. 0.0101 0.0101 0.0101 0.0101
Max. 0.0203 0.0207 0.0222 0.0207
Figure 2: Distribution of immediate regrets found by each algorithm after
200 iterations on Branin 2D
Matérn 5/2 ReLU SIGMOId TANH
Min. 0.1185 0.0690 0.0996 0.0365
1st Qt. 0.2236 0.2236 0.2236 0.2236
MediAn 0.2236 0.2236 0.2236 0.2236
3rd Qt. 0.2236 0.2236 0.2294 0.3436
Max. 0.3785 0.6038 1.1831 1.2872
2. The first set of points on which the black box function is fully evaluated, and the hyperpa-
rameters of the surrogate are calibrated.
10
Figure 3: Distribution of immediate regrets found by each algorithm after
200 iterations on Hartmann 3D (surrogate models in columns).
Matérn 5/2 ReLU Sigmoid TANH
Min. 0.8434 0.5100 0.6637 0.6637
1st Qt. 1.8440 1.7690 1.8440 1.7725
Median 1.8440 1.8440 1.8440 1.8440
3rd Qt. 1.8440 2.0310 2.4387 2.3159
Max. 1.8440 3.9640 5.1001 3.4202
Figure 4: Distribution of immediate regrets found by each algorithm after
200iterations on Alpine 4D (surrogate models in columns)
Matérn 5/2 ReLU SIGMOid TANH
Min. 0.0023 0.0005 0.0010 0.0005
1st Qt. 0.0101 0.0071 0.0081 0.0072
Median 0.0101 0.0101 0.0101 0.0101
3Rd Qt. 0.0101 0.0204 0.0101 0.0204
Max. 0.0204 0.0220 0.0207 0.0283
Figure 5: Distribution of immediate regrets found by each algorithm after
200 iterations on Branin 5D (surrogate models in columns).
Matérn 5/2 ReLU SIGmoid Tanh
Min. 0.2866 0.2866 0.2866 0.2866
1st Qt. 0.6026 0.6026 0.6026 0.6026
Median 0.6026 0.6026 0.7186 1.1123
3rd Qt. 0.6026 1.1123 1.1123 1.3724
Max. 1.7189 1.9232 1.7189 2.2739
Figure 6: Distribution of immediate regrets found by each algorithm after
200 iterations on Hartmann 6D (surrogate models in columns).
3.4.2. M3 (3003) TIME SERIES FORECASTING COMPETITION
In this example, we use the 3003 series of the M3 forecasting competition (Makri-
dakis et al. (1982) and Makridakis and Hibon (2000)), available in R package
Mcomp (Hyndman (2018)). The forecasting error metric is the Mean Absolute
Percentage Error (MAPE).
We compute ensembles of time series forecasting models by using auto-
matic ARIMA, exponential smoothing from Hyndman and Khandakar (2008)
and the Theta method from Assimakopoulos and Nikolopoulos (2000). These
methods are all implemented in R package forecast (Hyndman (2015)). We
denote by:
E: the exponential smoothing model, which has an average out-of-sample
MAPE of 0.1729 on the 3003 series
11
A: the automatic ARIMA model, which has an average out-of-sample
MAPE of 0.1882 on the 3003 series
T: the Theta model, which has an average out-of-sample MAPE of 0.1709
on the 3003 series
EAT: 1
3(ets + auto.arima + thetaf), which has an average MAPE out-of-
sample of 0.1697 on the 3003 series
ET :=αE+βT. Where αand βare unknown hyperparameters to be
optimized for ET on the 3003 series.
Bayesian Optimization with Matérn 5/2 and BQRVFL surrogates is repeated
5 times, with 10 points in the initial design, and 25 iterations of the algorithm.
Ranges of average MAPEs for ET are reported in the following table for differ-
ent surrogates, and the best values for αand βare 0.4294 and 0.5706 .
MATÉRN 5/2 RELU SIGMOId TANH
MIN. 0.1670916 0.1670916 0.1670916 0.1670916
MEDIAN 0.1670934 0.1670929 0.1670925 0.1670924
Max. 0.1671164 0.1670982 0.1670982 0.1670982
Figure 7: Average MAPEs found by each algorithm on 5 repeats.
References
V Assimakopoulos and K Nikolopoulos. The theta model: a decomposition ap-
proach to forecasting. International journal of forecasting, 16(4):521-530, 2000.
Christopher M Bishop. Pattern recognition and machine learning. Springer
google schola, 2: 5 43, 2006.
Satchidananda Dehuri and Sung-Bae Cho. A comprehensive survey on
functional link neural networks and an adaptive pso-bp learning for cflnn.
Neural Computing and Applications, 19(2):187-205, 2010.
Jeremy Ferwerda, Jens Hainmueller, and Chad J Hazlett. Kernel-based reg-
ularized least squares in r (krls) and stata (krls). Journal of Statistical Software,
79(3):1-26, 2017.
Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation
as a method for choosing a good ridge parameter. Technometrics, 21(2):215-
223, 1979.
J Harold, G Kushner, and George Yin. Stochastic approximation and recur-
sive algorithm and applications. Application of Mathematics, 35(10), 1997.
John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means
clustering algorithm. Journal of the royal statistical society. series c (applied
statistics), 28(1):100-108, 1979.
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estima-
tion for nonorthogonal problems. Technometrics, 12(1):55-67, 1970.
12
Rob Hyndman. Mcomp: Data from the M-Competitions, 2018. URL https://CRAN
R-project.org/package=Mcomp. R package version 2.8.
Rob J Hyndman. Forecast: Forecasting functions for time series and linear
models, 2015. URL URL. R package version 6.2.
Rob J Hyndman and Yeasmin Khandakar. Automatic time series forecast-
ing: The forecast package for r. Journal of Statistical Software, 27(3):1-22, 2008.
Donald R Jones. A taxonomy of global optimization methods based on
response surfaces. Journal of global optimization, 21(4):345-383, 2001.
James W Longley. An appraisal of least squares programs for the electronic
computer from the point of view of the user. Journal of the American Statistical
association, 62(319):819-841, 1967.
Spyros Makridakis and Michele Hibon. The m3-competition: results, con-
clusions and implications. International journal of forecasting, 16(4):451-476,
2000.
Spyros Makridakis, Allan Andersen, Roberto Carbone, Robert Fildes, Michele
Hibon, Robin Lewandowski, John Newton, Emanuel Parzen, and Robert Win-
kler. The accuracy of extrapolation (time series) methods: Results of a forecast-
ing competition. Journal of forecasting, 1(2):111-153, 1982.
Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. Toward global opti-
mization, volume 2, chapter bayesian methods for seeking the extremum. 1978.
T. Moudiki, Frédéric Planchet, and Areski Cousin. Multiple time series
forecasting using quasi-randomized functional link neural networks. Risks,
6(1):22, 2018.
Harald Niederreiter. Random number generation and quasi-Monte Carlo
methods. SIAM, 1992.
Yoh-Han Pao, Gwo-Hshiung Park, and Dusan J Sobajic. Learning and gen-
eralization characteristics of the random vector functional-link net. Neurocom-
puting, 6(2):163-180, 1994.
Victor Picheny, Tobias Wagner, and David Ginsbourger. A benchmark of
kriging-based infill criteria for noisy optimization. Structural and Multidisci-
plinary Optimization, 48(3): 607 626, 2013.
Carl Edward Rasmussen and Christopher KI Williams. Gaussian process
for machine learning. MIT press, 2006.
Werner F Schmidt, Martin A Kraaijveld, and Robert PW Duin. Feedforward
neural networks with random weights. In Pattern Recognition, 1992. Vol. II.
Conference B: Pattern Recognition Methodology and Systems, Proceedings.,
11th IAPR International Conference on, pages 1-4. IEEE, 1992.
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian op-
timization of machine learning algorithms. In Advances in neural information
processing systems, pages 2951-2959, 2012.
Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish,
Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scal-
able bayesian optimization using deep neural networks. In International con-
ference on machine learning, pages 21712180. PMLR, 2015.
Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter.
Bayesian optimization with robust bayesian neural networks. Advances in
13
neural information processing systems, 29, 2016.
2.
14
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We are interested in obtaining forecasts for multiple time series, by taking into account the potential nonlinear relationships between their observations. For this purpose, we use a specific type of regression model on an augmented dataset of lagged time series. Our model is inspired by dynamic regression models (Pankratz 2012), with the response variable’s lags included as predictors, and is known as Random Vector Functional Link (RVFL) neural networks. The RVFL neural networks have been successfully applied in the past, to solving regression and classification problems. The novelty of our approach is to apply an RVFL model to multivariate time series, under two separate regularization constraints on the regression parameters.
Article
Full-text available
The Stata package krls as well as the R package KRLS implement kernel-based regularized least squares (KRLS), a machine learning method described in Hainmueller and Hazlett (2014) that allows users to tackle regression and classification problems without strong functional form assumptions or a specification search. The flexible KRLS estimator learns the functional form from the data, thereby protecting inferences against misspecification bias. Yet it nevertheless allows for interpretability and inference in ways similar to ordinary regression models. In particular, KRLS provides closed-form estimates for the predicted values, variances, and the pointwise partial derivatives that characterize the marginal effects of each independent variable at each data point in the covariate space. The method is thus a convenient and powerful alternative to ordinary least squares and other generalized linear models for regression-based analyses.
Article
Full-text available
Responses of many real-world problems can only be evaluated perturbed by noise. In order to make an efficient optimization of these problems possible, intelligent optimization strategies successfully coping with noisy evaluations are required. In this article, a comprehensive review of existing kriging-based methods for the optimization of noisy functions is provided. In summary, ten methods for choosing the sequential samples are described using a unified formalism. They are compared on analytical benchmark problems, whereby the usual assumption of homoscedastic Gaussian noise made in the underlying models is meet. Different problem configurations (noise level, maximum number of observations, initial number of observations) and setups (covariance functions, budget, initial sample size) are considered. It is found that the choices of the initial sample size and the covariance function are not critical. The choice of the method, however, can result in significant differences in the performance. In particular, the three most intuitive criteria are found as poor alternatives. Although no criterion is found consistently more efficient than the others, two specialized methods appear more robust on average.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
Article
Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. But it is still costly if each evaluation of the objective requires training and validating the algorithm being optimized, which, for large datasets, often takes hours, days, or even weeks. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods.
Article
With randomly generated weights between input and hidden layers, a random vector functional link network is a universal approximator for continuous functions on compact sets with fast learning property. Though it was proposed two decades ago, the classification ability of this family of networks has not been fully investigated yet. Through a very comprehensive evaluation by using 121 UCI datasets, the effect of bias in the output layer, direct links from the input layer to the output layer and type of activation functions in the hidden layer, scaling of parameter randomization as well as the solution procedure for the output weights are investigated in this work. Surprisingly, we found that the direct link plays an important performance enhancing role in RVFL, while the bias term in the output neuron had no significant effect. The ridge regression based closed-form solution was better than those with Moore-Penrose pseudoinverse. Instead of using a uniform randomization in [-1,+1] for all datasets, tuning the scaling of the uniform randomization range for each dataset enhances the overall performance. Six commonly used activation functions were investigated in this work and we found that hardlim and sign activation functions degenerate the overall performance. These basic conclusions can serve as general guidelines for designing RVFL networks based classifiers.