Content uploaded by Thierry Moudiki
Author content
All content in this area was uploaded by Thierry Moudiki on Nov 21, 2020
Content may be subject to copyright.
LSBoost, gradient boosted penalized
nonlinear least squares
T. Moudiki
https://thierrymoudiki.github.io
21st November 2020
Contents
1 Introduction 2
2 Algorithm description 2
3 Numerical examples 4
Abstract
The LSBoost model presented in this document is a gradient boosting
Statistical/Machine Learning procedure; a close cousin of the LS Boost
described in Friedman (2001). LSBoost’s specificity resides in its usage of
randomized neural networks as base learners.
1
LSBoost, gradient boosted penalized nonlinear least squares
1 Introduction
The LSBoost model presented in this document is a gradient boosted Statist-
ical/Machine learning (ML hereafter) algorithm; a close cousin of the LS Boost
from Friedman (2001). LSBoost’s specificity resides in its usage of randomized
neural networks. Several examples employing LSBoost, both in Python and R,
can be found in:
•LSBoost: Explainable ’AI’ using Gradient Boosted randomized networks
(with examples in R and Python)
•Explainable ’AI’ using Gradient Boosted randomized networks Pt2 (the
Lasso)
Section 2describes the LSBoost algorithm in more detail, and section 3
contains some numerical examples.
2 Algorithm description
Let y∈Rnbe the centered response variable piquing our curiosity, and xi∈
Rpthe standardized explanatory variables for the ith observation. We are
interested in characterizing E[yi|xi]. That’s what LS Boost does; it is presented
in (Fig. 1):
Figure 1: LS Boost algorithm from Friedman (2001)
The general principles of (Fig. 1) do apply to LSBoost too. However in the
latter, at each boosting iteration m, line 4 of (Fig. 1) is replaced by the following
randomized neural networks model (cf. (Fig. 2) below, and replacing yby ˜
y,
the current model residuals):
2
LSBoost, gradient boosted penalized nonlinear least squares
Figure 2: LSBoost units (base learners)
That is:
βm=argminβ∈Rp+L
N
∑
i=1
[˜
yi,m−βh(xi;wm)]2(1)
And
||β||2
2≤s,s>0 (2)
wis drawn from a sequence of pseudo-random U([0; 1]) numbers. The cur-
rent main difference from Moudiki et al. (2018) is the use of pseudo-random
numbers here, for diversity1, instead of deterministic quasirandom Sobol there.
An interesting experiment would be to try a scrambled Sobol sequence here,
versus a deterministic Sobol sequence.
In addition:
•happlies an activation function, currently in Moudiki (2020) an element-
wise ReLU, x7→ max(0, x), to XW
• rows and columns of X– a matrix containing x’s in rows – can be sub-
sampled to increase the final ensemble’s diversity. If no subsampling is
applied and a deterministic sequence employed, we can see how stereo-
typical the h’s in (Eq. 1) could be 2.
• at line 5 of (Fig. 1), as suggested by Friedman (2001), a learning rate
0<ν≤1 (3)
1I wanted a lot of varied ways to attack the current residuals ˜
y
2h’s are the same function at each boosting iteration
3
LSBoost, gradient boosted penalized nonlinear least squares
Is incorporated to LSBoost. So that:
Fm(x) = Fm−1(x) + νβmh(x;w)(4)
The effect of νis to slow down the learning/gradient descent procedure.
• the LSBoost procedure can be stopped early before m=Mis attained,
and as soon as we have for a small, given tolerance parameter η>0:
∆m|| ˜
y||2
2<η(5)
Early stopping, as incorporating νin (Fig. 1), prevents overfitting from
occurring. Plus, when applicable, it could substantially reduce the com-
putational burden – of looping until M(cf. examples in section 3).
• the dropout (Srivastava et al. (2014)) can be utilized for computing h, as
another way to combat overfitting and increase the ensemble’s diversity
• the least squares minimization at line 4 in (Fig. 1) are penalized, lead-
ing currently in Moudiki (2020) to ridge regression (Hoerl and Kennard
(1970)) or lasso-based (Tibshirani (1996)) solutions to the gradient boost-
ing problem
• when LSBoost deals with a classification problem, the response y∈Nn
is one-hot encoded, and the problem is solved as multiple (as much as
the total number of classes) regression problems on class probabilities.
• as in Moudiki (2019) (Proposition 2.1), LSBoost is interpretable, espe-
cially when the activation function embedded in his chosen to be a
derivable one
3 Numerical examples
The examples presented in this section are based on a breast cancer classific-
ation dataset, firstly studied by Street et al. (1993). They can be found in the
project’s repository, as a notebook.
For this experiment, breast cancer dataset is splitted into training/testing
(80%/20%) sets. The ensemble’s base learners are Ridge regression models
applied through h(cf. (Eq. 1)), as randomized neural networks. All the hyper-
parameters are set to their default values, except for:
•tolerance related to (Eq. 5); that is η
•learning rate related to (Eq. 3) and (Eq. 4); that is ν
•n estimators related to (Fig. 1); that is M
Whose properties are of interest in this document. On the training set,
|| ˜
ym||2(6)
4
LSBoost, gradient boosted penalized nonlinear least squares
Is examined. It is supposed to be a decreasing function of the number
of boosting iterations M, if the new LSBoost’s been implemented correctly in
Moudiki (2020).
Conversely on the testing set nothing much can be said a priori, as the
three hyperparameters will typically have many different configurations and
influence, depending on the dataset at hand.
Looking at the graphs in (Fig. 3), we can observe that || ˜
ym||2is indeed, all
else being equal, a monotonic decreasing function of the number of boosting
iterations: the criteria (Eq. 5) are not totally absurd.
In the first row of this figure, there is no early stopping (tolerance = 0);
LSBoost will iterate until 100, the default budget for boosting iterations. The
learning rate νis then increased from 0.1 to 0.7: the fitting procedure is accel-
erated, and || ˜
ym||2can converge to lower values for || ˜
ym||2(that’s overfitting).
Figure 3: || ˜
ym||2as a function of number of boosting iterations
At the bottom, still in (Fig. 3), I consider η=0.1 (cf. (Eq. 5)), and an
increase of νfrom 0.1 to 0.7. The first graph on the left shows that 17 ( 4
for the graph at the bottom right) boosting iterations were necessary before
meeting the early stopping criteria. Sounds pretty much like the expected
behavior. Here are the results for each one of these 4 cases (out-of-sample
accuracy and area under the curve):
5
LSBoost, gradient boosted penalized nonlinear least squares
Sub-plot Out-of-sample Accuracy Out-of-sample AUC
11 95.61 % 95.51%
12 92.98% 93.45%
21 97.37% 96.88 %
22 97.37% 96.88%
Subplot 22, with η=0.1 and ν=0.7, is a clear winner of this mini-contest.
Highest accuracy, highest area under the curve, and only 4 boosting iterations
required to achieve a superior performance.
6
LSBoost, gradient boosted penalized nonlinear least squares
References
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting
machine. Annals of statistics, pages 1189–1232.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation
for nonorthogonal problems. Technometrics, 12(1):55–67.
Moudiki, T. (2019). Online bayesian quasi-random functional link networks;
application to the optimization of black box functions.
Moudiki, T. (2019–2020). mlsauce, Miscellaneous Statistical/Machine Learn-
ing stuff. https://github.com/thierrymoudiki/mlsauce. BSD 3-Clause
Clear License. Version 0.7.5.
Moudiki, T., Planchet, F., and Cousin, A. (2018). Multiple time series forecast-
ing using quasi-randomized functional link neural networks. Risks, 6(1):22.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958.
Street, W. N., Wolberg, W. H., and Mangasarian, O. L. (1993). Nuclear feature
extraction for breast tumor diagnosis. In Biomedical image processing and
biomedical visualization, volume 1905, pages 861–870. International Society
for Optics and Photonics.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
7