PreprintPDF Available

LSBoost, gradient boosted penalized nonlinear least squares

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Gradient boosted randomized *neural* networks
Content may be subject to copyright.
LSBoost, gradient boosted penalized
nonlinear least squares
T. Moudiki
https://thierrymoudiki.github.io
21st November 2020
Contents
1 Introduction 2
2 Algorithm description 2
3 Numerical examples 4
Abstract
The LSBoost model presented in this document is a gradient boosting
Statistical/Machine Learning procedure; a close cousin of the LS Boost
described in Friedman (2001). LSBoost’s specificity resides in its usage of
randomized neural networks as base learners.
1
LSBoost, gradient boosted penalized nonlinear least squares
1 Introduction
The LSBoost model presented in this document is a gradient boosted Statist-
ical/Machine learning (ML hereafter) algorithm; a close cousin of the LS Boost
from Friedman (2001). LSBoost’s specificity resides in its usage of randomized
neural networks. Several examples employing LSBoost, both in Python and R,
can be found in:
LSBoost: Explainable ’AI’ using Gradient Boosted randomized networks
(with examples in R and Python)
Explainable ’AI’ using Gradient Boosted randomized networks Pt2 (the
Lasso)
Section 2describes the LSBoost algorithm in more detail, and section 3
contains some numerical examples.
2 Algorithm description
Let yRnbe the centered response variable piquing our curiosity, and xi
Rpthe standardized explanatory variables for the ith observation. We are
interested in characterizing E[yi|xi]. That’s what LS Boost does; it is presented
in (Fig. 1):
Figure 1: LS Boost algorithm from Friedman (2001)
The general principles of (Fig. 1) do apply to LSBoost too. However in the
latter, at each boosting iteration m, line 4 of (Fig. 1) is replaced by the following
randomized neural networks model (cf. (Fig. 2) below, and replacing yby ˜
y,
the current model residuals):
2
LSBoost, gradient boosted penalized nonlinear least squares
Figure 2: LSBoost units (base learners)
That is:
βm=argminβRp+L
N
i=1
[˜
yi,mβh(xi;wm)]2(1)
And
||β||2
2s,s>0 (2)
wis drawn from a sequence of pseudo-random U([0; 1]) numbers. The cur-
rent main difference from Moudiki et al. (2018) is the use of pseudo-random
numbers here, for diversity1, instead of deterministic quasirandom Sobol there.
An interesting experiment would be to try a scrambled Sobol sequence here,
versus a deterministic Sobol sequence.
In addition:
happlies an activation function, currently in Moudiki (2020) an element-
wise ReLU, x7→ max(0, x), to XW
rows and columns of X a matrix containing x’s in rows can be sub-
sampled to increase the final ensemble’s diversity. If no subsampling is
applied and a deterministic sequence employed, we can see how stereo-
typical the h’s in (Eq. 1) could be 2.
at line 5 of (Fig. 1), as suggested by Friedman (2001), a learning rate
0<ν1 (3)
1I wanted a lot of varied ways to attack the current residuals ˜
y
2h’s are the same function at each boosting iteration
3
LSBoost, gradient boosted penalized nonlinear least squares
Is incorporated to LSBoost. So that:
Fm(x) = Fm1(x) + νβmh(x;w)(4)
The effect of νis to slow down the learning/gradient descent procedure.
the LSBoost procedure can be stopped early before m=Mis attained,
and as soon as we have for a small, given tolerance parameter η>0:
m|| ˜
y||2
2<η(5)
Early stopping, as incorporating νin (Fig. 1), prevents overfitting from
occurring. Plus, when applicable, it could substantially reduce the com-
putational burden of looping until M(cf. examples in section 3).
the dropout (Srivastava et al. (2014)) can be utilized for computing h, as
another way to combat overfitting and increase the ensemble’s diversity
the least squares minimization at line 4 in (Fig. 1) are penalized, lead-
ing currently in Moudiki (2020) to ridge regression (Hoerl and Kennard
(1970)) or lasso-based (Tibshirani (1996)) solutions to the gradient boost-
ing problem
when LSBoost deals with a classification problem, the response yNn
is one-hot encoded, and the problem is solved as multiple (as much as
the total number of classes) regression problems on class probabilities.
as in Moudiki (2019) (Proposition 2.1), LSBoost is interpretable, espe-
cially when the activation function embedded in his chosen to be a
derivable one
3 Numerical examples
The examples presented in this section are based on a breast cancer classific-
ation dataset, firstly studied by Street et al. (1993). They can be found in the
project’s repository, as a notebook.
For this experiment, breast cancer dataset is splitted into training/testing
(80%/20%) sets. The ensemble’s base learners are Ridge regression models
applied through h(cf. (Eq. 1)), as randomized neural networks. All the hyper-
parameters are set to their default values, except for:
tolerance related to (Eq. 5); that is η
learning rate related to (Eq. 3) and (Eq. 4); that is ν
n estimators related to (Fig. 1); that is M
Whose properties are of interest in this document. On the training set,
|| ˜
ym||2(6)
4
LSBoost, gradient boosted penalized nonlinear least squares
Is examined. It is supposed to be a decreasing function of the number
of boosting iterations M, if the new LSBoost’s been implemented correctly in
Moudiki (2020).
Conversely on the testing set nothing much can be said a priori, as the
three hyperparameters will typically have many different configurations and
influence, depending on the dataset at hand.
Looking at the graphs in (Fig. 3), we can observe that || ˜
ym||2is indeed, all
else being equal, a monotonic decreasing function of the number of boosting
iterations: the criteria (Eq. 5) are not totally absurd.
In the first row of this figure, there is no early stopping (tolerance = 0);
LSBoost will iterate until 100, the default budget for boosting iterations. The
learning rate νis then increased from 0.1 to 0.7: the fitting procedure is accel-
erated, and || ˜
ym||2can converge to lower values for || ˜
ym||2(that’s overfitting).
Figure 3: || ˜
ym||2as a function of number of boosting iterations
At the bottom, still in (Fig. 3), I consider η=0.1 (cf. (Eq. 5)), and an
increase of νfrom 0.1 to 0.7. The first graph on the left shows that 17 ( 4
for the graph at the bottom right) boosting iterations were necessary before
meeting the early stopping criteria. Sounds pretty much like the expected
behavior. Here are the results for each one of these 4 cases (out-of-sample
accuracy and area under the curve):
5
LSBoost, gradient boosted penalized nonlinear least squares
Sub-plot Out-of-sample Accuracy Out-of-sample AUC
11 95.61 % 95.51%
12 92.98% 93.45%
21 97.37% 96.88 %
22 97.37% 96.88%
Subplot 22, with η=0.1 and ν=0.7, is a clear winner of this mini-contest.
Highest accuracy, highest area under the curve, and only 4 boosting iterations
required to achieve a superior performance.
6
LSBoost, gradient boosted penalized nonlinear least squares
References
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting
machine. Annals of statistics, pages 1189–1232.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation
for nonorthogonal problems. Technometrics, 12(1):55–67.
Moudiki, T. (2019). Online bayesian quasi-random functional link networks;
application to the optimization of black box functions.
Moudiki, T. (2019–2020). mlsauce, Miscellaneous Statistical/Machine Learn-
ing stuff. https://github.com/thierrymoudiki/mlsauce. BSD 3-Clause
Clear License. Version 0.7.5.
Moudiki, T., Planchet, F., and Cousin, A. (2018). Multiple time series forecast-
ing using quasi-randomized functional link neural networks. Risks, 6(1):22.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.
(2014). Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958.
Street, W. N., Wolberg, W. H., and Mangasarian, O. L. (1993). Nuclear feature
extraction for breast tumor diagnosis. In Biomedical image processing and
biomedical visualization, volume 1905, pages 861–870. International Society
for Optics and Photonics.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
7
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We are interested in obtaining forecasts for multiple time series, by taking into account the potential nonlinear relationships between their observations. For this purpose, we use a specific type of regression model on an augmented dataset of lagged time series. Our model is inspired by dynamic regression models (Pankratz 2012), with the response variable’s lags included as predictors, and is known as Random Vector Functional Link (RVFL) neural networks. The RVFL neural networks have been successfully applied in the past, to solving regression and classification problems. The novelty of our approach is to apply an RVFL model to multivariate time series, under two separate regularization constraints on the regression parameters.
Article
Full-text available
Interactive image processing techniques, along with a linear-programming-based inductive classifier, have been used to create a highly accurate system for diagnosis of breast tumors. A small fraction of a fine needle aspirate slide is selected and digitized. With an interactive interface, the user initializes active contour models, known as snakes, near the boundaries of a set of cell nuclei. The customized snakes are deformed to the exact shape of the nuclei. This allows for precise, automated analysis of nuclear size, shape and texture. Ten such features are computed for each nucleus, and the mean value, largest (or "worst") value and standard error of each feature are found over the range of isolated cells. After 569 images were analyzed in this fashion, different combinations of features were tested to find those which best separate benign from malignant samples. Ten-fold cross-validation accuracy of 97% was achieved using a single separating plane on three of the thirty ...
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
Function estimation/approximation is viewed from the perspective of numerical optimization iti function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent "boosting" paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such "TreeBoost" models are presented. Gradient boosting of regression trees produces competitives highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.
Online bayesian quasi-random functional link networks; application to the optimization of black box functions
  • T Moudiki
Moudiki, T. (2019). Online bayesian quasi-random functional link networks; application to the optimization of black box functions.