Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identiﬁer

Sequential Training of Neural Networks

with Gradient Boosting

SEYEDSAMAN EMAMI, GONZALO MARTíNEZ-MUÑOZ

Escuela Politécnica Superior, Universidad Autónoma de Madrid, Francisco Tomás y Valiente, 11, 28049 Madrid, Spain (e-mail: emami.seyedsaman@uam.es)

Corresponding author: Seyedsaman Emami (e-mail: emami.seyedsaman@uam.es).

This work was supported by PID2019-106827GB-I00/AEI/10.13039/501100011033

ABSTRACT This paper presents a novel technique based on gradient boosting to train the ﬁnal layers of

a neural network (NN). Gradient boosting is an additive expansion algorithm in which a series of models

are trained sequentially to approximate a given function. A neural network can also be seen as an additive

expansion where the scalar product of the responses of the last hidden layer and its weights provide the ﬁnal

output of the network. Instead of training the network as a whole, the proposed algorithm trains the network

sequentially in Tsteps. First, the bias term of the network is initialized with a constant approximation that

minimizes the average loss of the data. Then, at each step, a portion of the network, composed of Jneurons,

is trained to approximate the pseudo-residuals on the training data computed from the previous iterations.

Finally, the Tpartial models and bias are integrated as a single NN with T×Jneurons in the hidden layer.

Extensive experiments in classiﬁcation and regression tasks, as well as in combination with deep neural

networks, are carried out showing a competitive generalization performance with respect to neural networks

trained with different standard solvers, such as Adam, L-BFGS, SGD and deep models. Furthermore, we

show that the proposed method design permits to switch off a number of hidden units during test (the units

that were last trained) without a signiﬁcant reduction of its generalization ability. This permits the adaptation

of the model to different classiﬁcation speed requirements on the ﬂy.

INDEX TERMS Gradient Boosting, Neural Network

I. INTRODUCTION

Machine learning is becoming a fundamental piece for the

success of more and more applications every day. Some

examples of novel applications include bioactive molecule

prediction [1], renewable energy prediction [2], classiﬁcation

of galactic sources [3], or agriculture area for mapping soil

contamination [4]. It is of capital importance to ﬁnd algo-

rithms that can efﬁciently handle complex data. Ensemble

methods are very effective at improving the generalization

accuracy of multiple simple models [5], [6] or even complex

models such as MLPs [7] or DeepCNNs [8].

In recent years, gradient boosting [9], [10], a fairly old

technique has gained much attention, specially due to the

novel and computationally efﬁcient version of gradient boost-

ing called eXtreme Gradient Boosting or XGBoost [11].

Gradient boosting builds a model as an additive expansion of

regressors to gradually minimize a given loss function. When

gradient boosting is combined with several stochastic tech-

niques, as bootstrapping or feature sampling from random

forest [12], its performance generally improves [13]. In fact,

this combination of randomization techniques and optimiza-

tion has placed XGBoost among the top contenders in Kaggle

competitions [11] and provides excellent performance in

a variety of applications as in the ones mentioned above.

Based on the success of XGBoost, other techniques have

been proposed like CatBoost [14] and LightGBM [15], which

propose improvements in training speed and generalization

performance. More details about these methods can be seen

in the comparative analysis of Bentéjac et al. [16]. Other type

of widespread boosting algorithm is AdaBoost [17], initially

developed for binary classiﬁcation and then for multi-class

classiﬁcation (AdaBoost-SAMME) [18] and regression [19].

On the other hand, convolutional deep architectures have

shown outstanding performances especially with structured

data such as images, speech, etc. [20], [21]. However, in the

context of tabular data, ensembles of classiﬁers or simple

MLPs are generally more effective than convolutional deep

neural networks [22]. In [23], the performance of Deep

Neural Networks for tabular data is compared with the tradi-

tional machine learning method, such as XGBoost. The study

shows that XGBoost outperforms deep models on the ana-

lyzed datasets and provides interesting insights to consider

VOLUME XX, 2023 1

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

when choosing a model for real-life applications, including

the model performance, computational inference cost, hy-

perparameter optimization time, and so on. The objective

of our study is to combine the stage-wise optimization of

gradient boosting into the training procedure of the last layers

of a neural network. The result of the proposed algorithm

is an alternative for training a single neural network (not an

ensemble of networks) and is specially suited for tabular data.

Several related studies propose hybrid algorithms that, for

instance, transform a decision forest into a single neural net-

work [24], [25] or that use a deep architecture to train a tree

forest [26]. In [24], it is shown that a pre-trained tree forest

can be cast into a two-layer neural network with the same

predictive outputs. First, each tree is converted into a neural

network. To do so, each split in the tree is transformed into an

individual neuron that is connected to a single input attribute

(split attribute) and whose activation threshold is set to the

split threshold. In this way, and by a proper combination

of the outputs of these neurons (splits) the network mimics

the behavior of the decision tree. Finally, all neurons are

combined through a second layer, which recovers the forest

decision. The weights of this network can be later retrained

to obtain further improvements [25]. In [26], a decision forest

is trained jointly by means of a deep neural network that

learns all splits of all trees of the forest. To guide the network

to learn the splits of the trees, a procedure that trains the

trees using back-propagation, is proposed. The ﬁnal output

of the algorithm is a decision forest whose performance is

remarkable in image classiﬁcation tasks.

In other related line of work [27], [28], boosting is ap-

plied to the construction of Deep Residual Learning models

[29]. In [27], a novel ResNet weight estimation model is

proposed by generalizing the boosting functional gradient

minimization [30] to the feature extraction space of the

network. The work presented in [28] also builds layer-by-

layer a ResNet boosting over features, however, it is based

on a different boosting framework [17]. The model (called

BoostResNet) works by learning a linear classiﬁer on the

output of each residual network block to build an ensemble

of shallow blocks. One important advantage of BoostResNet

over standard ResNet is its lower computational complexity

although the reported performance is not consistently better

with respect to ResNet. In contrast, the current proposal

method, based on [9], builds a simple shallow network in-

width rather than complex model in-depth and shows very

good performance in tabular datasets with respect to stan-

dard back-propagation training methods. Furthermore, our

proposal can adapt on the ﬂy to the use of a reduced number

of hidden neurons.

Another model that resembles the idea proposed in this

paper –yet with a different optimization process, ﬁnal model

and objective– was presented in [31]. They propose a convex

optimization algorithm for training a neural network that

theoreticallly could reach the global optimum although its

exact implementation is only feasible for a very low number

of input features. In order to reach the global optimum they

control the number of hidden neurons of the model by adding

one neuron at a time to the network and by including a L1

regularization on the top layer. The proposed idea is a step-

wise algorithm as all weights of the network are optimized at

each iteration. This is done in three optimization steps. First,

a new neuron (i.e. linear model) is added and trained on a

weighted loss function similarly to Adaboost. This weighted

loss can only be solved exactly for a very low number of

input features. Then, the output layer, and potentially all input

weights, are optimized using the proposed convex formula-

tion. Finally, the output weights are regularized to reduce the

complexity of the network. This ﬁnal step sets to zero some

of the output weights to effectively remove the corresponding

neurons. The algorithm is tested on one simple 2D problem in

order to assess the validity of the global optimum approach.

In this paper, we propose a combination of ensembles and

neural networks that is somehow complementary to the work

of [26], which is a single neural network that is trained using

an ensemble training algorithm. Speciﬁcally, we propose to

train a neural network iteratively as an additive expansion

of simpler models. The algorithm is equivalent to gradient

boosting: ﬁrst, a constant approximation is computed (as-

signed to the bias term of the neural network), then at each

step, a regression neural network with a single (or very few)

neuron(s) in the hidden layer is trained to ﬁt the residuals

of the previous models. All these models are then combined

to form a single neural network with one hidden layer.

This training procedure provides an alternative to standard

training solvers (such as Adam, L-BFGS or SGD) for training

a neural network. Other works related to the optimization

and convergence of Adam have been recently proposed [32]–

[34]. They showed that its convergence can be improved in

the context of high dimensional complex image classiﬁcation

tasks. However, their focus is mainly in deep models for

image classiﬁcation tasks.

In addition, the proposed method has an additive neural

architecture in which the latest computed neurons contribute

less to the ﬁnal decision. This can be useful in computation-

ally intensive applications as the number of active models (or

neurons) can be gauged on the ﬂy to the available computa-

tional resources without a signiﬁcant loss in generalization

accuracy. The proposed model is tested on multiple classi-

ﬁcation and regression problems, as well as in conjunction

with deep models posed as transfer learning problems. These

experiments show that the proposed method for training the

last layers of a neural network is a good alternative to other

standard methods.

The paper is organized as follows: Section II describes

gradient boosting and how to apply it to train a single neural

network; In section III the results of several experimental

analysis are shown; Finally, the conclusions are summarized

in the last section.

II. METHODOLOGY

In this section, we show the gradient boosting mathematical

framework [9] and the applied modiﬁcations in order to use

2VOLUME XX, 2023

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

it for training a neural network sequentially. The proposed

algorithm is valid for multi-class and binary classiﬁcation,

and for regression. Finally, an illustrative example is given.

A. GRADIENT BOOSTING

Given a training dataset D={xi, yi}N

1, the goal of machine

learning algorithms is to ﬁnd an approximation, ˆ

F(x), of the

objective function F∗(x), which maps instances xto their

output values y. In general, the learning process can be posed

as an optimization problem in which the expected value of

a given loss function, E[L(y, F (x))], is minimized. A data-

based estimate can be used to approximate this expected loss:

PN

i=1 L(yi, F (xi)).

In the speciﬁc case of gradient boosting, the model is built

using an additive expansion

Ft(x) = Ft−1(x) + ρtht(x),(1)

where ρtis the weight of the tth function, ht(x). The

approximation is constructed stage-wise in the sense that

at each step, a new model htis built without modifying

any of the previously created models included in Ft−1(x).

First, the additive expansion is initialized with a constant

approximation

F0(x) = argmin

α

N

X

i=1

L(yi, α)(2)

and the following models are built in order to minimize

(ρt, ht(x)) = argmin

ρ,ht

N

X

i=1

L(yi, Ft−1(xi) + ρht(xi)) .(3)

However, instead of jointly solve the optimization for ρand

ht, the problem is split into two steps. First, each model ht

is trained to learn the data-based gradient vector of the loss-

function. For that, each model, ht, is trained on a new dataset

D={xi, rti}N

i=1, where the pseudo-residuals, rti , are the

negative gradient of the loss function at Ft−1(xi)

rti =−∂L(yi, F (xi))

∂F (xi)F(x)=Ft−1(x)

(4)

The function, ht, is expected to output values close to the

pseudo residuals at the given data points, which are parallel

to the gradient of Lat Ft−1(x). Note, however, that the

training process of his generally guided by square-error

loss, which may be different from the given objective loss

function. Notwithstanding, the value of ρtis subsequently

computed by solving a line search optimization problem on

the given loss function

ρt= argmin

ρ

N

X

i=1

L(yi, Ft−1(xi) + ρht(xi)) .(5)

The original formulation of gradient boosting (as given

in [9]) is, in some ﬁnal derivations, only valid for decision

trees. Here, we present an extension of the formulation of

gradient boosting to be able to use any possible regressor as

base model and we describe how to integrate this process to

train a single neural network, which is the focus of the paper.

B. BINARY CLASSIFICATION

For binary classiﬁcation, in which y∈ {−1,1}, we will

consider the logistic loss

L(y, F (·)) = ln (1 + exp (−2y F (·))) ,(6)

which is optimized by the logit function F(x) =

1

2ln p(y=1|x)

p(y=−1|x). For this loss function the constant approxi-

mation of Eq. 2 is given by

F0= argmin

α

N

X

i=1

ln(1 + exp(−2yiα)) =

=1

2ln p(y= 1)

p(y=−1) =1

2ln 1−y

1 + y,

(7)

where yis the mean value of the class labels yi. The pseudo-

residuals given by Eq.4 on which the model htis trained for

the logistic loss can be calculated as

rti = 2yi/(1 + exp (2yiFt−1(xi))) .(8)

Once htis built, the value of ρtis computed using Eq. 5 by

minimizing

f(ρ) =

N

X

i=1

ln(1 + exp(−2yi(Ft−1(xi) + ρht(xi)))) .(9)

There is no close form solution for this equation. However,

the value of ρcan be approximated by a single Newton-

Raphson step

ρt≈ − f′(ρ= 0)

f′′(ρ= 0) =PN

i=1 rtiht(xi)

PN

i=1 rti(2yi−rti )h2

t(xi).(10)

This equation is valid for any base regressor and not only for

decision trees as in the general gradient boosting framework

[9]. The formulation in [9] is adapted to the fact that decision

trees can be seen as piecewise-constant additive models,

which allows for the use of a different ρfor each tree leaf.

Finally, the output for binary classiﬁcation of gradient

boosting composed of Tmodels for instance xis given by

the probability of y= 1|x

p(y= 1|x) = 1/(1 + exp(−2FT(x))) .(11)

C. MULTI-CLASS CLASSIFICATION

For multi-class classiﬁcation with K > 2classes, the labels

are deﬁned with 1-of-K vectors, y, such that yk= 1 if the

instance belongs to class kand yk= 0, otherwise. In this

context the output for xof the model is also a K-dimensional

vector F(x). The cross-entropy loss function is used in this

context

L(y,F(·)) = −

K

X

k=1

ykln pk(·),(12)

where pk(·)is the probability of a given instance of being of

class k

pk(·) = exp(Fk(·))

PK

l=1 exp(Fl(·)) (13)

VOLUME XX, 2023 3

This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

The additive model is initialized with constant value 0as

Fk,0= 0 ∀k, that correspond to a probability equal to 1/K

for all classes and instances.

The pseudo-residuals, given by Eq. 4, on which the model

htis trained for the multi-class loss are the derivative of

Eq. 12 with respect to Fkevaluated at Fk,t−1

rtik =

K

X

j=1

yij

pj(xi)

∂pj(xi)

∂Fk(xi)Fk(x)=Fk,t−1(x)

=

K

X

j=1

yij (δkj −pk,t−1(xi)) =

=yik −pk,t−1(xi),

(14)

where δkj is Kronecker delta and the fact that PK

j=1 yij =

1∀iis used in the ﬁnal step. In this study, a single model,

htis going to be trained per iteration to ﬁt the residuals for

all Kclasses. In contrast to the Kdecision trees per iteration

that are built in gradient boosting. Then a line search for each

of the Koutputs of the model is computed by minimizing

f(ρk) = −

N

X

i=1

K

X

j=1

yij

ln "exp(Fj,t−1(xi) + ρjhj,t(xi))

PK

l=1 exp(Fl,t−1(xi) + ρlhl,t(xi)) #(15)

with a Newton-Raphson step

ρk,t =−f′(ρk = 0)

f′′(ρk = 0) =

=−"PN

i=1 hk,t(xi)(yik −pk ,t−1(xi))

PN

i=1 h2

k,t(xi)pk ,t−1(xi)(pk,t−1(xi)−1)#.

(16)

In the same way as for Eq. 10, this equation is valid for all

types of base learners and not speciﬁc for decision trees as

in the original formulation [9], which opens the possibility to

apply gradient boosting to other base learners.

The ﬁnal output of gradient boosting composed of T

models for multi class tasks is the probability yk= 1|x

p(yk= 1|x) = exp(Fk,T (xi))

PK

l=1 exp(Fl,T (xi)) (17)

D. NEURAL NETWORK AS AN ADDITIVE EXPANSION

A multi-layered neural network can be seen as an additive

expansion of its last hidden layer. The output of the last

hidden layer for a fully connected neural network for binary

tasks is (using the parametrization shown in Fig 1 top left)

p(y= 1|x) = σ T

X

t=0

ωtzt!(18)

and for multi-class (bottom left in Fig 1)

p(y=k|x) = σ T

X

t=0

ωtkzt!(19)

...

...

...

hidden

layers

...

...

...

...

hidden

layers

...

...

...

...

hidden

layers

...

...

...

...

hidden

layers

... ...

FIGURE 1: Illustration of binary and multi-class neural

networks and their parameters (left diagrams) and neural

networks with one unit highlighted in black that represents

the tth model trained in gradient boosted neural network

(right diagrams).

with ztbeing the outputs of the last hidden layer, ωtand ωtk

the weights of the last layer and σthe activation function

(for classiﬁcation). For regression, no activation function is

used. It is not straightforward to adapt this process to train

deeper layers as we take advantage of the additive nature of

the ﬁnal output of the network. In the standard NN training

procedure all parameters of the model (i.e. vtand ωkor ωtk

as shown in Fig 1 left diagrams) are ﬁtted jointly with back-

propagation. In the proposed method, instead of learning the

weights jointly, the parameters are trained sequentially using

gradient boosting. To do so, after computing F0, a fully

connected regression neural network with a single (or few)

neuron(s) in the last hidden layer is trained using standard

back-propagation. This neural network is trained on the

residuals given by the previous iteration as given by Eq. 8 for

binary classiﬁcation and Eq. 14 for multi-class classiﬁcation

tasks. Note that, this network trained on the residuals could

have a single or several units in the hidden layer. In the

remainder of this section we will assume that at each step

of the boosting procedure a network with one single unit in

the hidden layer is used. Generalizing this to larger networks

is trivial. Fig 1 (right diagrams) shows, highlighted in black,

the regression neural network with a single neuron in the

hidden layer that is trained in iteration tand that corresponds

to model ht. After model thas been trained, the value of ρtis

computed using Eq. 10 for binary problems and ρk,t (Eq. 16)

for multi-class classiﬁcation. Once all Tmodels have been

trained, a neural network, as shown in Fig 1 (left diagrams),

with Tunits in the last hidden layer is obtained by assigning

all the weights necessary to compute the ztvariables (i.e. vt′

in Fig 1 right) to the corresponding weights in the ﬁnal NN

(i.e. vtin Fig 1 left) for binary and multi-class tasks

vt=v′

tt= 1, . . . , T (20)

4VOLUME XX, 2023

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and the weights ωtof the output layer for binary classiﬁcation

are assigned to

ω0=F0+

T

X

t=1

2ρtω′t0(21)

ωt= 2ρtω′

tt= 1, . . . , T (22)

where the ω′tand ω′t0are the weights from the hidden

neuron to the output and the bias term respectively for the ht

model (as shown in Fig 1 right-top diagram). For multi-class

with Kclasses ﬁnal the assignment is

ω0k=

T

X

t=1

ρk,tω′t0k(23)

ωtk =ρk,tω′

tk t= 1, . . . , T k = 1, . . . , K (24)

where the ω′tk and ω′t0kare the output weights for the ht

multi-class model (Fig 1 right-bottom diagram).

Finally, to recover the probability of y=k|xin the output

of the NN as given by Eq. 11 and Eq. 17, the activation func-

tion should be a sigmoid (i.e. σ(x)=1/(1+exp(−x)) for bi-

nary classiﬁcation with the logistic loss (Eq. 6), and the soft-

max activation function (i.e. σ(x) = exp(xk)/Pkexp(xk))

for multi-class tasks when using the cross-entropy loss

(Eq. 12). This training procedure can be easily modiﬁed

to larger increments of neurons, so that instead of a single

neuron per step (a linear model), a more ﬂexible model,

comprising more units (J), can be trained at each iteration.

The outline of the proposed method is shown in Algorithm 1.

The proposed training procedure can be further tuned by

applying subsampling and/or shrinking, as generally used in

gradient boosting [9], [11], [13]. In shrinking, the additive ex-

pansion process is regularized by multiplying each term ρtht

by a constant learning rate, ν∈(0,1], to prevent overﬁtting

when multiple models are combined [9]. Subsampling con-

sists of training each model on a random subsample without

replacement from the original training data. Subsampling has

shown to improve the performance of gradient boosting [13].

The overall computational time complexity to train the

proposed algorithm is the same as that of a standard NN

with the same number of hidden neurons and training epochs.

Note that, at each step of the proposed algorithm, one NN

with one (or few) neurons in the hidden layer is trained.

Hence, at the end of the process, the same number of weights

updates are performed. Note, however, that the proposed

method is sequential and cannot be easily parallelized. Once

the model is trained, the computational complexity for clas-

sifying new instances is equivalent to that of a standard NN

as the generated model is a standard neural network.

E. ILLUSTRATIVE EXAMPLE

To illustrate the workings of this algorithm, we show its

performance in a toy classiﬁcation problem. The toy problem

task consists in a 2D version of the ringnorm problem [35]:

where both classes are 2D Gaussian distribution, one with

(0,0) mean and covariance four times the identity matrix,

Algorithm 1 Training Neural network as an additive expan-

sion

Input

1: Input data D={xi, yi}N

1

2: Number of neurons T

3: Loss function

Training the model

1: Initialize ˆ

F0

2: for t= 1 to Tdo

3: Compute the pseudo-residual with Eqs. 8 14 (rti for

binary and rtik for multi-class classiﬁcation)

4: Fit a new regressor network model on the residuals.

5: Compute the gradient descent step through the

Newton-Raphson step (ρtfor binary, Eq. 10 and ρkt for

multi-class classiﬁcation, Eq. 16)

6: Update the model using Eq.1

7: end for

8: Create the ﬁnal network by assigning the weights with

Eqs. 20 22 24

and the second class with mean value at (2/√2,2/√2) and

the identity matrix as covariance.

The proposed gradient boosted neural network (GBNN)

with T= 100 and one hidden layer is trained on 200

randomly generated instances of this problem. In addition,

100 independent neural nets with hidden units in the range

[1,100] are also trained using the same training set. In Fig 2,

the boundaries for the different stages of the process are

shown graphically. In detail, the ﬁrst and second rows show

the results for GBNN and NN, respectively. Each column

shows the results for T= 1,T= 2,T= 3,T= 4 and

T= 100 neurons in the hidden layer respectively. Note that

the plots for GBNN are sequential; That is, the ﬁrst column

shows the ﬁrst trained model, the second column, the ﬁrst

two models combined, and so on. For the NN, each column

corresponds to a different NN with a different number of

neurons in the hidden layer. For each column, the architecture

of the networks and the number of weights (but not their

values) are the same. The color of the plots represent the

probability p(y= 1 |x)given by the models (Eq. 11) using

the viridis colormap. In addition, all plots show the training

points.

As we can see both GBNN and NN start, as expected,

with very similar models (column T= 1). As the number

of models (neurons for NN) increases, GBNN builds up

the boundary from previous models. On the other hand, the

standard neural network, as it creates a new model for each

size, is able to adjust faster to the data. However, as the

number of neurons increases, NN tends to overﬁt in this

problem (as shown in the bottom right-most plot). On the

contrary, GBNN tends to focus on the unsolved parts of

the problem: the decision boundary becomes deﬁned only

asymptotically, as the number of models (neurons) becomes

large.

VOLUME XX, 2023 5

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

−1 0 1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−2−1 0 1 2

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

−1 0 1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−2−1 0 1 2

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

−1 0 1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−2−1 0 1 2

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

−1 0 1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−2−1 0 1 2

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

−1 0 1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−2−1 0 1 2

−2.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

T=1 T=2 T=3 T=4 T=100

FIGURE 2: Classiﬁcation boundaries for gradient boosted neural network (top row) and for a standard neural network (bottom

row). Each column shows the results for a combination of a different number of models (hidden units). Top plots are the

sequential results of a single GBNN model, whether bottom plots are independent neural networks models

III. EXPERIMENTAL RESULTS

In this section, an analysis of the efﬁciency of the proposed

neural network training method based on gradient boosting is

tested on twelve binary classiﬁcation tasks, eight multi-class

problems, seven regression datasets from the UCI repository

[36], and two datasets related to image processing [37], [38].

These datasets, shown in Table 1, have different number

of instances and attributes and come from different ﬁelds

of application. The Energy dataset has two different target

columns (cooling and heating), so the different algorithms

were executed for both objectives separately. We modiﬁed

some of the datasets. For Diabetes duplicated instances and

instances with missing values were removed. In addition,

categorical values were substituted by dummy variables in:

German Credit Data, Hepatitis, Indian Liver Patient, MAGIC

and Tic-tac-toe. Finally, CIFAR-10 and MNIST were normal-

ized by dividing the attributes by 255.

Two batches of experiments were carried out: experiments

on tabular data and experiments on image and large datasets.

In the ﬁrst batch, the proposed method is compared with

respect to standard neural networks using different solvers

and with respect to dense deep neural networks. In the second

batch, a transfer learning approach was followed, training

the last layer of deep models with the proposed method and

fully connected NN. The ﬁrst experiment is carried out in

all the classiﬁcation and regression tasks shown in Table 1,

except for Covertype,Poker Hand,MNIST and CIFAR-10.

For these large datasets, the comparison of the proposed

method was carried out with respect to deep dense or convo-

lutional neural networks, depending on the type of problem.

For the ﬁrst batch of experiments, the scikit-learn

package [39] was used. For the second batch, we adopted

Keras library [40]. The implementation of the proposed

method (GBNN) is done in python following the standards

TABLE 1: The details of datasets, which used in experimen-

tal analysis

Dataset Area Instances Attribs.

Binary classiﬁcation

Australian Credit Approval Financial 690 14

German Credit Data Financial 1,000 20

Banknote Computer 1,372 5

Spambase Computer 4,601 57

Tic-tac-toe Game 958 9

Breast cancer Life 569 32

Diabetes Life 381 9

Hepatitis Life 155 19

Indian Liver Patient Life 583 10

MAGIC Gamma Telescope Physical 19,020 11

Ionosphere Physical 351 34

Sonar Physical 208 60

Multi-class classiﬁcation

Digits Computer 1,797 64

Poker Hand Game 1,025,010 10

Iris Life 150 4

Covertype Life 581,012 54

Vehicle Transportation 946 18

Vowel Education 990 11

Waveform Physical 5,000 21

Wine Physical 178 13

Regression

Concrete Physical 1,030 9

Energy Computer 768 8

Power Computer 9,568 4

Boston Housing Business 506 14

Wine quality-red Business 1,599 12

Wine quality-white Business 4,898 12

image processing

MNIST handwritten digits 60,000 28x28

CIFAR-10 labeled images 60,000 32x32

of scikit-learn. The implementation of the algorithm is

available on https://github.com/GAA-UAM/GBNN/.

6VOLUME XX, 2023

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

A. EXPERIMENTS WITH TABULAR DATA

For the ﬁrst set of experiments on tabular regression and

classiﬁcation, single hidden layer networks were trained

using three different standard solvers (Adam, L-BFGS and

SGD), and using the proposed gradient boosting approach.

Also, we considered a three-layer deep dense neural network

trained with the Adam solver. Furthermore, AdaBoost [17]

using small neural networks as the base models was also

included in the comparison (AdaBoost–NN). Note that, this

approach is different to the proposed GBNN in two aspects.

First, for classiﬁcation tasks, the base models in Adaboost are

classiﬁers that are combined by weighted majority voting.

Hence, the ﬁnal model is a collection of small base classi-

ﬁers and not a single neural network as in GBNN. Second,

Adaboost is based on modifying the instance weights during

training so that difﬁcult instances tend to get higher weights.

This poses a difﬁculty in the training of the neural networks

as they do not handle weighted instances. In order to run

Adaboost with neural networks we included a weighted

resampling step prior the training of each individual network.

The scikit-learn library does not include this func-

tionality in Adaboost. The comparison for classiﬁcation and

regression problems was carried out using 5×10-fold cross-

validation in order to have stable results. In addition, the same

splits were used for all methods. In this way, all methods

are working under the same conditions. For the Waveform

dataset we also considered 10 random train-test partitions

using 300 instances for training the models and the rest for

test, as experiments with this dataset generally consider this

experimental setup [35]. All datasets were standardized in

train so that all attributes have zero mean and one variance.

All methods were carefully tuned in order to obtain their

best possible performance and to obtain a fair comparison

between them. The optimum hyper-parameter setting for

each method was estimated using within-train 10-fold cross-

validation. For the standard neural network with the standard

solvers, the grid with the number of units in the hidden

layer was set to [1, 3, 5, 7, 11, 12, 17, 22, 27, 32, 37,

42, 47, 52, 60, 70, 80, 90, 100, 150, 200]. In addition, for

the SGD solver, the learning policies [Adaptive, Constant]

were also considered in its grid search. The rest of the hyper-

parameters were set to their default values. Regarding the

three-layer deep network, 100 neurons per layer were used

and all the other hyper-parameters were left to their default

values. For GBNN, a sequentially trained neural network

with 200 hidden units is built in steps of Junits per iteration.

For this model, the hyper-parameter grid search was car-

ried out using the following values for binary classiﬁcation:

[0.1,0.25,0.5,1.0] for the learning rate, [0.5,0.75,1.0] for

subsample rate and J∈[1,2,3]. For the multi-class and

regression problems the hyper-parameter grid was extended

a bit because in some datasets the values at the extremes

were always selected. Hence, the grid for multi-class and

regression problems is set to: [0.025,0.05,0.1,0.5,1] for the

learning rate, [0.25,0.5,0.75,1.0] for subsample rate and

J∈[1,2,3,4]. Moreover, for the proposed procedure, the

sub-networks are trained using the L-BFGS solver for all

datasets. This decision was made after some preliminary

experiments that showed that L-BFGS provided good results

in general for the small networks we are training at each

iteration. In addition, for both standard neural network (with

sgd and adam optimizers) and GBNN, early stopping was

applied when training using with the default values of the

sklearn library (i.e. stop training if after 10 epochs the loss

is not reduced by at least 10−4). Regarding AdaBoost-NN, a

classiﬁcation/regression one hidden layer perceptron is set as

base estimator. The number of estimators to include in the

ensemble and number of hidden neurons were set respec-

tively to the following pairs: (200,1), (100,2) and (67,3). For

multi-class problems the pair (50, 4) was also considered.

This is done in order to produce a model with a combined

total number of 200 hidden neurons as for previous networks.

Subsequently, the best sets of hyper-parameters obtained in

the grid search for each method were used to train the whole

training set, one standard neural network for each solver,

dense deep neural networks, AdaBoost–NN, and the pro-

posed gradient boosted neural network. Finally, the average

generalization performance of the models was estimated in

the left-out test set.

For all analyzed datasets, the average generalization per-

formance and standard deviations are shown in Table 2

for the proposed method (column GBNN), the three-layer

deep neural network (column Deep-NN), standard neural

networks trained with Adam (NN–Adam), L-BFGS (NN–

L-BFGS), SGD (NN–SGD), and AdaBoost–NN. For clas-

siﬁcation problems the generalization performance is given

as average accuracy and, for regression, as average root

mean square error. The Table 2 is structured in three blocks

depending on the problem type: binary classiﬁcation tasks,

multi-class classiﬁcation, and regression. The best result for

each method is highlighted with a light yellow background.

An overall comparison of these results is shown graphically

in Fig 3 using the methodology described in [41]. These

plots show the average rank for the studied methods across

the analyzed datasets where a higher rank (i.e. lower values)

indicates better results. The ﬁgure shows the Demsˇ

ar plots

for the classiﬁcations tasks only (subplot a), for regression

only (subplot b), and for all analyzed tasks (subplot c). The

statistical differences between methods are determined using

a Nemenyi test. In the plot, the difference in the average rank

of the two methods is statistically signiﬁcant if the methods

are not connected with a horizontal solid line. The critical

distance (CD) in average rank over which the performance

of two methods is considered signiﬁcant is shown in the plot

for reference (CD = 1.72, 2.84, and 1.47 for 19 classiﬁcation,

seven regression and all datasets respectively, considering six

methods and p-value <0.05).

From Table 2 and Fig 3, it can be observed that, for

the studied datasets, the best overall performing method

is GBNN. This method reached the best results in 13 out

of 26 tasks. The deep-NN model performed best in ﬁve

VOLUME XX, 2023 7

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2: Average generalization performance and standard deviation for gradient boosted neural network (GBNN), Deep

neural network, neural networks trained with solvers Adam, L-BFGS, SGD and AdaBoost. Using accuracy for the classiﬁcation

tasks and root mean square error (RMSE) for regression problems as the performance measurement. The best results for each

dataset are highlighted with a light yellow background.

Dataset GBNN Deep-NN NN–Adam NN–L-BFGS NN–SGD AdaBoost–NN

Binary classiﬁcation

Australian Credit Approval 85.91%±4.81 82.81%±4.26 85.88%±4.43 86.20%±4.32 86.35%±4.28 84.58 %±3.32

Banknote 99.99%±0.04 100.00%±0.00 99.99%±0.04 99.81%±0.42 97.32%±1.15 100.00%±0.00

Breast cancer 96.87%±1.93 97.40%±2.06 97.36%±1.97 96.83%±1.55 96.55%±1.69 96.45%±4.72

Diabetes 76.12%±3.93 71.30%±4.59 76.35%±4.05 76.67%±4.15 76.96%±4.21 74.81%±3.25

German Credit Data 74.16%±3.64 69.84%±4.53 73.74%±3.99 72.34%±3.41 73.28%±3.87 72.44%±4.42

Hepatitis 82.81%±10.60 84.10%±10.83 85.10%±10.86 82.11%±9.21 85.09%±10.65 85.54%±7.82

Indian Liver Patient 72.51%±5.30 70.88%±5.20 70.69%±5.81 69.44%±5.88 71.10%±5.52 69.26%±4.84

Ionosphere 90.94%±4.86 93.28%±3.69 91.34%±3.92 90.43%±4.58 87.90%±5.96 92.25%±3.52

MAGIC Gamma Telescope 87.52%±0.67 85.62%±0.85 87.65%±0.57 87.58%±0.62 86.28%±1.48 87.19%±0.75

Sonar 78.84%±7.30 86.22%±7.57 85.56%±6.48 85.29%±5.07 77.67%±7.67 85.91%±8.26

Spambase 94.44%±1.00 94.08%±1.22 94.61%±1.01 93.43%±1.27 93.42%±1.27 74.20%±13.16

Tic-tac-toe 98.70%±1.17 95.78%±1.80 90.44%±3.19 93.53%±2.56 70.94%±4.30 86.75%±3.69

Multi-class classiﬁcation

Digits 97.18%±1.21 97.55%±1.06 98.04%±1.11 97.17%±1.10 96.44%±1.09 95.66%±1.18

Iris 95.73%±6.02 94.40%±6.58 95.33%±5.58 94.93%±6.57 85.07%±8.71 94.72%±2.16

Vehicle 84.61%±3.92 83.41%±3.82 83.43%±3.52 82.96%±4.21 72.99%±4.51 72.13%±4.15

Vowel 89.88%±2.57 96.71%±1.96 94.28%±2.36 93.13%±3.09 53.04%±4.34 51.45%±3.37

Waveform 87.00%±1.16 82.60%±1.62 86.65%±1.40 86.46%±1.22 86.68%±1.37 84.99%±1.46

Waveform-300 82.94%±0.17 82.07%±0.63 83.69%±0.70 79.51%±0.02 83.92%±0.77 84.55%±0.40

Wine 98.88%±2.35 97.87%±3.35 97.77%±3.22 97.65%±3.50 95.72%±4.91 97.32%±3.48

Regression

Boston Housing 3.03%±0.74 3.18%±0.79 4.12%±0.77 3.50%±0.98 3.40%±0.87 14.43%±1.49

Concrete 4.80%±0.59 5.20%±0.56 9,61%±0.58 4.73%±0.59 6.04%±0.47 26.78%±1.10

Energy-Cooling 0.95%±0.16 2.16%±0.31 3.45%±0.46 1.14%±0.17 3.12%±0.41 15.91%±1.69

Energy-Heating 0.43%±0.08 1.40%±0.27 2.96%±0.38 0.49%±0.06 2.75%±0.33 12.95%±1.29

Power 3.84%±0.18 4.40%±0.39 4.25%±0.16 4.11%±0.17 4.14%±0.15 123.37%±9.70

Wine quality-red 0.60%±0.04 0.68%±0.06 0.64%±0.05 0.64%±0.73 0.65%±0.04 0.83%±0.06

Wine quality-white 0.67%±0.03 0.72%±0.04 0.68%±0.03 0.70%±0.03 0.70%±0.03 0.74%±0.02

datasets. The neural network method with Adam as its solver

captured the best results in three datasets. SGD and L-

BFGS solvers obtained the best outcome in two and one

datasets, respectively. And ﬁnally, the AdaBoost–NN got

the best performance in three datasets. In classiﬁcation, the

differences in average accuracy among the different methods

are generally favorable to GBNN, NN–Adam and Deep–NN,

although the differences between these methods are small

in many datasets. For instance, in Magic, the accuracy of

GBNN is only 0.13 percent points worse than NN–Adam.

However, small differences are not always the case. One of

the most notable differences between GBNN and second top

ranked method is in Tic-tac-toe where Deep–NN is 2.92%

worse than the result obtained by GBNN and NN-Adam is

more than eight points worse. In contrast, the most favorable

outcome for Deep–NN is obtained in Vowel and Sonar, where

its accuracy is 6.83% and 7.38% better than that of GBNN,

respectively. The results in classiﬁcation for NN–L–BFGS,

NN–SGD and AdaBoost–NN are generally worse than those

of the other three methods.

In regression, the results are more clearly in favor of

GBNN. The proposed method obtains the best performance

in all tested datasets except for one dataset, Concrete, in

which GBNN get the second best result. The performance of

NN–Adam in regression is suboptimal getting the worst per-

formances in general. Solver L-BFGS perform more closely

to the performance of GBNN.

These results can also be observed from Fig 3. The perfor-

mance of NN–L-BFGS, NN–SGD and AdaBoost–NN in the

classiﬁcation tasks is worse than that of NN–Adam, GBNN

and Deep–NN (subplot a). NN-Adam and GBNN take the

highest rank in the statistical test. For regression (subplot b),

the best-performing method is GBNN. The performance of

GBNN is signiﬁcantly better than that of NN–Adam, NN–

SGD and AdaBoost–NN. The overall results (subplot c) show

that GBNN has the best rank followed by NN–Adam and

Deep–NN. Overall, the performance of GBNN is statistically

better than the performance of NN–SGD and AdaBoost-NN.

To analyze the evolution of the networks generated by the

proposed method as more hidden units are included, another

experiment was carried out to build a GBNN with 200 hidden

units, and neural networks trained using the Adam solver

with 1 to 200 neurons in the hidden layer. In addition, the

8VOLUME XX, 2023

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

23456

NN−SGD

NN−L−BFGS

AdaBoost−NN

NN−Adam

GBNN

Deep−NN

CD

(a) Classiﬁcation

1 2 3 4 5 6

NN−Adam

AdaBoost−NN

NN−SGD

GBNN

NN−L−BFGS

Deep−NN

CD

(b) Regression

23456

NN−SGD

AdaBoost−NN

NN−L−BFGS

GBNN

NN−Adam

Deep−NN

CD

(c) All of the datasets

FIGURE 3: Average ranks (higher rank is better) for

GBNN, Deep-NN, NN–Adam, NN–L-BFGS, NN–sgd and

AdaBoost–NN for 26 datasets. a) The Demsˇ

ar plot for binary

and multiclass datasets. b) The Demsˇ

ar plot for regression

datasets. c) The Demsˇ

ar plot considering all of the datasets.

ﬁnal accuracy of a three-layer deep neural network with 100

neurons in each layer is also computed for reference. For this

experiment, 10-fold cross-validation was used. The average

evolution is shown in Fig 4 for Spambase (left plot), Tic-

tac-toe (middle plot) and Energy-Heating (right plot). Due to

computational limitations in this case, the hyper-parameters

were not tuned for each partition. Instead, they were set to

the values more often selected in the previous experiment for

each dataset. Speciﬁcally, they were set to J= 3, learning

rate to 0.5 and subsampling to 1 for Tic-tac-toe, to J= 1,

learning rate to 0.5 and subsampling to 0.75 for Spambase

and to J= 4, learning rate to 0.5 and subsampling to

1 for Energy-Heating for all partitions. Note that for each

sequence of GBNN, only one model is trained. For NN, 200

independent models with 1 to 200 neurons need to be trained

in order to obtain the sequence.

From Fig 4, it can be observed that the average test

accuracy of GBNN improves as more units are considered.

More importantly, we can observe that GBNN tends to sta-

bilize with the number of units. Hence, if the latter units

are removed, the performance in the model accuracy is not

damaged to a great extent. For instance, in Spambase, if the

number of units is reduced from T= 200 to T= 100

the model accuracy only drops from 94.78% to ≈94.63%.

In Tic-tac-toe the same size reduction does not reduce the

accuracy of the model. This observation also applies for the

Energy-Heating regression task. This property can be useful,

as one can adopt a single model to different computational

requirements on the ﬂy. This shows that the proposed method

is not very prone to over-ﬁtting when more units are included.

This could be explained by the fact that later units in GBNN

are trained on smaller pseudo residuals. In consequence, later

models (units) have less inﬂuence in the ﬁnal NN. This

is consistent with gradient boosting ensembles, in which

the important aspect is to tune other regularization hyper-

parameters, like the learning rate, instead of the number of

trees (see section 5 of ref [9] for an interesting discussion

of this aspect). On the other hand, the performance of NN

could be higher than that of GBNN at some stages (as shown

for Spambase), however, the number of hidden neurons to

use have to be decided during train to reduce over-ﬁtting.

In addition, the performance of two neural networks with T

and T+ 1 trained on the same data present a higher variance

than a single GBNN model using Tand T+ 1. Hence, the

performance of the sequence of NN is not as monotonic as

the sequence of GBNN.

In order to compare the computational performance of the

tested methods, the training time for GBNN, Deep-NN, NN–

Adam, NN-L-BFGS and NN-SGD are shown in Table 3.

For this experiment, we applied 10-fold cross-validation and

computed the average ﬁt time of the ﬁnal model that used the

set hyper-parameters most frequently selected in the cross-

validation. The size for GBNN and the different networks

was set to 200. Deep-NN was built, as before with three

layers of 100 neurons each. GBNN uses the solver L-BFGS.

This experiment is done on CPU using AMD Ryzen 7

5800H 3.20 GHz processor. As Table 3 illustrates, the most

computationally efﬁcient method is NN–SGD followed by

NN–L-BFGS. In general, the differences are rather contained

among most methods and datasets. Some exceptions include

the Deep-NN with respect to the other methods, which could

be up to x20 slower (e.g. Power or Energy-Heating). The

variance in the average times among the different methods

for each dataset is also due to the fact that the default hyper-

parameters for all methods include an early-stopping strategy

VOLUME XX, 2023 9

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

0 25 50 75 100 125 150 175 200

Number

of neurons

0.920

0.925

0.930

0.935

0.940

0.945

0.950

0.955

0.960

0.965

Accuracy

Average generalization accuracy

GBNN

NN

Deep-NN

(a) Spambase

0 25 50 75 100 125 150 175 200

Number of neurons

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Accuracy

Average generalization accuracy

GBNN

NN

Deep-NN

(b) Tic-tac-toe

0 25 50 75 100 125 150 175 200

Number of neurons

0

5

10

15

20

Accuracy

Average generalization accuracy

GBNN

NN

Deep-NN

(c) Energy-Heating

FIGURE 4: Average generalization accuracy for GBNN (blue curve), NN (orange curve), and Deep-NN (green curve) with

respect to the number of hidden units

in which the training stops if the loss does not drop at least

0.0001 in ten training iterations. In order to visualize this

effect, in Fig 5 we show the average training time with respect

of the number of neurons in the hidden layers for GBNN,

Deep-NN and NN-Adam for two datasets (one favorable to

GBNN and one favorable to NN). The left plot shows the

results considering early stopping and the right plot forces

the networks to train for 200 epochs. Note that the results

of GBNN are monotonic since a single model is used to

obtain the training time sequence. For Deep-NN and NN-

Adam, a different model is trained every 20 hidden neurons,

which explains the peaks in the curves especially when

early stopping is active (left plots). When the models are

forced to train for 200 epochs, the evolution of the shallow

models show a clear linear complexity with respect to the

number of hidden units. The deep model shows a quadratic

complexity in accordance with the growth of the number of

weights. In addition, the training time performance of GBNN

is also affected by the model setting. This can account for the

favorable and unfavorable results for GBNN shown in Fig 5.

In particular, in some exploratory experiments we carried

out using different learning rates illustrated that the higher

the learning rate, the slower the training time specially for

classiﬁcation.

B. EXPERIMENT WITH LARGE DATASETS AND DEEP

MODELS

For the second batch of experiment, three-layer deep dense

network and CNNs were used. In order to compare the deep

models with the proposed method, we followed a transfer

learning approach. For this, once the deep models are trained,

the last dense layer and the output layer of the CNN and

Deep-NN are removed. Then, the weights of the ﬁrst lay-

ers are frozen. Finally, a GBNN model is trained linked

to the frozen layers using the same training instances. We

term this model as Deep-GBNN. For Covertype and Poker

Hand, a three-layer deep dense neural network with 100

hidden networks per layer were trained for 200 epochs. The

activation functions are ReLu for the internal layers and

Sigmoid for the output layer. The solver SGD was applied.

TABLE 3: Total training time in seconds for gradient boosted

neural network (GBNN), deep neural network, and neural

networks with Adam solver

Dataset GBNN Deep-

NN

NN–

Adam

NN–

L-

BFGS

NN–

SGD

Binary classiﬁcation

Australian Credit Approval 1.105 0.236 0.093 0.614 0.359

Banknote 0.454 0.391 0.515 0.065 1.055

Breast cancer 0.165 0.295 0.305 0.510 0.054

Diabetes 0.882 0.253 0.698 0.619 0.221

German Credit Data 0.129 0.187 0.208 0.130 0.524

Hepatitis 0.316 0.060 0.023 0.247 0.046

Indian Liver Patient 0.557 0.220 0.288 0.073 0.400

Ionosphere 0.791 0.618 0.431 0.085 0.400

MAGIC Gamma Telescope 5.273 37.267 6.177 12.931 5.038

Spambase 2.240 2.461 0.903 5.057 0.639

Sonar 0.568 0.581 0.327 0.166 0.307

Tic-tac-toe 0.732 1.215 1.011 0.413 0.956

Multi-class classiﬁcation

Digits 0.976 1.095 1.016 0.463 2.369

Iris 1.288 0.321 0.108 0.142 0.102

Vehicle 0.256 0.519 0.269 0.549 0.391

Vowel 1.088 1.992 1.419 0.578 1.085

Waveform 3.896 8.676 5.452 3.644 5.146

Wine 0.116 0.348 0.022 0.071 0.041

Regression

Boston Housing 0.636 1.026 0.468 0.344 0.030

Concrete 0.453 0.650 0.672 0.673 0.045

Energy-Cooling 0.077 0.768 0.632 0.573 0.037

Energy-Heating 0.083 1.023 0.643 0.488 0.037

Power 0.198 4.093 1.709 1.330 0.321

Wine quality-red 1.192 1.748 0.865 0.995 0.450

Wine quality-white 1.148 1.953 2.647 3.150 0.220

For CIFAR-10 and MNIST, a CNN was trained for 200 epochs

on the training set. The CNN includes the following lay-

ers: two convolutions(32), max-pooling (2x2), dropout(0.2),

two convolutions(64), max-pooling (2x2), dropout(0.3), two

convolutions(128), max-pooling (2x2), dropout(0.4), two

dense(128), and one output dense layer for ten classes. The

activation functions are ReLu for the internal layers and

SoftMax for the output layer. For these datasets, a single

partition train/test was carried out. The dataset is divided into

a training and a test set as deﬁned by the dataset: 50,000 train

and 10,000 test for CIFAR-10; 60,000 train and 10,000 test

10 VOLUME XX, 2023

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

for MNIST and 25,010/1,000,000 for Poker. For Covertype,

a random stratiﬁed partition of 70%-30% was done. The

optimum hyper-parameters conﬁguration was measured us-

ing within-train 5-fold cross-validation, except for Covertype

where a 2-fold cross-validation was used. For Deep-GBNN,

the values for the grid search were in the following ranges:

[0.1,1] for the learning rate, [0.25,1] for subsample and

[1,15] for the step size. Also, the solver for the GBNN

method was set to Adam, due to better performance in deal-

ing with high-dimensional datasets. In addition, we run the ﬁ-

nal experiment using pre-trained networks. For that, we used

the InceptionV3 [42] and VGG16 [43] models pre-trained

on the imagenet dataset [44]. These models were loaded

and subsequently ﬁne-tuned on the CIFAR-10 dataset using

the default train partition composed of 50,000 instances.

They are validated on the remaining 10,000 test instances.

Finally, we froze the weights of VGG16 and InceptionV3

and replaced their last dense layer with a optimized GBNN

and NN with 500 units, in which the optimization had done

using grid search using train. Finally, the accuracy of NN and

GBNN is estimated on the test set.

The Table 4 presents the achieved accuracy for four

datasets in the test set (ﬁrst two columns) and the estima-

tion of the generalization accuracy obtained in the in-train

cross-validation (last two columns). The best in-train and

test results are highlighted with a yellow background. The

results show uneven performance for the different dataset.

The proposed method manages to improve the performance

of the deep models in MNIST,Cover type and Poker hand by

0.06, 1.76 and 0.12 percentage points respectively. In CIFAR-

10, performance drops by 0.03%. Even if the differences are

in general marginal, an interesting aspect is that the in-train

performance can be used to select the best model on each

dataset except in MNIST. Although the differences between

method in this dataset both in-train and in test are negligible.

Finally, the performances of the GBNN and NN models

trained on the output of InceptionV3 and VGG16 for CIFAR-

10, and the ﬁne-tuned transfer learning models as well are

shown in Table 5. The application of GBNN on top of the pre-

trained and ﬁne-tuned models, gained 0.08 and 0.07 percent

points in accuracy with respect to InceptionV3 and VGG16

respectively. The results of NN are 0.08 percentage points

worse and 0.08 better than the deep ﬁned-tuned models. The

accuracy improvements of using GBNN on top of the deep

models is small but as it is the additional computational

training cost. The classiﬁcation computational cost remains

the same.

IV. CONCLUSIONS

In this paper, we present a novel iterative method to train

a neural network based on gradient boosting. The proposed

algorithm builds at each step a regression neural network

with one or few (J) hidden unit(s) ﬁtted to the data residuals.

The weights of the network are then updated with a Newton-

Raphson step to minimize a given loss function. Then, the

data residuals are updated to train the model of the next

TABLE 4: Average generalization performance (ﬁrst two

columns) and in-train performance (second two columns) for

deep gradient boosted neural networks (Deep-GBNN) and

Deep Convolutional Neural Networks (Deep-CNN).

Test estimation In-train estimation

Dataset Deep-

GBNN

Deep-

CNN

Deep-

GBNN

Deep-

CNN

CIFAR-10 84.09 84.12 82.77 83.04

MNIST 99.52 99.46 99.35 99.42

Dataset Deep-

GBNN

Deep-NN Deep-

GBNN

Deep-NN

CoverType 90.00 88.24 87.53 83.45

Poker Hand 99.45 99.33 98.57 97.72

TABLE 5: The generalization performance of the ﬁne-tuned

transfer learning models for each model (InceptionV3 and

VGG16). And generalization and in-train performance of a

one-layer GBNN and NN-trained models on the output of

pre-trained models on the CIFAR-10.

Fine-

tuned

GBNN NN

Pre-trained model Test Test In-

train

Test In-

train

InceptionV3 93.12 93.20 99.99 93.04 99.98

VGG16 92.92 92.99 99.98 93.00 99.98

0 25 50 75 100 125 150 175 200

Number

of neurons

0

2

4

6

8

10

Time (in seconds)

tol = 1e-4, Max_epochs = 10

Average training time

NN

GBNN

Deep-NN

(a) Power

0 25 50 75 100 125 150 175 200

Number of neurons

0

5

10

15

20

Time (in seconds)

tol = 0, Max_epochs = 200

Average training time

NN

GBNN

Deep-NN

(b) Power

0 25 50 75 100 125 150 175 200

Number of neurons

0.0

0.2

0.4

0.6

0.8

Time

(in seconds)

tol = 1e-4, Max_epochs = 10

Average training time

NN

GBNN

Deep-NN

(c) Indian Liver Patient

0 25 50 75 100 125 150 175 200

Number of neurons

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Time (in seconds)

tol = 0, Max_epochs = 200

Average training time

NN

GBNN

Deep-NN

(d) Indian Liver Patient

FIGURE 5: Average training time (in seconds) for gradient

boosted neural network (Blue curve), standard neural net-

work (orange), and deep neural network (green), for wave-

form (top row) and Indian liver patient (bottom row). In the

left column, the default early stopping procedure was set to

train the networks of all models (tolerance is set to 1e−4

and iterations with no change to ten). In the right column, all

networks are force to train during 200 epochs (tolerance is

set to 0 and max iterations without change to 200)

iteration. The resulting Tregressors constitute a single neural

network with T×Jhidden units. In addition, the formulation

derived for this works opens the possibility to create gradient

boosting ensembles composed of base models different from

VOLUME XX, 2023 11

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

decision trees, as done in previous implementations of gradi-

ent boosting.

In the analyzed problems, the proposed method achieves a

generalization accuracy that converges with the number of

combined regressors (or hidden units). This quality of the

proposed method allows us to use the combined model fully

or partially by deactivating the units in order inverse to their

creation, depending on classiﬁcation speed requirements.

This can be done on the ﬂy during the test. In addition, we

showed that the training complexity is equivalent to that of

training a network with a standard solver.

The proposed method tested on a variety of classiﬁcation,

regression and image processing tasks. The results show a

performance favorable to the proposed method in general.

The proposed approach showed the best overall average rank

in the tested classiﬁcation and regression problems with

statistically signiﬁcant differences with respect to SGD and

L-BFGS approaches. In addition, for deep models a transfer

learning approach was followed and the results were favor-

able to the proposed method in some of the tasks although

the differences were small. Notwithstanding, the proposed

iterative training procedure opens novel alternatives for train-

ing neural networks. This particularly evident for regression

tasks where the proposed method achieved the best result in

most of the analyzed datasets.

REFERENCES

[1] Ismail Babajide Mustapha and Faisal Saeed, “Bioactive molecule predic-

tion using extreme gradient boosting,” Molecules, vol. 21, no. 8, 2016.

[2] Alberto Torres-Barrán, Álvaro Alonso, and José R. Dorronsoro, “Re-

gression tree ensembles for wind energy and solar radiation prediction,”

Neurocomputing (2017), 2017.

[3] N. Mirabal, E. Charles, E. C. Ferrara, P. L. Gonthier, A. K. Harding, M. A.

Sánchez-Conde, and D. J. Thompson, “3fgl demographics outside the

galactic plane using supervised machine learning: Pulsar and dark matter

subhalo interpretations,” The Astrophysical Journal, vol. 825, no. 1, pp.

69, 2016.

[4] Xiyue Jia, Yining Cao, David O’Connor, Jin Zhu, Daniel CW Tsang,

Bin Zou, and Deyi Hou, “Mapping soil pollution by using drone image

recognition and machine learning at an arsenic-contaminated agricultural

ﬁeld,” Environmental Pollution, vol. 270, pp. 116281, 2021.

[5] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani

Amorim, “Do we need hundreds of classiﬁers to solve real world

classiﬁcation problems?,” Journal of Machine Learning Research, vol. 15,

pp. 3133–3181, 2014.

[6] Rich Caruana and Alexandru Niculescu-Mizil, “An empirical comparison

of supervised learning algorithms,” in ICML ’06: Proceedings of the 23rd

international conference on Machine learning, New York, NY, USA, 2006,

pp. 161–168, ACM Press.

[7] Holger Schwenk and Yoshua Bengio, “Boosting neural networks,” Neural

Computation, vol. 12, no. 8, pp. 1869–1887, 2000.

[8] Mohammad Moghimi, Mohammad Saberian, Jian Yang, Li-Jia Li, Nuno

Vasconcelos, and Serge Belongie, “Boosted convolutional neural net-

works,” in British Machine Vision Conference (BMVC), York, UK, 2016.

[9] Jerome H. Friedman, “Greedy function approximation: a Gradient Boost-

ing machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189 – 1232,

2001.

[10] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al., “Additive

logistic regression: a statistical view of boosting (with discussion and a

rejoinder by the authors),” The annals of statistics, vol. 28, no. 2, pp. 337–

407, 2000.

[11] Tianqi Chen and Carlos Guestrin, “Xgboost: A scalable tree boosting sys-

tem,” in Proceedings of the 22Nd ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, New York, NY, USA, 2016,

KDD ’16, pp. 785–794, ACM.

[12] Leo Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.

5–32, 2001.

[13] Jerome H. Friedman, “Stochastic gradient boosting,” Computational

Statistics & Data Analysis, vol. 38, no. 4, pp. 367 – 378, 2002, Nonlinear

Methods and Data Mining.

[14] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika

Dorogush, and Andrey Gulin, “Catboost: unbiased boosting with categori-

cal features,” in Advances in neural information processing systems, 2018,

pp. 6638–6648.

[15] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong

Ma, Qiwei Ye, and Tie-Yan Liu, “Lightgbm: A highly efﬁcient gradient

boosting decision tree,” in Advances in neural information processing

systems, 2017, pp. 3146–3154.

[16] Candice Bentéjac, Anna Csörg˝

o, and Gonzalo Martínez-Muñoz, “A com-

parative analysis of gradient boosting algorithms,” Artiﬁcial Intelligence

Review, vol. (in press), 2020.

[17] Yoav Freund and Robert E Schapire, “A decision-theoretic generalization

of on-line learning and an application to boosting,” Journal of computer

and system sciences, vol. 55, no. 1, pp. 119–139, 1997.

[18] Trevor Hastie, Saharon Rosset, Ji Zhu, and Hui Zou, “Multi-class ad-

aboost,” Statistics and its Interface, vol. 2, no. 3, pp. 349–360, 2009.

[19] Harris Drucker, “Improving regressors using boosting techniques,” in

ICML. Citeseer, 1997, vol. 97, pp. 107–115.

[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,”

Nature, vol. 521, pp. 436–444, 2015.

[21] Jürgen Schmidhuber, “Deep learning in neural networks: An overview,”

Neural Networks, vol. 61, pp. 85 – 117, 2015.

[22] Chongsheng Zhang, Changchang Liu, Xiangliang Zhang, and George

Almpanidis, “An up-to-date comparison of state-of-the-art classiﬁcation

algorithms,” Expert Systems with Applications, vol. 82, pp. 128–150,

2017.

[23] Ravid Shwartz-Ziv and Amitai Armon, “Tabular data: Deep learning is not

all you need,” Information Fusion, vol. 81, pp. 84–90, 2022.

[24] Johannes Welbl, “Casting random forests as artiﬁcial neural networks (and

proﬁting from it),” in Pattern Recognition, Xiaoyi Jiang, Joachim Horneg-

ger, and Reinhard Koch, Eds. 2014, Springer International Publishing.

[25] Gérard Biau, Erwan Scornet, and Johannes Welbl, “Neural random

forests,” Sankhya A, vol. In press, 2018.

[26] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel

Rota Bulo, “Deep neural decision forests,” in The IEEE International

Conference on Computer Vision (ICCV), December 2015.

[27] Atsushi Nitanda and Taiji Suzuki, “Functional gradient boosting based

on residual network perception,” in International Conference on Machine

Learning. PMLR, 2018, pp. 3819–3828.

[28] Furong Huang, Jordan Ash, John Langford, and Robert Schapire, “Learn-

ing deep resnet blocks sequentially using boosting theory,” in International

Conference on Machine Learning. PMLR, 2018, pp. 2058–2067.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual

learning for image recognition,” in Proceedings of the IEEE conference on

computer vision and pattern recognition, 2016, pp. 770–778.

[30] Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean, “Boosting

algorithms as gradient descent,” Advances in neural information process-

ing systems, vol. 12, 1999.

[31] Yoshua Bengio, Nicolas Roux, Pascal Vincent, Olivier Delalleau, and

Patrice Marcotte, “Convex neural networks,” Advances in neural infor-

mation processing systems, vol. 18, 2005.

[32] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar, “On the convergence of

adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.

[33] Aaron Defazio and Samy Jelassi, “Adaptivity without compromise: A

momentumized, adaptive, dual averaged gradient method for stochastic

optimization,” arXiv preprint arXiv:2101.11075, 2021.

[34] Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and

Quanquan Gu, “Closing the generalization gap of adaptive gradient meth-

ods in training deep neural networks,” arXiv preprint arXiv:1806.06763,

2018.

[35] L. Breiman, “Bias, variance, and arcing classiﬁers,” Tech. Rep. 460,

Statistics Department, University of California, 1996.

[36] M. Lichman, “UCI machine learning repository,” 2013.

[37] Li Deng, “The mnist database of handwritten digit images for machine

learning research [best of the web],” IEEE Signal Processing Magazine,

vol. 29, no. 6, pp. 141–142, 2012.

[38] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, “Cifar-10 (cana-

dian institute for advanced research),” URL http://www. cs. toronto.

edu/kriz/cifar. html, vol. 5, 2010.

12 VOLUME XX, 2023

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[39] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-

sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-

learn: Machine learning in Python,” Journal of Machine Learning Re-

search, vol. 12, pp. 2825–2830, 2011.

[40] Francois Chollet et al., “Keras,” 2015.

[41] Janez Demšar, “Statistical comparisons of classiﬁers over multiple data

sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.

[42] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and

Zbigniew Wojna, “Rethinking the inception architecture for computer

vision,” in Proceedings of the IEEE conference on computer vision and

pattern recognition, 2016, pp. 2818–2826.

[43] Karen Simonyan and Andrew Zisserman, “Very deep convolutional net-

works for large-scale image recognition,” arXiv preprint arXiv:1409.1556,

2014.

[44] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,

“Imagenet: A large-scale hierarchical image database,” in 2009 IEEE

conference on computer vision and pattern recognition. Ieee, 2009, pp.

248–255.

SEYEDSAMAN EMAMI holds a Bachelor of

Science degree in Industrial Engineering (2012)

and a Master of Science degree in Financial En-

gineering (2018), both from Islamic Azad Univer-

sity. He is currently pursuing a Ph.D. degree in

Computer and Telecommunications Engineering

with a specialization in Information Engineering

at Universidad Autónoma de Madrid (UAM). The

author’s research interests are focused on ma-

chine learning topics, such as ensemble learning,

Bayesian Networks, deep learning frameworks, and the application of ma-

chine learning in the domains of science, media, and education. Mr. Emami

also gained valuable practical experience through collaborations with indus-

try partners and participation in research projects. They have demonstrated

strong problem-solving skills, attention to detail, and the ability to work

collaboratively with diverse teams. Overall, the author is a dedicated and

accomplished researcher with a strong track record in machine learning, and

their work promises to make signiﬁcant contributions to the ﬁeld.

GONZALO MARTíNEZ-MUÑOZ received the

university degree in Physics (1995) and Ph.D. de-

gree in Computer Science (2006) from the Univer-

sidad Autónoma de Madrid Madrid (UAM). From

1996 to 2002, he worked in industry. Until 2008 he

was an interim assistant professor in the Computer

Science Department of the UAM. During 2008/09,

he worked as a Fulbright post- doc researcher at

Oregon State University in the group of Professor

Thomas G. Dietterich, world reference in the ﬁeld

of Machine Learning. He is currently Assistant Professor at Computer

Science Department at UAM as well as coordinator of the master’s program

on Bioinformatics and Computational Biology. In addition, he is the IP

of several research projects related to AI. His research interests include

machine learning, decision trees, and ensemble learning, and applications

of machine learning to science, sports and education. Mr. Martínez-Muñoz

has published his research in some of the journals with highest impact in

machine learning.

VOLUME XX, 2023 13

content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3271515