Available via license: CC BYNCND 4.0
Content may be subject to copyright.
BOOSTING THE PERFORMANCE OF DEEP LEARNING: A GRADIENT BOOSTING
APPROACH TO TRAINING CONVOLUTIONAL AND DEEP NEURAL NETWORKS
Seyedsaman Emami and Gonzalo Mart´ınezMun˜oz
Escuela Polite´cnica Superior, Universidad Auto´ noma de Madrid, Madrid, Spain
ABSTRACT
Deep learning has revolutionized the computer vision and
image classification domains. In this context Convolutional
Neural Networks (CNNs) based architectures are the most
widely applied models. In this article, we introduced two pro
cedures for training Convolutional Neural Networks (CNNs)
and Deep Neural Network based on Gradient Boosting (GB),
namely GBCNN and GBDNN. These models are trained to
fit the gradient of the loss function or pseudoresiduals of pre
vious models. At each iteration, the proposed method adds
one dense layer to an exact copy of the previous deep NN
model. The weights of the dense layers trained on previ
ous iterations are frozen to prevent overfitting, permitting the
model to fit the new dense as well as to finetune the convolu
tional layers (for GBCNN) while still utilizing the informa
tion already learned. Through extensive experimentation on
different 2Dimage classification and tabular datasets, the pre
sented models show superior performance in terms of classi
fication accuracy with respect to standard CNN and DeepNN
with the same architecture.
Index Terms— Convolutional Neural Network, Deep
Neural Network, Gradient Boosting Machine
1.
INTRODUCTION
The wellknown deep learning technique designs a frame
work using Artificial Neural Network systems (ANNs). The
concept of deep learning after introducing the AlexNet [1]
model gained even more attention. A high variety of ar
chitectures and topologies of deep network models can be
constructed by combining different layer types in the model.
Likewise, Convolutional Neural Networks (CNNs) have been
widely adopted for diverse computer vision tasks, including
image classification [2], object detection [3], anomaly detec
tion [4] and segmentation [5]. These models have demon
strated impressive performance. Convolutional models have
proven successful in image classification and object recog
nition, as demonstrated in recent studies such as ResNet [6],
Very Deep Convolutional Networks [7], and Understanding
Convolutional Networks [8]. These studies show that by
The authors acknowledge financial support from PID2019
106827GBI00/ AEI/10.13039/501100011033.
increasing the number of convolutional layers, the model
becomes more robust in the feature extraction process [7],
however, deeper models present numerical difficulties dur
ing training [6, 9]. The nonlinearity of the network at each
layer reduces the gradients and that could lead to a very slow
training [9, 10]. One solution for this is to use batch normal
ization [9] as an intermediate layer for the networks, which
helps to stabilize training and reduce the number of training
epochs. Another solution is given by Residual neural net
work (ResNet) [6], which can stack even hundreds of layers
skipping the nonlinearity by passing information of previous
layers directly. It is worthy noting that ResNet also uses batch
normalization layers.
In another line of work, the Gradient Boosting Machines
(GBMs) decision tree ensembles [11–14] have become the
state of the art for solving tabular classification and regression
tasks [15,16]. GBMs work by training a sequence of regressor
models that sequentially learn the information not learnt by
previous models. This is done by computing the gradients of
the training data with respect to the previous iteration and by
fitting the following model to those gradient values or pseudo
residuals. The final model is the combine all generated mod
els in an additive manner. These ideas have also been ap
plied to sequentially train Neural Networks [17–20]. In [19],
a gradient boosting based approach that uses a weight estima
tion model to classify image labels is proposed. The model is
designed to mimic the ResNet deep neural network architec
ture and uses boosting functional gradient minimization [12].
In addition, it also involves formulating linear classifiers and
feature extraction, where the feature extraction produces in
put for the linear classifiers, and the resulting approximated
values are stacked in a ResNet layer.
In this paper, we propose a novel deep learning training
architecture framework based on gradient boosting, which in
cludes two structures: GBCNN and GBDNN. GBCNN, or
Gradient Boosted Convolutional Neural Network, is a CNN
training architecture based on convolutional layers. On the
other hand, GBDNN is a simpler architecture based only on
dense layers more specific for tabular data. Both architectures
explicitly use the gradient boosting procedure to build a set of
embedded NNs in depth. The approach adds one dense layer
at each iteration to a copy of the previous network. Then, the
previous dense layers are frozen and the weights of the new
1
1
added dense layer are trained on the residuals of the previous
iteration. In the case of GBCNN, also the previously trained
convolutional layers are finetuned at each boosting iteration.
The weights of previous layers are frozen to augment the un
trained weights of the new dense layers to fit.
The rest of the paper is organized as follows: In Section
2, we review related works in the field. In Section 3, we pro
vide an overview of our proposed approach. In Section 4, we
describe our experimental setup and present the results of our
experiments. In Section 5, finally, we conclude the paper in
the last section.
2.
RELATED
WORKS
Several studies have used the ideas of gradient boosting opti
mization to build Neural Networks. For instance, in [20] they
propose a convex optimization model for training a shallow
Neural Network that could reach the global optimum. This
is done by adding one hidden neuron at a time to the net
work, and reoptimizing the whole network by including a L1
regularization on the top layer. This top layer serves as a reg
ularizer to effectively remove neurons. However, the model is
computationally feasible only for very small number of inputs
attributes. In fact, the method is tested experimetally only for
2D datasets. In another line of work, a shallow neural network
is sequentially trained as an additive expansion using gradient
boosting [21]. The weights of the trained models are stored
to form a final neural network. Their idea is to build a sin
gle network using a sequential approach avoiding having an
ensemble of networks. The work is specific for tabular multi
output regression problems whether the method proposed in
this article is a deep architecture valid for both tabular and
structured datasets, such as image datasets.
Furthermore, a development of ResNet in the context of
boosting theory was proposed in [18]. This model, called
BoostResNet, uses residual blocks to that are trained during
boosting iterations based on [22]. The BoostResNet builds
an ensemble of shallow blocks. In a similar proposal, a deep
ResNetlike model (ResFGB) is developed in depth by using a
linear classifier and gradientboosting loss minimization [19].
The proposed method is different from these studies in dis
tinct aspects. In contrast to the work presented in [18, 19] the
proposed method is based on gradient boosting [11]. Con
trary to [19], the underlying architecture of our work can work
with any standard deep architecture (with or without convo
lutional layers) rather than an adhoc specific network block
like ResNet. When trained with convolutional layers, the pro
posed method includes a series of dense layers trained se
quentially while jointly finetuning previously fitted convo
lutional layers. In addition, the proposed method reduces the
nonlinearity complexity by freezing the previously trained
dense layers. On the other hand, ResFGB uses a combination
of a linear classifiers and a feature extractor, which is updated
by stacking a resnettype layer at each iteration.
The [18]
study built layerbylayer a ResNet boosting (BoostResNet)
over features, however, it is based on a different boosting
framework [22]. The advantage of BoostResNet over stan
dard ResNet is its lower computational complexity although
the reported performance is not consistently better with re
spect to ResNet.
3.
METHODOLOGY
In this paper we propose a methodology to train a set of Deep
Neural Network models as an additive expansion trained on
the residual of a given loss function. The procedure works by
adding a new dense layer sequentially to a copy of the pre
viously trained deep NN with all dense layers from previous
iterations frozen. The motivation behind freezing the trained
dense layers is that the model is less likely to adapt to the
noise in the data and avoid overfitting. Based on this idea two
deep architectures are proposed: one using convolutional lay
ers (GBCNN) and one with only dense layers (GBDNN). In
the case of GBCNN, at each iteration, the model learns from
the errors made by the previous dense layers also finetuning
the parameters of the convolutional layers. In GBDNN archi
tecture, only the newly inserted dense layer is trained at each
iteration while simultaneously freezing all previously trained
dense layers. In this section, we described the backbone of
the proposed methods alongside with its mathematical frame
work, followed by the used structure for the convolutional
layers. In the first subsection, we defined the mathemati
cal framework of the GBCNN and GBDNN, followed by
the description of the CNN structure used in the GBCNN
method.
3.1.
Gradient Boosted Convolutional and Deep Neural
Network
For the GBCNN architecture, we assume the training and
test instances with
D
=
(
X
i
,
y
i
)
N
distribution on input
X
i
∈
R
B
×
P
H
×
P
W
×
Ch
a 4dimensional tensor, where the
first dimension RB corresponds to the batch size, the sec
ond and third dimensions
R
P
H
×
P
W
represent the height and
width of each image respectively, and the fourth dimen
sion
R
Ch
represents the number of color channels. The
labels
y
i
are characterized by a onehot encoding vector
yi = [yi,1, yi,2, . . . , yi,K], where K is the number of classes,
and
y
i,j
=
1 if the
i
th
data point belongs to class
j
, and
y
i,j
=
0 otherwise. For the GBDNN model, we assume the
training and test instances with
D
=
(
X
i
,
y
i
)
N
distribution
on input Xi ∈ RF e with Fe features and Response variables
with
K
attributes
y
i
∈
[1
, K
] respectively.
The idea is to learn and update the trainable parameters of
the model to accurately estimate the class of unseen data by
minimizing the crossentropy ℓ(y, p) loss function
k
=1
ℓ
(
y
i
,
P
i
) =
−
y
i,k
log
p
i,k
,
(1)
k=0
where the trainable parameters of st−1 were trained in the
(t − 1)th iteration and are now frozen. The optimization
process is the same for GBCNN and GBDNN although the
GBCNN finetunes the convolutional layer at each iteration
whether GBDNN only trains the newly added layer.
where
P
=
{{
p
i,k
}
K
N
i
=1
is
a
probability
matrix
where
p
i,k
The optimization problem (Eq. 7) continues by training
the new model on the pseudoresiduals of the previous epoch,
is the estimate of the probability for the ith instance of be
longing to class k. These probabilities are obtained from the
raw outputs of the trained regression networks, Fk, by apply
ri,t−1, which are parallel to the gradient of the loss,
∂ℓ
(
y
i
, F
(
X
i
))
I
ing a softmax function to the outputs of the additive model
r
i,t
−
1
=
−
∂F
(
X
i
)
I
F
(
X
i
)=
F
t
−
1
(
X
i
)
. (8)
p
(
·
)
=
P
exp(
F
k
(
·
))
(2) and updating the ρt for K class labels with a line search opti
k
K
l
=1
.
exp(
F
l
(
·
))
mization
N K
The
final
model
is
built
as
an
additive
model
of
the
outputs
F
f
(
ρ
t
) =
ln
p
i
=1
j
=1
i,k
+
ρ
i,j
P
ˆ
i,j
,
(9)
ρ
P
ˆ
F
t
(
X
i
) =
F
t
−
1
(
X
i
) +
ρ
t
S
t
(
X
i
)
,
(3)
where ρt is a vector of weights for each class of the t − th
additive model St(Xi). This additive model is built in a step
wise manner. In the GBCNN, first, a series of layers, in
cluding convolution, activation, pooling, batch normalization,
dense and flattening layers, is defined as the function C
C
(
X
i
;
Ω
) =
H
(
X
i
)
,
(4)
where Ω is the set of trainable parameters of the model that
include weights and biases of all layers, and H is a se
quence of procedures performed by the various layers which
will be explained in the following section. While, in GB
DNN a first dense layer is initialized with random weights.
The model construction is followed by adding a new linear
transformation st with an activation function, representing the
tth dense layer (boosting iteration)
S
t
(
M
i
;
W
t
) =
ReLU
t
(
W
t
M
i
+
b
i
)
,
(5)
where
W
t
is the weight matrix and
b
t
is the bias vector for
the t − th dense layer respectively, Mi is the feature mapped
output from the previous dense layer output for ith instance,
and ReLU (x) = max(0, x) is the activation function for the
t
th dense layer.
Hence, we can define the additive model of the tth boost
ing iteration to train the parameters for the proposed model
S
t
=
S
t
(
S
t
−
1
(
C
(
X
i
;
Ω
t
));
W
t
)
.
(6)
by minimizing the loss function ℓ (Eq.1), using the following
objective function,
ρ
vector for all boosting iteration, the trainable parameters (Ω)
of the model C (Eq. 4) update using estimated weights (ρ)
Ωt
=
ρt Ω′t, t ∈ [1, T ], (10)
where Ω′t is the previous layer output vector.
Finally, We applied a regularization term to the proposed
models, namely shrinkage rate, which is a constant value ν ∈
(0, 1] that sets the contribution of each additive model in the
training procedure (ν × ρtSt) and prevents overfitting [11].
The schematic description of the proposed method is
shown in Algorithm. 1.
3.2.
Convolutional layers design
Despite the fact that the proposed method could be applied
to different CNN architectures, in the following, we describe
the design of convolutional and dense layers for the model
applied in the experiments.
The applied CNN architecture consists of a sequence of
three blocks. Each block has two 2Dconvolutional layers, a
batch normalization, max pooling and a dropout layer. The
dense layers are then connected to the last of these three lay
ers. The configuration of the blocks is the following. The two
2Dconvolutional
layers
in
the
first
block
consists
of
32
×
3
×
3
filters and ReLU as the activation function. The subsequent
block has two more 2Dconvolutional layers with 64 filters.
The convolutional layers of the final block are of size 128.
The concatenated layers are two 2Dconvolutional layers with
128 filters. All 2Dconvolutional layers use filters of size
(3x3). After the second convolutional layer of each block,
batch normalization is applied to stabilize the distribution of
activations. Then a max pooling layer of size (2, 2) is applied
to the output to reduce the spatial dependence of the feature
maps and increase the invariance to small translations, and fi
(ρt,
S
t
)
=
argmin
(
ρ
t
,S
t
)
i
=1
ℓ
(
y
i
, P
t
−
1
(
X
i
) +
ρ
t
S
t
)
.
(7)
nally a dropout layer is included to prevent overfitting with
0.2, 0.3 and 0.4 for each block respectively. Finally, the out
put is flatten and connected to fully connected dense layers.
K
}
N
1
1
The dense layers are added and trained iteratively while freez
ing previous dense layers. This model is compared to the
same architecture trained jointly as a standard CNN model.
The second proposed architecture considers only the iterative
training of a network composed only of dense layers.
This process is illustrated in Fig. 1 in which only the dense
layers are shown (Note, that the figure is not representing the
actual size of the used dense layers). In the first, iteration,
one single dense (and output) layer is trained (dark grey units
in left diagram of Fig. 1). This layer is the first fully con
nected layer. After fitting this model, the model is copied and
Algorithm 1 Training procedure of GBCNN.
Input:
Input image data and related labels
D
=
{
x
i
, y
i
}
N
Input: Number of boosting iterations T .
Input: Training epoch on minibatches E.
Input: Gradient Boosting loss function ℓ (Eq.1).
Image batch generator
G
(
D
)
=
{
(
x
i
, y
i
)
}
N
.
for
e
=
0 to
E
do
Fit the GBCNN with one dense layer to images
residual.
if if the training additive loss converges then
break
end if
end for
Update the trainable parameters Ω0.
Freeze the added dense layer’s parameters ω0.
for t
=
1 to T − 1 do
Add a new dense layer.
for e
=
0 to E do
Fit the GBCNN to imagesresidual.
if if the training additive loss converges then
break
end if
end for
Update the trainable parameters Ωt.
Freeze the added dense layer’s parameters ωt.
if if the training Gradient Boosting loss converges
then
break
end if
end for
Update the CNNs’ weights using Eq. 10.
return A fully trained and finetuned GBCNN network.
a second dense layer is added (iteration 1 and second diagram
in Fig. 1) freezing the weights of the first dense layer froze
(shown in the Fig. 1 with light gray neurons). This second
step finetunes the parameters of the convolutional layers (if
present), skips the training of the previous dense layer, and
trains the newly added dense layer (dark gray units). Each
new model, fits the last dense layers and the convolutional
blocks to the corresponding pseudoresiduals. The training
procedure continues until convergence.
Fig. 1: A snippet of the iterative model training procedure.
Lightgray indicates that the layer is frozen, dark gray that the
layer is being trained and white indicates output layer (also
being trained)
4.
EXPERIMENTS
In order to evaluate the proposed new methods structure for
Convolutional (GBCNN) and Deep Neural Networks (GB
DNN), several supervised 2Dimage classification tasks and
tabular datasets were considered. In the image datasets, the
proposed model (GBCNN) is compared with respect to a
CNN model that uses the same architecture and configura
tion, although with more dense layers (more details below). In
the tabular datasets, GBDNN is compared with a deep neu
ral network with same settings. The objective of our experi
ments was to evaluate the performance of the tested methods
in terms of accuracy and to determine its usefulness for solv
ing the 2Dimage and tabular classification problems. The
code of the proposed models is made available in github as
GBCNN1.
Regarding the 2Dimage classification problem, seven
2Dimage datasets of various areas of application, class la
bels, instances, pixel resolution, and color channels are used
in this study, as described in Table 1 for the convolutional
models. In this experiment, a data generator was used to
1
github.com/GAAUAM/GBCNN
28 × 28
1
Table 1: Dataset Properties for Image Classification Tasks (ranging from 0.1 to 0.001) and shrinkage rate ν (ranging
Name
train/test
P
H
MNIST [23] 60,000/10,000
×
P
W
K
Ch
10
from 0.1 to 1.0) of the GBDNN model. Both models are
composed of three dense layers, each with a size of 100
CIFAR10 [24] 50,000/10,000 32 × 32 10 3
Rice varieties [25] 56,250/18,750 32 × 32 5 3
FashionMNIST
[26]
60,000/10,000 28 × 28 10 1
KuzushijiMNIST
[27]
60,000/10,000 28 × 28 10 1
MNISTCorrupted
[27]
60,000/10,000 28 × 28 10 1
neurons and ReLU as the activation function.
5.
RESULTS
RockPaperScissors
2,520/370 32 × 32 3 3 Tables 3, 4 and 5 show the generalization accuracy results es
[28]
generate batches of training data onthefly during the train
ing process. This allowed us to train the model on a large
dataset, without having to load the entire dataset into GPU
memory. Moreover, the generator shuffled the data, applied
data augmentation techniques (including rescaling the pixel
values), and yielded batches with a size of 128. The common
hyperparameters and settings of the used GBCNN and CNN
were the same, including the structure of the convolutional
layers (described before), the number of hidden neurons in
each dense was set to 20, a learning rate of 0.001, 200 epochs
for training with applying early stopping on the validation
score to prevent overfitting. The shrinkage rate of GBCNN,
used to combine the different generated CNNs, was set to
0.1. As previously described, the proposed method adds one
dense layer at each iteration. The maximum number of dense
layers/iteratios was set to 10. Although, GBCNN generally
converges after 2 or 3 iterations. Notwithstanding, the num
ber of dense layers for CNN was left at 10. Furthermore,
we conducted an additional experiment wherein three images
(MNIST, FashionMNIST, and CIFAR10) were analyzed us
ing two dense layers, each with a size of 128 and identical
settings, in order to investigate the impact of dense layer size
on the experimental outcomes.
timated in the test subsets for the analyzed datasets and mod
els. Table 3 show the results for the convolutional models
(GBCNN and CNN) using 20 neuron in the dense hidden
layers. Table 4 include the results for selected image datasets
and with larger dense layers of 128 neurons. Finally, Table 5
portray the results for the deep models (GBDNN and DNN)
in the tabular datasets. The best results of the tables are high
lighted using a gray background.
As it can be observed in Table 3, the generalization
accuracy of the proposed convolutional model (GBCNN)
is higher than that of the baseline CNN model across all
tested datasets. The difference is specially large on the Rock
PaperScissors. The proposed model achieved an accuracy of
87.37%, which is 19.36% higher than the accuracy of CNN
(68.01%). On the other datasets the differences are small
but consistently in favor of the proposed architecture. For
instance, in CIFAR10, the proposed model achieved an ac
curacy of 87.65% with respect to the 86.71% that achieves
the CNN. In the MNIST dataset is wellknown for its sim
plicity and high accuracy, and GBCNN was able to achieve
a remarkable accuracy of 99.61%, which is higher than the
accuracy of CNN by a very small margin (0.06).
Table 3: Accuracy of GBCNN and CNN models on the ex
periment with a 20neuron dense layer. The best results are
highlighted in gray
Table 2: Dataset Properties for Image Classification Tasks
2DImage dataset
GB
CNN CNN
Nam
e
Digits [29] instances
1797 Feature
64
s
class labels
10
Ionosphere
[29] 351 34
2
Lette
r
26 [29] 20,000 16 26
Sona
r
208
2
60
USPS [30] 9,298 256 10
Vowel
[29] 990 10 11
Waveform
[29] 5,000 21 3
Moreover, in this work, tabular classification datasets
from different sources [29, 30] were also tested. The charac
teristics of these datasets are shown in Table 2. The classi
fications tasks are of different types ranging from radar data
to handwritten digits and vowel pronunciation. The hyper
parameter values of the proposed GBDNN and the standard
DeepNN are identical. For these datasets, the training was
performed using 10fold crossvalidation. In order to deter
mine the optimal values for hyperparameter of the models, a
within train grid search process was applied. The test values
for the hyperparameters are: learning rate of both models
In relation to the experiment involving the utilization of
a dense layer with a size of 128, the accuracy results for
the GBCNN and CNN models are presented in Table 4.
These models were trained and evaluated on three distinct
datasets: MNIST, CIFAR10, and FashionMNIST. The find
ings demonstrate that the GBCNN model outperformed the
CNN model across all three datasets.
Nonetheless, closer examination of the table reveals that
the results obtained using 128 dense layers (as shown in Ta
ble 3) are slightly inferior to those using 20 neuron layers
MNIST 99.61% 99.55%
CIFA
R

10 87.65% 86.71%
Rice varieties 99.82% 98.37%
Fashion
MNIST 94.34% 93.61%
Kuzushiji
MNIST 98.40% 96.70%
MNIST
Corrupte
d
99.58% 99.35%
RockPaperScissors
87.37% 68.01%
98.22%
±
1.24
94.87% ±3.08
95.50% ±0.49
(as shown in Table 4). This implies that for these particular
datasets, employing a greater number of smaller layers may
be more effective in capturing the intended tasks.
Table 4: Accuracy of GBCNN and CNN models on the ex
periment with a 128neuron dense layer. The best results are
highlighted in gray
2DImage dataset GBCNN CNN
MNIST 99.66% 99.40%
CIFA
R

10 86.72% 86.30%
Fashion
MNIST 94.02% 92.38%
The results for the tabular datasets summarized in Table 5
for GBDNN and DeepNN. The results demonstrate that GB
DNN outperforms DeepNN in most of the datasets. In par
ticular, GBDNN achieved the highest accuracy in the Digits,
Ionosphere, Letter26, USPS, Vowel, and Waveform datasets.
DeepNN, on the other hand, achieved the highest accuracy in
the Sonar dataset. As in previous results, the differences are
in general small with higher difference in Waveform in favor
of GBCNN (2.72point difference) and in Sonar in favor of
DeepNN (1.45point difference).
Table 5: Accuracy of GBDNN and DeepNN models. The
best results are highlighted in gray
Tabular dataset
GB
DNN
Deep
NN
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
1.4
1.2
1.0
0.8
0.6
GBCNN accuracy
0
2
4
6
8
Boosting iterations
Fig. 2: GBCNN accuracy
GBCNN loss
Digits
Ionosphere
Letter26
Sonar
USPS
Vowel
Waveform
86.55% ±9.03
97.55% ±1.06
94.60% ±3.66
95.29% ±0.63
88.00% ±3.78
96.95% ±0.62
97.47% ±2.08
82.38% ±2.37
0.4
0.2
0
2
4
6
8
Boosting iterations
Fig. 3: GBCNN loss
In addition to evaluating the accuracy metric, we analyzed
the training loss and accuracy of the models under comparison
(Fig. 2–5, Fig. 6–7). We further monitored the loss function
across various additive models during training for GBCNN.
The experiment was conducted using the CIFAR10 dataset,
with both training and test data points. The models were
configured with the same hyperparameters, including iden
tical convolutional layers as outlined in Section 3.2, and 10
dense layers with a size of 20. The learning rate was set to
0.001, with a batch size of 128, and 100 training epochs for
both models. In the case of GBCNN, the shrinkage was set
to 0.01.
The evolution performance of GBCNN is shown in
Fig. 2–5. Fig. 2 illustrates the train and test accuracy of
the model with respect to the number of boosting iterations.
In Fig. 3, the average crossentropy loss across various boost
ing iterations is shown. Additionally, Figs. 4 and 5 provide an
overview of the evolution of the performance of the additive
models developed with respect to the boosting iterations in
terms of mean squared error loss. As it can be observed from
these plots, the model converges after a few booting itera
tions. Specifically, after three boosting iterations the model
has converged and the training process could be stopped.
In contrast, the Fig. 2 illustrates that the GBCNN model
continues to learn and classify more accurately with a de
creasing tendency for crossentropy loss Fig. 3. Overall, these
findings provide valuable insights into the performance of the
GBCNN model and its potential for use in various applica
tions.
On the other hand, Figs. 6 and 7 present the performance
of the CNN model with respect to the training epochs. Specif
ically, the train and test accuracy and MSE loss of the model
are depicted in Figs. 6 and Figs. 7, respectively. Notably, the
CNN model’s accuracy, as displayed in Fig. 6, exhibits an
initial accuracy below 60% during the early stages of train
ing, contrasting with the performance of the GBCNN model.
This outcome reveals the model’s inaccuracy in its prelimi
nary training iterations.
Tes
t
data
p
oints
Train data
p
oints
Train data
p
oint
Tes
t
data
p
oin
t
97.39%
±
0.40
98.08% ±1.59
85.20% ±1.89
Accuracy
CrossEentropy loss
Additive:
0
Additive:
1
Additive:
2
Additive:
3
Additive:
4
Additive:
5
Additive:
6
Additive:
7
Additive:
8
Additive:
9
GBCNN loss  additive models
CNN
accuracy
0.14
0.9
0.12
0.10
0.8
0.08
0.06
0.7
0.04
0.6
0.02
0.00
0
20
40
60
80
100
Training
epochs
0
20
40
60
80
100
Training
epochs
Fig. 4: Additive model loss  train data points Fig. 6: CNN accuracy
0.10
GBCNN validation loss  additive models
1.4
CNN
loss
0.09
1.2
0.08
1.0
0.07
0.8
0.06
0.6
0.05
0.4
0.04
0.2
0.03
0
20
40
60
80
100
Training
epochs
0
20
40
60
80
100
Training
epochs
Fig. 5: Additive model loss  test data points
6.
CONCLUSION
This paper presents a novel training approach for training a set
of deep neural networks based on convolutional (CNNs) and
deep (DNNs) architectures, called GBCNN and GBDNN
respectively. The proposed training procedures are based on
gradient boosting algorithm, which trains iteratively a set of
models in order to learn the information not captured in pre
vious iterations. Additionally, both proposed models employ
an inner dense layer freezing approach to reduce model com
plexity and nonlinearity.
To evaluate the effectiveness of the proposed models,
we conducted experiments on various 2D images and tabu
lar datasets, ranging from radar data to handwritten digits,
fashion, and agriculture areas. Our results demonstrate that
the proposed GBCNN model outperforms traditional CNN
models using the same architecture in terms of accuracy in all
the analyzed image datasets. Moreover, the GBDNN model
outperforms DNN models in terms of accuracy on most of
Fig. 7: CNN loss
the studied tabular datasets. The training loss curves also
indicate that the proposed models exhibit a lower intrain
loss value for additive models and a faster convergence rate
than traditional CNN models. This shows that the proposed
GBCNN and GBDNN models represent a robust and ef
fective solution for 2D image and tabular classification tasks.
We recommend using these models for further research and
practical applications, as they offer a promising direction for
developing more accurate and efficient CNNs and DNNs in
the future.
7.
REFERENCES
[1]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin
ton, “Imagenet classification with deep convolutional
neural networks,” in Advances in Neural Information
Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105.
Curran Associates, Inc., 2012.
Additive:
0
Additive:
1
Additive:
2
Additive:
3
Additive:
4
Additive:
5
Additive:
6
Additive:
7
Additive:
8
Additive:
9
Tes
t
data
p
oints
Train data
p
oints
Tes
t
data
p
oints
Train data
p
oints
MSE
loss
MSE
loss
CrossEntropy loss
Accuracy
[2]
Dongmei Han, Qigang Liu, and Weiguo Fan, “A new
image classification method using cnn transfer learning
and web data augmentation,” Expert Systems with Ap
plications, vol. 95, pp. 43–56, 2018.
[3]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Sun, “Faster rcnn: Towards realtime object detection
with region proposal networks,” Advances in neural in
formation processing systems, vol. 28, 2015.
[4]
Taejoon Kim, Sang C Suh, Hyunjoo Kim, Jonghyun
Kim, and Jinoh Kim, “An encoding technique for cnn
based network anomaly detection,” in 2018 IEEE In
ternational Conference on Big Data (Big Data). IEEE,
2018, pp. 2960–2965.
[5]
Konstantinos Kamnitsas, Christian Ledig, Virginia FJ
Newcombe, Joanna P Simpson, Andrew D Kane,
David K Menon, Daniel Rueckert, and Ben Glocker,
“Efficient multiscale 3d cnn with fully connected crf
for accurate brain lesion segmentation,” Medical image
analysis, vol. 36, pp. 61–78, 2017.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[7]
Karen Simonyan and Andrew Zisserman, “Very deep
convolutional networks for largescale image recogni
tion,” arXiv preprint arXiv:1409.1556, 2014.
[8]
Matthew D Zeiler and Rob Fergus, “Visualizing and un
derstanding convolutional networks,” in European con
ference on computer vision. Springer, 2014, pp. 818–
833.
[9]
Sergey Ioffe and Christian Szegedy, “Batch normaliza
tion: Accelerating deep network training by reducing
internal covariate shift,” in International conference on
machine learning. PMLR, 2015, pp. 448–456.
[10]
Xavier Glorot and Yoshua Bengio, “Understanding
the difficulty of training deep feedforward neural net
works,” in Proceedings of the thirteenth interna
tional conference on artificial intelligence and statistics.
JMLR Workshop and Conference Proceedings, 2010,
pp. 249–256.
[11]
Jerome H Friedman, “Greedy function approximation:
a gradient boosting machine,” Annals of statistics, pp.
1189–1232, 2001.
[12]
Llew Mason, Jonathan Baxter, Peter Bartlett, and Mar
cus Frean, “Boosting algorithms as gradient descent,”
Advances in neural information processing systems, vol.
12, 1999.
[13]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,
Wei Chen, Weidong Ma, Qiwei Ye, and TieYan Liu,
“Lightgbm: A highly efficient gradient boosting deci
sion tree,” Advances in neural information processing
systems, vol. 30, 2017.
[14]
Tianqi Chen, Tong He, Michael Benesty, Vadim
Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen,
et al., “Xgboost: extreme gradient boosting,” R package
version 0.42, vol. 1, no. 4, pp. 1–4, 2015.
[15]
Ravid ShwartzZiv and Amitai Armon, “Tabular data:
Deep learning is not all you need,” Information Fusion,
vol. 81, pp. 84–90, 2022.
[16]
Candice Bente´jac, Anna Cso¨rgo˝, and Gonzalo Mart´ınez
Mun˜oz, “A comparative analysis of gradient boosting
algorithms,” Artificial Intelligence Review, vol. 54, pp.
1937–1967, 2021.
[17]
Seyedsaman Emami and Gonzalo Mart´ınezMun˜oz,
“Multioutput regression neural network training via gra
dient boosting,” 01 2022, pp. 145–150.
[18]
Furong Huang, Jordan Ash, John Langford, and Robert
Schapire, “Learning deep resnet blocks sequentially us
ing boosting theory,” in International Conference on
Machine Learning. PMLR, 2018, pp. 2058–2067.
[19]
Atsushi Nitanda and Taiji Suzuki, “Functional gradi
ent boosting based on residual network perception,” in
International Conference on Machine Learning. PMLR,
2018, pp. 3819–3828.
[20]
Yoshua Bengio, Nicolas Roux, Pascal Vincent, Olivier
Delalleau, and Patrice Marcotte, “Convex neural net
works,” Advances in neural information processing sys
tems, vol. 18, 2005.
[21]
Seyedsaman Emami and Gonzalo Mart´ınezMun˜oz,
“Sequential training of neural networks with gradient
boosting,” arXiv preprint arXiv:1909.12098, 2019.
[22]
Yoav Freund and Robert E Schapire, “A decision
theoretic generalization of online learning and an ap
plication to boosting,” Journal of computer and system
sciences, vol. 55, no. 1, pp. 119–139, 1997.
[23]
Norman Mu and Justin Gilmer, “Mnistc: A robust
ness benchmark for computer vision,” arXiv preprint
arXiv:1906.02337, 2019.
[24]
Alex Krizhevsky, “Learning multiple layers of features
from tiny images,” Tech. Rep., 2009.
[25]
Murat Koklu, Ilkay Cinar, and Yavuz Selim Taspinar,
“Classification of rice varieties with deep learning meth
ods,” Computers and electronics in agriculture, vol.
187, pp. 106285, 2021.
[26]
Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion
mnist: a novel image dataset for benchmarking machine
learning algorithms,” arXiv preprint arXiv:1708.07747,
2017.
[27]
Tarin Clanuwat, Mikel BoberIrizar, Asanobu Kitamoto,
Alex Lamb, Kazuaki Yamamoto, and David Ha, “Deep
learning for classical japanese literature,” arXiv preprint
arXiv:1812.01718, 2018.
[28]
Laurence Moroney, “Rock, paper, scissors dataset,” feb
2019.
[29]
M. Lichman, “UCI machine learning repository,” 2013.
[30]
Jonathan J. Hull, “A database for handwritten text
recognition research,” IEEE Transactions on pattern
analysis and machine intelligence, vol. 16, no. 5, pp.
550–554, 1994.