PreprintPDF Available

Boosting the performance of deep learning: A gradient Boosting approach to training convolutional and deep neural network

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Deep learning has revolutionized the computer vision and image classification domains. In this context Convolutional Neural Networks (CNNs) based architectures are the most widely applied models. In this article, we introduced two procedures for training Convolutional Neural Networks (CNNs) and Deep Neural Network based on Gradient Boosting (GB), namely GB-CNN and GB-DNN. These models are trained to fit the gradient of the loss function or pseudo-residuals of previous models. At each iteration, the proposed method adds one dense layer to an exact copy of the previous deep NN model. The weights of the dense layers trained on previous iterations are frozen to prevent over-fitting, permitting the model to fit the new dense as well as to fine-tune the convolutional layers (for GB-CNN) while still utilizing the information already learned. Through extensive experimentation on different 2D-image classification and tabular datasets, the presented models show superior performance in terms of classification accuracy with respect to standard CNN and Deep-NN with the same architecture.
Content may be subject to copyright.
BOOSTING THE PERFORMANCE OF DEEP LEARNING: A GRADIENT BOOSTING
APPROACH TO TRAINING CONVOLUTIONAL AND DEEP NEURAL NETWORKS
Seyedsaman Emami and Gonzalo Mart´ınez-Mun˜oz
Escuela Polite´cnica Superior, Universidad Auto´ noma de Madrid, Madrid, Spain
ABSTRACT
Deep learning has revolutionized the computer vision and
image classification domains. In this context Convolutional
Neural Networks (CNNs) based architectures are the most
widely applied models. In this article, we introduced two pro-
cedures for training Convolutional Neural Networks (CNNs)
and Deep Neural Network based on Gradient Boosting (GB),
namely GB-CNN and GB-DNN. These models are trained to
fit the gradient of the loss function or pseudo-residuals of pre-
vious models. At each iteration, the proposed method adds
one dense layer to an exact copy of the previous deep NN
model. The weights of the dense layers trained on previ-
ous iterations are frozen to prevent over-fitting, permitting the
model to fit the new dense as well as to fine-tune the convolu-
tional layers (for GB-CNN) while still utilizing the informa-
tion already learned. Through extensive experimentation on
different 2D-image classification and tabular datasets, the pre-
sented models show superior performance in terms of classi-
fication accuracy with respect to standard CNN and Deep-NN
with the same architecture.
Index Terms Convolutional Neural Network, Deep
Neural Network, Gradient Boosting Machine
1.
INTRODUCTION
The well-known deep learning technique designs a frame-
work using Artificial Neural Network systems (ANNs). The
concept of deep learning after introducing the AlexNet [1]
model gained even more attention. A high variety of ar-
chitectures and topologies of deep network models can be
constructed by combining different layer types in the model.
Likewise, Convolutional Neural Networks (CNNs) have been
widely adopted for diverse computer vision tasks, including
image classification [2], object detection [3], anomaly detec-
tion [4] and segmentation [5]. These models have demon-
strated impressive performance. Convolutional models have
proven successful in image classification and object recog-
nition, as demonstrated in recent studies such as ResNet [6],
Very Deep Convolutional Networks [7], and Understanding
Convolutional Networks [8]. These studies show that by
The authors acknowledge financial support from PID2019-
106827GBI00/ AEI/10.13039/501100011033.
increasing the number of convolutional layers, the model
becomes more robust in the feature extraction process [7],
however, deeper models present numerical difficulties dur-
ing training [6, 9]. The non-linearity of the network at each
layer reduces the gradients and that could lead to a very slow
training [9, 10]. One solution for this is to use batch normal-
ization [9] as an intermediate layer for the networks, which
helps to stabilize training and reduce the number of training
epochs. Another solution is given by Residual neural net-
work (ResNet) [6], which can stack even hundreds of layers
skipping the non-linearity by passing information of previous
layers directly. It is worthy noting that ResNet also uses batch
normalization layers.
In another line of work, the Gradient Boosting Machines
(GBMs) decision tree ensembles [11–14] have become the
state of the art for solving tabular classification and regression
tasks [15,16]. GBMs work by training a sequence of regressor
models that sequentially learn the information not learnt by
previous models. This is done by computing the gradients of
the training data with respect to the previous iteration and by
fitting the following model to those gradient values or pseudo-
residuals. The final model is the combine all generated mod-
els in an additive manner. These ideas have also been ap-
plied to sequentially train Neural Networks [17–20]. In [19],
a gradient boosting based approach that uses a weight estima-
tion model to classify image labels is proposed. The model is
designed to mimic the ResNet deep neural network architec-
ture and uses boosting functional gradient minimization [12].
In addition, it also involves formulating linear classifiers and
feature extraction, where the feature extraction produces in-
put for the linear classifiers, and the resulting approximated
values are stacked in a ResNet layer.
In this paper, we propose a novel deep learning training
architecture framework based on gradient boosting, which in-
cludes two structures: GB-CNN and GB-DNN. GB-CNN, or
Gradient Boosted Convolutional Neural Network, is a CNN
training architecture based on convolutional layers. On the
other hand, GB-DNN is a simpler architecture based only on
dense layers more specific for tabular data. Both architectures
explicitly use the gradient boosting procedure to build a set of
embedded NNs in depth. The approach adds one dense layer
at each iteration to a copy of the previous network. Then, the
previous dense layers are frozen and the weights of the new
1
1
added dense layer are trained on the residuals of the previous
iteration. In the case of GB-CNN, also the previously trained
convolutional layers are fine-tuned at each boosting iteration.
The weights of previous layers are frozen to augment the un-
trained weights of the new dense layers to fit.
The rest of the paper is organized as follows: In Section
2, we review related works in the field. In Section 3, we pro-
vide an overview of our proposed approach. In Section 4, we
describe our experimental setup and present the results of our
experiments. In Section 5, finally, we conclude the paper in
the last section.
2.
RELATED
WORKS
Several studies have used the ideas of gradient boosting opti-
mization to build Neural Networks. For instance, in [20] they
propose a convex optimization model for training a shallow
Neural Network that could reach the global optimum. This
is done by adding one hidden neuron at a time to the net-
work, and re-optimizing the whole network by including a L1
regularization on the top layer. This top layer serves as a reg-
ularizer to effectively remove neurons. However, the model is
computationally feasible only for very small number of inputs
attributes. In fact, the method is tested experimetally only for
2D datasets. In another line of work, a shallow neural network
is sequentially trained as an additive expansion using gradient
boosting [21]. The weights of the trained models are stored
to form a final neural network. Their idea is to build a sin-
gle network using a sequential approach avoiding having an
ensemble of networks. The work is specific for tabular multi-
output regression problems whether the method proposed in
this article is a deep architecture valid for both tabular and
structured datasets, such as image datasets.
Furthermore, a development of ResNet in the context of
boosting theory was proposed in [18]. This model, called
BoostResNet, uses residual blocks to that are trained during
boosting iterations based on [22]. The BoostResNet builds
an ensemble of shallow blocks. In a similar proposal, a deep
ResNet-like model (ResFGB) is developed in depth by using a
linear classifier and gradient-boosting loss minimization [19].
The proposed method is different from these studies in dis-
tinct aspects. In contrast to the work presented in [18, 19] the
proposed method is based on gradient boosting [11]. Con-
trary to [19], the underlying architecture of our work can work
with any standard deep architecture (with or without convo-
lutional layers) rather than an ad-hoc specific network block
like ResNet. When trained with convolutional layers, the pro-
posed method includes a series of dense layers trained se-
quentially while jointly fine-tuning previously fitted convo-
lutional layers. In addition, the proposed method reduces the
non-linearity complexity by freezing the previously trained
dense layers. On the other hand, ResFGB uses a combination
of a linear classifiers and a feature extractor, which is updated
by stacking a resnet-type layer at each iteration.
The [18]
study built layer-by-layer a ResNet boosting (BoostResNet)
over features, however, it is based on a different boosting
framework [22]. The advantage of BoostResNet over stan-
dard ResNet is its lower computational complexity although
the reported performance is not consistently better with re-
spect to ResNet.
3.
METHODOLOGY
In this paper we propose a methodology to train a set of Deep
Neural Network models as an additive expansion trained on
the residual of a given loss function. The procedure works by
adding a new dense layer sequentially to a copy of the pre-
viously trained deep NN with all dense layers from previous
iterations frozen. The motivation behind freezing the trained
dense layers is that the model is less likely to adapt to the
noise in the data and avoid overfitting. Based on this idea two
deep architectures are proposed: one using convolutional lay-
ers (GB-CNN) and one with only dense layers (GB-DNN). In
the case of GB-CNN, at each iteration, the model learns from
the errors made by the previous dense layers also fine-tuning
the parameters of the convolutional layers. In GB-DNN archi-
tecture, only the newly inserted dense layer is trained at each
iteration while simultaneously freezing all previously trained
dense layers. In this section, we described the backbone of
the proposed methods alongside with its mathematical frame-
work, followed by the used structure for the convolutional
layers. In the first subsection, we defined the mathemati-
cal framework of the GB-CNN and GB-DNN, followed by
the description of the CNN structure used in the GB-CNN
method.
3.1.
Gradient Boosted Convolutional and Deep Neural
Network
For the GB-CNN architecture, we assume the training and
test instances with
D
=
(
X
i
,
y
i
)
N
distribution on input
X
i
R
B
×
P
H
×
P
W
×
Ch
a 4-dimensional tensor, where the
first dimension RB corresponds to the batch size, the sec-
ond and third dimensions
R
P
H
×
P
W
represent the height and
width of each image respectively, and the fourth dimen-
sion
R
Ch
represents the number of color channels. The
labels
y
i
are characterized by a one-hot encoding vector
yi = [yi,1, yi,2, . . . , yi,K], where K is the number of classes,
and
y
i,j
=
1 if the
i
th
data point belongs to class
j
, and
y
i,j
=
0 otherwise. For the GB-DNN model, we assume the
training and test instances with
D
=
(
X
i
,
y
i
)
N
distribution
on input Xi RF e with Fe features and Response variables
with
K
attributes
y
i
[1
, K
] respectively.
The idea is to learn and update the trainable parameters of
the model to accurately estimate the class of unseen data by
minimizing the cross-entropy (y, p) loss function
k
=1
(
y
i
,
P
i
) =
y
i,k
log
p
i,k
,
(1)
k=0
where the trainable parameters of st1 were trained in the
(t 1)-th iteration and are now frozen. The optimization
process is the same for GB-CNN and GB-DNN although the
GB-CNN fine-tunes the convolutional layer at each iteration
whether GB-DNN only trains the newly added layer.
where
P
=
{{
p
i,k
}
K
N
i
=1
is
a
probability
matrix
where
p
i,k
The optimization problem (Eq. 7) continues by training
the new model on the pseudo-residuals of the previous epoch,
is the estimate of the probability for the ith instance of be-
longing to class k. These probabilities are obtained from the
raw outputs of the trained regression networks, Fk, by apply-
ri,t1, which are parallel to the gradient of the loss,
∂ℓ
(
y
i
, F
(
X
i
))
I
ing a softmax function to the outputs of the additive model
r
i,t
1
=
F
(
X
i
)
I
F
(
X
i
)=
F
t
1
(
X
i
)
. (8)
p
(
·
)
=
P
exp(
F
k
(
·
))
(2) and updating the ρt for K class labels with a line search opti-
k
K
l
=1
.
exp(
F
l
(
·
))
mization
N K
The
final
model
is
built
as
an
additive
model
of
the
outputs
F
f
(
ρ
t
) =
ln
p
i
=1
j
=1
i,k
+
ρ
i,j
P
ˆ
i,j
,
(9)
ρ
P
ˆ
F
t
(
X
i
) =
F
t
1
(
X
i
) +
ρ
t
S
t
(
X
i
)
,
(3)
where ρt is a vector of weights for each class of the t th
additive model St(Xi). This additive model is built in a step-
wise manner. In the GB-CNN, first, a series of layers, in-
cluding convolution, activation, pooling, batch normalization,
dense and flattening layers, is defined as the function C
C
(
X
i
;
) =
H
(
X
i
)
,
(4)
where is the set of trainable parameters of the model that
include weights and biases of all layers, and H is a se-
quence of procedures performed by the various layers which
will be explained in the following section. While, in GB-
DNN a first dense layer is initialized with random weights.
The model construction is followed by adding a new linear
transformation st with an activation function, representing the
t-th dense layer (boosting iteration)
S
t
(
M
i
;
W
t
) =
ReLU
t
(
W
t
M
i
+
b
i
)
,
(5)
where
W
t
is the weight matrix and
b
t
is the bias vector for
the t th dense layer respectively, Mi is the feature mapped
output from the previous dense layer output for i-th instance,
and ReLU (x) = max(0, x) is the activation function for the
t
-th dense layer.
Hence, we can define the additive model of the t-th boost-
ing iteration to train the parameters for the proposed model
S
t
=
S
t
(
S
t
1
(
C
(
X
i
;
t
));
W
t
)
.
(6)
by minimizing the loss function (Eq.1), using the following
objective function,
ρ
vector for all boosting iteration, the trainable parameters ()
of the model C (Eq. 4) update using estimated weights (ρ)
t
=
ρt t, t [1, T ], (10)
where t is the previous layer output vector.
Finally, We applied a regularization term to the proposed
models, namely shrinkage rate, which is a constant value ν
(0, 1] that sets the contribution of each additive model in the
training procedure (ν × ρtSt) and prevents overfitting [11].
The schematic description of the proposed method is
shown in Algorithm. 1.
3.2.
Convolutional layers design
Despite the fact that the proposed method could be applied
to different CNN architectures, in the following, we describe
the design of convolutional and dense layers for the model
applied in the experiments.
The applied CNN architecture consists of a sequence of
three blocks. Each block has two 2D-convolutional layers, a
batch normalization, max pooling and a dropout layer. The
dense layers are then connected to the last of these three lay-
ers. The configuration of the blocks is the following. The two
2D-convolutional
layers
in
the
first
block
consists
of
32
×
3
×
3
filters and ReLU as the activation function. The subsequent
block has two more 2D-convolutional layers with 64 filters.
The convolutional layers of the final block are of size 128.
The concatenated layers are two 2D-convolutional layers with
128 filters. All 2D-convolutional layers use filters of size
(3x3). After the second convolutional layer of each block,
batch normalization is applied to stabilize the distribution of
activations. Then a max pooling layer of size (2, 2) is applied
to the output to reduce the spatial dependence of the feature
maps and increase the invariance to small translations, and fi-
(ρt,
S
t
)
=
argmin
(
ρ
t
,S
t
)
i
=1
(
y
i
, P
t
1
(
X
i
) +
ρ
t
S
t
)
.
(7)
nally a dropout layer is included to prevent overfitting with
0.2, 0.3 and 0.4 for each block respectively. Finally, the out-
put is flatten and connected to fully connected dense layers.
K
}
N
1
1
The dense layers are added and trained iteratively while freez-
ing previous dense layers. This model is compared to the
same architecture trained jointly as a standard CNN model.
The second proposed architecture considers only the iterative
training of a network composed only of dense layers.
This process is illustrated in Fig. 1 in which only the dense
layers are shown (Note, that the figure is not representing the
actual size of the used dense layers). In the first, iteration,
one single dense (and output) layer is trained (dark grey units
in left diagram of Fig. 1). This layer is the first fully con-
nected layer. After fitting this model, the model is copied and
Algorithm 1 Training procedure of GB-CNN.
Input:
Input image data and related labels
D
=
{
x
i
, y
i
}
N
Input: Number of boosting iterations T .
Input: Training epoch on mini-batches E.
Input: Gradient Boosting loss function (Eq.1).
Image batch generator
G
(
D
)
=
{
(
x
i
, y
i
)
}
N
.
for
e
=
0 to
E
do
Fit the GB-CNN with one dense layer to images-
residual.
if if the training additive loss converges then
break
end if
end for
Update the trainable parameters 0.
Freeze the added dense layer’s parameters ω0.
for t
=
1 to T 1 do
Add a new dense layer.
for e
=
0 to E do
Fit the GB-CNN to images-residual.
if if the training additive loss converges then
break
end if
end for
Update the trainable parameters t.
Freeze the added dense layer’s parameters ωt.
if if the training Gradient Boosting loss converges
then
break
end if
end for
Update the CNNs’ weights using Eq. 10.
return A fully trained and fine-tuned GB-CNN network.
a second dense layer is added (iteration 1 and second diagram
in Fig. 1) freezing the weights of the first dense layer froze
(shown in the Fig. 1 with light gray neurons). This second
step fine-tunes the parameters of the convolutional layers (if
present), skips the training of the previous dense layer, and
trains the newly added dense layer (dark gray units). Each
new model, fits the last dense layers and the convolutional
blocks to the corresponding pseudo-residuals. The training
procedure continues until convergence.
Fig. 1: A snippet of the iterative model training procedure.
Light-gray indicates that the layer is frozen, dark gray that the
layer is being trained and white indicates output layer (also
being trained)
4.
EXPERIMENTS
In order to evaluate the proposed new methods structure for
Convolutional (GB-CNN) and Deep Neural Networks (GB-
DNN), several supervised 2D-image classification tasks and
tabular datasets were considered. In the image datasets, the
proposed model (GB-CNN) is compared with respect to a
CNN model that uses the same architecture and configura-
tion, although with more dense layers (more details below). In
the tabular datasets, GB-DNN is compared with a deep neu-
ral network with same settings. The objective of our experi-
ments was to evaluate the performance of the tested methods
in terms of accuracy and to determine its usefulness for solv-
ing the 2D-image and tabular classification problems. The
code of the proposed models is made available in github as
GB-CNN1.
Regarding the 2D-image classification problem, seven
2D-image datasets of various areas of application, class la-
bels, instances, pixel resolution, and color channels are used
in this study, as described in Table 1 for the convolutional
models. In this experiment, a data generator was used to
1
github.com/GAA-UAM/GB-CNN
28 × 28
1
Table 1: Dataset Properties for Image Classification Tasks (ranging from 0.1 to 0.001) and shrinkage rate ν (ranging
Name
train/test
P
H
MNIST [23] 60,000/10,000
×
P
W
K
Ch
10
from 0.1 to 1.0) of the GB-DNN model. Both models are
composed of three dense layers, each with a size of 100
CIFAR-10 [24] 50,000/10,000 32 × 32 10 3
Rice varieties [25] 56,250/18,750 32 × 32 5 3
Fashion-MNIST
[26]
60,000/10,000 28 × 28 10 1
Kuzushiji-MNIST
[27]
60,000/10,000 28 × 28 10 1
MNIST-Corrupted
[27]
60,000/10,000 28 × 28 10 1
neurons and ReLU as the activation function.
5.
RESULTS
Rock-Paper-Scissors
2,520/370 32 × 32 3 3 Tables 3, 4 and 5 show the generalization accuracy results es-
[28]
generate batches of training data on-the-fly during the train-
ing process. This allowed us to train the model on a large
dataset, without having to load the entire dataset into GPU
memory. Moreover, the generator shuffled the data, applied
data augmentation techniques (including rescaling the pixel
values), and yielded batches with a size of 128. The common
hyperparameters and settings of the used GB-CNN and CNN
were the same, including the structure of the convolutional
layers (described before), the number of hidden neurons in
each dense was set to 20, a learning rate of 0.001, 200 epochs
for training with applying early stopping on the validation
score to prevent overfitting. The shrinkage rate of GB-CNN,
used to combine the different generated CNNs, was set to
0.1. As previously described, the proposed method adds one
dense layer at each iteration. The maximum number of dense
layers/iteratios was set to 10. Although, GB-CNN generally
converges after 2 or 3 iterations. Notwithstanding, the num-
ber of dense layers for CNN was left at 10. Furthermore,
we conducted an additional experiment wherein three images
(MNIST, Fashion-MNIST, and CIFAR-10) were analyzed us-
ing two dense layers, each with a size of 128 and identical
settings, in order to investigate the impact of dense layer size
on the experimental outcomes.
timated in the test subsets for the analyzed datasets and mod-
els. Table 3 show the results for the convolutional models
(GB-CNN and CNN) using 20 neuron in the dense hidden
layers. Table 4 include the results for selected image datasets
and with larger dense layers of 128 neurons. Finally, Table 5
portray the results for the deep models (GB-DNN and DNN)
in the tabular datasets. The best results of the tables are high-
lighted using a gray background.
As it can be observed in Table 3, the generalization
accuracy of the proposed convolutional model (GB-CNN)
is higher than that of the baseline CNN model across all
tested datasets. The difference is specially large on the Rock-
Paper-Scissors. The proposed model achieved an accuracy of
87.37%, which is 19.36% higher than the accuracy of CNN
(68.01%). On the other datasets the differences are small
but consistently in favor of the proposed architecture. For
instance, in CIFAR-10, the proposed model achieved an ac-
curacy of 87.65% with respect to the 86.71% that achieves
the CNN. In the MNIST dataset is well-known for its sim-
plicity and high accuracy, and GB-CNN was able to achieve
a remarkable accuracy of 99.61%, which is higher than the
accuracy of CNN by a very small margin (0.06).
Table 3: Accuracy of GB-CNN and CNN models on the ex-
periment with a 20-neuron dense layer. The best results are
highlighted in gray
Table 2: Dataset Properties for Image Classification Tasks
2D-Image dataset
GB-
CNN CNN
Nam
e
Digits [29] instances
1797 Feature
64
s
class labels
10
Ionosphere
[29] 351 34
2
Lette
r
-26 [29] 20,000 16 26
Sona
r
208
2
60
USPS [30] 9,298 256 10
Vowel
[29] 990 10 11
Waveform
[29] 5,000 21 3
Moreover, in this work, tabular classification datasets
from different sources [29, 30] were also tested. The charac-
teristics of these datasets are shown in Table 2. The classi-
fications tasks are of different types ranging from radar data
to handwritten digits and vowel pronunciation. The hyper-
parameter values of the proposed GB-DNN and the standard
Deep-NN are identical. For these datasets, the training was
performed using 10-fold cross-validation. In order to deter-
mine the optimal values for hyper-parameter of the models, a
with-in train grid search process was applied. The test values
for the hyper-parameters are: learning rate of both models
In relation to the experiment involving the utilization of
a dense layer with a size of 128, the accuracy results for
the GB-CNN and CNN models are presented in Table 4.
These models were trained and evaluated on three distinct
datasets: MNIST, CIFAR-10, and Fashion-MNIST. The find-
ings demonstrate that the GB-CNN model outperformed the
CNN model across all three datasets.
Nonetheless, closer examination of the table reveals that
the results obtained using 128 dense layers (as shown in Ta-
ble 3) are slightly inferior to those using 20 neuron layers
MNIST 99.61% 99.55%
CIFA
R
-
10 87.65% 86.71%
Rice varieties 99.82% 98.37%
Fashion-
MNIST 94.34% 93.61%
Kuzushiji-
MNIST 98.40% 96.70%
MNIST-
Corrupte
99.58% 99.35%
Rock-Paper-Scissors
87.37% 68.01%
98.22%
±
1.24
94.87% ±3.08
95.50% ±0.49
(as shown in Table 4). This implies that for these particular
datasets, employing a greater number of smaller layers may
be more effective in capturing the intended tasks.
Table 4: Accuracy of GB-CNN and CNN models on the ex-
periment with a 128-neuron dense layer. The best results are
highlighted in gray
2D-Image dataset GB-CNN CNN
MNIST 99.66% 99.40%
CIFA
R
-
10 86.72% 86.30%
Fashion-
MNIST 94.02% 92.38%
The results for the tabular datasets summarized in Table 5
for GB-DNN and Deep-NN. The results demonstrate that GB-
DNN outperforms Deep-NN in most of the datasets. In par-
ticular, GB-DNN achieved the highest accuracy in the Digits,
Ionosphere, Letter-26, USPS, Vowel, and Waveform datasets.
Deep-NN, on the other hand, achieved the highest accuracy in
the Sonar dataset. As in previous results, the differences are
in general small with higher difference in Waveform in favor
of GB-CNN (2.72-point difference) and in Sonar in favor of
Deep-NN (1.45-point difference).
Table 5: Accuracy of GB-DNN and Deep-NN models. The
best results are highlighted in gray
Tabular dataset
GB-
DNN
Deep-
NN
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
1.4
1.2
1.0
0.8
0.6
GB-CNN accuracy
0
2
4
6
8
Boosting iterations
Fig. 2: GB-CNN accuracy
GB-CNN loss
Digits
Ionosphere
Letter-26
Sonar
USPS
Vowel
Waveform
86.55% ±9.03
97.55% ±1.06
94.60% ±3.66
95.29% ±0.63
88.00% ±3.78
96.95% ±0.62
97.47% ±2.08
82.38% ±2.37
0.4
0.2
0
2
4
6
8
Boosting iterations
Fig. 3: GB-CNN loss
In addition to evaluating the accuracy metric, we analyzed
the training loss and accuracy of the models under comparison
(Fig. 2–5, Fig. 6–7). We further monitored the loss function
across various additive models during training for GB-CNN.
The experiment was conducted using the CIFAR-10 dataset,
with both training and test data points. The models were
configured with the same hyperparameters, including iden-
tical convolutional layers as outlined in Section 3.2, and 10
dense layers with a size of 20. The learning rate was set to
0.001, with a batch size of 128, and 100 training epochs for
both models. In the case of GB-CNN, the shrinkage was set
to 0.01.
The evolution performance of GB-CNN is shown in
Fig. 2–5. Fig. 2 illustrates the train and test accuracy of
the model with respect to the number of boosting iterations.
In Fig. 3, the average cross-entropy loss across various boost-
ing iterations is shown. Additionally, Figs. 4 and 5 provide an
overview of the evolution of the performance of the additive
models developed with respect to the boosting iterations in
terms of mean squared error loss. As it can be observed from
these plots, the model converges after a few booting itera-
tions. Specifically, after three boosting iterations the model
has converged and the training process could be stopped.
In contrast, the Fig. 2 illustrates that the GB-CNN model
continues to learn and classify more accurately with a de-
creasing tendency for cross-entropy loss Fig. 3. Overall, these
findings provide valuable insights into the performance of the
GB-CNN model and its potential for use in various applica-
tions.
On the other hand, Figs. 6 and 7 present the performance
of the CNN model with respect to the training epochs. Specif-
ically, the train and test accuracy and MSE loss of the model
are depicted in Figs. 6 and Figs. 7, respectively. Notably, the
CNN model’s accuracy, as displayed in Fig. 6, exhibits an
initial accuracy below 60% during the early stages of train-
ing, contrasting with the performance of the GB-CNN model.
This outcome reveals the model’s inaccuracy in its prelimi-
nary training iterations.
Tes
t
data
p
oints
Train data
p
oints
Train data
p
oint
Tes
t
data
p
oin
t
97.39%
±
0.40
98.08% ±1.59
85.20% ±1.89
Accuracy
Cross-Eentropy loss
Additive:
0
Additive:
1
Additive:
2
Additive:
3
Additive:
4
Additive:
5
Additive:
6
Additive:
7
Additive:
8
Additive:
9
GB-CNN loss - additive models
CNN
accuracy
0.14
0.9
0.12
0.10
0.8
0.08
0.06
0.7
0.04
0.6
0.02
0.00
0
20
40
60
80
100
Training
epochs
0
20
40
60
80
100
Training
epochs
Fig. 4: Additive model loss - train data points Fig. 6: CNN accuracy
0.10
GB-CNN validation loss - additive models
1.4
CNN
loss
0.09
1.2
0.08
1.0
0.07
0.8
0.06
0.6
0.05
0.4
0.04
0.2
0.03
0
20
40
60
80
100
Training
epochs
0
20
40
60
80
100
Training
epochs
Fig. 5: Additive model loss - test data points
6.
CONCLUSION
This paper presents a novel training approach for training a set
of deep neural networks based on convolutional (CNNs) and
deep (DNNs) architectures, called GB-CNN and GB-DNN
respectively. The proposed training procedures are based on
gradient boosting algorithm, which trains iteratively a set of
models in order to learn the information not captured in pre-
vious iterations. Additionally, both proposed models employ
an inner dense layer freezing approach to reduce model com-
plexity and non-linearity.
To evaluate the effectiveness of the proposed models,
we conducted experiments on various 2D images and tabu-
lar datasets, ranging from radar data to handwritten digits,
fashion, and agriculture areas. Our results demonstrate that
the proposed GB-CNN model outperforms traditional CNN
models using the same architecture in terms of accuracy in all
the analyzed image datasets. Moreover, the GB-DNN model
outperforms DNN models in terms of accuracy on most of
Fig. 7: CNN loss
the studied tabular datasets. The training loss curves also
indicate that the proposed models exhibit a lower in-train
loss value for additive models and a faster convergence rate
than traditional CNN models. This shows that the proposed
GB-CNN and GB-DNN models represent a robust and ef-
fective solution for 2D image and tabular classification tasks.
We recommend using these models for further research and
practical applications, as they offer a promising direction for
developing more accurate and efficient CNNs and DNNs in
the future.
7.
REFERENCES
[1]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hin-
ton, “Imagenet classification with deep convolutional
neural networks,” in Advances in Neural Information
Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds., pp. 1097–1105.
Curran Associates, Inc., 2012.
Additive:
0
Additive:
1
Additive:
2
Additive:
3
Additive:
4
Additive:
5
Additive:
6
Additive:
7
Additive:
8
Additive:
9
Tes
t
data
p
oints
Train data
p
oints
Tes
t
data
p
oints
Train data
p
oints
MSE
loss
MSE
loss
Cross-Entropy loss
Accuracy
[2]
Dongmei Han, Qigang Liu, and Weiguo Fan, “A new
image classification method using cnn transfer learning
and web data augmentation,” Expert Systems with Ap-
plications, vol. 95, pp. 43–56, 2018.
[3]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
Sun, “Faster r-cnn: Towards real-time object detection
with region proposal networks,” Advances in neural in-
formation processing systems, vol. 28, 2015.
[4]
Taejoon Kim, Sang C Suh, Hyunjoo Kim, Jonghyun
Kim, and Jinoh Kim, “An encoding technique for cnn-
based network anomaly detection,” in 2018 IEEE In-
ternational Conference on Big Data (Big Data). IEEE,
2018, pp. 2960–2965.
[5]
Konstantinos Kamnitsas, Christian Ledig, Virginia FJ
Newcombe, Joanna P Simpson, Andrew D Kane,
David K Menon, Daniel Rueckert, and Ben Glocker,
“Efficient multi-scale 3d cnn with fully connected crf
for accurate brain lesion segmentation,” Medical image
analysis, vol. 36, pp. 61–78, 2017.
[6]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[7]
Karen Simonyan and Andrew Zisserman, “Very deep
convolutional networks for large-scale image recogni-
tion,” arXiv preprint arXiv:1409.1556, 2014.
[8]
Matthew D Zeiler and Rob Fergus, “Visualizing and un-
derstanding convolutional networks,” in European con-
ference on computer vision. Springer, 2014, pp. 818–
833.
[9]
Sergey Ioffe and Christian Szegedy, “Batch normaliza-
tion: Accelerating deep network training by reducing
internal covariate shift,” in International conference on
machine learning. PMLR, 2015, pp. 448–456.
[10]
Xavier Glorot and Yoshua Bengio, “Understanding
the difficulty of training deep feedforward neural net-
works,” in Proceedings of the thirteenth interna-
tional conference on artificial intelligence and statistics.
JMLR Workshop and Conference Proceedings, 2010,
pp. 249–256.
[11]
Jerome H Friedman, “Greedy function approximation:
a gradient boosting machine,” Annals of statistics, pp.
1189–1232, 2001.
[12]
Llew Mason, Jonathan Baxter, Peter Bartlett, and Mar-
cus Frean, “Boosting algorithms as gradient descent,”
Advances in neural information processing systems, vol.
12, 1999.
[13]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,
Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu,
“Lightgbm: A highly efficient gradient boosting deci-
sion tree,” Advances in neural information processing
systems, vol. 30, 2017.
[14]
Tianqi Chen, Tong He, Michael Benesty, Vadim
Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen,
et al., “Xgboost: extreme gradient boosting,” R package
version 0.4-2, vol. 1, no. 4, pp. 1–4, 2015.
[15]
Ravid Shwartz-Ziv and Amitai Armon, “Tabular data:
Deep learning is not all you need,” Information Fusion,
vol. 81, pp. 84–90, 2022.
[16]
Candice Bente´jac, Anna Cso¨rgo˝, and Gonzalo Mart´ınez-
Mun˜oz, “A comparative analysis of gradient boosting
algorithms,” Artificial Intelligence Review, vol. 54, pp.
1937–1967, 2021.
[17]
Seyedsaman Emami and Gonzalo Mart´ınez-Mun˜oz,
“Multioutput regression neural network training via gra-
dient boosting,” 01 2022, pp. 145–150.
[18]
Furong Huang, Jordan Ash, John Langford, and Robert
Schapire, “Learning deep resnet blocks sequentially us-
ing boosting theory,” in International Conference on
Machine Learning. PMLR, 2018, pp. 2058–2067.
[19]
Atsushi Nitanda and Taiji Suzuki, “Functional gradi-
ent boosting based on residual network perception,” in
International Conference on Machine Learning. PMLR,
2018, pp. 3819–3828.
[20]
Yoshua Bengio, Nicolas Roux, Pascal Vincent, Olivier
Delalleau, and Patrice Marcotte, “Convex neural net-
works,” Advances in neural information processing sys-
tems, vol. 18, 2005.
[21]
Seyedsaman Emami and Gonzalo Mart´ınez-Mun˜oz,
“Sequential training of neural networks with gradient
boosting,” arXiv preprint arXiv:1909.12098, 2019.
[22]
Yoav Freund and Robert E Schapire, “A decision-
theoretic generalization of on-line learning and an ap-
plication to boosting,” Journal of computer and system
sciences, vol. 55, no. 1, pp. 119–139, 1997.
[23]
Norman Mu and Justin Gilmer, “Mnist-c: A robust-
ness benchmark for computer vision,” arXiv preprint
arXiv:1906.02337, 2019.
[24]
Alex Krizhevsky, “Learning multiple layers of features
from tiny images,” Tech. Rep., 2009.
[25]
Murat Koklu, Ilkay Cinar, and Yavuz Selim Taspinar,
“Classification of rice varieties with deep learning meth-
ods,” Computers and electronics in agriculture, vol.
187, pp. 106285, 2021.
[26]
Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion-
mnist: a novel image dataset for benchmarking machine
learning algorithms,” arXiv preprint arXiv:1708.07747,
2017.
[27]
Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto,
Alex Lamb, Kazuaki Yamamoto, and David Ha, “Deep
learning for classical japanese literature,” arXiv preprint
arXiv:1812.01718, 2018.
[28]
Laurence Moroney, “Rock, paper, scissors dataset,” feb
2019.
[29]
M. Lichman, “UCI machine learning repository,” 2013.
[30]
Jonathan J. Hull, “A database for handwritten text
recognition research,” IEEE Transactions on pattern
analysis and machine intelligence, vol. 16, no. 5, pp.
550–554, 1994.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose a dual pathway, 11-layers deep, three-dimensional Convolutional Neural Network for the challenging task of brain lesion segmentation. The devised architecture is the result of an in-depth analysis of the limitations of current networks proposed for similar applications. To overcome the computational burden of processing 3D medical scans, we have devised an efficient and effective dense training scheme which joins the processing of adjacent image patches into one pass through the network while automatically adapting to the inherent class imbalance present in the data. Further, we analyze the development of deeper, thus more discriminative 3D CNNs. In order to incorporate both local and larger contextual information, we employ a dual pathway architecture that processes the input images at multiple scales simultaneously. For post-processing of the network's soft segmentation, we use a 3D fully connected Conditional Random Field which effectively removes false positives. Our pipeline is extensively evaluated on three challenging tasks of lesion segmentation in multi-channel MRI patient data with traumatic brain injuries, brain tumours, and ischemic stroke. We improve on the state-of-the-art for all three applications, with top ranking performance on the public benchmarks BRATS 2015 and ISLES 2015. Our method is computationally efficient, which allows its adoption in a variety of research and clinical settings. The source code of our implementation is made publicly available.
Article
A key element in solving real-life data science problems is selecting the types of models to use. Tree ensemble models (such as XGBoost) are usually recommended for classification and regression problems with tabular data. However, several deep learning models for tabular data have recently been proposed, claiming to outperform XGBoost for some use cases. This paper explores whether these deep models should be a recommended option for tabular data by rigorously comparing the new deep models to XGBoost on various datasets. In addition to systematically comparing their performance, we consider the tuning and computation they require. Our study shows that XGBoost outperforms these deep models across the datasets, including the datasets used in the papers that proposed the deep models. We also demonstrate that XGBoost requires much less tuning. On the positive side, we show that an ensemble of deep models and XGBoost performs better on these datasets than XGBoost alone.
Article
Rice, which is among the most widely produced grain products worldwide, has many genetic varieties. These varieties are separated from each other due to some of their features. These are usually features such as texture, shape, and color. With these features that distinguish rice varieties, it is possible to classify and evaluate the quality of seeds. In this study, Arborio, Basmati, Ipsala, Jasmine and Karacadag, which are five different varieties of rice often grown in Turkey, were used. A total of 75,000 grain images, 15,000 from each of these varieties, are included in the dataset. A second dataset with 106 features including 12 morphological, 4 shape and 90 color features obtained from these images was used. Models were created by using Artificial Neural Network (ANN) and Deep Neural Network (DNN) algorithms for the feature dataset and by using the Convolutional Neural Network (CNN) algorithm for the image dataset, and classification processes were performed. Statistical results of sensitivity, specificity, prediction, F1 score, accuracy, false positive rate and false negative rate were calculated using the confusion matrix values of the models and the results of each model were given in tables. Classification successes from the models were achieved as 99.87% for ANN, 99.95% for DNN and 100% for CNN. With the results, it is seen that the models used in the study in the classification of rice varieties can be applied successfully in this field.
Article
Since Convolutional Neural Network (CNN) won the image classification competition 202 (ILSVRC12), a lot of attention has been paid to deep layer CNN study. The success of CNN is attributed to its superior multi-scale high-level image representations as opposed to hand-engineering low-level features. However, estimating millions of parameters of a deep CNN requires a large number of annotated samples, which currently prevents many superior deep CNNs (such as AlexNet, VGG, ResNet) being applied to problems with limited training data. To address this problem, a novel two-phase method combining CNN transfer learning and web data augmentation is proposed. With our method, the useful feature presentation of pre-trained network can be efficiently transferred to target task, and the original dataset can be augmented with the most valuable Internet images for classification. Our method not only greatly reduces the requirement of a large training data, but also effectively expand the training dataset. Both of method features contribute to the considerable over-fitting reduction of deep CNNs on small dataset. In addition, we successfully apply Bayesian optimization to solve the tuff problem, hyper-parameter tuning, in network fine-tuning. Our solution is applied to six public small datasets. Extensive experiments show that, comparing to traditional methods, our solution can assist the popular deep CNNs to achieve better performance. Particularly, ResNet can outperform all the state-of-the-art models on six small datasets. The experiment results prove that the proposed solution will be the great tool for dealing with practice problems which are related to use deep CNNs on small dataset.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in n. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.