ArticlePDF Available

Toward Optimal Learning Rate Schedule in Scene Classification Network

Authors:

Abstract and Figures

Stochastic gradient descent (SGD) and adaptive methods, including ADAM, RMSProp, AdaDelta, and AdaGrad, are two dominant optimization algorithms for training convolution neural network (CNN) in scene classification tasks. Recent work reveals that these adaptive methods lead to degraded performance in image classification tasks compared with SGD. In this letter, a learning rate schedule named switching from constant to step decay (SCTSD) is proposed to further improve the classification accuracy of SGD. SCTSD begins with a constant learning rate and switches to step decay learning rates when appropriate. Theoretical evidence is provided on the superiority of SCTSD compared with other manually tuned schedules. Comparison experiments have been conducted among adaptive methods, SCTSD, and other manual schedules with three CNN architectures. The experiment results on various scene classification data sets show that SCTSD has the highest accuracy on the test set and it is state of the art. In the end, some suggestions on hyperparameters selection of SCTSD are given for scene classification.
Content may be subject to copyright.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 1
Towards optimal learning rate schedule in scene
classification network
Minglong Xue, Jian Li, Qingli Luo*
Abstract—Stochastic Gradient Descent (SGD) and adaptive
methods, including ADAM, RMSProp, AdaDelta, AdaGrad, are
two dominant optimization algorithms for training Convolution
Neural Network (CNN) in scene classification tasks. Recent
work reveals that these adaptive methods lead to degraded
performance in image classification tasks compared with SGD.
In this letter, a learning rate schedule named switching from
constant to step decay (SCTSD) is proposed to further im-
prove the classification accuracy of SGD. SCTSD begins with
a constant learning rate and switches to step decay learning
rates when appropriate. Theoretical evidence is provided on
the superiority of SCTSD compared with other manually tuned
schedules. Comparison experiments have been conducted among
adaptive methods, SCTSD and other manual schedules with three
CNN architectures. The experiment results on various scene
classification datasets show that SCTSD has the highest accuracy
on test set and it is state-of-the-art. In the end, some suggestions
on hyper-parameters selection of SCTSD are given for scene
classification.
Index Terms—stochastic gradient descent (SGD), convolu-
tion neural network (CNN), scene classification tasks, learning
rate schedule, switching from constant to step decay schedule
(SCTSD).
I. INTRODUCTION
RECENT advance in scene classification tasks rely mainly
on Convolution Neural Network (CNN) for its high
performance in scene classification tasks. CNN can extract
features automatically and it can accept various input, includ-
ing SAR, Lidar, hyperspectral and optical images. Meanwhile,
CNN has rather low operating time once it was trained.
CNN has achieved great success in automatic target recog-
nition(ATR), marine monitoring and forest monitoring.
In the field of CNN optimization, Stochastic Gradient De-
scent (SGD) is a dominant algorithm based on the first-order
derivative of the loss function. Due to the large amounts of
parameters in CNN and sophisticated loss function, traditional
second-derivative based optimization algorithms do not work
in training CNN. SGD takes steps to find the minimum
of the loss function. Decaying learning rate schedules help
it approach the optimal solution. Classical SGD use manu-
ally tuned schedules, such as polynomial[1] and step decay
schedule[2][3][4], to find solution for CNN. Some variants
of SGD, that use adaptive learning rates and gradients, are
Minglong Xue, Jian Li and Qingli Luo* are all with State key laboratory of
precision measuring technology and instruments, Tianjin University, Tianjin,
300072 (E-Mail:xueminglong@tju.edu.cn,luoqingli@tju.edu.cn)
Qingli Luo* is also with Binhai International Advanced Structural Integrity
Research Centre, Tianjin 300072, China. (E-Mail:luoqingli@tju.edu.cn.)
This work is supported by National Natural Science Foundation of China
(grant No. 41601446), and Tianjin Natural Science Foundation (grant No.
16JCQNJC01200), (Corresponding author: Qingli Luo).
more popular than SGD for its fast convergence rate in the
training process of CNN. These adaptive methods leverage
past learning rates and gradients to work.
Although adaptive methods have achieved great success,
recent works reveal that they can not find an optimal solution
in some settings[5][6]. There exist extreme gradients in the
process of optimization. They occur rarely but are informative.
Due to the moving average strategy of some adaptive methods,
information of these large gradients is lost and it leads to poor
convergence[5]. Further, adaptive methods may converge to
”sharp” minima of loss function[6]. CNN trained by adaptive
methods tends to perform worse on unseen data than by SGD.
Several state-of-the-art works choose SGD with step decay
schedules as the CNN optimizer. VGGNet[2] used two 3x3
filters instead of one large-size filter and won the first place
in 2014 ImageNet[7] recognition challenge. ResNet achieved
an improvement through residual shortcut connections[3].
In order to further improve the performance, switching from
constant to step decay (SCTSD) schedule is proposed for SGD.
SCTSD begins with a constant learning rate and then switches
to step decay learning rates at a specific switch point. Constant
learning rate at the initial stage help reduce the search space
rapidly. Then a more precious solution can be found in this
space through step decay schedule. The key contributions of
this work are as follows:
Firstly, we provide theoretical evidence that SCTSD is
superior to polynomial decay and step decay schedule.
Our work is based on the proof of Ge et al.[8] that the
step decay schedule outperforms the polynomial decay
schedule for training CNN. We take a step further and
prove the superiority of SCTSD compared with the step
decay schedule.
Secondly, Comparison experiments have been conducted
among constant learning rate, adaptive methods, SCTSD,
polynomial decay and step decay schedules on three
popular scene classification datasets. Three widely used
CNN architectures are used in our experiments. The result
shows that CNN trained by SCTSD has higher accuracy
compared with other schedules. In the end, parameter
selection of SCTSD is discussed in different scenarios.
II. SCTSD SCHEDULE
We provide theoretical evidence of the superiority of
SCTSD compared with step and polynomial decay schedule
based on the work of Ge et al.[8]. One-layer CNN without the
activation unit can be viewed as a linear regression model. By
solving the regression problem, we prove that SCTSD leads
to lower test error than step decay schedule.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 2
A. Prior knowledge
Let us start with a linear regression problem, The loss
function to be solved is
min L(β) = min
βRd
E(Xβ Y)2(1)
Beginning with initial weights β0and a sequence of learning
rates η0, η1, ..., ηt, the weights of regression model βtis
updated by SGD through following iteration
βt+1 =βtηt∇L(βt)(2)
∇L(βt) = xT
t·(xt·βtyt)(3)
where xt, ytis the training data at time t,ηtis the learning
rate at time t.ytcan be denoted as yt=xtβ+et.βis the
theoretical optimal solution that SGD can find for L(β). The
hessian of L(β)is H=E(2L(βt)). The noise eexisting in
the data satisfies that E[e] = 0, var(e) = E(eTe) = θ2.
According to Eq. 1, the loss error tbetween L(βt)and
L(β)is denoted as
t=E[L(βt)] − L(β)
=E[(βtβ)TH(βtβ)] (4)
Suppose that the eigenvalue of His λ(i), i = 1,2, ..., d,
and the condition number κmeasures the noise sensitivity of
model and it is defined as κ=λmaxmin .
The variance v(i)
tof solution βtis defined as
v(i)
t=E[(β(i)
tβ(i))2](5)
Then the error tcan be rewritten as
t=
d
X
i=1
λ(i)v(i)
t(6)
According to Eq. 2, Eq. 3 and Eq. 5, the variance vtof the
solutions βtfound by SGD can be computed through following
iteration
v(i)
t= (1 ηtλ(i))2v(i)
t1+λ(i)θ2η2
t
=
T
Y
t=1
(1 ηtλ(i))2v(i)
0+λ(i)θ2
T
X
t=1
η2
t
T
Y
i=t+1
(1 ηiλ(i))2
(7)
According to [8], tis lower bounded by
lim
t→∞
E[L(βt)] − L(β)
θ2d/t 1(8)
where θ2d/t is the statistical minimal error that SGD could
reach.
Fig. 1: Learning Rate Schedules
B. Step decay schedule is superior to polynomial decay sched-
ule
The polynomial and step decay schedule are modeled as
ηt=η0
1+b·tand ηt=η0·exp (b·t)respectively by Ge et
al.[8]. He proved that the error of polynomial decay schedule
is lower bounded by
Texp(T
κlog T)0+θ2d
T·κ
64 (9)
The error of step decay schedule is upper bounded by
T2·exp(T
2κlog T)0+ 4 ·log T·θ2d
T(10)
For any T < κ2, it is obviously that the solution found by
step decay schedule is much better than adaptive methods for
it can reach an error rate of
T·log T·log κwhich is lower
than
T·κ.
C. SCTSD is superior to step decay schedule
The diagram of our schedule is in Fig. 1. The main idea of
SCTSD is that constant learning rate is employed at first to
reduce search scope, then a more precise solution can be found
through step decay schedule. Details about SCTSD is shown
in Algorithm 1. The error rate of SCTSD can be reduced to
T·log κwhich will be proved through iteration.
At the first stage, when t<T0, ηt=η0=1
L2, we prove
that vt(1 λ(i)/L2)2tv0+λ(i)/L2. According to iterative
variance in Eq. 7, it is clearly true at time t= 0.
When 0< t < T0, suppose that it is true at time t1, then
v(i)
t= (1 λ(i)
L2)2v(i)
t1+λ(i)θ2
(L2)2
(1 λ(i)
L2)2tv(i)
0+θ2
L2
(11)
After the first stage of SCTSD, the variance is
vT0(1 λ
L2)2T0v0+θ2
L2v0
T3
0
+θ2
L2(12)
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 3
Input : number of initial epochs T0;
number of total epochs T;
initial learning rate η0;
initial weights β0
Output: model weights βt
ηl=η0;
for t1to T0do
βtβt1ηl· ∇L(βt1);
end
ˆ
T=TT0;
ˆ
T0=ˆ
T
log ˆ
T;
for l1to log ˆ
Tdo
ηlηl/2l;
for tT0+ (l1) ·ˆ
T0to T0+l·ˆ
T0do
βtβt1ηl· ∇L(βt1);
end
end
Algorithm 1: SCTSD schedule
At the second stage, when T0< t < T , learning rate ηt
is reduced by a factor of 2 at time t=T0+ (l1) ·T
log T,
where l= 1,2, ..., log (TT0). We prove that vT0+lˆ
T0
vT0exp(3l)+2ˆ
T0η2
lλθ2, where ˆ
T0=ˆ
T
log ˆ
T,ˆ
T=TT0by
induction.
In the base case, it is clearly true when t=T0, l = 0 for
that vT0vT0.
When t>T0, l > 0, suppose that it is true when l1, then
vT0+lˆ
T0vT0+(l1) ˆ
T0·exp (2ηl·λˆ
T0) + ˆ
T0η2
l·λθ2
vT0·exp(3l) + (2 exp(3l) + 1) ˆ
T0ηl·λθ2
vT0·exp(3l)+2ˆ
T0ηl·λθ2
(13)
Further, according to Eq. 12, we have
vTv0
T3
0·(TT0)3+2 log κ
λ·T·θ2(14)
Therefore, according to Eq. 6, Eq. 11 and Eq. 14, for any
κ < T < κ2, our schedule has an error rate of
t·log κ, which
is much lower than
t·log T·log κof step decay schedule.
D. The best switch point T
0T
2
According to Eq. 14, when T has been set up, the variance
vTdepends on the first term v0
T3
0·(TT0)3. Due to T0+ (T
T0) = T, we could get the lowest vTwhen T0TT0T
2.
Hence, the best switch point is located nearby T
0T
2.
III. EXP ER IM EN T
In this section, experiments have been carried out on three
scene classification datasets. Details about datasets and exper-
iment environments are shown in the first part. Further, com-
parison experiments among different schedules and parameters
settings have been conducted.
A. Experimental Setup
Our experiments are conducted in three benchmark datasets
which are widely used for scene classification, including UC
Merced[12], AID[13], and NWPU-RESISC[14]. UC Merced
Land Use Data set provides 21 class land use images, including
agricultural, baseball diamond and beach. Each class contains
100 images. Each image is 256*256 pixels and the spatial
resolution is about 1 foot. AID is an Aerial Scene Classifica-
tion dataset proposed by WuHan university. It contains over
60,000 images which can be categorized into 30 types and
it is collected from Google Earth. Each image in the AID
measures 600*600 pixels. NWPU-RESISC contains 35,000
images including 45 classes. Each image is 256*256 pixels.
Three CNN architectures, including ResNet[3], VGGNet[2]
and DenseNet [4] are used for classification tasks in our
experiments. Random cropping from images padded with 4
pixels and randomly horizontal flip is performed for data
augmentation. T is the total number of epochs during training.
The setting of T depends on the convergence time of the
applied networks. In our experiments, all of the three networks
could converge in 200 epochs. Hence, T is set as 200 in
our experiments. Some results used pretrained model which
is pretrained on ImageNet[7] in order to improve the accuracy
on test set.
Overage accuracy is adopted as the evaluation metrics. Av-
erage accuracy computes the accuracy of each class separately
and then averages the accuracy of different classes. It equals
to overall accuracy when the dataset is balanced. In our case,
the datasets we used are all strongly balanced and the overall
accuracy is equal to average accuracy.
B. Experiment Results
Comparison experiments are conducted with the network
of VGGNet, ResNet and DenseNet and these results are
presented in Fig. 2 from left to right. For each network
and dataset, the applied strategies include ADAM, constant
learning rate, polynomial, step decay and SCTSD schedules.
The test accuracy of these strategies is reported in Fig. 2(a)-
(i). The test accuracy of SCTSD is plotted with purple line
and in each figure, it has higher test accuracy than the other
schedules.
Numerical results are listed in Tab. I. The accuracy is
estimated by each model, which is trained by 80% samples
and tested on the other 20% samples. SCTSD has lower test
error than the other schedules and it is state-of-the-art. In UC
Merced, DenseNet trained by SCTSD with switch point 100
and η0= 0.01 has the highest test accuracy(98.57%). It is bet-
ter than that of Siamese Convolutional Network[9](94.29%),
Deep Feature Fusion Network[10](97.42%), Fusing Local
and Global Features Model[11](98.49%). In AID, the test
accuracy of SCTSD(97.40%) is higher than Deep Fea-
ture Fusion Network(91.82%) and Fusing Local and Global
Features Model(89.76%). Similarly, the performance of
SCTSD(97.19%) is superior to Siamese Convolutional Net-
work(95.95%). Hence, the three networks trained by SCTSD
has better performance than that by the other training sched-
ules in all of these three datasets.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 4
(a) VGGNet16 on UCMerced (b) ResNet50 on UCMerced (c) DenseNet121 on UCMerced
(d) VGGNet16 on AID (e) ResNet50 on AID (f) DenseNet121 on AID
(g) VGGNet16 on NWPU (h) ResNet50 on NWPU (i) DenseNet121 on NWPU
Fig. 2: Classification accuracy on test set. From left to right, the results are from VGGNet, ResNet and DenseNet CNN
architectures. From top to bottom, the results are from UC Merced, AID and NWPU datasets.
Comparison experiments have also been conducted between
the network with and without ImageNet pretrained. Take the
results of Densenet121 as examples, the pretrained strategy
could significantly improve the classification accuracy in UC
Merced(by 5.57% 14.29%), AID(by 5.76% 12.86%) and
NWPU(by 2.04% 6.62%).
The impact on classification accuracy of different ratios
between train and test set has also been evaluated. Comparison
experiments are conducted with the training percentages of
20%, 50% and 80% on three datasets with the use of DenseNet
121 and the results are listed in Tab. II. We could see
that SCTSD schedule has better performance than the other
schedules no matter which kinds of training percentages were
applied.
ADAM is selected as the representative of adaptive sched-
ules in both Tab. I and Tab. II. One reason is that ADAM
has performance than the other adaptive schedules as shown
in Tab. III.
Comparison experiments are conducted on DenseNet121
with the use of our strategy by setting up different switch
points. The results are listed in Tab. I. According to the
theoretical derivation in Section II, the best switch point is
set as T
01
2T= 100. In UC Merced, the best result is
obtained when T0= 100. In AID and NWPU dataset, we get
the best accuracy when T0= 120 and T0= 80 rather than
T
2= 100.T
2is the recommended value for setting the best
switch point. For these above three cases, all the best switch
point T
0is located in or nearby T
2, which is consistent with
our theory (T
01
2T).
IV. CONCLUSION
In this letter, we propose a SCTSD schedule that could
further improve the performance of CNN in scene classifica-
tion tasks. Theoretical evidence is provided about the lowest
error rates of SCTSD by solving a linear regression problem.
Comparison experiments have been conducted among manu-
ally tuned schedules and adaptive methods. Experiment results
show that the SCTSD has superior performance than the
other schedules in various datasets. Besides, hyper-parameters
settings of switch points have been discussed and the recom-
mended values are provided. In the future, more intelligent
approaches will be designed to select the best switch point for
SCTSD.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 5
TABLE I: Average test accuracy of last 5 epoch
Networks Layers Training Schedule Pretrained UC Merced AID NWPU
VGGNet
16 Adam, η= 0.0001 Yes 96.66 92.31 93.00
16 Constant Learning Rate, η0= 0.01 Yes 96.62 92.76 90.48
16 Polynomial Decay, η0= 0.01, b = 0.1Yes 95.48 95.01 94.46
16 Step Decay, η0= 0.01 Yes 96.67 95.30 94.73
11 SCTSD Schedule, η0= 0.01 Yes 97.14 95.27 93.90
16 SCTSD Schedule, η0= 0.01 Yes 97.38 95.49 94.80
ResNet
50 Adam, η= 0.0001 Yes 97.71 95.34 95.65
50 Constant Learning Rate, η0= 0.01 Yes 97.57 96.55 92.71
50 Polynomial Decay, η0= 0.01, b = 0.1Yes 97.48 96.12 96.86
50 Step Decay, η0= 0.01 Yes 98.05 96.73 97.17
18 SCTSD Schedule, η0= 0.01 Yes 97.81 96.48 96.43
50 SCTSD Schedule, η0= 0.01 Yes 98.29 97.04 97.15
DenseNet
121 Adam, η= 0.0001 No 89.29 87.55 89.83
121 Constant Learning Rate, η0= 0.01 No 83.95 82.86 90.65
121 Polynomial Decay, η0= 0.01, b = 0.1No 89.24 86.04 91.06
121 Step Decay, η0= 0.01 No 92.00 88.99 93.59
121 SCTSD Schedule, Switch Point=80,η0= 0.01 No 93.00 91.35 94.35
121 SCTSD Schedule, Switch Point=100,η0= 0.01 No 92.76 90.98 95.10
121 SCTSD Schedule, Switch Point=120,η0= 0.01 No 92.00 91.56 94.91
121 Adam, η= 0.0001 Yes 98.43 96.72 96.45
121 Constant Learning Rate, η0= 0.01 Yes 98.24 95.72 94.08
121 Polynomial Decay, η0= 0.01, b = 0.1Yes 98.05 96.36 96.68
121 Step Decay, η0= 0.01 Yes 98.14 97.10 96.77
121 SCTSD Schedule, Switch Point=80,η0= 0.01 Yes 98.52 97.14 97.19
121 SCTSD Schedule, Switch Point=100,η0= 0.01 Yes 98.57 96.97 97.14
121 SCTSD Schedule, Switch Point=120,η0= 0.01 Yes 98.38 97.40 97.17
Siamese Convolutional Network[9] *step decay, η0= 0.001 - 94.29 - 95.95
Deep Feature Fusion Network[10] * * - 97.42 91.87 -
Fusing Local and
Global Features Model[11] * * - 98.49 89.76 -
*The learning rate schedule or layers is unknown to us.
-The author did not conduct experiments in this dataset.
TABLE II: Average test accuracy of last 5 epoch of DenseNet121 with different training percentages
Training Schedule
Training Percentage
20% 50% 80%
UC Merced AID NWPU UC Merced AID NWPU UC Merced AID NWPU
Adam 90.93 92.06 92.77 96.78 95.39 95.14 98.43 96.72 96.45
Constant Learning Rate 92.57 94.00 93.34 96.84 95.17 93.53 98.14 97.10 96.77
Polynomial Decay 92.17 93.49 92.89 97.56 95.66 95.40 97.10 96.77 96.68
Step Decay 92.68 93.86 93.15 97.54 95.90 95.70 98.14 97.10 96.77
SCTSD Schedule, Switch Point=80 93.08 94.15 93.61 98.11 96.06 95.98 98.52 97.14 97.19
SCTSD Schedule, Switch Point=100 93.43 94.24 93.63 98.17 96.16 95.97 98.57 96.97 97.14
SCTSD Schedule, Switch Point=120 92.93 94.02 93.82 98.10 96.00 95.97 98.39 97.40 97.17
TABLE III: Average test accuracy of last 5 epoch on ResNet18
with different adaptive schedules
Networks Training Schedule UC Merced AID NWPU
ResNet18
RMSProp 95.00 82.28 82.85
AdaDelta 94.81 93.94 95.32
Adagrad 89.43 93.57 94.61
Adam 96.24 92.58 92.69
REFERENCES
[1] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
tion by averaging,SIAM Journal on Control and Optimization, vol. 30,
no. 4, pp. 838–855, 1992.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Representations, 2015.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[5] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and
beyond,” arXiv preprint arXiv:1904.09237, 2019.
[6] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The
marginal value of adaptive gradient methods in machine learning,” in
Advances in Neural Information Processing Systems, 2017, pp. 4148–
4158.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[8] R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, “The step decay
schedule: A near optimal, geometrically decaying learning rate proce-
dure for least squares,” in Advances in Neural Information Processing
Systems, 2019, pp. 14 951–14 962.
[9] X. Liu, Y. Zhou, J. Zhao, R. Yao, B. Liu, and Y. Zheng, “Siamese
convolutional neural networks for remote sensing scene classification,
IEEE Geoscience and Remote Sensing Letters, 2019.
[10] S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for vhr
remote sensing scene classification,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 55, no. 8, pp. 4775–4784, 2017.
[11] X. Bian, C. Chen, L. Tian, and Q. Du, “Fusing local and global features
for high-resolution scene classification,” IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing, vol. 10, no. 6, pp.
2889–2901, 2017.
[12] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions
for land-use classification,” in Proceedings of the 18th SIGSPATIAL in-
ternational conference on advances in geographic information systems.
ACM, 2010, pp. 270–279.
[13] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,
“Aid: A benchmark data set for performance evaluation of aerial scene
classification,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 55, no. 7, pp. 3965–3981, 2017.
[14] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-
cation: Benchmark and state of the art,” Proceedings of the IEEE, vol.
105, no. 10, pp. 1865–1883, 2017.
... The introduction of additional parameters is given in Eq. (26). The designing of a hybrid learning schedule with transitions is as follows: Decay function selection is based on the optimization issue, model architecture, and dataset properties [23]. Different characteristics of the learning rate behaviour, including quick decay, gradual annealing, or adaptive modifications, should be captured by each decay function [24]. ...
Article
Full-text available
The learning rate is one of the most crucial hyper-parameters to regulate during the training of the Deep Learning (DL) models and optimizers. Adaptive learning rate algorithms try to automate the time-consuming process of manually setting a suitable learning rate, which is still exhausting. This research uses the learn rate schedule mechanism for training DL models. The learn rate schedule mechanism updates the learning rate for each step or iteration in DL models and optimizers for problem-solving. This paper implements a learn rate schedule mechanism and hybrid learn rate schedule mechanism like piecewise, exponential decay, polynomial time, reciprocal time and cosine annealing decay as adaptive learning rate mechanisms for DL models and optimizers like Adadelta, Adam, RMSprop and Stochastic Gradient Descent with Momentum (SGDM) to improve the accuracy of Received Signal Strength Indicator (RSSI)-based localization in LoRaWAN (Long Range Wide Area Networks) based Internet of Things (IoT) networks. These techniques aim to automate the process of determining suitable learning rates that dynamically update the learning rate for each step or iteration for optimizers and deep learning models. This technique improves the model’s performance by introducing adaptability into the learning process and departing from conventional set learning rates. The mathematical model of the learning rate schedule is derived and formulated with adaptive deep learning rate models to map with the LoRaWAN RSSI-based localization datasets for accessing the performance parameters. The learn rate schedule for different types of localization datasets is also analyzed. The results were compared for all the learning rate schedule mechanisms with the default parameter settings of DL models, and it gives a better accuracy of 98.98%, which is higher than the existing models.
... A methodology known as Switching from Constant to Step Decay (SCTSD) has been introduced to elevate model performance, particularly regarding accuracy. The outcomes of experiments affirm that SCTSD outperforms alternative scheduling techniques across diverse datasets [18]. Several research regarding with improvement in learning rate also was done in previous research [19][20][21]. ...
Article
Full-text available
Indonesia is an agricultural country where most people work as farmers. As an agricultural country, Indonesia produces staple foods, such as rice, corn, sago, and fruits. This research uses the Convolutional Neural Network (CNN), one of the popular algorithms in Deep Learning to classify two varieties of fruit using Stochastic Gradient Descent (SGD) optimizer. The data used in this research is the primary data collected using a smartphone camera. The data are 400 images of two fruit varieties, Mango, and Avocado. The main research objective is to obtain the highest accuracy by modifying the classifier model and learning rate. The model modification, this research uses an ensemble system while in learning rate using an exponential scheduled learning rate. The result shows that the accuracy of the ensemble system is 0.99, the scheduled learning rate is 0.97, while without modifications is 0.53, respectively. However, when using the SGD optimizer to train CNN, it is advised to use a predefined learning rate. A shorter training period with sufficient model accuracy and practicality supports the scheduled learning rate.
... We discovered that the learning rate parameter directly affects the accuracy of the CNN model. Hence, we have seen existing research focused on tuning the learning rate parameter [28], [35], [36]. However, the learning rate parameter was not the primary priority in our experiment because the dropCyclic method was proposed to change the maximum of the learning rate in each cycle. ...
Article
Full-text available
The ensemble learning method is a necessary process that provides robustness and is more accurate than the single model. The snapshot ensemble convolutional neural network (CNN) has been successful and widely used in many domains, such as image classification, fault diagnosis, and plant image classification. The advantage of the snapshot ensemble CNN is that it combines the cyclic learning rate schedule in the algorithm to snap the best model in each cycle. In this research, we proposed the dropCyclic learning rate schedule, which is a step decay to decrease the learning rate value in every learning epoch. The dropCyclic can reduce the learning rate and find the new local minimum in the subsequent cycle. We evaluated the snapshot ensemble CNN method based on three learning rate schedules: cyclic cosine annealing, max-min cyclic cosine learning rate scheduler, and dropCyclic then using three backbone CNN architectures: MobileNetV2, VGG16, and VGG19. The snapshot ensemble CNN methods were tested on three aerial image datasets: UCM, AID, and EcoCropsAID. The proposed dropCyclic learning rate schedule outperformed the other learning rate schedules on the UCM dataset and obtained high accuracy on the AID and EcoCropsAID datasets. We also compared the proposed dropCyclic learning rate schedule with other existing methods. The results show that the dropCyclic method achieved higher classification accuracy compared with other existing methods.
Article
The rapid and accurate identification of the initial stage of surface oil spills is crucial for reducing economic losses and environmental pollution. However, because the initial stage of a surface oil spill covers a small area and often integrates with soil and lakes, achieving rapid and accurate identification using common methods, such as satellite remote sensing, is challenging. To address this issue, unmanned aerial vehicle (UAV)-based surface oil spill rapid identification technology is used. This article proposes a Res-SUnet network model based on small samples from UAV multispectral images, utilizing the reflectance spectral feature information of oil under sunlight for fast and accurate identification and segmentation of small surface oil spill targets. By preprocessing 54 sets of multispectral images ( 54×654\times 6 spectral channel images) collected from the surface oil spill area of the Daqing oilfield and dividing them into training, test, and validation sets in a ratio of 3:2:2, the results show an improvement of 1.17% and 1.39% over traditional U-net and improved S-UNet models, respectively. This method is characterized by its lightweight, rapid, and accurate identification, making it especially suitable for the rapid and precise identification of UAV surface oil spills at the initial stage, providing a valuable tool for enterprises to reduce economic losses and environmental pollution.
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at https://github.com/liuzhuang13/DenseNet.
Article
Full-text available
Aerial scene classification, which aims to automatically label an aerial image with a specific semantic category, is a fundamental problem for understanding high-resolution remote sensing imagery. In recent years, it has become an active task in remote sensing area and numerous algorithms have been proposed for this task, including many machine learning and data-driven approaches. However, the existing datasets for aerial scene classification like UC-Merced dataset and WHU-RS19 are with relatively small sizes, and the results on them are already saturated. This largely limits the development of scene classification algorithms. This paper describes the Aerial Image Dataset (AID): a large-scale dataset for aerial scene classification. The goal of AID is to advance the state-of-the-arts in scene classification of remote sensing images. For creating AID, we collect and annotate more than ten thousands aerial scene images. In addition, a comprehensive review of the existing aerial scene classification techniques as well as recent widely-used deep learning methods is given. Finally, we provide a performance analysis of typical aerial scene classification and deep learning approaches on AID, which can be served as the baseline results on this benchmark.
Article
The convolutional neural networks (CNNs) have shown powerful feature representation capability, which provides novel avenues to improve scene classification of remote sensing imagery. Although we can acquire large collections of satellite images, the lack of rich label information is still a major concern in the remote sensing field. In addition, remote sensing data sets have their own limitations, such as the small scale of scene classes and lack of image diversity. To mitigate the impact of the existing problems, a Siamese CNN, which combines the identification and verification models of CNNs, is proposed in this letter. A metric learning regularization term is explicitly imposed on the features learned through CNNs, which enforce the Siamese networks to be more robust. We carried out experiments on three widely used remote sensing data sets for performance evaluation. Experimental results show that our proposed method outperforms the existing methods.
Article
The rapid development of remote sensing technology allows us to get images with high and very high resolution (VHR). VHR imagery scene classification has become an important and challenging problem. In this paper, we introduce a framework for VHR scene understanding. First, the pretrained visual geometry group network (VGG-Net) model is proposed as deep feature extractors to extract informative features from the original VHR images. Second, we select the fully connected layers constructed by VGG-Net in which each layer is regarded as separated feature descriptors. And then we combine between them to construct final representation of the VHR image scenes. Third, discriminant correlation analysis (DCA) is adopted as feature fusion strategy to further refine the original features extracting from VGG-Net, which allows a more efficient fusion approach with small cost than the traditional feature fusion strategies. We apply our approach to three challenging data sets: 1) UC MERCED data set that contains 21 different areal scene categories with submeter resolution; 2) WHU-RS data set that contains 19 challenging scene categories with various resolutions; and 3) the Aerial Image data set that has a number of 1000010 000 images within 30 challenging scene categories with various resolutions. The experimental results demonstrate that our proposed method outperforms the state-of-the-art approaches. Using feature fusion technique achieves a higher accuracy than solely using the raw deep features. Moreover, the proposed method based on DCA fusion produces good informative features to describe the images scene with much lower dimension.
Article
Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.
Article
In this paper, a fused global saliency-based multiscale multiresolution multistructure local binary pattern (salM 3 LBP) feature and local codebookless model (CLM) feature is proposed for high-resolution image scene classification. First, two different but complementary types of descriptors (pixel intensities and differences) are developed to extract global features, characterizing the dominant spatial features in multiple scale, multiple resolution, and multiple structure manner. The micro/macrostructure information and rotation invariance are guaranteed in the global feature extraction process. For dense local feature extraction, CLM is utilized to model local enrichment scale invariant feature transform descriptor and dimension reduction is conducted via joint low-rank learning with support vector machine. Finally, a fused feature representation between salM 3 LBP and CLM as the scene descriptor to train a kernel-based extreme learning machine for scene classification is presented. The proposed approach is extensively evaluated on three challenging benchmark scene datasets (the 21-class land-use scene, 19-class satellite scene, and a newly available 30-class aerial scene), and the experimental results show that the proposed approach leads to superior classification performance compared with the state-of-the-art classification methods.
Article
Remote sensing image scene classification plays an important role in a wide range of applications and hence has been receiving remarkable attention. During the past years, significant efforts have been made to develop various datasets or present a variety of approaches for scene classification from remote sensing images. However, a systematic review of the literature concerning datasets and methods for scene classification is still lacking. In addition, almost all existing datasets have a number of limitations, including the small scale of scene classes and the image numbers, the lack of image variations and diversity, and the saturation of accuracy. These limitations severely limit the development of new approaches especially deep learning-based methods. This paper first provides a comprehensive review of the recent progress. Then, we propose a large-scale dataset, termed "NWPU-RESISC45", which is a publicly available benchmark for REmote Sensing Image Scene Classification (RESISC), created by Northwestern Polytechnical University (NWPU). This dataset contains 31,500 images, covering 45 scene classes with 700 images in each class. The proposed NWPU-RESISC45 (i) is large-scale on the scene classes and the total image number, (ii) holds big variations in translation, spatial resolution, viewpoint, object pose, illumination, background, and occlusion, and (iii) has high within-class diversity and between-class similarity. The creation of this dataset will enable the community to develop and evaluate various data-driven algorithms. Finally, several representative methods are evaluated using the proposed dataset and the results are reported as a useful baseline for future research.
Article
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.