Content uploaded by Qingli Luo
Author content
All content in this area was uploaded by Qingli Luo on Dec 25, 2020
Content may be subject to copyright.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 1
Towards optimal learning rate schedule in scene
classification network
Minglong Xue, Jian Li, Qingli Luo*
Abstract—Stochastic Gradient Descent (SGD) and adaptive
methods, including ADAM, RMSProp, AdaDelta, AdaGrad, are
two dominant optimization algorithms for training Convolution
Neural Network (CNN) in scene classification tasks. Recent
work reveals that these adaptive methods lead to degraded
performance in image classification tasks compared with SGD.
In this letter, a learning rate schedule named switching from
constant to step decay (SCTSD) is proposed to further im-
prove the classification accuracy of SGD. SCTSD begins with
a constant learning rate and switches to step decay learning
rates when appropriate. Theoretical evidence is provided on
the superiority of SCTSD compared with other manually tuned
schedules. Comparison experiments have been conducted among
adaptive methods, SCTSD and other manual schedules with three
CNN architectures. The experiment results on various scene
classification datasets show that SCTSD has the highest accuracy
on test set and it is state-of-the-art. In the end, some suggestions
on hyper-parameters selection of SCTSD are given for scene
classification.
Index Terms—stochastic gradient descent (SGD), convolu-
tion neural network (CNN), scene classification tasks, learning
rate schedule, switching from constant to step decay schedule
(SCTSD).
I. INTRODUCTION
RECENT advance in scene classification tasks rely mainly
on Convolution Neural Network (CNN) for its high
performance in scene classification tasks. CNN can extract
features automatically and it can accept various input, includ-
ing SAR, Lidar, hyperspectral and optical images. Meanwhile,
CNN has rather low operating time once it was trained.
CNN has achieved great success in automatic target recog-
nition(ATR), marine monitoring and forest monitoring.
In the field of CNN optimization, Stochastic Gradient De-
scent (SGD) is a dominant algorithm based on the first-order
derivative of the loss function. Due to the large amounts of
parameters in CNN and sophisticated loss function, traditional
second-derivative based optimization algorithms do not work
in training CNN. SGD takes steps to find the minimum
of the loss function. Decaying learning rate schedules help
it approach the optimal solution. Classical SGD use manu-
ally tuned schedules, such as polynomial[1] and step decay
schedule[2][3][4], to find solution for CNN. Some variants
of SGD, that use adaptive learning rates and gradients, are
Minglong Xue, Jian Li and Qingli Luo* are all with State key laboratory of
precision measuring technology and instruments, Tianjin University, Tianjin,
300072 (E-Mail:xueminglong@tju.edu.cn,luoqingli@tju.edu.cn)
Qingli Luo* is also with Binhai International Advanced Structural Integrity
Research Centre, Tianjin 300072, China. (E-Mail:luoqingli@tju.edu.cn.)
This work is supported by National Natural Science Foundation of China
(grant No. 41601446), and Tianjin Natural Science Foundation (grant No.
16JCQNJC01200), (Corresponding author: Qingli Luo).
more popular than SGD for its fast convergence rate in the
training process of CNN. These adaptive methods leverage
past learning rates and gradients to work.
Although adaptive methods have achieved great success,
recent works reveal that they can not find an optimal solution
in some settings[5][6]. There exist extreme gradients in the
process of optimization. They occur rarely but are informative.
Due to the moving average strategy of some adaptive methods,
information of these large gradients is lost and it leads to poor
convergence[5]. Further, adaptive methods may converge to
”sharp” minima of loss function[6]. CNN trained by adaptive
methods tends to perform worse on unseen data than by SGD.
Several state-of-the-art works choose SGD with step decay
schedules as the CNN optimizer. VGGNet[2] used two 3x3
filters instead of one large-size filter and won the first place
in 2014 ImageNet[7] recognition challenge. ResNet achieved
an improvement through residual shortcut connections[3].
In order to further improve the performance, switching from
constant to step decay (SCTSD) schedule is proposed for SGD.
SCTSD begins with a constant learning rate and then switches
to step decay learning rates at a specific switch point. Constant
learning rate at the initial stage help reduce the search space
rapidly. Then a more precious solution can be found in this
space through step decay schedule. The key contributions of
this work are as follows:
•Firstly, we provide theoretical evidence that SCTSD is
superior to polynomial decay and step decay schedule.
Our work is based on the proof of Ge et al.[8] that the
step decay schedule outperforms the polynomial decay
schedule for training CNN. We take a step further and
prove the superiority of SCTSD compared with the step
decay schedule.
•Secondly, Comparison experiments have been conducted
among constant learning rate, adaptive methods, SCTSD,
polynomial decay and step decay schedules on three
popular scene classification datasets. Three widely used
CNN architectures are used in our experiments. The result
shows that CNN trained by SCTSD has higher accuracy
compared with other schedules. In the end, parameter
selection of SCTSD is discussed in different scenarios.
II. SCTSD SCHEDULE
We provide theoretical evidence of the superiority of
SCTSD compared with step and polynomial decay schedule
based on the work of Ge et al.[8]. One-layer CNN without the
activation unit can be viewed as a linear regression model. By
solving the regression problem, we prove that SCTSD leads
to lower test error than step decay schedule.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 2
A. Prior knowledge
Let us start with a linear regression problem, The loss
function to be solved is
min L(β) = min
β∈Rd
E(Xβ −Y)2(1)
Beginning with initial weights β0and a sequence of learning
rates η0, η1, ..., ηt, the weights of regression model βtis
updated by SGD through following iteration
βt+1 =βt−ηt∇L(βt)(2)
∇L(βt) = xT
t·(xt·βt−yt)(3)
where xt, ytis the training data at time t,ηtis the learning
rate at time t.ytcan be denoted as yt=xtβ∗+et.β∗is the
theoretical optimal solution that SGD can find for L(β). The
hessian of L(β)is H=E(∇2L(βt)). The noise eexisting in
the data satisfies that E[e] = 0, var(e) = E(eTe) = θ2.
According to Eq. 1, the loss error tbetween L(βt)and
L(β∗)is denoted as
t=E[L(βt)] − L(β∗)
=E[(βt−β∗)TH(βt−β∗)] (4)
Suppose that the eigenvalue of His λ(i), i = 1,2, ..., d,
and the condition number κmeasures the noise sensitivity of
model and it is defined as κ=λmax/λmin .
The variance v(i)
tof solution βtis defined as
v(i)
t=E[(β(i)
t−β∗(i))2](5)
Then the error tcan be rewritten as
t=
d
X
i=1
λ(i)v(i)
t(6)
According to Eq. 2, Eq. 3 and Eq. 5, the variance vtof the
solutions βtfound by SGD can be computed through following
iteration
v(i)
t= (1 −ηtλ(i))2v(i)
t−1+λ(i)θ2η2
t
=
T
Y
t=1
(1 −ηtλ(i))2v(i)
0+λ(i)θ2
T
X
t=1
η2
t
T
Y
i=t+1
(1 −ηiλ(i))2
(7)
According to [8], tis lower bounded by
lim
t→∞
E[L(βt)] − L(β∗)
θ2d/t ≥1(8)
where θ2d/t is the statistical minimal error that SGD could
reach.
Fig. 1: Learning Rate Schedules
B. Step decay schedule is superior to polynomial decay sched-
ule
The polynomial and step decay schedule are modeled as
ηt=η0
1+b·tand ηt=η0·exp (−b·t)respectively by Ge et
al.[8]. He proved that the error of polynomial decay schedule
is lower bounded by
T≥exp(−T
κlog T)0+θ2d
T·κ
64 (9)
The error of step decay schedule is upper bounded by
T≤2·exp(−T
2κlog T)0+ 4 ·log T·θ2d
T(10)
For any T < κ2, it is obviously that the solution found by
step decay schedule is much better than adaptive methods for
it can reach an error rate of ∗
T·log T·log κwhich is lower
than ∗
T·κ.
C. SCTSD is superior to step decay schedule
The diagram of our schedule is in Fig. 1. The main idea of
SCTSD is that constant learning rate is employed at first to
reduce search scope, then a more precise solution can be found
through step decay schedule. Details about SCTSD is shown
in Algorithm 1. The error rate of SCTSD can be reduced to
∗
T·log κwhich will be proved through iteration.
At the first stage, when t<T0, ηt=η0=1
L2, we prove
that vt≤(1 −λ(i)/L2)2tv0+λ(i)/L2. According to iterative
variance in Eq. 7, it is clearly true at time t= 0.
When 0< t < T0, suppose that it is true at time t−1, then
v(i)
t= (1 −λ(i)
L2)2v(i)
t−1+λ(i)θ2
(L2)2
≤(1 −λ(i)
L2)2tv(i)
0+θ2
L2
(11)
After the first stage of SCTSD, the variance is
vT0≤(1 −λ
L2)2T0v0+θ2
L2≤v0
T3
0
+θ2
L2(12)
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 3
Input : number of initial epochs T0;
number of total epochs T;
initial learning rate η0;
initial weights β0
Output: model weights βt
ηl=η0;
for t←1to T0do
βt←βt−1−ηl· ∇L(βt−1);
end
ˆ
T=T−T0;
ˆ
T0=ˆ
T
log ˆ
T;
for l←1to log ˆ
Tdo
ηl←ηl/2l;
for t←T0+ (l−1) ·ˆ
T0to T0+l·ˆ
T0do
βt←βt−1−ηl· ∇L(βt−1);
end
end
Algorithm 1: SCTSD schedule
At the second stage, when T0< t < T , learning rate ηt
is reduced by a factor of 2 at time t=T0+ (l−1) ·T
log T,
where l= 1,2, ..., log (T−T0). We prove that vT0+lˆ
T0≤
vT0exp(−3l)+2ˆ
T0η2
lλθ2, where ˆ
T0=ˆ
T
log ˆ
T,ˆ
T=T−T0by
induction.
In the base case, it is clearly true when t=T0, l = 0 for
that vT0≤vT0.
When t>T0, l > 0, suppose that it is true when l−1, then
vT0+lˆ
T0≤vT0+(l−1) ˆ
T0·exp (−2ηl·λˆ
T0) + ˆ
T0η2
l·λθ2
≤vT0·exp(−3l) + (2 exp(−3l) + 1) ˆ
T0ηl·λθ2
≤vT0·exp(−3l)+2ˆ
T0ηl·λθ2
(13)
Further, according to Eq. 12, we have
vT≤v0
T3
0·(T−T0)3+2 log κ
λ·T·θ2(14)
Therefore, according to Eq. 6, Eq. 11 and Eq. 14, for any
κ < T < κ2, our schedule has an error rate of ∗
t·log κ, which
is much lower than ∗
t·log T·log κof step decay schedule.
D. The best switch point T∗
0≈T
2
According to Eq. 14, when T has been set up, the variance
vTdepends on the first term v0
T3
0·(T−T0)3. Due to T0+ (T−
T0) = T, we could get the lowest vTwhen T0≈T−T0≈T
2.
Hence, the best switch point is located nearby T∗
0≈T
2.
III. EXP ER IM EN T
In this section, experiments have been carried out on three
scene classification datasets. Details about datasets and exper-
iment environments are shown in the first part. Further, com-
parison experiments among different schedules and parameters
settings have been conducted.
A. Experimental Setup
Our experiments are conducted in three benchmark datasets
which are widely used for scene classification, including UC
Merced[12], AID[13], and NWPU-RESISC[14]. UC Merced
Land Use Data set provides 21 class land use images, including
agricultural, baseball diamond and beach. Each class contains
100 images. Each image is 256*256 pixels and the spatial
resolution is about 1 foot. AID is an Aerial Scene Classifica-
tion dataset proposed by WuHan university. It contains over
60,000 images which can be categorized into 30 types and
it is collected from Google Earth. Each image in the AID
measures 600*600 pixels. NWPU-RESISC contains 35,000
images including 45 classes. Each image is 256*256 pixels.
Three CNN architectures, including ResNet[3], VGGNet[2]
and DenseNet [4] are used for classification tasks in our
experiments. Random cropping from images padded with 4
pixels and randomly horizontal flip is performed for data
augmentation. T is the total number of epochs during training.
The setting of T depends on the convergence time of the
applied networks. In our experiments, all of the three networks
could converge in 200 epochs. Hence, T is set as 200 in
our experiments. Some results used pretrained model which
is pretrained on ImageNet[7] in order to improve the accuracy
on test set.
Overage accuracy is adopted as the evaluation metrics. Av-
erage accuracy computes the accuracy of each class separately
and then averages the accuracy of different classes. It equals
to overall accuracy when the dataset is balanced. In our case,
the datasets we used are all strongly balanced and the overall
accuracy is equal to average accuracy.
B. Experiment Results
Comparison experiments are conducted with the network
of VGGNet, ResNet and DenseNet and these results are
presented in Fig. 2 from left to right. For each network
and dataset, the applied strategies include ADAM, constant
learning rate, polynomial, step decay and SCTSD schedules.
The test accuracy of these strategies is reported in Fig. 2(a)-
(i). The test accuracy of SCTSD is plotted with purple line
and in each figure, it has higher test accuracy than the other
schedules.
Numerical results are listed in Tab. I. The accuracy is
estimated by each model, which is trained by 80% samples
and tested on the other 20% samples. SCTSD has lower test
error than the other schedules and it is state-of-the-art. In UC
Merced, DenseNet trained by SCTSD with switch point 100
and η0= 0.01 has the highest test accuracy(98.57%). It is bet-
ter than that of Siamese Convolutional Network[9](94.29%),
Deep Feature Fusion Network[10](97.42%), Fusing Local
and Global Features Model[11](98.49%). In AID, the test
accuracy of SCTSD(97.40%) is higher than Deep Fea-
ture Fusion Network(91.82%) and Fusing Local and Global
Features Model(89.76%). Similarly, the performance of
SCTSD(97.19%) is superior to Siamese Convolutional Net-
work(95.95%). Hence, the three networks trained by SCTSD
has better performance than that by the other training sched-
ules in all of these three datasets.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 4
(a) VGGNet16 on UCMerced (b) ResNet50 on UCMerced (c) DenseNet121 on UCMerced
(d) VGGNet16 on AID (e) ResNet50 on AID (f) DenseNet121 on AID
(g) VGGNet16 on NWPU (h) ResNet50 on NWPU (i) DenseNet121 on NWPU
Fig. 2: Classification accuracy on test set. From left to right, the results are from VGGNet, ResNet and DenseNet CNN
architectures. From top to bottom, the results are from UC Merced, AID and NWPU datasets.
Comparison experiments have also been conducted between
the network with and without ImageNet pretrained. Take the
results of Densenet121 as examples, the pretrained strategy
could significantly improve the classification accuracy in UC
Merced(by 5.57% ∼14.29%), AID(by 5.76% ∼12.86%) and
NWPU(by 2.04% ∼6.62%).
The impact on classification accuracy of different ratios
between train and test set has also been evaluated. Comparison
experiments are conducted with the training percentages of
20%, 50% and 80% on three datasets with the use of DenseNet
121 and the results are listed in Tab. II. We could see
that SCTSD schedule has better performance than the other
schedules no matter which kinds of training percentages were
applied.
ADAM is selected as the representative of adaptive sched-
ules in both Tab. I and Tab. II. One reason is that ADAM
has performance than the other adaptive schedules as shown
in Tab. III.
Comparison experiments are conducted on DenseNet121
with the use of our strategy by setting up different switch
points. The results are listed in Tab. I. According to the
theoretical derivation in Section II, the best switch point is
set as T∗
0≈1
2T= 100. In UC Merced, the best result is
obtained when T0= 100. In AID and NWPU dataset, we get
the best accuracy when T0= 120 and T0= 80 rather than
T
2= 100.T
2is the recommended value for setting the best
switch point. For these above three cases, all the best switch
point T∗
0is located in or nearby T
2, which is consistent with
our theory (T∗
0≈1
2T).
IV. CONCLUSION
In this letter, we propose a SCTSD schedule that could
further improve the performance of CNN in scene classifica-
tion tasks. Theoretical evidence is provided about the lowest
error rates of SCTSD by solving a linear regression problem.
Comparison experiments have been conducted among manu-
ally tuned schedules and adaptive methods. Experiment results
show that the SCTSD has superior performance than the
other schedules in various datasets. Besides, hyper-parameters
settings of switch points have been discussed and the recom-
mended values are provided. In the future, more intelligent
approaches will be designed to select the best switch point for
SCTSD.
IEEE GEOSCIENCE AND REMOTE SENSING LEETERS. 5
TABLE I: Average test accuracy of last 5 epoch
Networks Layers Training Schedule Pretrained UC Merced AID NWPU
VGGNet
16 Adam, η= 0.0001 Yes 96.66 92.31 93.00
16 Constant Learning Rate, η0= 0.01 Yes 96.62 92.76 90.48
16 Polynomial Decay, η0= 0.01, b = 0.1Yes 95.48 95.01 94.46
16 Step Decay, η0= 0.01 Yes 96.67 95.30 94.73
11 SCTSD Schedule, η0= 0.01 Yes 97.14 95.27 93.90
16 SCTSD Schedule, η0= 0.01 Yes 97.38 95.49 94.80
ResNet
50 Adam, η= 0.0001 Yes 97.71 95.34 95.65
50 Constant Learning Rate, η0= 0.01 Yes 97.57 96.55 92.71
50 Polynomial Decay, η0= 0.01, b = 0.1Yes 97.48 96.12 96.86
50 Step Decay, η0= 0.01 Yes 98.05 96.73 97.17
18 SCTSD Schedule, η0= 0.01 Yes 97.81 96.48 96.43
50 SCTSD Schedule, η0= 0.01 Yes 98.29 97.04 97.15
DenseNet
121 Adam, η= 0.0001 No 89.29 87.55 89.83
121 Constant Learning Rate, η0= 0.01 No 83.95 82.86 90.65
121 Polynomial Decay, η0= 0.01, b = 0.1No 89.24 86.04 91.06
121 Step Decay, η0= 0.01 No 92.00 88.99 93.59
121 SCTSD Schedule, Switch Point=80,η0= 0.01 No 93.00 91.35 94.35
121 SCTSD Schedule, Switch Point=100,η0= 0.01 No 92.76 90.98 95.10
121 SCTSD Schedule, Switch Point=120,η0= 0.01 No 92.00 91.56 94.91
121 Adam, η= 0.0001 Yes 98.43 96.72 96.45
121 Constant Learning Rate, η0= 0.01 Yes 98.24 95.72 94.08
121 Polynomial Decay, η0= 0.01, b = 0.1Yes 98.05 96.36 96.68
121 Step Decay, η0= 0.01 Yes 98.14 97.10 96.77
121 SCTSD Schedule, Switch Point=80,η0= 0.01 Yes 98.52 97.14 97.19
121 SCTSD Schedule, Switch Point=100,η0= 0.01 Yes 98.57 96.97 97.14
121 SCTSD Schedule, Switch Point=120,η0= 0.01 Yes 98.38 97.40 97.17
Siamese Convolutional Network[9] *step decay, η0= 0.001 - 94.29 - 95.95
Deep Feature Fusion Network[10] * * - 97.42 91.87 -
Fusing Local and
Global Features Model[11] * * - 98.49 89.76 -
*The learning rate schedule or layers is unknown to us.
-The author did not conduct experiments in this dataset.
TABLE II: Average test accuracy of last 5 epoch of DenseNet121 with different training percentages
Training Schedule
Training Percentage
20% 50% 80%
UC Merced AID NWPU UC Merced AID NWPU UC Merced AID NWPU
Adam 90.93 92.06 92.77 96.78 95.39 95.14 98.43 96.72 96.45
Constant Learning Rate 92.57 94.00 93.34 96.84 95.17 93.53 98.14 97.10 96.77
Polynomial Decay 92.17 93.49 92.89 97.56 95.66 95.40 97.10 96.77 96.68
Step Decay 92.68 93.86 93.15 97.54 95.90 95.70 98.14 97.10 96.77
SCTSD Schedule, Switch Point=80 93.08 94.15 93.61 98.11 96.06 95.98 98.52 97.14 97.19
SCTSD Schedule, Switch Point=100 93.43 94.24 93.63 98.17 96.16 95.97 98.57 96.97 97.14
SCTSD Schedule, Switch Point=120 92.93 94.02 93.82 98.10 96.00 95.97 98.39 97.40 97.17
TABLE III: Average test accuracy of last 5 epoch on ResNet18
with different adaptive schedules
Networks Training Schedule UC Merced AID NWPU
ResNet18
RMSProp 95.00 82.28 82.85
AdaDelta 94.81 93.94 95.32
Adagrad 89.43 93.57 94.61
Adam 96.24 92.58 92.69
REFERENCES
[1] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approxima-
tion by averaging,” SIAM Journal on Control and Optimization, vol. 30,
no. 4, pp. 838–855, 1992.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Representations, 2015.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[4] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
connected convolutional networks,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[5] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and
beyond,” arXiv preprint arXiv:1904.09237, 2019.
[6] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The
marginal value of adaptive gradient methods in machine learning,” in
Advances in Neural Information Processing Systems, 2017, pp. 4148–
4158.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in 2009 IEEE conference on
computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[8] R. Ge, S. M. Kakade, R. Kidambi, and P. Netrapalli, “The step decay
schedule: A near optimal, geometrically decaying learning rate proce-
dure for least squares,” in Advances in Neural Information Processing
Systems, 2019, pp. 14 951–14 962.
[9] X. Liu, Y. Zhou, J. Zhao, R. Yao, B. Liu, and Y. Zheng, “Siamese
convolutional neural networks for remote sensing scene classification,”
IEEE Geoscience and Remote Sensing Letters, 2019.
[10] S. Chaib, H. Liu, Y. Gu, and H. Yao, “Deep feature fusion for vhr
remote sensing scene classification,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 55, no. 8, pp. 4775–4784, 2017.
[11] X. Bian, C. Chen, L. Tian, and Q. Du, “Fusing local and global features
for high-resolution scene classification,” IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing, vol. 10, no. 6, pp.
2889–2901, 2017.
[12] Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions
for land-use classification,” in Proceedings of the 18th SIGSPATIAL in-
ternational conference on advances in geographic information systems.
ACM, 2010, pp. 270–279.
[13] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu,
“Aid: A benchmark data set for performance evaluation of aerial scene
classification,” IEEE Transactions on Geoscience and Remote Sensing,
vol. 55, no. 7, pp. 3965–3981, 2017.
[14] G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi-
cation: Benchmark and state of the art,” Proceedings of the IEEE, vol.
105, no. 10, pp. 1865–1883, 2017.