Content uploaded by Morris Riedel
Author content
All content in this area was uploaded by Morris Riedel on Jul 17, 2021
Content may be subject to copyright.
ENHANCING LARGE BATCH SIZE TRAINING OF DEEP
MODELS FOR REMOTE SENSING APPLICATIONS
Rocco Sedona1,2, Gabriele Cavallaro1, Morris Riedel1,2, and Matthias Book2
1J¨
ulich Supercomputing Centre, Forschungszentrum J¨
ulich, Germany
2School of Engineering and Natural Sciences, University of Iceland, Iceland
ABSTRACT
A wide variety of Remote Sensing (RS) missions are
continuously acquiring a large volume of data every day.
The availability of large datasets has propelled Deep Learn-
ing (DL) methods also in the RS domain. Convolutional
Neural Networks (CNNs) have become the state of the art
when tackling the classification of images, however the pro-
cess of training is time consuming. In this work we exploit
the Layer-wise Adaptive Moments optimizer for Batch train-
ing (LAMB) optimizer to use large batch size training on
High-Performance Computing (HPC) systems. With the
use of LAMB combined with learning rate scheduling and
warm-up strategies, the experimental results on RS data clas-
sification demonstrate that a ResNet50 can be trained faster
with batch sizes up to 32K.
Index Terms—Distributed deep learning, high perfor-
mance computing, residual neural network, convolutional
neural network, classification, deepsat
1. INTRODUCTION
Deep Learning (DL) is emerging as the leading Artificial In-
telligence (AI) technique owing to the current convergence of
scalable computing capability (i.e., HPC and Cloud comput-
ing), easy access to large volumes of data, and the emergence
of new algorithms enabling robust training of large-scale deep
CNNs [1].
Recent HPC architectures and parallel programming have
been influenced by the rapid advancement of DL and hard-
ware accelerators (e.g., GPUs). The classical workloads that
run on HPC systems (e.g., numerical methods based on phys-
ical laws in various scientific fields) are becoming more het-
erogeneous. They are being transformed by DL algorithms
that require higher memory, storage, and networking capabil-
ities, as well as optimized software and libraries, to deliver
the required performance [2].
HPC systems are an effective solution that deals with the
challenges posed by big RS data. Modern Earth Observa-
The results of this research were achieved through
the support of HELMHOLTZ AI CONSULTANTS @ FZJ
https://www.helmholtz.ai/themenmenue/our-research/consultant-
teams/helmholtz-ai-consultants-fzj/index.html
tion (EO) programs (e.g., ESA’s Copernicus) provide contin-
uous streams of massive volumes of multi-sensor RS data on
a daily basis 1.
While DL has provided numerous breakthrough for many
RS applications, some challenges are still unsolved. The de-
ployment of a deep model can produce a neural network ar-
chitecture with a significant number of tunable parameters
(i.e., millions for a ResNet50 architecture [3]), which requires
a large amount of time to complete its training.
To achieve high scalability performance over a large num-
ber of GPUs the main approach is to increase the effective
batch size (i.e., the batch size per worker multiplied by the
number of workers)[4]. However, it was noted that the use
of the popular the Stochastic Gradient Descent (SGD) opti-
mizer in a setting with batch sizes larger than 8K can lead
to substantial degradation of performance, e.g., classification
accuracy, if used without any additional countermeasures [5].
One mechanism to avoid this difficulty is tuning the learn-
ing rate schedule that uses warm-up phases before the train-
ing, scales learning rate with the number of distributed work-
ers, and reduces the rate according to a fixed factor after a
fixed number of epochs [6]. More sophisticated strategies to
deal with very large batch sizes use adaptive learning rates
that are tuned dependent on layer depth, value of computed
gradients and progress of training. In [7] the authors showed
that the use of a learning rate with linear scaling w.r.t. the
number of GPUs, step dacay, and warm-up allowed training
DL models on a RS dataset with batch sizes up to 8K.
In this study, we propose to use the recently presented
LAMB [8] optimizer with a multifold strategy consisting of a
learning rate scheduler with polynomial decay that calculates
the initial learning rate with a non-linear rule. We utilized also
a warm-up phase at the beginning of the training with length
proportional to the batch size [4] [5] [8]. We demonstrate that
this training strategy and the adoption LAMB optimizer can
scale the training of a ResNet50 for the classification of two
RS datasets, using batch sizes up to 32K without a significant
degradation of the accuracy.
1https://sentinels.copernicus.eu/web/sentinel/news/-/article/2018-
sentinel-data-access-annual-report
2. PROBLEM FORMULATION
Distributed computing frameworks such as the TensorFlow
native mirrored and parameter server strategies, PyTorch Dis-
tributed, Horovod [9] have gained visibility lately, enabling
a faster trainining of deep neural networks on large datasets
[4]. There are two approaches for distributed training, the
model distribution and the data distribution ((i.e., data par-
allelism)) [10]. In this work we used data distribution and
run the experiments on one of the HPC systems hosted at
the J¨
ulich Supercomputing Centre (JSC). Data distributed
frameworks are more straightforward to implement and re-
quire less hand-tuning. In data parallelism the DL model is
replicated on each worker and data are divided in different
chunks among the workers. The training of the models is then
executed in parallel, where each replica performs backpropa-
gation on different data. At the end of each iteration the mod-
els exchange their local parameters between each other in a
synchronized way. In this work, we adopted the Horovod li-
brary due to its flexible API that can be used on top of the
most popular DL libraries such as TensorFlow, Keras, Py-
Torch and MXNet. Horovod relies on Message Passing In-
terface (MPI) and NVIDIA Collective Communication Li-
brary (NCCL) libraries for the synchronization of the model
parameters among the different workers, which is performed
using a decentralized ring-allreduce algorithm [9].
3. METHODOLOGY
3.1. ResNet50
ResNet50 [3] was presented in 2015 and it is still among the
most widely used CNNs for solving various computer vision
tasks. Although stacking a large number of layers to create a
deep neural network would intuitively provide very powerful
and expressive models, in practice the training becomes more
difficult due to the so called vanishing gradient problem. It
was noted that in deep neural networks the gradient becomes
small as a function of the depth, preventing the model from
updating the weights [11]. ResNet50 aims at overcoming this
issue by adopting the skip connections: instead of directly
fitting the underlying mapping H(x), the residual mapping
F(x) := H(x)−xis learned [3]. Implementing the skip
connections as identity mappings (F(x) + x), [3] creates a
deep CNN solving the vanishing gradient problem.
3.2. LAMB optimizer
With the data distribution parallel strategy the effective batch
size is the result of the multiplication of the per-worker batch
size by the number of workers. The adoption of the SGD opti-
mizer was shown to help tackling optimization problems with
batch size up to 8K, used in combination with a strategy that
computes the initial learning rate according to a linear scaling
rule and a warm-up phase [4]. However, above the threshold
of 8K, this solution is not sufficient to train a model without
degradation of the results during testing. In [5], the authors
found that if the ratio of the L2-norm of weights and gradients
is high, the training can become unstable. The LAMB opti-
mizer has been specifically proposed to improve the training
stability and generalization performance [8]. LAMB is based
on the popular optimization algorithm adaptive learning rate
optimization algorithm (ADAM) [12]. In contrast to SGD,
LAMB is a layer-wise adaptive algorithm that adopts a per
dimension normalization with respect to the square root of
the second moment and a layer-wise normalization. The gen-
eral rule for updating the parameters with iterative algorithms
such as ADAM and SGD is:
xt+1 =xt+ηtut,(1)
where xare the parameters of the model, ηis the learning
rate and uis the update of the parameters. For the layer-wise
adaptive strategies the formula becomes:
xi
t+1 =xi
t−ηt
Φ(∥xi
t∥)
∥gi
t∥gi
t,(2)
where xiare the parameters at layer i, githe gradient at layer
i, ηtis the learning rate at step t and Φis a scaling function.
Comparing the classical rule for the update of the weights (eq.
1) with the formula of the layer-wise adaptive strategy (eq. 2),
we can observe that the two changes are the following: (i) the
update is scaled to unit l2-norm and (ii) an additional scaling
Φis applied [8].
4. EXPERIMENTAL RESULTS
4.1. Dataset
The experiments were carried out on the SAT-4 and SAT-6 air-
borne datasets [13]. The patches were created using the Na-
tional Agriculture Imagery Program (NAIP) dataset, which
consists of 330,000 scenes covering the Continental United
States. The size of each patch is 28 ×28 ×4. Each patch
has 4 channels (i.e., RGB with near infrared) with 1m spatial
resolution. Each patch is associated to one class. The SAT-4
dataset contains 500,000 patches ans includes annotations of
four land cover classes, which are barren land, trees, grass-
land and a class that groups together everything that is not the
three aforementioned. The dataset was split in a training set
of 400.000 patches and a test set of 100.000 patches. Simi-
larly to SAT-4, SAT-6 contains patches with size 28 ×28 ×4,
but the total number of image patches is 405.000 and is an-
notated with six landcover classes that are barren land, trees,
grassland, roads, buildings and water bodies. The training set
consists of 324.000 patches and the test set of 81.000 patches.
4.2. Experimental Setup
We used the Dynamical Exascale Entry Platform (DEEP),
that is an European pre-exascale platform which incorporates
Batch size 8K 16K 32K 65K
Learning rate 0.02 0.028 0.04 0.05
Warm-up 5 10 20 40
Table 1. Hyper-parameters of LAMB optimizer as in the ex-
perimental setting of [8].
heterogeneous HPC systems. DEEP is being developed by the
European project Dynamical Exascale Entry Platform – Ex-
treme Scale Technologies (DEEP-EST) 2. The Extreme Scale
Booster (ESB) partition hosts 75 nodes, each equipped with
1 Nvidia V100 Tesla Graphics Processing Unit (GPU) (each
with 32 GB of memory). To test whether the data distributed
algorithm with large batch sizes can scale. we used up to 32
GPUs. We used Python 3.8.5 and the following libraries
for DL and the data distributed framework: TensorFlow
2.3.1,Horovod 0.20.3,Scikit-learn 0.20.3
and Scipy 1.5.2. SAT-4 and SAT-6 are saved as MAT-
LAB .mat files and were read using the Scipy library. We
trained the models with the Keras API and built the input
pipeling with the TensorFlow data API. We trained a ResNet-
50 models from scratch, i.e. without loading pre-trained
weights, on the datasets SAT4 and SAT6 [13]. Each patch
of the dataset is associated to one of the classes, making this
a patch-based multi-class classification problem. Thus, we
stacked a fully connected layer (with 6 neurons for SAT-6
and 4 neurons for SAT-4) on top of the model activated with
the softmax function. We selected a number of epochs equal
to 100 for the training of the models. The initial learning
rate was set using a heuristics that computes the learning rate
proportionally to the root square of the effective batch size.
We also adopted a polynomial scheduler of order 2 for the
learning rate as shown in [8], as well as a warm-up that grad-
ually ramps up the value of the learning rate at the beginning
of the training. The combination of these techniques helps
solving the problem of instability that can cause exploding
gradients. The hyper-parameters were selected based on [8]
and are shown in Tab. 1. As a baseline for comparison we
also used the ADAM optimizer with a fixed learning rate set
to 0.001 as in [12] and batch size equal to 64. We tested
also the SGD optimizer with hyper-parameters as explained
in [4], but the training did not converge in 100 epochs, thus
a further exploration of the hyper-parameter space should
be performed and results could be reported in future works.
We implemented a simple data augmentation with random
flips and rotation of the patches, which helps reducing the
overfitting.
4.3. Evaluation
The accuracy and loss metrics shown in Tab. 2, 3 and 4 are
the average of 3 runs for each set of hyper-parameters. Us-
2https://www.deep-est.eu/
Batch size N. GPUs Accuracy Loss Time [s]
8K 4 0.99 0.02 34
16K 8 0.98 0.07 18
32K 16 0.96 0.11 9
65K 32 diverges 5
Table 2. Accuracy and test loss, training time per epoch
epoch with LAMB optimizer, dataset SAT4.
Batch size N. GPUs Accuracy Loss Time [s]
8K 4 0.99 0.05 41
16K 8 0.98 0.11 22
32K 16 0.94 0.17 11
65K 32 diverges 6
Table 3. Accuracy and test loss, training time per epoch
epoch with LAMB optimizer, dataset SAT6.
ing LAMB and batch sizes up to 32K we could obtain results
that are comparable to those obtained using small batch sizes
and consistent with state of the art results [14]. In particular,
we can see that the accuracy obtained by using LAMB with a
batch size of 8K is very similar to that obtained using ADAM
with a much smaller batch size equal to 64 on both datasets
(shown in Tab. 2, 3 and 4). We did not test the ADAM opti-
mizer since it is known that it tends not to generalize well on
test data when large batch sizes are employed [15]. As stated
above, results remain acceptable with batch sizes of 32K us-
ing the new LAMB approach, while they diverge with batch
size equal to 65K. We can observe that as the batch size in-
creases, the test losses and accuracies tend to increase and
decrease respectively, up to the point where they significantly
diverge from baseline results. This happens even though we
did not observe training difficulties such as exploding gra-
dient, a behaviour that was observed using SGD with large
batches [4]. As the batch size grows, the generalization gap
becomes non negligible [15] due to the fact that optimizers in
the large batch size regime converge to sharp instead of flat
minimizers [16].
The scaling in terms of time required to complete an
epoch is slightly less than linear w.r.t. the number of GPUs
that are employed for the training (Tab. 2 and 3). We hypoth-
esize that the use of large per-worker batch size is beneficial
for scaling. In fact, using large batch sizes the communica-
tion time (time spent to exchange the gradients between the
workers) remains smaller than the computation time (time
spent to propagate the batches back and forth in the CNN),
reducing the GPUs idle time. A conclusive and thorough
study of the possible set-ups could be beneficial also to other
researchers in the field.
Dataset Batch size Accuracy Loss Time [s]
SAT-4 64 0.98 0.02 263
SAT-6 64 0.98 0.04 214
Table 4. Accuracy and test loss, training time per epoch with
ADAM optimizer, dataset SAT4 and SAT6. These experi-
ments were carried out on a single GPU.
5. CONCLUSIONS
In this work the LAMB optimizer was used to train a ResNet-
50 model with large batch sizes up to 32K. The results ob-
tained with the SAT-4 and SAT-6 RS datasets showed that the
training performance remained unaffected and that process-
ing speed up was achieved. Training the model with batch
sizes above the threshold of 32K is still problematic, as it was
shown in the results using a batch size equal to 65K. An ad-
ditional consideration is that in the present work we used two
RS datasets with a simple multi-class classification problem,
but the question whether this approach could be extended to
more complex classification problems is still without an an-
swer. We are currently planning to work on a comparison with
the Layer-wise Adaptive Rate Scaling (LARS) optimizer [5],
which might be included in future publications. However, this
work should be considered as a preliminary assessment and a
systematic analysis that includes also other training strategies
and algorithms to deal with large batch size should be under-
taken, such as the adoption of a cyclical learning rate sched-
uler [17] and the dynamic increase of the batch size during the
training [18]. A quantitative analysis that takes into consid-
eration the optimal configuration for distributed DL such as
the per-worker batch size or specific parameters of Horovod
is also lacking at the moment and is in the future plans of
the authors. The repository with the Python code is publicly
available 3.
6. REFERENCES
[1] G. Fox, J. A. Glazier, and et al., “Learning Every-
where: Pervasive Machine Learning for Effective High-
Performance Computation,” in 2019 IEEE International
Parallel and Distributed Processing Symposium Work-
shops (IPDPSW), 2019, pp. 422–429.
[2] T. Ben-Nun and T. Hoefler, “Demystifying Parallel and
Distributed Deep Learning: An in-Depth Concurrency
Analysis,” CoRR, vol. abs/1802.09941, 2018.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
Learning for Image Recognition,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2016.
3https://gitlab.version.fz-juelich.de/sedona3/igarss2021 sat6.git
[4] P. Doll´
ar P. Goyal and et al., “Accurate, Large Minibatch
SGD: Training ImageNet in 1 Hour,” arXiv:1706.02677.
[5] Y. You, I. Gitman, and B. Ginsburg, “Large Batch Train-
ing of Convolutional Networks,” 2017.
[6] M. Yamazaki, A. Kasagi, and et al., “Yet Another Accel-
erated SGD: ResNet-50 Training on ImageNet in 74.7
seconds,” arXiv preprint arXiv:1903.12650, 2019.
[7] R. Sedona, G. Cavallaro, J. Jitsev, A. Strube, M. Riedel,
and J.A. Benediktsson, “Remote Sensing Big Data
Classification with High Performance Distributed Deep
Learning,” Remote Sensing, vol. 11, no. 24, pp. 3056,
Dec 2019.
[8] Y. You, J. Li, and et al., “Large Batch Optimization for
Deep Learning: Training BERT in 76 minutes,” 2020.
[9] A. Sergeev and M. Del Balso, “Horovod: Fast and
Easy Distributed Deep Learning in TensorFlow,” arXiv
preprint arXiv:1802.05799, 2018.
[10] T. Ben-Nun and T. Hoefler, “Demystifying Parallel and
Distributed Deep Learning: An In-Depth Concurrency
Analysis,” ACM Computing Surveys, 2019.
[11] B. Hanin and D. Rolnick, “How to Start Training: The
Effect of Initialization and Architecture,” 2018.
[12] D. P. Kingma and J. Ba, “Adam: A Method for Stochas-
tic Optimization,” 2017.
[13] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano,
M. Karki, and R. Nemani, “DeepSat - A Learning
Framework for Satellite Imagery,” 2015.
[14] M. A. Kadhim and M. H. Abed, Convolutional Neural
Network for Satellite Image Classification, p. 165–178,
Springer International Publishing, Mar 2019.
[15] E. Hoffer, I. Hubara, and D. Soudry, “Train Longer,
Generalize Better: Closing the Generalization Gap in
Large Batch Training of Neural Networks,” in Pro-
ceedings of the 31st International Conference on Neural
Information Processing Systems, Red Hook, NY, USA,
2017, NIPS’17, p. 1729–1739, Curran Associates Inc.
[16] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyan-
skiy, and P. T. P. Tang, “On Large-Batch Training for
Deep Learning: Generalization Gap and Sharp Min-
ima,” 2017.
[17] L. N. Smith, “Cyclical Learning Rates for Training Neu-
ral Networks,” 2017.
[18] S. L. Smith, P.J. Kindermans, C. Ying, and Q.V. Le,
“Don’t Decay the Learning Rate, Increase the Batch
Size,” 2018.