Content uploaded by Morris Riedel

Author content

All content in this area was uploaded by Morris Riedel on Jul 17, 2021

Content may be subject to copyright.

ENHANCING LARGE BATCH SIZE TRAINING OF DEEP

MODELS FOR REMOTE SENSING APPLICATIONS

Rocco Sedona1,2, Gabriele Cavallaro1, Morris Riedel1,2, and Matthias Book2

1J¨

ulich Supercomputing Centre, Forschungszentrum J¨

ulich, Germany

2School of Engineering and Natural Sciences, University of Iceland, Iceland

ABSTRACT

A wide variety of Remote Sensing (RS) missions are

continuously acquiring a large volume of data every day.

The availability of large datasets has propelled Deep Learn-

ing (DL) methods also in the RS domain. Convolutional

Neural Networks (CNNs) have become the state of the art

when tackling the classiﬁcation of images, however the pro-

cess of training is time consuming. In this work we exploit

the Layer-wise Adaptive Moments optimizer for Batch train-

ing (LAMB) optimizer to use large batch size training on

High-Performance Computing (HPC) systems. With the

use of LAMB combined with learning rate scheduling and

warm-up strategies, the experimental results on RS data clas-

siﬁcation demonstrate that a ResNet50 can be trained faster

with batch sizes up to 32K.

Index Terms—Distributed deep learning, high perfor-

mance computing, residual neural network, convolutional

neural network, classiﬁcation, deepsat

1. INTRODUCTION

Deep Learning (DL) is emerging as the leading Artiﬁcial In-

telligence (AI) technique owing to the current convergence of

scalable computing capability (i.e., HPC and Cloud comput-

ing), easy access to large volumes of data, and the emergence

of new algorithms enabling robust training of large-scale deep

CNNs [1].

Recent HPC architectures and parallel programming have

been inﬂuenced by the rapid advancement of DL and hard-

ware accelerators (e.g., GPUs). The classical workloads that

run on HPC systems (e.g., numerical methods based on phys-

ical laws in various scientiﬁc ﬁelds) are becoming more het-

erogeneous. They are being transformed by DL algorithms

that require higher memory, storage, and networking capabil-

ities, as well as optimized software and libraries, to deliver

the required performance [2].

HPC systems are an effective solution that deals with the

challenges posed by big RS data. Modern Earth Observa-

The results of this research were achieved through

the support of HELMHOLTZ AI CONSULTANTS @ FZJ

https://www.helmholtz.ai/themenmenue/our-research/consultant-

teams/helmholtz-ai-consultants-fzj/index.html

tion (EO) programs (e.g., ESA’s Copernicus) provide contin-

uous streams of massive volumes of multi-sensor RS data on

a daily basis 1.

While DL has provided numerous breakthrough for many

RS applications, some challenges are still unsolved. The de-

ployment of a deep model can produce a neural network ar-

chitecture with a signiﬁcant number of tunable parameters

(i.e., millions for a ResNet50 architecture [3]), which requires

a large amount of time to complete its training.

To achieve high scalability performance over a large num-

ber of GPUs the main approach is to increase the effective

batch size (i.e., the batch size per worker multiplied by the

number of workers)[4]. However, it was noted that the use

of the popular the Stochastic Gradient Descent (SGD) opti-

mizer in a setting with batch sizes larger than 8K can lead

to substantial degradation of performance, e.g., classiﬁcation

accuracy, if used without any additional countermeasures [5].

One mechanism to avoid this difﬁculty is tuning the learn-

ing rate schedule that uses warm-up phases before the train-

ing, scales learning rate with the number of distributed work-

ers, and reduces the rate according to a ﬁxed factor after a

ﬁxed number of epochs [6]. More sophisticated strategies to

deal with very large batch sizes use adaptive learning rates

that are tuned dependent on layer depth, value of computed

gradients and progress of training. In [7] the authors showed

that the use of a learning rate with linear scaling w.r.t. the

number of GPUs, step dacay, and warm-up allowed training

DL models on a RS dataset with batch sizes up to 8K.

In this study, we propose to use the recently presented

LAMB [8] optimizer with a multifold strategy consisting of a

learning rate scheduler with polynomial decay that calculates

the initial learning rate with a non-linear rule. We utilized also

a warm-up phase at the beginning of the training with length

proportional to the batch size [4] [5] [8]. We demonstrate that

this training strategy and the adoption LAMB optimizer can

scale the training of a ResNet50 for the classiﬁcation of two

RS datasets, using batch sizes up to 32K without a signiﬁcant

degradation of the accuracy.

1https://sentinels.copernicus.eu/web/sentinel/news/-/article/2018-

sentinel-data-access-annual-report

2. PROBLEM FORMULATION

Distributed computing frameworks such as the TensorFlow

native mirrored and parameter server strategies, PyTorch Dis-

tributed, Horovod [9] have gained visibility lately, enabling

a faster trainining of deep neural networks on large datasets

[4]. There are two approaches for distributed training, the

model distribution and the data distribution ((i.e., data par-

allelism)) [10]. In this work we used data distribution and

run the experiments on one of the HPC systems hosted at

the J¨

ulich Supercomputing Centre (JSC). Data distributed

frameworks are more straightforward to implement and re-

quire less hand-tuning. In data parallelism the DL model is

replicated on each worker and data are divided in different

chunks among the workers. The training of the models is then

executed in parallel, where each replica performs backpropa-

gation on different data. At the end of each iteration the mod-

els exchange their local parameters between each other in a

synchronized way. In this work, we adopted the Horovod li-

brary due to its ﬂexible API that can be used on top of the

most popular DL libraries such as TensorFlow, Keras, Py-

Torch and MXNet. Horovod relies on Message Passing In-

terface (MPI) and NVIDIA Collective Communication Li-

brary (NCCL) libraries for the synchronization of the model

parameters among the different workers, which is performed

using a decentralized ring-allreduce algorithm [9].

3. METHODOLOGY

3.1. ResNet50

ResNet50 [3] was presented in 2015 and it is still among the

most widely used CNNs for solving various computer vision

tasks. Although stacking a large number of layers to create a

deep neural network would intuitively provide very powerful

and expressive models, in practice the training becomes more

difﬁcult due to the so called vanishing gradient problem. It

was noted that in deep neural networks the gradient becomes

small as a function of the depth, preventing the model from

updating the weights [11]. ResNet50 aims at overcoming this

issue by adopting the skip connections: instead of directly

ﬁtting the underlying mapping H(x), the residual mapping

F(x) := H(x)−xis learned [3]. Implementing the skip

connections as identity mappings (F(x) + x), [3] creates a

deep CNN solving the vanishing gradient problem.

3.2. LAMB optimizer

With the data distribution parallel strategy the effective batch

size is the result of the multiplication of the per-worker batch

size by the number of workers. The adoption of the SGD opti-

mizer was shown to help tackling optimization problems with

batch size up to 8K, used in combination with a strategy that

computes the initial learning rate according to a linear scaling

rule and a warm-up phase [4]. However, above the threshold

of 8K, this solution is not sufﬁcient to train a model without

degradation of the results during testing. In [5], the authors

found that if the ratio of the L2-norm of weights and gradients

is high, the training can become unstable. The LAMB opti-

mizer has been speciﬁcally proposed to improve the training

stability and generalization performance [8]. LAMB is based

on the popular optimization algorithm adaptive learning rate

optimization algorithm (ADAM) [12]. In contrast to SGD,

LAMB is a layer-wise adaptive algorithm that adopts a per

dimension normalization with respect to the square root of

the second moment and a layer-wise normalization. The gen-

eral rule for updating the parameters with iterative algorithms

such as ADAM and SGD is:

xt+1 =xt+ηtut,(1)

where xare the parameters of the model, ηis the learning

rate and uis the update of the parameters. For the layer-wise

adaptive strategies the formula becomes:

xi

t+1 =xi

t−ηt

Φ(∥xi

t∥)

∥gi

t∥gi

t,(2)

where xiare the parameters at layer i, githe gradient at layer

i, ηtis the learning rate at step t and Φis a scaling function.

Comparing the classical rule for the update of the weights (eq.

1) with the formula of the layer-wise adaptive strategy (eq. 2),

we can observe that the two changes are the following: (i) the

update is scaled to unit l2-norm and (ii) an additional scaling

Φis applied [8].

4. EXPERIMENTAL RESULTS

4.1. Dataset

The experiments were carried out on the SAT-4 and SAT-6 air-

borne datasets [13]. The patches were created using the Na-

tional Agriculture Imagery Program (NAIP) dataset, which

consists of 330,000 scenes covering the Continental United

States. The size of each patch is 28 ×28 ×4. Each patch

has 4 channels (i.e., RGB with near infrared) with 1m spatial

resolution. Each patch is associated to one class. The SAT-4

dataset contains 500,000 patches ans includes annotations of

four land cover classes, which are barren land, trees, grass-

land and a class that groups together everything that is not the

three aforementioned. The dataset was split in a training set

of 400.000 patches and a test set of 100.000 patches. Simi-

larly to SAT-4, SAT-6 contains patches with size 28 ×28 ×4,

but the total number of image patches is 405.000 and is an-

notated with six landcover classes that are barren land, trees,

grassland, roads, buildings and water bodies. The training set

consists of 324.000 patches and the test set of 81.000 patches.

4.2. Experimental Setup

We used the Dynamical Exascale Entry Platform (DEEP),

that is an European pre-exascale platform which incorporates

Batch size 8K 16K 32K 65K

Learning rate 0.02 0.028 0.04 0.05

Warm-up 5 10 20 40

Table 1. Hyper-parameters of LAMB optimizer as in the ex-

perimental setting of [8].

heterogeneous HPC systems. DEEP is being developed by the

European project Dynamical Exascale Entry Platform – Ex-

treme Scale Technologies (DEEP-EST) 2. The Extreme Scale

Booster (ESB) partition hosts 75 nodes, each equipped with

1 Nvidia V100 Tesla Graphics Processing Unit (GPU) (each

with 32 GB of memory). To test whether the data distributed

algorithm with large batch sizes can scale. we used up to 32

GPUs. We used Python 3.8.5 and the following libraries

for DL and the data distributed framework: TensorFlow

2.3.1,Horovod 0.20.3,Scikit-learn 0.20.3

and Scipy 1.5.2. SAT-4 and SAT-6 are saved as MAT-

LAB .mat ﬁles and were read using the Scipy library. We

trained the models with the Keras API and built the input

pipeling with the TensorFlow data API. We trained a ResNet-

50 models from scratch, i.e. without loading pre-trained

weights, on the datasets SAT4 and SAT6 [13]. Each patch

of the dataset is associated to one of the classes, making this

a patch-based multi-class classiﬁcation problem. Thus, we

stacked a fully connected layer (with 6 neurons for SAT-6

and 4 neurons for SAT-4) on top of the model activated with

the softmax function. We selected a number of epochs equal

to 100 for the training of the models. The initial learning

rate was set using a heuristics that computes the learning rate

proportionally to the root square of the effective batch size.

We also adopted a polynomial scheduler of order 2 for the

learning rate as shown in [8], as well as a warm-up that grad-

ually ramps up the value of the learning rate at the beginning

of the training. The combination of these techniques helps

solving the problem of instability that can cause exploding

gradients. The hyper-parameters were selected based on [8]

and are shown in Tab. 1. As a baseline for comparison we

also used the ADAM optimizer with a ﬁxed learning rate set

to 0.001 as in [12] and batch size equal to 64. We tested

also the SGD optimizer with hyper-parameters as explained

in [4], but the training did not converge in 100 epochs, thus

a further exploration of the hyper-parameter space should

be performed and results could be reported in future works.

We implemented a simple data augmentation with random

ﬂips and rotation of the patches, which helps reducing the

overﬁtting.

4.3. Evaluation

The accuracy and loss metrics shown in Tab. 2, 3 and 4 are

the average of 3 runs for each set of hyper-parameters. Us-

2https://www.deep-est.eu/

Batch size N. GPUs Accuracy Loss Time [s]

8K 4 0.99 0.02 34

16K 8 0.98 0.07 18

32K 16 0.96 0.11 9

65K 32 diverges 5

Table 2. Accuracy and test loss, training time per epoch

epoch with LAMB optimizer, dataset SAT4.

Batch size N. GPUs Accuracy Loss Time [s]

8K 4 0.99 0.05 41

16K 8 0.98 0.11 22

32K 16 0.94 0.17 11

65K 32 diverges 6

Table 3. Accuracy and test loss, training time per epoch

epoch with LAMB optimizer, dataset SAT6.

ing LAMB and batch sizes up to 32K we could obtain results

that are comparable to those obtained using small batch sizes

and consistent with state of the art results [14]. In particular,

we can see that the accuracy obtained by using LAMB with a

batch size of 8K is very similar to that obtained using ADAM

with a much smaller batch size equal to 64 on both datasets

(shown in Tab. 2, 3 and 4). We did not test the ADAM opti-

mizer since it is known that it tends not to generalize well on

test data when large batch sizes are employed [15]. As stated

above, results remain acceptable with batch sizes of 32K us-

ing the new LAMB approach, while they diverge with batch

size equal to 65K. We can observe that as the batch size in-

creases, the test losses and accuracies tend to increase and

decrease respectively, up to the point where they signiﬁcantly

diverge from baseline results. This happens even though we

did not observe training difﬁculties such as exploding gra-

dient, a behaviour that was observed using SGD with large

batches [4]. As the batch size grows, the generalization gap

becomes non negligible [15] due to the fact that optimizers in

the large batch size regime converge to sharp instead of ﬂat

minimizers [16].

The scaling in terms of time required to complete an

epoch is slightly less than linear w.r.t. the number of GPUs

that are employed for the training (Tab. 2 and 3). We hypoth-

esize that the use of large per-worker batch size is beneﬁcial

for scaling. In fact, using large batch sizes the communica-

tion time (time spent to exchange the gradients between the

workers) remains smaller than the computation time (time

spent to propagate the batches back and forth in the CNN),

reducing the GPUs idle time. A conclusive and thorough

study of the possible set-ups could be beneﬁcial also to other

researchers in the ﬁeld.

Dataset Batch size Accuracy Loss Time [s]

SAT-4 64 0.98 0.02 263

SAT-6 64 0.98 0.04 214

Table 4. Accuracy and test loss, training time per epoch with

ADAM optimizer, dataset SAT4 and SAT6. These experi-

ments were carried out on a single GPU.

5. CONCLUSIONS

In this work the LAMB optimizer was used to train a ResNet-

50 model with large batch sizes up to 32K. The results ob-

tained with the SAT-4 and SAT-6 RS datasets showed that the

training performance remained unaffected and that process-

ing speed up was achieved. Training the model with batch

sizes above the threshold of 32K is still problematic, as it was

shown in the results using a batch size equal to 65K. An ad-

ditional consideration is that in the present work we used two

RS datasets with a simple multi-class classiﬁcation problem,

but the question whether this approach could be extended to

more complex classiﬁcation problems is still without an an-

swer. We are currently planning to work on a comparison with

the Layer-wise Adaptive Rate Scaling (LARS) optimizer [5],

which might be included in future publications. However, this

work should be considered as a preliminary assessment and a

systematic analysis that includes also other training strategies

and algorithms to deal with large batch size should be under-

taken, such as the adoption of a cyclical learning rate sched-

uler [17] and the dynamic increase of the batch size during the

training [18]. A quantitative analysis that takes into consid-

eration the optimal conﬁguration for distributed DL such as

the per-worker batch size or speciﬁc parameters of Horovod

is also lacking at the moment and is in the future plans of

the authors. The repository with the Python code is publicly

available 3.

6. REFERENCES

[1] G. Fox, J. A. Glazier, and et al., “Learning Every-

where: Pervasive Machine Learning for Effective High-

Performance Computation,” in 2019 IEEE International

Parallel and Distributed Processing Symposium Work-

shops (IPDPSW), 2019, pp. 422–429.

[2] T. Ben-Nun and T. Hoeﬂer, “Demystifying Parallel and

Distributed Deep Learning: An in-Depth Concurrency

Analysis,” CoRR, vol. abs/1802.09941, 2018.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual

Learning for Image Recognition,” in Proceedings of the

IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, 2016.

3https://gitlab.version.fz-juelich.de/sedona3/igarss2021 sat6.git

[4] P. Doll´

ar P. Goyal and et al., “Accurate, Large Minibatch

SGD: Training ImageNet in 1 Hour,” arXiv:1706.02677.

[5] Y. You, I. Gitman, and B. Ginsburg, “Large Batch Train-

ing of Convolutional Networks,” 2017.

[6] M. Yamazaki, A. Kasagi, and et al., “Yet Another Accel-

erated SGD: ResNet-50 Training on ImageNet in 74.7

seconds,” arXiv preprint arXiv:1903.12650, 2019.

[7] R. Sedona, G. Cavallaro, J. Jitsev, A. Strube, M. Riedel,

and J.A. Benediktsson, “Remote Sensing Big Data

Classiﬁcation with High Performance Distributed Deep

Learning,” Remote Sensing, vol. 11, no. 24, pp. 3056,

Dec 2019.

[8] Y. You, J. Li, and et al., “Large Batch Optimization for

Deep Learning: Training BERT in 76 minutes,” 2020.

[9] A. Sergeev and M. Del Balso, “Horovod: Fast and

Easy Distributed Deep Learning in TensorFlow,” arXiv

preprint arXiv:1802.05799, 2018.

[10] T. Ben-Nun and T. Hoeﬂer, “Demystifying Parallel and

Distributed Deep Learning: An In-Depth Concurrency

Analysis,” ACM Computing Surveys, 2019.

[11] B. Hanin and D. Rolnick, “How to Start Training: The

Effect of Initialization and Architecture,” 2018.

[12] D. P. Kingma and J. Ba, “Adam: A Method for Stochas-

tic Optimization,” 2017.

[13] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano,

M. Karki, and R. Nemani, “DeepSat - A Learning

Framework for Satellite Imagery,” 2015.

[14] M. A. Kadhim and M. H. Abed, Convolutional Neural

Network for Satellite Image Classiﬁcation, p. 165–178,

Springer International Publishing, Mar 2019.

[15] E. Hoffer, I. Hubara, and D. Soudry, “Train Longer,

Generalize Better: Closing the Generalization Gap in

Large Batch Training of Neural Networks,” in Pro-

ceedings of the 31st International Conference on Neural

Information Processing Systems, Red Hook, NY, USA,

2017, NIPS’17, p. 1729–1739, Curran Associates Inc.

[16] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyan-

skiy, and P. T. P. Tang, “On Large-Batch Training for

Deep Learning: Generalization Gap and Sharp Min-

ima,” 2017.

[17] L. N. Smith, “Cyclical Learning Rates for Training Neu-

ral Networks,” 2017.

[18] S. L. Smith, P.J. Kindermans, C. Ying, and Q.V. Le,

“Don’t Decay the Learning Rate, Increase the Batch

Size,” 2018.