Conference PaperPDF Available

# Enhancing Large Batch Size Training of Deep Models for Remote Sensing Applications

Authors:

## Abstract

A wide variety of Remote Sensing (RS) missions are continuously acquiring a large volume of data every day. The availability of large datasets has propelled Deep Learning (DL) methods also in the RS domain. Convolutional Neural Networks (CNNs) have become the state of the art when tackling the classification of images, however the process of training is time consuming. In this work we exploit the Layer-wise Adaptive Moments optimizer for Batch training (LAMB) optimizer to use large batch size training on High-Performance Computing (HPC) systems. With the use of LAMB combined with learning rate scheduling and warm-up strategies, the experimental results on RS data classification demonstrate that a ResNet50 can be trained faster with batch sizes up to 32K.
ENHANCING LARGE BATCH SIZE TRAINING OF DEEP
MODELS FOR REMOTE SENSING APPLICATIONS
Rocco Sedona1,2, Gabriele Cavallaro1, Morris Riedel1,2, and Matthias Book2
1J¨
ulich Supercomputing Centre, Forschungszentrum J¨
ulich, Germany
2School of Engineering and Natural Sciences, University of Iceland, Iceland
ABSTRACT
A wide variety of Remote Sensing (RS) missions are
continuously acquiring a large volume of data every day.
The availability of large datasets has propelled Deep Learn-
ing (DL) methods also in the RS domain. Convolutional
Neural Networks (CNNs) have become the state of the art
when tackling the classiﬁcation of images, however the pro-
cess of training is time consuming. In this work we exploit
the Layer-wise Adaptive Moments optimizer for Batch train-
ing (LAMB) optimizer to use large batch size training on
High-Performance Computing (HPC) systems. With the
use of LAMB combined with learning rate scheduling and
warm-up strategies, the experimental results on RS data clas-
siﬁcation demonstrate that a ResNet50 can be trained faster
with batch sizes up to 32K.
Index TermsDistributed deep learning, high perfor-
mance computing, residual neural network, convolutional
neural network, classiﬁcation, deepsat
1. INTRODUCTION
Deep Learning (DL) is emerging as the leading Artiﬁcial In-
telligence (AI) technique owing to the current convergence of
scalable computing capability (i.e., HPC and Cloud comput-
of new algorithms enabling robust training of large-scale deep
CNNs [1].
Recent HPC architectures and parallel programming have
been inﬂuenced by the rapid advancement of DL and hard-
ware accelerators (e.g., GPUs). The classical workloads that
run on HPC systems (e.g., numerical methods based on phys-
ical laws in various scientiﬁc ﬁelds) are becoming more het-
erogeneous. They are being transformed by DL algorithms
that require higher memory, storage, and networking capabil-
ities, as well as optimized software and libraries, to deliver
the required performance [2].
HPC systems are an effective solution that deals with the
challenges posed by big RS data. Modern Earth Observa-
The results of this research were achieved through
the support of HELMHOLTZ AI CONSULTANTS @ FZJ
teams/helmholtz-ai-consultants-fzj/index.html
tion (EO) programs (e.g., ESA’s Copernicus) provide contin-
uous streams of massive volumes of multi-sensor RS data on
a daily basis 1.
While DL has provided numerous breakthrough for many
RS applications, some challenges are still unsolved. The de-
ployment of a deep model can produce a neural network ar-
chitecture with a signiﬁcant number of tunable parameters
(i.e., millions for a ResNet50 architecture [3]), which requires
a large amount of time to complete its training.
To achieve high scalability performance over a large num-
ber of GPUs the main approach is to increase the effective
batch size (i.e., the batch size per worker multiplied by the
number of workers)[4]. However, it was noted that the use
of the popular the Stochastic Gradient Descent (SGD) opti-
mizer in a setting with batch sizes larger than 8K can lead
to substantial degradation of performance, e.g., classiﬁcation
accuracy, if used without any additional countermeasures [5].
One mechanism to avoid this difﬁculty is tuning the learn-
ing rate schedule that uses warm-up phases before the train-
ing, scales learning rate with the number of distributed work-
ers, and reduces the rate according to a ﬁxed factor after a
ﬁxed number of epochs [6]. More sophisticated strategies to
deal with very large batch sizes use adaptive learning rates
that are tuned dependent on layer depth, value of computed
gradients and progress of training. In [7] the authors showed
that the use of a learning rate with linear scaling w.r.t. the
number of GPUs, step dacay, and warm-up allowed training
DL models on a RS dataset with batch sizes up to 8K.
In this study, we propose to use the recently presented
LAMB [8] optimizer with a multifold strategy consisting of a
learning rate scheduler with polynomial decay that calculates
the initial learning rate with a non-linear rule. We utilized also
a warm-up phase at the beginning of the training with length
proportional to the batch size [4] [5] [8]. We demonstrate that
this training strategy and the adoption LAMB optimizer can
scale the training of a ResNet50 for the classiﬁcation of two
RS datasets, using batch sizes up to 32K without a signiﬁcant
1https://sentinels.copernicus.eu/web/sentinel/news/-/article/2018-
sentinel-data-access-annual-report
2. PROBLEM FORMULATION
Distributed computing frameworks such as the TensorFlow
native mirrored and parameter server strategies, PyTorch Dis-
tributed, Horovod [9] have gained visibility lately, enabling
a faster trainining of deep neural networks on large datasets
[4]. There are two approaches for distributed training, the
model distribution and the data distribution ((i.e., data par-
allelism)) [10]. In this work we used data distribution and
run the experiments on one of the HPC systems hosted at
the J¨
ulich Supercomputing Centre (JSC). Data distributed
frameworks are more straightforward to implement and re-
quire less hand-tuning. In data parallelism the DL model is
replicated on each worker and data are divided in different
chunks among the workers. The training of the models is then
executed in parallel, where each replica performs backpropa-
gation on different data. At the end of each iteration the mod-
els exchange their local parameters between each other in a
synchronized way. In this work, we adopted the Horovod li-
brary due to its ﬂexible API that can be used on top of the
most popular DL libraries such as TensorFlow, Keras, Py-
Torch and MXNet. Horovod relies on Message Passing In-
terface (MPI) and NVIDIA Collective Communication Li-
brary (NCCL) libraries for the synchronization of the model
parameters among the different workers, which is performed
using a decentralized ring-allreduce algorithm [9].
3. METHODOLOGY
3.1. ResNet50
ResNet50 [3] was presented in 2015 and it is still among the
most widely used CNNs for solving various computer vision
tasks. Although stacking a large number of layers to create a
deep neural network would intuitively provide very powerful
and expressive models, in practice the training becomes more
difﬁcult due to the so called vanishing gradient problem. It
was noted that in deep neural networks the gradient becomes
small as a function of the depth, preventing the model from
updating the weights [11]. ResNet50 aims at overcoming this
ﬁtting the underlying mapping H(x), the residual mapping
F(x) := H(x)xis learned [3]. Implementing the skip
connections as identity mappings (F(x) + x), [3] creates a
deep CNN solving the vanishing gradient problem.
3.2. LAMB optimizer
With the data distribution parallel strategy the effective batch
size is the result of the multiplication of the per-worker batch
size by the number of workers. The adoption of the SGD opti-
mizer was shown to help tackling optimization problems with
batch size up to 8K, used in combination with a strategy that
computes the initial learning rate according to a linear scaling
rule and a warm-up phase [4]. However, above the threshold
of 8K, this solution is not sufﬁcient to train a model without
degradation of the results during testing. In [5], the authors
found that if the ratio of the L2-norm of weights and gradients
is high, the training can become unstable. The LAMB opti-
mizer has been speciﬁcally proposed to improve the training
stability and generalization performance [8]. LAMB is based
on the popular optimization algorithm adaptive learning rate
optimization algorithm (ADAM) [12]. In contrast to SGD,
dimension normalization with respect to the square root of
the second moment and a layer-wise normalization. The gen-
eral rule for updating the parameters with iterative algorithms
such as ADAM and SGD is:
xt+1 =xt+ηtut,(1)
where xare the parameters of the model, ηis the learning
rate and uis the update of the parameters. For the layer-wise
xi
t+1 =xi
tηt
Φ(xi
t)
gi
tgi
t,(2)
where xiare the parameters at layer i, githe gradient at layer
i, ηtis the learning rate at step t and Φis a scaling function.
Comparing the classical rule for the update of the weights (eq.
1) with the formula of the layer-wise adaptive strategy (eq. 2),
we can observe that the two changes are the following: (i) the
update is scaled to unit l2-norm and (ii) an additional scaling
Φis applied [8].
4. EXPERIMENTAL RESULTS
4.1. Dataset
The experiments were carried out on the SAT-4 and SAT-6 air-
borne datasets [13]. The patches were created using the Na-
tional Agriculture Imagery Program (NAIP) dataset, which
consists of 330,000 scenes covering the Continental United
States. The size of each patch is 28 ×28 ×4. Each patch
has 4 channels (i.e., RGB with near infrared) with 1m spatial
resolution. Each patch is associated to one class. The SAT-4
dataset contains 500,000 patches ans includes annotations of
four land cover classes, which are barren land, trees, grass-
land and a class that groups together everything that is not the
three aforementioned. The dataset was split in a training set
of 400.000 patches and a test set of 100.000 patches. Simi-
larly to SAT-4, SAT-6 contains patches with size 28 ×28 ×4,
but the total number of image patches is 405.000 and is an-
notated with six landcover classes that are barren land, trees,
grassland, roads, buildings and water bodies. The training set
consists of 324.000 patches and the test set of 81.000 patches.
4.2. Experimental Setup
We used the Dynamical Exascale Entry Platform (DEEP),
that is an European pre-exascale platform which incorporates
Batch size 8K 16K 32K 65K
Learning rate 0.02 0.028 0.04 0.05
Warm-up 5 10 20 40
Table 1. Hyper-parameters of LAMB optimizer as in the ex-
perimental setting of [8].
heterogeneous HPC systems. DEEP is being developed by the
European project Dynamical Exascale Entry Platform – Ex-
treme Scale Technologies (DEEP-EST) 2. The Extreme Scale
Booster (ESB) partition hosts 75 nodes, each equipped with
1 Nvidia V100 Tesla Graphics Processing Unit (GPU) (each
with 32 GB of memory). To test whether the data distributed
algorithm with large batch sizes can scale. we used up to 32
GPUs. We used Python 3.8.5 and the following libraries
for DL and the data distributed framework: TensorFlow
2.3.1,Horovod 0.20.3,Scikit-learn 0.20.3
and Scipy 1.5.2. SAT-4 and SAT-6 are saved as MAT-
LAB .mat ﬁles and were read using the Scipy library. We
trained the models with the Keras API and built the input
pipeling with the TensorFlow data API. We trained a ResNet-
weights, on the datasets SAT4 and SAT6 [13]. Each patch
of the dataset is associated to one of the classes, making this
a patch-based multi-class classiﬁcation problem. Thus, we
stacked a fully connected layer (with 6 neurons for SAT-6
and 4 neurons for SAT-4) on top of the model activated with
the softmax function. We selected a number of epochs equal
to 100 for the training of the models. The initial learning
rate was set using a heuristics that computes the learning rate
proportionally to the root square of the effective batch size.
We also adopted a polynomial scheduler of order 2 for the
learning rate as shown in [8], as well as a warm-up that grad-
ually ramps up the value of the learning rate at the beginning
of the training. The combination of these techniques helps
solving the problem of instability that can cause exploding
gradients. The hyper-parameters were selected based on [8]
and are shown in Tab. 1. As a baseline for comparison we
also used the ADAM optimizer with a ﬁxed learning rate set
to 0.001 as in [12] and batch size equal to 64. We tested
also the SGD optimizer with hyper-parameters as explained
in [4], but the training did not converge in 100 epochs, thus
a further exploration of the hyper-parameter space should
be performed and results could be reported in future works.
We implemented a simple data augmentation with random
ﬂips and rotation of the patches, which helps reducing the
overﬁtting.
4.3. Evaluation
The accuracy and loss metrics shown in Tab. 2, 3 and 4 are
the average of 3 runs for each set of hyper-parameters. Us-
2https://www.deep-est.eu/
Batch size N. GPUs Accuracy Loss Time [s]
8K 4 0.99 0.02 34
16K 8 0.98 0.07 18
32K 16 0.96 0.11 9
65K 32 diverges 5
Table 2. Accuracy and test loss, training time per epoch
epoch with LAMB optimizer, dataset SAT4.
Batch size N. GPUs Accuracy Loss Time [s]
8K 4 0.99 0.05 41
16K 8 0.98 0.11 22
32K 16 0.94 0.17 11
65K 32 diverges 6
Table 3. Accuracy and test loss, training time per epoch
epoch with LAMB optimizer, dataset SAT6.
ing LAMB and batch sizes up to 32K we could obtain results
that are comparable to those obtained using small batch sizes
and consistent with state of the art results [14]. In particular,
we can see that the accuracy obtained by using LAMB with a
batch size of 8K is very similar to that obtained using ADAM
with a much smaller batch size equal to 64 on both datasets
(shown in Tab. 2, 3 and 4). We did not test the ADAM opti-
mizer since it is known that it tends not to generalize well on
test data when large batch sizes are employed [15]. As stated
above, results remain acceptable with batch sizes of 32K us-
ing the new LAMB approach, while they diverge with batch
size equal to 65K. We can observe that as the batch size in-
creases, the test losses and accuracies tend to increase and
decrease respectively, up to the point where they signiﬁcantly
diverge from baseline results. This happens even though we
did not observe training difﬁculties such as exploding gra-
dient, a behaviour that was observed using SGD with large
batches [4]. As the batch size grows, the generalization gap
becomes non negligible [15] due to the fact that optimizers in
the large batch size regime converge to sharp instead of ﬂat
minimizers [16].
The scaling in terms of time required to complete an
epoch is slightly less than linear w.r.t. the number of GPUs
that are employed for the training (Tab. 2 and 3). We hypoth-
esize that the use of large per-worker batch size is beneﬁcial
for scaling. In fact, using large batch sizes the communica-
tion time (time spent to exchange the gradients between the
workers) remains smaller than the computation time (time
spent to propagate the batches back and forth in the CNN),
reducing the GPUs idle time. A conclusive and thorough
study of the possible set-ups could be beneﬁcial also to other
researchers in the ﬁeld.
Dataset Batch size Accuracy Loss Time [s]
SAT-4 64 0.98 0.02 263
SAT-6 64 0.98 0.04 214
Table 4. Accuracy and test loss, training time per epoch with
ADAM optimizer, dataset SAT4 and SAT6. These experi-
ments were carried out on a single GPU.
5. CONCLUSIONS
In this work the LAMB optimizer was used to train a ResNet-
50 model with large batch sizes up to 32K. The results ob-
tained with the SAT-4 and SAT-6 RS datasets showed that the
training performance remained unaffected and that process-
ing speed up was achieved. Training the model with batch
sizes above the threshold of 32K is still problematic, as it was
shown in the results using a batch size equal to 65K. An ad-
ditional consideration is that in the present work we used two
RS datasets with a simple multi-class classiﬁcation problem,
but the question whether this approach could be extended to
more complex classiﬁcation problems is still without an an-
swer. We are currently planning to work on a comparison with
the Layer-wise Adaptive Rate Scaling (LARS) optimizer [5],
which might be included in future publications. However, this
work should be considered as a preliminary assessment and a
systematic analysis that includes also other training strategies
and algorithms to deal with large batch size should be under-
taken, such as the adoption of a cyclical learning rate sched-
uler [17] and the dynamic increase of the batch size during the
training [18]. A quantitative analysis that takes into consid-
eration the optimal conﬁguration for distributed DL such as
the per-worker batch size or speciﬁc parameters of Horovod
is also lacking at the moment and is in the future plans of
the authors. The repository with the Python code is publicly
available 3.
6. REFERENCES
[1] G. Fox, J. A. Glazier, and et al., “Learning Every-
where: Pervasive Machine Learning for Effective High-
Performance Computation,” in 2019 IEEE International
Parallel and Distributed Processing Symposium Work-
shops (IPDPSW), 2019, pp. 422–429.
[2] T. Ben-Nun and T. Hoeﬂer, “Demystifying Parallel and
Distributed Deep Learning: An in-Depth Concurrency
Analysis,” CoRR, vol. abs/1802.09941, 2018.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual
Learning for Image Recognition,” in Proceedings of the
IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2016.
[4] P. Doll´
ar P. Goyal and et al., Accurate, Large Minibatch
SGD: Training ImageNet in 1 Hour,arXiv:1706.02677.
[5] Y. You, I. Gitman, and B. Ginsburg, “Large Batch Train-
ing of Convolutional Networks,” 2017.
[6] M. Yamazaki, A. Kasagi, and et al., “Yet Another Accel-
erated SGD: ResNet-50 Training on ImageNet in 74.7
seconds,” arXiv preprint arXiv:1903.12650, 2019.
[7] R. Sedona, G. Cavallaro, J. Jitsev, A. Strube, M. Riedel,
and J.A. Benediktsson, “Remote Sensing Big Data
Classiﬁcation with High Performance Distributed Deep
Learning,” Remote Sensing, vol. 11, no. 24, pp. 3056,
Dec 2019.
[8] Y. You, J. Li, and et al., “Large Batch Optimization for
Deep Learning: Training BERT in 76 minutes,” 2020.
[9] A. Sergeev and M. Del Balso, “Horovod: Fast and
Easy Distributed Deep Learning in TensorFlow, arXiv
preprint arXiv:1802.05799, 2018.
[10] T. Ben-Nun and T. Hoeﬂer, “Demystifying Parallel and
Distributed Deep Learning: An In-Depth Concurrency
Analysis,” ACM Computing Surveys, 2019.
[11] B. Hanin and D. Rolnick, “How to Start Training: The
Effect of Initialization and Architecture,” 2018.
[12] D. P. Kingma and J. Ba, “Adam: A Method for Stochas-
tic Optimization,” 2017.
[13] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano,
M. Karki, and R. Nemani, “DeepSat - A Learning
Framework for Satellite Imagery,” 2015.
[14] M. A. Kadhim and M. H. Abed, Convolutional Neural
Network for Satellite Image Classiﬁcation, p. 165–178,
Springer International Publishing, Mar 2019.
[15] E. Hoffer, I. Hubara, and D. Soudry, “Train Longer,
Generalize Better: Closing the Generalization Gap in
Large Batch Training of Neural Networks, in Pro-
ceedings of the 31st International Conference on Neural
Information Processing Systems, Red Hook, NY, USA,
2017, NIPS’17, p. 1729–1739, Curran Associates Inc.
[16] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyan-
skiy, and P. T. P. Tang, “On Large-Batch Training for
Deep Learning: Generalization Gap and Sharp Min-
ima,” 2017.
[17] L. N. Smith, “Cyclical Learning Rates for Training Neu-
ral Networks,” 2017.
[18] S. L. Smith, P.J. Kindermans, C. Ying, and Q.V. Le,
“Don’t Decay the Learning Rate, Increase the Batch
Size,” 2018.
Article
Full-text available
High-Performance Computing (HPC) has recently been attracting more attention in remote sensing applications due to the challenges posed by the increased amount of open data that are produced daily by Earth Observation (EO) programs. The unique parallel computing environments and programming techniques that are integrated in HPC systems are able to solve large-scale problems such as the training of classification algorithms with large amounts of Remote Sensing (RS) data. This paper shows that the training of state-of-the-art deep Convolutional Neural Networks (CNNs) can be efficiently performed in distributed fashion using parallel implementation techniques on HPC machines containing a large number of Graphics Processing Units (GPUs). The experimental results confirm that distributed training can drastically reduce the amount of time needed to perform full training, resulting in near linear scaling without loss of test accuracy.
Chapter
Full-text available
Multimedia applications and processing is an exciting topic, and it is a key of many applications of artificial intelligent like video summarization, image retrieval or image classification. A convolutional neural networks have been successfully applied on multimedia approaches and used to create a system able to handle the classification without any human’s interactions. In this paper, we produce effective methods for satellite image classification that are based on deep learning and using the convolutional neural network for features extraction by using AlexNet, VGG19, GoogLeNet and Resnet50 pretraining models. The Resnet50 model achieves a promising result than other models on three different dataset SAT4, SAT6 and UC Merced Land. The accuracy of classification of this model for UC Merced Land dataset is 98%, for SAT4 is 95.8%, and the result for SAT6 is 94.1%.
Article
Full-text available
Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.
Article
We investigate the effects of initialization and architecture on the start of training in deep ReLU nets. We identify two common failure modes for early training in which the mean and variance of activations are poorly behaved. For each failure mode, we give a rigorous proof of when it occurs at initialization and how to avoid it. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in. The second failure mode, exponentially large variance of activation length, can be avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.
Article
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. Specifically, we present trends in DNN architectures and the resulting implications on parallelization strategies. We discuss the different types of concurrency in DNNs; synchronous and asynchronous stochastic gradient descent; distributed system architectures; communication schemes; and performance modeling. Based on these approaches, we extrapolate potential directions for parallelism in deep learning.
Article
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at https://github.com/uber/horovod.
Article
It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train Inception-ResNet-V2 on ImageNet to $77\%$ validation accuracy in under 2500 parameter updates, efficiently utilizing training batches of 65536 images.
Article
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.