Speeding up DNNs using HPL based Fine-grained
Tiling for Distributed Multi-GPU Training
Kaustubh Shivdikar, Kaushal Paneri and David Kaeli
Department of Electrical and Computer Engineering
Email: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Abstract—With the advent of hardware accelerators, it is
possible to carry out massive computations on petascale-class
problems using graphics processors. One of the challenges when
designing programs to exploit a GPU is the strong relationship
between performance and the mapping/utilization of GPU mem-
ory. This paper considers how to map a Deep Neural Network
(DNN) application to the GPU memory systems. We consider
an optimized version of a DNN model leveraging coarse-grained
This work presented includes two studies. First, we consider
the performance of of High Performance Linpack benchmark, a
highly optimized implementation of Linpack. Second, using the
insights derived from ﬁrst study, we optimize the performance of
DNNs by altering its batch size. We consider two different types
of DNNs, presenting different forms of parallelism in terms of
data-level and model-level characteristics. The HPL study for
8 NVIDIA K80 GPUs yielded an optimized Block size of 160
which amounted to batch size value of 64 images. Comparing
this method to a heuristic approach, a speedup of 1.46x for CNN
A modern-day system hosting multiple NVIDIA DGX
servers could easily have been in top 100 systems on the
Top500 list of 2012. The raw compute compute capability
of today’s GPU system has spurred on renewed interest by
industry and academia in deep learning applications. In this
work, we consider how best to map DNNs to a GPU’s memory
An important factor to consider while implementing the
high-level software models on any cluster is how it maps
on to the underlying hardware. Fatahalian et al.  discuss
the challenges of programming a memory hierarchy using a
programming language they developed called Sequoia. They
expand further talking about the problems associated with in-
creased latency as a direct result of multi-level memory model.
The paper provides an example of mapping a 1024x1024
matrix-matrix multiplication to different caches. Achieving
the right tile can impact performance signiﬁcantly. If the
problem is written in a way that maps perfectly to the memory
hierarchy, long memory latencies can be hidden.
A typical approach to understand the performance of a
memory system is to perform benchmarking . The High Per-
formance variant of the Linpack benchmark named HPL ,
commonly used to benchmark supercomputers, provides a
good starting point. Moreover, there is a lot of similarity
between Deep Neural Nets and Linpack, since both involve
solving linear systems of equations.
II. EX PE RI ME NTAL AP PROACH
We start with developing a Convolutional Neural Network
model, and use the MNIST handwriting dataset on Keras,
with the Tensorﬂow-GPU backend and CUDA, cuDNN and
cuBLAS integrated . Categorical cross-entropy as loss func-
tion and the Adam Optimizer  was used for training. The
total number of parameters to be trained were 1,192,042. The
colored digits in Fig 1 indicates the number of nodes in each
Fig. 1. Single GPU CNN Model for MNIST dataset.
Our experiments were performed using multiple GPUs 
and applying coarse-grained parallelism techniques (i.e.,
model-level and data-level parallelism , ) to the same
CNN model. In model-level parallelism, the ﬁrst dense layer
is distributed to multiple GPUs. To achieve data parallelism,
we used a mini-batch rmsprop algorithm  for distributed
learning, and applied an asynchronous weight update for
robustness and ease of implementation. In order to evaluate
the speed of the DNNs, we record the epoch time.
Since the ﬁrst epoch takes the most time to run since it
randomly initializes weights, we focus on the average time of
the ﬁrst epoch collected across multiple runs.
The dimensions of an image in the MNIST database is
28x28, with each image being single precision, resulting in a
image size of 25,088 bits. To map these onto a GPU memory
hierarchy, the output from the HPL library  is used. HPL
divides the problem set in to smaller matrices (called Blocks),
with each dimension NB-by-NB. We start with tuning the tile
size parameter “NB” for different iterations of same version
of HPL, and identify the parameter conﬁguration that obtain
the best speedup.
III. RES ULTS
After running multiple iterations of HPL over a single node
comprising of 8 Tesla K80 GPUs by NVIDIA, we found the
fastest block size to be 160 (see Fig 2). Thus, the minimum
memory requirements for this tile can be computed using
Fig. 2. Performance plot in 100 GFLOPS with varying block sizes of HPL.
H P L T ile Siz e = 160∗160∗64(double precision)= 1,638,400
As the number of bits in a single image is known (i.e.,
25,088 bits), the batch size (the total images in every epoch)
leading to the best performance can be derived from Equa-
tion 2 - 65.3.
Batch Size =H P L T il e Siz e
Image Siz e (2)
Since the pooling layer in a CNN reduces the window size
in powers of two, we explore powers of two while selecting
the batch size. In this case, Equation 3 produces the best batch
size value of 64.
Batch SizeOptimal = 2nint(log2(HP L T ile Size
Image S ize )) (3)
Where nint() is nearest integer function.
To conﬁrm that our solution for DNN was the best, we
implemented the same network with other batch sizes to
determine their epoch run times, which are plotted in Figure 3
and Figure 4.
With a heuristic approach, the batch size value for DNN
would start computation with 32 images per epoch, and
gradually increase to ﬁnd the best batch size. Although a
batch size of 32 would produce lower loss, the time taken
to complete 1 epoch would be considerable.
Fig. 3. SingleGPU model on Tesla K80 GPU: Execution time(Seconds) Vs.
Fig. 4. SingleGPU model on Tesla K80 GPU: Categorical cross-entropy vs
The suggested method of obtaining Speedup for DNN using
HPL has shown to be superior to tuning DNNs manually using
the ideal batch size obtained through heuristics. By using
a batch size of 64, we obtain a 1.46x speedup. Since we
would improve DNN performance by this rate over a thousand
epochs, the performance gain should be signiﬁcant.
Tuning a large number of hyper-parameters for neural
networks has always been a time-consuming iterative process.
This work provides a method to speedup DNNs by selecting
the best batch size. We consider selection from a hardware
perspective versus a brute force approach. Since the focus of
our work here is to consider batch size as a function of memory
model and architecture of GPU, we did not consider the
speed of convergence for a particular optimization problem.
Speedups for DNNs can also be achieved by simply allocat-
ing more GPUs to the process. This is where convergence-
invariance comes into consideration. Tallada in his work 
describes how doubling the number of GPUs and halﬁng the
batch size, does not guarantee a 2x speedup. The convergence
of a network is a area for future work. This work can also
be expanded to take into consideration the overall network
structure parameters, such as the number of nodes and number
 Wang E. et al., “High-Performance Computing on the Intel Xeon Phi”,
Springer 2014, ISBN 978-3-319-06486-4.
 Fatahalian K. et al.,“Sequoia: programming the memory hierarchy”, SC
2006, Proceedings of the 2006 ACM/IEEE conference on Supercomput-
 Petitet A. et al., “HPL - A Portable Implementation of the High-
Performance Linpack Benchmark for Distributed-Memory Computers”
Innovation Computing Laboratory, Feb 2016, University of Tennessee.
 Szegedy C. et al. “Going Deeper with Convolutions”, Computer Vision
and Pattern Recognition (CVPR), 2015 IEEE Conference.
 Kingma D. and Ba J., “ADAM: A Method for Stochastic Optimization”,
International Conference on Learning Representations (ICLR) 2015.
 Schaa D. and Kaeli D., “Exploring the multiple-GPU design space”,
Parallel & Distributed Processing, 2009, IPDPS.
 Dean J. et al., “Large Scale Distributed Deep Networks”, Neural Infor-
mation Processing Systems, NIPS 2012.
 Tallada M. “Coarse grain parallelization of deep neural networks” in
PPoPP ’16 Proceedings of the 21st ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming.
 Tieleman T., Hinton G., “Lecture 6.5 RmsProp: Divide the gradient by a
running average of its recent magnitude”, COURSERA: Neural Networks
for Machine Learning.
 LeCun Y., Cortes C. and Burges C, “The MNIST Database of Handwrit-
ten Digits”, Google Labs New York and Microsoft Research, Redmond.