Conference PaperPDF Available

Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs



Content may be subject to copyright.
Exhaustive Study of Hierarchical AllReduce
Patterns for Large Messages Between GPUs
Yuichiro Ueno
Department of Computer Science
Tokyo Institute of Technology
Tokyo, Japan
Rio Yokota
Global Scientific Information and Computing Center
Tokyo Institute of Technology
Tokyo, Japan
Abstract—Data-parallel distributed deep learning requires an
AllReduce operation between all GPUs with message sizes in the
order of hundreds of megabytes. The popular implementation
of AllReduce for deep learning is the Ring-AllReduce, but this
method suffers from latency when using thousands of GPUs.
There have been efforts to reduce this latency by combining
the ring with more latency-optimal hierarchical methods. In
the present work, we consider these hierarchical communication
methods as a general hierarchical Ring-AllReduce with a pure
Ring-AllReduce on one end and Rabenseifner’s algorithm on the
other end of the spectrum. We exhaustively test the various com-
binations of hierarchical partitioning of processes on the ABCI
system in Japan on up to 2048 GPUs. We develop a performance
model for this generalized hierarchical Ring-AllReduce and show
the lower-bound of the effective bandwidth achievable for the
hierarchical NCCL communication on thousands of GPUs. Our
measurements agree well with our performance model. We also
find that the optimal large-scale process hierarchy contains the
optimal small-scale process hierarchy so the search space for the
optimal communication will be reduced.
Index Terms—Hierarchical, AllReduce, Large Message, GPU,
InfiniBand, NVLink, NCCL, Deep Learning
Deep learning allows very large models to be trained on
very large datasets, so the computation time can quickly
become intractable. Furthermore, many training runs need to
be performed to find the optimal hyperparameters such as
the learning rate, momentum, drop out, and weight decay.
Distributed parallel computing is one solution to this large
demand in computational power. The simplest approach to
harness the power of a large cluster of computers is to run
the hyperparameter search in parallel. However, being able to
run a single training job on multiple nodes/GPUs could allow
faster convergence to an optimum set of hyperparameters due
to better feedback during the search [1].
Large-scale parallel deep learning is performed on the
largest supercomputers today. For example, the winners of
the ACM Gordon Bell Prize for 2018 used more than 25,000
NVIDIA Tesla V100 GPUs to train a deep neural network
to improve the predictive capability of climate models [2].
Computational resource of AI Bridging Cloud Infrastructure (ABCI) was
awarded by "ABCI Grand Challenge" Program, National Institute of Advanced
Industrial Science and Technology (AIST). This work is supported by JST
CREST Grant Number JPMJCR1687, Japan.
Other finalists of the prize for 2018 also used deep learning
for extracting data from electron microscopes [3].
A popular benchmark for large-scale distributed deep learn-
ing is the training of ResNet-50 on the ImageNet-1K dataset
on thousands of processes. Goyal et al. achieved a top-1
validation accuracy of 76.3% in 1 hour using 256 P100 GPUs
[4]. Akiba et al. achieved 74.9% in 15 minutes using 1,024
P100 GPUs [5]. Jia et al. achieved 75.8% in 6.6 minutes
using 2,048 P40 GPUs [6]. Mikami et al. achieved 75% in
3.7 minutes using 2,176 V100 GPUs [7]. Ying et al. achieved
76.3% in 2.2 minutes using 1,024 TPUs [8].
Training of deep neural networks on distributed systems
can be done either by distributing the data and replicating the
model or distributing the model and replicating the data among
the distributed processes. The former is called data-parallel
and typically requires a course grain AllReduce collective
communication of the gradients among all processes. The
latter is called model-parallel and requires a fine grain point-
to-point communication of the activation/gradients between
layers. Data-parallelism has been the more popular approach
due to its scalability and ease of implementation. In this
work, we will focus on the collective communication used in
the data-parallel approach. Even though the large volume of
computation typically hides the latency in the communication
of deep learning, it has been reported that communication
could become a bottleneck past 1,024 GPUs [2].
As the number of processes participating in the collective
communication grow, the optimal algorithm changes. For ex-
ample, using the Ring-AllReduce on thousands of processes is
suboptimal, although it is known to be faster for sending large
data to a small number of processes. Combining the collectives
of the NVIDIA Collective Communications Library (NCCL)
[9] and MPI is known to improve performance [10]. Fur-
thermore, using the collectives of NCCL two-dimensionally
also works well [7]. In this way, grouping rings into a
few hierarchical collective operations seems to give a better
performance, but the optimal dimensions of this hierarchical
communication depend on the network topology, the number
of GPUs per node, the bandwidth within the node and between
the nodes, the number of processes, and the message size. We
aim to provide a strategy for choosing the optimal hierarchical
communication for deep learning workloads.
Our contributions are as follows:
We consider the hierarchical AllReduce communication
as a hybrid method with the Ring-AllReduce at one end
and Rabenseifner’s Algorithm at the other end. We show
that both of these algorithms at the far end are suboptimal
for typical deep learning workloads.
We develop a model that predicts the performance of the
hierarchical AllReduce by the number of dimensions, the
number of processes and the message size and verify its
accuracy on InfiniBand-connected multi-GPU per node
We implement the hierarchical AllReduce on both
NVLink and InfiniBand using NCCL for the elementary
communication at every level of the hierarchical AllRe-
duce. We test this for a wide configuration of hierarchical
topologies and the number of processes to show the effect
of the low-latency of NCCL.
Through performance modeling and systematic exper-
imentation, we determine a strategy for choosing the
optimal hierarchical AllReduce for any network topology,
number of GPUs per node, bandwidth within the node and
between the nodes, number of processes, and message
In this section, we describe the general properties of the
communication and parallelization of deep neural network
training. For a thorough review of this topic, we refer the
reader to [11].
A. Parallelism of Deep Learning
Distributed parallel computers can accelerate deep learning
by exploiting the two types of parallelism available. One is to
parallelize over the model and the other is to parallelize over
the data. In this work, we focus on the data-parallelism.
In data-parallel distributed deep learning, the data is dis-
tributed and the parameters are replicated. Multiple processes
work on a different subset of the entire data set, and aggregate
this information at the end of each iteration to update the
shared parameters among the processes. Instead of speeding up
the time it takes to process a single data, it processes multiple
data in a given amount of time.
B. Distributed Stochastic Gradient Descent (SGD)
In data-parallel distributed deep learning, different processes
use different training data, and the different gradients that
result from this are used to update the parameters. There
are two well-known updating methods: the asynchronous
method (Asynchronous SGD) and the synchronous method
(Synchronous SGD).
1) Asynchoronous SGD: Asynchronous SGD can have a
number of different implementations. A typical one would be
to use parameter servers, which are servers dedicated to keep-
ing track of the parameters. The worker processes, in charge of
the training, calculate the forward propagation and backward
propagation and sent the resulting gradients to the parameter
server. The parameter servers will then update the parameters
according to the received gradient. The parameter server then
sends the updated parameters to the worker process. When the
worker process receives the updated parameters, the worker
uses them to calculate the forward and backward propagation.
2) Synchronous SGD: Synchronous SGD typically does
not use parameter servers. The dataset is distributed to all
processes and they all serve as workers, which do forward and
backward propagation on the distributed datasets. Then, at the
end of each iteration, an AllReduce collective communication
is performed to calculate the average of the gradients among
all processes. This gives the equivalent result to using a
large mini-batch on a single process, but the calculation will
be faster because the data is actually processed in parallel.
Under the assumption that the parameters on all processes are
initialized with the same value, this method allows the model
to remain consistent between the processes without sending
the parameters. One drawback of this approach is that the
effective mini-batch could become exceedingly large, which
is known to have a detrimental effect on the accuracy [12].
3) Characteristic of Communication in Synchronous SGD:
In the current work, we focus on Synchronous SGD. As
mentioned earlier, this requires us to perform an AllReduce
collective communication between all processes at every it-
eration. When the model size increases, the message size
of this AllReduce communication also increases. One of the
well-known convolutional neural networks is ResNet-50, and
this requires around 100MB communication in each iteration.
Many AllReduce implementations switch between different
algorithms depending on the number of processes and message
size. However, the range of message size they consider is
usually only until 8MB [13], which is much smaller than
the typical message size in deep learning communications.
This motivates us to conduct a more in-depth investigation of
optimal AllReduce operations for deep learning workloads.
In this section, we discuss the details of the AllReduce
collective communication and its optimal implementation for
deep learning. An AllReduce communication performs a re-
duction operation on each element of an array on all processes
and the result of the reduction is stored on all processes.
A. Algorithms of AllReduce
There exist many algorithms for the AllReduce, but we will
focus on two of them that have optimal bandwidth require-
ments; the Ring-AllReduce Algorithm [14] and Rabenseifner’s
Algorithm [13] in the present work.
1) The Ring-AllReduce Algorithm: The Ring-AllReduce al-
gorithm forms a logical ring among all the processes involved
in the communication. Each process only communicates with
the two adjacent processes in the logical ring topology. If the
total number of processes is p, the array to be reduced is
partitioned into pchunks. When the i-th process sends the i-
th chunk to the next (i+1)-th process, the same i-th process
will receive the (i1)-th chunk from the (i1)-th process.
After receiving the chunk it will perform a reduction with
its own corresponding chunk, and then pass it on to the next
(i+1)-th process. When i=p1,i+1=pwill wrap around
to 0, and when i=0,i1=1will wrap around to p1.
When this step is done p1times, every process will have a
different chunk that has the contribution of all processes. At
this stage, the Reduce-Scatter operation is complete. In order
to complete the AllReduce operation, one must now perform
an AllGather operation, which can be done in a similar manner
using the ring in p1steps. Therefore, the total number of
communications is 2(p1). If the total message size is n, each
chunk size will be n/p, so the total volume of communication
will be 2(p1)n/p.
2) Rabenseifner’s Algorithm: Let us suppose the total
number of processes is p=2k. Rabenseifner’s Algorithm
performs the AllReduce by communicating with processes
that have a distance of 1,2,4,· · · ,2k1, therefore requiring
only O(log p)steps. In the first step, each process exchanges
data with the neighboring processes. During this exchange,
it splits the array into two halves; and performs a Reduce-
Scatter similar to a ring with two processes. In the next step,
pairs are formed among the processes that are two apart. The
array that was split during the first step is further split again,
and two halves are exchanged in a similar fashion. This kind
of recursive halving for the Reduce-Scatter can be combined
with a recursive doubling for the AllGather to compose an
AllReduce. The Reduce-Scatter is done with pairs of processes
with a distance of 1,2,4,· · · ,2k1, whereas the AllGather does
the reverse with pairs having a distance of 2k1,· · · ,4,2,1. The
total number of exchanges for this case is 2k. The total volume
of communication is 2n(Ík
3) Pipelining of AllReduce: Both the Ring-AllReduce and
Rabenseifner’s algorithm first perform a Reduce-Scatter, which
consists of receiving data, performing the reduction on it,
and sending it to the next process. The sending of one
chunk of data can be overlapped with the reduction of the
received chunk, so the cost of the reduction can be entirely
hidden. Because of the need to partition the chunks again to
overlap the computation with communication, the number of
communications increases and the latency does as well.
4) Comparison AllReduce Algorithms: Let the total number
of processes be p=2kand let the length of the array be
n. Then, the total amount of communication per process is
2(p1)n/pfor both the Ring-AllReduce and Rabenseifner’s
Algorithm. The main difference is that the Ring-AllReduce
only communicates between adjacent processes but it does
so 2(p1)times, whereas Rabenseifner’s Algorithm com-
municates with processes that are exponentially far but only
for 2 log2ptimes. Therefore, the Ring-AllReduce can avoid
contention on most network topologies [14]. We can use
a very simple performance model using the latency αand
inverse bandwidth βof the network, where the point-to-point
communication time Tcan be written as [13]:
Under this model, Rabenseifner’s algorithm is always faster
than the Ring-AllReduce, since the volume of communication
is same but the latency of Rabenseifner’s Algorithm is smaller.
B. Implementations of AllReduce
NVIDIA Collective Communications Library (NCCL) [9]
and CUDA-aware MPI [15] are the two most popular AllRe-
duce implementations between GPUs.
1) NVIDIA Collective Communications Library: NCCL
is designed for communication between GPUs in a multi-
GPU environment, where the GPUs are connected through
NVLink or PCI-Express internally, and are connected through
InfiniBand or Ethernet externally. The AllReduce collective
of NCCL is implemented using the Ring-AllReduce algorithm.
NCCL can detect the NVLink and InfiniBand connections, and
form multiple ring topologies to extract the full bandwidth of
the fabric automatically. Therefore, NCCL is used as a de facto
standard library for distributed deep learning.
2) CUDA-aware MPI: CUDA-aware MPI is an extension
of MPI, where it can handle the address of GPU memory
space. The user does not need to explicitly call the memory
copies between host and device, so it is easy to use. CUDA-
aware MPI can use GPUDirect RDMA, which enables DMA
between GPU and PCI-Express devices (e.g. InfiniBand HCA).
MPI has many different implementations for the AllReduce
so that the optimal one can be selected for a given number of
processes and message size. There has been some research to
bring its performance closer to that of NCCL [16].
In this section, we describe a hierarchical AllReduce algo-
rithm, which combines the best of both worlds from Raben-
seifner’s Algorithm and the Ring-AllReduce Algorithm.
A. Hierarchical Characteristic of Rabenseifner’s Algorithm
Rabenseifner’s Algorithm can be thought of as a hierarchical
version of the Ring-AllReduce Algorithm, where the Ring-
AllReduce is performed on pairs of two recursively. When the
number of processes is two, the two algorithms are identical.
Let us now consider the case of four processes, where each
of them have rank ={0,1,2,3}. In this case, Rabenseifner’s
Algorithm will perform two steps of Reduce-Scatter and two
steps of AllGather, with a total of four steps to perform
an AllReduce. The first step of the Reduce-Scatter form the
pairs (pair-1: {0,1}, pair-2: {2,3}). After the pairs exchange
the first half and second half of the buffer, rank {0,2}will
receive the first half of the reduced array, while rank {1,3}
will receive the second half of the reduced array. Next, we
perform the reduction between pairs, by doing the same
operation for pairs (pair-1: {0,2}, pair-2: {1,3}). This will
result in rank ={0,2,1,3}having the total reduced values
the first, second, third, and fourth chunk, respectively. Then,
performing AllGather with the pairing in the opposite order
will result in an efficient AllGather algorithm. The second
Reduce-Scatter step of Rabenseifner’s Algorithm for four pro-
cesses, is identical to the Reduce-Scatter step of Rabenseifner’s
Algorithm for two processes, which means it is also identical
to a Ring-Reduce-Scatter for two processes. To summarize,
Rabenseifner’s Algorithm for four processes will perform the
below operations:
1) a Ring-Reduce-Scatter for pairs: {0,1},{2,3}
2) a Ring-Reduce-Scatter for pairs: {0,2},{1,3}
3) a Ring-AllGather for pairs: {0,2},{1,3}
4) a Ring-AllGather for pairs: {0,1},{2,3}
B. Extension to the Hierarchical Ring-AllReduce Algorithm
Rabenseifner’s Algorithm can be thought of as a hierarchical
Ring-AllReduce Algorithm that performs the Ring-AllReduce
on pairs of two in a multi-layer fashion. A generalization of
this is to have the rings at each layer have more than two
processes. This allows a solution to the following issues:
1) From the perspective of Rabenseifner’s Algorithm, it
can alleviate the contention that would otherwise occur
in an original Rabenseifner’s Algorithm for networks
topologies that don’t have a full bisection bandwidth.
Also, the distance between pairs in Rabenseifner’s Al-
gorithm grows exponentially, but we can also reduce this
exponential growth by using more processes per layer.
2) From the viewpoint of a Ring-AllReduce Algorithm, it
can reduce the number of communications and therefore
reduce the latency. The hierarchy keeps the ring at any
given layer from becoming too long and can reduce the
number of communications within a ring.
A schematic of the hierarchical Ring-AllReduce on 128
processes with 4×8×4hierarchical configuration and the 5
steps of communication are shown in Fig. 1. In this figure, we
assume that each node has four GPUs. Step 1 is the intra-
node Reduce-Scatter and Step 2 is the inter-node Reduce-
Scatter where each GPU in the node participates in a separate
Ring-Reduce-Scatter between eight processes. The message
size of Step 2 is 1/4of Step 1, since it uses the result of
the Reduce-Scatter at Step 1. Step 3 further performs an
AllReduce between 4 processes, where the message size is
now 1/4×1/8=1/32 of Step 1. Steps 4 and 5 perform an
AllGather between the same processes as the Reduce-Scatter
in Steps 2 and 1, respectively.
C. Analysis of the Hierarchical Ring-AllReduce Algorithm
In this section, we model the communication time Tof the
hierarchical Ring-AllReduce algorithm using T(n)=α+nβ
[13] where nis the message size. Let hbe the number of
hierarchical layers, and pbe the total number of processes.
The number of processes per layer can be multiplied to yield
the total number of processes
p=p1×p2× · · · × ph.
We consider the difference in the utilization of the bandwidth
at different layers and we note the inverse bandwidth at the
i-th layer as βi. Then, the point-to-point communication time
Tifor a message size nat that i-th layer is
By assuming that the time for the reduction can be hidden by
partitioning the message to p×sparts, the total communication
time Ti,all for the Ring-AllReduce of message size nwith
pipelining at that i-th layer would be
The total communication time for all layers can then be
calculated as
Ti,all n
=2sα h
(pi) − h!+2n
βi pi1
Let us consider the following decomposition of processes:
p=pa1×· · ·×pb1×pa2×· · ·×pb2×· · ·× pam×· · ·×pbm.
We assume that for a given layer with processes
pai× · · · × pbi, the inverse bandwidth β=Biis constant
within that layer. Under these conditions, the interconnect
is decomposed into hierarchical layers, and the bandwidth
may differ between the layers. (e.g. Inter-node and intra-node
communication could be considered as different layers in the
hierarchical process decomposition).
The total communication time of the hierarchical commu-
nication can be written as
T(n)=2sα h
(pi) − h!+2n
j=ai pj1
=2sα h
(pi) − h!+2n
Bi 1
AllReduce on z axis
(3, 1, 3) (3, …, 3) (3, 7, 3)(3, 0, 3)
(3, 0, 2) (3, 1, 2) (3, …, 2) (3, 7, 2)
(3, 0, 1) (3, 1, 1) (3, …, 1) (3, 7, 1)
(3, 0, 0) (3, 1, 0) (3, …, 0) (3, 7, 0)
AllGather on y axis
(3, 0, 3) (3, 1, 3) (3, …, 3) (3, 7, 3)
(3, 0, 2) (3, 1, 2) (3, …, 2) (3, 7, 2)
(3, 0, 1) (3, 1, 1) (3, …, 1) (3, 7, 1)
(3, 0, 0) (3, 1, 0) (3, …, 0) (3, 7, 0)
Reduce-Scatter on y axis
(3, 0, 3) (3, 1, 3) (3, …, 3) (3, 7, 3)
(3, 0, 2) (3, 1, 2) (3, …, 2) (3, 7, 2)
(3, 0, 1) (3, 1, 1) (3, …, 1) (3, 7, 1)
(3, 0, 0) (3, 1, 0) (3, …, 0) (3, 7, 0)
Reduce-Scatter on x axis
(3, 0, 3) (3, 1, 3) (3, …, 3) (3, 7, 3)
(3, 0, 2) (3, 1, 2) (3, …, 2) (3, 7, 2)
(3, 0, 1) (3, 1, 1) (3, …, 1) (3, 7, 1)
(3, 0, 0) (3, 1, 0) (3, …, 0) (3, 7, 0)
(3, 0, 3)
(2, 0, 3)
(1, 0, 3)
(0, 0, 3)
(3, 1, 3)
(2, 1, 3)
(1, 1, 3)
(0, 1, 3)
(3, …, 3)
(2, …, 3)
(1, …, 3)
(0, …, 3)
(3, 7, 3)
(2, 7, 3)
(1, 7, 3)
(0, 7, 3)
AllGather on x axis
(3, 0, 3) (3, 1, 3) (3, …, 3) (3, 7, 3)
(3, 0, 2) (3, 1, 2) (3, …, 2) (3, 7, 2)
(3, 0, 1) (3, 1, 1) (3, …, 1) (3, 7, 1)
(3, 0, 0) (3, 1, 0) (3, …, 0) (3, 7, 0)
(3, 0, 3)
(2, 0, 3)
(1, 0, 3)
(0, 0, 3)
(3, 1, 3)
(2, 1, 3)
(1, 1, 3)
(0, 1, 3)
(3, …, 3)
(2, …, 3)
(1, …, 3)
(0, …, 3)
(3, 7, 3)
(2, 7, 3)
(1, 7, 3)
(0, 7, 3)
Fig. 1. A schematic of the hierarchical Ring-AllReduce on 128 processes
with 4×8×4configuration.
UPI 31.2 GT/s
PCIe 128 GT/s
NVLink 400 GT/s
InfiniBand EDR
100 Gbps
Other Nodes Other Nodes
1 Node
Fig. 2. A schematic of the ABCI supercomputer node configuration.
Since Îb0
k=1pk=1, the total communication time can be
rewritten as
T(n)=2sα h
(pi) − h!+2n
Bi 1
=2sα h
(pi) − h!+2n
Bi Pi1
where pai× · · · × pbi=Piand Î0
k=1Pk=1. This is the
same as the hierarchical Ring-AllReduce with mlayers, except
for the difference in the αterm. This shows the hierarchical
decomposition in the layers of same bandwidth does not
affect the bandwidth term in the communication time. The
equivalence of the bandwidth term in the Ring-AllReduce and
Rabenseifner’s Algorithm comes from this.
D. Hierarchical Ring-AllReduce Algorithm on a GPU-
accelerated InfiniBand Cluster
A schematic of the node configuration for the ABCI su-
percomputer at AIST is shown in Fig. 2. Each node has two
CPUs, four NVIDIA Tesla V100 GPUs, and two InfiniBand
EDR HCAs. The GPUs are connected via NVIDIA NVLink2
for high-bandwidth and low-latency communication.
Training a deep neural network on this kind of node
configuration will use each GPU for the forward and backward
propagation with an AllReduce between GPUs. We will con-
sider the effect of using the hierarchical AllReduce algorithm
for such cases.
Let us assume that the AllReduce happens on Pnodes
and 4×Pprocesses. We consider the case where there are
two layers in the communication hierarchy – NVLink2 and
InfiniBand EDR. In this case, the total communication time
for the AllReduce is
Thier (n)=2sα(2+P)+2nBnvl ink 3
4+BI B/2P1
where Bnvli nk is the inverse bandwidth of NVLink2, and BI B/2
is the inverse bandwidth of half the InfiniBand EDR because
four GPUs share two InfiniBand EDR HCAs.
The theoretical bandwidth of NVLink2 is almost four times
that of InfiniBand EDR. There are six possible ring orderings
for the GPUs, and two patterns will share one link, so intra-
node communication is 4×6/2=12 times faster than inter-
node communication. Therefore, we can rewrite the equation
Thier (n)=2sα(2+P)+2nBI B
12 3
4+2×BI B P1
=2sα(2+P)+2nBI B 9
16 1
A flat AllReduce without any hierarchy does not require all
GPUs to use InfiniBand, so the bandwidth improves to
Tring (n)=2sα(4P1)+2nBI B×24P1
=2sα(4P1)+2nBI B 1
Therefore, it can be seen that the hierarchical AllReduce im-
proves the α(latency) term, but for the B(inverse bandwidth)
term Ring-AllReduce is smaller for P>6, so there is a trade-
To evaluate the performance of our hierarchical Ring-
AllReduce algorithm for deep learning workloads, we perform
two experiments; a microbenchmark and actual training of
image classification, on the ABCI supercomputer at AIST.
Fig. 2 shows a schematic of a node on ABCI. Each rack
has 34 nodes, where the intra-rack network is full-bisection,
while the inter-rack has 1/3of the full-bisection.
A. Implementation
Our implementation of the hierarchical Ring-AllReduce
uses Reduce-Scatter, AllGather, and AllReduce (for innermost
layer) collectives of NCCL, which is optimized for NVIDIA
GPUs, NVIDIA NVLink, and InfiniBand. The version of
compilers and libraries we used are shown in Table I.
B. Microbenchmark result
We use 128MiB single-precision floating point numbers
per process, which is similar to the one used in ResNet-
50 training, and we change the number of processes from
4 to 2048, and its hierarchical decomposition. We assume
that the deep learning framework allocates input and output
arrays separately. Furthermore, we allow the input array to
be overwritten to store intermediate results. If the input array
needs to be preserved, the communication at each layer must
copy the output array back to the input array.
Software Version
GCC 4.8.5
CUDA 9.2
cuDNN 7.4
Open MPI 2.1
NCCL 2.3.5-2
Chainer v6.0.0b3
1) Determining the Parameters of the Model: In order to es-
timate the communication time, we determine the parameters,
the latency α, the inverse of bandwidth βand the granularity
of pipelining s.αand βare measured by osu_latency [17]
with GPUDirect RDMA-enabled Open MPI. sis determined
from the implementation of NCCL. As a result, we found that
the values are α=4.80[µs], B1
I B =9301.19[MB/s] and s=2.
2) Scalability with Respect to Processes: The communica-
tion time of the Ring-AllReduce with and without hierarchy is
shown in Fig. 3. The results show the best case for the different
number of layers in the hierarchy. To keep the communication
topology symmetric, we use only 32 out of the 34 nodes per
a) Evaluation of the Performance Model: Fig. 3 shows
that the results with the performance model match the com-
munication time, except for the flat Ring-AllReduce results
of processes 8 to 256. In this case, the communication time
is slower than the estimation time. This means that there is
room for improvement in NCCLs ability to detect the network
topology, in particular on the ABCI which has two InfiniBand
HCAs on each node. Setting the environment values correctly
could improve this problem.
b) Performance of the Flat Ring-AllReduce: The four
processes case is a special case since it can use the high-
bandwidth performance of NVLink2, without depending on
the bandwidth of InfiniBand. Furthermore, when the number
of processes is bigger than 256, the communication time is
proportional to the number of processes. The proportional term
of the performance model (2) is the latency term therefore this
slowdown suffers from the latency issues of many processes.
c) Performance of the Hierarchical Ring-AllReduce:
From Fig. 3, we see that the flat Ring-AllReduce case is
bandwidth-bound for processes 8 to 128, so if both Infini-
Band HCAs on the node are utilized, its performance could
potentially exceed that of the hierarchical Ring-AllReduce.
However, beyond 256 processes the latency causes the per-
formance of the flat Ring-AllReduce to degrade. On the other
hand, the hierarchical communication does not suffer from
latency issues as can be seen from both the performance model
and the actual measurements. Therefore, we show that the
hierarchical communication is an appropriate large-message
AllReduce algorithm for many GPU processes.
3) Intra-rack Performance of the Hierarchical Ring-
AllReduce: The performance of different hierarchical topolo-
gies on up to 128 processes (a rack) is shown in Fig. 4. Table
II shows the effective bandwidth for the corresponding runs.
The fastest hierarchical topology was 4×8×4, which was
1.39 times faster than the flat Ring-AllReduce.
a) The optimal number of first layer processes: Cases
that the number of first layer processes is four are faster
than the others due to the high-bandwidth and low-latency
performance of NVLink2. When the number of first layer
processes is two, the communication is executed in NVLink2
only, but it is slower than cases whose number is four. This
reason is NVLink2 has a fully-connected topology so if the
number is two, we cannot exploit all links of the topology.
When the number of first layer processes is 8 or 128, the
speed of the constructed ring is limited by InfiniBand. In
these cases, two InfiniBand HCAs are available but due to the
topology detection issue of NCCL, the performance is likely
b) The Performance of Deeper Hierarchy: From Fig. 4,
the second bar from the right is Rabenseifner’s Algorithm,
which is the almost slowest case. There are two probable
1) Due to the deeper hierarchy.
The latency of multiple kernel launches, and also the
creation of communication threads per NCCL commu-
2) Due to the smaller message sizes in the deeper layer.
The message sizes of the deeper layer are reduced by
the shallower layers.
The overhead by the first cause is likely to be small, because
the communication time of the second layer is almost equal
when the number of hierarchy differs (e.g. 4×4×8and 4×
Fig. 5 shows the attainable bandwidth for pure NCCL com-
munication when the number of processes and message size
are varied. To isolate potential network problems, we used only
one GPU per node and only one ring for the communication
(NCCL_MAX_NRINGS=1). First of all, as the message size
of the AllReduce decreases the performance decreases for all
number of processes. This indicates that NCCL currently is not
yet optimized for workloads like Rabenseifner’s Algorithm,
where the message size decreases exponentially at each stage.
This is why Rabenseifner’s Algorithm has the worst perfor-
mance, though it is algorithmically superior.
4) Inter-rack Performance of the Hierarchical Ring-
AllReduce: The results for the experiments on more than one
rack (over 128 processes) are shown in Fig. 6. ABCI has full-
bisection within the rank, but 1/3of the full-bisection across
(NP =128)
Topology Bus Bandwidth [MB/s]
1st dim 2nd dim 3rd dim 4th dim
128 11547.97
4×32 83446.91 4348.85
4×4×885237.64 4229.91 4121.85
4×8×485956.47 4648.48 3766.94
4×16 ×281777.71 4140.13 2375.11
4×4×4×281196.07 4188.34 3417.55 3285.54
4×8×2×283191.72 4640.75 2705.13 3063.70
8×16 11343.85 4463.43
8×4×411239.52 4551.33 4011.04
8×4×2×211362.74 4366.68 2978.18 3195.68
Rabenseifner 30279.93 27978.80 3314.66 2998.27
4-Rabenseifner 92155.60 3330.65 3060.10 3014.57
Fig. 3. Communication time of the Ring-AllReduce with and without hierarchy, and its estimation time by the model.
Fig. 4. Communication time of the hierarchical Ring-AllReduce. The total
number of processes is 128. The different colors correspond to the Reduce-
Scatter, AllReduce, and AllGather. Each layer is split by the border.
a) Evaluation of 256 processes: For 256 processes, the
best hierarchical topology was 4×8×4×2, which is approxi-
mately 1.42 times faster compared to the flat Ring-AllReduce.
The first three intra-rank layers are 4×8×4, which agrees with
the best topology for the intra-rank AllReduce experiments.
The next best cases are 4×8×2×4and 4×8×8, which
are both topologies that do not clearly separate intra-rack and
inter-rack. A schematic of the communication topology for
the 8×8case is shown in Fig. 7. The communication in the
third layer has both intra-rack and inter-rack communications,
Fig. 5. The attainable bandwidth for pure NCCL communication when the
number of processes and message size are varied. We used one GPU per node
and one ring for communication to isolate potential network problems.
which limits the number of processes that communicate inter-
rack. Therefore, it should put less stress on the 1/3full-
bisection inter-rack bandwidth.
(NP =256)
Topology Bus Bandwidth [MB/s]
1st dim 2nd dim 3rd dim 4th dim
4×8×884179.02 4680.55 2115.40
4×8×2×484346.88 4674.03 2922.87 2682.33
4×8×4×282676.35 4680.52 3448.45 2441.33
Fig. 6. Communication time of the hierarchical Ring-AllReduce. The total number of processes is 256,512,1024, and 2048. The different colors correspond
to the Reduce-Scatter, AllReduce, and AllGather. Each layer is split by the border.
Table III shows the effective bandwidth in these inter-node
experiments. We expect that the fourth layer bandwidth of
4×8×4×2is lower than the third layer one of 4×8×8due to
its message size and inter-rack communication, but as a result,
the bandwidth of the third layer is lower than the fourth layer
one. In particular, the interconnect of ABCI is implemented
via a tree with multi-layer switches, so the inter-rack com-
munication contends with the intra-rack communication. This
is known as the Hot-spot problem [18]. Therefore, care must
be taken when partitioning the layers so that each layer has a
mixture of inter-rack and intra-rack communications.
b) Evaluation of 512 processes: When the total number
of processes is 512, the fastest communication time is achieved
when a hierarchical topology of 4×8×4×4is used. The
communication time is 1.55 times faster than the flat Ring-
AllReduce in this case. This is an extension of the fastest
hierarchical topology for 256 processes, and also agrees with
the fastest case for 128 processes.
c) Evaluation of 1024 processes: For 1024 processes,
4×8×8×4was the best hierarchical topology, and was 1.89
times faster than the flat Ring-AllReduce. This topology is an
extension of the third best topology for 256 processes, which
does not separate inter-rack and intra-rack communications.
The second best topology for 1024 processes was 4×8×4×
8, and was 1.88 times faster than the flat Ring-AllReduce.
Similar to the 512 process case, this is an extension of the
topology where the intra-rack and inter-rack communications
are separated.
d) Evaluation of 2048 processes: For 2048 processes,
4×8×8×8was the best topology, and was 3.05 times faster
than the flat Ring-AllReduce. This is an extension of the best
topology for 1024 processes.
Dim 2 communication
Dim 3 communication
1/4 transfer is via inter-rack IB link.
Rack #2
1, 5  1, 8
8, 5 8, 8
1, 1  1, 4
8, 1 8, 4
Rack #1
Fig. 7. The relation between the second layer and the third layer for a 4×8×8
topology. The second layer is only intra-rack, but the third layer has 1/4inter-
rack and 3/4intra-rack communication.
5) How to Determine the Best Hierarchical Topology: In
this section, we show how to determine the best hierarchical
topology from the experiments on a small number of pro-
cesses by identifying the universal behavior of the hierarchical
AllReduce. Among the different hierarchical topologies that
we tested, the top-5 are shown in Fig. 8.
We found that the optimal topology for a larger number of
processes is likely to be a simple extension of a smaller scale
topology either by:
1) Adding another layer of hierarchy.
2) Increasing the number of final layer processes.
For 128 processes the best topology was 4×8×4, and for
256 processes the best topology was 4×8×4×2. This is an
example of rule 1. The best topology for 512 processes was
4×8×4×4, so it also follows rule 2.
As Fig. 5 shows, the effective bandwidth decreases for
deeper hierarchy because the message size becomes very small
at the end of the hierarchical communication. This suggests
that shallower hierarchies with longer rings are preferable as
long as each layer is using the same interconnect (e.g. NVLink,
InfiniBand, etc.). Therefore, which rule to use depends on the
balance between the decrease in effective bandwidth for deeper
hierarchy, and the increase in latency of having longer rings
per layer. This behavior depends on the message size and the
number of processes.
Which layer needs to be made longer depends on the case,
and we see this in the 1024 case where the best topology is 4×
8×8×4, where the third layer is extended (and not the fourth).
There is also the effect of InfiniBand congestion, which the
current performance model cannot predict accurately.
C. Image Classification result
To evaluate our hierarchical Ring-AllReduce in the actual
training, we profile one iteration of ResNet-50 training with
the ImageNet-1K dataset. We execute 300 iterations of training
and show the median of the profile. To avoid the effect of the
straggler process, we add MPI_Barrier before the AllReduce.
The topologies that we use are 4×8×4,4×8×4×2and
4×8×4×4, for NP=128,256 and 512, respectively.
Fig. 8. The top-5 hierarchical topologies for each number of processes.
Table IV shows that using the hierarchical Ring-AllReduce
improves the parallel efficiency, computed as a ratio of the
total iteration time except for the communication related time
range (i.e. Pack, AllReduce and Unpack). Compared 256 and
512 processes case, the slowdown of the flat Ring-AllReduce
is almost 40%, but the hierarchical Ring-AllReduce is only
14%. Furthermore, the degradation of the parallel efficiency
is also.
There have been many attempts to reduce the latency of
AllReduce by using a hierarchical approach. Horovod [10]
uses NCCL and MPI in a hierarchical manner to improve the
scalability of AllReduce for deep learning workloads. It has
been used to scale deep learning applications to over 25,000
GPUs on Summit [2]. Summit has six GPUs per node, so they
first use intra-node NCCL collectives, and use four GPUs for
the inter-node MPI collectives. This number four matches the
number of InfiniBand HCAs. They use a highly optimized MPI
implementation provided by the vendor of Summit, which uses
a tree AllReduce to minimize the latency.
With respect to related work on the ABCI supercomputer,
there has been a similar work to optimize deep learning
NP = 128 NP = 256 NP = 512
Ring Hier-Ring Ring Hier-Ring Ring Hier-Ring
Forward 28.6 29.1 28.9 28.9 28.9 28.9
Backward 58.5 58.5 58.6 58.4 58.5 58.5
Pack 1.2 1.2 1.2 1.2 1.2 1.2
AllReduce 18.4 13.7 20.8 15.9 29.3 18.1
Unpack 0.4 0.4 0.4 0.4 0.4 0.4
Update 7.8 7.3 7.7 6.9 7.6 7.1
Total [ms] 114.9 110.2 117.6 111.7 125.9 114.2
Efficiency [%] 0.826 0.861 0.81 0.843 0.755 0.827
communication [7]. Mikami et al. adaptively change the mini-
batch size during the training, and were able to train ImageNet
for very large mini-batch sizes without any degradation of
the validation accuracy. They used a two-level hierarchical
AllReduce to improve the communication time. However, they
did not investigate whether two-levels were optimal.
There have also been efforts to train ImageNet at super-
computer scale using the third generation TPUs [8]. Ying et
al. used techniques such as Distributed Batch Normalization
and Input Pipeline Optimization to train ImageNet in 2.2
minutes without loss of accuracy. They also used a 2-D
Ring-AllReduce algorithm to effectively use the 2-D torus
interconnect between the TPUs. Their algorithm is optimal
for the particular interconnect of TPUs, but is not necessarily
optimal on fat-tree-based other supercomputers like Summit
and ABCI.
A. Limitations
NCCL is not originally designed to do hierarchical Ring-
AllReduce communications, and we ended up launching mul-
tiple NCCL collectives. This requires the NCCL communica-
tions of small messages and causes decreases of bandwidth.
There is room for improvement by modifying NCCL itself.
This enables us to cache the registered memory for Infini-
Band communication between layers, share the communica-
tion threads, merge the CUDA kernels to one.
We have only run our experiments on the ABCI supercom-
puter. Therefore, the effects of different fabric and network
topology are not obvious from our studies. Future work will
require runs on multiple systems with different interconnect,
to investigate the effect of our hierarchical Ring-AllReduce
algorithm and the validity of our performance model.
B. Conclusion
In this work, we have implemented a hierarchical Ring-
AllReduce algorithm using NCCL, and on the ABCI su-
percomputer at AIST using various configurations of the
hierarchical topology. We observed a significant increase in
performance when using the optimal number of layers with
the optimal number of processes per layer. We performed an
exhaustive search on the hierarchical topology to find this
optimal configuration. We found that the optimal topologies on
a large number of processes contain the optimal topologies on
a small number of processes as a subgraph. This allowed us
to construct a performance model that predicts the optimal
topology for a large number of processes without actually
running on a large number of processes.
C. Future work
This work has the potential to be extended to workloads
other than deep learning. We would also like to provide
implementations that do not depend on NCCL. Also, more
explanation of the empirical optimal topologies can be pro-
We are grateful to Akira Naruse (NVIDIA) and Hitoshi Sato
(AIST) for useful discussions. Author’s internship project at
Preferred Networks, Inc. helped with the in-depth understand-
ing of collective communications.
[1] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for
Hyper-Parameter Optimization,” in Advances in Neural Information
Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett,
F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011,
pp. 2546–2554.
[2] T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips,
A. Mahesh, M. Matheson, J. Deslippe, M. Fatica, Prabhat, and M. Hous-
ton, “Exascale Deep Learning for Climate Analytics,” in Proceedings
of the International Conference for High Performance Computing,
Networking, Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA:
IEEE Press, 2018, pp. 51:1–51:12.
[3] R. M. Patton, J. T. Johnston, S. R. Young, C. D. Schuman, D. D. March,
T. E. Potok, D. C. Rose, S.-H. Lim, T. P. Karnowski, M. A. Ziatdinov,
and S. V. Kalinin, “167PFlopss Deep Learning for Electron Microscopy:
From Learning Physics to Atomic Manipulation,” in Proceedings of the
International Conference for High Performance Computing, Networking,
Storage, and Analysis, ser. SC ’18. Piscataway, NJ, USA: IEEE Press,
2018, pp. 50:1–50:11.
[4] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola,
A. Tulloch, Y. Jia, and K. He, “Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour,arXiv:1706.02677 [cs], Jun. 2017.
[5] T. Akiba, S. Suzuki, and K. Fukuda, “Extremely Large Minibatch SGD:
Training ResNet-50 on ImageNet in 15 Minutes,arXiv:1711.04325
[cs], Nov. 2017.
[6] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo,
Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu, “Highly
Scalable Deep Learning Training System with Mixed-Precision: Training
ImageNet in Four Minutes,” arXiv:1807.11205 [cs, stat], Jul. 2018.
[7] H. Mikami, H. Suganuma, P. U-chupala, Y. Tanaka, and Y. Kageyama,
“ImageNet/ResNet-50 Training in 224 Seconds,” Nov. 2018.
[8] C. Ying, S. Kumar, D. Chen, T. Wang, and Y. Cheng, “Image Clas-
sification at Supercomputer Scale,” arXiv:1811.06992 [cs, stat], Nov.
[9] “NVIDIA Collective Communications Library (NCCL),” https://devel-, May 2017.
[10] A. Sergeev and M. Del Balso, “Horovod: Fast and easy distributed deep
learning in TensorFlow,arXiv:1802.05799 [cs, stat], Feb. 2018.
[11] T. Ben-Nun and T. Hoefler, “Demystifying Parallel and Distributed Deep
Learning: An In-Depth Concurrency Analysis,” arXiv:1802.09941 [cs],
Feb. 2018.
[12] Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet
Training in Minutes,” in Proceedings of the 47th International Confer-
ence on Parallel Processing, ser. ICPP 2018. New York, NY, USA:
ACM, 2018, pp. 1:1–1:10.
[13] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of Collective
Communication Operations in MPICH,” International Journal of High
Performance Computing Applications, vol. 19, no. 1, pp. 49–66, Feb.
[14] P. Patarasuk and X. Yuan, “Bandwidth Optimal All-reduce Algorithms
for Clusters of Workstations,Journal of Parallel and Distributed
Computing, vol. 69, no. 2, pp. 117–124, Feb. 2009.
[15] “MPI Solutions for GPUs,”
gpus, May 2013.
[16] A. A. Awan, C.-H. Chu, H. Subramoni, and D. K. Panda, “Optimized
Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand
Clusters: MPI or NCCL?” in Proceedings of the 25th European MPI
Users’ Group Meeting, ser. EuroMPI’18. New York, NY, USA: ACM,
2018, pp. 2:1–2:9.
[17] “MVAPICH :: Benchmarks,” http://mvapich.cse.ohio-
[18] T. Hoefler, T. Schneider, and A. Lumsdaine, “Multistage switches are
not crossbars: Effects of static routing in high-performance networks,” in
2008 IEEE International Conference on Cluster Computing, Sep. 2008,
pp. 116–125.
... As the size of NLP problems continues to rise, it becomes imperative for us to scale the training of LDA towards more computing resources, as well as accommodating larger corpus with more topics. Graphics Processing Units (GPUs) exhibits remarkable performance over traditional CPU system and are hence widely applied on compute-intensive problems such as deep learning [17]- [21] and graph [22]. Towards expediting LDA training, GPUs are a tempting platform for two, if not more, reasons. ...
Full-text available
Latent Dirichlet Allocation (LDA) is a statistical approach for topic modeling with a wide range of applications. Attracted by the exceptional computing and memory throughput capabilities, this work introduces ezLDA which achieves efficient and scalable LDA training on GPUs with the following three contributions: First, ezLDA introduces three-branch sampling method which takes advantage of the convergence heterogeneity of various tokens to reduce the redundant sampling task. Second, to enable sparsity-aware format for both D and W on GPUs with fast sampling and updating, we introduce hybrid format for W along with corresponding token partition to T and inverted index designs. Third, we design a hierarchical workload balancing solution to address the extremely skewed workload imbalance problem on GPU and scale ezLDA across multiple GPUs. Taken together, ezLDA achieves superior performance over the state-of-the-art attempts with lower memory consumption.
... We analyzed the communication algorithms often involved in GPU cluster in recent years [2,10,13,20,23,24,32,41,42,44,46,51,53,55], and found that most of the communication algorithms follow the traffic pattern in Definition 1. We consider a case where GPU G i sends send flow to G d , the communication path can be represented as G i → l j → s m → l k → G d . ...
Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies.
... Proteus currently supports commonly used communication primitives in modern DL frameworks [21], [22]. Prior work proposes new communication primitives, such as hierarchical reduce [39], [40] and CollectivePermute [37], to accelerate the communication. Proteus can be extended by including more candidate patterns. ...
DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this paper, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves $3.0\%$ average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to $133.8\%$.
... Since grouping rings into a few hierarchical collective operations seems to give a better performance, the optimal dimensions of this hierarchical communication depend on multiple aspects. Ueno et al. [17] provide a strategy for choosing the optimal hierarchical communication for deep learning workloads. ...
Full-text available
Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a g rouping A llReduce method based on the two-dimensional t orus topology. This method synchronizes the model parameters by grouping and makes full use of bandwidth. Secondly, we propose a distributed algorithm, 2D-TGA-ADMM, which combines the 2D-TGA with the alternating direction method of multipliers (ADMM). It focuses on sub-model training and reduces the wait time among workers in the synchronization process. Finally, experimental results on the Tianhe-2 supercomputing platform show that compared with the $${\mathtt {MPI\_Allreduce}}$$ MPI _ Allreduce , the 2D-TGA could shorten the synchronization wait time by $$33\%$$ 33 % .
Deep learning (DL) has gained great success in recent years, leading to state-of-the-art performance in research community and industrial fields like computer vision and natural language processing. One of the reasons for this success is the huge amount parameters adopted in DL models. However, it is impractical to train a moderately large model with a large number of parameters on a typical single device. Thus, It is necessary to train DL models in clusters with distributed training algorithms. However, traditional distributed training algorithms are usually sub-optimal and highly customized, which owns the drawbacks to train large-scale DL models in varying computing clusters. To handle the above problem, researchers propose auto-parallelism, which is promising to train large-scale DL models efficiently and practically in various computing clusters. In this survey, we perform a broad and thorough investigation on challenges, basis, and strategy searching methods of auto-parallelism in DL training. Firstly, we abstract basic parallelism schemes with their communication cost and memory consumption in DL training. Further, we analyze and compare a series of current auto-parallelism works and investigate strategies and searching methods which are commonly used in practice. At last, we discuss several trends in auto-parallelism which are promising in further research.
In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by exploring some architectural features of the system. By analyzing the performance of a typical implementation of the SK algorithm on such a system, a huge performance gap is observed between the row rescaling and column rescaling of the algorithm, where the latter requires much more time than the former. We also found that the costly MPI communication of the column rescaling seriously hinders the exploitation of parallelism. By observing and leveraging unique architectural characteristics across different system optimizations, such as column rescaling redesign, data blocking, micro-kernel design, enhanced intra-node and inter-node communication in MPI, etc., COFFEE is able to explore cross-layer optimization opportunities that enable fast and efficient execution of the SK algorithm. Our experimental results show that COFFEE provides up to 7.5X with an average of 2.0X performance improvement over the typical implementation on a single node, and up to 2.9X with an average of 1.6X performance improvement over the state-of-the-art MPI Allreduce algorithms on Tianhe-1 supercomputer.
Conference Paper
Full-text available
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NCCL have been proposed. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/internode multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and internode broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK. The proposed solutions outperform the recently introduced NCCL2 library for small and medium message sizes and offer comparable/better performance for very large message sizes.
Full-text available
Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.
Conference Paper
In this paper, we investigate large scale computers' capability of speeding up deep neural networks (DNN) training. Our approach is to use large batch size, powered by the Layer-wise Adaptive Rate Scaling (LARS) algorithm, for efficient usage of massive computing resources. Our approach is generic, as we empirically evaluate the effectiveness on two neural networks: AlexNet and ResNet-50 trained with the ImageNet-1k dataset while preserving the state-of-the-art test accuracy. Compared to the baseline of a previous study from a group of researchers at Facebook, our approach shows higher test accuracy on batch sizes that are larger than 16K. Using 2,048 Intel Xeon Platinum 8160 processors, we reduce the 100-epoch AlexNet training time from hours to 11 minutes. With 2,048 Intel Xeon Phi 7250 Processors, we reduce the 90-epoch ResNet-50 training time from hours to 20 minutes. Our implementation is open source and has been released in the Intel distribution of Caffe v1.0.7.
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. Specifically, we present trends in DNN architectures and the resulting implications on parallelization strategies. We discuss the different types of concurrency in DNNs; synchronous and asynchronous stochastic gradient descent; distributed system architectures; communication schemes; and performance modeling. Based on these approaches, we extrapolate potential directions for parallelism in deep learning.
Training modern deep learning models requires large amounts of computation, often provided by GPUs. Scaling computation from one GPU to many can enable much faster training and research progress but entails two complications. First, the training library must support inter-GPU communication. Depending on the particular methods employed, this communication may entail anywhere from negligible to significant overhead. Second, the user must modify his or her training code to take advantage of inter-GPU communication. Depending on the training library's API, the modification required may be either significant or minimal. Existing methods for enabling multi-GPU training under the TensorFlow library entail non-negligible communication overhead and require users to heavily modify their model-building code, leading many researchers to avoid the whole mess and stick with slower single-GPU training. In this paper we introduce Horovod, an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow. Horovod is available under the Apache 2.0 license at
We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.
Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. This system enables us to train visual recognition models on internet-scale data with high efficiency.