PreprintPDF Available

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We explore a novel approach for building DNN training clusters using commodity optical devices. Our proposal, called TopoOpt, co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. TopoOpt uses a novel alternating optimization technique and a group theory-inspired algorithm to find the best network topology and routing plan, together with parallelization strategy, for distributed DNN training. To motivate our proposal, we measure the communication patterns of distributed DNN workloads at a large online service provider. Experiments with a 12-node prototype demonstrate the feasibility of TopoOpt. Simulations on real distributed training models show that, compared to similar-cost FatTree interconnects, TopoOpt reduces DNN training time by up to 3x.
Interconnect cost comparison. link is d × B. This architecture provides an ideal bound for training iteration times in a cluster where each server has d ×B available network bandwidth. • Cost-equivalent Fat-tree. To compare the performance of TOPOOPT with a cost-equivalent architecture, we simulate a full bisection bandwidth Fat-tree where each server has one NIC and the bandwidth of each link is d ×B , where B is lower than B and is selected such that this Fat-tree has a similar cost to TOPOOPT ( §5.2). • Oversub. Fat-tree. This is a 2:1 oversubscribed Fat-tree interconnect, similar to the one used in Opera [87], where the bandwidth of each link is d ×B but half of the links at the ToR uplink layer are omitted. • SiP-ML [79]. SiP-ML is a futuristic DNN training cluster with several Tbps bandwidth per GPU. While having a Tbps network is certainly a plus, our goal is to compare the algorithmic contributions of TOPOOPT and SiP-ML. Hence, to make an apples-with-apples comparison, we allocate d wavelengths, each with bandwidth B, to each SiP-ML GPU and follow its SiP-Ring algorithm to find a topology with a reconfiguration latency of 25 µs. Appendix E elaborates on our modifications to SiP-ML. • Expander [108, 115]. Finally, we simulate a fabric where each server has d NICs with bandwidth B interconnected via an Expander topology. DNN Workloads. We simulate four real-world DNN models: DLRM [20], CANDLE [3], BERT [58], and VGG [107]. Table 2 summarizes model configurations and batch sizes used in our simulations. The top and bottom rows correspond to models used in Sections 5.3 and 5.4, respectively. Batch sizes are selected based on the common batch sizes used in BIGNET. Each data point averages 5-20 simulation runs. Parallelization strategy. We use FlexNet's topologyaware parallelization strategy search for Ideal Fat-tree, Costequivalent Fat-tree, Oversub. Fat-tree, SiP-ML, and Expander
… 
Content may be subject to copyright.
TOP OOP T: Optimizing the Network Topology for Distributed DNN Training
Weiyang Wang1Moein Khazraee1Zhizhen Zhong1Zhijao Jia2,3
Dheevatsa Mudigere3Ying Zhang3Anthony Kewitsch4Manya Ghobadi1
1Massachusetts Institute of Technology 2CMU 3Facebook 4Telescent
Abstract
We explore a novel approach for building DNN training clus-
ters using commodity optical devices. Our proposal, called
TOP OOPT,co-optimizes the distributed training process across
three dimensions: computation, communication, and network
topology. TOPO OPT uses a novel alternating optimization
technique and a group theory-inspired algorithm to find the
best network topology and routing plan, together with paral-
lelization strategy, for distributed DNN training. To motivate
our proposal, we measure the communication patterns of dis-
tributed DNN workloads at a large online service provider.
Experiments with a 12-node prototype demonstrate the feasi-
bility of TO POOP T. Simulations on real distributed training
models show that, compared to similar-cost Fat-tree intercon-
nects, TOPOOPT reduces DNN training time by up to 3×.
1 Introduction
Our society is rapidly becoming reliant on deep neural
networks (DNNs). New datasets and models are invented
frequently, increasing the memory and computational re-
quirements for training. This explosive growth has created an
urgent demand for efficient distributed DNN training systems.
Today’s DNN training systems are built on top of traditional
datacenter clusters, with electrical packet switches arranged in
a multi-tier Fat-tree topology [45]. Fat-tree topologies are traf-
fic oblivious fabrics, allowing uniform bandwidth and latency
between server pairs. They are ideal when the workload is un-
predictable and consists mostly of short transfers–two inherent
properties of legacy datacenter workloads [47,48,51,63,64].
However, Fat-tree networks are becoming a bottleneck for
distributed DNN training workloads [55,65,71,76,90,93,116].
Prior work has focused on addressing this challenge by
reducing the size of parameters to transmit through the
network [46,55,56,65,69,73
75,82,93,105,119] and
developing techniques to discover faster parallelization
strategies while considering the available network band-
width [44,46,76,93,109]. These proposals co-optimize
computation and communication as two important dimensions
of distributed DNN training, but they do not consider the
physical layer topology as an optimization dimension.
Recently, SiP-ML [79] demonstrated the benefits of
8 Tbps silicon photonics-based networks for distributed
training workloads. While encouraging, the silicon photonics
technology is not yet commercially available, which begs the
question: “Can we build an optimized network topology for
DNN training clusters using today’s commodity hardware?”
To answer this question, we analyze DNN training jobs
from production clusters of a large-scale service provider with
billions of users, which we call BIGNET for anonymity. We
demonstrate that training workloads do not satisfy standard
assumptions about datacenter traffic that underlie the design
of Fat-tree interconnects. Specifically, we show that (
i
) the
communication overhead of large DNN training jobs increases
dramatically as we increase the number of workers; and (
ii
)
the traffic heatmap of DNN training jobs highly depends on
their parallelization strategies and AllReduce collectives.
Motivated by these observations, we propose TOPO OPT, a
DNN training system that co-optimizes network topology and
parallelization strategy. In this paper, we grapple with the algo-
rithmic challenges of finding the best topology, such as how to
navigate the large search space across computation, communi-
cation, and topology dimensions, and also with various opera-
tional challenges, such as which optical switching technologies
match well with the traffic patterns of various DNN models.
In particular, we cast the topology and parallelization strat-
egy co-optimization problem as an off-line alternating opti-
mization framework. Our optimization technique alternates
between optimizing the parallelization strategy and optimiz-
ing the network topology. It searches over the parallelization
strategy space assuming a fixed topology, and feeds the traffic
demand to a T OPOLOGYFINDERalgorithm. The updated topol-
ogy is then fed back into the parallelization strategy search al-
gorithm. This alternating process repeats until the system con-
verges to an optimized parallelization strategy and topology.
We demonstrate that finding an optimized network topology
for DNNs with hybrid data & model parallelism is challenging
because the ideal network topology needs to meet two goals
simultaneously: (
i
) allocate most of the available bandwidth
to AllReduce transfers; and (
ii
) ensure a small hop-count for
Model Parallel transfers. To meet these goals, we propose a
novel group theory-based technique, called TotientPerms. Our
TotientPerms approach builds a series of AllReduce permu-
tations that not only carry AllReduce transfers efficiently, but
also are well-positioned to carry Model Parallel transfers and,
hence, improve the overall training performance.
To demonstrate the feasibility of TOP OOPT, we build
a 12-server testbed with NVIDIA A100 GPUs [37] and
100 Gbps NICs. Our large-scale simulations with four
representative DNN models (DLRM [20], CANDLE [3],
BERT [114], VGG [107]) show that TOPOOPT reduces the
training iteration time by up to 3
×
compared to a similar-cost
Fat-tree. Moreover, we demonstrate that TO POOP T is, on
average, 3.4
×
cheaper than an ideal full bisection bandwidth
1
arXiv:2202.00433v1 [cs.NI] 1 Feb 2022
Fat-tree. Finally, we evaluate the impact of reconfiguration
latency on performance and argue that today’s reconfigurable
optical switches are too slow for large-scale DNN workloads.
TOP OOPT is the first system with entirely commodity hard-
ware that co-optimizes topology and parallelization strategy
and is currently being evaluated for deployment at BIG NET.
2 Characterizing DNN Workloads
Data parallelism.
Data parallelism is a popular paralleliza-
tion strategy, whereby a batch of training samples is distributed
across training nodes accelerators. Each accelerator holds a
replica of the DNN model and executes the forward and back-
propagation steps locally. In data parallelism, all accelerators
synchronize their model weights during each training itera-
tion. This step is commonly referred to as AllReduce and can
be performed using various techniques, such as broadcast-
ing [121], parameter servers [81], ring-AllReduce [2,75,110],
tree-reduce [101], or hierarchical ring-AllReduce [111,113].
Hybrid data & model parallelism.
Prior work showed that
pure data parallelism may be a suboptimal strategy for large
training jobs because of the increasing cost of synchronizing
model parameters across accelerators [20,72,76,92,94,106].
In BI GNET, we use a hybrid of data & model parallelism for
training large DNNs, where different parts of a DNN and its
dataset are processed on different accelerators in parallel. To
keep each accelerator’s utilization high, we use pipeline par-
allelism [93] together with model parallelism, where training
samples across multiple iterations are processed in parallel
with the partitioned model in a pipelined fashion. In this paper,
we use model parallelism as a generic term that includes both
model and pipeline parallelism.
Types of data dependencies in DNN training.
Each train-
ing iteration includes two major types of data dependencies.
Type (1) refers to activations and gradients computed during
the Forward and Backpropagation steps. This data dependency
is required for each input sample. Type (2) refers to synchroniz-
ing the model weights across accelerators through the AllRe-
duce step once a batch of samples are processed. Depending
on the parallelization strategy, these data dependencies may
result in local memory accesses or cross-accelerator traffic. For
instance, in a hybrid data & model parallelization strategy, both
type (1) and (2) result in cross-accelerator traffic, depending
on how the model is distributed across accelerators. Given
that type (1) is related to model parallelism, we refer to the
network traffic created by type (1) as MP transfers. Similarly,
we refer to the network traffic created by type (2) as AllReduce
transfers. Note that AllReduce transfers do not strictly mean
data parallelism traffic since model parallelism can also create
AllReduce transfers across a subset of training nodes (§3.3).1
1
We only consider transfers related to training because our servers have
dedicated NICs for storage and other non-training traffic.
0
0.2
0.4
0.6
0.8
1
110 100
Object Trac king
Recommendation
Natural Language Proc.
Image Recognition
700
Number of workers
CDF
(a) Number of workers
0
0.2
0.4
0.6
0.8
1
0.01 0.1 110 100 1000
Object Tracking
Recommendation
Natural Lang.
Image
Trai ni ng dur ati on (ho urs )
CDF
(b) Training job duration
Figure 1: Profiling distributed DNN training jobs in BI GNET.
2.1 Production Measurements
We study traffic traces from hundreds of production DNN
training jobs running on multiple clusters at BI GNET. We
instrument each job to log its training duration, number of
workers, and the total amount of data transferred across its
workers during training.
Number of workers and job duration.
Figure 1a shows
the CDF of the number of workers for different models in our
clusters. Most jobs are distributed across 32 to 700 workers,
agreeing with recent announcements by other major players
in the industry [43,92]. A worker can be a CPU or a GPU
depending on which cluster the job is executed on. Figure 1b
demonstrates the CDF of total training job duration, showing
that most jobs last over 10 hours. In fact, the top 10% of jobs
take more than 96 hours (four days) to finish.
Network overhead.
Figure 2illustrates the percentage of
network overhead as the number of GPUs is increased from
8 to 128 for six DNN jobs in production. We use RDMA to
transmit packets between servers and measure the percentage
of time consumed by communication during training as net-
work overhead. The figure shows that as the number of GPUs
increases, the network quickly takes up a significant portion of
training iteration time. Similar observations have been made
in prior work [56,71,79,93,105]. This is because our training
servers are equipped with several NICs; hence, each server
takes up several ports on its Top-of-Rack (ToR) switch, which
limits the number of servers under the same rack. As a result,
our network topology spans across multiple switches and racks,
which, in turn, increases the likelihood of network bottlenecks.
Traffic heatmaps.
Figure 3shows the heatmap of server-
to-server traffic for four training jobs running in our produc-
tion GPU clusters. The rows and columns indicate source and
destination servers (each with eight GPUs), while the color
encodes the amount of traffic between server pairs. The values
on the colormap are not shown for confidentiality reasons. All
heatmaps in the figure contain diagonal squares (in dark blue),
indicating a ring communication pattern between servers. This
is expected since ring-AllReduce is the dominating AllRe-
duce communication collective at BIGNET. However, the MP
transfers (light blue and green squares) are model-dependent
because MP transfers depend on the parallelization strategy
and device placement of a training job. Moreover, we find that
the traffic patterns of our training jobs do not change between it-
erations for the entire training duration, resulting in exactly the
same heatmaps throughout the training time. Once a training
2
0
20
40
60
80
816 32 64 128
Number of GPUs
DNN 1 DNN 2
DNN 3 DNN 4
DNN 5 DNN 6
Network Overhead (%)
Figure 2: Network overhead
measurements in BI GNET.
(a) Vision (b) Image processing (c) Object Tracking (d) Speech Recognition
0
0
4
7
4 7
Source Server
Destination Server
Source Server
0
0
4
7
4 7
Destination Server
0
0
4
8
4 8
Source Server
Destination Server
0
0
4
7
47
Source Server
Destination Server
11
11
Figure 3: Traffic heatmaps of production jobs in BIG NET.
4 GB
0
(b) Traffic Heatmap 2(a) Traffic Heatmap 1 (c) Traffic Heatmap 3
0 5 10 15
0
5
10
15
0 5 10 15
0
5
10
15
0 5 10 15
0
5
10
15
Figure 4: DLRM traffic heatmaps.
(b) Permutation 2(a) Permutation 1 (c) Permutation 3
01
2
3
4
5
6
7
12
13
15
8
9
10
11
14
01
2
3
4
5
6
7
12
13
15
8
9
10
11
14
01
2
3
4
5
6
7
12
13
15
8
9
10
11
14
Figure 5: Ring-AllReduce permutations.
2GB
0
0 5 10 15
0
5
10
15
0 5 10 15
0
5
10
15
0 5 10 15
0
5
10
15
(b) Traffic Heatmap 2
(a) Traffic Heatmap 1 (c) Traffic Heatmap 3
Figure 6: CANDLE traffic heatmaps.
job starts, the same parallelization strategy and synchronization
method are used across training iterations, resulting in a peri-
odic and predictable traffic pattern. Similar observations have
been made in prior work [120]. In particular, the traffic heatmap
is identical between different training iterations. Note that the
traffic pattern changes within a training iteration. Section 5
evaluates the impact of reconfiguring the physical topology
within training iterations vs. simply keeping the topology the
same throughout the job.
2.2 Controlled Experiments
To better understand the impact of parallelization strategy on
network traffic, we analyze the heatmap of two large DNN
models, Deep Learning Recommendation Model (DLRM) and
CANcer Distributed Learning Environment (CANDLE), dis-
tributed across 16 servers each with one GPU interconnected
with a full bisection bandwidth Fat-tree topology.
DLRM traffic pattern.
DLRMs are a family of personalization and recommenda-
tion models based on embedding table lookups that capitalize
on categorical user data [95]. Typically, DLRM models are
large, up to 10s of trillion parameters, primarily due to their
large embedding tables. Large embedding tables result in large
AllReduce transfers. Moreover,the look-up time of embedding
tables does not drop significantly as the batch sizes decrease.
Hence, a common parallelization strategy for DLRMs is to
use model parallelism to place each embedding table on one
GPU and use data parallelism for the rest of the model [90].
Consider a simplified DLRM architecture with four embed-
ding tables
E0,···,E3
, each with embedding dimensions of 512
columns and
107
rows (total size 20 GB) distributed across 16
servers
S0,···,S15
. Following the parallelization strategy used
in BI GNET, we place
E0
on
S0
,
E1
on
S3
,
E2
on
S8
, and
E3
on
S13
, and replicate the rest of the model on all servers. This par-
allelization strategy creates a mix of MP and AllReduce traffic,
shown in Figure 4. Each heatmap in 4a, 4b, and 4c corresponds
to a different ring-AllReduce permutation, shown in Figures 5a,
5b, and 5c. Although all three heatmaps correspond to the exact
same parallelization strategy and device placement, the blue
diagonal lines appear at different parts of the heatmaps, depend-
ing on the order of servers in the ring-AllReduce permutation.
However, MP transfers (green vertical and horizontal lines in
each heatmap) are dictated by the parallelization strategy and
device placement, and therefore remain at exactly the same
spot in all three heatmaps. Hence, AllReduce transfers are
permutable but MP transfers are not. We leverage this unique
property of DNN training jobs in our TOPOLOGYFINDER al-
gorithm (§3.3). Note that MP transfers in DLRM form one-to-
many broadcast and many-to-one incast patterns to transfer the
activation and gradients to all nodes because each server han-
dling an embedding table needs to communicate with all other
servers. The size of each AllReduce transfer in this example
is 4 GB, whereas the size of MP transfers is 32 MB.
CANDLE traffic pattern.
CANDLE is a family of DNN
architectures used to predict the response of cancerous tumors
to drug treatments, based on molecular features of tumor cells
and drug descriptors [3,6]. CANDLE models often contain
several multilayer perceptrons (MLPs) for drug and cell fea-
tures [3]. Consider a simplified CANDLE model with one drug
MLP
D0
and one cell MLP
C0
, each with a size of 4 GB. A
common parallelization strategy is to distribute the model by
3
Figure 7: Illustration of TO POOP T’s interconnect.
replicating
D0
on four servers (e.g.,
{S0,S1,S2,S3}
) and
C0
on another set of four servers (e.g.,
{S12,S13 ,S14,S15}
). The
rest of the model is replicated across all servers
{S0,···,S15 }
.
This parallelization strategy creates mostly AllReduce traffic
with a few MP transfers, as shown in Figure 6. Similar to the
DLRM experiments, we permute the order of servers in the
ring-AllReduce communication according to Figure 5and plot
three different heatmaps in Figures 6a, 6b, and 6c. Similar to
the above, we confirm that the position of MP transfers remains
fixed, while AllReduce transfers are permutable. We repeat the
above experiment using tree-AllReduce and confirm the same
takeaways hold (see Appendix A).
3 TOPOOPT System Design
This section describes TO POOP T, a novel system based on com-
modity optical devices that jointly optimizes DNN paralleliza-
tion strategy and topology to accelerate today’s training jobs.
3.1 TOPOOP T Interconnect
A TOPOOPT cluster is a shardable interconnect where each
server has
d
interfaces connected to a core layer of
d
optical
switches, as shown in Figure 7. The optical switches enable
TOP OOPT to partition the cluster dedicated partitions for each
training job. The size of each partition depends on the number
of servers that the job requests. Given a DNN training job and
a set of servers, TOPOOP T first finds the best parallelization
strategy and topology between servers off-line. Then, it
reconfigures the optical switches to realize the target topology
for the job.
There is a wide range of optical switching technologies suit-
able for a TO POOP T cluster, including commodity available
optical patch panels [41] and 3D-MEMS [16,39], as well as
futuristic designs such as Mordia [99], MegaSwitch [53], and
Sirius [50,57]. All of these technologies are valid choices for
a TOPOOPT cluster. Section 4discusses the impact of optical
switching technologies on scale and reconfiguration frequency.
Degree of each server.
We denote the number of interfaces
on each server (i.e., the degree of the server) by
d
. Typically,
d
is the same as the number of NICs installed on the server. In
cases where the number of NICs is limited, the degree can be
increased with NICs that support break-out cables or with the
next generation of co-packaged optical NVLinks [10]. As an
example, in our testbed, we use one 100 Gbps HP NIC [29]
with 4
×
25 Gbps interfaces to build a system with degree four
FlexFlow’s MCMC
Parallelization Strategy Search
Topology and routing
Parallelization strategy
Comm. xTopo. plane
Comp. xComm. plane
Device placement
topology
Find Allreduce
permutations &
maximum weight
matchings
Use coin-change
routing and
shortest path
routing
Top olog yFi nder Algorithm (Section 3.3)
Figure 8: TOPO OPT searches for the best parallelization
strategy, jointly with routing, and topology.
(d=4). Section 5evaluates the impact of don performance.
Target workload.
The target workload for TO POOP T is
long-lasting DNN training jobs with hybrid data & model
parallelism. Hence, we assume the set of servers assigned to
each job remains the same throughout the lifetime of the job,
and the GPUs are not shared across multiple jobs. Section 7
discusses potential approaches to enable dynamic scheduling
and multi-tenancy in TOPOOP T.
Storage and control plane traffic
. BI GNETs training clus-
ters consist of custom-designed servers, each with eight GPUs,
eight dedicated NICs for training traffic (GPU NICs), and four
additional NICs for storage and other traffic(CPU NICs). Other
companies, such as Facebook and NVIDIA, have similar server
architectures [9,90]. TOPOOPT only considers GPU NICs as
server degree and partitions the network that is dedicated for
training traffic. The CPU NICs are connected through a sepa-
rate fabric to carry storage and other control plane traffic.
3.2 Co-optimizing Parallelization Strategy and Network
Topology
The search space is too large.
Finding the optimal paralleliza-
tion strategy is an NP-complete problem, and adding network
topology and routing makes the problem even harder [76].
An extreme solution is to jointly optimize compute, commu-
nication, and topology dimensions using a cross-layer opti-
mization formulation. Theoretically, this approach can find the
optimal solution, but the search space quickly explodes, even
for modest networks (e.g., six nodes [109]). Our cross-layer
optimization problem is computationally expensive, as it needs
to evaluate a large number ofoperator placements and network
configurations.
Naive approach.
The other extreme is to optimize the net-
work topology sequentially after the parallelization strategy
has been found. While this approach is able to reconfigure
the network to better match its traffic demand, the eventual
combination of topology and parallelization strategy may be
sub-optimal in the global configuration space because the paral-
lelization strategy search algorithm needs to assume a generic
static network topology and might miss opportunities to find a
better strategy enabled by a different topology.
Our approach: alternating optimization.
In TOPOOPT,
we seek to achieve the best of both worlds. To make the prob-
lem tractable, we divide the search space into two planes:
Comp.×Comm.
and
Comm.×To po.
and use an alternating
optimization technique to iteratively search in one plane while
keeping the result of the other plane constant. Figure 8illus-
trates TOPOOPTs alternating optimization framework. The
4
green box represents our
Comp.×Comm.
plane, which uses
FlexFlow’s MCMC (Markov Chain Monte Carlo) search algo-
rithm [76] to find the best parallelization strategy for a given
topology while considering the link bandwidths as communi-
cation cost. We feed the output of FlexFlow’s parallelization
strategy to our
Comm.×To po.
plane each time it finds a strat-
egy with improved runtime, or after 50 iterations of MCMC
search when it cannot find an improvement, to escape local
optimums. The yellow box in Figure 8represents TOP OOPTs
Comm.×To po.
plane where, given a parallelization strategy
and device placement as input, it finds the best network topol-
ogy and routing to minimize the training iteration time using
our TOPOLOGYFINDER algorithm. The best discovered topol-
ogy is fed back into the
Comp.×Comm.
plane, which further
optimizes the parallelization strategy and device placement
based on the updated topology. This optimization loop repeats
until convergence or after
k
iterations, where
k
is a configurable
hyper-parameter. The next subsection describes our TOPOLO-
GY FINDER algorithm inside theComm.×Topo.plane.
3.3 TOPOLOGYFINDER Algorithm
Prior proposals are inefficient for DNN workloads.
At first
blush, finding a network topology seems straightforward: we
just need to translate the parallelization strategy and device
placement from
Comp.×Comm.
plane into a traffic matrix
and map the traffic matrix into circuit schedules. Several
prior papers have addressed this problem for datacenter net-
works [53,60,64,68,79,83
85,99,117]. The conventional
wisdom in prior work is to allocate as many direct parallel
links as possible to elephant flows and leaves mice flows to take
multiple hops across the network. In principle, this approach
works well for datacenters but we argue it leads to sub-optimal
topologies for distributed DNN training because the size of
AllReduce transfers is larger than MP transfers in BIGNET
(Appendix B). Hence, the conventional approach leads to cre-
ating parallel direct links for carrying AllReduce traffic and
forcing MP flows to have a large hop-count. But MP transfers
are on the critical path of processing every batch, and a slight
delay in their completion time negatively impacts the entire
training iteration time. Consequently, having a large hop-count
for MP transfers degrades training performance.
TOP OOPTs novel technique.
In TOPOOPT, we seek to
meet two goals simultaneously: (
i
) allocate most of the avail-
able bandwidth to AllReduce transfers since the bulk of the
traffic belongs to them; but (
ii
) ensure a small hop-count for MP
transfers. We meet both goals by leveraging a unique property
of distributed DNN training traffic, namely that the AllReduce
part of the traffic matrix is mutable and can be split across
multiple permutations (§2.2). Intuitively, this is because MP
traffic is composed of network flows among nodes that contain
different parts of a DNN model thus creating immutable data
dependencies across these nodes, while AllReduce transfers
contain network flows among nodes that handle the same part
of the model, providing flexibility on the order of nodes par-
ticipating in AllReduce. Consequently, if a group of servers
is connected in a certain order, simply permuting the label-
ing of the servers gives another ordering that would finish the
AllReduce operation with the same latency while potentially
providing a smaller hop-count for MP transfers. Instead of
selecting just one AllReduce order, we find multiple permuta-
tions for each AllReduce group that best satisfy MP transfers
and overlap their corresponding sub-topologies. In doing so
we, not only serve the AllReduce traffic, but also decrease the
hop-count for MP transfers.
TOPOLOGYFINDER steps.
Algorithm 1presents the pseu-
docode of our TOPOLOGYFINDER algorithm. The algorithm
takes the following inputs:
N
dedicated servers for the training
job, each with degree
d
, as well a list of AllReduce and MP
transfers (
TAllReduce
and
TMP
) based on the parallelization strat-
egy and device placement obtained from the
Comp.×Comm.
plane. The algorithm then finds the best topology (
G
) and rout-
ing rules (
R
) and returns them to the
Comp.×Comm.
plane
for the next round of alternating optimization. Our algorithm
consists of the following four steps.
Step 1: Distribute the degree.
The first step distributes the
degree
d
between AllReduce and MP sub-topologies propor-
tionally, based on their share of total traffic. We specifically
start with AllReduce transfers and allocate at least one degree
to the AllReduce sub-topology to ensure the network topology
remains connected (line 2). The remaining degrees, if any, are
allocated to the MP sub-topology (line 3).
Step 2: Construct the AllReduce sub-topology and rout-
ing.
To find the AllReduce sub-topology, the algorithm iterates
over every AllReduce group kand allocates degree dkto each
group proportionally based on the amount of traffic they re-
quire (line 6). Note that in hybrid data & model parallelism
strategies, the AllReduce step can be performed across a subset
of servers when an operator is replicated across a few servers
instead of all servers. For each AllReduce group
k
, TOPOLO-
GY FINDER efficiently finds a set of permutations across those
servers in
k
(line 8). It then selects the top
dk
permutations
that best satisfy MP traffic demands using a module called
TopPermutations
(line 9). There are several metrics to use
in
TopPermutations
module to measure satisfaction of MP
demand. In our implementation, we use the sum of MP transfer
sizes with a direct link from AllReduce permutations as our
metric. TOPOLOGYFINDER’s approach of selecting AllRe-
duce permutations while considering MP traffic demand is
a key reason to look for alternative AllReduce permutations.
However, at large scales, finding the set of all possible AllRe-
duce permutations is non-trivial, since the number of possible
permutations is
O(n!)
, where
n
is the number of servers in
group
k
. Inspired by group theory, we develop a technique to
address this challenge, called TotientPerms, described next.
Using group theory to find AllReduce permutations.
Given that ring-AllReduce is the dominant AllReduce col-
lective in BIGNET, we describe our
TotientPerms
technique
based on ring-AllReduce. Appendix Cexplains how to ex-
5
Algorithm 1 TOPOLOGYFINDER pseudocode
1: procedure TOPOLOGYFINDER(N,d,TAllRed uce,TMP)
Input N: Number of dedicated training servers for the job.
Input d: Degree of each server.
Input TAllReduce: AllReduce transfers.
Input TMP: MP transfers.
Output G: Topology to give back to the Comp.×Comm.plane.
Output R: Routing rules to give back to the Comp.×Comm.plane.
Distribute degreed between AllReduce and MP sub-topologies
2: dAllRed uce =max(1, dd×sum(Treduce )
sum(Treduce)+sum(TMP)e)
3: dMP =ddAllRed uce
Construct the AllReduce sub-topology GAllReduce
4: GAllRed uce ={}
5: for each AllReduce group kwith set of transfers Tkdo
Assign degree dkto group k according to its total traffic
6: dk=ddAllRed uce ×sum(Tk)
sum(Treduce)e
7: dAllRed uce =dAllReduce dk
Find all the permutations between servers in groupk
8: Pk=TotientPerms(N,k)
Select dkpermutations from Pkaccording to TMP
9: GAllRed uce =GAllReduce TopPermutations(N,dk,Pk,TMP )
10: if dAllRed uce == 0then
11: break
Compute routes on GAllReduce using the coin change algorithm [49]
12: R=CoinChangeMod(N,G)
Construct the MP sub-topology GMP
13: GMP ={}
14: for i:i<dMP do
Find a maximum weight matching accordingto TMP
15: g=BlossomMaximumWeightMatching(TMP)
16: GMP =GMP g
Reduce the amount of demand for each link l in graph g
17: for lgdo
18: TMP[l] = TMP[l]/2
Combine the AllReduce and MP topologies
19: G=GAllRed uce GMP
Compute routes on GMP with shortest path
20: R+= ShortestPath(G,TMP)
tend our algorithm to other AllReduce communication col-
lectives. For a ring-AllReduce group with
n
servers labeled
S0, ..., Sn1
, a straightforward ring-AllReduce permutation
is
(S0S1S2··· → Sn1S0)
. We denote this per-
mutation by a ring generation rule as:
SiS(i+1)mod n
.
Since the servers form a ring, the index of the starting server
does not matter. For instance, these two rings are equivalent:
(S0S1S2S3S0)
and
(S1S2S3S0S1)
. To
reduce the search space of all possible permutations, we find
the ring generation rule for all regular rings; i.e., rings where
the distance between indices of consecutive servers is equal;
i.e., server
Si
is connected to server
S(i+p)mod n
. In particular,
we show that all integer numbers
p<n
, where
p
is co-prime
with
n
(i.e.
gcd(p,n) = 1
), represent a valid ring-AllReduce
permutation (Appendix C). For instance, for
n=12
servers,
our ring generation rule with
p=1,5,7,11
will lead into 4
distinct AllReduce permutations between the servers. In cases
where
n
is extremely large, we restrict
p
to prime numbers, thus
reducing the search space to only
n
ln n
according to the Prime
Number Theorem [62]. The eventual AllReduce sub-topology
is the union of top permutations selected in line 9.
Coin-change routing.
Consider servers
Si
and
Sj
that need
to exchange AllReduce transfers but do not have a direct edge
between them. We use a modified version of the classical coin
change problem [49] to find an efficient routing path (line 12).
S0S1
S2
S3
S4
S5
S6
S7
S12
S13
S15
S8
S9
S10
S11
S14
0
2
3
4
5
6
7
12
13
15
8
9
10
11
14
1
(a) TOP OOPT topology
4GB
128MB
0
0510 15
0
5
10
15
(b) TOP OOPT traffic pattern
Figure 9: Example of TO POOP T’s topology and traffic matrix.
In classical coin change, the goal is to find the minimum num-
ber of coins that would sum to a certain total value. Our ring
generation rules enable us to treat the routing problem sim-
ilarly. In particular, the
p
values of AllReduce permutations
that have been selected in the AllReduce sub-topology are the
coin values, and the difference between server
i
and
j
indices
(
(ji)modn
) is the target total value that we want to achieve.
The only difference is that our problem runs in
modulo n
arithmetic space, as the server IDs wrap around in the ring
structure. Appendix Clists the pseudocode of
TotientPerms
,
TopPermutations, and CoinChangeMod methods.
Step 3: Construct the MP topology.
Given that MP trans-
fers are not permutable, we use the classical Blossom maxi-
mum weight matching algorithm [59] to find the best connec-
tivity between servers with MP transfers (line 15). We repeat
the matching algorithm until we run out of degrees. To increase
the likelihood of more diverse connectivity across server pairs,
we divide the magnitude of
TMP
for pairs that already have an
edge between them by two (line 18). In general, division by
two can be replaced by a more sophisticated function with a
diminishing return. Appendix Delaborates on this point.
Step 4: Final topology and routing.
Finally, we combine
the MP and AllReduce sub-topologies and compute k-shortest
path routes for MP transfers (lines 19 and 20).
Example.
We use the DLRM model in Figure 4distributed
across 16 servers each with six NICs (
d=6
) as an exam-
ple. Instead of choosing one of the AllReduce permutations
in Figure 5, TOPOOP T combines the three ring-AllReduce
permutations to load-balance the AllReduce transfers while
providing a short hop-count for MP transfers. Figure 9illus-
trates TOPOOPTs topology and traffic matrix, demonstrating
a more balanced traffic matrix than Figure 4.
4 Optical Switching Technologies
Once an optimized topology and parallelization strategy is
found for a given job, we use optical switches in TOP OOPT to
reconfigure the interconnection between the set of servers that
participate in the job. Since our TOPOLOGYFINDER algorithm
takes the server degree
d
as input, we directly map the output
of the algorithm to a physical topology.
There are many different optical switching technologies
that we can use for TO POOP T [41,50,53,54,57,60,64,79,83,
85,87,88,99,104]. Table 1lists the key characteristics of these
technologies. In principle, TO POOP T’s design is compatible
6
Technology Port-
count
Reconfig.
latency
Insertion
Loss (dB)
Cost /port
Optical Patch Panels [41] 1008 minutes 0.5 $100
3D MEMS [16,39] 384 10 ms 1.5–2.7 $520
2D MEMS [53,99] 300 11.5 µs 10–20 Not commercial
Silicon Photonics [79,104] 256 900 ns 3.7 Not commercial
Tunable Lasers [50,57] 128 3.8 ns 7-13 Not commercial
RotorNet [87,88] 64 10 µs 2 Not commercial
Table 1: Comparison of optical switching technologies.
with any of these technologies. However, most are not com-
mercially available. For an immediate deployment in BIG NET,
this section focuses on optical patch panels [41,102] and 3D
MEMS circuit switches [16,39], the only two technologies
that are commercially available today. In our simulations, we
evaluate the performance of fast reconfigurable switches to
provide a perspective on future designs (§5).
Optical patch panels.
Fiber optic patch panels are com-
monly used for cable management. Reconfigurable optical
patch panels are a new class of software-controlled patch pan-
els and are already commercialized at scale [102]. For instance,
Telescent offers fully reconfigurable patch panels with 1008
duplex ports and insertion loss less than 0.5 dB for $100K
($100/port) [41,102]. Reconfiguration is performed using a
robotic arm that grabs a fiber on the transmit side and connects
it to a fiber on the receive side [78]. However, the reconfigura-
tion latency of optical patch panels is several minutes [41].
3D MEMS-based Optical Circuit Switches (OCSs).
An
OCS uses tiny mirrors to change the direction of light, thereby
reconfiguring optical links. The largest optical circuit switch on
the market has 384 duplex ports with
10 ms reconfiguration
latency and is available for $200K ($520/port) [39]. However,
the optical loss of these switches is 1.5–2.7 dB [19]. Compared
to patch panels, OCSs have the following disadvantages: (
i
)
each port is five times more expensive; (
ii
) their insertion loss
is higher; and (
iii
) their port-count is three times lower. The
main advantage of OCSs is that their reconfiguration latency is
four orders of magnitude faster than patch panels.
Impact of reconfiguration latency.
Patch panel and OCS
technologies are both applicable to TO POOP T. The choice of
which technology to use depends on several factors, includ-
ing scale of the cluster, iteration training time of jobs, and
frequency of job arrivals. For instance, OCSs can be used to
reconfigure the topology of a job within training iterations,
whereas patch panels are only suitable when the topology re-
mains intact throughout the entire training. Our evaluations
demonstrate that the reconfiguration latency of today’s OCSs is
too high for some DNNs, leading to sub-optimal performance
when the topology is reconfigured within iterations (§5).
Handling job arrivals.
To start a job with
k
servers, we need
to reconfigure the interconnection between these
k
servers be-
fore the job starts. This can be done quickly when OCSs are
used, but when patch panels are used, there could be several
minutes of delay before the job can start. To address this chal-
lenge, we use a look-ahead approach to pre-provision the next
topology while current jobs are running. More specifically,
1x2 optical switch
d interfaces
Server1Server2Servern
Active port Look-ahead port
Active
Look-ahead
Patch Panel1Patch Panel2Patch Panel2d-1 Patch Panel2d
Active
Look-ahead
2d Patch Panels
Servern-1
Figure 10: Active & Look-ahead ports when the reconfigu-
ration latency is too high.
we use a simple 1
×
2 mechanical optical switch [98] at each
server’s interface to choose between Active vs. Look-ahead
ports. These 1
×
2 switches are inexpensive ($25) and have
0.73 dB optical loss measured in our prototype. We then con-
nect the two ends of each 1
×
2 switch to different patch panels,
as shown in Figure 10. As a result, a TOPOOPT cluster with
n
servers, each with
d
interfaces, has
2d
patch panels where each
interface is split into two parts: Active and Look-ahead. At any
point in time, only one end of each 1
×
2 switch is participating
in the active topology; the other end is pre-provisioning the
topology for the next job. Once all the servers for the new job
are ready, TOPOOPT immediately flips to the new topology by
reconfiguring the corresponding 1×2 switches.
5 Large-Scale Simulations
This section evaluates the performance of a large-scale
TOP OOPT interconnect. First, we explain our simulation soft-
ware and methodology (§5.1). Then, we provide a cost analysis
of TOPOOPT to inform our simulations when comparing
different interconnects (§5.2). Next, we demonstrate the per-
formance of TO POOP T when a cluster is dedicated to a single
distributed DNN training job (§5.3). We extend this setting to
a case where a training cluster is shared among multiple DNNs
5.4). Finally, we demonstrate the impact of reconfiguration
latency and server degree on TOP OOPT s performance (§5.5).
5.1 Methodology & Setup
We implement two simulators to evaluate TOPOOPT:
FlexNet simulator.
We augment FlexFlow’s simulator [27]
to be network-aware and call it FlexNet. Given a DNN model
architecture and a batch size, FlexFlow’s simulator explores
different parallelization strategies and device placements to
find a strategy that minimizes per-iteration training time. The
output of the FlexFlow simulator is a task graph describing
the set of computation and communication tasks on each GPU
and their dependencies. However, current implementation of
FlexFlow ignores the network topology entirely by assuming
servers are connected in a full-mesh interconnect. Our FlexNet
simulator extends the FlexFlow simulator and enables it to
consider multiple networks, including Fat-trees, TOPO OPT,
and expander networks. Moreover, FlexNet implements our
alternating optimization framework (§3) to find an optimized
network topology and routing rules for TO POOP T.
7
VGG BERT DLRM CANDLE
Batch Batch size/GPU: 16 Batch size/GPU: 128 Batch size/GPU: 256
size/GPU: #Trans. blks: 12 #Dense layer: 8 #Dense layer: 8
64 Hidden layer: 1024 Dense layer size: 2048 Dense layer: 16384
Seq. length: 64 #Densefeat. layer: 16 #Dense feat. layer: 16
#Attn. heads: 16 Feat.layer size: 4096 Feat. layer size: 16384
Embed. size: 512 Embed.: 128×107
#Embed. tables: 64
Batch Batch size/GPU: 16 Batch size/GPU: 256 Batch size/GPU: 256
size/GPU: #Trans. blks: 6 #Dense layer: 8 #Dense layer: 8
64 Hidden layer: 768 Dense layer size: 1024 Dense layer: 4096
Seq. length: 256 #Dense feat. layer: 16 #Dense feat. layer: 16
#Attn. heads: 6 Feat. layer size: 2048 Feat. layer size: 4096
Embed. size: 512 Embed.: 256×107
#Embed. tables: 16
Table 2: DNN models used in our simulations.
FlexNetPacket simulator.
We find that FlexFlow’s simu-
lator often underestimates the training iteration time at large
scales because it does not simulate packets traversing through
a network. Extending FlexNet to become a packet-level simu-
lator is not computationally feasible, since FlexFlow generally
requires thousands of MCMC rounds to converge. To faith-
fully simulate per-packet behavior of network switches,buffers,
and multiple jobs sharing the same fabric, we build a second
event-based packet simulator, called FlexNetPacket, on top
of htsim [5]. FlexNetPacket takes the output of FlexNet– that
is, the optimized parallelization strategy, device placement of
each operator, optimized network topology, and routing rules –
and simulates several training iterations. The simulated train-
ing iterating times with FlexNetPacket match those we observe
in BI GNETs clusters. The per-hop latency in FlexNetPacket
is set to 1
µ
s to reflect the multi-hop latency of servers that
are not directly connected. These two simulators together are
10K lines of code in C++. We will release our codebase and
all related data and scripts online.
Simulated network architectures.
We simulate dis-
tributed training clusters with
n
servers equipped with four
NVIDIA A100 GPUs [37]. We vary
n
in different experiments
and simulate the following network architectures:
TOP OOPT-oneshot.
A TOPOOPT interconnect where
each server is equipped with
d
NICs, each with bandwidth
B
connected via a flat layer of optical devices. At the beginning
of each job, the topology is reconfigured based on the output
of our alternating optimization framework (§3) and remains
unchanged throughout the entire training job. Both OCS and
patch panels are suitable for this architecture.
TOP OOPT-reconfig.
To study the impact of changing the
network topology within training iterations, we simulate a
reconfigurable TOPOOPT interconnect. We only rely on OCSs
for this design and assume their reconfiguration latency is
10 ms. Given that FlexFlow’s parallelization strategy search is
not aware of dynamically reconfigurable networks, following
prior work [79], we measure the traffic demand every 50 ms
and adjust the circuits based on a heuristic algorithm to satisfy
the current traffic demand as much as possible (Appendix D).
Ideal Fat-tree.
An ideal full bisection bandwidth Fat-tree
where each server has one NIC and the bandwidth of every
Link
band-
width
Transceiver
($)
NIC ($) Electrical
switch
port ($)
Patch
panel
port ($)
OCS
port
($)
1×2
switch
($)
10 Gbps 20 [12] 180 [32] 87 [21] 100[41] 520 [39] 25 [98]
25 Gbps 39 [13] 185 [33] 144 [23] 100 [41] 520 [39] 25 [98]
40 Gbps 39 [14] 376 [17] 144 [22] 100 [41] 520 [39] 25 [98]
100 Gbps 99 [11] 660 [34] 225 [24] 100 [41] 520 [39] 25 [98]
200 Gbps2198 [11] 790[35] 450 [24] 100 [41] 520 [39] 25[98]
Table 3: Cost of network components.
0.6
6
60
(a) d= 4, B= 100 Gbps
SiP-ML
TopoOpt-oneshot
TopoOpt-reconfig
Cost-equiv. Fat-tree
Ideal Fat-tree
0.2
2
20
Expander
Oversub Fat-tree
Interconnect Cost (M$)
128 432 1024 2000
Number of servers 128 432 1024 2000
Number of servers
(b) d= 8, B= 200 Gbps
Figure 11: Interconnect cost comparison.
link is
d×B
. This architecture provides an ideal bound for
training iteration times in a cluster where each server has
d×B
available network bandwidth.
Cost-equivalent Fat-tree.
To compare the performance of
TOP OOPT with a cost-equivalent architecture, we simulate
a full bisection bandwidth Fat-tree where each server has one
NIC and the bandwidth of each link is
d×B0
, where
B0
is lower
than Band is selected such that this Fat-tree has a similar cost
to TOPOOPT 5.2).
Oversub. Fat-tree.
This is a 2:1 oversubscribed Fat-tree
interconnect, similar to the one used in Opera [87], where the
bandwidth of each link is
d×B
but half of the links at the ToR
uplink layer are omitted.
SiP-ML [79].
SiP-ML is a futuristic DNN training cluster
with several Tbps bandwidth per GPU. While having a
Tbps network is certainly a plus, our goal is to compare the
algorithmic contributions of TOPOOP T and SiP-ML. Hence,
to make an apples-with-apples comparison, we allocate
d
wavelengths, each with bandwidth
B
, to each SiP-ML GPU
and follow its SiP-Ring algorithm to find a topology with a
reconfiguration latency of 25
µ
s. Appendix Eelaborates on
our modifications to SiP-ML.
Expander [108,115].
Finally, we simulate a fabric where
each server has
d
NICs with bandwidth
B
interconnected via
an Expander topology.
DNN Workloads.
We simulate four real-world DNN mod-
els: DLRM [20], CANDLE [3], BERT [58], and VGG [107].
Table 2summarizes model configurations and batch sizes used
in our simulations. The top and bottom rows correspond to
models used in Sections 5.3 and 5.4, respectively. Batch sizes
are selected based on the common batch sizes used in BIGNE T.
Each data point averages 5–20 simulation runs.
Parallelization strategy.
We use FlexNet’s topology-
aware parallelization strategy search for Ideal Fat-tree, Cost-
equivalent Fat-tree, Oversub. Fat-tree, SiP-ML, and Expander
2200 G transceivers and switch ports are estimated as 2×100G cost.
8
ExpanderSiP-MLTopoOpt-oneshot Cost-equiv. Fat-tree Ideal Fat-tree
TopoOpt-reconfig
(a) CANDLE (b) BERT (c) DLRM
2
20
200
10 25 40 100 200
Trai nin g Ite rat ion T ime ( s)
0.03
0.3
3
30
300
10 25 40 100 200
0.03
0.3
3
10 25 40 100 200
Link Bandwidth (Gbps) Link Bandwidth (Gbps) Link Bandwidth (Gbps)
Figure 12: Dedicated cluster of 128 servers (d= 4).
Trai nin g Ite rat ion T ime ( s)
Link Bandwidth (Gbps) Link Bandwidth (Gbps) Link Bandwidth (Gbps)
(a) CANDLE (b) BERT (c) DLRM
2
20
200
10 25 40 100 200
0.03
0.3
3
10 25 40 100 200
0.03
0.3
3
30
300
10 25 40 100 200
Figure 13: Dedicated cluster of 128 servers (d= 8).
networks. For TOP OOPT , we use FlexNet’s alternating opti-
mization framework to find the best parallelization strategy
jointly with topology. We use ring-AllReduce and distributed
parameter server [81] as default AllReduce communication
collectives between servers and within servers, respectively.
5.2 Cost Analysis
We begin our evaluations by comparing the cost of different
network architectures. Table 3lists the cost of network compo-
nents we use in this section. The cost of transceivers, NICs,and
electrical switch ports is based on the lowest available prices
in official retailer websites. We obtain the cost of patch panel,
OCS, and 1
×
2 optical switch directly from their suppliers.
Note that the cost of optical components stays constant as
link bandwidth increases, an inherent advantage of optics.
Following prior work, we estimate the cost of fiber optics
cables as 30 cents per meter [64] and select each fiber’s length
from a uniform distribution between 0 and 1000 meters [126].
Figure 11 compares the interconnect cost across various
network architectures as the number of servers is increased.
We calculate the cost of TOPO OPT-oneshot based on
2d
patch
panels and 1
×
2 switches at each link to support its look-ahead
design (§4). TOPOOPT-reconfig’s cost is based on
d
OCSs
connected to all servers in a flat topology. We make the fol-
lowing observations. First, using OCSs for TOP OOPT is more
expensive (1.33
×
, on average) than patch panels. Note that
OCSs can be used in both TO POOP T-oneshot and TO POOP T-
reconfig interconnects. Second, the cost of TO POOP T-oneshot
(blue curve) overlaps with the Cost-equivalent Fat-tree (yellow
curve). This is intentional, since having a cost-equivalent archi-
tecture enables us to compare the performance of TO POOP T
to a cluster at the same price point. Third, TO POOP T-oneshot
is, on average, 3.4
×
more cost effective than its Ideal Fat-tree
counterpart. Finally, the most and least expensive fabrics are
SiP-ML and Expander, respectively, and this section shows
that they both perform worse than TOPOOPT.
We acknowledge that estimating the cost of networking hard-
ware is challenging because prices are subject to significant
discounts with bulk orders. However, assuming all components
in this analysis are subject to the same bulk order discounts,
the relative comparison across architectures remains valid.
5.3 Performance Comparison for Dedicated Clusters
Figure 12a demonstrates the training iteration time of CAN-
DLE distributed on a dedicated cluster of 128 servers with
four A100 GPUs where
d=4
. We vary the link bandwidth (
B
)
on the x-axis. There are three takeaway points from this figure.
First, Ideal Fat-tree, TOPOOP T-oneshot, TOP OOPT -reconfig,
and SiP-ML architectures all achieve similar performance for
CANDLE. This is because the best parallelization strategy
for CANDLE at this scale is mostly data parallel, with a few
MP transfers, hence the network topology matters less. Recall
that TOPOOPT-oneshot has the lowest cost across these archi-
tectures. Second, the Cost-equivalent Fat-tree architecture has,
on average, 2.8
×
higher training iteration time than these four
architectures. Third, the Expander architecture has the worst
performance, since it is not optimized for DNN workloads.
3,4
The difference between those overlapping architectures
starts to matter for BERT, shown in Figure 12b. In particular,
this time, only three architectures overlap: TOPOOPT-oneshot,
Ideal Fat-tree, and SiP-ML because BERT’s parallelization
strategy includes more MP transfers than CANDLE; hence, the
impact of network topology on training iteration time is more
pronounced. As a result, TO POOP T-reconfig’s performance
starts to suffer since the reconfiguration latency of OCSs is long
compared to the training iteration time of BERT at this scale.
DLRM’s case is even more interesting, as it has a lot more
MP transfers than the other two DNNs. As shown in Figure 12c,
TOP OOPT-oneshot’s performance remains close to the Ideal
Fat-tree but both SiP-ML and TO POOP T-reconfig perform
poorly and despite increasing the link bandwidth, their training
iteration time stays flat. This happens because DLRM has a
lot of one-to-many and many-to-one broadcast and incast MP
transfers which require several circuit reconfigurations to meet
the traffic demand, consequently hurting the performance of
both SiP-ML and TO POOPT-reconfig. In particular, TOP OOPT-
reconfig is performing two orders of magnitude worse than
SiP-ML because its reconfiguration latency is two orders of
magnitude higher (10 ms vs. 25
µ
s). To verify this conclusion,
we run a series of simulations without reconfiguring TOPOOPT-
reconfig and SiP-ML and observe that their performance
3
We note that it might be possible to improve the performance of the Ex-
pander fabric by augmenting Blink’s approach [116] to a cluster-level solution.
4VGG’s results are similar to CANDLE (figures omitted).
9
(a) Average Iteration Time
Average I teration Tim e (s)
(b) 99%-ile Iteration Time
99%-ile Iteration Time (s)
0.05
0.1
0.15
20% 40% 60% 80% 100%
0.1
0.2
0.3
0.4
0.5
20% 40% 60% 80% 100%
Load Load
Top oOp t-oneshot Cost-equiv. Fat-tree Ideal Fat-tree OversubFat-tree
Figure 14: Shared cluster of 432 servers (
d
= 8,
B
= 100 Gbps).
matches TOPOOPT-oneshot. Section 5.5 evaluates the impact
of reconfiguration latency on performance and shows that we
need optical switches with faster reconfiguration latency, such
as Sirius [50], to meet the performance of Ideal Fat-tree.
Figure 13 shows the same setting as Figure 12 except that
each server now has eight NICs. The results show a similar
trend: even though per server bandwidth has increased, the
behavior of different network architectures remains consistent.
In summary, across all data points in Figures 12 and 13,
TOP OOPT-oneshot has 2.2
×
better training iteration time than
its Cost-equivalent Fat-tree counterpart.
5.4 Performance Comparison for Shared Clusters
We now compare the performance of different network
architectures when the cluster is shared across multiple DNN
jobs. Following prior work [86,100], we run a series of
simulations where 40% of the jobs are DLRM, 30% are BERT,
20% are CANDLE, and 10% are VGG16. We change the
number of active jobs to represent the load on the cluster.
Figure 14 compares the average and 99%-tile iteration time
at different loads for a cluster with 432 servers, where
d
= 8 and
B
= 100 Gbps. SiP-ML does not support multiple jobs, hence
we omit it in this experiment. Moreover, we omit TOPO OPT-
reconfig and Expander networks since they both have a poor
performance in this setting. Instead, we add the Oversub. Fat-
tree interconnect to demonstrate the impact of congestion on
Fat-tree topologies. Figure 14a shows that TOP OOPT -oneshot
improves the average iteration time by 1.7
×
and 1.16
×
, com-
pared to Cost-equivalent Fat-tree and Oversub. Fat-tree ar-
chitectures, respectively. Moreover, TOP OOPT improves the
training time of Ideal Fat-tree fabric by 1.07
×
on average!
Initially, we were surprised by this result since we expected
Ideal Fat-tree would have the lowest possible iteration time.
However, we find two reasons why TOPOOPT out-performs
Ideal Fat-tree: (
i
) the workload consists of large incast transfers
causing congestion in the network that Fat-tree interconnects
are notoriously vulnerable to, and (
ii
) TOPOOPT provides bet-
ter latency properties, as most servers are at most three hops
away, whereas for Fat-trees, the maximum hop count is six. We
observe a similar trend for the tail iteration completion times,
depicted in Figure 14b. Averaging across all load values on the
x-axis, TOPOOPT improves the tail training iteration time by
3
×
, 1.4
×
, and 1.12
×
, compared to Cost-equivalent Fat-tree,
0.01
1
110 100 1000 10000
(b) DLRM
0.01
1
100
110 100 1000 10000
Reconfigu ration Latency (us)
Trai nin g Ite rat ion T ime ( s)
(a) BERT
Top oOpt -reconfig
Top oOpt -oneshot
Reconfigu ration Latency (us)
Top oOpt -reconfig
Top oOpt -oneshot
Figure 15: Impact of reconfiguration latency.
0.01
0.1
1
10
DLRM CANDLE BERT
d=4 d=6
d=8 d=10
(b) B= 100 Gbps
Trai nin g Ite rat ion T ime ( s)
(a) B= 40 Gbps
0.01
0.1
1
10
DLRM CANDLE BERT
d=4 d=6
d=8 d=10
Figure 16: Impact of server degree (d) on performance.
Oversub. Fat-tree, and Ideal Fat-tree architectures, respectively.
5.5 Sensitivity Analysis
Impact of reconfiguration latency
. The results presented in
Figures 12 and 13 indicate that reconfiguring the topology
within training iterations can lead to poor performance in some
cases. One way to address this issue is to keep the topology
the same throughout the entire training time (similar to our
TOP OOPT-oneshot design). While our experiments show that
keeping the topology intact achieves training iteration time as
good as the Ideal Fat-tree fabric most of the time, intellectu-
ally, it is important to understand whether there is a suitable
reconfiguration latency for DNN training clusters. Figure 15
shows the training iteration time of BERT and DLRM in the
same setting as in Figure 12 while sweeping the reconfigura-
tion latency of OCSs in TO POOP T-reconfig from
1µs
to
10 ms
.
The horizontal blue line corresponds to TO POOP T-oneshot’s
iteration time which remains constant as it does not recon-
figure the network topology. The figure shows that when the
reconfiguration latency is lower than 1
µ
s, the iteration time of
TOP OOPT-reconfig matches that of TOPOOPT-oneshot. Fast
reconfigurable switches are going to be essential in elastic
scenarios where the cluster is shared across multiple jobs and
servers join and leave different jobs unexpectedly. This is
a challenging research problem, and we leave the design of
a joint topology optimization, cluster scheduling, and paral-
lelization strategy to future work. We believe futuristic fast
reconfigurable switches, such as Sirius [50], are well-suited for
this setting but finding a parallelization algorithm that is aware
of reconfigurability in the network topology is a challenging
and exciting future research problem.
Impact of server degree.
We next study the impact of server
degree
d
on TOPOOPTs performance. Specifically, we vary
the degree of each server in TOP OOPT for two link bandwidths:
40 Gbps and 100 Gbps. Figure 16 shows the trend for different
DNN models. Both DLRM and CANDLE are network-heavy;
therefore, they benefit more from the additional bandwidth
obtained by increasing
d
. CANDLE’s improvement is almost
linear as degree goes up, as the strategy is closer to data parallel
10
1x2 Optical
Switches
Patch Panel
A100 GPU
Servers
Figure 17: Photo of our testbed.
and the amount of bandwidth available to AllReduce operation
increases linearly as well. In the case of DLRM, we observe
a super-liner scaling when
B
= 100 Gbps. This is because
DLRM has one-to-many and many-to-one MP transfers which
require a low hop count in the topology. As we increase
d
,
TOPOLOGYFINDER is able to find network topologies with
much lower diameter, consequently benefiting the performance
by both increasing bandwidth and reducing hop-count for MP
transfers. Finally, BERT is mostly compute bound at higher
bandwidth; hence, increasing the server degree and bandwidth
per node has marginal impact on its iteration time.
6 Prototype
Testbed setup.
We build a prototype to demonstrate the
feasibility of TO POOP T. Our prototype includes 12 ASUS
ESC4000A-E10 servers and a G4 NMT patch panel [41].
Each server is equipped with one A100 Nvidia GPU [37]
(40 GB of HBM2 memory), one 100 Gbps HP NIC [29], and
one 100 Gbps Mellanox ConnectX5 NIC. Our HP NICs are
capable of supporting 4
×
25 Gbps interfaces using a PSM4
transceiver with four breakout fibers [7], enabling us to build
a TOPOOPT system with degree
d=4
and
B=25
Gbps. We
enable DCB [18] and PFC on these interfaces to support a
lossless fabric for RDMA. To compare TOPOOPT’s training
performance with an ideal baseline, we connect the Mellanox
NICs on each server to a 100 Gbps MX480 Juniper switch [30].
We build a completely functional T OPO OPT-oneshot prototype
with our patch panel, including 1
×
2 optical switches [98] to flip
between active/look-ahead topologies. Figure 17 shows our
prototype. Given that our simulation results suggest the recon-
figuration latency of OCS is too long for TOPOOP T-reconfig,
we only focus on TO POOP T-oneshot in our prototype.
Distributed training framework.
We use FlexFlow’s train-
ing engine [26], based on Legion’s parallel programming
system [31], to train three DNN models: ResNet50 [70],
BERT [58], and CANDLE [3]. Since our prototype is an order
of magnitude smaller scale than our simulation setup, we use
smaller model sizes and batch sizes. Table 4lists the details of
each model. We ensure all GPUs are fully utilized.
ResNet50 BERT CANDLE
Batch size/GPU: 20 Batch size/GPU: 2 Batch size/GPU: 10
dataset: CIFAR10 #Trans. blks: 4 #Dense layer: 4
Hidden layer sz: 768 Dense layer sz: 4096
Seq. length: 64 #Dense feat. layer: 8
#Attn. heads: 16 Feat.layer size: 4096
Embed. size: 512
Table 4: DNN models used in our testbed.
1
10
100
1000
BERT CANDLE ResNet
TopoOpt-oneshot
Ideal Baseline
Samples / Second
Figure 18: Training through-
put (samples/second).
0
0.2
0.4
0.6
0.8
0500 1000 1500 2000
Accuracy
Time (seconds)
TopoOpt-oneshot
Ideal Baseline
Figure 19: Time-to-accuracy
of ResNet50.
Modifications to NCCL.
By default, the NCCL com-
munication library [36] assumes all network interfaces are
routable from other interfaces. This assumption is not ideal
for TOPOOPT because we have a specific routing strategy
to optimize training time. We modify NCCL to understand
TOP OOPTs topology and respect its routing preferences.
Moreover, we integrate our TotientPermsAllReduce permu-
tations into NCCL and enable it to load-balance parameter
synchronization across multiple ring-AllReduce permutations.
RDMA indirect forwarding.
To support a multihop
TOP OOPT interconnect, we enable RDMA RoCEv2 indirect
forwarding on all our HP NICs. This is challenging because
packet processing and memory access in RDMA protocol is
offloaded to the NIC. Hence, if a packet’s IP destination ad-
dress does not match the NIC’s IP address, RDMA engine
silently drops the packet. Hence, by default, RDMA does not
support host-level indirect forwarding for a host to act as a
relay for other hosts. To address this issue, we collaborated
with engineers from Marvell, the provider of the ASIC on our
HP NICs, to adjust the NIC firmware and enable supporting in-
direct forwarding functionality. Our approach does not require
proprietary software/firmware and is applicable to commodity
NICs with the same ASIC. We will release our scripts publicly.
At a high-level, we use a feature called NPAR, or network par-
titioning allowing us to split each 25 Gbps physical interface
into two logical interfaces in the hardware level:
i f1
and
i f2
.
Each logical interface has a different MAC address but only
i f1
has an IP address. RDMA is enabled on
i f1
but disabled
on
i f2
. Hence, packets arriving at
i f2
are delivered to the host
networking stack. We then establish a set of
iproute
,
arp
, and
tc flower
rules in Linux to guarantee that a packet is routed
to
i f1
logical interface if its destination IP address matches
i f1
’s IP address. Otherwise, the packet is handledby
i f2
, allow-
ing the NIC to give the packet to the Linux kernel for further
processing. Compared to pure point-to-point RDMA, this ap-
proach takes a small performance penalty, but our experiments
show the overhead is negligible. A more performant produc-
11
tion implementation can use XDP (eXpress Data Path) [25] to
increase efficiency. Moreover, the future generation of Marvell
ASICs will support hardware offloading of tc flower [28]; this
will further reduce the performance penalty of our approach,
since the NIC can handle indirect forwarding packets.
Training performance.
Figure 18 demonstrates that
TOP OOPT achieves a training throughput (samples/second)
similar to that of our ideal baseline. Moreover, Figure 19 shows
that our prototype and baseline have a similar time-to-accuracy
for training ResNet50 on the CIFAR10 dataset [80]. The small
differences between the time-to-accuracy curves are due to
random seed selection.
7 Discussion
Handling scale.
A flat TO POOP T cluster with OCSs can scale
to 384 servers and a TOPOOP T cluster with patch panels can
scale to 1000 servers. Assuming each server has 8 GPUs, these
clusters can host 3,072 and 8,000 GPUs, respectively. Given
that our DNN jobs run on less than 1000 workers (Figure 1a),
there is no immediate need to create a hierarchy of switches. To
further scale a TO POOP T cluster, we can create a hierarchical
interconnect by placing the servers under ToR switches and
connecting the ToR switches to the optical switch layer, similar
to prior work [50,67,68,88]. Another option is to build a Clos
topology using a hierarchy of optical switches and patch panels.
We leave exploring these options to future work.
Supporting dynamic scheduling and elasticity.
Prior
work has demonstrated the benefits of dynamically choosing
the training servers for elastic training jobs [86,100]. Our tar-
get usecase in BIGNET is to leverage TOPOOPT for the vast
number of long-lasting training jobs that do not change dynam-
ically. In cases where elasticity is required, instead of using
patch panels, we use OCSs (or other fast reconfigurable optical
switches) to change the servers participating in a job quickly.
Note that dynamically changing the set of servers participating
in a job while keeping both the topology and the parallelization
strategy optimal requires augmenting the optimization space
with an additional dimension, hence, making the problem even
more challenging, which we leave to future work.
Handling failures.
Unlike SiP-ML’s single ring topol-
ogy [79], our TOPOLOGYFINDER’s technique spreads the
available degree across servers to create topologies with a di-
verse set of AllReduce permutations, which, in turn, increases
the failure resiliency of a TOPOOPT interconnect. In particular,
a TOPOOPT topology does not have a single point of failure.
Supporting multi-tenancy.
To support multi-tenancy [122,
123], TOPOOPT can leverage NVIDIA’s MIG [38] to treat one
physical server as multiple logical servers in its topology.
TotientPerms in Fat-trees.
Although our TotientPerms
technique is well-suited for reconfigurable optical intercon-
nects, it may be of independent interest for Fat-tree intercon-
nects as well since load-balancing the AllReduce traffic across
multiple permutations can help with network congestion.
8 Related Work
Optimizing DNN training.
To address the increasing compu-
tation and network bandwidth requirements of large training
jobs, a plethora of frameworkshave been proposed [4,44,55,65,
71,73,76,77,93,96,97,105,109,116,125]. These frameworks
distribute the dataset and/or DNN model across accelerators
while considering the available network bandwidth, but unlike
TOP OOPT, they do not consider the physical layer topology as
an optimization dimension. Specifically,Blink [116] builds fast
collectives for distributed ML, but it needs a physical topology
to generate its spanning trees. Moreover, several methods have
been proposed to quantize and compress the gradients to reduce
the amount of communication data across servers [46,52,124].
While all these approaches are effective, they are designed for
data parallel strategies and do not consider the large amount
of data transfers caused by model parallel training. Wang et
al. [118] compared the performance of Fat-trees and BCube
topologies for distributed training workloads and highlighted
several inefficiencies in Fat-trees. However, unlike TOPOOPT,
their proposed approach does not co-optimize topology and
parallelization strategy.
DNN parallelization strategies.
Data and model paral-
lelism have been widely used by today’s DNN frameworks
(e.g., TensorFlow [42], PyTorch [40], MXNet [15]) to par-
allelize training across multiple devices. Recent work has
also proposed automated frameworks (e.g., FlexFlow [76],
ColocRL [89]) that find efficient parallelization strategies by
searching over a comprehensive space of potential strategies.
These frameworks rely on and are optimized for the conven-
tional Fat-tree interconnects. TOPOOPT proposes a new ap-
proach to building DNN training systems by jointly optimizing
network topology and parallelization strategy.
DNN training infrastructures and schedulers.
Several
training infrastructures have been proposed recently, including
NVIDIA DGX SuperPOD [9], TPU cluster [8], and super-
computers [1]. All these systems assume non-reconfigurable
network topologies, such as Fat-tree, Torus, and other traffic
oblivious interconnects. TO POOP T is the first DNN system
that uses commodity reconfigurable interconnects to accel-
erate DNN jobs. Gandiva [120], Themis [86], Tiresias [66],
BytePS [77,97], and Pollux [100] seek to improve the utiliza-
tion of GPU clusters through scheduling algorithms. These
approaches are complementary to ours, and many of their tech-
niques can be applied to a TO POOP T cluster.
Optical Interconnects.
Several papers demonstrated the
benefits of optically reconfigurable interconnects for datacen-
ters [50,53,57,60,64,83
85,87,88,99]. As mentioned in
Section 3.3, these designs lead to sub-optimal topologies for
distributed DNN traffic. Similarly, traffic oblivious intercon-
nects, such as RotorNet [87,88], are a great fit for datacenter
workloads, but they are not suitable for DNN training jobs
characterized by repetitive AllReduce and MP traffic demands.
Hybrid electrical/optical datacenter proposals [60,117] can be
12
used to route AllReduce traffic through the optical fabric and
MP flows through a standard electrical Fat-tree network. But
hybrid clusters are not cost effective and sufferfrom many prob-
lems, including TCP ramp-up inefficiencies [91], segregated
routing issues [61], and uncertainty in terms of how to divide
the cluster between electrical and optical fabrics [64,68].
9 Conclusion
We present TOPO OPT, a novel network interconnect to build
DNN training clusters. We design an alternating optimization
algorithm that explores the large space of Computation
×
Communication
×
Topology strategies for a DNN workload,
and demonstrate TO POOP T obtains up to 3
×
faster training
iteration time than a cost-equivalent Fat-tree.
References
[1]
Summit Supercomputer, 2014.
https:
//www.olcf.ornl.gov/summit/.
[2]
Baidu, 2017.
https://github.com/baidu-
research/baidu-allreduce.
[3]
CANDLE Uno: Predicting Tumor Dose Re-
sponse across Multiple Data Sources, 2017.
https://github.com/ECP-CANDLE/Benchmarks/
tree/master/Pilot1/Uno.
[4]
Meet Horovod: Uber’s Open Source Distributed Deep
Learning Framework for TensorFlow, 2017.
https:
//eng.uber.com/horovod.
[5]
htsim packet simulator, 2018.
https://github.com/
nets-cs- pub-ro/NDP/wiki/NDP-Simulator.
[6]
CANcer Distributed Learning Environment (CAN-
DLE), 2019.
https://datascience.cancer.gov/
collaborations/joint-design- advanced-
computing/candle.
[7]
AOI 100G PSM4 Transceiver, 2020.
https:
//www.ebay.com/itm/234092018446?hash=
item3680f8bb0e:g:WoMAAOSwLFJg8dKF.
[8]
Google TPU, 2020.
https://cloud.google.com/
tpu.
[9]
Nvidia DGX SuperPOD, 2020.
https:
//www.nvidia.com/en-us/data- center/dgx-
superpod/.
[10]
NVIDIA is Preparing Co-Packaged Pho-
tonics for NVLink, Dec. 2020.
https:
//www.techpowerup.com/276139/nvidia-is-
preparing-co- packaged-photonics-for-
nvlink.
[11]
100GBASE-SR4 QSFP28 850nm 100m DOM
MTP/MPO MMF Optical Transceiver Module, 2021.
https://www.fs.com/products/48354.html.
[12]
10GBASE-SR SFP+ 850nm 300m DOM LC MMF
Transceiver Module, 2021.
https://www.fs.com/
products/11552.html.
[13]
25GBASE-SR SFP28 850nm 100m DOM LC MMF Op-
tical TransceiverModule, 2021.
https://www.fs.com/
products/67991.html.
[14]
40GBASE-SR4 QSFP+ 850nm 150m DOM MTP/MPO
MMF Optical Transceiver Module, 2021.
https://
www.fs.com/products/36143.html.
[15]
Apache MXNet, 2021.
https://mxnet.apache.org/
.
[16]
Calient Optical Circuit Switch, 2021.
https:
//www.calient.net/products/edge640-optical-
circuit-switch/.
[17]
Chelsio T540-LP-CR 4-Port 10 Gigabit Ether-
net Adapter Card - Part ID: T540-LP-CR, 2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3526&idcategory=6.
[18]
Data Center Bridging eXchange (DCBX), 2021.
https://man7.org/linux/man-pages/man8/dcb-
dcbx.8.html.
[19]
Datasheet for Single Mode Network Op-
tical Switch up to 384x384 ports, 2021.
https://www.hubersuhner.com/en/documents-
repository/technologies/pdf/data-sheets-
optical-switches/polatis- series-7000n.
[20]
Deep Learning Recommendation Model for Personal-
ization and Recommendation Systems, 2021.
https:
//github.com/facebookresearch/dlrm.
[21]
Edgecore AS5610-52X 48-Port 10GbE Bare Metal
Switch with ONIE - Part ID: 5610-52X-O-AC-F-US,
2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3025&idcategory=7.
[22]
Edgecore AS6712-32X 32-Port 40GbE Bare Metal
Switch with ONIE - Part ID: 6712-32X-O-12V-F,
2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3602&idcategory=7.
[23]
Edgecore AS7312-54XS 48-Port 25GbE + 6-
Port 100GbE Bare Metal Switch with ONIE
- Part ID: 7312-54XS-O-AC-F-US, 2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3598&idcategory=7.
[24]
Edgecore AS7816-64X 64-Port 100GbE Bare Metal
Switch with ONIE - Part ID: 7816-64X-O-AC-B-US,
2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3483&idcategory=7.
13
[25]
eXpress Data Path, 2021.
https://xdp-
project.net/.
[26]
Flex Flow’s Training Engine, 2021.
https://
flexflow.ai/.
[27]
FlexFlow source code, 2021.
https://github.com/
flexflow/FlexFlow.
[28]
Flow based traffic control filter, 2021.
https:
//man7.org/linux/man-pages/man8/tc-
flower.8.html.
[29]
HPE Ethernet 4x25Gb 1-port 620QSFP28 Adapter,
2021.
https://support.hpe.com/hpesc/public/
docDisplay?docId=emr_na-c05220334.
[30]
Juniper MX480 switch, 2021.
https:
//www.juniper.net/us/en/products/routers/
mx-series/mx480- universal-routing-
platform/specs.html.
[31]
Legion Programming System, 2021.
https://
legion.stanford.edu/overview/.
[32]
Mellanox ConnectX-4 Single Port 10 Gi-
gabit Ethernet Adapter Card, PCIe 3.0
x8 - Part ID: MCX4111A-XCAT, 2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=2812&idcategory=6.
[33]
Mellanox ConnectX-4 Single Port 25 Gi-
gabit Ethernet Adapter Card, PCIe 3.0
x8 - Part ID: MCX4111A-ACAT, 2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=2814&idcategory=6.
[34]
Mellanox ConnectX-5 EN Single Port 100
Gigabit Ethernet Adapter Card, PCIe 3.0
x16 - Part ID: MCX515A-CCAT, 2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3150&idcategory=6.
[35]
Mellanox ConnectX-6 VPI Single Port HDR
200Gb/s InfiniBand & Ethernet Adapter Card,
PCIe 3.0/4.0 x16 - Part ID: MCX653105A-HDAT,
2021.
https://www.colfaxdirect.com/store/pc/
viewPrd.asp?idproduct=3669&idcategory=6.
[36]
NCCL, 2021.
https://github.com/NVIDIA/nccl-
tests.
[37]
NVIDIA A100 Tensor Core GPU, 2021.
https://
www.nvidia.com/en-us/data- center/a100/.
[38]
NVIDIA MULTI-INSTANCE GPU, 2021.
https://www.nvidia.com/en-us/technologies/
multi-instance- gpu/.
[39]
Polatis Optical Circuit Switch, 2021.
https:
//www.polatis.com/series-7000- 384x384-
port-software- controlled-optical-circuit-
switch-sdn- enabled.asp.
[40] PyTorch, 2021. https://pytorch.org.
[41]
Telescent G4 Network Topology Manager, 2021.
https://www.telescent.com/products.
[42]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng
Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
Manjunath Kudlur, Josh Levenberg, Rajat Monga,
Sherry Moore, Derek G. Murray, Benoit Steiner,
Paul Tucker, Vijay Vasudevan, Pete Warden, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow:
A system for large-scale machine learning. In 12th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI 16), pages 265–283, Savannah,
GA, November 2016. USENIX Association.
[43]
Bilge Acun, Matthew Murphy, Xiaodong Wang,
Jade Nie, Carole-Jean Wu, and Kim Hazelwood.
Understanding training efficiency of deep learning
recommendation models at scale, 2020.
[44]
Ravichandra Addanki, Shaileshh Bojja Venkatakrish-
nan, Shreyan Gupta, Hongzi Mao, and Mohammad
Alizadeh. Learning generalizable device placement al-
gorithms for distributed machine learning. In Advances
in Neural Information Processing Systems, volume 32,
pages 3981–3991. Curran Associates, Inc., 2019.
[45]
Mohammad Al-Fares, Alexander Loukissas, and Amin
Vahdat. A scalable, commodity data center network
architecture. SIGCOMM Comput. Commun. Rev.,
38(4):63–74, August 2008.
[46]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka,
and Milan Vojnovic. Qsgd: Communication-efficient
sgd via gradient quantization and encoding. In I. Guyon,
U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett, editors, Advances in
Neural Information Processing Systems, volume 30,
pages 1709–1720. Curran Associates, Inc., 2017.
[47]
Mohammad Alizadeh, Albert Greenberg, David A.
Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar,
Sudipta Sengupta, and Murari Sridharan. Data Center
TCP (DCTCP). In Proceedings of the ACM SIGCOMM
2010 Conference, SIGCOMM ’10, pages 63–74, New
York, NY, USA, 2010. ACM.
14
[48]
Mohammad Alizadeh, Shuang Yang, Milad Sharif,
Sachin Katti, Nick McKeown, Balaji Prabhakar, and
Scott Shenker. pfabric: Minimal near-optimal datacen-
ter transport. In Proceedings of the ACM SIGCOMM
2013 Conference on SIGCOMM, SIGCOMM ’13,
pages 435–446, New York, NY, USA, 2013. ACM.
[49]
Javed A. Aslam. Dynamic Programming So-
lution to the Coin Changing Problem, 2004.
https://www.ccs.neu.edu/home/jaa/CSG713.04F/
Information/Handouts/dyn_prog.pdf.
[50]
Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel
Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini
Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and
Hugh Williams. Sirius: A flat datacenter network with
nanosecond optical switching. In Proceedings of the An-
nual Conference of the ACM Special Interest Group on
Data Communication on the Applications, Technologies,
Architectures, and Protocols for Computer Communi-
cation, SIGCOMM ’20, page 782–797, New York, NY,
USA, 2020. Association for Computing Machinery.
[51]
Theophilus Benson, Aditya Akella, and David A. Maltz.
Network traffic characteristics of data centers in the
wild. In Proceedings of the 10th ACM SIGCOMM
Conference on Internet Measurement, IMC ’10, pages
267–280, New York, NY, USA, 2010. ACM.
[52]
Chia-Yu Chen, Jungwook Choi, Daniel Brand, Ankur
Agrawal, Wei Zhang, and Kailash Gopalakrishnan.
Adacomp : Adaptive residual gradient compression for
data-parallel distributed training. Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[53]
Li Chen, Kai Chen, Zhonghua Zhu, Minlan Yu, George
Porter, Chunming Qiao, and Shan Zhong. Enabling
wide-spread communications on optical fabric with
megaswitch. In 14th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 17), pages
577–593, Boston, MA, 2017. USENIX Association.
[54]
Qixiang Cheng, Richard Dai, Meisam Bahadori, Nathan
C. Abrams, Padraic E Morrissey, Madeleine Glick, Peter
O’Brien, and Keren Bergman. Si/sin microring-based
optical router in switch-and-select topology. 09 2018.
[55]
Minsik Cho, Ulrich Finkler, David Kung, and Hillery
Hunter. Blueconnect: Decomposing all-reduce for deep
learning on heterogeneous network hierarchy. SysML
Conference, 2019.
[56]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael
Papamichael, Adrian Caulfield, Todd Massengil, Ming
Liu, Daniel Lo, Shlomi Alkalay, and Michael Haselman.
Accelerating persistent neural networks at datacenter
scale. In Hot Chips, volume 29, 2017.
[57]
K. Clark, H. Ballani, P. Bayvel, D. Cletheroe,
T. Gerard, I. Haller, K. Jozwik, K. Shi, B. Thomsen,
P. Watts, H. Williams, G. Zervas, P. Costa, and
Z. Liu. Sub-nanosecond clock and data recovery in
an optically-switched data centre network. In 2018
European Conference on Optical Communication
(ECOC), pages 1–3, 2018.
[58]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. BERT: pre-training of deep
bidirectional transformers for language understanding.
CoRR, abs/1810.04805, 2018.
[59]
Jack Edmonds. Paths, trees, and flowers. Canadian
Journal of Mathematics, 17:449–467, 1965.
[60]
Nathan Farrington, George Porter, Sivasankar Rad-
hakrishnan, Hamid Hajabdolali Bazzaz, Vikram
Subramanya, Yeshaiahu Fainman, George Papen, and
Amin Vahdat. Helios: A hybrid electrical/optical switch
architecture for modular data centers. SIGCOMM’10,
pages 339–350.
[61]
K. Foerster,M. Ghobadi,and S. Schmid. Characterizing
the algorithmic complexity of reconfigurable data center
architectures. In Proc. ANCS ’18, pages 89–96, 2018.
[62]
Everest G. and Ward Thomas. An introduction to
number theory, 2005.
[63]
Peter X. Gao, Akshay Narayan, Sagar Karandikar,
Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia
Ratnasamy, and Scott Shenker. Network requirements
for resource disaggregation. In Proceedings of the 12th
USENIX Conference on Operating Systems Design and
Implementation, OSDI’16, pages 249–264, Berkeley,
CA, USA, 2016. USENIX Association.
[64]
Monia Ghobadi, Ratul Mahajan, Amar Phanishayee,
Nikhil Devanur, Janardhan Kulkarni, Gireeja Ranade,
Pierre-Alexandre Blanche, Houman Rastegarfar,
Madeleine Glick, and Daniel Kilper. Projector: Agile
reconfigurable data center interconnect. In Proceedings
of the 2016 ACM SIGCOMM Conference, SIGCOMM
’16, pages 216–229, New York, NY, USA, 2016.
Association for Computing Machinery.
[65]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter
Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew
Tulloch, Yangqing Jia, and Kaiming He. Accurate,
large minibatch SGD: training imagenet in 1 hour.
CoRR, abs/1706.02677, 2017.
15
[66]
Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin,
Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang
Liu, and Chuanxiong Guo. Tiresias: A GPU cluster
manager for distributed deep learning. In 16th
USENIX Symposium on Networked Systems Design and
Implementation (NSDI 19), pages 485–500, Boston,
MA, February 2019. USENIX Association.
[67]
Chuanxiong Guo, Guohan Lu, Dan Li, Haitao Wu,
Xuan Zhang, Yunfeng Shi, Chen Tian, Yongguang
Zhang, and Songwu Lu. Bcube: A high performance,
server-centric network architecture for modular data
centers. In Proceedings of the ACM SIGCOMM 2009
Conference on Data Communication, SIGCOMM ’09,
page 63–74, New York, NY, USA, 2009. Association
for Computing Machinery.
[68]
Navid Hamedazimi, Zafar Qazi, Himanshu Gupta,
Vyas Sekar, Samir R. Das, Jon P. Longtin, Himanshu
Shah, and Ashish Tanwer. Firefly: A reconfigurable
wireless data center fabric using free-space optics.
SIGCOMM’14, pages 319–330.
[69]
Sayed Hadi Hashemi, Sangeetha Abdu Jyothi, and
Roy H. Campbell. Communication scheduling as
a first-class citizen in distributed machine learning
systems. CoRR, abs/1803.03288, 2018.
[70]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 770–778, 2016.
[71]
Yanping Huang, Yonglong Cheng, Dehao Chen,
HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and
Zhifeng Chen. Gpipe: Efficient training of giant neural
networks using pipeline parallelism. NeurIPS, 2019.
[72]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan
Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee,
Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng
Chen. Gpipe: Efficient training of giant neural networks
using pipeline parallelism, 2019.
[73]
Forrest N. Iandola, Khalid Ashraf, Matthew W.
Moskewicz, and Kurt Keutzer. Firecaffe: near-linear
acceleration of deep neural network training on
compute clusters. CoRR, abs/1511.00175, 2015.
[74]
Anand Jayarajan, Jinliang Wei, Garth Gibson,
Alexandra Fedorova, and Gennady Pekhimenko.
Priority-based parameter propagation for distributed
DNN training. CoRR, abs/1905.03960, 2019.
[75]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang,
Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu
Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen,
Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu.
Highly scalable deep learning training system with
mixed-precision: Training imagenet in four minutes.
CoRR, abs/1807.11205, 2018.
[76]
Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond
data and model parallelism for deep neural networks.
SysML, 2019.
[77]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong
Cui, and Chuanxiong Guo. A unified architecture for
accelerating distributed DNN training in heterogeneous
gpu/cpu clusters. In 14th USENIX Symposium on Oper-
ating Systems Design and Implementation (OSDI 20),
pages 463–479. USENIX Association, November 2020.
[78]
A. S. Kewitsch. Large scale, all-fiber optical cross-
connect switches for automated patch-panels. Journal
of Lightwave Technology, 27(15):3107–3115, 2009.
[79]
Mehrdad Khani, Manya Ghobadi, Mohammad Al-
izadeh, Ziyi Zhu, Madeleine Glick, Keren Bergman,
Amin Vahdat, Benjamin Klenk, and Eiman Ebrahimi.
Sip-ml: High-bandwidth optical network interconnects
for machine learning training. In Proceedings of the
2021 ACM SIGCOMM 2021 Conference, SIGCOMM
’21, pages 657–675, New York, NY, USA, 2021.
Association for Computing Machinery.
[80]
Alex Krizhevsky, Geoffrey Hinton, et al. Learning
multiple layers of features from tiny images. 2009.
[81]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J.
Smola, Amr Ahmed, Vanja Josifovski, James Long, Eu-
gene J. Shekita, and Bor-Yiing Su. Scaling distributed
machine learning with the parameter server. OSDI’14,
pages 583–598. USENIX Association, 2014.
[82]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and
William J Dally. Deep gradient compression: Reducing
the communication bandwidth for distributed training.
arXiv preprint arXiv:1712.01887, 2017.
[83]
He Liu, Feng Lu, Alex Forencich, Rishi Kapoor,
Malveeka Tewari, Geoffrey M. Voelker, George Papen,
Alex C. Snoeren, and George Porter. Circuit switching
under the radar with REACToR. NSDI’14, pages 1–15.
[84]
He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas
Feltman, George Papen, Stefan Savage, Srinivasan
Seshan, Geoffrey M. Voelker, David G. Andersen,
Michael Kaminsky,George Porter, and Alex C. Snoeren.
Scheduling techniques for hybrid circuit/packet net-
works. In Proceedings of the 11th ACM Conference on
Emerging Networking Experiments and Technologies,
CoNEXT ’15, New York, NY, USA, 2015. Association
for Computing Machinery.
16
[85]
Yunpeng James Liu, Peter Xiang Gao, Bernard Wong,
and Srinivasan Keshav. Quartz: A new design element
for low-latency dcns. SIGCOMM’14, pages 283–294.
[86]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun
Singhvi, Shivaram Venkataraman, Aditya Akella,
Amar Phanishayee, and Shuchi Chawla. Themis:
Fair and efficient GPU cluster scheduling. In 17th
USENIX Symposium on Networked Systems Design
and Implementation (NSDI 20), pages 289–304, Santa
Clara, CA, February 2020. USENIX Association.
[87]
William M. Mellette, Rajdeep Das, Yibo Guo, Rob
McGuinness, Alex C. Snoeren, and George Porter.
Expanding across time to deliver bandwidth efficiency
and low latency. NSDI’20, 2020.
[88]
William M. Mellette, Rob McGuinness, Arjun Roy,
Alex Forencich, George Papen, Alex C. Snoeren, and
George Porter. Rotornet: A scalable, low-complexity,
optical datacenter network. SIGCOMM ’17, pages
267–280, 2017.
[89]
Azalia Mirhoseini, Hieu Pham, Quoc V. Le, Benoit
Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar,
Mohammad Norouzi, Samy Bengio, and Jeff Dean.
Device placement optimization with reinforcement
learning. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference
on Machine Learning, volume 70 of Proceedings
of Machine Learning Research, pages 2430–2439,
International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
[90]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang,
Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing
Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang
Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti
Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xi-
aodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu,
Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng,
Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang,
Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna
Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman,
Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry
Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar
Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab
Bhattacharya, Petr Lapukhov, Maxim Naumov, Lin
Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao.
Software-hardware co-design for fast and scalable train-
ing of deep learning recommendation models, 2021.
[91]
Matthew K. Mukerjee, Christopher Canel, Weiyang
Wang, Daehyeok Kim, Srinivasan Seshan, and Alex C.
Snoeren. Adapting TCP for reconfigurable datacenter
networks. In 17th USENIX Symposium on Networked
Systems Design and Implementation (NSDI 20), pages
651–666, Santa Clara, CA, February 2020. USENIX
Association.
[92]
Shar Narasimhan. NVIDIA Clocks World’s Fastest
BERT Training Time and Largest Transformer Based
Model, Paving Path For Advanced Conversational
AI, Aug. 2019.
https://devblogs.nvidia.com/
training-bert- with-gpus/.
[93]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee,
Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger,
Phillip B. Gibbons, and Matei Zaharia. Pipedream:
Generalized pipeline parallelism for dnn training. In
Proceedings of the 27th ACM Symposium on Operating
Systems Principles, SOSP’19, pages 1–15, New York,
NY, USA, 2019. Association for Computing Machinery.
[94]
Maxim Naumov, John Kim, Dheevatsa Mudigere,
Srinivas Sridharan, Xiaodong Wang, Whitney Zhao,
Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa
Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su,
Jiyan Yang, and Mikhail Smelyanskiy. Deep learning
training in facebook data centers: Design of scale-up
and scale-out systems, 2020.
[95]
Maxim Naumov, Dheevatsa Mudigere, Hao-
Jun Michael Shi, Jianyu Huang, Narayanan Sun-
daraman, Jongsoo Park, Xiaodong Wang, Udit Gupta,
Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhul-
gakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu,
Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr
Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin
Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha
Smelyanskiy. Deep learning recommendation model
for personalization and recommendation systems, 2019.
[96]
T. T. Nguyen, M. Wahib, and R. Takano. Topology-
aware sparse allreduce for large-scale deep learning. In
2019 IEEE 38th International Performance Computing
and Communications Conference (IPCCC), pages 1–8,
2019.
[97]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao,
Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo.
A generic communication scheduler for distributed
dnn training acceleration. In Proceedings of the 27th
ACM Symposium on Operating Systems Principles,
SOSP ’19, page 16–29, New York, NY, USA, 2019.
Association for Computing Machinery.
[98]
Genzhi Photonics. 1x2 Mechanical Optical Switch,
2021.
https://www.gezhiphotonics.com/1x2-
optical-switch.html.
17
[99]
George Porter, Richard Strong, Nathan Farrington, Alex
Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu
Fainman, George Papen, and Amin Vahdat. Integrating
microsecond circuit switching into the data center.
SIGCOMM’13, pages 447–458.
[100]
Aurick Qiao, Sang Keun Choe, Suhas Jayaram
Subramanya, Willie Neiswanger, Qirong Ho, Hao
Zhang, Gregory R. Ganger, and Eric P. Xing. Pollux:
Co-adaptive cluster scheduling for goodput-optimized
deep learning. In 15th USENIX Symposium on
Operating Systems Design and Implementation (OSDI
21), pages 1–18. USENIX Association, July 2021.
[101]
J. R. Quinlan. Induction of decision trees. Mach.
Learn., 1(1):81–106, March 1986.
[102]
Leslie Reid. MOX Announces New Teles-
cent Automation Technology on Its Latest
Hillsboro to Portland Fiber Route, Sept. 2020.
https://www.businesswire.com/news/home/
20200915005391/en/MOX-Announces- New-
Telescent-Automation- Technology-on-Its-
Latest-Hillsboro- to-Portland-Fiber- Route.
[103]
Peter Sanders, Jochen Speck, and Jesper Larsson Träff.
Two-tree algorithms for full bandwidth broadcast, re-
duction and scan. Parallel Computing, 35(12):581–594,
2009.
[104]
Tae Joon Seok, Niels Quack, Sangyoon Han, Richard S.
Muller, and Ming C. Wu. Large-scale broadband
digital silicon photonic switches with vertical adiabatic
couplers. Optica, 3(1):64–70, Jan 2016.
[105]
Alexander Sergeev and Mike Del Balso. Horovod: fast
and easy distributed deep learning in tensorflow. CoRR,
abs/1802.05799, 2018.
[106]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,
Patrick LeGresley, Jared Casper, and Bryan Catanzaro.
Megatron-lm: Training multi-billion parameter
language models using model parallelism, 2020.
[107]
Karen Simonyan and Andrew Zisserman. Very
deep convolutional networks for large-scale image
recognition, 2015.
[108]
Ankit Singla, Chi-Yao Hong, Lucian Popa, and
P. Brighten Godfrey. Jellyfish: Networking data
centers randomly. In Proceedings of the 9th USENIX
Conference on Networked Systems Design and
Implementation, NSDI’12, pages 17–17, Berkeley, CA,
USA, 2012. USENIX Association.
[109]
Jakub Tarnawski, Amar Phanishayee, Nikhil R.
Devanur, Divya Mahajan, and Fanny Nina Paravecino.
Efficient algorithms for device placement of DNN
graph operators. In Hugo Larochelle, Marc’Aurelio
Ranzato, Raia Hadsell, Maria-Florina Balcan, and
Hsuan-Tien Lin, editors, Advances in Neural Informa-
tion Processing Systems 33: Annual Conference on
Neural Information Processing Systems 2020, NeurIPS
2020, December 6-12, 2020, virtual, 2020.
[110]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp.
Optimization of collective communication operations
in mpich. Int. J. High Perform. Comput. Appl.,
19(1):49–66, February 2005.
[111]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp.
Optimization of collective communication operations
in mpich. Int. J. High Perform. Comput. Appl.,
19(1):49–66, February 2005.
[112]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp.
Optimization of collective communication operations in
mpich. The International Journal of High Performance
Computing Applications, 19(1):49–66, 2005.
[113]
Yuichiro Ueno and Rio Yokota. Exhaustive study
of hierarchical allreduce patterns for large messages
between gpus. In 2019 19th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing
(CCGRID), pages 430–439, 2019.
[114]
Jakob Uszkorei. Transformer: A Novel Neural
Network Architecture for Language Understanding,
Aug. 2017.
https://ai.googleblog.com/2017/08/
transformer-novel- neural-network.html.
[115]
Asaf Valadarsky, Gal Shahaf, Michael Dinitz, and
Michael Schapira. Xpander: Towards optimal-
performance datacenters. In Proceedings of the 12th
International on Conference on Emerging Networking
EXperiments and Technologies, CoNEXT ’16, pages
205–219, New York, NY, USA, 2016. ACM.
[116]
Guanhua Wang, Shivaram Venkataraman, Amar
Phanishayee, Jorgen Thelin, Nikhil Devanur, and
Ion Stoica. Blink: Fast and generic collectives for
distributed ml. In Conference on Machine Learning
and Systems (MLSys 2020), March 2020.
[117]
Guohui Wang, David G. Andersen, Michael Kaminsky,
Konstantina Papagiannaki, T.S. Eugene Ng, Michael
Kozuch, and Michael Ryan. c-Through: Part-time
optics in data centers. SIGCOMM’10, pages 327–338.
[118]
S. Wang, D. Li, J. Geng, Y. Gu, and Y. Cheng. Impact
of Network Topology on the Performance of DML:
Theoretical Analysis and Practical Factors. In IEEE
INFOCOM 2019 - IEEE Conference on Computer
Communications, pages 1729–1737, 2019.
18
[119]
Pijika Watcharapichat, Victoria Lopez Morales,
Raul Castro Fernandez, and Peter Pietzuch. Ako:
Decentralised deep learning with partial gradient
exchange. SoCC ’16, 2016.
[120]
Wencong Xiao, Romil Bhardwaj, Ramachandran
Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua
Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu
Zhang, Fan Yang, and Lidong Zhou. Gandiva: Intro-
spective cluster scheduling for deep learning. In 13th
USENIX Symposium on Operating Systems Design and
Implementation (OSDI 18), pages 595–610, Carlsbad,
CA, October 2018. USENIX Association.
[121]
Pengtao Xie, Jin Kyu Kim, Yi Zhou, Qirong Ho,
Abhimanu Kumar, Yaoliang Yu, and Eric Xing.
Lighter-communication distributed machine learning
via sufficient factor broadcasting. In Proceedings of the
Thirty-Second Conference on Uncertainty in Artificial
Intelligence, pages 795–804, Arlington, Virginia, USA,
2016. AUAI Press.
[122]
Peifeng Yu and Mosharaf Chowdhury. Fine-grained
GPU sharing primitives for deep learning applications.
In Inderjit S. Dhillon, Dimitris S. Papailiopoulos,
and Vivienne Sze, editors, Proceedings of Machine
Learning and Systems 2020, MLSys 2020, Austin, TX,
USA, March 2-4, 2020. mlsys.org, 2020.
[123]
Peifeng Yu, Jiachen Liu, and Mosharaf Chowdhury.
Fluid: Resource-aware hyperparameter tuning engine.
In A. Smola, A. Dimakis, and I. Stoica, editors, Pro-
ceedings of Machine Learning and Systems, volume 3,
pages 502–516, 2021.
[124]
Yue Yu, Jiaxiang Wu, and Longbo Huang. Double
quantization for communication-efficient distributed
optimization. In Advances in Neural Information
Processing Systems, volume 32, pages 4438–4449.
Curran Associates, Inc., 2019.
[125]
H. Zhao and J. Canny. Kylix: A sparse allreduce for
commodity clusters. In 2014 43rd International Con-
ference on Parallel Processing, pages 273–282, 2014.
[126]
Danyang Zhuo, Monia Ghobadi, Ratul Mahajan, Amar
Phanishayee, Xuan Kelvin Zou, Hang Guan, Arvind Kr-
ishnamurthy, and Thomas Anderson. RAIL: A case for
redundant arrays of inexpensive links in data center net-
works. In 14th USENIX Symposium on Networked Sys-
tems Design and Implementation (NSDI 17), pages 561–
576, Boston, MA, March 2017. USENIX Association.
19
0 5 10 15 0 5 10 15
0
5
10
15
4GB
0
0 5 10 15
0
5
10
15
0
5
10
15
(a) Traffic Heatmap 1 (b) Traffic Heatmap 2 (c) Traffic Heatmap 3
Figure 20: DLRM traffic heatmaps with DBT.
(b) Double binary tree
permutation 2
(a) Double binary tree
permutation 1
(c) Double binary tree
permutation 3
0
2
46
8
1 3
10
5 7 9 11 13 15
1412
1
3
57
9
0 2
11
4 6 8 10 12 14
1513
0
6
12 2
8
1 7
14
13 3 9 15 511
104
1
7
13 3
9
0 6
15
12 2814 410
115
0
3
69
12
811
15
14 14710 13
52
8
11
14 1
4
0 3
7
6 9 12 15 2 5
1310
Figure 21: Double binary tree (DBT) permutations.
0 5 10 15 0 5 10 15
0
5
10
15
2GB
0
0 5 10 15
0
5
10
15
0
5
10
15
(a) Traffic Heatmap 1 (b) Traffic Heatmap 2 (c) Traffic Heatmap 3
Figure 22: CANDLE traffic heatmaps with DBT.
A Tree-AllReduce and other AllReduce per-
mutations
Section 2established that we can manipulate the traffic of
a ring-AllReduce collective by permuting the labeling of
servers in the AllReduce group. Here, we illustrate how to use
the same technique on another AllReduce algorithm, called
tree-AllReduce.
In the tree-AllReduce algorithm, the servers are connected
logically to form a tree topology. The AllReduce operation
happens by first running a reduce operation to the root node
with recursive halving, followed by a broadcast to the rest of
the cluster with recursive doubling [112].
A common instantiation of tree-AllReduce is the
double binary tree
(DBT) algorithm described in [103]. In this
algorithm, the first step is to create a balanced binary tree for
the nodes. The properties of balanced binary trees guarantee
that one half of the nodes will be leaf-nodes, and the other
half will be in-tree; thus, a second binary tree is constructed
by flipping the labeling of the leaf and in-tree nodes. This way,
each node (except the root in both trees) has the same amount
of communication requirement for the AllReduce operation
described in the last paragraph, and bandwidth-optimally is
achieved. Figure 21a shows an example where in the first
binary tree, the in-tree nodes are even, and the leaf nodes are
odd, while the second tree flips the labeling.
Essentially, the DBT itself is an example of permuting
the node labeling to achieve an AllReduce operation with
balanced communication load. We also note that we can
permute the labeling for the entire set of nodes for a pair
of DBT to create a new pair of trees that can perform the
AllReduce operation at the same speed. Figures 21b and 21c
illustrate two other possible double binary trees, and their
corresponding traffic demand matrix for the DLRM and
CANDLE example shown in Section §2in Figures 20 and 22.
Arbitrary permutations can be used, and to limit the cases, we
could simply consider the cyclic permutations in the modular
space as described in TotientPerms.
In general, all AllReduce operations can be described as a
directed graph
G=(V,E)
where
V
is the set of nodes in the clus-
ter, and
E
denotes data dependencies. The permutable property
says that every graph
G0=(V,E0)
that is isomorphic to
G
can
perform the AllReduce operation equally well, where the ho-
momorphism between
G
and
G0
is described by the symmetric
group on V(generally denoted Sym(V)in group theory).
B Size of AllReduce and MP transfers
In most workloads observed in BI GNET , the size of AllReduce
transfers is larger than the size of MP transfers for each
iteration. This is because in most cases it would not be
worthwhile if MP transfers are as large as AllReduce transfers.
Consider the DLRM example in Section 2.2 with 20 GB
embedding tables with double-precision floating parameters.
If we were to distribute this embedding table using data
parallelism, each server would need to send and receive
37.5 GB of data for the AllReduce operation. On a 100 Gbps
fabric this would take 3 seconds by itself, where as if we put
it on one server, it would only need to transfer 32 MB/server
(assume we have a per-server batch size of 8192, then MP
traffic is calculated as 16 servers
×
8192 samples/server
×
512
activation per sample
×
8 bytes per activation / 16
servers
=
32 MB). We note that adding pipeline parallelism can increase
the amount of MP traffic as it overlaps forward and backward
passes. Efficient ways to pipeline batches remains an active
research area [71,93] especially when hybrid parallelism is
employed. Pure model parallelism creates another type of
sparse traffic pattern where only accelerators with inter-layer
dependencies need to communicate. Our TOPOLOGYFINDER
algorithm can support such communication patterns.
On the other hand, conceptually, when the network
bandwidth goes to infinity, other overheads in the system (e.g.
CUDA kernel launch) will dominate the latency. In such cases,
it might be beneficial to choose model parallelism instead of
data parallelism, to reduce the amount of system overheads.
In particular, prior work showed 10 Tbps Silicon Photonics
links enable more aggressive model parallelism where the
size of MP traffic is significant [79]. TOPOOPT’s approach
to distribute the degree between the MP and AllReduce
sub-topologies enables us to accommodate this case as well.
20
Algorithm 3 CoinChangeMod pseudocode
1: procedure COINCHANGEMOD(N,G)
Input N: Total number of nodes
Input G: Network Topology
Output R: Routings
R is the routing result
2: R={}
Acquire the set of “coins" from the topology,
which are the choices of Algorithm 4
3: C=GetCoins(G)
4: for i[1,N1]do
curr_dist denotes the “distance" of a value
(node distance) counted by number of “coins"
5: curr_dist[i] =
curr_bt record a back-trace of “coins" to
get to a value (node distance)
6: curr_bt[i] =
7: for cCdo
8: curr_dist[c] = 0
9: curr_bt[c] = c
10: while curr_dist has at least one in it do
11: for i[1,N1]do
12: new_dist[i] = curr_dist[i]
13: new_bt[i] = curr_bt [i]
14: for cCdo
15: if curr_dist[(ic)mod N]<new_dist[i]then
16: new_dist[i] = cur_dist[(ic)mod N] +1
17: new_bt[i] = c
18: curr_dist =new_dist
19: curr_bt =new_bt
Construct the routing for each node distance from the back-trace
20: R=GetRouteSeq(curr_bt)
21: return R
Algorithm 4 TopPermutations pseudocode
1: procedure TOPPERMUTATIONS(N,dk,Pk,TM P)
Input N: Total number of nodes
Input dk: Degree allocated for group this AllReduce group of size k
Input Pk: Candidate permutations for this AllReduce group of size k
Input Tmp: Traffic matrix for MP traffic
Output Gk
: Parameter synchronization topology, as a topology matrix
Initially, Gkis empty
2: Gk= []
3: for k0Pkdo
Pick dkcandidate permutations evenly from starting from k0
4: ProposedConns =Pick_dk(Pk,k0)
Assert how much MP traffic this choice can satisfy.
We want the set of candidates that maximizes the
demand satisfied for the MP traffic.
The metric of “satisfied MP" can have many definitions
5: Satis f ied MP =MPSatisfied(Tmp ,ProposedConns)
6: Gk+= (The ProposedConns that maximizes Sat is f iedM P)
7: return Gk
Algorithm 2 TotientPerms pseudocode
1: procedure TOTIE NTPE RMS(N,k)
Input N: Total number of nodes
Input k: AllReduce group size
Output Pk: Set of permutations for AllReduce group of size k
Initially, Pkis empty
2: Pk={}
This loop runs φ(p)times, where
φis the Euler Totient function, φ(p)= |{k<p:gcd(k,p)=1}|
one can also restrict p to be prime only
3: for pk,gcd(p,k)== 1do
4: one_perm =[]
5: for iin 0 to N/kdo
6: one_perm += [i+j×pfor jin 0 to k]
7: Pk+= one_perm
8: return Pk
C TOPOLOGYFINDER Details
We first provide the mathematical foundation of the ring
permutation rule.
Theorem 1
(Ring Generation)
.
For a cluster of
N
nodes
V={S0,S1,··· ,SN1}
, all integer numbers
p<N
, where
p
is co-prime with
N
(i.e.
gcd(p,N) = 1
) represent a unique
ring-AllReduce permutation rule.
Proof.
Consider the integer modulo
N
group with addition
Z+
N={0,1,···,(N1)}
.
Z+
N
is a cyclic group. By fundamental
theorem of cyclic groups,
p
is a generator of
Z+
N
if and only if
gcd(p,N) = 1
. Hence we can cover the entire
Z+
N
by repeatedly
adding pto itself.
Now consider the graph
GZ+
N,p= (VZ+
N,Ep)
where the set
of vertices
VZ+
N=Z+
N
and
Ep={(a×p,(a+1)×p)V2
Z+
N
,a
Z+
N}
. The set
Ep
forms a cycle on
GZ+
N,p
. Now denote our
cluster as
G= (V,E)
where
V
is defined as above and
E
represents a set of directed links. Then
GZ+
N,p
is isomorphic
to
G
, hence following the rule in
Ep
we can define a valid
ring in
G
. Furthermore, since
pi6=pj
we can guarantee that
(0,pi)Epi
and
(0,pj)/Epi
, each
pi
is guaranteed to describe
a unique ring.
Algorithms 2,3and 4list the detailed pseudocodes
of sub-modules in Algorithm 1, namely
TotientPerms
,
CoinChangeMod and TopPermutations.
To extend our approach to other AllReduce algorithms, one
way is to generalize TotientPerms (Algorithm 2) so that the
Ep
described in theorem 1simply represents a permutation
which we apply to the original node labeling, while keeping
the edge relation, to create an isomorphic graph that describes
the new AllReduce topology.
21
Algorithm 5 TOP OOPT-reconfig pseudocode
1: procedure TOPO OPT-RE CON FIG(V,T,d,L)
Input V: Nodes in the network
Input T: Unsatisfied traffic demand matrix
Input d: Node degree limit
Input L: Number of links between ordered node-pair, initially zero
Output E: Allocated links, initially empty
Initially, E is empty
2: E={}
Initially, each node has d available tx and rx interfaces
3: for vVdo
4: availabletx[v] = d
5: availablerx[v] = d
Create new links according to the demand list
6: while i,j<|V|:i6=j,availabletx [vi]>0,availablerx [vj]>0do
allocate a direct connection for the highest demand pair
7: (v1,v2) = node-pair with highest demand in T
8: e=NewLink(v1,v2)
9: E=E∪{e}
Increment the number of parallel links from v1to v2
10: L(v1,v2) += 1
Scale the demand down by the number of links
11: T(v1,v2)×=1/2
Update available interfaces
12: for v(v1,v2)do
13: availabletx[v1]=1
14: availablerx[v2]=1
Stop considering nodes with zero available interfaces
15: if availabletx[v1] == 0then
16: for uVdo
17: Remove (v1,u)’s entry from T
18: if availablerx[v2] == 0then
19: for uVdo
20: Remove (u,v2)’s entry from T
21: return E
D TOPOOPT-reconfig Heuristic
Algorithm 5describes the heuristic we use for TOP OOPT-
reconfig. As mentioned in Section 3.3, our goals are: (
i
) have
enough bandwidth for large transfer demands; (
ii
) while also
minimize the latency of indirect routing for nodes that do not
have a direct link between them.
To achieve this goal in a reconfigurable interconnect, we
propose a utility function that finds a balance between the
two goals by maximizing the number of parallel links between
high demand nodes but with a diminishing return. More
formally, assume a network topology is represented by graph
G= (V,E)
and each node has degree
d
. We define
L(i,j)
to
be the number of parallel links between node-pair
(i,j)
. Let
T(i,j)
be the amount of unsatisfied traffic demand, we define
a topology G’s utility function as follows:
Ut ility(G)=
{i,j}∈E
T(i,j)×Discount(L(i,j)) (1)
The
Discount
function can be defined in different ways; in
Algorithm 5as well as Algorithm 1’s MP construction, we use
Discount(l) =
l
x=1
2x(2)
to reduce the utility of additional links exponentially. One can
also explore other discount scaling, such as linear or factorial
functions.
When the fabric is reconfigurable (as in TO POOP T-
reconfig), we collect the unsatisfied traffic demand every 50 ms
and run Algorithm 5to decide the new network topology.
After the new topology is computed, we pause all the flows
for 10 ms representing the reconfiguration delay of the OCS,
apply the new topology, and then resume the flows that has one
or more corresponding physical links across the flow source
and destination.
E Modifications to SiP-ML
Since SiP-ML’s SiP-Ring proposal is based on a physical ring
topology, its reconfiguration algorithm has several constraints
about wavelength allocation for adjacent nodes. Given that
TOP OOPTs physical topology is not a ring, directly applying
SiP-Ring’s optimization using their original C++ code have
resulted SiP-ML to perform extremely poorly in our setup. To
give SiP-ML a leg up, we observe that its formulation tries to
optimize a utility function very similar to Equation 1without
the
Discount
part (i.e.
Discount =1
), but with an ILP. While an
ILP gives the optimal solution, its runtime makes it prohibitive
for the amount of simulation parameters we explore. Therefore,
we substitute the ILP with Algorithm 5with
Discount =1
which is a heuristic that tries to achieve a similar goal.
Note that SiP-ML paper has another design called SiP-OCS,
which is more similar architecturally to TOPO OPT. In
the SiP-ML paper, SiP-OCS is proposed as a one-shot
reconfiguration approach due to the long reconfiguration
latency of 3D-MEMS based OCSs.
22
Article
Full-text available
Distributed deep training has become a significant consumer of bandwidth across datacenter-scale networks. The diverse parallel strategies employed in deep training require different communication patterns, necessitating the periodic adaptation of dynamic topologies. Since electrical switching approaches its capacity limit due to high bandwidths and has difficulties in regard to topology adaptation (i.e., logical and physical topologies are isomorphic), optical switching has become an attractive option to address these bottlenecks. In this paper, we propose Modoru, a wavelength- and datarate-agnostic Clos architecture with a switching speed of O(10  ns)O(10\;{\rm ns}) . Modoru is a drop-in replacement solution that has no constraints on achieving a high radix. To verify its topological flexibility, we also develop topology-as-a-service, which provisions sequentially dynamic topologies for training jobs and guarantees high topology availability over the entire network. Large-scale simulations show a basic 7.9×7.9 \times acceleration in deep training jobs using Modoru. Additionally, experiments on the Modoru prototype demonstrate acceleration of deep training jobs through the provisioning of adaptive topologies.
Thesis
We present Placeto, a reinforcement learning (RL) approach to efficiently find device placements for distributed neural network training. Unlike prior approaches that only find a device placement for a specific computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph. We propose two key ideas in our approach: (1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot; (2) we use graph embeddings to capture relevant information about the structure of the computation graph, without relying on node labels for indexing. These ideas allow Placeto to train efficiently and generalize to unseen graphs. Our experiments show that Placeto requires up to 6.1 x fewer training steps to find placements that are on par with or better than the best placements found by prior approaches. Moreover, Placeto is able to learn a generalizable placement policy for any given family of graphs, which can then be used without any retraining to predict optimized placements for unseen graphs from the same family. This eliminates the large overhead incurred by prior RL approaches whose lack of generalizability necessitates re-training from scratch every time a new graph is to be placed.
Conference Paper
We present ByteScheduler, a generic communication scheduler for distributed DNN training acceleration. ByteScheduler is based on our principled analysis that partitioning and rearranging the tensor transmissions can result in optimal results in theory and good performance in real-world even with scheduling overhead. To make ByteScheduler work generally for various DNN training frameworks, we introduce a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines. We further introduce a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions. ByteScheduler now supports TensorFlow, PyTorch, and MXNet without modifying their source code, and works well with both Parameter Server (PS) and all-reduce architectures for gradient synchronization, using either TCP or RDMA. Our experiments show that ByteScheduler accelerates training with all experimented system configurations and DNN models, by up to 196% (or 2.96X of original speed).
Conference Paper
DNN training is extremely time-consuming, necessitating efficient multi-accelerator parallelization. Current approaches to parallelizing training primarily use intra-batch parallelization, where a single iteration of training is split over the available workers, but suffer from diminishing returns at higher worker counts. We present PipeDream, a system that adds inter-batch pipelining to intra-batch parallelism to further improve parallel training throughput, helping to better overlap computation with communication and reduce the amount of communication when possible. Unlike traditional pipelining, DNN training is bi-directional, where a forward pass through the computation graph is followed by a backward pass that uses state and intermediate data computed during the forward pass. Naïve pipelining can thus result in mismatches in state versions used in the forward and backward passes, or excessive pipeline flushes and lower hardware efficiency. To address these challenges, PipeDream versions model parameters for numerically correct gradient computations, and schedules forward and backward passes of different minibatches concurrently on different workers with minimal pipeline stalls. PipeDream also automatically partitions DNN layers among workers to balance work and minimize communication. Extensive experimentation with a range of DNN tasks, models, and hardware configurations shows that PipeDream trains models to high accuracy up to 5.3X faster than commonly used intra-batch parallelism techniques.