ArticlePDF Available

Scalable Deadlock-Free Deterministic Minimal-Path Routing Engine for InfiniBand-Based Dragonfly Networks

Authors:
  • University of Castilla-La Mancha, Albacete, Spain

Abstract and Figures

Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
Content may be subject to copyright.
Scalable Deadlock-Free Deterministic
Minimal-Path Routing Engine for
InfiniBand-Based Dragonfly Networks
German Maglione-Mathey , Member, IEEE, Pedro Yebenes, Member, IEEE,
Jesus Escudero-Sahuquillo , Pedro Javier Garcia, Francisco J. Quiles, Member, IEEE, and Eitan Zahavi
Abstract—Dragonfly topologies are gathering great interest nowadays as one of the most promising interconnect options for
High-Performance Computing (HPC) systems. However, Dragonflies contain physical cycles that may lead to traffic deadlocks unless
the routing algorithm prevents them properly. In general, existing deadlock-free routing algorithms, either deterministic or adaptive,
proposed for Dragonflies, use Virtual Channels (VCs) to prevent cyclic dependencies. However, these topology-aware algorithms are
difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is nowadays the most widely used
network technology in HPC systems. This is due to some limitations in the IB specification, specifically regarding the way Virtual Lanes
(VLs), which are considered as similar to VCs, can be assigned to traffic flows. Indeed, none of the routing engines currently available
in the official releases of the IB control software has been specifically proposed for Dragonflies. In this paper, we present a new
deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be
straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly
Routing). Specifically, D3R maps each route to a single, specific VL depending on the destination group, and according to a specific
order, so that cyclic dependencies (so deadlocks) are prevented. D3R is scalable as it requires only 2 VLs to prevent deadlocks
regardless of network size, i.e., fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for
Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly
groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We
have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in
general, D3R outperforms other routing engines.
Index Terms—High-performance interconnection networks, InfiniBand, Dragonfly topology, routing algorithms, deadlock-freedom
Ç
1MOTIVATION
HIGH performance computing (HPC) systems are typi-
cally associated with low-latency and high-bandwidth
networks which lead to fast messaging between commu-
nicating processes. However, as systems become larger
and computation continues to become cheaper, the net-
work is increasingly becoming a scarce resource. Dragon-
fly topologies (see Section 2.2) are hierarchical network
designs that are currently considered as some of the most
promising interconnection patterns for HPC systems,
since they offer a low diameter, path diversity and low
network cost [1].
Dragonfly topologies, among others, can be implemented
[2] based on the InfiniBand (IB) architecture [3], [4]. Accord-
ing to the last Top500 list [5] IB is nowadays the most widely
used network technology in HPC systems. In addition to
offering flexibility for implementing different network
topologies, IB provides mechanisms for implementing suit-
able routing engines. In network topologies that contain
physical cycles, an essential feature of the routing engine is
guaranteeing deadlock freedom, i.e., guaranteeing that cir-
cular dependencies among buffers and network resources
(“credit loops” in the IB jargon) cannot appear [3], [6], [7].
For instance, Dragonflies in general offer a rich path-
diversity at the cost of containing physical cycles, which
may lead to deadlocks if no countermeasures are taken by
the routing engine.
Most of the deadlock-free routing engines available in IB
are based on the management of Virtual Lanes, which are
separate logical communication links that share the band-
width of every physical link. A separate flow control for
each VL is implemented at both sides of every link, so that,
in IB, buffering is channeled through VLs at the switch ports
and at the Host Channel Adapters (HCAs) that connect
endnodes to the network. In that sense, VLs are similar to
Virtual Channels (VCs) [8], and indeed VLs can be used,
like VCs, to prevent cyclic dependencies, so deadlocks.
G. Maglione-Mathey, P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia
and F. J. Quiles are with the Computer Systems Department, Univer-
sity of Castilla-La Mancha, Campus Universitario, Albacete s/n 02071,
Spain. E-mail: german.maglione@dsi.uclm.es, {pedro.yebenes, jesus.
escudero, pedrojavier.garcia, francisco.quiles}@uclm.es.
Eitan Zahavi is with Mellanox Technologies, Yokneam 20692, Israel.
E-mail: eitan@mellanox.com.
Manuscript received 18 Aug. 2016; revised 16 June 2017; accepted 7 Aug.
2017. Date of publication 21 Aug. 2017; date of current version 8 Dec. 2017.
(Corresponding author: German Maglione-Mathey.)
Recommended for acceptance by J. L. Tr?ff.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2017.2742503
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018 183
1045-9219 ß2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
However, not all the routing algorithms that prevent dead-
locks based on VCs can be directly implemented through
VLs. In particular, VC-based deadlock-free routing algo-
rithms, either deterministic or adaptive, specially proposed
for Dragonflies, are difficult toimplement (or even unfeasible)
in IB-based systems due to some limitations in the IB specifi-
cation (see Sections 2.3 and 3). Indeed, the official releases
of the IB control software do not provide any deadlock-free
routing engine specially tailored to Dragonflies, but generic
ones such as LASH [9] or DFSSSP [10] (see Section 2.3). How-
ever, these topology-agnostic routing engines do not scale, in
terms of network resources, when applied to Dragonflies, as
the number of VLs that they require to prevent deadlocks
increases with network size.
In this paper we propose a new routingalgorithm specially
tailored to Dragonfly networks which is deadlock-free, deter-
ministic and minimal-path, besides being scalable and easily
implementable in IB. This proposal, that we have called D3R
(Deterministic Deadlock-free Dragonfly Routing) is valid for
any connection pattern in the intra-group network (see
Section 2.2), but requires that the inter-group one is fully con-
nected
1
(see Section 4.1). D3R leverages the properties of these
Dragonfly configurations to map each route exclusively to
one VL, according to a specific order, so that deadlocks are
prevented without need of changing the VL of a packet along
any route, which eases significantly its implementation in IB.
Moreover, the number of VLs required by D3R is very low
regardless of network size (as low as 2 if no more than 1 VL
is mandatory to prevent deadlocks in the intra-group net-
work). Indeed, we have implemented D3R in the IB Subnet
Manager (OpenSM v3.3.19 [13]) as a new routing engine
(see Section 4.5), and this has allowed us to test D3R in a real
IB-based cluster (see Section 5.4). The results from this clus-
ter, in addition to others obtained from simulations (see sec-
tion 5.2 and 5.3), confirm that D3R prevents deadlocks using
fewer VLs than the required by the routing engines available
in OpenSM that are suitable for Dragonflies, regardless of
network size. These results confirm also that D3R achieves
an even more efficient behavior, outperforming other rout-
ing engines, if it is configured with one additional VL to
reduce internal contention in the Dragonfly groups.
In detail, the main contributions of this paper are:
We analyze the limitations of IB regarding the imple-
mentation of VC-based deadlock-free routing algo-
rithms proposed for Dragonflies.
We propose D3R as a deterministic, minimal-path
deadlock-free routing algorithm for Dragonflies that
can be easily implemented in IB-based networks,
using a few of the available VLs.
We prove through theoretical foundations that D3R
is completely deadlock-free, requiring 2 VLs to pre-
vent deadlocks regardless of network size (provided
that deadlocks can be prevented in the intra-group
network without need of more than 1 VL).
In addition, we propose using D3R configured with
an additional VL to provide not only deadlock free-
dom, but also to reduce intra-group contention.
We show, by means of simulation experiments, that
D3R is scalable as it can be applied to large Dragon-
flies without neither requiring more VLs nor losing
network performance.
We describe how we have implemented D3R in
OpenSM as a new routing engine that can be used in
IB-based Dragonflies.
We show the results of the evaluation of D3R in a
real IB-based cluster configured with different Drag-
onfly topologies, under real HPC applications.
The rest of the paper is organized as follows: Section 2
overviews the IB specification, summarizes the basics of
Dragonfly topologies, and discusses the pros and cons of
several deadlock-free routings that can be applied in Drag-
onflies. Section 3 analyzes the problems addressed by D3R.
Section 4 describes D3R in depth and provides theoretical
foundations to prove that it is deadlock free. The evaluation
results are provided and analyzed in Section 5. Finally, in
Section 6, some conclusions are drawn.
2BACKGROUND
2.1 The InfiniBand Specification
Interconnection networks based on the InfiniBand (IB) spec-
ification are made of physical devices, such as switches,
HCAs and links, together with control software entities,
such as the subnet manager (SM) that discovers and config-
ures the IB-based network. IB-based switches contain a set
of ports, while HCAs contain one (or several) ports to con-
nect endnodes to the network. Every port at switches or
HCAs contains a buffer to store packets to be forwarded.
Fig. 1 shows an example of an IB-based switch configured
with kinput and koutput ports. Note that, for the sake of
clarity, we show unidirectional input and output ports,
even though IB assumes bidirectional ports at switches.
A switch forwards incoming packets from input ports to
output ports, based on their destination endnode and on the
routing information contained at the linear forwarding table
(LFT). Basically, for a given packet destination, a query to
the LFT returns the output port of that packet at the current
switch. Buffers at switch and HCAs ports are channeled
through Virtual Lanes in order to split the physical link into
logical entities that share the total link bandwidth. The IB
specification defines at most 15 data VLs, although commer-
cial IB-based hardware (such as that supplied by Mellanox)
only offers 8 data VLs. In order to prevent packet loss due
to buffer overflow, the IB specification also defines a flow-
control scheme based on credits at VL-level, so that VLs are
Fig. 1. The InfiniBand-based switch architecture.
1. Note that the usual configuration for the inter-group network in
real Dragonflies is fully connected [11], [12].
184 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
allowed to forward a packet only if there are available cred-
its for the VL assigned to that packet at the next switch
(or HCA) port.
The IB specification defines that VLs are assigned to pack-
ets based mainly on their Service Level (SL) which is set to
packets at HCAs, prior their injection in the network. The
number of SLs available in IB-based networks is limited to
16, and the SL of a packet cannot be changed once that packet
is injected. Every switch (and HCA) has a SL-to-VL mapping
table per output port (see Fig. 1), used to assign packets
requesting that output port with a specific VL, based on
those packet SLs and on their arrival port (in the case of
switches). For that purpose, each entry of the SL-to-VL table
associates an input port and SL with a VL. Hence, a packet
can be stored at different VLs along its route, depending on
its SL and on the information of the SL-to-VL table at each
HCA or switch output port crossed by this packet.
As mentioned above, according to the IB specification,
IB-based networks (or subnets) are configured by the Sub-
net Manager. For instance, the SM implements methods to
discover the network topologies and applies the routing
algorithms (hereafter referred to as routing engines) in these
topologies. The SM also populates the LFTs at switches and
the SL-to-VL tables both at HCAs and switch ports. The
OpenSM [13] is the open-source implementation of the SM
included in the OpenFabrics Software (OFS) [14], the most
commonly used control-software distribution for IB. OFS
is supported by the OpenFabrics Alliance, which gathers
several companies promoting the IB technology.
2.2 Dragonfly Topology Basics
Fig. 2 shows a generic Dragonfly topology. Notice that
switches are organized in two hierarchical levels. In the first
level, switches and endnodes of the same group are con-
nected via local channels, forming the intra-group network.
In the second level, the groups are connected by means of
global channels that form the inter-group network. A Drag-
onfly network can be defined by three parameters [1]:
a: number of switches in each group.
p: number of endnodes connected to each switch.
h: number of channels at each switch used to connect
to other groups, i.e., the global channels.
Thus, a group contains aswitches which are inter-
connected via local channels, and can be considered as a
virtual high-radix switch. Besides, each group is intercon-
nected to the other groups by means of ahglobal
channels.
Both the intra- and inter-group network topologies are
not tied to any particular connection pattern. For instance, a
fully-connected Dragonfly network (see Fig. 4) assumes a
direct link between any pair of switches within the same
group (i.e., the intra-group level), and a direct link between
any pair of groups (i.e., the inter-group level). This pattern
is the one recommended in the original paper where the
Dragonfly topology was proposed [1].
Another proposal for the intra-group connection pattern
is the Hamming graph (also known as Generalized Hyper-
cube [15] or Flattened Butterfly [16]), which has been imple-
mented, in particular with a 2D-rectangular 16 6layout,
in Cray Cascade systems [12]. In Mathematical terms, the
Hamming graph Hðd; qÞis the result of the graph Cartesian
product of dcomplete graphs Kq. Fig. 3 shows a 2D-square
Hamming-based intra-group network, where every switch
is fully-connected along each dimension to the other
switches in that dimension. Deadlocks can be avoided in
this connection pattern using, for instance, Dimension-
Order Routing (DOR) [6] inside each group.
Both fully-connected and Hamming Dragonflies can
interconnect N¼apðah þ1Þendnodes by using g¼ah þ1
groups. In order to balance channel load on load-balanced
traffic, fully-connected Dragonflies should be built so that
the values of parameters a,p, and hfulfill the equation
a¼2p¼2h. Similarly, Hamming Dragonflies require an
equal length for every dimension (i.e., square), and, in the
case of square two-dimensional Hamming Dragonflies,
Fig. 2. Generic Dragonfly connection pattern.
Fig. 4. A deadlock situation in an 36-node fully-connected Dragonfly.
Fig. 3. Dragonfly intra-group network based on a 2D-square (33) ham-
ming graph.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 185
p¼sqrtðaÞendnodes can be connected without unbalance
channel load under uniform traffic [17].
2.3 Deadlock-Free Routings for Dragonfly
A deadlock-free minimal-path routing algorithm for Dragon-
flies was proposed in the original paper where this topology
was defined [1]. This algorithm requires 2 Virtual Channels
(VCs) [8] to prevent deadlocks: basically, cyclic dependencies
are avoided by shifting the VC of the packets when they are
going to traverse a local (i.e., intra-group) channel after tra-
versing a global (i.e., inter-group) one. In general, the use of
minimal-path routing allows to achieve high network
throughput for load-balanced traffic, as reported in several
works [1], [18], [19], [20], [21], since in this scenario all the links
are used evenly. Note that, even in the case that the applica-
tions do not generate “naturally” a uniform pattern of traffic
in the system, the use of task-to-node mapping strategies can
provide traffic balance, as proposed in [22], [23], [24]. Thus, in
any of these scenarios of well-balanced traffic load, using min-
imal-path routing makes sense as it provides good perfor-
mance without need of more complex approaches. On the
contrary, for adversarial-like traffic patterns, the global links
between groups of a Dragonfly may become saturated when
multiple endnodes in a group send inter-group traffic
addressed to the same destination group [1]. A similar situa-
tion may appear inside a group when multiple endnodes
attached to a switch send intra-group traffic addressed to
endnodes in the same neighboring switch. In both scenarios,
minimal-path routing may lead indeed to network perfor-
mance degradation.
As an alternative to minimal-path routing, the original
Dragonfly paper also proposes an adaptation of the non-
minimal Valiant’s routing algorithm [25] to Dragonflies, so
that packets are routed first to a randomly-chosen interme-
diate group, then to their actual destination group. In this
way, a uniform random distribution of traffic emerges even
if the original traffic pattern is adversarial-like, balancing
traffic load on local and global links so that the performance
degradation that this pattern may generate is prevented [1].
However, this is achieved at the expense of longer paths,
which may increase packet latency, and at the cost of requir-
ing 3 VCs to prevent deadlocks (packet VC must be shifted
after every inter-group hop). Moreover, Valiant’s routing
algorithm leads to approximately a doubling of the traffic
load with respect to the minimal-path routing [7], [22], and
this may subsequently lead to an unnecesary performance
degradation for uniform-like traffic patterns.
As both minimal and non-minimal routing may lead to
performance degradation under different traffic patterns,
the use of adaptive routing is also suggested in the original
Dragonfly paper, specifically the Universal Globally-Adaptive
Load-Balanced (UGAL) algorithm [26]. UGAL balances traffic
among minimal and non-minimal paths on a packet-by-
packet basis, the choice being made by using queue length
and hop count to choose the path with minimum delay. For
the Dragonfly topology, two versions of UGAL were initially
proposed: UGAL-L, that uses local queue information at the
current node, and UGAL-G, that uses queue information
from all the global channels (although the latter is considered
ideal and unfeasible in practice [1]). Both versions of UGAL
require 3 VCs to provide deadlock freedom in Dragonflies.
Note that these versions of UGAL consider link loads to
make routing decisions but not the traffic pattern, hence they
may still lead to network performance degradation under
certain traffic patterns, as reported in [18], [19], [27], [28].
These works propose alternative criteria to make routing
decisions with the aim of improving the performance of
UGAL, although in some cases at the cost of requiring an
additional VC (i.e., a total of 4 VCs) to prevent deadlocks. By
contrast, there are proposals of deadlock-free adaptive rout-
ing algorithms for Dragonflies that do not need VCs to pre-
vent deadlocks [29], [30], although in practice they require
other additional network resources. In general, deadlock-
free adaptive routings for Dragonflies introduce some
degree of network complexity and demand a higher number
of network resources with respect to deterministic minimal
routing. Moreover, adaptive routing may require packet
reordering at the destination endnodes upon out-of-order
delivery, then introducing additional latency. As mentioned
above, all these problems of adaptive routing would be intro-
duced unnecessarily in scenarios where minimal routing suf-
fices to achieve good performance. Moreover, even in the
case that adaptive routing is required, deadlock-free mini-
mal routing may be still necessary as one of the routing
options to choose from (as in UGAL).
However, despite the fact that InfiniBand (IB) Virtual
Lanes can be considered as similar to VCs, most of the VC-
based proposals discussed above are difficult to implement,
or even unfeasible, in IB-based systems, due to some limita-
tions in the IB specification. Specifically, many of these pro-
posals require shifting the VC of the packets upon moving
between groups [1], [18], [19], [27], [28], but shifting VLs as
required may be very complicated in IB, basically because
the required configuration of the SL-to-VL tables may end
up being too complicated or even unfeasible (See Section 3
for more details). Different or similar limitations prevent, in
practice, the implementation in IB of other deadlock-free
routings proposed for Dragonflies [29], [30]. Indeed, as far
as we know, none of the deadlock-free routing algorithms
specially designed for Dragonflies has been implemented in
the official releases of OpenSM.
Nevertheless, deadlocks can be prevented in real IB-based
Dragonflies through some suitable, although topology-
agnostic, routing schemes. For instance, UPDN [31] is based
on building the spanning tree of the interconnection net-
work, then removing (i.e., restricting) paths in order to
break cyclic dependencies. However, UPDN provides non-
minimal routes when applied to Dragonfly topologies. Other
topology-agnostic routing algorithms, such as LASH [9] and
DFSSSP [10] are able to prevent deadlocks by means of the
VLs available in IB, and they are available as routing engines
in the official release of OpenSM. These routing engines pro-
vide minimal routes, and avoid cyclic dependencies by map-
ping each different route completely to a single VL (what is
usually referred to as layered routing). On the contrary, the
recently proposed, topology-agnostic routing algorithm DF-
DN [32] assigns different parts of a route to different VLs
(what is usually referred to as VL-hopping). DF-DN provides
minimal routes, but requires a “non-uniform” configuration
of the SL-to-VL tables at the output ports to shift the VL of
the packets as required, which may end up being difficult
to compute and implement (similarly to the VC-based
186 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
topology-aware algorithms discussed above). Moreover,
note that, as topology-agnostic algorithms, LASH, DFSSSP
and DF-DN are able to route on very different (even irregu-
lar) topologies, but at the cost of requiring a number of VLs
that increases with network size. Therefore, they may con-
sume VLs that otherwise could be used to provide other fea-
tures such as QoS or Congestion Management. In the worst
cases, these non-scalable routing engines may be impossible
to apply in IB-based Dragonfly topologies (especially large
ones), as they may require more VLs than the available in IB
hardware (See Section 5).
The deterministic minimal-path routing algorithm for IB-
based Dragonflies proposed in this paper (D3R), like the
layered-routing algorithms mentioned above, also prevents
deadlocks by mapping each different route to a single VL
(i.e., it is not necessary to shift the VL of the packets along a
route). However, as D3R is a topology-aware routing algo-
rithm, it takes advantage of the Dragonfly properties to pre-
vent deadlocks using a very low number of VLs regardless
of network size (see Section 4), also reducing the network
setup time (see Section 5). D3R can be directly applied in IB-
based systems with Dragonfly topology, either as a stand-
alone minimal-path routing engine (as it is analyzed and
evaluated in this paper) or as part of an adaptive routing
engine (which is out of the scope of this paper).
3PROBLEM STATEMENT
The lossless nature of IB-based networks, along with the
topology connection pattern and the routing algorithm,
may result in a deadlock situation, if cyclic dependencies
(“credits loops”, in IB) exist among network resources. In
contrast with network topologies that are “naturally”
deadlock-free (like fat-trees), Dragonfly topologies contain
physical cycles that will lead to cyclic dependencies if not
prevented by the routing algorithm. Fig. 4 shows a deadlock
situation created by several minimal-path routes in a fully-
connected Dragonfly topology connecting 36 nodes.
Fig. 5 shows a portion of the channel dependency graph
(CDG) [8] corresponding to the deadlock situation shown in
Fig. 4.
Indeed, the channel dependency graph is a powerful tool
for detecting cyclic dependencies, so possible deadlock sit-
uations. Specifically, the nodes in the CDG depict the chan-
nels of the network, while the edges show the dependencies
among channels given by the routing algorithm, taking into
account that a dependency exits between channels A and B
(in that direction) if B can be requested after using A. For
instance, in Fig. 4 ch1is a local channel connecting switches
00 and 01 at G0, and ch2is a global channel connecting G0
and G2, thus there is a direct dependency between ch1and
ch2(shown as an arrow in Fig. 5), as the (basic) minimal-
path routing algorithm states that both channels will be
crossed consecutively by traffic flows going from switch 00
to any switch in G2. Similar dependencies exist among other
channels, so that there is a cycle in the CDG involving chan-
nels ch1,ch2,ch3,ch4,ch5and ch6. Therefore, it is necessary
to remove at least one dependency in that cycle to prevent
deadlocks, either by introducing restrictions in the routing
algorithm or by using VCs as escape ways [8].
In IB, VLs can be used to prevent deadlocks instead of
VCs, although there are some restrictions derived from the
IB specification. For instance, a naive approach to avoid
deadlocks in IB-based Dragonflies using minimal-path rout-
ing could consist in assigning a different VL to each destina-
tion group, so that packets addressed to different groups
would never share VLs, thereby preventing any cyclic
dependency. However, according to the IB specification the
number of VLs is limited to 15 for data traffic, and many
commercial IB-based devices only implement 8 VLs for data
traffic, hence this approach is unfeasible if the number of
groups in a Dragonfly is higher than the available VLs.
An alternative approach would be adapting to IB VLs the
deadlock-free routings proposed for Dragonflies which are
based on shifting the packets VC to prevent dependencies
(discussed in Section 2.3). In Dragonflies, in general, the VC
(VL) should be shifted when a packet reaching a switch
through a global channel subsequently requests a local chan-
nel for its next hop. For instance, in Fig. 4, a packet reaching
G2through ch2and requesting channel ch3would shift the
VL before being routed through ch3,thenbreakingdepen-
dency between ch2and ch3, and so the cycle, in the CDG.
However, in general, shifting the packets VL may be very
complicated (or even unfeasible) in some Dragonfly topolo-
gies, due to limitations of the IB specification regarding the
way VLs can be assigned to packets (see Section 2.1). Note
that, due to the limited number of VLs and SLs, packets reach-
ing a switch through a global channel may have the same SL
as local packets, but the former should shift VL if they con-
tinue traveling, while the latter should not. As the number of
hops performed by a packet is not tracked, the only way to
distinguish between global and local incoming packets is their
input port. Hence, the required selective VL-shifting would
require configuring the entries in the SL-to-VL tables in a non-
uniform way, i.e., depending onthe input-output port combi-
nation for each entry. Moreover, this non-uniform configura-
tion of the SL-to-VL tables would vary depending on the
specific Dragonfly topology, as the number of global and local
ports may vary, as well as the switch radix. Hence, configur-
ing the SL-to-VL tables may end up being too complicated
(especially if network size grows), or at least too dependant
on the specific topology. In the worst cases there may be no
way to configure the SL-to-VL tables to achieve the required
VL-shifting, especially if VLs must be shifted several times
along a route, or for non-fully-connected intra-group net-
works, where global packets may need to perform several
local hops inside the destination group.
As a consequence of the problems mentioned above, there
are no topology-aware routings for Dragonflies implemented
as routing engines in the official releases of OpenSM, but
topology-agnostic ones such as LASH and DFSSSP, that do
Fig. 5. A portion of the CDG in the deadlock situation shown in Fig. 4.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 187
not scale with network size in terms of the required VLs (as
also discussed in Section 2.3). The next section describes our
proposal to overcome all the problems discussed above, in
order to count witha deadlock-free routing algorithm scalable
and suitable for IB-based Dragonflies.
4 D3R DESCRIPTION
This section presents the new routing engine called Deter-
ministic Deadlock-free Dragonfly Routing (D3R), specially
tailored to IB-based Dragonfly topologies. In the following
we describe the D3R basic approach to prevent deadlocks,
as well as an enhanced version to reduce the intra-group
contention. We also provide the theoretical foundations
behind D3R. Finally, we show some implementation details
of D3R in the IB control software.
4.1 D3R Basics
D3R is a topology-aware, minimal path, deterministic, dead-
lock free and scalable routing algorithm, suitable for Dragon-
flies using a fully-connected pattern in the inter-group
network, and any connection pattern in the intra-group net-
work. However, D3R is more efficient in terms of required
VLs if deadlock freedom can be achieved in the intra-group
network without need of more than 1 VL (see below).
D3R assigns each Dragonfly group with an identifier Gi
in a strictly-increasing monotonic order (see Section 4.5
for details on how to achieve this in IB). After identifying
groups, D3R defines minimal-path routes, so that every
path can traverse at most one global channel. In other
words, the hop distance between the source and destination
groups must be zero or one. For instance, Fig. 6 shows a
route from a source node Hsattached to switch Ssin group
Gs, to a destination node Hdattached to switch Sdin group
Gd. Note that the minimal-path route traverses only one
global channel (gcsd). The pseudo-code shown in Fig. 7
reflects this minimal-path routing.
Given this minimal-path routing, D3R establishes a map-
ping of routes to different VLs
2
in order to prevent cyclic
dependencies in the CDG. Specifically, if the destination
node of the route is within a group Gdthat is a successor of
the source group Gs(i.e., Gd>G
s, where sidentifies the
source group and dthe destination group), the route is
completely mapped to a single VL (i.e., all the hops of this
route will be performed through that VL). On the contrary,
if Gs>G
daccording to the group ordering defined previ-
ously, the route is mapped completely to a second, different
VL. Routes between endnodes in the same group are
mapped to one of those two VLs. In this way, cyclic depen-
dencies (so deadlocks) are prevented in the CDG, as shown
in Section 4.2. Fig. 8 summarizes this VL-mapping stated by
D3R.
It is worth mentioning that achieving this mapping of
routes to VLs in IB is very simple. Indeed, it is necessary
only to assign packets (prior to their injection) with SL0 or
SL1, depending on their source and destination group, and
configuring the SL-to-VL tables so that packets with SL0 are
always mapped to VL0, and packets with SL1 are always
mapped to VL1. Note that all the SL-to-VL tables would
have exactly the same configuration, regardless of network
size and the specific topology.
Fig. 9 shows an example of how D3R maps routes to VLs
in a 36-node Dragonfly. Two types of arrows (i.e., bold and
dashed) show the direction of the traffic flows through the
global links connecting the 4 groups of this topology. For
instance, if a traffic flow goes from group G2to group G1,
then it is assigned with SL0 (so mapped to VL0), since
G1<G
2. By contrast, if a traffic flow goes from group G2
to group G3, then it it is assigned with SL1 (so it is mapped
to VL1) as G3>G
2. In this case, we assume VL0 for com-
munications between nodes within the same group.
Note that, according to the VL-mapping stated above,
D3R requires only 2 VLs to prevent deadlocks, regardless of
Fig. 6. Minimal-path in a Dragonfly topology.
Fig. 7. Pseudo-code of the minimal-path routing for Dragonflies.
Fig. 8. D3R VL mapping using 2VLs.
Fig. 9. Example of D3R in a 36-node Dragonfly (a¼3,h¼3,p¼3).
2. Note that we use “VL” here indistinctly for VC and VL, or what-
ever similar features available in network technologies other than IB.
188 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
network size, but provided that the intra-group network can
be deadlock free without need of more than 1 VL. Other-
wise, if the routing algorithm for the intra-group network
created new cyclic channel dependencies, then we would
need to consider additional VLs to break these dependen-
cies. In extremely complex intra-group network configura-
tions, a number of VLs greater than the available ones in IB
hardware (usually 8, as in Mellanox devices) could be nec-
essary to remove cyclic channel dependencies in the intra-
group network. Nevertheless, these intra-group network
configurations would be so complex that they would not be
feasible in practice due to wiring and space constraints or
available switch radix.
Moreover, D3R can also improve the intra-group conten-
tion by using 1 additional VL (see Section 4.4).
4.2 Solving Cyclic Dependencies
As mentioned above, D3R is able to break the cycles appear-
ing in the CDG by only requiring 2 VLs. Fig. 10 shows the
CDG resulting from the application of the VL-mapping of
D3R to the Dragonfly topology and traffic scenario shown
in Fig. 4. VL0 stores packets routed through channels
belonging to groups ordered in a strictly-decreasing mono-
tonic order (i.e., G3,G2,G1,G0), while VL1 stores packets
routed through channels belonging to groups ordered in a
strictly-increasing monotonic order (i.e., G0,G1,G2,G3).
As we can see, VL0 stores packets routed through global
channels ch4and ch6, which corresponds to the communi-
cations from G2to G1and from G1to G0(i.e., Gs>G
d),
respectively. By contrast, VL1 stores flows using the global
channel ch2, used for communications from G0to G2(i.e.,
Gd>G
s). D3R is able to break the cyclic dependency
shown in the CDG in Fig. 5, thereby preventing deadlocks.
In general, as the number of possible groups is not con-
strained by the use of monotonic ordering, D3R actually
guarantees deadlock freedom using just 2 VLs, regardless of
network size. This is proven in the next section.
4.3 Theoretical Foundations
This Section adds to the paper theoretical foundations in
which we have based the D3R proposal.
Definition 1. An interconnection network Iis a strongly con-
nected directed multigraph, denoted by I¼ðN; C Þ, where Nis
the vertex set representing the nodes, and Cis the arc set repre-
senting the physical channels.
Definition 2. Assuming that the network technology supports
virtual channels (VCs), let VC be the set of all virtual channels
available in I. Each physical channel cp2Cis a subset of VC
(i.e., cpVC) consisting of kvirtual channels cp¼fvc0p;
...;vc
ðk1Þpg[7]. Hence, VC ¼fvcxp :0 x<k^0
p<jCjg.
Now we can consider that the nodes of the interconnec-
tion network Iare connected through virtual channels of
VC, i.e., I¼ðN; VCÞ.
Definition 3. A deterministic routing function Rof the form
R:NN!VC for the interconnection network I, returns
the output virtual channel vc 2VC to route a packet from the
current node ncto the destination node nd, where nc;n
d2N.
Definition 4. Given the source node nsand the destination node
ndof a packet, where ns;n
d2N, a route rtðns;n
dÞfrom nsto
ndis an ordered sequence of virtual channels vc 2VC that
allows that packet to reach ndfrom ns. A valid route is defined
by application of Rat each router between nsand nd. Let RT
be the set of all the valid routes in the network according to R.
Definition 5. Given an interconnection network I, a routing
function R, and a pair of virtual channels vcxi ;vc
yj 2VC
returned by R, there is a direct dependency from vcxi to vcyj if,
according to R, packets can use vcyj immediately after vcxi in
at least one valid route.
Definition 6. The channel dependency graph of a a routing func-
tion Rdefined on interconnection network Iis a directed graph
D¼ðVC; AÞ, where the vertex set VC corresponds to the set of
virtual channels in I, and the arc set Aconsists of pairs of vir-
tual channels vcxi;vc
yjÞ:vcxi ;vc
yj 2VCgsuch that, accord-
ing to R, there is a direct dependency from vcxi to vcyj .
Definition 7. A deadlock situation for a given interconnection
network Iand a routing function Roccurs when a valid config-
uration exists, according to R, such that there exists a set
P¼fp0;...;p
v1gof packets, where each packet piholds at
least one virtual channel, pihas not reached its destination and
piis unable to proceed towards its destination because the out-
put virtual channel requested by piis unavailable because it is
held by another packet in P. In other words, a deadlock forms a
cycle where each packet pi2Pis blocked and must wait for an
unavailable virtual channel held by another packet in P.
Note that the above definitions are just a rephrasing of
well-known concepts presented by Duato [33], [34] and
Dally & Seitz [8]. We have included them for the sake of
completeness of these theoretical foundations.
Theorem 1. A deterministic routing function Rfor an intercon-
nection network Iis deadlock-free iff there are no cycles in the
channel dependency graph D.
Note that Theorem 1 comes from Dally & Seitz [8], where
a formal proof can be found. In what follows we provide
specific definitions for the Dragonfly and the D3R routing
algorithm, where specific subscripts are defined.
Definition 8. A Dragonfly network Idis an interconnection
network consisting of a fully-connected inter-group network,
Fig. 10. CDG of D3R using 2 VLs. Global channels are circles in white
and local channels are shaded circles.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 189
interconnecting a totally ordered set of disjoint groups. Let
fG0;...;G
g1g,foranyg>0, where Gwis a subgraph of
Id, be the set of disjoint groups, each of them extended with
the corresponding output virtual channels belonging to
the inter-group network. Gw¼ðNw;VC
wÞ:NwN^VCw
VC.
Definition 9. Let VCwbe the set of virtual channels of the Drag-
onfly group Gw, for any wsuch that 0w<g. Let k¼2,
i.e., each physical channel cpconsists of 2 virtual channels: vc0p
and vc1p.Ascpcan be either an intra- or inter-group physical
channel, then VCwcan be divided into 4 disjoint ordered sub-
sets: E0a0,E0e0;E
1a1and E1e1, where vc0p2E0a0and
vc1p2E1a1for any intra-group physical channel cp, and
vc0p2E0e0and vc1p2E1e1for any inter-group physical chan-
nel cp. The second subscripts of the subsets can be arranged so
that a0¼ðgw1Þ2, and e0¼a0þ1, and a1¼w2,
and e1¼a1þ1. As a result, a total ordering of the subsets Exi
is possible according to their two subscripts: E00 <... <
E0ð2g1Þ<E
10 <... <E
1ð2g1Þ. Now, any virtual channel
vc 2Exi can be labeled as vcxiz, where Exi ¼fvcxiz :0
x1^0i< 2g^0z<jCwjg, where Cwis the
set of all physical channels of the Dragonfly group Gw.
Lemma 1. If each virtual channel is labeled as in Definition 9,
then a total ordering of virtual channels is possible according to
their subscripts: vc000 <vc
001 <... <vc
1ð2g1ÞðjCwj1Þ.
Definition 10 characterizes the D3R routing algorithm,
which requires only two VCs for guaranteeing deadlock
freedom regardless of network size.
Definition 10. Let D3Rbe a deterministic routing algorithm of
the form D3R:NN!RT , on Dragonfly network Idcon-
sisting of any number of groups g, such that for any pair of
nodes ns;n
d2Nbelonging respectively to the Dragonfly
groups Gnand Gmfor 0n<g^0m<g, it returns a
route rtðns;n
dÞ2RT by applying a deterministic routing
function Rthat limits a packet to traverse at most one global
channel to reach its destination and, at each router, returns one
of 2 possible virtual channels, labeled as in Definition 9, accord-
ing to the following criteria:
R!vc0iz if n m
vc1iz if n <m
Note that, if virtual channels are labeled as in Definition 2,
the deterministic routing algorithm D3Rprovides routes
restricted to vc0pfrom a higher-numbered group to an
equal- or lower-numbered group, and routes restricted to
vc1pfrom a lower-numbered group to a higher-numbered
group, for any physical channel cpin any route.
Theorem 2. The overall CDG of a Dragonfly interconnection
network Idinduced by the deterministic routing algorithm
D3Ris acyclic if the CGDs of the intra-group networks in Id
induced by D3Rare acyclic.
Proof. By contradiction, suppose that a cycle exists in the
CDG of Id. Since we require that the CDG of the intra-
group network is acyclic for D3R, such cycle must con-
sists of virtual channels belonging to different groups. Let
us consider the virtual channels vcxiu and vcyjv from dif-
ferent groups, with i<j, and such that virtual channel
vcyjv is the vertex just before vcxiu. Hence an arc exists
between vcyjv and vcxiu. Therefore, vcxiu 2VCnand
vcyjv 2VCmwith n<m,asi<j. As we have assumed a
deterministic routing function that restricts a route to vc1
from a lower-numbered group to a higher-numbered
group, so x¼y¼1. Then ðvc1ju;vc
1ivÞis an arc, from a
virtual channel of Gmto a virtual channel of Gnwith
m>n, which is a contradiction. tu
Corollary 1. The deterministic routing algorithm D3Ris dead-
lock-free for a Dragonfly interconnection network Idif the
intra-group networks in Idare deadlock-free for D3R.
Proof. By Theorem 1, the intra-group networks in Idhave
no cycles in their CDG for D3R. By Theorem 2 the CDG of
Idinduced by D3Ris acyclic and as a consequence of The-
orem 1, D3Ris deadlock-free for Id.tu
4.4 Reducing Intra-Group Contention
Although D3R using 2 VLs prevents deadlocks appearing in
the assumed Dragonflies connection patterns, it is possible
to improve even more its performance, if we pay attention
to the intra-group traffic dynamics. Basically, we have three
types of traffic flows that may be present in a group: coming
from outside the group, going out of the group and making
an intra-group trip. As we have mentioned above, the intra-
group traffic (i.e., Gs¼Gd) is always mapped to VL0 (see
Fig. 8). This situation causes intra-group traffic being
mapped to the same VL as traffic flows from groups labeled
with a higher identifier (see Fig. 8).
A very simple solution to improve the performance of
D3R is to separate intra-group traffic from the other two
types of traffic by using an additional VL. Specifically, the
third VL is used only for intra-group communication,
the remaining two VLs being used as usual. Fig. 11 shows
the D3R VL-mapping extended in order to use 3 VLs. Note
that the main difference with respect to the previous version
of D3R is the use of VL2 when packets source and destina-
tion groups are equal (i.e., Gd¼Gs).
As we show in the evaluation (see Section 5), D3R using 3
VLs reduces significantly contention inside the group,
because the intra-group traffic is set aside in a different VL
from other types of traffic flows removing their interaction.
4.5 Implementation Overview
This Section describes some implementation details of D3R
in the OpenSM v3.3.19 [13]. Basically, OpenSM gathers
the functionality of the subnet manager and the subnet agent
(SA) server. It implements different routing algorithms by
Fig. 11. D3R VL mapping using 3VLs.
190 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
means of routing engines, that compute the routing tables and
load this information onto the switches.
The SM discovers the network topology, assigns LIDs to
network devices, and populates the routing tables, based on
the selected routing algorithm and SL-to-VL tables. Once the
SM finishes the setup stage it becomes idle, but it may re-
invoke its algorithms if network reconfiguration is required.
Then, the information computed by the SM is given to the SA
that acts as a database receiving queries from IB devices. An
example of the information obtained from the SA is the ser-
vice level that must be assigned to a specific traffic flow. Based
on that SL, this traffic flow is then mapped to a specific VL.
The mapping of SLs to VLs is defined by the SL-to-VL tables,
placed at the ports of IB devices.
In order to implement D3R in OpenSM, we have created a
new routing engine called dfly. First the OpenSM assigns
LIDs to all the IB-based devices (i.e., endnodes and switches).
Next the dfly routing engine discovers the Dragonfly net-
work topology composed of switches belonging to different
groups. In order to identify the switches within a group, we
compute the closed neighborhood
3
of the first switch detected
by OpenSM. Then, we repeat this operation with each neigh-
boring switch (including those that are actually in other
groups) in order to detect the switches with more common
neighbors, which are then considered as members of the
same group. We repeat this step for each switch that has not
yet been included into a group. The first discovered group is
assigned with identifier 0, the next discovered one is
assigned with identifier 1, and so on. After this step the dfly
routing engine has a logical structure of the network topol-
ogy identifying endnodes, groups and switches within those
groups. This logical structure is used to populate the routing
tables using the general algorithm shown in Fig. 7, and the
particular intra-group routing (1-hop for fully-connected
intra-group networks and DOR for Hamming-based ones).
Then, the routing tables are loaded onto network switches.
Moreover, in order to answer the SL queries received by the
SA, we have implemented in OpenSM the function path_sl
containing the algorithms shown in Figs. 8 and 11. These
actions enable the dfly routing engine to compute dynami-
cally the SL. The D3R implementation falls through to the
DFSSSP routing engine if it does not detect the structure of a
supported Dragonfly.
5EVALUATION
This section describes the evaluation of D3R. We explain the
applied methodology and the simulation model. Next, we
analyze the obtained results through simulations and real-
traffic workloads run in the cluster.
5.1 Methodology
We have performed simulations and experiments with real
IB-based hardware, using a framework which integrates IB
control software, IB-based hardware and OMNeT++-based
simulators (see Fig. 12). We have extended previously pro-
posed tools [2], [35], [36].
We use several modules of the OpenFabrics Software
[14], such as the ibsim, which simulates the control traffic
in the IB fabric, IB tools,usedtotestandverifythecon-
figuration of the IB fabric, and the OpenSM. Moreover,
we have developed the RAAP HPC (RHPC) tools [35].
RHPC provides utilities to obtain information about the
network, such as topology or routing paths, which can
be used to feed the OMNeT++-based simulator described
in the next section. This framework allows us to test new
routing engines implemented in OpenSM before running
them in real hardware, or simulating them in large IB-
based networks.
5.2 Simulation Model
An accurate simulator is required to evaluate D3R, since we
could not afford to build one. We have used the IB simulator
contributed by Mellanox Technologies to the OMNeT++
community during 2008 [37]. Since then, the model have
been used to predict large IB network performance by vari-
ous publications [10], [36], [38], [39], [40], [41].
At the crux of this simulator, there is a network that
sends and receive small IB packet fragments of 64 bytes,
over links that impose credit-based link-level flow-control
according to the IB specification. This model, shown in
Fig. 13, utilizes a hierarchical design which re-uses the same
blocks to build IB switches and HCAs. The basic blocks are:
a generator, a sink, an input buffer, a VL arbiter and an out-
put buffer.
The generator (gen) is responsible for breaking incoming
application level messages into packets of the Maximal
Transferable Unit (MTU) and further more each packet into
a set of 64B segments denoted flits. The generation of flits is
controlled by a self timed push event which emulates the
PCIe output bandwidth. The sink block assembles incoming
flits back into the original messages. The self-timed event
pop imitates the performance of the PCIe bus that delivers
the IB messages into the host memory. Operating system
jitter, also known as OS noise, is modeled by the ON/OFF
behavior of the sink.
The input-buffer (ibuf) block implements a Virtual Output
Queues scheme. As such it is responsible for holding a large
set of queues, one for each VL and output port pair which
stores incoming flits.Theibuf limits the growing of any of
these queues to prevent a situation when a specific VL could
end up hogging the full buffer space [34]. Hence, a minimum
buffer space is guaranteed to any VL at both host channel
adapters (HCAs) and switches ports. The head of each of these
queues is sent to the vlarb module as response for receiving a
sent message from the particular output port and VL.
Fig. 12. Overview of the evaluation framework.
3. The closed neighborhood of a vertex vin a graph Gis the sub-
graph containing vertex vitself and all vertices adjacent to v.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 191
However, a parameter of the input buffer controls the number
of supported concurrent packet reads to different output
ports. Consequently, when there are too many concurrent
reads the sent event is ignored at the head of the queue is not
sent to the vlarb. On sending the last flit of a packet, the ibuf
sends a done message to all the connected vlarb,toenablethem
to re-send the sent message. The input buffer is also responsi-
ble for receiving the IB flow-control messages and sends
update of the other side of the link available buffer credits to
the vlarb via TxCred messages. The ibuf also reports its avail-
able space to the other side of the link by sending a RxCred
message to the obuf. The VL arbiter vlarb is responsible for
arbitrating between the input ports traffic and VLs. The imple-
mented arbitration follows the principles provided in
iSLIP [42]. Following these principles, arbitration can become
false if the selected input buffer already reached the maxi-
mum number of concurrent packet transfers or the output
buffer is full. The output buffer (obuf) is responsible for send-
ing the flits at line rate. It is also responsible for sending flow-
control credit updates at a rate that is not slower than a self
timer event named min time, following the IB definition.
5.3 Simulation Experiments
This Section evaluates D3R compared to other routing
engines using the framework described above. Regarding
the simulator configuration, we assume PCIe 3.0 8X interfac-
ing HCAs to the traffic generator (gen module), so that there
is no bottleneck in traffic generation. The simulator models
a hiccup in the PCIe interface every 0.1 mslasting for
0:01ms. We assume QDR 4x links (i.e., 40 Gbps of theoretical
bandwidth, reduced to 32 Gbps due to the 8b/10b encoding
protocol). We model copper cables with length of 7 m (i.e.,
the maximum length for an active copper QDR cable
shipped by Mellanox). Latency introduced by these cables
is 6.1 ns/m, so that the propagation delay for each packet is
43 ns. Regarding switches the simulator models groups of
ports according to the simulated Dragonfly scenarios,
depending on the values a,pand h. Input buffers (ibuf) store
128 flits per VL, each flit consumes 64 bytes (i.e., 1,024 flits in
total or 64KB per buffer). We use MTU size equal to 2 KB.
We have run experiments for the following routing
engines available in OpenSM
4
:
MIN-1VL is a minimal-path routing using only 1 VL,
so that deadlocks are not prevented. It shows the
worst case, without deadlock prevention.
DFSSSP is a topology-agnostic deadlock-free minimal-
path routing engine [10] available in OpenSM. As this
routing is not optimized for Dragonflies, it ends up
using many VLs to break cycles in large networks.
LASH is an unicast topology-agnostic routing engine
for IB [9]. It uses SLs to provide deadlock-free
shortest-path routing while distributing paths
among VLs. Depending on the network size, the
number of VLs varies.
D3R-2VL is our approach using just 2 VLs, without
traffic balancing. Although it is deadlock-free, it will
suffer from network contention, as it always uses
only 2 VLs independently of the network size.
D3R-3VL is our approach using 3 VLs and traffic bal-
ancing. The additional VL reduces network conten-
tion that degrades the performance of D3R-2VL.
Table 1 shows different Dragonfly configurations used
in the simulation as well as the setup time spent by
OpenSM running over the ibsim for LASH, DFSSSP and
D3R routing engines. All Dragonfly network configura-
tions assume a fully-connected inter-group network. DFF
configurations use a fully-connected intra-group network,
while DHF use a Hamming-based intra-group network.
We assume Dragonflies using pnumber of nodes con-
nected to a specific switch, aswitches per group and h
global links per switch (see Section 2.2). We tested that
the D3R routing engine (and also the other ones) builds
correctly for the selected Dragonfly topologies, and that it
is able to break cycles by always using 2 VLs, regardless
of network size. Table 1 also shows that D3R does not
spend a significant setup time in ibsim to configure the
fabric, regardless of network size. However, as LASH and
DFSSSPrequiremorethan8VLs(i.e.,thoseavailablein
IB-based devices) to break cycles, OpenSM fails in the
setupstagebecauseitcannotconguremoreVLsthan
theavailableones.Therefore,wemodiedthecodeof
LASH and DFSSSP in OpenSM to avoid this fail, and
compute the required number of VLs. In this case we
have discarded the setup time (“-” symbol in the table).
The synthetic traffic pattern used for simulations fol-
lows a uniform (random) distribution of destinations,
whereallthesourcesgeneratetrafcaddressedtoaran-
dom destination. This traffic communication pattern is
commonly used for evaluating routing algorithms. The
traffic generation rate is increased from 0 to 100 percent
of the maximum actual link bandwidth. Each load point
represents results for a 1-millisecond simulated time. For
each simulation point we obtain the average offered load,
normalized against the maximum theoretical bandwidth
ofthenetwork.Asweonlyevaluatetheroutingalgo-
rithm efficiency and deadlock prevention, congestion sce-
narios are out of the scope of this paper. In these
scenarios, our routing scheme should be combined with
other queuing schemes using additional VLs to alleviate
thecongestionnegativeeffects.
The performance metrics evaluated in the experiments is
the normalized bandwidth against the maximum theoretical
efficiency of the network. Fig. 14 shows simulation results
for DFF configurations balanced and oversubscribed (i.e.,
a¼p), and also for DHF configurations. Note that all these
network configurations are the ones from Table 1.
Fig. 13. The ib flit sim simulator internal structure and events.
4. We have not considered other routing engines available in
OpenSM because they do not offer deadlock prevention (e.g., MIN-
HOP) or minimal paths (e.g., UPDN) in Dragonfly topologies. We have
not considered either other routing engines because they are not avail-
able in the official OpenSM release (e.g., DF-DN).
192 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
Figs. 14a, 14b and 14c depict performance results for bal-
anced (i.e., a¼2h¼2p) DFF network configurations #1, #2
and #4 of Table 1, respectively. In particular, Fig. 14a (72
endnodes) shows that DFSSSP slightly outperforms D3R-
3VL, but DFSSSP uses 8 VLs (3 VLs minimum for deadlock
prevention, plus 5 additional VLs for reducing intra-group
contention), while D3R-3VL only uses 2 VLs for deadlock-
prevention plus 1 VL for reducing intra-group contention.
By contrast, LASH using 2 VLs is able to avoid deadlocks,
but it suffers a slight performance degradation compared
D3R-2VL, due to intra-group contention. However, LASH
required VLs increase with the network size, compared to
D3R. Regarding MIN-1VL, the simulator detects deadlocks
when the traffic load is around 65 percent, so network
performance dropping near zero in this case. D3R-2VL
does not trigger the deadlock detection, but it suffers
TABLE 1
Number of VLs Required for Deadlock-Freedom and Setup Time (in ibsim), in Different Fully-Connected (DFF)
and Hamming (DHF) Dragonfly Intra-Group Networks
Network Size LASH DFSSSP D3R
# #VLs Setup Time #VLs Setup Time #VLs Setup Time
DFF
1 72 Nodes (a4h2p2) 2 <1s 3 <1s 2 <1s
2 342 Nodes (a6h3p3) 2 1 s 4 1 s 2 <1s
3 1,056 Nodes (a8h4p4 ) 3 30s 5 10 s 2 2 s
4 2,550 Nodes (a10h5p5) 3 6 m 6 s 6 1 m 04 s 2 3 s
5 6,162 Nodes (a13h6p6) 5 1 h 42 m 6 8 m 26 s 2 4 s
6 16,512 Nodes (a16h8p8) 5 33 h 22 m 35 s 7 1h 49 m 14 s 2 23 s
DHF
7 342 Nodes (a9h2p2) 7 10 s 10 - 2 <1s
8 1,332 Nodes (a9h4p4) 10 - 14 - 2 3 s
9 2,112 Nodes (a16h2p4) 14 - 18 - 2 4 s
10 6,240 Nodes (a16h4p6) 14 - 25 - 2 15 s
Fig. 14. Simulation results for Balanced and oversubscribed fully-connected (DFF) and hamming-based (DHF) Dragonflies.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 193
performance degradation at high traffic loads due to intra-
group contention. Figs. 14b and 14c show network efficiency
results for balance DFF connecting 342 and 2550 endnodes.
D3R-2VL is able to prevent deadlocks, but suffers from
intra-group contention at high traffic loads. However, D3R-
3VL reduces the effects of intra-group contention with just 3
VLs, significantly outperforming DFSSSP and LASH. Again,
MIN-1VL obtains the worst performance due to the appear-
ance of deadlocks.
Figs. 14d, 14e and 14f depict performance results for
oversubscribed (i.e., a¼p) DFFs based on network configu-
rations #1, #2 and #4 of Table 1, respectively, but the number
of endnodes doubles with respect to the previous scenario.
Obviously, as the number of endnodes per group doubles,
the intra-group contention increases as well. Regarding the
performance results, we can observe that DFSSSP-8VL,
LASH-2VL and D3R-2VL, tough deadlock free, suffer a per-
formance degradation compared to D3R-3VL, due to the
intra-group contention generated by the oversubscribed
Dragonfly configuration. Note that MIN-1VL is not able to
deal with deadlocks. Again, D3R-3VL achieves good results
even this scenario, regardless of network size.
Figs. 14g, 14h and 14i depict performance results for DHF
network configurations #7, #8 and #10 of Table 1, respec-
tively. In this case, we show results only for MIN-1VL, D3R-
2VL and D3R-3VL, because DFSSSP and LASH are unable
to route in those network configurations (although LASH is
valid for the DHF with 342 nodes). The performance of
MIN-1VL drops to near zero as a consequence of deadlocks.
D3R-2VL suffers from the effects of contention aggravated
by the limited bandwidth offered by the unbalanced
Hamming-based intra-group configurations (see Section 2.2).
Even in these corner-case scenarios, D3R-3VL can reduce the
effect of intra-group contention by separating the local traffic
from the global one.
5.4 Experiments in a Real Cluster
This section shows experiment results for D3R, DFSSSP, and
LASH performed under real traffic workloads in the CEL-
LIA (Cluster for the Evaluation of Low-Latency Architec-
tures) facility built with IB-based hardware
5
(see Fig. 15).
CELLIA allows us to test the correctness of the implementa-
tion of a routing engine comparing simulation results with
real-workloads execution. Each server node in CELLIA is
a HP Proliant DL120 Gen9 with a processor Intel Xeon
E5-2630v3 8-cores at 1.80 GHz and 16 GB of RAM memory.
We installed CentOS 7 with a kernel version 3.10.
0-327.4.4.el7.x86_64. Each node has a dual-port
Mellanox ConnectX3 MCX353A-QCBT HCA working at
QDR speed (i.e., 40 Gbps throttled to actual 32 Gbps due to
Fig. 15. Performance results of MPI-based real workloads in the 38-node Dragonfly configured in CELLIA.
5. CELLIA belongs to the RAAP research group [43], from the Alba-
cete Research Institute of Informatics at the University of Castilla-La
Mancha, Spain.
194 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
the 8b/10b encoding protocol). HCAs are plugged to a x16
PCIe v3.0 interface The HCA drivers and firmware are sup-
plied by Mellanox in the MLNX_OFED 3.2 [44] (the HCAs
firmware is v2.36.5000). Specifically, we built a 36-node
fully-connected Dragonfly topology in CELLIA (see Fig. 9).
We used 12 8-port Mellanox IS5022 switches to build a
Dragonfly with 4 groups and 3 switches per group (i.e.,
a¼3,h¼3;p¼3). Switches ports also work at QDR speed.
Cables are QSFP Mellanox, suitable for QDR speed. Both
HCAs and Switches offer 9 Virtual Lanes per port: 8 VLs for
data and 1 VL for management. We run a modified version
of OpenSM v3.3.19 [13] including the D3R routing engine
(see Section 4.5). We have used the following benchmarks
to compare D3R, DFSSSP, and LASH:
Netgauge is a benchmark used to evaluate the net-
work performance [45], under several communica-
tion patterns, such as effective bisection bandwidth
(ebb), one-to-many (1toN) or many-to-one (Nto1).
We have run 36 MPI tasks in CELLIA (one per node).
Graph500 [46] measures the lookup speed in graphs
spread in memory, involving the interconnection
network. The performance metric is the number of
traversed edges per second (TEPS) We have run the
grap500 mpi simple and grap500 mpi replicated
tests, and 36 MPI tasks in CELLIA (one per node).
HPCG [47] is a software package performing opera-
tions over a grid with double precision floating point
values, i.e., GFLOP/s, during the benchmark execu-
tion time. We defined the problem size to be
104 104 104 dimensions per MPI task. We exe-
cuted 36 MPI tasks (one per node).
HPCC [48] is a suite of benchmarks including per-
formance tests for interconnection networks,
stressing a moderately the network: PTRANS and
best-effort(i.e., RandomlyOrderedRingBandwidth and
NaturallyOrderedRingBandwidth). HPCC perfor-
mance metrics are the communication bandwidth
(GB/s) and latency (ms). We have run HPCC with
36 MPI tasks (one per node). At each node one MPI
task consumes 8 GB of memory with a block size of
192NBs,i.e.,aproblemsizeofN¼163;584.The
rest of parameters have been configured using the
form at [49].
NAMD is an application the simulation large biomo-
lecular systems modeling atomic structures [50].
Among the NAMD-based tests, we have chosen
apoa1and f1atpaseWe measure the execution
time (i.e., wallclock) of both applications, configured
according to the input parameters by default. We
executed 36 MPI tasks (one per node).
Regarding the mapping of MPI tasks to endnodes we
have tested 2 different mappings: lineal and random.
The lineal mapping assigns MPI taskito endnode i. The
random mapping performs a random assignment of tasks
to endnodes.
In order to validate our implementation of D3R, and
the simulation results, we have run a single experiment in
CELLIA for each combination of the considered routings
(D3R-2VL, D3R-3VL, DFSSSP-8VL, or LASH-2VL), bench-
marks (Netgauge, Graph500, HPCG, HPCC, or NAMD) and
task mappings (Linear or Random). Fig. 15 shows the perfor-
mance results of these experiments. For instance, HPCC tests
Ping-Pong and Ordered-Ring generate an adversarial-like
traffic pattern (achieving between 1 and 4 GB/s, approxi-
mately). In these cases, the D3R behavior is virtually identi-
cal to that of both DFSSSP and LASH. Other tests, such as
PTRANS or Netgauge-ebb generate a many-to-many traffic
pattern. In these scenarios, D3R performance results are sim-
ilar to those of DFSSSP and LASH. In general, there are small
variations among all the evaluated routing engines, because
of the small size of the CELLIA network (i.e., a 38-node Drag-
onfly with 12 switches). Note that the simulation in small net-
works (see Fig. 14a) and the experiments in CELLIA show
qualitatively similar performance results.
In summary, experiments under real and simulated sce-
narios confirm D3R as a new routing engine for IB-based
Dragonflies, offering deadlock-freedom by using a small
number of VLs, leaving the remainder VLs free to be used
for other purposes like reducing the impact of congestion
[34], [51], and/or providing Quality of Service (QoS) for dif-
ferentiated services [52].
6CONCLUSIONS AND FUTURE WORK
In this paper we propose a new scalable, deadlock-free and
deterministic minimal-path routing engine for Dragonflies
topologies, implemented in the OpenSM v3.3.19, which is
ready to be used in IB-based systems. This new routing
engine, called D3R, requires only 2 Virtual Lanes to prevent
deadlocks regardless of network size, although it can be
configured with an additional VL to improve network per-
formance. The results obtained from simulations and from
experiments performed in a real IB-based cluster confirm
that D3R not only prevents deadlocks, but it is also able to
outperform (up to 88 percent in some network configura-
tions) other routing engines available in OpenSM that are
suitable for Dragonflies, while requiring fewer VLs.
As a future work we plan to enhance this OpenSM
implementation to detect and support other inter- and
intra-group interconnection patterns. We also intend to
expand our simulation and test bed environments to
incorporate traffic loads other than those based on MPI,
such as distributed I/O, machine/deep learning and
others commonly used in data centers and HPC systems.
Moreover, we aim to incorporate a more elaborated fault-
tolerance support, including capabilities to route by itself
(i.e., without DFSSSP) in a degraded Dragonfly. In addi-
tion,weplantoanalyzepossiblewaystoprovideother
features, apart from deadlock freedom, using the VLs that
D3R leaves free, for instance developing a topology-aware
queuing scheme that balances the buffer usage to reduce
the effects of congestion.
ACKNOWLEDGMENTS
Authors thank professor Jose Duato, from Technical Univer-
sity of Valencia, and Jose Carlos Valverde, from the Mathe-
matics Department of UCLM, for their invaluable support
regarding the theoretical foundations of D3R. Authors also
thank Raul Galindo for his constant technical support
regarding the cluster CELLIA. This work has been jointly
supported by the Spanish MINECO and European
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 195
Commission (FEDER funds) under the projects TIN2012-
38341-C04, UNCM13-1E-2456, TIN2015-66972-C5-2-R and
the FPI grant BES-2013-063681, and by Junta de Comuni-
dades de Castilla-La Mancha under the projects POII10-
0289-3724 and PEII-2014-028-P. Jesus Escudero-Sahuquillo
has been funded by the University of Castilla-La Mancha
(UCLM) and the European Commission (FSE funds), with a
contract for accessing the Spanish System of Science, Tech-
nology and Innovation, for the implementation of the UCLM
research program (UCLM resolution date 31/07/2014).
REFERENCES
[1] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technology-driven,
highly-scalable dragonfly topology,” in Proc. 35th Annu. Int. Symp.
Comput. Archit., 2008, pp. 77–88.
[2] G. M. Mathey, P. Yebenes, P. J. Garc
ıa, F. J. Quiles, and
J. Escudero-Sahuquillo, “Analyzing available routing engines for
InfiniBand-based clusters with dragonfly topology,” in Proc. Int.
Conf. High Performance Comput. Simulation, 2015, pp. 168–171.
[3] InfiniBand Trade Association, InfiniBand Architecture Specifica-
tion, vol. 1, Mar. 2015, http://www.infinibandta.org
[4] IBTA, “The infiniband trade association.” [Online]. Available:
www.infinibandta.org
[5] TOP500.org, “Top 500 the list.” [Online]. Available: www.top500.
org
[6] J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks:
An Engineering Approach. Burlington, MA, USA: Morgan Kauf-
mann Publishers, 2002
[7] W. Dally and B. Towles, Principles and practices of interconnection
networks. Burlington, MA, USA: Morgan Kaufmann Publishers
Inc., 2004.
[8] W. Dally and C. Seitz, “Deadlock-free message routing in multi-
processor interconnection networks,” IEEE Trans. Comput.,
vol. C-36, no. 5, pp. 547–553, May 1987.
[9] T. Skeie, O. Lysne, and I. Theiss, “Layered shortest path (lash)
routing in irregular system area networks,” in Proc. Int. Parallel
Distrib. Process. Symp., 2002, Art. no. 8.
[10] J. Domke, T. Hoefler, and W. E. Nagel, “Deadlock-free oblivious
routing for arbitrary topologies,” in Proc. IEEE Int. Parallel Distrib.
Process. Symp., 2011, pp. 616–627.
[11] B. Arimilli, et al., “The PERCS high-performance interconnect,” in
Proc. 18th IEEE Symp. High Performance Interconnects, Aug. 2010,
pp. 75–82.
[12] G. Faanes, et al., “Cray cascade: A scalable HPC system based on a
dragonfly network,” in Proc. Int. Conf. High Performance Comput.
Netw. Storage Anal., Nov. 2012, pp. 1–9.
[13] OpenSM, “Infiniband subnet manager.” [Online]. Available:
http://git.openfabrics.org/~halr/opensm.git/
[14] The open fabrics alliance. [Online]. Available: www.openfabrics.
org
[15] L. N. Bhuyan and D. P. Agrawal, “Generalized hypercube and
hyperbus structures for a computer network,” IEEE Trans. Com-
put., vol. C-33, no. 4, pp. 323–333, Apr. 1984.
[16] J. Kim, J. Balfour, and W. Dally, “Flattened butterfly topology for
on-chip networks,” in Proc. 40th Annu. IEEE/ACM Int. Symp.
Microarchitecture, 2007, pp. 172–182.
[17] C. Camarero, E. Vallejo, and R. Beivide, “Topological characteri-
zation of hamming and dragonfly networks and its implications
on routing,” ACM Trans. Archit. Code Optim., vol. 11, no. 4,
pp. 39:1–39:25, 2014.
[18] M. Garc
ıa, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero,
“Efficient routing mechanisms for dragonfly networks,” in Proc.
42nd Int. Conf. Parallel Process., 2013, pp. 582–592.
[19] P. Faizian, M. S. Rahman, M. A. Mollah, X. Yuan, S. Pakin, and
M. Lang, “Traffic pattern-based adaptive routing for intra-group
communication in dragonfly networks,” in Proc. IEEE 24th Annu.
Symp. High-Performance Interconnects, 2016. pp. 19–26.
[20] C. Gomez, F. Gilabert, M. E. Gomez, P. Lopez, and J. Duato,
“Deterministic versus adaptive routing in fat-trees,” in Proc. IEEE
Int. Parallel Distrib. Process. Symp., 2007, pp. 1–8.
[21] E. Zahavi, “Fat-trees routing and node ordering providing conten-
tion free traffic for MPI global collectives,” in Proc. IEEE Int. Symp.
Parallel Distrib. Process. Workshops Phd Forum, May 2011, pp. 761–
770.
[22] B. Prisacari, G. Rodriguez,P. Heidelberger, D. Chen, C. Minkenberg,
and T. Hoefler, “Efficient task placement and routing of nearest
neighbor exchanges in dragonfly networks,” in Proc. 23rd Int. Symp.
High-Performance Parallel Distrib. Comput., 2014, pp. 129–140.
[23] K. T. Pedretti and T. Hoefler, “A comparison of task mapping
strategies on two generations of cray systems,” in Proc. SIAM
Conf. Parallel Process. Scientific Comput., Feb. 2014, https://
cfwebprod.sandia.gov/cfdocs/CompResearch/docs/\\pedretti
\_siampp14.pdf
[24] A. Bhatele, “Task mapping on complex computer network topolo-
gies for improved performance,” Lawrence Livermore Nat. Labora-
tory, LDRD Final Report LLNL-TR-678732, Oct. 2015.
[25] L. G. Valiant and G. J. Brebner, “Universal schemes for parallel
communication,” in Proc. T13th Annu. ACM Symp. Theory Comput.,
1981, pp. 263–277.
[26] A. Singh, “Load-balanced routing in interconnection networks,”
Stanford Univ., Mar. 2005.
[27] J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, and S. Scott,
“Overcoming far-end congestion in large-scale networks,” in Proc.
IEEE 21st Int. Symp. High Performance Comput. Archit., Feb. 2015,
pp. 415–427.
[28] N. Jiang, J. Kim, and W. J. Dally, “Indirect adaptive routing on
large scale interconnection networks,” in Proc. 36th Annu. Int.
Symp. Comput. Archit., 2009, pp. 220–231.
[29] M. Garc
ıa, et al., “On-the-fly adaptive routing in high-radix hier-
archical networks,” in Proc. 41st Int. Conf. Parallel Process., 2012,
pp. 279–288.
[30] D. Xiang and X. Liu, “Deadlock-free broadcast routing in dragon-
fly networks without virtual channels,” IEEE Trans. Parallel Dis-
trib. Syst., vol. 27, no. 9, pp. 2520–2532, Sep. 2016.
[31] J. Sancho, A. Robles, and J. Duato, “Effective strategy to compute
forwarding tables for InfiniBand networks,” in Proc. Int. Conf. Par-
allel Process., Sep. 2001, pp. 48–57.
[32] T. Schneider, O. Bibartiu, and T. Hoefler, “Ensuring deadlock-
freedom in low-diameter InfiniBand networks,” in Proc. 24th
Annu. Symp. High-Performance Interconnects, 2016, pp. 1–8.
[33] J. Duato, “A new theory of deadlock-free adaptive routing in
wormhole networks,” IEEE Trans. Parallel Distrib. Syst., vol. 4,
no. 12, pp. 1320–1331, Dec. 1993.
[34] J. Escudero-Sahuquillo, et al., “A new proposal to deal with con-
gestion in InfiniBand-based fat-trees,” J. Parallel Distrib. Comput.,
vol. 74, no. 1, pp. 1802–1819, 2014.
[35] G. Maglione-Mathey, P. Yebenes, J. Escudero-Sahuquillo,
P. J. Garcia, and F. J. Quiles, “Combining openfabrics software
and simulation tools for modeling InfiniBand-based interconnec-
tion networks,” in Proc. 2nd IEEE Int. Workshop High-Performance
Interconnection Netw. Exascale Big-Data Era, Mar. 2016, pp. 55–58.
[36] J. Domke, T. Hoefler, and S. Matsuoka, “Fail-in-place network
design: Interaction between topology, routing algorithm and fail-
ures,” in Proc. Int. Conf. High Performance Comput., Netw. Storage
Anal., 2014, pp. 597–608.
[37] Mellanox, “Omnet++ InfiniBand flit level simulation model,” 2008.
[Online]. Available: http://www.mellanox.com/page/omnet
[38] E. Zahavi, I. Keslassy, and A. Kolodny, “Distributed adaptive
routing for big-data applications running on data center
networks,” in Proc. 8th ACM/IEEE Symp. Archit. Netw. Commun.
Syst., 2012, pp. 99–110.
[39] E. Zahavi, “Fat-tree routing and node ordering providing conten-
tion free traffic for MPI global collectives,” J. Parallel Distrib Com-
put., vol. 72, no. 11, pp. 1423–1432.
[40] E. Zahavi, G. Johnson, D. J. Kerbyson, and M. Lang, “Optimized
InfiniBandTM fat-tree routing for shift all-to-all communication
patterns,” Concurrency Comput.: Practice Experience, vol. 22, no. 2,
pp. 217–231, 2010.
[41] E. Zahavi, I. Keslassy, and A. Kolodny, “Quasi fat trees for HPC
clouds and their fault-resilient closed-form routing,” in Proc. IEEE
22nd Annu. Symp. High-Performance Interconnects, 2014, pp. 41–48.
[42] N. McKeown, “The iSLIP scheduling algorithm for input-queued
switches,” IEEE/ACM Trans. Netw., vol. 7, no. 2, pp. 188–201,
Apr. 1999.
[43] RAAP. High-performance networks and architectures group.
[Online]. Available: http://www.i3a.info/raap
[44] Mellanox Technologies, “Mellanox OpenFabrics enterprise distri-
bution for linux (MLNX_OFED),” 2013. [Online]. Available:
http://www.mellanox.com
[45] T. Hoefler, T. Mehlan, A. Lumsdaine, and W. Rehm, “Netgauge: A
network performance measurement framework,” in Proc. 3rd Int.
Conf. High Performance Comput. Commun., 2007, pp. 659–671.
196 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 1, JANUARY 2018
[46] The graph500 benchmark 2015. [Online]. Available: www.graph500.
org
[47] J. Dongarra, M. A. Heroux, and P. Luszczek, “HPCG benchmark:
A new metric for ranking high performance computing systems,”
EE and CS Dept., Knoxville, TN, USA, UT-EECS-15-736, 2015.
[48] The HPCC benchmark. 2015. [Online]. Available: icl.cs.utk.edu/
hpcc
[49] Advanced Clustering Technologies. 2015. [Online]. Available:
www.advancedclustering.com/act-kb/tune-hpl-dat-file
[50] J. C. Phillips, et al., “Scalable molecular dynamics with NAMD,”
J. Comput. Chemistry, vol. 26, no. 16, pp. 1781–1802, 2005.
[51] P. Yebenes, J. Escudero-Sahuquillo, P. J. Garc
ıa, and F. J. Quiles,
“Efficient queuing schemes for HoL-blocking reduction in drag-
onfly topologies with minimal-path routing,” in Proc. IEEE Int.
Conf. Cluster Comput., 2015, pp. 817–824.
[52] P. Yebenes, et al., “Combining HoL-blocking avoidance and dif-
ferentiated services in high-speed interconnects,” in Proc. 21st Int.
Conf. High Performance Comput., 2014, pp. 1–10.
German Maglione-Mathey received the BS
and MS degrees in Computer Science from the
University of Castilla-La Mancha (UCLM), Spain,
in 2015 and 2016 respectively. He began his
research career in 2015 as a PhD Student at the
University of Castilla-La Mancha in Spain, when
he was recruited by the Computer Architecture
Department of that University. His research inter-
ests include High Performance Computing inter-
connects and Data Center Networks and all the
strategies related to improve them, especially
network topologies, routing algorithms and congestion management.
Pedro Yebenes is PhD Student from the Univer-
sity of Castilla-La Mancha in Spain. He began his
research career in 2011 when he was recruited
by the Computer Architecture Department of that
University. He has developed a simulation tool for
modeling HPC networks using the OMNeT++
framework. Since then, he has used and updated
this tool to propose and test new techniques for
HPC networks focused on congestion control,
quality of service, and non-minimal adaptive rout-
ing algorithms. He has published his results in
more than ten international conferences and journals.
Jesus Escudero-Sahuquillo received the MS
and PhD degrees in Computer Science from the
University of Castilla-La Mancha (UCLM), Spain,
in 2008 and 2011, respectively. His research
interests include high-performance computing
and Big-Data, interconnection networks and all
the strategies related to improve them, such as
network topologies, routing algorithms, conges-
tion management, and power saving. He has
published more than 30 papers in national and
international peer-reviewed conferences and
journals. In 2006 he joined the Computer Systems Depar tment (DSI),
UCLM, Spain. He performed several pre- and post-doc research stays in
Simula Research Labs (Norway) and Heidelberg University (Gemany).
In 2014 he moved to the industry and worked for Oracle Corporation
(Norway), as a PhD Senior Engineer. In 2015 he moved to the Technical
University of Valencia (Spain), as a PostDoc research assistant granted
with a national-competitive grant “Juan de La Cierva”. In 2016 he joined
again the DSI, UCLM (Spain), with a 5-years PostDoc position funded by
the UCLM and the European Commission (FSE funds). He has partici-
pated in several research projects funded by the European Commission
and the Spanish Government. He has served as program committee
and reviewer in several conferences and journals. He is co-organizer of
the IEEE International Workshop on High-Performance Interconnection
Networks in the Exascale and Big-Data Era (HiPINIEB).
Pedro Javier Garcia received a degree in com-
munication engineering from the Technical Uni-
versity of Valencia, Spain, in 1996, and the PhD
degree in computer science from the University
of Castilla-La Mancha (UCLM), Spain, in 2006. In
1999, he joined the Computing Systems Depart-
ment (DSI), UCLM, Spain, where he is currently
an assistant professor of computer architecture
and technology. His main research interests are
the design and implementation of strategies to
improve several aspects of high-performance
interconnection networks, especially congestion management schemes
and routing algorithms. He has published more than 50 refereed papers
in ranked journals and conferences. He has guided one doctoral thesis
and is guiding currently three more. He has been the coordinator of two
research projects supported respectively by the Spanish Government
and by the Government of Castilla-La Mancha, as well as the coordinator
of two Research & Development Agreements with different companies.
In addition, he has participated in other (more than 30) research proj-
ects, supported by the European Commission and the Spanish
Government. He has served as organizer committee member and
program committee member in several international conferences and
workshops, such as ICPP, HotI, CCGrid, ISC, HiPINEB. He has been
also a guest editor of several journals.
Francisco J. Quiles is a Full Professor of Com-
puter Architecture and Technology at the Comput-
ing Systems Department of UCLM. His research
interests include: high-performance interconnec-
tion networks for multiprocessor systems and
clusters, parallel algorithms for video compression
and video transmission. He has served as Pro-
gram Committee member in several conferences.
He has published over 200 papers in international
journals and conferences and participated in
68 research projects supported by the NFS, Euro-
pean Commission, the Spanish Government and Research & Develop-
ment Agreements with different companies. Also, he has guided 9
doctoral theses.
Eitan Zahavi manages the Mellanox end-to-end
performance architecture group which focuses
on features that improve the overall system per-
formance for both Ethernet and InfiniBand, lossy
and lossless. We also study Optical Data Center
networks. Example fields of research are Applica-
tion performance, Congestion Control, Adaptive
Routing, Tenants Isolation, and Topologies. The
group employs large system simulation and lab
experiments to validate our hypothesis and test
new features implementations.
"For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
MAGLIONE-MATHEY ET AL.: SCALAB LE DEADLOCK-FREE DETERMINISTIC MINIMAL-PATH ROUTING ENGINE FOR INFINIBAND-BASED DRAGONFLY... 197
... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topologyagnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [19], which can be also used in Dragonflies. ...
... In that sense, in [20] the re-quirements of some of these routing engines are analyzed in terms of routing time and number of required VLs, but no performance measurements are provided. In [18], both a comparison of the requirements (in terms of number of required VLs) and a performance comparison (based on both simulation experiments and results from a real InfiniBand-based cluster) are provided, but not all the routing engines available for Dragonflies are considered. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [17], LASH [8] and D3R [18]. ...
Preprint
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Considerable research efforts have been devoted to investigating a range of efficient network topologies for HPC systems, including Flattened Butterfly [9][10][11], Dragonfly [12][13][14], HyperX [15], Skywalk [16], and SlimFly [17]. These structures are capable of delivering low diameters for HPC systems while also ensuring scalability through the port numbers (radixes) of the construction blocks (routers) [18]. ...
Article
Full-text available
The design of interconnection networks is a fundamental aspect of high-performance computing (HPC) systems. Among the available topologies, the Galaxyfly network stands out as a low-diameter and flexible-radix network for HPC applications. Given the paramount importance of collective communication in HPC performance, in this paper, we present two different all-to-all broadcast algorithms for the Galaxyfly network, which adhere to the supernode-first rule and the router-first rule, respectively. Our performance evaluation validates their effectiveness and shows that the first algorithm has a higher degree of utilization of network channels, and that the second algorithm can significantly reduce the average time for routers to collect packets from the supernode.
... A lot of research has been devoted to exploring a variety of efficient topologies, e.g., Flattened Butterfly [7], Dragonfly [4], [5], HyperX [8], Skywalk [9], and SlimFly [10]. Their scalability is guaranteed through the port number (radix) of the construction blocks (routers) [26]. ...
... A common example is a DF with a flattened butterfly intra-group scheme and a pruned Hamming graph inter-group scheme. Real DFs have diameters ranging from 2 to 5. Routing DFs requires special attention in order to guarantee deadlock freedom [32,57,56]. Furthermore, DFs provide low performance for adversarial inter-group traffic patterns unless either fine-tuned non-minimal 3 adaptive routing techniques [48,44,67] or groupspreading job placement policies (RDR or RRR in [43]) are used. ...
Thesis
Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.
Article
According to the latest TOP500 list, InfiniBand (IB) is the most widely used network architecture in the top 10 supercomputers. IB relies on Credit-based Flow Control (CBFC) to provide a lossless network and InfiniBand congestion control (IB CC) to relieve congestion, however, this can lead to the problem of victim flow since messages are mixed in the same queue and long-lived congestion spreading due to slow convergence. To deal with these problems, in this paper, we propose FlowStar, a fast convergence per-flow state accurate congestion control for InfiniBand. FlowStar includes two core mechanisms: 1) optimized per-flow CBFC mechanism provides flow state control to detect real congestion; and 2) rate adjustment rules make up for the mismatch between the original IB CC rate regulation and the per-hop CBFC to alleviate congestion spreading. FlowStar implements a per-flow congestion state on switches and can obtain in-flight packet information without additional parameter settings to ensure a lossless network. Evaluations show that FlowStar improves average and tail message complete time under different workloads.
Article
Based on the most recent TOP500 rankings, Infiniband (IB) stands out as the dominant network architecture among the top 10 supercomputers. Yet, it primarily employs deterministic routing, which tends to be suboptimal in network traffic balance. While deterministic routing invariably opts for the same forwarding path, adaptive routing offers flexibility by permitting packets to traverse varied paths for every source-destination pair. Contemporary adaptive routing methods in HPC networks typically determine path selection rooted in the switch queue's occupancy. While the queue length provides a glimpse into local congestion, it's challenging to consolidate such fragmented information to portray the full path accurately. In this paper, we introduce Alarm, an adaptive routing system that uses probabilistic path selection grounded in one-way delay metrics. The one-way delay not only offers a more holistic view of congestion, spanning from source to destination, but also captures the intricacies of network flows. Alarm gleans the one-way delay from each pathway via data packets, eliminating the need for separate delay detection packets and clock synchronization. The probabilistic selection hinges on weights determined by the one-way delay, ensuring the prevention of bottleneck links during congestion updates. Notably, routing decisions under Alarm are made per-flowlet. Guided by delay cues, the gap between flowlets is dynamically adjusted to match the maximum delay variation across diverse paths, thereby preventing the occurrence of packet out-of-order. The simulation results show that Alarm can achieve 2.0X and 1.7X better average and p99 FCT slowdown than existing adaptive routing.
Article
With the increasing need for graph analysis, massive Concurrent iterative Graph Processing (CGP) jobs are usually performed on the common large-scale real-world graph. Although several solutions have been proposed, these CGP jobs are not coordinated with the consideration of the inherent dependencies in graph data driven by graph topology. As a result, they suffer from redundant and fragmented accesses of the same underlying graph dispersed over distributed platform, because the same graph is typically irregularly traversed by these jobs along different paths at the same time. In this work, we develop GraphTune , which can be integrated into existing distributed graph processing systems, such as D-Galois, Gemini, PowerGraph, and Chaos, to efficiently perform CGP jobs and enhance system throughput. The key component of GraphTune is a dependency-aware synchronous execution engine in conjunction with several optimization strategies based on the constructed cross-iteration dependency graph of chunks. Specifically, GraphTune transparently regularizes the processing behavior of the CGP jobs in a novel synchronous way and assigns the chunks of graph data to be handled by them based on the topological order of the dependency graph so as to maximize the performance. In this way, it can transform the irregular accesses of the chunks into more regular ones so that as many CGP jobs as possible can fully share the data accesses to the common graph. Meanwhile, it also efficiently synchronizes the communications launched by different CGP jobs based on the dependency graph to minimize the communication cost. We integrate it into four cutting-edge distributed graph processing systems and a popular out-of-core graph processing system to demonstrate the efficiency of GraphTune. Experimental results show that GraphTune improves the throughput of CGP jobs by 3.1∼6.2, 3.8∼8.5, 3.5∼10.8, 4.3∼12.4, and 3.8∼6.9 times over D-Galois, Gemini, PowerGraph, Chaos, and GraphChi, respectively.
Article
Dragonfly networks have significant advantages in data exchange due to the small network diameter, low cost and modularization. A graph G is vertex-pancyclic if for any vertex uV(G)u\in V(G), there exist cycles through u of every length \ell with 3V(G)3\leq \ell \leq |V(G)|. A graph G is Hamiltonian-connected if there exists a Hamiltonian path between any two distinct vertices u,vV(G)u,v\in V(G). In this paper, we mainly research the pancyclic and Hamiltonian properties of the dragonfly network D(n,h), and find that it is Hamiltonian with n1,h2n\geq 1,\,\,h\geq 2, pancyclic, vertex-pancyclic and Hamiltonian-connected with n4,h2n\geq 4,\,\,h\geq 2.
Article
BCube is a promising structure of data center network, as it can significantly improve the performance of typical applications. With the expansion of network scale and increasement of complexity, reliability and stability of networks have become more essential. In this paper, we study the fault-tolerant routings in BCube. Firstly, we design a fault-tolerant routing algorithm based on node disjoint multi-paths. The proposed multi-path routing has stronger fault tolerance, since each path has no other common nodes except the source node and the destination node. Secondly, we investigate an effective fault-tolerant routing based on routing capabilities algorithm for BCube. The proposed algorithm has higher fault tolerance and success rate of finding feasible routes, since it does not limit the faults number. Thirdly, we present an adaptive path finding algorithm for establishing virtual links between any two nodes in BCube, which can shorten the diameter of BCube. Extensive simulation results show that the proposed routing scheme outperforms the existing popular algorithms. Compared with the state-of-the-art fault-tolerant routing algorithms, the proposed algorithm has a 21.5% to 25.3% improvement on both throughput and packet arrival rate. Meanwhile, it reduces the average latency of 18.6% and the maximum latency of 23.7% in networks.
Chapter
Interconnection networks fall into two different classes: (1) low-radix networks, and (2) high-radix networks. High-radix networks mainly include fat-tree networks, and dragonfly related networks. Dragonfly related networks include the 1D dragonfly networks, 2D Slingshot networks, dragonfly+ networks and others. The 1D dragonfly networks consist of completely connected router groups, where each pair of router groups has one or multiple global optical connection. Each pair of routers in the same router group has a single local connection. The Slingshot networks replace the router group with a flattened butterfly 2D connected group, where every two groups can be connected by one or multiple global connections. The dragonfly+ network is an enhanced 1D dragonfly network, where each router group contains two sub-groups of switches: one called leaf switches, and the other are spine switches. The spine switches are directly connected to the spine switches of the other router groups while the leaf switches are connected to the spine switches in the same group and the servers. The speaker presents efficient routing algorithms, network connection schemes, and collective communication operations in different high-radix networks.
Conference Paper
Full-text available
As the size of High Performance Computing clusters grows, the increasing probability of interconnect hot spots degrades the latency and effective bandwidth the network provides. This paper presents a solution to this scalability problem for real life constant bisectional-bandwidth fat-tree topologies. It is shown that maximal bandwidth and cut-through latency can be achieved for MPI global collective traffic. To form such congestion-free configuration, MPI programs should utilize collective communication, MPI-node-order should be topology aware, and the packets routing should match the MPI communication patterns. First, we show that MPI collectives can be classified into unidirectional and bidirectional shifts. Using this property, we propose a scheme for congestion-free routing of the global collectives in fully and partially populated fat trees running a single job. Simulation results of the proposed routing, MPI-node-order and communication patterns show a 40% throughput improvement over previously published results for all-to-all collectives.
Article
Full-text available
We present the HPCG benchmark: High Performance Conjugate Gradients that is aimed providing more application-oriented measurement of system performance when compared with the High Performance LINPACK benchmark. We show the model partial differential equation and its discretization as well as the algorithm for iteratively solving it. The performance results show how HPCG ranks large supercomputing installations and delivers richer view of important system characteristics.
Article
Full-text available
A new deadlock-free unicast-based broadcast scheme is proposed based on a new routing scheme called minus-first routing. Minus-first routing is a partially adaptive routing scheme in dragonfly networks without any virtual channels. The main goals of the broadcast schemes are to minimize the total delivery time, and any router does not receive any message more than once. No channel competition is introduced. Two different broadcast schemes are proposed: (1) the group-first, and (2) the router-first. It is shown that unicast-based broadcast schemes are necessary to avoid deadlocks at the consumption channels. The group-first broadcast scheme delivers a message to all groups as early as possible; and the router-first scheme minimizes the number of unicast steps to traverse global links. To our knowledge, the method in this paper is the first collective communication work for dragonfly networks in the literature. Simulation results are presented to evaluate the proposed unicast-based broadcast schemes.
Article
Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.