Content uploaded by German Maglione
Author content
All content in this area was uploaded by German Maglione on Dec 29, 2021
Content may be subject to copyright.
Leveraging InfiniBand Controller to Configure
Deadlock-Free Routing Engines for Dragonflies
German Maglione-Mathey, Jesus Escudero-Sahuquillo, Pedro Javier Garcia,
Francisco J. Quiles1
Departamento de Sistemas Inform´aticos, Universidad de Castilla-La Mancha, Spain
Eitan Zahavi2
Mellanox Technologies, Israel
Abstract
The Dragonfly topology is currently one of the most popular network topologies
in high-performance parallel systems. The interconnection networks of many
of these systems are built from components based on the InfiniBand specifi-
cation. However, due to some constraints in this specification, the available
versions of the InfiniBand network controller (OpenSM) do not include routing
engines based on some popular deadlock-free routing algorithms proposed the-
oretically for Dragonflies, such as the one proposed by Kim and Dally based on
Virtual-Channel shifting. In this paper we propose a straightforward method
to integrate this routing algorithm in OpenSM as a routing engine, explaining
in detail the configuration required to support it. We also provide experiment
results, obtained both from a real InfiniBand-based cluster and from simula-
tion, to validate the new routing engine and to compare its performance and
requirements against other routing engines currently available in OpenSM.
Keywords: Dragonfly, InfiniBand, Routing, Deadlock freedom
1german.maglione@dsi.uclm.es, jesus.escudero@uclm.es, pedrojavier.garcia@uclm.es, fran-
cisco.quiles@uclm.es
2eitan@mellanox.com
Preprint submitted to Journal on Parallel and Distributed Computing December 29, 2021
1. Motivation
High-Performance Computing (HPC) systems, as well as Datacenters, con-
sist of a high number of computing and storage endnodes interconnected by
means of a network. Given the increasingly growing communication require-
ments of the services and applications supported by these systems, the perfor-5
mance demanded to the interconnection network (mainly high throughput and
low latency) augments accordingly. Hence, it is essential to optimize the design
and configuration of the interconnection network in order to achieve the desired
performance required by the system services and applications.
One of the most important design aspects of any interconnection network is10
its topology, i.e. the pattern followed to interconnect the endnodes of the system.
In that sense, Dragonfly topologies [1] have become very popular in the last years
because of their low diameter, path diversity, high bisection bandwidth, and
good performance/cost ratio. In addition, they offer a good scalability due to
their hierarchical structure. Thanks to these advantages, Dragonfly topologies15
have been configured in several real systems such as the IBM PERCS [2], the
Cray XC series [3] or the Niagara (the Canada’s fastest supercomputer) [4],
which implements a Dragonfly+ architecture [5].
However, the interconnection pattern of Dragonfly topologies contains phys-
ical cycles that may lead to cyclic channel dependencies and so to traffic dead-20
locks. When a packet is allowed by the routing algorithm to traverse two chan-
nels cnand cm, in that order, to reach its destination, this creates a direct
dependency from cnto cm. The set of all direct dependencies that packets cre-
ate, starting from all sources to all destinations, forms a channel dependency
graph [6], where each vertex represents a channel and an edge represents a de-25
pendency. Cyclic dependencies are formed between two or more channels that
depend on each other, either directly or indirectly. This situation may lead to
a cycle of packets waiting for one another to release resources (i.e. channels)
in a deadlock configuration [7], and hence remain blocked indefinitely. Hence it
is mandatory that the routing algorithms used in Dragonflies are designed to30
2
guarantee deadlock freedom. For that purpose, the routing algorithms, either
deterministic or adaptive, usually considered as suitable for Dragonflies prevent
potential cyclic dependencies among channels [6] by following one out of three
potential approaches. The first approach uses Virtual Channels (VCs) [6] as es-
cape ways, i.e. the VC assigned to a packet may vary along their route so that35
no cyclic channel dependencies appear in the network for a given VC. In the
second approach, the routes that follows the packets in the network are grouped
into disjoint sets, and each set is mapped to an exclusive VC, all this according
to some algorithm that guarantees that not all the routes that form any possible
cycle belong to the same set (i.e. that not all of the routes forming a cycle share40
the same VC). This second approach is in general known as “layered routing”
[8]. The third approach restricts routes so that the allowed ones never form a
cycle.
The use in Dragonflies of routing algorithms based on the first approach was
proposed at the same time as this topology by Kim and Dally [1]. Specifically, a45
deterministic minimal routing algorithm that requires 2 VCs to provide deadlock
freedom was proposed, as well as an oblivious non-minimal routing algorithm
that requires 3 VCs, the latter being an adaptation of the Valiant’s routing
algorithm [9]. In addition, it was also suggested the use of both minimal and
non-minimal routes according to the adaptive routing algorithm proposed in50
[1, 10], which requires 3 VCs to provide deadlock freedom. However, these
algorithms, similarly to others providing deadlock freedom through the same
approach (e.g. the ones proposed in [11, 12, 13]), are not easy to implement
in real systems due to the requirement of “shifting” the VC assigned to some
packets. In particular, this “VC shifting” is difficult to achieve in networks built55
from components based on the InfiniBand architecture [14] (see Section 2.3),
that is currently one of the most widely used network technology in HPC systems
[15]. In that sense, although VCs can be emulated through the InfiniBand
Virtual Lanes (VLs), some restrictions in the InfiniBand specification [14] (see
Section 3) make it complex to shift the VL assigned to a packet along its route,60
as required by the routing algorithms that follow the first approach for deadlocks
3
prevention.
For that reason, the deadlock-free routing algorithms that have been im-
plemented as routing engines in the InfiniBand control software (OpenSM [16])
follow mainly the second or third approaches mentioned above. The available65
routing engines suitable for Dragonflies that follow the second approach (i.e.
layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two be-
ing actually topology agnostic algorithms while the latter having been specially
designed for Dragonflies. Regarding the third approach, the classical, topology-
agnostic Up/Down algorithm is also available as a routing engine (UPDN) in70
OpenSM [19], which can be also used in Dragonflies. As far as we know, the only
deadlock-free routing algorithm following the first approach (i.e. requiring VL
shifting) and suitable for Dragonflies that has been implemented as a IB routing
engine is DF-DN [20]. This routing engine actually was proposed in general for
low-diameter topologies, and it is based on a non-uniform configuration of the75
tables that map the packets Service Level (SL, see Section 2.3) to VLs, which
indeed introduces some degree of complexity in the network configuration.
It is worth clarifying that all the routing engines mentioned in the former
paragraph are deterministic, since implementing adaptive routing algorithms in
InfiniBand presents very difficult or even unsurmountable problems. It is worth80
clarifying also that DFSSSP and LASH routing engines are strictly minimal-path
and UPDN may provide non-minimal routes. Nevertheless, D3R is minimal if
we consider a Dragonfly group (see Section 2.1) as a virtual switch, therefore
D3R minimizes the number of traversed global channels (see Section 2.1.1). This
is also true for the minimal routing algorithm for Dragonflies [1] (see Section85
2.1.1).
In summary, when it comes to configure a Dragonfly topology in a real
InfiniBand-based cluster, the administrator has to choose basically among a not
many available deadlock-free, deterministic routing engines. However, there are
not many studies comparing the performance of these routing engines which90
could help network administrator in choosing the most suitable one for the
specific Dragonfly network under configuration. In that sense, in [20] the re-
4
quirements of some of these routing engines are analyzed in terms of routing
time and number of required VLs, but no performance measurements are pro-
vided. In [18], both a comparison of the requirements (in terms of number of95
required VLs) and a performance comparison (based on both simulation exper-
iments and results from a real InfiniBand-based cluster) are provided, but not
all the routing engines available for Dragonflies are considered.
In order to make easier an optimal configuration of InfiniBand-based Drag-
onfly networks, in this paper we first enlarge the set of deadlock-free routing en-100
gines available for Dragonflies by proposing a method to implement in OpenSM
the deterministic, minimal routing algorithm proposed originally in [1], which,
as far as we know, had not been implemented yet as a routing engine. Second,
we analyze the requirements, pros and cons of all the available routing engines in
this set, including the new one, considering the impact of several switch features105
and constraints in their implementation and performance. Third, we provide
a thorough performance comparison of these routing engines, based on both
simulation experiments and results from a real InfiniBand-based cluster. For
this performance comparison we consider several traffic scenarios, including real
applications, the results showing that in many scenarios the newly implemented110
routing engine outperforms the other ones while requiring fewer VLs.
The rest of the paper is organized as follows. Section 2 describes the back-
ground details of Dragonfly topologies InfiniBand-based networks and routing
engines for InfiniBand-based clusters. Section 3 points out the implementation
problems of the routing algorithms that are the base of our proposal. Section 4115
covers the implementation details of the mentioned routing as a routing engine
in the InfiniBand network controller. In Section 5 the deadlock-free routing en-
gines available for Dragonflies are analyzed and evaluated based on experiment
results. Finally, in Section 6 some conclusions are drawn.
5
2. Background120
2.1. Dragonfly Topology
The Dragonfly [1] is one of the most popular topologies nowadays, due to its
interesting properties [21]. Figure 1 shows a generic Dragonfly topology, where
switches are organized in two hierarchical levels. In the first level, switches and
endnodes into the same group are connected via local channels, forming the125
intra-group network. In the second level, the groups are connected by means
of global channels that form the inter-group network. A Dragonfly network can
be defined by three parameters:
•aswitches in each group.
•pendnodes connected to each switch.130
•hlinks at each switch used to connect to other groups (i.e. the global
channels).
Figure 1: Generic Dragonfly connection pattern.
6
Based on these parameters a group contains aswitches which are intercon-
nected via local channels, and altogether can be considered as a virtual high-
radix switch. Each group is connected to the other groups by means of a×h135
global channels. Both the intra- and inter-group topologies are not tied to any
particular connection pattern. For instance, a fully-connected Dragonfly net-
work assumes a direct link between any pair of switches within the same group
(i.e. the intra-group level), and a direct link between any pair of groups (i.e. the
inter-group level). Fully-connected Dragonflies can interconnect N=ap(ah+ 1)140
endnodes by using g=ah + 1 groups. In order to balance channel load on load-
balanced traffic, fully-connected Dragonflies should be built so that the values
of parameters a,h, and pfulfill the equation a= 2h= 2p. This pattern is
the one recommended in the original paper where the Dragonfly topology was
proposed [1].145
Dragonfly networks are cost-effective, since the number of elements required
to interconnect large networks is lower than those required by other topologies.
Besides, among the interesting properties of Dragonfly networks, it is worth
mentioning the low-diameter (i.e. the average number of hops required to route
packets from a source endnode to a destination endnode) and path diversity150
(i.e. the different routes that can be followed by packets when communicating
two endnodes).
2.1.1. Dragonfly Routing
A deadlock-free minimal routing algorithm for Dragonflies was proposed in
the original paper where this topology was defined [1]. Figure 3 shows a minimal155
routing in a Dragonfly, from source node Hsattached to switch Ssin group Gs
to destination node Hdattached to switch Sdin group Gdthat traverses a
single global channel gcsd. The pseudo-code defining this routing algorithm can
be seen in Figure 4. This algorithm, hereinafter referred to as DLA, requires 2
Virtual Channels (VCs) [6] to prevent deadlocks. Basically, cyclic dependencies160
are avoided by shifting the VC of the packets when they are going to traverse a
local (i.e. intra-group) channel after traversing a global (i.e. inter-group) one.
7
As mentioned earlier, in order to balance channel load under uniform traffic,
a fully-connected Dragonfly should be built using a= 2h= 2p. Analyzing
the relationship that a,hand pmust satisfy to balance channels load requires
to compute the number of flows (i.e. source-destination pairs) traversing each
channel, assuming unidirectional channels. A fully-connected Dragonfly has
three types of channels: terminal channels (tc), local channels (lc) and global
channels (gc) (see Figure 1). Assuming a minimal routing, the number of flows
traversing a tc is:
ft= 1 ×(N−1) = ap(ah + 1) −1 (1)
Agc is used only to communicate a pair of groups, thus the number of flows
traversing a gc is:
fg= (ap)2(2)
Three different types of flows traverse a lc of a group:
1. Flows starting and ending in that group: p2.
2. Flows starting in that group and ending in a different group: ahp2.165
3. Flows starting in a different group and ending in that group: ahp2.
Therefore, the total number of flows traversing a lc is equal to:
fl=p2+ahp2+ahp2(3)
For p≥2, the numbers of flows of the different types of channels satisfy
fg< fl< ft.
Regarding the load balance of VCs and channels, first let’s consider the
balance between the two VCs used by the routing algorithm. In the case of the170
different types of flows traversing a lc (enumerated above), the first two types
are mapped to the same VC, but the last (third) type of flow to a different
VC. Hence, the imbalance between the numbers of flows mapped to these two
VCs (p2+ahp2and ahp2, respectively) is driven by the term p2. However, this
difference decreases in relation to the number of flows mapped to each VC when175
the network size increases, since aincreases twice as fast as p. Similarly, the
8
imbalance between the numbers of flows traversing the types of channels gc and
lc is driven by the term p2. This can be seen by rearranging Equation 3 as
follows fl= 2ahp2+p2, and replacing 2h=a(as we assume fully-connected
Dragonflies), then resulting fl= (ap)2+p2. Figure 2 shows how the ratio fg
fl
180
approaches 1 when pincreases, so also the network size.
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
2 4 6 8 10 12 14 16
0.9
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
Ratio
Number of endnodes per switch (p)
Figure 2: The ratio fg
flis close to 1 when the network size increases.
This minimal routing works well for load-balanced traffic, but results in very
poor performance under adversarial traffic patterns [1].
As an alternative to minimal routing, the original Dragonfly paper also pro-
poses an adaptation of the non-minimal Valiant’s routing algorithm [9] to Drag-185
onflies, so that packets are routed first to a randomly-chosen intermediate group,
Figure 3: Inter-group path in a Dragonfly topology.
9
if Gs6=Gd&Ssis not directly connected to Gdthen
route to Si Sihas a connection (gcsd ) to Gd
else if Gs6=Gdthen
route to Sk Skis a switch in group Gd
else if Sk6=Sdthen
route to Sd
end if
Figure 4: Pseudo-code of the minimal routing for Dragonflies (DLA).
then to their actual destination group. In this way, a uniform, random distri-
bution of traffic emerges even if the original traffic pattern is adversarial-like,
balancing traffic load on local and global links so that the performance degrada-
tion that this pattern may generate is prevented [1]. However, this is achieved190
at the expense of longer paths, which may increase packet latency, and at the
cost of requiring 3 VCs to prevent deadlocks (packet VC must be shifted after
every inter-group hop). Moreover, the use of longer routes required by Valiant’s
routing algorithm leads to approximately a doubling of the traffic load that is
traversing the netwok with respect to the minimal-path routing [7, 22], and195
this may subsequently hasten to an unnecessary performance degradation for
uniform-like traffic patterns.
As both minimal and non-minimal routing may produce a performance
degradation under different traffic patterns, the use of adaptive routing is also
suggested in the original Dragonfly paper, specifically the Universal Globally-200
Adaptive Load-Balanced (UGAL) algorithm [10]. UGAL balances traffic among
minimal and non-minimal paths on a packet-by-packet basis, the select of the
path being made by using the queue occupancy level and hop count information
to select the path with minimum delay. For the Dragonfly topology, two versions
of UGAL were initially proposed: UGAL-L, that uses local queue occupancy in-205
formation at the current node, and UGAL-G, that uses that information from
all the global channels (although the latter is considered ideal and unfeasible in
practice [1]). Both versions of UGAL require 3 VCs to provide deadlock free-
10
dom in Dragonflies. By contrast, there are proposals of deadlock-free adaptive
routing algorithms for Dragonflies that do not need VCs to prevent deadlocks210
[23, 24], although in practice they require other additional network resources. In
general, deadlock-free adaptive routings for Dragonflies introduce some degree
of network complexity and demand a higher number of network resources with
respect to deterministic minimal routing. Moreover, adaptive routing may re-
quire packet reordering at the destination endnodes upon out-of-order delivery,215
then introducing additional latency. As mentioned above, all these problems of
adaptive routing would be introduced unnecessarily in scenarios where minimal
routing suffices to achieve good performance. Moreover, even in the case that
adaptive routing is required, deadlock-free minimal routing may be still neces-
sary since it is necessary to select among minimal and non-minimal paths, as it220
happens when UGAL is used. Note that all the routing algorithms proposed in
the original Dragonfly paper [1] require shifting VCs, which makes complex their
implementation in IB-based networks (further details can be found in Section
3). For that reason, other algorithms have been traditionally used to provide
deadlock-free routing to IB-based networks, as explained in Section 2.3.225
2.2. InfiniBand Architecture Basics
The InfiniBand Architecture (IB) specification [14] describes a virtual cut-
through (VCT) interconnect technology for communicating processing and/or
storage nodes (i.e. endnodes). Emerging applications requiring intensive com-
puting and massive storage have imposed on high-performance computer (HPC)230
clusters, datacenters and hyperscalers an increased burden, demanding the in-
terconnection network greater capacity to move data between endnodes, and
pushing more functionality down to the network elements. IB-based networks
offer higher bandwidth and lower latency compared to other network technolo-
gies. Last generation of IB-based products provides HDR speed (i.e. 200 Gb/s)235
and 0.6µs end-to-end latency. It is expected that these features will improve in
the future [25] in order to accomplish with the demands of future applications.
The IB-based networks are composed of several elements: network interfaces
11
or host channel adapters (HCAs), switches, routers and cables to interconnect
the former elements. The subnet manager (SM) that is also a part of an IB240
network is a control software entity that discovers and configures the network.
HCAs are in charge of injecting in the network packets generated by the appli-
cations. These packets travel throughout the network, reaching switches and
routers via the ports. Switches and routers have several ports, so that IB-based
networks can be configured with several network topologies, such as Meshes,245
Tori, Fat-trees or Dragonflies, among others. Cables connect HCA ports to the
switch/router ports. Moreover, network elements in IB networks have to be
identified (addressed) so that routing algorithms can define routes in the topol-
ogy. The addressing in IB networks is defined by means of local identifiers (LIDs)
that are assigned to HCA and switch/router ports in the network-configuration250
stage by the SM. Then, these ports can be identified unequivocally within the
network when the topology is discovered, as well as the network elements (i.e.
HCAs, switches and routers) that they belong to.
As mentioned above, in IB-based networks, the configuration stage is orches-
trated by the SM, an agent that may be implemented in software or hardware,255
which discovers all the network elements that form the topology. More pre-
cisely, the SM broadcasts management datagrams (MADs) that are received
by the network elements (several MADs can be received by the same network
element through several paths in the network). Then these elements notify that
they exist to the SM, and the SM sends them their corresponding LID. Obvi-260
ously, the network topology must offer full connectivity among these network
elements so that they can be discovered. At the same time the SM identifies
the network elements, it discovers the network topology by keeping track of the
paths followed by MADs when discovering the network elements. After discov-
ering the topology, the SM applies the selected routing algorithm by populating265
the Linear Forwarding Tables (LFTs) at switches. At the end of this stage, the
SM is able to provide the network description, containing the LIDs of network
elements, the network topology and the routing algorithm information at the
switch routing tables.
12
In the IB specification, the buffering at HCA and switch ports is chan-270
neled through Virtual Lanes (VLs). Basically the physical links, interconnect-
ing buffers between two ports at different network elements are split into a set
of logical VLs sharing the total link bandwidth. The IB specification defines a
credit-based flow-control scheme at VL-level, so that, VLs are allowed to for-
ward a packet only if there are available credits for the VL assigned to that275
packet at the next switch (or HCA) port. Hence, VLs also represent a fraction
of the total buffer space at each port.In IB-based networks, VLs are assigned
to packets based on their Service Level (SL), a numerical identifier that is set
to packets at HCAs, prior the injection of the packets in the network. The SL
value cannot be changed once their injection in the network. Meanwhile, the280
maximum number of VLs per HCA or switch port defined in the IB specifica-
tion is 15, although several manufacturers such as Mellanox limit this number
to 9 VLs (1 of them reserved for management purposes). The number of SLs
available in IB-based networks is limited to 16. Figure 5 shows a diagram of an
IB-based k-port switch.285
Figure 5: Diagram of the architecture of an IB-based k-port switch.
As can be seen in Figure 5 every switch (and HCA) has a SL-to-VL mapping
table (SL2VL) per output port, which is used to assign packets requesting that
output port with a specific VL, based on those packets SLs and on their arrival
port (in the case of switches). For that purpose, each entry of the SL2VL table
associates an input port and a SL with a VL. Hence, a packet can be assigned290
to different VLs along its route, depending on its SL and on the information of
13
the SL2VL table at each HCA or switch output port crossed by this packet. In
addition to the LFTs, the SM also populates the SL2VL tables both at HCAs
and switch ports.
OpenSM [16] is the open-source implementation of the SM included in295
the OpenFabrics Software (OFS) [26], a commonly used open-source control-
software distribution for IB-based networks. OFS is supported by the OpenFab-
rics Alliance [26], which gathers several companies promoting the IB technology.
Basically, OpenSM implements the functionality of the SM, including the im-
plementation of different routing algorithms, referred to as routing engines. In300
particular, the routing engines are in charge of computing the routing tables
and send this information to the switches.
2.3. Deadlock-free Routing in InfiniBand-based Dragonflies
As mentioned above, the interconnection pattern of the Dragonfly topolo-
gies contains physical cycles that may lead to deadlocks [7], thus the routing305
algorithms used in Dragonflies must be designed to avoid the possibility of such
deadlocks to emerge. Indeed, a deadlock situation (“credit loops” in the Infini-
Band (IB) jargon) can emerge in an IB-based network, because cyclic depen-
dencies among channels may appears due to the backpressure over the buffers
associated with these channels.310
In general, there are three different approaches to design a deadlock-free
routing algorithm. The first approach uses Virtual Channels (VCs) [6] as es-
cape ways, where the VC assigned to a packet may vary along its route. This
approach performs a “shifting” of the VC assigned to a packet, so that no cyclic
dependencies appear in the network for a given VC. As mentioned above, shift-315
ing VCs is difficult to implement in IB-based networks (see Section 3 for more
details), hence this approach has been traditionally avoided in real IB-based sys-
tems. Indeed, as far as we know, besides the implementation of DLA that we
propose in this paper, DF-DN [20] is currently the only IB routing engine that
uses VC-shifting (VL-shifting in IB terms), which is restricted to low-diameter320
IB networks. It is worth clarifying that DF-DN leaves the calculation of the
14
Linear Forwarding Tables (LFTs) to routing engines that do not prevent dead-
locks, such as SSSP [27] and MINHOP [16], while it configures the SL2VL tables
so that deadlocks are avoided.
In the second approach, known as Layered Routing, the routing engine an-325
alyzes the paths that may form a cycle in the channel dependency graph [7],
and maps some of those paths to different VLs in order to break cyclic channel
dependencies. Note that in this approach VLs are assigned for the entire paths,
so that packets cannot change VL in the middle of their route (i.e. there is no
VL-shifting). Hence, this approach has been traditionally the preferred one to330
design routing engines. Current routing engines available in IB that are suitable
for Dragonfly networks and use Layered Routing, are DFSSSP [17], LASH [8]
and D3R [18].
The last approach to design deadlock-free routing algorithms is based on
restricting routes so that the allowed ones never form a cycle. OpenSM provides335
two similar algorithms that follow this approach and can be used in a Dragonfly
network: Up*/Down* [28] (UPDN [19]) and Down*/Up* [28] (DNUP [19]).
In summary, the available routing engines that are able to make a deadlock-
free routing configuration for a Dragonfly network are:
1. DLA: Our own implementation of the minimal topology-aware routing340
for Dragonfly, proposed in this paper [1]. It will be explained in detail in
Section 4.
2. D3R: It is a topology-aware, minimal, deterministic, deadlock-free and
scalable routing algorithm, suitable for Dragonfly topologies that use a
fully-connected pattern in the inter-group network, and any connection345
pattern in the intra-group network. D3R maps each route to a single,
specific VL depending on the destination group, and according to a specific
order, so that deadlocks are prevented [18].
3. UPDN: It is a topology-agnostic, deterministic and deadlock free routing
engine [19] that builds the spanning tree of the interconnection network,350
then removing (i.e. restricting) paths in order to break cyclic depen-
15
dencies. However, UPDN provides non-minimal routes when applied to
Dragonfly topologies. DNUP is similar to UPDN.
4. LASH: Is an unicast topology-agnostic routing engine for IB [8]. It pro-
vide deadlock-free minimal-path routing while distributing paths among355
VLs (Layers). Depending on the network size, the number of required VLs
varies.
5. DF-DN: It imposes a total order on virtual channels where packets are
mapped along their routes, and increments the VL at every hop (which
is known as VL-hopping), thus the diameter of the network is an upper360
bound for the required number of VLs [20]. Note that this requires to
configure the SL2VL tables so that the VL of packets moving between
two specific switches is increased. However, this approach presents some
limitations (see Section 3). DF-DN leaves the calculation of the LFTs to
SSSP and MINHOP, two minimal-path topology-agnostic deadlock-free365
routing engines that optimize link utilization, balancing the number of
routes per link.
6. DFSSSP: It is a topology-agnostic deadlock-free minimal-path routing
engine [17] available in OpenSM. DFSSSP globally balances the number
of routes per link in order to optimize link utilization. As this routing370
engine is not optimized for Dragonflies, it ends up using too many VLs to
break cycles in large networks.
3. Problem Statement
As explained in previous sections, different types of routing algorithms pro-
posed for Dragonfly networks, either minimal or non-minimal avoid cyclic chan-375
nel dependencies by shifting the VC of the packets along their path. A general-
ized form of this approach is shown in Figure 6, where the Virtual Channel (VC)
of the packet changes in every hop (VC hopping) [29], so that cyclic channel
dependencies are avoided. Note, however, that some routing algorithms such
as the minimal one proposed originally for Dragonflies (DLA) requires shifting380
16
VLs only once. Indeed, changing the packets VC at every hop may be a waste
of resources that otherwise could be used for other purposes.
Figure 6: VC-hopping approach to avoid cyclic dependencies.
Nevertheless, regardless of the times that packets must shift their VC, the
problems to implement VC shifting in IB-based networks are basically the same.
Indeed, despite the fact that IB Virtual Lanes (VLs) can be thought of as VCs,385
this approach cannot be used directly in IB-based networks. First of all, note
that the routing algorithms based on VC shifting require to do it at specific
hops along their route, i.e. the VC that must be assigned to the packet depends
on which hop the packet is performing. However, there is no “hop” information
in IB packets. Moreover, note that the VL assignment is managed through the390
SL2VL table at each output port, that only use the input port and Service
Level (SL) of a packet to assign the next VL (see Section 2.2). Regarding
the SL, it cannot contribute in general to identify the hop that the packet is
performing, because it cannot be changed once the packet is injected, and also
because the number of available SLs is very reduced. On the other hand, the395
input port may help in specific cases to identify the number of hops performed
by a packet. For instance, a packet coming from a port connected to a terminal
channel is clearly performing its first hop. Similarly, a packet coming from a port
connected to a global channel has just performed a hop between two Dragonfly
groups. However, in the case of packets coming from ports connected to local400
channels, we cannot distinguish between packets originated in a different group
and packets whose source is in the current group. An example of this problem
is shown in Figure 7.
Specifically, Figure 7 shows a traffic situation in a Dragonfly group with a
non-fully-connected intra-group pattern. In this case “external” packet flows405
17
(the blue ones) need to traverse at least two switches (switches A and B) until
they reach their destination switch (not shown in the figure). “Local” packet
flows (the red ones) follow the same path.
Figure 7: External flows (blue) and local flows (red) inside a Dragonfly group.
As can be seen, at output port opathe information inside the SL2VL table
is enough to split into different VLs the flows coming from global channels (gc0
410
to gch−1) and the flows coming from terminal channels (tc0to tcp−1). However,
at output port opbit is not possible to discern between local (red) and external
(blue) flows, because all of them share the same input port (connected to a local
channel). This prevents the correct VL assignment, as external and local flows
should be assigned with different VLs to prevent prevent deadlocks.415
Taking this problem into account, our IB-based DLA implementation is lim-
ited to Dragonfly networks with a fully-connected pattern in the inter-group and
the intra-group network, because in this way, an external packet flow requires to
traverse just one local channel (lc) inside its destination group before reaching
its destination switch. Therefore, an external flow and a local flow only meet at420
the first output port that both cross in the group, where is possible to discern
between both flows as they come from local and terminal channels, respectively.
In summary, due to this restriction the input port information is enough to
change the VL of the external packet flow as required by the routing algorithm.
4. DLA Implementation Overview425
This Section describes the implementation details of the deadlock-free mini-
mal routing for fully-connected Dragonfly networks (DLA), described in Section
18
2.1, and summarized in Figure 4, as a new routing engine in OpenSM.
According to the routing algorithm, the DLA routing engine uses VLs as
escape ways. i.e. cyclic dependencies are avoided by shifting the VL of the430
packets when they are going to traverse a local (i.e. intra-group) channel after
traversing a global (i.e. inter-group) one.
Like any other routing engine, DLA operates during the IB-based network
configuration stage. Specifically, once OpenSM assigns LIDs to all the IB-based
devices (i.e. endnodes and switches), the DLA routing engine discovers the435
Dragonfly network topology composed of switches belonging to different groups.
Afterwards, it populates the routing tables (LFTs) and the SL2VL tables. These
basic functionalities are performed in three stages:
1. Dragonfly group discovery. For this stage we use the calculation of the
closed neighborhood for each switch [18]. The closed neighborhood of a440
vertex vin a graph Gis the subgraph containing vertex vitself and all
vertices adjacent to v.
2. Once switches and endnodes are assigned to a Dragonfly group, each
switch is visited, and for each target LID, a decision is made as to what
port should be used to access that LID, according to the algorithm de-445
fined in Figure 4. This information is used to populate the routing tables
(LFTs).
3. The last stage involves the SL2VL table calculation for each switch, as
explained in the following paragraphs.
For the sake of simplicity we can see all the SL2VL tables at an InfiniBand450
(IB) switch as one single table. This table would be indexed by the output port
(op),input port (ip) and Service Level (SL), to return the corresponding Virtual
Lane (VL). This table can be thought of as a function that assigns a set of input
values (op,ip,sl) to a output value (vl).
The function shown in Figure 8 is the one applied by the routing engine to455
configure the SL2VL tables. Specifically, for any SL, the function returns VL 1
when a packet traverses a local (i.e. intra-group) channel (lc) after traversing a
19
global (i.e. inter-group) channel (gc). Otherwise, the function returns VL 0.
sl2vl(op, ip, sl) =
V L 1 if op ∈ {lc0. . . lca−2} ∧ ip ∈ {gc0. . . gch−1}
V L 0 otherwise
Figure 8: Function used by the DLA routing engine to map packets to VLs.
By using this function, the DLA routing engine performs as required by the
routing algorithm, hence it is able to avoid deadlocks using only 2 VLs. Note,460
however, that this mapping of packets to VLs is independent of the packet’s SL,
so that optionally a single SL (any) could be used for all the packets for the
sake of simplicity.
Figure 9 shows an schematic of an IB switch with populated SL2VL tables,
using the sl2vl function shown in Figure 8. Notice that only packets coming465
from global to local channels shift from VL 0 to VL 1. Indeed, an “asymmetric”
configuration of SL2VL tables results from the function, i.e. different ports get
different tables
The members of respectively local and global channels in the expression
shown in Figure 8 correspond to a Dragonfly network with a fully-connected470
pattern within the inter-group and intra-group networks, as our implementation
of DLA is restricted to that pattern as explained in Section 3.
Figure 9: Schematic of an IB-based switch with SL2VL tables configured according to the
DLA routing engine.
20
5. Routing Engines Evaluation
In this section we analyze and evaluate all the deadlock-free routing engines
available for Dragonfly networks (see Section 2.3), including our implementation475
of DLA as a routing engine. Specifically, we study first the pros and cons of these
routing engines taking into account the impact on their performance of several
switch features and constraints in their implementation. Next, we evaluate the
scalability of routing engines based on their requirements and on simulation
results. Finally, we analyze the results obtained from real-traffic workloads run480
in an IB-based real cluster.
In more detail, in order to get results for this evaluation we have per-
formed both simulations and experiments with real IB hardware, using a frame-
work which integrates IB control software, IB-based hardware and OMNeT++-
based simulators [30]. Basically, in the context of this study we have ex-485
tended a previously proposed methodology [18, 31], adding support for man-
aging and using SL2VL tables to the flit-level IB simulator contributed by Mel-
lanox TechnologiesTM[18], and to an in-house packet-level technology-agnostic
interconnection networks simulator [32].
5.1. Impact of Switch Features on Routing Engines490
In this section we analyze the impact of having different VL buffer sizes and
the use of Virtual Output Queuing (VOQ) on the performance of the analyzed
routing engines. In that sense, the InfiniBand (IB) specification requires to
provide separate buffering resources at switch and HCA ports, enough space in
each Virtual Lane (VL) for at least one packet, and a separate flow-control plane495
for data VLs. However, the VLs maximum buffer size and the implementation
of physical VOQ [33] (i.e. input ports implementing separate queues for each
output port) are left to the manufacturer’s criteria [14], hence the performance of
IB-based fabric (so networks) may vary, depending on the VOQ implementation.
Indeed, the input buffer size has a considerable influence on the network500
robustness as more packets can be stored if large buffers are provided, especially
21
when traffic burst or congestion occurs. Also, it is well known that packets in an
input-queued switch (such as IB-based switches) suffer from poor performance
due to Head Of Line (HOL) blocking [34]. The VOQ scheme makes it possible
to relay packets at the input port VL that are stored behind blocked packets505
due to insufficient space in VL at the next hop, reducing the effects of HOL
blocking.
In order to evaluate the impact of both buffer size and the implementation
(or not) of VOQ, we have run simulation experiments in a Dragonfly topology
of fixed size (72 nodes, a= 4, h= 2, p= 2) varying buffer size and modeling510
switches with or without VOQ. For that purpose, we have used an in-house
developed technology-agnostic packet-level simulator that allows us to test dif-
ferent interconnection networks configurations and switch features [32]. The
simulator is configured to use a credit-based flow control a VL-level, and static
buffering management where each VL has a separate fixed buffer space. This515
method prevents VLs from growing beyond a specified size, thereby avoiding
buffer hogging [35]. We assume 8 data VLs and QDR 4x links (i.e. 40 Gbps
of theoretical bandwidth, reduced to 32 Gbps due to the 8b/10bencoding pro-
tocol). We use a MTU size of 4 KB. The synthetic traffic pattern used for
simulations follows a random uniform distribution of destinations. Figures 10520
and 11 show the results obtained, specifically the normalized accepted traffic as
a function of traffic load. In that sense, the traffic generation rate have been
increased from 0% to 100% of the maximum link bandwidth (32Gbps) in 10%
steps. For each point of load, the results has been computed as the average ac-
cepted traffic measured during 5ms of simulated time, then normalized against525
the maximum link bandwidth.
The results for DLA, D3R, LASH, UPDN, DFSSSP, and DF-DN (i.e. SSSP-
DF and MINHOP-DF) for different buffer sizes (1, 2, 4, 8, 16, and 32 packets per
VL) and for switch architectures without VOQ are shown in Figure 10. Figure
11 shows the results for switch architectures with VOQ. It is worth mentioning530
that DFSSSP uses 8 VLs; SSSP-DF and MINHOP-DF 3 VLs; DLA, D3R and
LASH 2 VLs; and UPDN uses just one VL (see Table 2). Therefore, as each VL
22
has a separate fixed buffer space, DFSSSP requires much more overall buffer
space than the rest of routing engines.
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 1 packet ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 2 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 4 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 8 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(e) 16 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(f) 32 packets ×VL
Figure 10: 72-node Dragonfly without VOQ and different input buffer sizes for dla (1SL,
2VLs), d3r and lash (2SLs, 2VLs), updn (1SL, 1VL), dfsssp (8SLs, 8VLs), sssp-df (3SLs,
3VLs), and minhop-df (4SLs, 3VLs).
As can be seen, the results show how an increase in the size of the buffer535
23
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 1 packet ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 2 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 4 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 8 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(e) 16 packets ×VL
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(f) 32 packets ×VL
Figure 11: 72-node Dragonfly with VOQ and different input buffer sizes for dla (1SL, 2VLs),
d3r and lash (2SLs, 2VLs), updn (1SL, 1VL), dfsssp (8SLs, 8VLs), sssp-df (3SLs, 3VLs), and
minhop-df (4SLs, 3VLs).
24
leads to an increase in performance (at a different ratio), whether VOQ is used
(see Figure 11) or not (see Figure 10). For instance, D3R and UPDN are more
responsive to changes in buffer size, especially regarding how much traffic can
be absorbed until network reaches the saturation point. Regarding how much
it is worth increasing the size of input buffers, Figures 12a and 12b shows the540
improvement factor of each routing engine as the input buffer size is increased
from 1 packet per VL up to 32 packets per VL, ether without VOQ (see Figure
12a) or with VOQ (see Figure 12b). Note that, the overall improvement factor
does not grow significantly beyond 16 packets per VL.
On the other hand, all the routing engines show a considerable performance545
improvement when VOQ is used. It is worth noting that DFSSSP shows a stable
performance even in the absence of VOQ, because it uses much more VLs, thus
reducing though VLs a greater fraction of HOL blocking [36] than the other
routing engines. Indeed, when VOQ is used, DFSSSP performance increases
only by a factor of 1.16, in the best case (see Figure 12c). By contrast, the use550
of VOQ boosts the performance of both DLA and D3R (that use 2 VLs), so that
they outperform DFSSSP. This is a direct consequence of how these algorithms
take advantage of the implementation of VOQ.
To make this issue clearer, a small subset of a Dragonfly intra-group network
with different types of packet flows is shown in Figure 13. Specifically, “external”555
flows (i.e. flows coming from other groups) are shown in blue, and “local” flows
are shown in red. In this example, the external flows have their destination
in switch B, while local flows have their destination either at switch B or at
a group connected to switch B. Therefore, all the flows have to cross the local
channel (lc) connecting switches A and B.560
In the situation show in the Figure 13, and according to DLA, local and
external flows are assigned with different VLs, so that they will not compete for
the same buffer space and contention will be reduced. Furthermore, the use of
VOQ allows forwarding packets belonging to a local flow (red flow) that request
accessing to global ports gc0. . . g cp−1at switch B, even if they share the VL565
with local flows that are blocked because they request accessing terminal ports
25
that have been granted to external flows. Overall, this reduces contention on
both global and local channels.
By contrast, topology-agnostic routing engines do not distinguish Dragonfly
groups (therefore they cannot distinguish between external and local flows), so570
that they mix in the same VL locally-generated flows with externally-generated
flows, each flow facing different degrees of contention. In addition, an external
flow coming from a global channel gci: 0 ≤i<hat switch A can be forwarded
to any port of that switch, including global channels (except the same port gci).
Moreover, if the ingress flow is forwarded through a local channel (lc), it may575
later be forwarded to either terminal or global channels at switch B. In the case
of non-minimal routing engines (such as UPDN), they could keep forwarding
the external flow through local channels until destination is reached. All this
increases the number of flows requesting the same output port from different
input ports. Therefore, if an output port becomes saturated, even temporarily,580
the number of affected flows and input ports is higher for topology agnostic
routing engines than if DLA is used in VOQ architectures.
The improvement factor when using VOQ for the same buffer size is shown
in Figure 12c. That Figure shows a considerably increase in performance of all
the routing engines when using VOQ, going from 1.63 up to 2.60 for D3R, to 1.1585
up to 1.16 form DFSSSP. The minimum, maximum and median improvement
factors for each routing engine are shown in Table 1.
In summary, the DLA routing engine is suitable for IB-based Dragonflies
from the performance point of view, provided that there is enough buffer space
and a VOQ-like switch architecture is used3. On the other hand, DFSSSP590
shows a stable performance (in contrast to other routing engines) regardless of
buffer space and whether VOQ is used or not. However, as DFSSSP uses all
the VLs available in comercial components, it cannot be further improved by
implementing queuing schemes [37, 38] or Quality of Service (QoS) support.
3Note that commercial IB-based components do implement VOQ, such as
MellanoxTMproducts
26
dla d3r lash updn dfsssp sssp-df minhop-df
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1 2 4 8 16 32
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Improvement factor
Packets per VL
(a) Increasing buffer size without using VOQ
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1 2 4 8 16 32
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
Improvement factor
Packets per VL
(b) Increasing buffer size while using VOQ
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
1 2 4 8 16 32
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
Improvement factor
Packets per VL
(c) Improvement of VOQ over non-VOQ for the same
buffer size
Figure 12: Impact of different buffer sizes and the use of VOQ for dla (1SL, 2VLs), d3r
and lash (2SLs, 2VLs), updn (1SL, 1VL), dfsssp (8SLs, 8VLs), sssp-df (3SLs, 3VLs), and
minhop-df (4SLs, 3VLs). 27
Figure 13: Flows crossing a local channel (intra-group network).
Table 1: Minimum, Maximum and Median improvement factors when using VOQ for
each routing engine.
Routing engine Min. Med. Max.
DLA 1.356 1.428 1.448
D3R 1.628 2.373 2.596
LASH 1.179 1.288 1.420
UPDN 1.347 1.412 1.471
DFSSSP 1.059 1.126 1.165
SSSP-DF 1.283 1.298 1.324
MINHOP-DF 1.242 1.251 1.275
5.2. Scalability Analysis595
In this Section DLA, D3R, LASH, UPDN, DFSSSP, and DF-DN (i.e. SSSP-
DF and MINHOP-DF) are evaluated for configurations of IB-based Dragonflies
with different sizes based on simulation results. For that purpose, we have ex-
tended the IB-specific flit-level simulator contributed by MellanoxTM[18], imple-
menting the Service Level to Virtual Lane (SL2VL) tables. This simulator uses600
static buffering management for VLs and implements Virtual Output Queuing
(VOQ). The simulator is configured as follows: 8 data VLs, QDR 4x links, MTU
size of 4 KB, and buffer size of 16 packets per VL. Traffic injection is given by
the following synthetic patterns:
28
1. random uniform: destinations are randomly chosen uniformly among all605
the endnodes. We sample the network throughput during 5ms simulated
time (after 1ms of warm-up).
2. 6-point 3D stencil: it reflects nearest-neighbor communication pattern in
real-world applications. We sample the network throughput during 5ms
simulated time (after 1ms of warm-up).610
3. hot-spot: some percentage of the endnodes generate traffic directed to a
small fraction of endnodes, while the rest of endnodes generate random
uniform traffic. We randomly select hs=b0.06 ×Nc(i.e. 6% of the
total endnodes N) of endnodes to generate hot-spot traffic at 100% of link
bandwidth, addressed to hd=blog2(hs)crandomly selected destinations.615
we sample the network throughput during 1ms while the congested traffic
is being transmitted, after 1ms of warm-up, only for those endnodes not
receiving hot-spot traffic.
For the first two traffic patterns, the traffic generation rate has been increased
from 0% to 100% of the maximum effective link bandwidth for QDR 4x (i.e.620
32Gbps) in 10% steps. This is also the case for the endnodes in the third
(hot-spot) communication pattern that do not generate hot-spot traffic.
Table 2 shows the different balanced Dragonfly configurations (see Section
2.1) used in the experiments, as well as the resources (i.e. SLs and VLs) re-
quired by each routing engine. All these network configurations assume a fully-625
connected intra- and inter-group network.
Figures 14, 15 and 16 depict the performance results obtained for the routing
engines when applied to the Dragonfly networks of Table 2 under respectively
uniform, 6-point 3D stencil, and host-spot traffic patterns. Specifically, these
figures show the accepted traffic (throughput) normalized against the maximum630
link bandwidth, as a function of traffic load. Figure 17, 18 and 19 show sim-
ulation results for oversubscribed (i.e. not balanced, a= 2h=p) Dragonflies,
where the number of endnodes doubles with respect to the networks of Table 2.
In general, all the routing engines show a stable performance for the differ-
29
Table 2: Number of SLs and VLs required in fully-connected balanced Dragonfly net-
works of different sizes.
# of Nodes 72 342 1056 2550
Routing engine # SLs # VLs # SLs # VLs # SLs # VLs # SLs # VLs
DLA 12121212
D3R 22222222
LASH 2 2 2 2 3 3 3 3
UPDN 1 1 1 1 1 1 1 1
DFSSSP 8 8 8 8 8 8 8 8
SSSP-DF 3 3 4 3 5 3 6 3
MINHOP-DF 4 3 5 3 6 3 6 3
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 72-nodes Dragonfly (a4h2p2)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 342-nodes Dragonfly (a6h3p3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 1056-nodes Dragonfly (a8h4p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 2550-nodes Dragonfly (a10h5p5)
Figure 14: Balanced fully-connected Dragonflies under random uniform traffic pattern (VOQ
is used).
30
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 72-nodes Dragonfly (a4h2p2)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 342-nodes Dragonfly (a6h3p3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 1056-nodes Dragonfly (a8h4p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 2550-nodes Dragonfly (a10h5p5)
Figure 15: Balanced fully-connected Dragonflies under 6-point 3D stencil traffic pattern (VOQ
is used).
31
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 72-nodes Dragonfly (a4h2p2)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 342-nodes Dragonfly (a6h3p3)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 1056-nodes Dragonfly (a8h4p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 2550-nodes Dragonfly (a10h5p5)
Figure 16: Balanced fully-connected Dragonflies under hot-spot traffic pattern (VOQ is used).
32
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 144-nodes Dragonfly (a4h2p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 684-nodes Dragonfly (a6h3p6)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 2112-nodes Dragonfly (a8h4p8)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 5100-nodes Dragonfly (a10h5p10)
Figure 17: Oversubscribed fully-connected Dragonflies under random uniform traffic pattern
(VOQ is used).
33
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 144-nodes Dragonfly (a4h2p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 684-nodes Dragonfly (a6h3p6)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 2112-nodes Dragonfly (a8h4p8)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 5100-nodes Dragonfly (a10h5p10)
Figure 18: Oversubscribed fully-connected Dragonflies under 6-point 3D stencil traffic pattern
(VOQ is used).
34
dla d3r lash updn dfsssp sssp-df minhop-df
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(a) 144-nodes Dragonfly (a4h2p4)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(b) 684-nodes Dragonfly (a6h3p6)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(c) 2112-nodes Dragonfly (a8h4p8)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Load
(d) 5100-nodes Dragonfly (a10h5p10)
Figure 19: Oversubscribed fully-connected Dragonflies under hot-spot traffic pattern (VOQ
is used).
35
ent network sizes under random uniform and 3D stencil traffic patterns, which635
is coherent with the results shown in Section 5.1, independently whether the
Dragonflies are balanced or not. The only exception we see is for D3R and
LASH, which show unstable performance when used in oversubscribed Drag-
onflies, because of the intra-group contention generated by the uniform traffic
pattern. By contrast, under hot-spot traffic, when the network size increases640
all the routing engines exhibit some degree of variability in their performance
results, since Dragonflies are sensitive to this type of traffic pattern [39]. How-
ever, DLA is able to provide a throughput comparable to routing engines that
require a larger number of VLs.
Overall, DLA exhibits the best performance results on most scenarios, since645
the simulator is configured with enough buffer space and VOQ (see Section
5.1). On the other hand, UPDN gets always the worst performance because it
configures longer paths, and favors the appearance of bottlenecks at the root of
the directed graphs it is based on.
5.3. Experiments in a real cluster650
This section shows experiment results for DLA, D3R, LASH, UPDN, DF-
SSSP, and DF-DN (SSSP-DF and MINHOP-DF) performed under real traf-
fic workloads in the Cluster for the Evaluation of Low-Latency Architectures
(CELLIA) built from InfiniBand-based hardware4. CELLIA allows us to test
the correctness of the implementation of a routing engine by comparing simula-655
tion results against real-workload execution. Each server node in CELLIA is a
HP Proliant DL120 Gen9 with Intel Xeon E5-2630v3 8-cores 1.80GHz processor
and 32GB of RAM memory. We installed Ubuntu 16.04.3 LTS (Xenial) with
a kernel version 4.4.0 x86 64. Each node has a dual-port MellanoxTM Con-
nectX3 MCX353A-QCBT HCA working at QDR speed (i.e. 40 Gbps throttled660
to actual 32 Gbps due to the 8b/10b encoding protocol). HCAs are attached
4CELLIA belongs to the RAAP research group [40], from the Albacete Research Institute
of Informatics at the University of Castilla-La Mancha, Spain.
36
through a x16 PCIe v3.0 interface. The HCA drivers and firmware are supplied
by Mellanox (HCA firmware v2.42.5000).
Specifically, for this evaluation we built a 42-node (a= 3, h= 3, p= 2) fully-
connected Dragonfly topology in CELLIA. We used 21 8-port MellanoxTM IS5022665
switches to build a Dragonfly with 7 groups and 3 switches per group. Switches
ports also work at QDR speed. Cables are QSFP MellanoxTM , suitable for QDR
speed. Both HCAs and Switches offer 9 Virtual Lanes (VLs) per port: 8 data
VLs and 1 management VL. We run a modified version of OpenSM v3.3.19
[16] that includes all the routing engines explained and analyzed in the previous670
sections.
In order to validate our implementation of DLA, and the simulation results
shown in Sections 5.1 and 5.2, we have run a single experiment in CELLIA
for each combination of routing engines (DLA, D3R, LASH, UPDN, DFSSSP,
SSSP-DF and MINHOP-DF), benchmarks (Netgauge [41], HPCG [42], HPCC675
[43], Graph500 [44] and NAMD [45]) and task mappings (Random and Linear,
i.e. the M P I taskiis assigned to endnode i). Basically, it is the same methodol-
ogy used to validate our previous routing engine proposal [18]. Figure 20 shows
the performance results of these experiments.
Note that the real applications generate one out of four traffic patterns680
or a combination of them. Specifically, HPCC tests Ping-Pong and Ordered-
Ring generate an adversarial traffic pattern. Other tests, such as PTRANS or
Netgauge-ebb generate a many-to-many traffic pattern, while Netgauge Nto1
and Netgauge 1toN generate a hotspot (i.e. many-to-one) and one-to-many
pattern, respectively. The remaining tests generate a combination of the above685
(although not necessarily a combination of all the patterns) and we use them
to extend the consistency of the evaluation, looking at other real workloads.
In general, there are small variations among all the evaluated routing en-
gines, because of the CELLIA’s small size network (i.e. a 42-node Dragonfly
with 21 switches). Note, however, that the simulation in a small network (see690
Figure 14a) and the experiments in CELLIA show qualitatively similar perfor-
mance results. For instance, under many-to-many traffic patterns, UPDN shows
37
dla d3r lash updn dfsssp sssp-df minhop-df
0
2
4
6
8
10
12
14
lineal
random
lineal
random
lineal
random
0
2
4
6
8
10
12
14
Throughput Gb/s
EBB1toNNto1
(a) Netgauge (Nto1, 1toN and
EBB)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
lineal
random
lineal
random
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Harmonic Mean TEPS
ReplicatedSimple
(b) Graph500 (Simple and
Replicated)
0
5
10
15
20
25
30
35
40
lineal
random
lineal
random
0
5
10
15
20
25
30
35
40
GFLOPs
Time (s)
Execution TimeGFLOPs
(c) HPCG (GFLOPs and Exe-
cution time)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
lineal
random
lineal
random
lineal
random
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Throughput GB/s
Max Ping PongAvg Ping PongMin Ping Pong
(d) HPCC (Ping Pong)
0
2
4
6
8
10
12
14
16
lineal
random
lineal
random
lineal
random
0
2
4
6
8
10
12
14
16
Throughput Gb/s
PTRANSRnd Ordered Rin gNat Ordered Ring
(e) HPCC (Ordered Ring) and
PTRANS
0
10
20
30
40
50
60
lineal
random
lineal
random
0
10
20
30
40
50
60
GUPS (milli)
MPI Random Acc ess LCGMPI Random Acc ess
(f) HPCC (MPI Random Ac-
cess)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
lineal
random
lineal
random
lineal
random
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Latency us
Max Ping PongAvg Ping PongMin Ping Pong
(g) HPCC (Ping Pong)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
lineal
random
lineal
random
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Latency us
Rnd Ordered Rin gNat Ordered Ring
(h) HPCC (Ordered Ring)
0
10
20
30
40
50
60
70
lineal
random
lineal
random
0
10
20
30
40
50
60
70
WallClock (s)
f1atpaseapoa1
(i) NAMD (apoa and f1atpase)
Figure 20: Performance results of MPI-based real workloads in the 42-node Dragonfly config-
ured in the real IB-based cluster CELLIA.
38
a performance degradation because of the longer routes it configures and to the
root switch becoming a bottleneck. For DLA, the performance results are sim-
ilar, in many cases, to those of D3R, DFSSSP and LASH. The same applies to695
SSSP-DF and MINHOP-DF, except for the HPCC test MPI Random Access,
where the performance of both are below the ones of other routing engines.
In summary, experiments under real and simulated scenarios confirm the
implementation of DLA as a viable routing engine for IB-based Dragonflies,
offering a similar or better performance than other routing engines under differ-700
ent applications while requiring a very reduced number of resources (1 SL and
2 VLs).
6. Conclusions
The overall objective of the work described in this paper is to provide the
networks based on the InfiniBand architecture (IB) with a suitable implemen-705
tation of the deadlock-free minimal routing algorithm (DLA) proposed by Kim
and Dally for Dragonflies. Prior to this work, this algorithm had not been im-
plemented as a routing engine in the IB control software (OpenSM), mainly due
to the restrictions in the IB specification regarding the shifting of Virtual Lanes
(VLs), which DLA requires to guarantee deadlock freedom. In order to fill this710
gap, we have proposed a straightforward method to overcome these restrictions,
so that finally DLA can be used as the routing engine in real IB-based clus-
ters configured with Dragonfly topology, instead of the generic (i.e. topology
agnostic) routing engines previously available in OpenSM.
In more detail, our implementation of DLA is restricted to Dragonflies with715
fully-connected inter- and intra-group networks (see Section 3). In addition, an
“asymmetric” computation of the Service-Level-to-Virtual-Lane (SL2VL) tables
is required, which may introduce some degree of complexity in the network
configuration. The results obtained both from simulations and from experiments
performed in a real IB-based cluster confirm that our implementation of DLA720
is able to outperform other routing engines available in OpenSM in networks
39
based on switches implementing Virtual Output Queuing (VOQ).
Moreover, we have analyzed the requirements of the different routing engines
in terms of the number of required SLs and VLs, and we can conclude that DLA
requires fewer resources than most of the remaining and already implemented725
routing engines included in OpenSM.
Acknowledgment
This work has been jointly supported by the Spanish Ministry of Science,
Innovation & Universities under the project RTI2018-098156-B-C52, Spanish
MINECO under project UNCM13-1E-2456, and by Junta de Comunidades de730
Castilla-La Mancha under the projects POII10-0289-3724, PEII-2014-028-P and
SBPLY/17/180501/000498. German Maglione-Mathey is funded by the Uni-
versidad de Castilla-La Mancha (UCLM) with a pre-doctoral contract PRE-
DUCLM16/29. Jesus Escudero-Sahuquillo is funded by the Universidad de
Castilla-La Mancha (UCLM) and the European Commission (FSE funds), with735
a contract for accessing the Spanish System of Science, Technology and Innova-
tion, for the implementation of the UCLM research program (UCLM resolution
date: 31/07/2014).
References
References740
[1] J. Kim, W. J. Dally, S. Scott, D. Abts, Technology-Driven, Highly-Scalable
Dragonfly Topology, in: 2008 International Symposium on Computer Ar-
chitecture, 2008, pp. 77–88 (Jun. 2008). doi:10.1109/ISCA.2008.19.
[2] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoe-
fler, J. Joyner, J. Lewis, J. Li, N. Ni, R. Rajamony, The percs high-745
performance interconnect, in: 2010 18th IEEE Symposium on High Per-
formance Interconnects, 2010, pp. 75–82 (Aug 2010). doi:10.1109/HOTI.
2010.16.
40
[3] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson,
T. Johnson, J. Kopnick, M. Higgins, J. Reinhard, Cray Cascade: A scal-750
able HPC system based on a Dragonfly network, in: High Performance
Computing, Networking, Storage and Analysis (SC), 2012 International
Conference for, 2012, pp. 1–9 (Nov. 2012). doi:10.1109/SC.2012.39.
[4] The niagara supercomputer (2018).
URL https://www.scinethpc.ca/niagara/755
[5] A. Shpiner, Z. Haramaty, S. Eliad, V. Zdornov, B. Gafni, E. Zahavi, Drag-
onfly+: Low cost topology for scaling datacenters, in: 2017 IEEE 3rd In-
ternational Workshop on High-Performance Interconnection Networks in
the Exascale and Big-Data Era (HiPINEB), 2017, pp. 1–8 (Feb 2017).
doi:10.1109/HiPINEB.2017.11.760
[6] W. J. Dally, C. L. Seitz, Deadlock-free message routing in multiproces-
sor interconnection networks, IEEE Transactions on Computers C-36 (5)
(1987) 547–553 (May 1987). doi:10.1109/TC.1987.1676939.
[7] W. Dally, B. Towles, Principles and Practices of Interconnection Networks,
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003 (2003).765
[8] T. Skeie, O. Lysne, I. Theiss, Layered shortest path (lash) routing in ir-
regular system area networks, in: Proceedings 16th International Paral-
lel and Distributed Processing Symposium, 2002, pp. 8 pp– (April 2002).
doi:10.1109/IPDPS.2002.1016559.
[9] L. G. Valiant, G. J. Brebner, Universal schemes for parallel communication,770
in: Proceedings of the Thirteenth Annual ACM Symposium on Theory of
Computing, STOC ’81, ACM, New York, NY, USA, 1981, pp. 263–277
(1981). doi:10.1145/800076.802479.
[10] A. Singh, Load-balanced routing in interconnection networks, Ph.D. thesis,
Stanford University, 450 Serra Mall, Stanford, CA 94305, EE.UU. (March775
2005).
41
[11] M. Garc´ıa, E. Vallejo, R. Beivide, M. Odriozola, M. Valero, Efficient routing
mechanisms for dragonfly networks, in: 2013 42nd International Conference
on Parallel Processing, 2013, pp. 582–592 (Oct 2013). doi:10.1109/ICPP.
2013.72.780
[12] P. Faizian, M. S. Rahman, M. A. Mollah, X. Yuan, S. Pakin, M. Lang,
Traffic pattern-based adaptive routing for intra-group communication in
dragonfly networks, in: 2016 IEEE 24th Annual Symposium on High-
Performance Interconnects (HOTI), 2016, pp. 19–26 (Aug 2016). doi:
10.1109/HOTI.2016.017.785
[13] J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, S. Scott, Overcoming far-
end congestion in large-scale networks, in: 2015 IEEE 21st International
Symposium on High Performance Computer Architecture (HPCA), 2015,
pp. 415–427 (Feb 2015). doi:10.1109/HPCA.2015.7056051.
[14] The InfiniBand R
Architecture Specification, Volume 1, Release 1.3, The790
InfiniBand Trade Association, 2015 (March 2015).
URL http://www.infinibandta.org
[15] TOP500.org, Top 500 the list.
URL www.top500.org
[16] OpenSM, Infiniband subnet manager.795
URL http://git.openfabrics.org/~{}halr/opensm.git/
[17] J. Domke, T. Hoefler, W. E. Nagel, Deadlock-free oblivious routing for
arbitrary topologies, in: 2011 IEEE International Parallel Distributed Pro-
cessing Symposium, 2011, pp. 616–627 (May 2011). doi:10.1109/IPDPS.
2011.65.800
[18] G. Maglione-Mathey, P. Yebenes, J. Escudero-Sahuquillo, P. J. Garcia,
F. J. Quiles, E. Zahavi, Scalable deadlock-free deterministic minimal-path
routing engine for infiniband-based dragonfly networks, IEEE Transactions
42
on Parallel and Distributed Systems 29 (1) (2018) 183–197 (Jan 2018).
doi:10.1109/TPDS.2017.2742503.805
[19] J. C. Sancho, A. Robles, J. Duato, Effective strategy to compute forwarding
tables for infiniband networks, in: International Conference on Parallel
Processing, 2001., 2001, pp. 48–57 (Sept 2001). doi:10.1109/ICPP.2001.
952046.
[20] T. Schneider, O. Bibartiu, T. Hoefler, Ensuring deadlock-freedom in low-810
diameter infiniband networks, in: 2016 IEEE 24th Annual Symposium on
High-Performance Interconnects (HOTI), 2016, pp. 1–8 (Aug 2016). doi:
10.1109/HOTI.2016.015.
[21] C. Camarero, E. Vallejo, R. Beivide, Topological characterization of ham-
ming and dragonfly networks and its implications on routing, ACM Trans.815
Archit. Code Optim. 11 (4) (2014) 39:1–39:25 (Dec. 2014). doi:10.1145/
2677038.
[22] B. Prisacari, G. Rodriguez, P. Heidelberger, D. Chen, C. Minkenberg,
T. Hoefler, Efficient task placement and routing of nearest neighbor ex-
changes in dragonfly networks, in: Proceedings of the 23rd Interna-820
tional Symposium on High-performance Parallel and Distributed Comput-
ing, HPDC ’14, ACM, New York, NY, USA, 2014, pp. 129–140 (2014).
doi:10.1145/2600212.2600225.
[23] M. Garc´ıa, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero,
G. Rodr´ıguez, J. Labarta, C. Minkenberg, On-the-fly adaptive routing in825
high-radix hierarchical networks, in: 2012 41st International Conference on
Parallel Processing, 2012, pp. 279–288 (Sept 2012). doi:10.1109/ICPP.
2012.46.
[24] D. Xiang, X. Liu, Deadlock-free broadcast routing in dragonfly networks
without virtual channels, IEEE Transactions on Parallel and Distributed830
Systems 27 (9) (2016) 2520–2532 (Sept 2016). doi:10.1109/TPDS.2015.
2503746.
43
[25] InfiniBand Roadmap: IBTA - InfiniBand Trade Association.
URL http://www.infinibandta.org/content/pages.php?pg=
technology_overview835
[26] The open fabrics alliance.
URL www.openfabrics.org
[27] T. Hoefler, T. Schneider, A. Lumsdaine, Optimized routing for large-scale
infiniband networks, in: 2009 17th IEEE Symposium on High Performance
Interconnects, 2009, pp. 103–111 (Aug 2009). doi:10.1109/HOTI.2009.9.840
[28] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray, R. M. Needham,
T. L. Rodeheffer, E. H. Satterthwaite, C. P. Thacker, Autonet: a high-
speed, self-configuring local area network using point-to-point links, IEEE
Journal on Selected Areas in Communications 9 (8) (1991) 1318–1335 (Oct
1991). doi:10.1109/49.105178.845
[29] I. D. Scherson, A. S. Youssef (Eds.), Interconnection Networks for High-
performance Parallel Computers, IEEE Computer Society Press, Los
Alamitos, CA, USA, 1994 (1994).
[30] A. Varga, OMNeT++ User Manual, OpenSim Ltd.
URL http://omnetpp.org/doc/omnetpp/manual/usman.html850
[31] G. Maglione-Mathey, P. Yebenes, J. Escudero-Sahuquillo, P. Garcia,
F. Quiles, Combining OpenFabrics Software and Simulation Tools for Mod-
eling InfiniBand-Based Interconnection Networks, in: Proceedings - 2nd
IEEE International Workshop on High-Performance Interconnection Net-
works in the Exascale and Big-Data Era, HiPINEB 2016, 2016, pp. 55–58855
(2016). doi:10.1109/HIPINEB.2016.7.
[32] P. Yebenes, J. Escudero-Sahuquillo, P. Garcia, F. Quiles, Towards modeling
interconnection networks of exascale systems with omnet++, in: Parallel,
Distributed and Network-Based Processing (PDP), 2013 21st Euromicro
44
International Conference on, 2013, pp. 203–207 (Feb 2013). doi:10.1109/860
PDP.2013.36.
[33] Y. Tamir, G. L. Frazier, High performance multi-queue buffers for vlsi
communication switches, in: in Proc. 15th Annual Symp. on Computer
Arch, 1988, pp. 343–354 (1988).
[34] M. Karol, M. Hluchyj, S. Morgan, Input versus output queueing on a space-865
division packet switch, IEEE Transactions on Communications 35 (12)
(1987) 1347–1356 (December 1987). doi:10.1109/TCOM.1987.1096719.
[35] K. Yoshigoe, Threshold-based exhaustive round-robin for the cicq switch
with virtual crosspoint queues, in: 2007 IEEE International Conference on
Communications, 2007, pp. 6325–6329 (June 2007). doi:10.1109/ICC.870
2007.1047.
[36] M. E. Gomez, J. Flich, A. Robles, P. Lopez, J. Duato, Voqsw: a methodol-
ogy to reduce hol blocking in infiniband networks, in: Proceedings Interna-
tional Parallel and Distributed Processing Symposium, 2003, pp. 10 pp.–
(April 2003). doi:10.1109/IPDPS.2003.1213134.875
[37] J. Duato, J. Flich, T. Nachiondo, A cost-effective technique to reduce hol
blocking in single-stage and multistage switch fabrics, in: 12th Euromi-
cro Conference on Parallel, Distributed and Network-Based Processing,
2004. Proceedings., 2004, pp. 48–53 (Feb 2004). doi:10.1109/EMPDP.
2004.1271426.880
[38] P. Y´ebenes, J. Escudero-Sahuquillo, P. J. Garc´ıa, F. J. Quiles, Efficient
queuing schemes for hol-blocking reduction in dragonfly topologies with
minimal-path routing, in: 2015 IEEE International Conference on Cluster
Computing, 2015, pp. 817–824 (Sept 2015). doi:10.1109/CLUSTER.2015.
138.885
[39] A. Bhatele, W. D. Gropp, N. Jain, L. V. Kale, Avoiding hot-spots on
two-level direct networks, in: SC ’11: Proceedings of 2011 International
45
Conference for High Performance Computing, Networking, Storage and
Analysis, 2011, pp. 1–11 (Nov 2011). doi:10.1145/2063384.2063486.
[40] RAAP, High-performance networks and architectures group.890
URL http://www.i3a.info/raap
[41] T. Hoefler, T. Mehlan, A. Lumsdaine, W. Rehm, Netgauge: A Network Per-
formance Measurement Framework, in: Proceedings of High Performance
Computing and Communications, HPCC’07, Vol. 4782, Springer, 2007, pp.
659–671 (Sep. 2007).895
[42] P. L. Jack Dongarra, Michael A. Heroux, HPCG benchmark: a new metric
for ranking high performance computing systems, Tech. Rep. UT-EECS-15-
736, Electrical Engineering and Computer Science Department, Knoxville,
Tennessee (November 2015).
[43] The hpcc benchmark (2015).900
URL icl.cs.utk.edu/hpcc
[44] The graph500 benchmark (2015).
URL www.graph500.org
[45] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa,
C. Chipot, R. D. Skeel, L. Kal´e, K. Schulten, Scalable molecular dynamics905
with namd, Journal of Computational Chemistry 26 (16) (2005) 1781–1802
(2005).
46