ArticlePDF Available

High-Performance Routing With Multipathing and Path Diversity in Ethernet and HPC Networks

Authors:

Abstract and Figures

The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this article, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
Content may be subject to copyright.
1
High-Performance Routing with Multipathing and
Path Diversity in Ethernet and HPC Networks
Maciej Besta1, Jens Domke2, Marcel Schneider1, Marek Konieczny3,
Salvatore Di Girolamo1, Timo Schneider1, Ankit Singla1, Torsten Hoefler1
1Department of Computer Science, ETH Zurich; 2RIKEN Center for Computational Science (R-CCS)
3Faculty of Computer Science, Electronics and Telecommunications; AGH-UST
Abstract—The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such
as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in
realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established
topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between
each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This
hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this work, to facilitate
high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they
exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and
overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing
with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of
HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general
clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our
review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
F
This is an extended version of a paper published at
IEEE TPDS 2021 under the same title
1 INTRODUCTION AND MOT IVATIO N
Fat tree [141] and related networks such as Clos [59] are the
most commonly deployed topologies in data centers and
supercomputers today, dominating the landscape of Ether-
net clusters [166], [100], [219]. However, many low-diameter
topologies such as Slim Fly or Jellyfish that substantially
reduce cost, power consumption, and latency have been
proposed. These networks improve the cost-performance
tradeoff compared to fat trees. For instance, Slim Fly is
2×more cost- and power-efficient at scale than fat trees,
simultaneously delivering 25% lower latency [35].
A key challenge in realizing the benefits of these topolo-
gies is routing. On one hand, due to their lower diameters,
these networks provide shorter path lengths than fat trees
and other traditional topologies such as torus. However, as
illustrated by our recent research efforts [43], the number of
shortest paths between each pair of endpoints is much smaller
than in fat trees. Selected results are illustrated in Figure 1.
In this figure, we compare established three-level fat trees
(FT3) with representative modern low-diameter networks:
Slim Fly (SF) [35], [33] (a variant with diameter 2), Dragonfly
(DF) [137] (the “balanced” variant with diameter 3), Jelly-
fish (JF) [203] (with diameter 3), Xpander (XP) [219] (with
diameter 3), and HyperX (Hamming graph) (HX) [4] that
generalizes Flattened Butterflies (FBF) [136] with diameter 3.
As observed [43], “in DF and SF, most routers are connected
with one minimal path. In XP, more than 30% of routers are
connected with one minimal path.” In the corresponding JF
Dragony
Fat tree
HyperX
Slim Fly
Xpander
Default
variant
Equivalent
Jellysh
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
0
25
50
75
100
0
25
50
75
100
length of minimal paths
fraction of router pairs (%)
Topologies with a comparable
number of endpoints (10,000)
1 2 3>3 1 2 3>3 1 2 3>3 1 2 3 >3 1 2 3>3
0
25
50
75
100
0
25
50
75
100
diversity (counts) of minimal paths
fraction of router pairs (%)
Default
variant
Equivalent
Jellysh
Path diversity...
...Low
Randomness in Jellysh "smooths out" distributions
of minimal path diversities
Low path
diversity
...High
Dragony Fat tree HyperX Slim Fly Xpander
Topologies with a comparable
number of endpoints (10,000)
Fig. 1: Distributions of lengths and counts of shortest paths in low-diameter
topologies and in fat trees. When analyzing counts of minimal paths between a
router pair, we consider disjoint paths (no shared links). An equivalent Jellyfish
network is constructed using the same number of identical routers as in the
corresponding non-random topology (a plot taken from our past work [43]).
networks (i.e., random Jellyfish networks constructed using
the same number of identical routers as in the corresponding
non-random topology), “the results are more leveled out, but
pairs of routers with one shortest part in-between still form large
fractions. FT3 and HX show the highest diversity.” We conclude
arXiv:2007.03776v3 [cs.NI] 30 Oct 2020
2
that in all the considered low-diameter topologies, shortest
paths fall short: at least a large fraction of router pairs are
connected with only one shortest path.
Simultaneously, these low-diameter topologies offer
high diversity of non-minimal paths [43]. They provide at
least three disjoint “almost”-minimal paths (i.e., paths that
are one hop longer than their corresponding shortest paths)
per router pair (for the majority of pairs). For example, in
Slim Fly (that has the diameter of 2), 99% of router pairs are
connected with multiple non-minimal paths of length 3 [43].
The above properties of low-diameter networks
place unprecedented design challenges for performance-
conscious routing protocols. First, as shortest paths fall
short, one must resort to non-minimal routing, which is
usually more complex than the minimal one. Moreover,
as topologies lower their diameter, their link count is also
reduced. Thus, even if they do indeed offer more than one
non-minimal path between pairs of routers, the correspond-
ing routing protocol must carefully use these paths in order
not to congest the network (i.e., the path diversity is still
a scarce resource demanding careful examination and use).
Third, a shortage of shortest paths means that one cannot
use established multipath routing1schemes such as Equal-
Cost Multi-Path (ECMP) [106], which usually assume that
different paths between communicating entities are minimal and
have equal lengths. Restricting traffic to these paths does not
utilize the path diversity of low-diameter networks.
In this work, to facilitate overcoming these challenges
and to propel designing high-performance routing for mod-
ern interconnects, we develop a taxonomy of different forms
of support for path diversity by a routing design. These
forms of support include (1) enabling multipathing using
both (2) shortest and (3) non-shortest paths, (4) explicit
consideration of disjoint paths, (5) support for adaptive
load balancing across these paths, and (6) genericness (i.e.,
being applicable to different topologies). We also discuss
additional aspects, for example whether a given design uses
multipathing to enhance its resilience,performance, or both.
Then, we use this taxonomy to categorize and analyze a
wide selection of existing routing designs. Here, we consider
two fundamental classes of routing designs: simple routing
building blocks (e.g., ECMP [106] or Network Address Alias-
ing (NAA)) and routing architectures (e.g., PortLand [166]
or PARX [73]). While analyzing respective routing archi-
tectures, we include and investigate the architectural and
technological details of these designs, for example whether
a given scheme is based on the simple Ethernet architecture,
the full TCP/IP stack, the InfiniBand (IB) stack, or other
HPC designs. This enables network architects and protocol
designers to gain insights into supporting path diversity in
the presence of different technological constraints.
We consider protocols and architectures that originated
in both the HPC and data center as well as general net-
working communities. This is because all these environ-
1Multipath routing indicates a routing protocol that uses more than one
path in the network, for at least one pair of communicating endpoints.
We consider multipathing both within a single flow/message (e.g.,
as in spraying single packets across multiple paths, cf. § 4.6), and
multipath across flows/messages (e.g., as in standard ECMP, where
different flows follow different paths § 4.4). Path diversity indicates
whether a given network topology offers multiple paths between dif-
ferent routers (i.e., has potential for speedups from multipath routing).
ments are important in today’s large-scale networking land-
scape. While the most powerful Top500 systems use vendor-
specific or InfiniBand (IB) interconnects, more than half of
the Top500 (e.g., in the June 2019 or in the November 2019
issues) machines [74] are based on Ethernet, see Figure 2. We
observe similar numbers for the Green500 list. The impor-
tance of Ethernet is increased by the “convergence of HPC and
Big Data”, with cloud providers and data center operators
aggressively aiming for high-bandwidth and low-latency
fabrics [219], [100], [223]. Another example is Mellanox, with
its Ethernet sales for the 3rd quarter of 2017 being higher
than those for IB [181]. Similar trends are observed in more
recent numbers [233]. Similar trends are observed in more
recent numbers: “Sales of Ethernet adapter products increased
112% year-over-year (...) we are shipping 400 Gbps Ethernet
switches” [233]. At the same time, IB’s sales have been
growing by 27% year-over-year [233]. This is “led by strong
demand for the HDR 200 gigabit solutions” [233]. Thus, our
analysis can facilitate developing multipath routing in both
IB-based supercomputers but also in a broad landscape of
cloud computing infrastructure such as data centers.
0
20
40
2010 2012 2014 2016 2018
Top500 list issue
Share (percentage)
Ethernet
Inniband
Myrinet
Omnipath
Proprietary
Custom
Fig. 2: The share of different interconnect technologies in the Top500 systems (a
plot taken from our past work [43]).
In general, we provide the following contributions:
We provide the first taxonomy of networking architectures
and the associated routing protocols, focusing on the of-
fered support for path diversity and multipathing.
We use our taxonomy to categorize a wide selection of
routing designs for data centers and supercomputers.
We investigate the relationships between support for path
diversity and architectural and technological details of
different routing protocols.
We discuss in detail the design of representative protocols.
We are the first to analyze multipathing schemes related
to both supercomputers and the High-Performance Com-
puting (HPC) community (e.g., the Infiniband stack) and
to data centers (e.g., the TCP/IP stack).
Complementary Analyses There exist surveys on mul-
tipathing [183], [5], [243], [15], [217], [142], [202]. Yet, none
focuses on multipathing and path diversity offered by
routing in data centers or supercomputers. For example,
Lee and Choi describe multipathing in the general Internet
3
Example modern low-diameter topologies
Slim Fly [2014]; diam.: 2 Jellysh
[2012];
diam.: 3
Multistage networks (Fat trees, Clos)
Aggregation
routers, also
referred to as
access routers
Core routers,
also referred
to as border
routers or
spine routers
In these topologies,
each router connects
to a certain number of
servers / compute nodes
In multistage networks,
only edge routers are
attached to a certain numebr
of servers / compute nodes
Dragony [2008]; diam.: 3
Edge routers,
also referred to
as leaf routers;
they can also
form Top-of-Rack
(ToR) routers
A pod
A group
A group
Xpander [2016]; diam.: 3
HyperX [2009]; diam.: 3
The default structure has
no well-dened groups
(is at), it can however
be arbitrarily modied
Example traditional [high-diameter] topologies
Routers usually
have many servers
(endpoints) attached
An example
packaging
of routers
Routers in
multistage
(3-level)
networks
usually have
a moderate
number of
attached
endpoints
Mesh Torus Routers in
traditional
topologies
usually have
few attached
endpoints
Diameter in traditional
topologies is usually high
Packaging
depends on a
specic network
A router
A router
A router
A router
A router
Servers
(endpoints)
Servers
(endpoints)
Servers
(endpoints)
Fat tree (3-level); diam.: 4
Example Clos (3-level); diam.: 4
Example tree (3-level); diam.: 4
Core and
aggregation
routers have often
a higher number
of ports than
edge routers
Flattened Buttery [2007];
There is a unique
path between any
edge router and
any core router [5]
There is more than
one path between
any edge router and
any core router [89, 90]
There is a unique
path between any
edge router and
any core router
Legend:
An example
of physical
packaging
A certain
type of
routers
Routers with
many and
few ports
Communicating
routers
Minimal path
Non-minimal path
Every "row" and
"column" of routers
are fully connected
In a basic DF variant,
routers in a group are
fully connected. Still,
in general, intra-group
cabling pattern may vary
SF groups are not
necessarily fully intra-connected
Fig. 3: Illustration of network topologies related to the routing protocols and schemes considered in this work. Red color indicates an example shortest path
between routers. Green color indicates example alternative non-minimal paths. Blue color illustrates grouping of routers.
and telecommunication networks [139]. Li et al. [142] also
focus on the general Internet, covering aspects of multi-
path transmission related to all TCP/IP stack layers. Singh
et al. [202] cover only a few multipath routing schemes
used in data centers, focusing on a broad Internet setting.
Moreover, some works are dedicated to performance evalu-
ations of a few schemes for multipathing [3], [109]. Next,
different works are dedicated to multipathing in sensor
networks [15], [5], [183], [243]. Finally, there are analyses of
other aspects of data center networking, for example energy
efficiency [200], [26], optical interconnects [124], network
virtualization [117], [25], overall routing [56], general data
center networking with focus on traffic control in TCP [168],
low-latency data centers [145], the TCP incast problem [188],
bandwidth allocation [57], transport control [237], general
data center networking for clouds [226], congestion manage-
ment [120], reconfigurable data center networks [81], and
transport protocols [207], [182]. We complement all these
works, focusing solely on multipath routing in supercomputers,
data centers, and small clusters. As opposed to other works
with broad focus, we specifically target the performance aspects
of multipathing and path diversity. Our survey is the first to
deliver a taxonomy of the path diversity features of routing
schemes, to categorize existing routing protocols based on
this taxonomy, and to consider both traditional TCP/IP and
Ethernet designs, but also protocols and concepts tradi-
tionally associated with HPC, for example multipathing in
InfiniBand [179], [87].
2 FU NDAM ENTAL NOTIO NS
We first outline fundamental notions: network topologies,
network stacks, and associated routing concepts and designs.
While we do not conduct any theoretical investigation,
we state for clarity a network model used implicitly
in this work. We model an interconnection network as an
undirected graph G= (V, E );Vand Eare sets of routers,
4
also referred to as nodes (|V|=Nr), and full-duplex inter-
router physical links. Endpoints (also referred to as servers
or compute nodes) are not modeled explicitly.
2.1 Network Topologies
We consider routing in different network topologies. The
most important associated topologies are in Figure 3. We
only briefly describe their structure that is used by routing
architectures to enable multipathing (a detailed analysis
of respective topologies in terms of their path diversity is
available elsewhere [43]). In most networks, routers form
groups that are intra-connected with the same pattern of
cables. We indicate such groups with the blue color.
Many routing designs are related to fat trees (FT) [141]
and Clos (CL) [59]. In these networks (broadly referred
to as multistage topologies (MS)”), a certain fraction of
routers is attached to endpoints while the remaining routers
are only dedicated to forwarding traffic. A common real-
ization of these networks consists of three stages (layers) of
routers: edge (leaf ) routers, aggregation (access) routers, and
core (spine, border) routers. Edge and aggregation routers are
additionally grouped into pods, to facilitate physical layout
(cf. Fig. 3). Only edge routers connect to endpoints. Aggre-
gation and core routers only forward traffic; they enable
multipathing. The exact form of multipathing depends on
the topology variant. Consider a pair of communicating
edge routers (located in different pods/groups). In fat trees,
multipathing is enabled by selecting different core routers
and different aggregation routers to forward traffic between
the same communicating pair of edge routers. Importantly,
after fixing the core router, there is a unique path between
the communicating edge routers. In Clos, in addition to such
multipathing enabled by selecting different core routers, one
can also use different paths between a specific edge and core router.
Finally, simple trees are similar to fat trees in that fixing
different core routers enables multipathing; still, one cannot
multipath by using different aggregation routers.
The most important modern low-diameter networks
are Slim Fly (SF) [35], Dragonfly (DF) [137], Jellyfish
(JF) [203], Xpander (XP) [219], and HyperX (Hamming
graph) (HX) [4]. Other proposed topologies in this family in-
clude Flexfly [227], Galaxyfly [140], Megafly [77], projective
topologies [52], HHS [22], and others [184], [128], [172]. All
these networks have different structure and thus different
potential for multipathing [43]; in Figure 3, we illustrate
example paths between a pair of routers. Importantly, in
most of these networks, unlike in fat trees, different paths
between two endpoints usually have different lengths [43].
Finally, many routing designs can be used with any
topology, including traditional ones such as meshes.
2.2 Routing Concepts and Related
We often refer to three interrelated sub-problems for routing:
P
Path selection,
R
Routing itself, and
L
Load balanc-
ing. Path selection
P
determines which paths can be used
for sending a given packet. Routing itself
R
answers a
question on how the packet finds a way to its destination.
Load balancing
L
determines which path (out of identified
alternatives) should be used for sending a packet to maximize
performance and minimize congestion.
2.3 Routing Schemes
We consider routing schemes (designs) that can be loosely
grouped into specific protocols (e.g., OSPF [162]), architec-
tures (e.g., PortLand [166]), and general strategies and tech-
niques (e.g., ECMP [106] or spanning trees [174]). Overall,
a protocol or a strategy often addresses a specific network-
ing problem, rarely more than one. Contrarily, a routing
architecture usually delivers a complete routing solution and
it often addresses more than one, and often all, of the above-
described problems. All these designs are almost always
developed in the context of a specific network stack, also
referred to as network architecture, that we describe next.
2.4 Network Stacks
We focus on data centers and high-performance systems.
Thus, we target Ethernet & TCP/IP, and traditional HPC
networks (InfiniBand, Myrinet, OmniPath, and others).
2.4.1 Ethernet & TCP/IP
In the TCP/IP protocol stack, two layers of addressing are
used. On Layer 2 (L2),Ethernet (MAC) addresses are used
to uniquely identify endpoints, while on Layer 3 (L3),IP
addresses are assigned to endpoints. Historically, the Ether-
net layer is not supposed to be routable: MAC addresses are
only used within a bus-like topology where no routing is
required. In contrast, the IP layer is designed to be routable,
with a hierarchical structure that allows scalable routing
over a worldwide network (the Internet). More recently,
vendors started to provide routing abilities on the Ethernet
layer for pragmatic reasons: since the Ethernet layer is
effectively transparent to the software running on the end-
points, such solutions are easy to deploy. Additionally, the
Ethernet interconnect of a cluster can usually be considered
homogeneous, while the IP layer is used to route between
networks and needs to be highly interoperable.
Since Ethernet was not designed to be routable, there are
several restrictions on routing protocols for Ethernet: First,
the network cannot modify any fields in the packets (control-
data plane separation is key in self-configuring Ethernet
devices). There is no mechanism like the TTL field in the IP
header that allows the network to detect cyclic routing. Sec-
ond, Ethernet devices come with pre-configured, effectively
random addresses. This implies that there is no structure
in the addresses that would allow for a scalable routing
implementation: Each switch needs to keep a lookup table
with entries for each endpoint in the network. Third, since
the network is expected to self-configure, Ethernet routing
schemes must be robust to the addition and removal of
links. These restrictions shape many routing schemes for
Ethernet: Spanning trees are commonly used to guarantee
loop-freedom under any circumstances, and more advanced
schemes often rely on wrapping Ethernet frames into a
format more suitable for routing at the edge switches [166].
Another intricacy of the TCP/IP stack is that flow control
is only implemented in Layer 4 (L4), the transport layer. This
means that the network is not supposed to be aware of and
responsible for load balancing and resource sharing; rather,
it should deliver packets to the destination on a best-effort
basis. In practice, most advanced routing schemes violate
this separation and are aware of TCP flows, even though
5
flow control is still left to the endpoint software [100].
Many practical problems are caused by the interaction of
TCP flow control with decisions in the routing layer, and
such problems are often discussed together with routing
schemes, even though they are completely independent of
the network topology (e.g., the TCP incast problem).
Traditional Ethernet is lossy: when packet buffers are
full, packets are dropped. Priority Flow Control (PFC) [66]
addresses this by allowing a switch to notify another (up-
stream) switch with special “pause” frames to stop sending
frames until further notice, if the buffer occupancy in the
first switch is above a certain threshold. Another extension
of Ethernet towards technologies traditionally associated
with HPC is the incorporation of Remote Direct Memory
Access (RDMA) using the RDMA over Converged Ethernet
(RoCE) [112] protocol, which enriches the Ethernet with the
RDMA communication semantics.
2.4.2 InfiniBand
The InfiniBand (IB) architecture is a switched fabric design
and is intended for high-performance and system area
network (SAN) deployment scales. Up to 49,151 endpoints
(physical or virtual), addressed by a 16 bit local identifier
(LID), can be arranged in a so called subnet, while the
remaining address space is reserved for multicast operations
within a subnet. Similar to the modern datacenter Ethernet
(L2) solutions, these IB subnets are routable to a limited
extent with switches supporting unicast and multicast for-
warding tables, flow control, and other features which do
not require modification of in-flight packet headers. Theo-
retically, multiple subnets can be connected by IB routers
performing the address translation between the subnets
to create larger SANs (effectively L3 domains), but this
impedes performance due to the additionally required global
routing header (GRH) and is rarely used in practice.
IB natively supports RDMA and atomic operations. The
necessary (for high performance) lossless packet forwarding
within IB subnets is realized through link-level, credit-based
flow control [61]. Software-based and latency impeding
solutions to achieve reliable transmissions, as for example in
TCP, are therefore not required. While switches have the ca-
pability to drop deadlocked packets that reside for extended
time periods in their buffers, they cannot identify livelocks,
such as looping unicast or multicast packets induced by
cyclic routing. Hence, the correct and acyclic routing config-
uration is offloaded to a centralized controller, called subnet
manager, which configures connected IB devices, calculates
the forwarding tables with implemented topology-agnostic
or topology-aware routing algorithms, and monitors the net-
work for failures. Therefore, most routing algorithms either
focus on minimal path length to guarantee loop-freedom, or
are derivatives of the Up*/Down* routing protocol [152],
[78] which can be viewed as a generalization of the spanning
tree protocol of Ethernet networks. Besides this oblivious,
destination-based routing approach, IB also supports source-
based routing, but unfortunately only for a limited traffic
class reserved for certain management packets.
The subnet manager can configure the InfiniBand net-
work with a few flow control features, such as quality-of-
service to prioritize traffic classes over others or congestion
control mechanism to throttle ingest traffic. However, ad-
hering to the correct service levels or actually throttling the
packet generation is left to the discretion of the endpoints.
Similarly, in sacrifice for lowest latency and highest band-
width, IB switches have limited support for common capa-
bilities found in Ethernet, for example VLANs, firewalling,
or other security-relevant functionality. Consequently, some
of these have been implemented in software at the end-
points on top of the IB transport protocol, e.g., TCP/IP via
IPoIB, whenever the HPC community deemed it necessary.
2.4.3 Other HPC Network Designs
Cray’s Aries [14] is a packet-switched interconnect designed
for high performance and deployed on the Cray XC systems.
Aries adopts a dragonfly topology, where nodes within
groups are interconnected with a two-dimensional all-to-
all structure (i.e., routers in one dragonfly group effectively
form a flattened butterfly, cf. Figure 3). Being designed for
high-performance systems, it allows nodes to communicate
with RDMA operations (i.e., put, get, and atomic opera-
tions). The routing is destination-based, and the network
addresses are tuples composed by a node identifier (18-bit,
max 262,144 nodes), the memory domain handle (12-bit)
that identifies a memory segment in the remote node, and
an offset (40-bit) within this segment. The Aries switches
employ wormhole routing [62] to minimize the per-switch
required resources. Aries does not support VLANs or QoS
mechanisms, and its stack design does not match that of
Ethernet. Thus, we define the (software) layer at which the
Aries routing operates as proprietary.
Slingshot [1] is the next-generation Cray network. It
implements a DF topology with fully-connected groups.
Slingshot can switch two types of traffic: RoCE (using L3)
and proprietary. Being able to manage RoCE traffic, a Sling-
shot system can be interfaced directly to data centers, while
the proprietary traffic (similar to Aries, i.e., RDMA-based
and small-packet headers) can be generated from within
the system, preserving high performance. Cray Slingshot
supports VLANs, QoS, and endpoint congestion mitigation.
IBM’s PERCS [19] is a two-level direct interconnection
network designed to achieve high bisection bandwidth
and avoid external switches. Groups of 32 compute nodes
(made of four IBM POWER7 chips) are fully connected
and organized in supernodes. Each supernode has 512 links
connecting it to other supernodes. Depending on the system
size (max 512 supernodes), each supernode pair can be
connected with one or multiple links. The PERCS Hub
Chip connects the POWER7 chips within a compute node
between themselves and with the rest of the network. The
Hub Chip participates to the cache coherence protocol and
is able to fetch/inject data directly from the processors’
L3 caches. PERCS supports RDMA, hardware-accelerated
collective operations, direct-cache (L3) network access, and
enables applications to switch between different routing
modes. Similarly to Aries, PERCS routing operates on a
proprietary stack.
We also summarize other HPC oriented proprietary
interconnects. Some of them are no longer manufactured;
we include them for the completeness of our discussion of
path diversity. Myricom’s Myrinet [47] is a local area mas-
sively parallel processor network, designed to connect thou-
6
sands of small compute nodes. A more recent development,
Myrinet Express (MX) [86], provides more functionalities
in its network interface cards (NICs). Open-MX [92] is a
communication layer that offers the MX API on top of the
Ethernet hardware. Quadrics’ QsNet [178], [177] integrates
local memories of compute nodes into a single global virtual
address space. Moreover, Intel introduced OmniPath [46],
an architecture for a tight integration of CPU, memory,
and storage units. Other HPC interconnects are Atos’ Bull
eXascale Interconnect (BXI) [67] and EXTOLL’s intercon-
nect [165]. Many of these architectures feature some form of
programmable NICs [47], [177]. Finally, there exist routing
protocols for specific low-diameter topologies, for example
for SF [234] or DF [153]. However, they usually do not
support multipathing or non-minimal routing.
2.5 Focus of This Work
In our investigation, we focus on routing. Thus, in the Ether-
net and TCP/IP landscape, we focus on designs associated
with Layer 2 (L2, Data Link Layer) and Layer 3 (L3, Internet
Layer), cf. § 2.4.1. As most of congestion control and load
balancing are related to higher layers, we only describe such
schemes whenever they are parts of the associated L2 or L3
designs. In the InfiniBand landscape, we focus on the subnet
and L3 related schemes, cf. § 2.4.2.
3 TAXONOMY OF ROUTING SCHEMES
We first identify criteria for categorizing the considered
routing designs. We focus on how well these designs utilize
path diversity. These criteria are used in Tables 1–2. Specif-
ically, we analyze whether a given scheme enables using
(1) arbitrary shortest paths and (2) arbitrary non-minimal
paths. Moreover, we consider whether a studied scheme
enables (3) multipathing (between two hosts) and whether
these paths can be (4) disjoint. Finally, we investigate (5) the
support for adaptive load balancing across exposed paths
between router pairs and (6) compatibility with an arbi-
trary topology. In addition, we also indicate the location
of each routing scheme in the networking stack2. We also
indicate whether a given multipathing scheme focuses on
performance or resilience (i.e., to provide backup paths in
the event of failures). Next, we identify whether supported
paths come with certain restrictions, e.g., whether they
are offered only within a spanning tree. Finally, we also
broadly categorize the analyzed routing schemes into basic
and complex ones. The former are usually specific protocols
or classes of protocols, used as building blocks of the latter.
4 SIMPLE ROUTING BUILDING BLOCKS
We now present simple routing schemes, summarized in
Table 1, that are usually used as building blocks for more
complex routing designs. For each described scheme, we
indicate what aspects of routing (as described in § 2.2) this
scheme focuses on:
P
path selection,
R
routing itself,
L
or load
balancing. We consider both general classes of schemes (e.g.,
2We consider protocols in both Data Link (L2) and Network (L3) layers. However,
we abstract away hardware details and use a term “router” for both L2 switches
and L3 routers, unless describing a specific switching protocol (to avoid confusion).
overall destination-based routing) and also specific protocols
(e.g., Valiant routing [220]).
Note that, in addition to schemes focusing on multi-
pathing, we also describe designs that do not explicitly
enable it. This is because these designs are often used as key
building blocks of architectures that provide multipathing.
An example is a simple spanning tree mechanism, that on
its own does not enable any form of multipathing, but is a
basis of numerous designs that enable it [16], [209].
4.1 Destination-Based Routing Protocols
The most common approach to
R
routing are destination-
based routing schemes. Each router holds a routing table that
maps any destination address to a next-hop output port. No
information apart from the destination address is used, and
the packet does not need to be modified in transit. In this
setup, it is important to differentiate the physical network
topology (typically modeled as an undirected graph, since
all practically used network technologies use full-duplex
links, cf. § 2.1) from the routing graph, which is naturally
directed in destination-based schemes. In the routing graph,
there is an edge from node ato node biff there is a routing
table entry at aindicating bas the next hop destination.
Typically, the lookup table is implemented using longest-
prefix matching, which allows entries with an identical
address prefix and identical output port to be compressed
into one table slot. This method is especially well suited to
hierarchically organized networks. In general, longest-prefix
matching is not required: it is feasible and common to keep
uncompressed routing tables, e.g., in Ethernet routing.
Simple destination-based routing protocols can only pro-
vide a single path between any source and destination, but
this path can be non-minimal. For non-minimal paths, special
care must be taken to not cause cyclic routing: this can
happen when the routing tables of different routers are not
consistent, cf. property preserving network updates [70]. In a
configuration without routing cycles, the routing graph for
a fixed destination node is a tree rooted in the destination.
4.2 Source Routing (SR)
Another
R
routing scheme is source routing (SR). Here, the
route from source to destination is computed at the source,
and then attached to the packet before it is injected into
the network. Each switch then reads (and possibly removes)
the next hop entry from the route, and forwards the packet
there. Compared to destination based routing, this allows
for far more flexible path selection [121]. Yet, now the
endpoints need to be aware of the network topology to
make viable routing choices.
Source routing is rarely deployed in practice. Still, it
could enable superior routing decisions (compared to des-
tination based routing) in terms of utilizing path diversity,
as endpoints know the physical topology. There are recent
proposals on how to deploy source routing in practice, for
example with the help of OpenFlow [121], or with packet
encapsulation (IP-in-IP or MAC-in-MAC) [95], [111], [94].
Source routing can also be achieved to some degree with
Multiprotocol Label Switching (MPLS) [190], a technique in
which a router forwards packets based on path labels instead
of network addresses (i.e., the MPLS label assigned to a packet
can represent a path to be chosen [190], [232]).
7
Routing Scheme
(Name, Abbreviation, Reference)
Related
concepts
2.2)
Stack
Layer
2.4)
Features of schemes Additional remarks and clarifications
SP NP MP DP ALB AT
General routing building blocks (classes of routing schemes)
Simple Destination-based routing
R
L2, L3 -ééé -Care must be taken not to cause cyclic dependencies
Simple Source-based routing (SR)
R
L2, L3 - - é-Source routing is difficult to deploy in practice, but it is more flexible than
destination-based routing. As endpoints know the physical topology,
multipathing should be easier to realize than in destination routing.
Simple Minimal routing
P
L2, L3 -éééé -Easy to deploy, numerous designs fall in this category
Specific routing building blocks (concrete protocols or concrete protocol families)
Equal-Cost Multipathing (ECMP) [106]
R
L
L3 -é-é é -In ECMP, all routing decisions are local to each switch.
Spanning Trees (ST) [174]
P
L2 ééé -The ST protocol offers shortest paths but only within one spanning tree.
Packet Spraying (PR) [69]
L
L2, L3 -é-é é -One selects output ports with round-robin [69] or randomization [195].
Virtual LANs (VLANs)
P
L2 ééé -VLANs by itself does not focus on multipathing, and it inherits
spanning tree limitations, but it is a key part of multipathing architectures.
IP Routing Protocols
R
L2, L3 -éééé -Examples are OSPF [162], IS-IS [170], EIGRP [173].
Location–Identification Separation (LIS)
R
L2, L3 --ééé-LIS by itself does not focus on multipathing and path diversity,
but it may facilitate developing a multipathing architecture.
Valiant load balancing (VLB) [220]
R
P
L
L2, L3 é-ééé -
UGAL [137]
R
P
L
L2, L3 - - é- - UGAL means Universal Globally-Adaptive Load balanced routing.
Network Address Aliasing (NAA)
L
L3, subnet ------NAA is based on IP aliasing in Ethernet networks [180] and
virtual ports via LID mask control (LMC) in InfiniBand [113, Sec. 7.11.1].
Depending on how a derived scheme implements it.
Multi-Railing
P
L2, L3, subn. --- ---Depending on how a derived scheme implements it.
Multi-Planes
P
L2, L3, subn. ------Depending on how a derived scheme implements it.
TABLE 1: Comparison of simple routing building blocks (often used as parts of more complex routing schemes in Table 2). Rows are sorted chronologically.
We focus on how well the compared schemes utilize path diversity. “Related concepts” indicates the associated routing concepts described in § 2.2. “Stack
Layer” indicates the location of each routing scheme in the TCP/IP or InfiniBand stack (cf. § 2.4). SP,NP,MP,DP,ALB, and AT illustrate whether a given routing
scheme supports various aspects of path diversity. Specifically: SP: A given scheme enables using arbitrary shortest paths. NP: A given scheme enables using arbitrary
non-minimal paths. MP: A given scheme enables multipathing (between two hosts). DP: A given scheme considers disjoint paths. ALB: A given scheme offers
adaptive load balancing.AT: A given scheme works with an arbitrary topology.-: A given scheme does offer a given feature. : A given scheme offers a given
feature in a limited way. é: A given scheme does not offer a given feature. Explanations in remarks.
4.3 Minimal Routing Protocols
A common approach to
P
path selection is to only use
minimal paths: Paths that are no longer than the shortest
path between their endpoints. Minimal paths are preferrable
for routing because they minimize network resources con-
sumed for a given volume of traffic, which is crucial to
achieve good performance at high load.
An additional advantage of minimal paths is that they
guarantee loop-free routing in destination-based routing
schemes. For a known, fixed topology, the routing tables can
be configured to always send packets along shortest paths.
Since every hop along any shortest path will decrease the
shortest-path distance to the destination by one, the packet
always reaches its destination in a finite number of steps.
To construct shortest-path routing tables, a variation of
the Floyd-Warshall all-pairs shortest path algorithm [80] can
be used. Here, besides the shortest-path distance for all
router pairs, one also records the out-edge at a given router
(i.e., the output port) for the first step of a shortest path
to any other router. Other schemes are also applicable, for
example an algorithm by Suurballe and Tarjan for finding
shortest pairs of edge-disjoint paths [212].
Basic minimal routing does not consider multipathing.
However, schemes such as Equal-Cost Multipathing
(ECMP) extend minimal routing to multipathing 4.4).
4.4 Equal-Cost Multipathing (ECMP)
Equal-Cost Multipathing [106] routing is an extension of sim-
ple destination-based
R
routing that specifically exploits
the properties of minimal paths. Instead of having only one
entry per destination in the routing tables, multiple next-
hop options are stored. In practice, ECMP is used with
minimal paths, because using non-minimal ones may lead
to routing loops. Now, any router can make an arbitrary
choice among these next-hop options. The resulting routing
will still be loop-free and only use minimal paths.
ECMP allows to use a greater variety of paths compared
to simple destination-based routing. Since now there may
be multiple possible paths between any pair of nodes, a
mechanism for
L
load balancing is needed. Typically, ECMP
is used with a simple, oblivious scheme similar to packet
spraying 4.6), but on a per-flow level to prevent packet
reordering [58]: each switch chooses a pseudo-random next
hop port among the shortest paths based on a hash com-
puted from the flow parameters, aiming to obtain an even
distribution of load over all minimal paths (some variations
of such simple per-flow scheme were proposed, for exam-
ple Table-based Hashing [208] or FastSwitching [244]). Yet,
random assignments do not imply uniform load balancing
in general, and more advanced schemes such as Weighted
Cost Multipathing (WCMP) [238], [241] aim to improve this.
In addition, ECMP natively does not support adaptive load
balancing. This is addressed by many network architectures
described in Section 5 and by direct extensions of ECMP,
such as Congestion-Triggered Multipathing (CTMP) [205] or
Table-based Hashing with Reassignments (THR) [58].
4.5 Spanning Trees (ST)
Another approach to
P
path selection is to restrict the topol-
ogy to a spanning tree. Then, the routing graph becomes a
tree of bi-directional edges which guarantees the absence of
cycles as long as no router forwards packets back on the link
that the packet arrived on. This can be easily enforced by
each router without any global coordination. Spanning tree
based solutions are popular for auto-configuring protocols
on changing topologies. However, simple spanning tree-
based routing can leave some links completely unused if
8
the network topology is not a tree. Moreover, shortest paths
within a spanning tree are not necessarily shortest when
considering the whole topology. Spanning tree based solu-
tions are an alternative to minimal routing to ensure loop-
free routing in destination-based routing systems. They
allow for non-minimal paths at the cost of not using network
resources efficiently and have been used as a building block
in schemes like SPAIN [163]. A single spanning tree does
not enable multipathing between two endpoints. However,
as we discuss in Section 5, different network architectures
use spanning trees to enable multipathing [209].
4.6 Packet Spraying
A fundamental concept for
L
load balancing is per-packet
load balancing. In the basic variant, random packet spray-
ing [69], each packet is sent over a randomly chosen path
selected from a (static) set of possible paths. The key differ-
ence from ECMP is that modern ECMP spreads flows, not
packets. Typically, packet spraying is applied to multistage
networks, where many equal length paths are available and
a random path among these can be chosen by selecting a
random upstream port at each router. Thus, simple packet
spraying natively considers, enables, and uses multipathing.
In TCP/IP architectures, per-packet load balancing is
often not considered due to the negative effects of packet
reordering on TCP flow control; but these effects can still
be reduced in various ways [69], [100], for example by
spraying not single packets but series of packets, such as
flowlets [223] or flowcells [102]. Moreover, basic random
packet spraying is an oblivious load balancing method, as
it does not use any information about network congestion.
However, in some topologies, for example in fat trees, it can
still guarantee optimal performance as long as it is used for
all flows. Unfortunately, this is no longer true as soon as the
topology looses its symmetry due to link failures [241].
4.7 Virtual LANs (VLANs)
Virtual LANs (VLANs) [147] were originally used for iso-
lating Ethernet broadcast domains. They have recently been
used to implement multipathing. Specifically, once a VLAN
is assigned to a given spanning tree, changing the VLAN
tag in a frame results in sending this frame over a different
path, associated with a different spanning tree (imposed on
the same physical topology). Thus, VLANs in the context
of multipathing primarily address path selection
P
.
4.8 Simple IP Routing
We explicitly distinguish a class of established IP routing
protocols
R
, such as OSPF [162] or IS-IS [170]. They are
often used as parts of network architectures. Despite being
generic (i.e., they can be used with any topology), they do
not natively support multipathing.
4.9 Location–Identification Separation (LIS)
In Location–Identification Separation (LIS), used in some
architectures, a routing scheme
R
separates the physical
location of a given endpoint from its logical identifier. In this
approach, the logical identifier of a given endpoint (e.g., its
IP address used in an application) does not necessarily indi-
cate the physical location of this endpoint in the network. A
mapping between identifiers and addresses can be stored in
a distributed hashtable (DHT) maintained by switches [135]
or hosts, or it can be provided by a directory service (e.g.,
using DNS) [94]. This approach enables more scalable rout-
ing [76]. Importantly, it may facilitate multipathing by for
example maintaining multiple virtual topologies defined
by different mappings in DHTs [118].
4.10 Valiant Load Balancing (VLB)
To facilitate non-minimal
R
routing, additional information
apart from the destination address can be incorporated
into a destination-based routing protocol. An established
and common approach is Valiant routing [220], where this
additional information is an arbitrary intermediate router R
that can be selected at the source endpoint. The routing is
divided into two parts: first, the packet is minimally routed
to R; then, it is minimally routed to the actual destination.
VLB has aspects of source routing, namely the choice of R
and the modification of the packet in flight, while most of
the routing work is done in a destination-based way. As
such, VLB natively does not consider multipathing. VLB
also incorporates a specific
P
path selection (by selecting
the intermediate node randomly). This also provides simple,
oblivious
L
load balancing.
4.11 Universal Globally-Adaptive Load Balanced (UGAL)
Universal Globally-Adaptive Load balanced (UGAL) [137]
is an extension of VLB that enables more advantageous
routing decisions
R
P
in the context of load balancing
L
. Specifically, when a packet is to be routed, UGAL either
selects a path determined by VLB, or a minimum one. The
decision usually depends on the congestion in the network.
Consequently, UGAL considers multipathing in its design:
consecutive packets may be routed using different paths.
4.12 Network Address Aliasing (NAA)
Network Address Aliasing (NAA) is a building block to
support multipathing, especially in InfiniBand-based net-
works. Network Address Aliasing, also known as IP aliasing
in Ethernet networks [180] or port virtualization via LID
mask control (LMC) in InfiniBand [113, Sec. 7.11.1], is a
technique that assigns multiple identifiers to the same network
endpoint. This allows the routing protocols to increase the
path diversity between two endpoints, and it was used
both as a fail-over (enhancing resilience) [225] or for
L
load balancing the traffic (enhancing performance) [73]. In
particular, due to the destination-based routing where a
path is only defined by the given destination address; as
mandated by the InfiniBand standard [113] this address
aliasing is the only standard-conform and software-based
solution to enable multiple disjoint paths between an IB
source and a destination port.
4.13 Multi-Railing and Multi-Planes
Various HPC systems employ multi-railing: using multiple
injection ports per node into a single topology [97], [229].
9
Another common scheme is multi-plane topologies, where
nodes are connected to a set of disjoint topologies, either
similar [96] or different [156]. This is used to increase path
diversity and available throughput. However, this increased
level of complexity also comes with additional challenges
for the routing protocols to utilize the hardware efficiently.
5 ROUTING PROTOCOLS AND ARCHITECTURES
We now describe representative networking architectures,
focusing on their support for path diversity and multipathing3,
according to the taxonomy described in Section 3. Table 2
illustrates the considered architectures and the associated
protocols. Symbols -”, ”, and é indicate that a given
design offers a given feature, offers a given feature in a
limited way, and does not offer a given feature, respectively.
We broadly group the considered designs intro three
classes. First 5.1), we describe schemes that belong to
the Ethernet and TCP/IP landscape and were introduced
for the Internet or for small clusters, most often for the
purpose of increasing resilience, with performance being
only secondary target. Despite the fact that these schemes
originally did not target data centers, we include them as
many of these designs were incorporated or used in some
way in the data center context. Second, we incorporate
Ethernet and TCP/IP related designs that are specifically
targeted at data centers or supercomputers 5.2). The last
class is dedicated to designs related to InfiniBand 5.3).
5.1 Ethernet & TCP/IP (Clusters, General Networks)
In the first part of Table 2, we illustrate the Ethernet and
TCP/IP schemes that are associated with small clusters and
general networks. Chronologically, the considered schemes
were proposed between 1999 and 2010 (with VIRO from
2011 and MLAG from 2014 being exceptions).
Multiple Spanning Trees (MSTP) [16], [65] extends the
STP protocol and it enables creating and managing multiple
spanning trees over the same physical network. This is done
by assigning different VLANs to different spanning trees,
and thus frames/packets belonging to different VLANs can
traverse different paths in the network. There exist Cisco’s
implementations of MSTP, for example Per-VLAN spanning
tree (PVST) and Multiple-VLAN Spanning Tree (MVST).
Table-based Hashing with Reassignments (THR) [58] ex-
tends ECMP to a simple form of load balancing: it selectively
reassigns some active flows based on load sharing statistics.
Global Open Ethernet (GOE) [114], [115] provides virtual
private network (VPN) services in metro-area networks
(MANs) using Ethernet. Its routing protocol, per-destination
multiple rapid spanning tree protocol (PD-MRSTP), com-
bines MSTP [16] (for using multiple spanning trees for
different VLANs) and RSTP [17] (for quick failure recovery).
Viking [197] is very similar to GOE. It also relies on MSTP
to explicitly seek faster failure recovery and more through-
put by using a VLAN per spanning tree, which enables
redundant switching paths between endpoints. TeXCP [125]
3We encourage participation in this survey. In case the reader possesses additional
information relevant for the contents, the authors welcome the input. We also
encourage the reader to send us any other information that they deem important,
e.g., architectures not mentioned in the current survey version.
is a Traffic Engineering (TE) distributed protocol for bal-
ancing traffic in intra-domains of ISP operations. It focuses
on algorithms for path selection and load balancing, and
briefly discusses a suggested implementation that relies on
protocols such as RSVP-TE [21] to deploy paths in routers.
TeXCP is similar to another protocol called MATE [75].
TRansparent Interconnection of Lots of Links (TRILL) [214]
and Shortest Path Bridging (SPB) [13] are similar schemes
that both rely on link state routing to, among others, enable
multipathing based on multiple trees and ECMP. Ether-
net on Air [191] uses the approach introduced by SEAT-
TLE [135] to eliminate flooding in the switched network.
They both rely on LIS and distributed hashtables (DHTs),
implemented in switches, to map endpoints to the switches
connecting these endpoints to the network. Here, Ethernet
on Air uses its DHT to construct a routing substrate in the
form of a Directed Acyclic Graph (DAG) between switches.
Different paths in this DAG can be used for multipathing.
VIRO [118] is similar in relying on the DHT-style routing.
It mentions multipathing as a possible feature enabled by
multiple virtual topologies built on top of a single physical
network. Finally, MLAG [210] and MC-LAG [210] enable
multipathing through link aggregation.
First, many of these designs enable shortest paths, but
a non-negligible number is limited in this respect by the
used spanning tree protocol (i.e., the used shortest paths
are not shortest with respect to the underlying physical
topology). A large number of protocols alleviates this with
different strategies. For example, SEATTLE, Ethernet on Air,
and VIRO use DHTs that virtualize the physical topology,
enabling shortest paths. Other schemes, such as Smart-
Bridge [189] or RBridges [175], directly enhance the span-
ning tree protocol (in various ways) to enable shortest paths.
Second, many protocols also support multipathing. Two most
common mechanisms for this are either ECMP (e.g., in AMP
or THR) or multiple spanning trees combined with VLAN
tagging (e.g., in MSTP or GOE). However, almost no schemes
explicitly support non-minimal paths4, disjoint paths, or adap-
tive load balancing. Yet, they all work on arbitrary topologies.
All these features are mainly dictated by the purpose and
origin of these architectures and protocols. Specifically, most
of them were developed with the main goal being resilient
to failures and not higher performance. This explains for
example almost no support for adaptive load balancing
in response to network congestion. Moreover, they are all re-
stricted by the technological constraints in general Ethernet
and TCP/IP related equipment and protocols, which are
historically designed for the general Internet setting. Thus,
they have to support any network topology. Simultaneously,
many such protocols were based on spanning trees. This dic-
tates the nature of multipathing support in these protocols,
often using some form of multiple spanning trees (MSTP,
GOE, Viking) or “shortcutting” spanning trees (VIRO).
5.2 Ethernet & TCP/IP (Data Centers, Supercomputers)
The designs associated with data centers and supercomput-
ers are listed in the second part of Table 2.
4While schemes based on spanning trees strictly speaking enable non-minimal
paths, this is not a mechanism for path diversity per se, but limitation dictated
by the fact that the used spanning trees often do not enable shortest paths.
10
Routing Scheme Stack
Layer
Features of schemes Scheme
used
Additional remarks and clarifications
SP NP MP DP ALB AT
Related to Ethernet and TCP/IP (small clusters and general networks):
OSPF-OMP (OMP) [224] L3 -é-é é -OSPF Cisco’s enhancement of OSPF to the multipathing setting. Packets from
the same flow are forwarded using the same path.
MPA [164] L3 é é -MPA only focuses on algorithms for generating routing paths.
SmartBridge [189] L2 -é é é é -ST SmartBridges improves ST; packets are sent between hosts
using the shortest possible path in the network.
MSTP [16], [65] L2 é-é é -ST+VLAN Shor test paths are offered only within spanning trees.
STAR [150] L2 -é é é é -ST STAR improves ST; frames are forwarded over alternate paths
that are shorter than their corresponding ST path.
LSOM [84] L2 -é é é é -LSOM suppor ts mesh networks also in MAN. LSA manages state of links.
AMP [93] L3 -é-é- - ECMP, OMP AMP extends ECMP and OSPF-OMP.
RBridges [175] L2 -é é é é -
THR [58] L3 -é-é-ECMP Table-based Hashing with Reassignments (THR) extends ECMP, it
selectively reassigns some active flows based on load sharing statistics.
GOE [114] L2 é-é é -ST+VLAN Shortest paths are offered only within spanning trees. One spanning tree
per VLAN is used. Focus on resiliece.
Viking [197] L2 é-é é∗∗ -ST+VLAN Shortest paths are offered only within spanning trees. One spanning tree
per VLAN is used. ∗∗Viking uses elaborate load balancing, but it is static.
TeXCP [125] L3 ---- Routing in ISP, path are computed offline,
load balancing selects paths based on congestion and failures.
CTMP [205] L3-é- - - - ECMP The scheme focuses on generating paths and on adaptive load balancing.
It extends ECMP. Path generation is agnostic to the layer.
SEATTLE [135] L2 -é é é é -LIS (DHT) Packets traverse the shortest paths.
SPB [13], TRILL [214] L2 -é-é é -
Ethernet on Air [191] L2 -éé é -LIS (DHT) Multipathing is used only for resilience.
VIRO [118] L2–L3 - - é é -LIS (DHT) Multipathing could be enabled by using multiple virtual networks
over the same underlying physical topology.
MLAG [210], MC-LAG [210] L2 é∗∗ é é -Not all shortest paths are enabled; ∗∗multipathing only for resilience.
Related to Ethernet and TCP/IP (data centers, supercomputers):
DCell [99] L2–L3 é-é é é é (RL)DCell comes with a specific topology that consists of layers of routers.
Monsoon [95] L2, L3 é∗∗ é é é (CL)VLB, SR,
ECMP
VLB is used in groups of edge routers. ∗∗ECMP is used only between
border and access routers.
Work by Al-Fares et al. [7] L3 -é- - - é(FT)
PortLand [166] L2 -é-é é é (FT) ECMP
MOOSE [194] L2 -éé é -OSPF-OMP∗∗ ,
LIS
Only a brief discussion on augmenting the frame format for multipathing.
∗∗only mentioned as a possible mechanism for multipathing in MOOSE.
BCube [98] L2–L3 -é- - é é (RL)BCube comes with a specific topology that consists of layers of routers.
VL2 [94] L3-é-é∗∗ é(CL) LIS, VLB, ECMP L3 is used but L2 semantics are offered. ∗∗TCP congestion control.
SPAIN [163] L2 - - é-ST+VLAN SPAIN uses one ST per VLAN. Path diversity is limited by #VLANs
supported in L2 switches.
Work by Linden et al. [222] L3 -é--ECMP These aspects are only mentioned. The whole design extends ECMP.
Work by Suchara et al. [211] L3 -- - ∗∗ -Support is implicit. ∗∗ Paths are precomputed based on predicted traffic.
The design focuses on fault tolerance but also considers performance.
PAST [209] L2 ∗∗ -é-ST+VLAN,
VLB
PAST enables either shortest or non-minimal paths.
∗∗Limited or no multipathing.
Shadow MACs [2] L2 -é-é é -Non-minimal paths are mentioned only in the context of resilience.
WCMP for DC [241] L3 -é-é é é (MS)∗∗ ECMP WCMP uses OpenFlow [157]. WCMP extends ECMP with hashing
of flows based on link capacity. ∗∗ Applicable to simple 2-stage networks.
Flexible fabric [121] L3 -é∗∗ é é SR Non-minimal paths considered for resilience only. ∗∗Only mentioned.
Main focus is placed on leaf-spine and fat trees.
XPath [107] L3 -- - ∗∗ -Unclear scaling behavior. ∗∗XPath relies on default congestion control.
Adaptive load balancingL3 -é-é-é(MS) PR Examples are DRILL [90] or DRB [53]
ECMP-VLB [127] L3 --- é- - (XP)ECMP, VLB Focus on the Xpander network.
FatPaths [43] L2–L3 --- - - -∗∗ PR∗∗∗ ,
ECMP,
VLAN
Simultaneous use of shortest and non-minimal paths.
∗∗Generally applicable but main focus is on low-diameter topologies.
∗∗∗FatPaths sprays packets grouped in flowlets. Only briefly described.
Related to InfiniBand and other traditionally HPC-related designs (data centers, supercomputers):
Shortest path schemessubnet é é ∗∗ é é D-free They incl. Min-Hop [158], (DF-)SSSP [105], [72], and Nue [71].
∗∗Only when combined with NAA.
MUD [152], [78] subnet  é é éD-free Original proposals disregarded IB’s destination-based routing criteria;
hence, applicability is limited without NAA.
LASH-TOR [204] subnet é é é é éD-free Original proposals disregarded IB’s destination-based routing criteria;
hence, applicability is limited without NAA.
Multi-Routing [167] subnet é- - ∗∗ Depends on #{network planes}and/or selected routing schemes.
∗∗Must be implemented in upper layer protocol, like MPI.
Adaptive Routing [159] subnet --Propriety Mellanox extension are outside of InfiniBand specification.
SAR [70] subnet é é é é é(FT) NAA, D-free Theoretically in Phase 2 & 4 of ’Property Preserving Network Update’.
PARX [73] subnet -éé(HX) NAA, D-free Implemented via upper layer protocol, e.g. modified MPI library.
Cray’s Aries [14] propr. --- é-é(DF) UGAL, D-free Link congestion information are propagated through the network and
used to decide between minimal and non-minimal paths.
Cray’s Slingshot [1] L3 or propr. --- é-é(DF) UGAL, D-free Similar to Ar ies, adds endpoint congestion mitigation.
Myricom’s Myrinet [47] propr. - - ? ? ? -SR, D-free
Atos’ BXI [67] propr. -? ? ? - - D-free
EXTOLL’s architecture [165] propr. -? ? ? - -
Intel’s OmniPath [46] propr. - - - - D-free No built-in support for enforcing packeting ordering across different paths
Quadrics’ QsNet [178], [177] propr. - - -é(FT) SR Unclear details on how to use multipathing in practice
IBM’s PERCS [19] propr. --- éé(DF) UGAL, D-free Routing modes can be set on a per-packet basis.
TABLE 2: Routing architectures. Rows are sorted chronologically and then by topology/multipathing support. “Scheme used” indicates incorporated building blocks from Table 1. “Stack
Layer” indicates the location of a given scheme in the TCP/IP or InfiniBand stack (cf. § 2.4). SP,NP,MP,DP,ALB, and AT illustrate whether a given routing scheme supports various aspects of
path diversity. Specifically: SP: A given scheme enables using arbitrary shortest paths. NP: A given scheme enables using arbitrary non-minimal paths. MP: A given scheme enables multipathing
(between two hosts). DP: A given scheme considers disjoint (no shared links) paths. ALB: A given scheme offers adaptive load balancing.AT: A given scheme works with an arbitrary topology.
-: A given scheme does offer a given feature. : A given scheme offers a given feature in a limited way. é: A given scheme does not offer a given feature. Explanations in remarks. MS, FT, CL,
XP, and HX are symbols of topologies described in § 2.1. RL is a specific type of a network called “recursive layered” design, described in § 5.2.3. ?”: Unknown. D-free”: deadlock-free.
11
5.2.1 Multistage (Fat Tree, Clos, Leaf-Spine) Designs
One distinctive group of architectures target multistage
topologies. A common key feature of all these designs
is multipathing based on multiple paths of equal lengths
leading via core routers (cf. § 2.1). Common building blocks
are ECMP, VLB, and PR; however, details (of how these
blocks are exactly deployed) may vary depending on, for
example, the specific targeted topology (e.g., fat tree vs. leaf-
spine), the targeted stack (e.g., bare L2 Ethernet vs. the L3
IP setting), or whether a given design uses off-the-shelf
equipment or rather proposes some HW modifications. Im-
portantly, these designs focus on multipathing with shortest
paths because multistage networks offer a rich supply of
such paths. They often offer some form of load balancing.
Monsoon [95] provides a hybrid L2–L3 Clos design in
which all endpoints in a datacenter form a large single L2
domain. L2 switches may form multiple layers, but the last
two layers (access and border) consist of L3 routers. ECMP
is used for multipathing between access and border routers.
All L2 layers use multipathing based on selecting a random
intermediate switch in the uppermost L2 layer (with VLB).
To implement this, Monsoon relies on switches that support
MAC-in-MAC tunneling (encapsulation) [111] so that one may
forward a frame via an intermediate switch.
PortLand [166] uses fat trees and provides a complete L2
design; it simply assumes standard ECMP for multipathing.
Al-Fares et al. [7] also focus on fat trees. They provide
a complete design based on L3 routing. While they only
briefly mention multipathing, they use an interesting solu-
tion for spreading traffic over core routers. Specifically, they
propose that each router maintains a two-level routing table.
Now, a destination address in a packet may be matched
based on its prefix (“level 1”); this matching takes place
when a packet is sent to an endpoint in the same pod. If
a packet goes to a different pod, the address hits a special
entry leading to routing table “level 2”. In this level, match-
ing uses the address suffix (“right-hand” matching). The key
observation is that, while simple prefix matching would
force packets (sent to the same subnet) to use the same
core router, suffix matching enables selecting different core
routers. The authors propose to implement such routing
tables with ternary content-addressable memories (TCAM).
VL2 [94] targets Clos and provides a design in which
the infrastructure uses L3 but the services are offered L2
semantics. VL2 combines ECMP and VLB for multipathing.
To send a packet, a random core router is selected (VLB);
ECMP then is used to further spread load across available
redundant paths. Using an intermediate core router in VLB
is implemented with IP-in-IP encapsulation.
There is a large number of load balancing schemes for
multistage networks. The majority focus on the transport
layer details and are outside the scope of this work; we
outline them in Section 6 and coarsely summarize them in
Table 2. An example design, DRB [53], offers round-robin
packet spraying and it also discusses how to route such
packets in Clos via core routers using IP-in-IP encapsulation.
5.2.2 General Network Designs
There are also architectures that focus on general topologies;
some of them are tuned for certain classes of networks but
may in principle work on any topology [43]. In contrast to
architectures for multistage networks, designs for general
networks rarely consider ECMP because it is difficult to
use ECMP in a context of a general topology, without the
guarantee of a rich number of redundant shortest paths,
common in Clos or in a fat tree. Instead, they often resort
to some combination of ST and VLANs.
SPAIN [163] is an L2 architecture that focuses on using
commodity off-the-shelf switches. To enable multipathing
in an arbitrary network, SPAIN (1) precomputes a set of re-
dundant paths for different endpoint pairs, (2) merges these
paths into trees, and (3) maps each such tree into a separate
VLAN. Different VLANs may be used for multipathing
between endpoint pairs, assuming used switches support
VLANs. While SPAIN relies on TCP congestion control for
reacting to failures, it does not offer any specific scheme for
load balancing for more performance.
MOOSE [194] addresses the limited scalability of Ether-
net; it simply relies on orthogonal designs such as OSPF-
OMP for multipathing.
PAST [209] is a complete L2 architecture for general
networks. Its key idea is to use a single spanning tree per
endpoint. As such, it does not explicitly focus on ensuring
multipathing between pairs of endpoints, instead focusing
on providing path diversity at the granularity of a desti-
nation endpoint, by enabling computing different spanning
trees, depending on bandwidth requirements, considered
topology, etc.. It enables shortest paths, but also supports
VLB by offering algorithms for deriving spanning trees
where paths to the root of a tree are not necessarily minimal.
PAST relies on ST and VLAN for implementation.
There are also works that focus on encoding a diver-
sity of paths available in different networks. For example,
Jyothi et al. [121] discuss encoding arbitrary paths in a data
center with OpenFlow to enable flexible fabric, XPath [107]
compresses the information of paths in a data center so
that they can be aggregated into a practical number of
routing entries, and van der Linden et al. [222] discuss
how to effectively enable source routing by appropriately
transforming selected fields of packet headers to ensure that
the ECMP hashing will result in the desired path selection.
Some recent architectures focus on high-performance
routing in low-diameter networks. ECMP-VLB is a simple
routing scheme suggested for Xpander topologies [127] that,
as the name suggests, combines the advantages of ECMP
and VLB. Finally, FatPaths [43] targets general low-diameter
networks. It (1) divides physical links into layers that form
acyclic directed graphs, (2) uses paths in different layers for
multipathing. Packets are sprayed over such layers using
flowlets. FatPaths discusses an implementation based on
address space partitioning, VLANs, or ECMP.
5.2.3 Recursive Networks
Some architectures, besides routing, also come with novel
“recursive” topologies [99], [98]. The key design choice in
these architectures to obtain path diversity is to use multiple
NICs per server and connect servers to one another.
5.3 InfiniBand
We now describe the IB landscape. We omit a line of com-
mon routing protocols based on shortest paths, as they are
12
not directly related to multipathing, but their implementa-
tions in the IB fabric manager natively support NAA; these
routings are MinHop [158], SSSP [105], Deadlock-Free SSSP
(DFSSSP) [72], and a DFSSSP variant called Nue [71].
5.3.1 Multi-Up*/Down* (MUD) routing
Numerous variations of Multi-Up*/Down* routing have
been proposed, e.g., [152], [78], to overcome the bottlenecks
and limitations of Up*/Down*. The idea is to utilize a set of
Up*/Down* spanning trees—each starting from a different
root node—and choose a path depending on certain criteria.
For example, Flich at al. [78] proposed to select two roots
which either give the highest amount of non-minimal or the
highest amount of minimal paths, and then randomly select
from those two trees for each source-destination pair. Sim-
ilarly, Lysne et al. [152] proposed to identify multiple root
nodes (by maximizing the minimal distance between them),
and load-balance the traffic across the resulting spanning
trees to avoid the usual bottleneck near a single root. Both
approaches require NAA to work with InfiniBand.
5.3.2 LASH-Transition Oriented Routing (LASH-TOR)
The goal of LASH-TOR [204] is not directly path diversity,
however it is a byproduct of how the routing tries to ensure
deadlock-freedom (an essential feature in lossless networks)
under resource constraints. LASH-TOR uses the LAyered
Shortest Path routing for the majority of source-destination
pairs, and Up*/Down* as fall-back when LASH would ex-
ceed the available virtual channels. Hence, assuming NAA
to separate the LASH (minimal paths) from the Up*/Down*
(potentially non-minimal path), one can gain limited path
diversity in InfiniBand.
5.3.3 Multi-Routing
Multi-routing can be viewed as an extension of the multi-
plane designs outlined in § 2.1. In preliminary experiments,
researchers have tried if the use of different routing algo-
rithms on similar network planes can have an observable
performance gain [167]. Theoretically, additionally to the in-
creased, non-overlapping path diversity resulting from the
multi-plane design, utilizing different routing algorithms
within each plane can yield benefits for certain traffic pat-
terns and load balancing schemes, which would otherwise
be hidden when the same routing is used everywhere.
5.3.4 Adaptive Routing (AR)
For completeness, we list Mellanox’s adaptive routing im-
plementation for InfiniBand as well, since it (theoretically)
increases path diversity and offers load balancing within
the more recent Mellanox-based InfiniBand networks [159].
However, to this date, their technology is proprietary and
outside of the IB specifications. Furthermore, Mellanox’s AR
only supports a limited set up topologies (tori-like, Clos-like
and their Dragonfly variation).
5.3.5 Scheduling-Aware Routing (SAR)
Similar to LASH-TOR, the path diversity offered by SAR
was not intended as multipathing feature or load balancing
feature [70]. Using NAA with LMC = 1, SAR employs a
primary set of shortest paths, calculated with a modified DF-
SSSP routing [72], and a secondary set of paths, calculated
with the Up*/Down* routing algorithm. Whenever SAR re-
routes the network to adapt to the currently running HPC
applications, the network traffic must temporarily switch
to the fixed secondary paths to avoid potential deadlocks
during the deployment of the new primary forwarding
rules. Hence, during each deployment, there is a short time
frame where multipathing is intended, but (theoretically)
the message passing layer could also utilize both, the pri-
mary and secondary paths, simultaneously, outside of the
deployment window without breaking SAR’s validity.
5.3.6 Pattern-Aware Routing for HyperX (PARX)
PARX is the only known, and practically demonstrated,
routing for InfiniBand which intentionally enforces the gen-
eration of minimal and non-minimal paths, and mixes the
usage of both for load-balancing reasons [73], while still ad-
hering to the IB specifications. The idea of this routing is an
emulation of AR capabilities with non-AR techniques/tech-
nologies to overcome the bottlenecks on the shortest path
between IB switch located in the same dimension of the
HyperX topology. PARX for a 2D HyperX, with NAA and
LMC = 2, offers between 2 and 4 disjoint paths, and adap-
tively selects minimal or non-minimal routes depending on
the message size to optimize for either message latency
(with short payloads) or throughput for large messages.
5.4 Other HPC Network Designs
Cray’s Aries and Slingshot adopt the adaptive UGAL rout-
ing to distribute the load across the network. When using
minimal paths, the packets are sent directly to the dragonfly
destination group. With non-minimal paths, instead, pack-
ets are first minimally routed to an intermediate group, then
minimally routed to the destination group. Within a group,
packets are always minimally routed. Routing decisions are
taken on a per-packet basis. They consist in selecting a
number of minimal and non-minimal paths, evaluating the
load on these paths, and finally selecting one. The load is es-
timated by using link load information propagated through
the network [137]. Applications can select different “biasing
levels” for the adaptive routing (e.g., bias towards minimal
routing), or disable the adaptive routing and always use
minimal or non-minimal paths.
In IBM’s PERCS, shortest paths lengths vary between
one and three hops (i.e., route within the source supernode;
reach the destination supernode; route within the desti-
nation supernode). Non-minimal paths can be derived by
minimally-routing packets towards an intermediate supern-
ode. The maximum non-minimal path length is five hops.
As pairs of supernodes can be connected by more than one
link, multiple shortest paths can exist. PERCS provides three
routing modes that can be selected by applications on a per-
packet basis: non-minimal, with the applications defining
the intermediate supernode; round-robin, with the hard-
ware selecting among the multiple routes in a round-robin
manner; randomized (only for non-minimal paths), where
the hardware randomly chooses an intermediate supernode.
Quadrics’ QsNet [178], [177] is a source routed inter-
connect that enables, to some extent, multipathing between
two endpoints, and comes with adaptivity in switches.
13
Specifically, a single routing table (deployed in a QsNet NIC
called “Elan”) translates a processor ID to a specification
of a path in the network. Now, as QsNet enables loading
several routing tables, one could encode different paths
in different routing tables. Finally, QsNet offers hardware
support for broadcasts, and for multicasts to physically
contiguous QsNet endpoints.
Intel’s OmniPath [46] offers two mechanisms for mul-
tipathing between any two endpoints: different paths in
the fabric or different virtual lanes within the same phys-
ical route. However, the OmniPath architecture itself does
not prescribe specific mechanisms to select a specific path.
Moreover, it does not provide any scheme for ensuring
packet ordering. Thus, when such ordering is needed, the
packets must use the same path, or the user must provide
other scheme for maintaining the right ordering.
Finally, the specifications of Myricom’s Myrinet [47]
or Open-MX [92], Atos’ BXI [67], and EXTOLL’s inter-
connect [165] do not disclose details on their support for
multipathing. Myrinet does use source routing and works
on arbitrary topologies. Both BXI and EXTOLL’s design offer
adaptive routing to mitigate congestion, but it is unclear if
multipathing is used.
6 RE LATED ASPEC TS OF NETWORKING
Congestion control & load balancing are strongly related to
the transport layer (L4). This area was extensively covered
in surveys, covering overall networking [155], [51], [187],
[151], [116], [228], mobile or ad hoc environments [146],
[201], [196], [79], [240], [63], [88], [154], and more recent
cloud and data center networks [230], [199], [213], [239],
[231], [20], [89], [198], [168], [188], [237], [226], [120], [81],
[207], [182]. Thus, we do not focus on these aspects of
networking and we only mention them whenever necessary.
However, as they are related to many considered routing
schemes, we cite respective works as a reference point for
the reader. Many schemes for load balancing and congestion
control were proposed in recent years [149], [29], [171],
[169], [176], [144], [160], [54], [11], [103], [242], [100], [185],
[23], [12], [221], [148], [110], [161], [119], [24], [9], [49], [50],
[108], [129], [85], [236]. Such adaptive load balancing can be
implemented using flows [60], [186], [195], [218], [30], [241],
[8], [123], [106], flowcells (fixed-sized packet series) [102],
flowlets (variable-size packet series) [131], [10], [223], [130],
[126], and single packets [235], [100], [69], [53], [176], [235],
[185], [90]. In data centers, load balancing most often focuses
on flow and flowlet based adaptivity. This is because the
targeted stack is often based on TCP that suffers perfor-
mance degradation whenever packets become reordered. In
contrast, HPC networks usually use packet level adaptivity,
and research focuses on choosing good congestion signals,
often with hardware modifications [82], [83].
Similarly to congestion control, we exclude flow control
from our focus, as it is also usually implemented within L4.
Some works analyze various properties of low-diameter
topologies, for example path length, throughput, and band-
width [219], [128], [122], [203], [127], [33], [143], [134], [101],
[132], [216], [77], [133], [22], [215], [6]. Such works could
be used with our multipathing analysis when developing
routing protocols and architectures that take advantage of
different properties of a given topology.
7 CHALLENGES
There are many challenges related to multipathing and path
diversity support in HPC systems and data centers.
First, we predict a rich line of future routing protocols
and networking architectures targeting recent low-diameter
topologies. Some of the first examples are the FatPaths
architecture [43] or the PARX routing [73]. However, more
research is required to understand how to fully use the po-
tential behind such networks, especially considering more
effective congestion control and different technological con-
straints in existing networking stacks.
Moreover, little research exists into routing schemes
suited specifically for particular types of workloads, for ex-
ample deep learning [27], linear algebra computations [39],
[40], [206], [138], graph processing [36], [33], [42], [45],
[38], [91], [32], [41], and other distributed workloads [37],
[34], [87] and algorithms [193], [192]. For example, as some
workloads (e.g., in deep learning [28]) have more predicable
communication patterns, one could try to gain speedups
with multipath routing based on the structural network
properties that are static or change slowly. Contrarily, when
routing data-driven workloads such as graph computing,
one could bias more aggressively towards adaptive multi-
pathing, for example with flowlets [43], [223].
Finally, we expect the growing importance of various
schemes enabling programmable routing and transport [18],
[55]. Here, one line of research will probably heavily depend
on OpenFlow [157] and, especially, P4 [48]. It is also inter-
esting to investigate how to use FPGAs [31], [44], [64] or
“smart NICs” [68], [104], [55] in the context of multipathing.
8 CONCLUSION
Developing high-performance routing protocols and net-
working architectures in HPC systems and data centers is an
important research area. Multipathing and overall support
for path diversity is an important part of such designs,
and specifically one of the enablers for high performance.
The importance of routing is increased by the prevalence
of communication intensive workloads that put pressure on
the interconnect, such as graph analytics or deep learning.
Many networking architectures and routing protocols
have been developed. They offer different forms of support
for multipathing, they are related to different parts of vari-
ous networking stacks, and they are based on miscellaneous
classes of simple routing building blocks or design princi-
ples. To propel research into future developments in the area
of high-performance routing, we present the first analysis
and taxonomy of the rich landscape of multipathing and
path diversity support in the routing designs in supercom-
puters and data centers. We identify basic building blocks,
we crystallize fundamental concepts, we list and categorize
existing architectures and protocols, and we discuss key
design choices, focusing on the support for different forms
of multipathing and path diversity. Our analysis can be used
by network architects, system developers, and routing pro-
tocol designers who want to understand how to maximize
the performance of their developments in the context of bare
Ethernet, full TCP/IP, or InfiniBand and other HPC stacks.
Acknowledgment The work was supported by JSPS KAKENHI
Grant Number JP19H04119.
14
REFERENCES
[1] Slingshot: The Interconnect for the Exascale Era -
Cray Inc. https://www.cray.com/sites/default/files/
Slingshot-The-Interconnect-for-the-Exascale- Era.pdf.
[2] K. Agarwal, C. Dixon, E. Rozner, and J. B. Carter. Shadow macs:
scalable label-switching for commodity ethernet. In HotSDN’14,
pages 157–162, 2014.
[3] S. Aggarwal and P. Mittal. Performance evaluation of single path
and multipath regarding bandwidth and delay. Intl. J. Comp. App.,
145(9), 2016.
[4] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber.
HyperX: topology, routing, and packaging of efficient large-scale
networks. In ACM/IEEE Supercomputing, page 41, 2009.
[5] H. D. E. Al-Ariki and M. S. Swamy. A survey and analysis
of multipath routing protocols in wireless multimedia sensor
networks. Wireless Networks, 23(6), 2017.
[6] F. Al Faisal, M. H. Rahman, and Y. Inoguchi. A new power
efficient high performance interconnection network for many-
core processors. Journal of Parallel and Distributed Computing,
101:92–102, 2017.
[7] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity
data center network architecture. In ACM SIGCOMM, pages 63–
74, 2008.
[8] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and
A. Vahdat. Hedera: Dynamic flow scheduling for data center
networks. In NSDI, volume 10, pages 19–19, 2010.
[9] M. Alasmar, G. Parisis, and J. Crowcroft. Polyraptor: embracing
path and data redundancy in data centres for efficient data
transport. In Proceedings of the ACM SIGCOMM 2018 Conference
on Posters and Demos, pages 69–71. ACM, 2018.
[10] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan,
K. Chu, A. Fingerhut, F. Matus, R. Pan, N. Yadav, G. Varghese,
et al. CONGA: Distributed congestion-aware load balancing
for datacenters. In Proceedings of the 2014 ACM conference on
SIGCOMM, pages 503–514. ACM, 2014.
[11] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel,
B. Prabhakar, S. Sengupta, and M. Sridharan. Data center
TCP (DCTCP). ACM SIGCOMM computer communication review,
41(4):63–74, 2011.
[12] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prab-
hakar, and S. Shenker. pFabric: Minimal near-optimal datacenter
transport. ACM SIGCOMM Computer Communication Review,
43(4):435–446, 2013.
[13] D. Allan, P. Ashwood-Smith, N. Bragg, J. Farkas, D. Fedyk,
M. Ouellete, M. Seaman, and P. Unbehagen. Shortest path
bridging: Efficient control of larger ethernet networks. IEEE
Communications Magazine, 48(10), 2010.
[14] B. Alverson, E. Froese, L. Kaplan, and D. Roweth. Cray XC series
network. Cray Inc., White Paper WP-Aries 01-1112, 2012.
[15] A. A. Anasane and R. A. Satao. A survey on various multipath
routing protocols in wireless sensor networks. Procedia Computer
Science, 79:610–615, 2016.
[16] ANSI/IEEE. Amendment 3 to 802.1q virtual bridged local area
networks: Multiple spanning trees. ANSI/IEEE Draft Standard
P802.1s/D11.2, 2001.
[17] ANSI/IEEE. Virtual bridged local area networks amendment 4:
Provider bridges. ANSI/IEEE Draft Standard P802.1ad/D1, 2003.
[18] M. T. Arashloo, A. Lavrov, M. Ghobadi, J. Rexford, D. Walker,
and D. Wentzlaff. Enabling programmable transport protocols in
high-speed nics. In NSDI, 2020.
[19] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup,
T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The
PERCS High-Performance Interconnect. In Hot Interconnects 2010.
IEEE, Aug.
[20] M. Aruna, D. Bhanu, and R. Punithagowri. A survey on load
balancing algorithms in cloud environment. International Journal
of Computer Applications, 82(16), 2013.
[21] D. Awduche, L. Berger, D. Gan, T. Li, V. Srinivasan, and G. Swal-
low. Rsvp-te: extensions to rsvp for lsp tunnels, 2001.
[22] S. Azizi, N. Hashemi, and A. Khonsari. Hhs: an efficient network
topology for large-scale data centers. The Journal of Supercomput-
ing, 72(3):874–899, 2016.
[23] W. Bai, L. Chen, K. Chen, D. Han, C. Tian, and W. Sun. PIAS:
practical information-agnostic flow scheduling for data center
networks. In Proceedings of the 13th ACM Workshop on Hot Topics
in Networks, HotNets-XIII, pages 25:1–25:7, 2014.
[24] B. G. Banavalikar, C. M. DeCusatis, M. Gusat, K. G. Kamble,
and R. J. Recio. Credit-based flow control in lossless Ethernet
networks, Jan. 12 2016. US Patent 9,237,111.
[25] M. F. Bari, R. Boutaba, R. Esteves, L. Z. Granville, M. Podlesny,
M. G. Rabbani, Q. Zhang, and M. F. Zhani. Data center network
virtualization: A survey. IEEE Communications Surveys & Tutorials,
15(2):909–928, 2012.
[26] A. Beloglazov, R. Buyya, Y. C. Lee, and A. Zomaya. A taxonomy
and survey of energy-efficient data centers and cloud computing
systems. In Advances in computers, volume 82, pages 47–111.
Elsevier, 2011.
[27] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and
T. Hoefler. A modular benchmarking infrastructure for high-
performance and reproducible deep learning. arXiv preprint
arXiv:1901.10183, 2019.
[28] T. Ben-Nun and T. Hoefler. Demystifying parallel and distributed
deep learning: An in-depth concurrency analysis. ACM CSUR,
2019.
[29] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz. Mp-hula:
Multipath transport aware load balancing using programmable
data planes. In Proceedings of the 2018 Morning Workshop on In-
Network Computing, pages 7–13. ACM, 2018.
[30] T. Benson, A. Anand, A. Akella, and M. Zhang. Microte: Fine
grained traffic engineering for data centers. In Proceedings of
the Seventh COnference on emerging Networking EXperiments and
Technologies, page 8. ACM, 2011.
[31] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler.
Substream-centric maximum matchings on fpga. In ACM/SIGDA
FPGA, pages 152–161, 2019.
[32] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler.
Practice of streaming and dynamic graphs: Concepts, models,
systems, and parallelism. arXiv preprint arXiv:1912.12740, 2019.
[33] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun,
O. Mutlu, and T. Hoefler. Slim noc: A low-diameter on-chip
network topology for high energy efficiency and scalability. In
ACM SIGPLAN Notices, 2018.
[34] M. Besta and T. Hoefler. Fault tolerance for remote memory
access programming models. In ACM HPDC, pages 37–48, 2014.
[35] M. Besta and T. Hoefler. Slim Fly: A Cost Effective Low-Diameter
Network Topology. Nov. 2014. ACM/IEEE Supercomputing.
[36] M. Besta and T. Hoefler. Accelerating irregular computations
with hardware transactional memory and active messages. In
ACM HPDC, 2015.
[37] M. Besta and T. Hoefler. Active access: A mechanism for high-
performance distributed data-centric computations. In ACM ICS,
2015.
[38] M. Besta and T. Hoefler. Survey and taxonomy of lossless graph
compression and space-efficient graph representations. arXiv
preprint arXiv:1806.01799, 2018.
[39] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. R¨
atsch,
T. Hoefler, and E. Solomonik. Communication-efficient jaccard
similarity for high-performance distributed genome compar-
isons. IEEE IPDPS, 2020.
[40] M. Besta, F. Marending, E. Solomonik, and T. Hoefler. Slimsell:
A vectorizable graph representation for breadth-first search. In
IEEE IPDPS, pages 32–41, 2017.
[41] M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski,
C. Barthels, G. Alonso, and T. Hoefler. Demystifying graph
databases: Analysis and taxonomy of data organization, system
designs, and graph queries. arXiv preprint arXiv:1910.09017, 2019.
[42] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler.
To push or to pull: On reducing communication and synchroniza-
tion in graph computations. In ACM HPDC, 2017.
[43] M. Besta, M. Schneider, K. Cynk, M. Konieczny, E. Henriksson,
S. Di Girolamo, A. Singla, and T. Hoefler. Fatpaths: Routing in
supercomputers and data centers when shortest paths fall short.
ACM/IEEE Supercomputing, 2020.
[44] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler.
Graph processing on fpgas: Taxonomy, survey, challenges. arXiv
preprint arXiv:1903.06697, 2019.
[45] M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and
T. Hoefler. Log (graph) a near-optimal high-performance graph
representation. In ACM PACT, pages 1–13, 2018.
[46] M. S. Birrittella et al. Intel® omni-path architecture: Enabling
scalable, high performance fabrics. In IEEE HOTI, 2015.
[47] N. J. Boden et al. Myrinet: A gigabit-per-second local area
network. IEEE micro, 1995.
15
[48] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,
C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, et al. P4:
Programming protocol-independent packet processors. ACM
SIGCOMM Computer Communication Review, 44(3):87–95, 2014.
[49] M. Bredel, Z. Bozakov, A. Barczyk, and H. Newman. Flow-based
load balancing in multipathed layer-2 networks using openflow
and multipath-tcp. In Hot topics in software defined networking,
pages 213–214. ACM, 2014.
[50] M. Caesar, M. Casado, T. Koponen, J. Rexford, and S. Shenker.
Dynamic route recomputation considered harmful. ACM SIG-
COMM Computer Communication Review, 40(2):66–71, 2010.
[51] C. Callegari, S. Giordano, M. Pagano, and T. Pepe. A survey
of congestion control mechanisms in linux tcp. In International
Conference on Distributed Computer and Communication Networks,
pages 28–42. Springer, 2013.
[52] C. Camarero, C. Mart´
ınez, E. Vallejo, and R. Beivide. Projective
networks: Topologies for large parallel computer systems. IEEE
TPDS, 28(7):2003–2016, 2016.
[53] J. Cao, R. Xia, P. Yang, C. Guo, G. Lu, L. Yuan, Y. Zheng, H. Wu,
Y. Xiong, and D. Maltz. Per-packet load-balanced, low-latency
routing for clos-based data center networks. In ACM CoNEXT,
pages 49–60, 2013.
[54] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Ja-
cobson. BBR: congestion-based congestion control. ACM Queue,
14(5):20–53, 2016.
[55] A. Caulfield, P. Costa, and M. Ghobadi. Beyond smartnics:
Towards a fully programmable cloud. In IEEE HPSR, pages 1–6.
IEEE, 2018.
[56] K. Chen, C. Hu, X. Zhang, K. Zheng, Y. Chen, and A. V. Vasilakos.
Survey on routing in data centers: insights and future directions.
IEEE network, 25(4):6–10, 2011.
[57] L. Chen, B. Li, and B. Li. Allocating bandwidth in datacenter
networks: A survey. Journal of Computer Science and Technology,
29(5):910–917, 2014.
[58] T. W. Chim and K. L. Yeung. Traffic distribution over equal-cost-
multi-paths. In IEEE International Conference on Communications,
volume 2, pages 1207–1211, 2004.
[59] C. Clos. A study of non-blocking switching networks. Bell Labs
Technical Journal, 32(2):406–424, 1953.
[60] A. R. Curtis, W. Kim, and P. Yalagandula. Mahout: Low-overhead
datacenter traffic management using end-host-based elephant
detection. In INFOCOM, 2011 Proceedings IEEE, pages 1629–1637.
IEEE, 2011.
[61] W. Dally and B. Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2003.
[62] W. J. Dally and B. P. Towles. Principles and practices of interconnec-
tion networks. Elsevier, 2004.
[63] E. Dashkova and A. Gurtov. Survey on congestion control
mechanisms for wireless sensor networks. In Internet of things,
smart spaces, and next generation networking, pages 75–85. Springer,
2012.
[64] J. de Fine Licht et al. Transformations of high-level synthesis
codes for high-performance computing. arXiv:1805.08288, 2018.
[65] A. F. De Sousa. Improving load balance and resilience of eth-
ernet carrier networks with ieee 802.1 s multiple spanning tree
protocol. In IEEE ICN/ICONS/MCL, 2006.
[66] C. DeCusatis. Handbook of fiber optic data communication: a practical
guide to optical networking. Academic Press, 2013.
[67] S. Derradji et al. The bxi interconnect architecture. In IEEE HOTI,
2015.
[68] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schneider,
J. Ber´
anek, M. Besta, L. Benini, D. Roweth, and T. Hoefler.
Network-accelerated non-contiguous memory transfers. arXiv
preprint arXiv:1908.08590, 2019.
[69] A. Dixit, P. Prakash, Y. C. Hu, and R. R. Kompella. On the impact
of packet spraying in data center networks. In INFOCOM, 2013
Proceedings IEEE, pages 2130–2138. IEEE, 2013.
[70] J. Domke and T. Hoefler. Scheduling-Aware Routing for Super-
computers. In ACM/IEEE Supercomputing, 2016.
[71] J. Domke, T. Hoefler, and S. Matsuoka. Routing on the De-
pendency Graph: A New Approach to Deadlock-Free High-
Performance Routing. In ACM HPDC, 2016.
[72] J. Domke, T. Hoefler, and W. Nagel. Deadlock-Free Oblivious
Routing for Arbitrary Topologies. In IEEE IPDPS, 2011.
[73] J. Domke, S. Matsuoka, I. R. Ivanov, Y. Tsushima, T. Yuki, A. No-
mura, S. Miura, N. McDonald, D. L. Floyd, and N. Dub´
e. HyperX
Topology: First At-Scale Implementation and Comparison to the
Fat-Tree. In ACM/IEEE Supercomputing, 2019.
[74] J. J. Dongarra, H. W. Meuer, E. Strohmaier, et al. Top500 super-
computer sites. Supercomputer, 13:89–111, 1997.
[75] A. Elwalid, C. Jin, S. Low, and I. Widjaja. Mate: Mpls adaptive
traffic engineering. In IEEE INFOCOM, 2001.
[76] D. Farinacci, D. Lewis, D. Meyer, and V. Fuller. The locator/id
separation protocol (lisp) rfc 6830. 2013.
[77] M. Flajslik et al. Megafly: A topology for exascale systems.
In International Conference on High Performance Computing, pages
289–310. Springer, 2018.
[78] J. Flich, P. L´
opez, J. C. Sancho, A. Robles, and J. Duato. Improving
InfiniBand Routing Through Multiple Virtual Networks. In
ISHPC, 2002.
[79] D. J. Flora, V. Kavitha, and M. Muthuselvi. A survey on conges-
tion control techniques in wireless sensor networks. In ICETECT,
pages 1146–1149. IEEE, 2011.
[80] R. W. Floyd. Algorithm 97: shortest path. Communications of the
ACM, 5(6):345, 1962.
[81] K.-T. Foerster and S. Schmid. Survey of reconfigurable data center
networks: Enablers, algorithms, complexity. ACM SIGACT News,
50(2):62–79, 2019.
[82] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero,
M. Valero, G. Rodriguez, J. Labarta, and C. Minkenberg. On-the-
fly adaptive routing in high-radix hierarchical networks. In 41st
International Conference on Parallel Processing, ICPP, pages 279–288,
2012.
[83] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, and M. Valero.
Efficient routing mechanisms for dragonfly networks. In 42nd
International Conference on Parallel Processing, ICPP, pages 582–
592, 2013.
[84] R. Garcia et al. Lsom: A link state protocol over mac addresses
for metropolitan backbones using optical ethernet switches. In
IEEE NCA, 2003.
[85] Y. Geng, V. Jeyakumar, A. Kabbani, and M. Alizadeh. Juggler:
a practical reordering resilient network stack for datacenters.
In Proceedings of the Eleventh European Conference on Computer
Systems, pages 1–16, 2016.
[86] P. Geoffray. Myrinet express (MX): Is your interconnect smart? In
IEEE HPC Asia, 2004.
[87] R. Gerstenberger et al. Enabling Highly-scalable Remote Mem-
ory Access Programming with MPI-3 One Sided. In Proc. of
ACM/IEEE Supercomputing, 2013.
[88] A. Ghaffari. Congestion control mechanisms in wireless sensor
networks: A survey. Journal of network and computer applications,
52:101–115, 2015.
[89] E. J. Ghomi, A. M. Rahmani, and N. N. Qader. Load-balancing
algorithms in cloud computing: A survey. Journal of Network and
Computer Applications, 88:50–71, 2017.
[90] S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, and A. Firoozshahian.
Drill: Micro load balancing for low-latency data center networks.
In ACM SIGCOMM, 2017.
[91] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler.
Communication-avoiding parallel minimum cuts and connected
components. In ACM SIGPLAN Notices, volume 53, pages 219–
232. ACM, 2018.
[92] B. Goglin. Design and implementation of Open-MX: High-
performance message passing over generic Ethernet hardware.
In IEEE IPDPS, 2008.
[93] I. Gojmerac, T. Ziegler, and P. Reichl. Adaptive multipath routing
based on local distribution of link load information. In Springer
QofIS, 2003.
[94] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim,
P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: a scalable
and flexible data center network. ACM SIGCOMM computer
communication review, 39(4):51–62, 2009.
[95] A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta.
Towards a next generation data center architecture: scalability
and commoditization. In ACM PRESTO, 2008.
[96] GSIC, Tokyo Institute of Technology. TSUBAME2.5 Hardware
and Software, Nov. 2013.
[97] GSIC, Tokyo Institute of Technology. TSUBAME3.0 Hardware
and Software Specifications, July 2017.
[98] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang,
and S. Lu. Bcube: a high performance, server-centric network
architecture for modular data centers. ACM SIGCOMM CCR,
39(4):63–74, 2009.
16
[99] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. Dcell: a
scalable and fault-tolerant network structure for data centers.
In ACM SIGCOMM Computer Communication Review, volume 38,
pages 75–86. ACM, 2008.
[100] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore,
G. Antichi, and M. Wojcik. Re-architecting datacenter networks
and stacks for low latency and high performance. In ACM
SIGCOMM, 2017.
[101] V. Harsh, S. A. Jyothi, I. Singh, and P. Godfrey. Expander data-
centers: From theory to practice. arXiv preprint arXiv:1811.00212,
2018.
[102] K. He, E. Rozner, K. Agarwal, W. Felter, J. B. Carter, and A. Akella.
Presto: Edge-based load balancing for fast datacenter networks.
In ACM SIGCOMM, 2015.
[103] K. He, E. Rozner, K. Agarwal, Y. J. Gu, W. Felter, J. B. Carter, and
A. Akella. AC/DC TCP: virtual congestion control enforcement
for datacenter networks. In Proceedings of the 2016 conference on
ACM SIGCOMM, pages 244–257, 2016.
[104] T. Hoefler, S. Di Girolamo, K. Taranov, R. E. Grant, and
R. Brightwell. spin: High-performance streaming processing in
the network. In ACM/IEEE Supercomputing, 2017.
[105] T. Hoefler, T. Schneider, and A. Lumsdaine. Optimized Routing
for Large-Scale InfiniBand Networks. In IEEE HOTI, 2009.
[106] C. Hopps. RFC 2992: Analysis of an Equal-Cost Multi-Path
Algorithm, 2000.
[107] S. Hu, K. Chen, H. Wu, W. Bai, C. Lan, H. Wang, H. Zhao,
and C. Guo. Explicit path control in commodity data centers:
Design and applications. IEEE/ACM Transactions on Networking,
24(5):2768–2781, 2016.
[108] J. Huang et al. Tuning high flow concurrency for mptcp in data
center networks. Journal of Cloud Computing, 9(1):1–15, 2020.
[109] X. Huang and Y. Fang. Performance study of node-disjoint mul-
tipath routing in vehicular ad hoc networks. IEEE Transactions on
Vehicular Technology, 2009.
[110] J. Hwang, J. Yoo, and N. Choi. Deadline and incast aware tcp for
cloud data center networks. Computer Networks, 68:20–34, 2014.
[111] IEEE. IEEE 802.1ah standard. http://www.ieee802.org/1/
pages/802.1ah.html, 2008.
[112] Infiniband Trade Association and others. Rocev2, 2014.
[113] InfiniBand® Trade Association. InfiniBandTM Architecture Spec-
ification Volume 1 Release 1.3 (General Specifications), Mar. 2015.
[114] A. Iwata, Y. Hidaka, M. Umayabashi, N. Enomoto, and A. Aru-
taki. Global open ethernet (goe) system and its performance
evaluation. IEEE Journal on Selected Areas in Communications,
22(8):1432–1442, 2004.
[115] A. IWATE. Global optical ethernet architecture as cost-effective
scalable vpn solution. In NFOEC, 2002.
[116] R. Jain. Congestion control and traffic management in atm
networks: Recent advances and a survey. Computer Networks and
ISDN systems, 28(13):1723–1738, 1996.
[117] R. Jain and S. Paul. Network virtualization and software defined
networking for cloud computing: a survey. IEEE Communications
Magazine, 51(11):24–31, 2013.
[118] S. Jain et al. Viro: A scalable, robust and namespace independent
virtual id routing for future networks. In IEEE INFOCOM, 2011.
[119] J. Jiang, R. Jain, and C. So-In. An explicit rate control framework
for lossless ethernet operation. In Communications, 2008. ICC’08.
IEEE International Conference on, pages 5914–5918. IEEE, 2008.
[120] R. P. Joglekar and P. Game. Managing congestion in data center
network using congestion notification algorithms. IRJET, 2016.
[121] S. A. Jyothi, M. Dong, and P. Godfrey. Towards a flexible data
center fabric with source routing. In ACM SOSR, 2015.
[122] S. A. Jyothi, A. Singla, P. B. Godfrey, and A. Kolla. Measuring and
understanding throughput of network topologies. In ACM/IEEE
Supercomputing, 2016.
[123] A. Kabbani, B. Vamanan, J. Hasan, and F. Duchene. FlowBen-
der: Flow-level Adaptive Routing for Improved Latency and
Throughput in Datacenter Networks. In Proceedings of the 10th
ACM International on Conference on emerging Networking Experi-
ments and Technologies, pages 149–160. ACM, 2014.
[124] C. Kachris and I. Tomkos. A survey on optical interconnects for
data centers. IEEE Communications Surveys & Tutorials, 14(4):1021–
1036, 2012.
[125] S. Kandula, D. Katabi, B. Davie, and A. Charny. Walking the
tightrope: Responsive yet stable traffic engineering. In ACM
SIGCOMM CCR, volume 35, pages 253–264. ACM, 2005.
[126] S. Kandula, D. Katabi, S. Sinha, and A. Berger. Dynamic load
balancing without packet reordering. ACM SIGCOMM Computer
Communication Review, 37(2):51–62, 2007.
[127] S. Kassing, A. Valadarsky, G. Shahaf, M. Schapira, and A. Singla.
Beyond fat-trees without antennae, mirrors, and disco-balls. In
ACM SIGCOMM, pages 281–294, 2017.
[128] G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and
T. Hoefler. Cost-effective diameter-two topologies: Analysis and
evaluation. In ACM/IEEE Supercomputing, 2015.
[129] N. Katta, A. Ghag, M. Hira, I. Keslassy, A. Bergman, C. Kim, and
J. Rexford. Clove: Congestion-aware load balancing at the virtual
edge. In Proceedings of the 13th International Conference on emerging
Networking EXperiments and Technologies, pages 323–335, 2017.
[130] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford. Hula:
Scalable load balancing using programmable data planes. In
Proceedings of the Symposium on SDN Research, page 10. ACM,
2016.
[131] N. P. Katta, M. Hira, A. Ghag, C. Kim, I. Keslassy, and J. Rexford.
CLOVE: how I learned to stop worrying about the core and love
the edge. In Proceedings of the 15th ACM Workshop on Hot Topics in
Networks, HotNets, pages 155–161, 2016.
[132] R. Kawano, H. Nakahara, I. Fujiwara, H. Matsutani, M. Koibuchi,
and H. Amano. Loren: A scalable routing method for layout-
conscious random topologies. In 2016 Fourth International Sympo-
sium on Computing and Networking (CANDAR), pages 9–18. IEEE,
2016.
[133] R. Kawano, H. Nakahara, I. Fujiwara, H. Matsutani, M. Koibuchi,
and H. Amano. A layout-oriented routing method for low-
latency hpc networks. IEICE TRANSACTIONS on Information and
Systems, 100(12):2796–2807, 2017.
[134] R. Kawano, R. Yasudo, H. Matsutani, and H. Amano. k-
optimized path routing for high-throughput data center net-
works. In 2018 Sixth International Symposium on Computing and
Networking (CANDAR), pages 99–105. IEEE, 2018.
[135] C. Kim, M. Caesar, and J. Rexford. Floodless in seattle: a scalable
ethernet architecture for large enterprises. In ACM SIGCOMM,
pages 3–14, 2008.
[136] J. Kim, W. J. Dally, and D. Abts. Flattened butterfly: a cost-
efficient topology for high-radix networks. In ACM SIGARCH
Comp. Arch. News, 2007.
[137] J. Kim, W. J. Dally, S. Scott, and D. Abts. Technology-driven,
highly-scalable dragonfly topology. In 2008 International Sympo-
sium on Computer Architecture, pages 77–88. IEEE, 2008.
[138] G. Kwasniewski, M. Kabi´
c, M. Besta, J. VandeVondele, R. Solc `
a,
and T. Hoefler. Red-blue pebbling revisited: near optimal par-
allel matrix-matrix multiplication. In ACM/IEEE Supercomputing,
page 24. ACM, 2019.
[139] G. M. Lee and J. Choi. A survey of multipath routing for traffic
engineering. Information and Communications University, Korea,
2002.
[140] F. Lei, D. Dong, X. Liao, X. Su, and C. Li. Galaxyfly: A novel
family of flexible-radix low-diameter topologies for large-scales
interconnection networks. In ACM ICS, 2016.
[141] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman,
M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. S.
Pierre, D. S. Wells, M. C. Wong-Chan, S. Yang, and R. Zak. The
network architecture of the connection machine CM-5. J. Parallel
Distrib. Comput., 33(2):145–158, 1996.
[142] M. Li, A. Lukyanenko, Z. Ou, A. Yl¨
a-J¨
a¨
aski, S. Tarkoma,
M. Coudron, and S. Secci. Multipath transmission for the
internet: A survey. IEEE Communications Surveys & Tutorials,
18(4):2887–2925, 2016.
[143] S. Li, P.-C. Huang, and B. Jacob. Exascale interconnect topology
characterization and parameter exploration. In HPCC/SmartCi-
ty/DSS, pages 810–819. IEEE, 2018.
[144] Y. Li and D. Pan. Openflow based load balancing for fat-tree
networks with multipath support. In Proc. 12th IEEE International
Conference on Communications (ICC’13), Budapest, Hungary, pages
1–5, 2013.
[145] S. Liu, H. Xu, and Z. Cai. Low latency datacenter networking: A
short survey. arXiv preprint arXiv:1312.3455, 2013.
[146] C. Lochert, B. Scheuermann, and M. Mauve. A survey on conges-
tion control for mobile ad hoc networks. Wireless communications
and mobile computing, 7(5):655–676, 2007.
[147] LS Committee and others. Ieee standard for local and metropoli-
tan area networks—virtual bridged local area networks. IEEE
Std, 802, 2006.
17
[148] Y. Lu. Sed: An sdn-based explicit-deadline-aware tcp for cloud
data center networks. Tsinghua Science and Technology, 21(5):491–
499, 2016.
[149] Y. Lu, G. Chen, B. Li, K. Tan, Y. Xiong, P. Cheng, J. Zhang,
E. Chen, and T. Moscibroda. Multi-path transport for RDMA
in datacenters. In 15th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 18), pages 357–371, 2018.
[150] K.-S. Lui, W. C. Lee, and K. Nahrstedt. Star: a transparent
spanning tree bridge protocol with alternate routing. ACM
SIGCOMM CCR, 32(3):33–46, 2002.
[151] W.-M. Luo, C. Lin, and B.-P. Yan. A survey of congestion control
in the internet. CHINESE JOURNAL OF COMPUTERS-CHINESE
EDITION-, 24(1):1–18, 2001.
[152] O. Lysne and T. Skeie. Load Balancing of Irregular System Area
Networks Through Multiple Roots. In CIC. CSREA Press, 2001.
[153] G. Maglione-Mathey et al. Scalable deadlock-free deterministic
minimal-path routing engine for infiniband-based dragonfly net-
works. IEEE TPDS, 2017.
[154] G. Maheshwari, M. Gour, and U. K. Chourasia. A survey on
congestion control in manet. International Journal of Computer
Science and Information Technologies (IJCSIT), 5(2):998–1001, 2014.
[155] A. Matrawy and I. Lambadaris. A survey of congestion control
schemes for multicast video applications. IEEE Communications
Surveys & Tutorials, 5(2):22–31, 2003.
[156] S. Matsuoka. A64fx and Fugaku: A Game Changing, HPC / AI
Optimized Arm CPU for Exascale, Sept. 2019.
[157] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Pe-
terson, J. Rexford, S. Shenker, and J. Turner. OpenFlow: enabling
innovation in campus networks. ACM SIGCOMM Computer
Communication Review, 38(2):69–74, 2008.
[158] Mellanox Technologies. Mellanox OFED for Linux User Manual
Rev. 2.0-3.0.0, Aug. 2013.
[159] Mellanox Technologies. How To Configure Adaptive Routing
and SHIELD (New), Nov. 2019.
[160] R. Mittal, V. T. Lam, N. Dukkipati, E. R. Blem, H. M. G. Wassel,
M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats.
TIMELY: rtt-based congestion control for the datacenter. In
Proceedings of the 2015 ACM Conference on Special Interest Group
on Data Communication, SIGCOMM, pages 537–550, 2015.
[161] B. Montazeri, Y. Li, M. Alizadeh, and J. Ousterhout. Homa:
A receiver-driven low-latency transport protocol using network
priorities. arXiv preprint arXiv:1803.09615, 2018.
[162] J. Moy. Ospf version 2. Technical report, 1997.
[163] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul.
SPAIN: COTS Data-Center Ethernet for Multipathing over Ar-
bitrary Topologies. In NSDI, pages 265–280, 2010.
[164] P. Narvaez, K.-Y. Siu, and H.-Y. Tzeng. Efficient algorithms for
multi-path link-state routing. 1999.
[165] S. Neuwirth et al. Scalable communication architecture for
network-attached accelerators. In IEEE HPCA, 2015.
[166] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang,
P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. Port-
land: a scalable fault-tolerant layer 2 data center network fabric.
ACM SIGCOMM CCR, 39(4):39–50, 2009.
[167] A. Nomura et al. Performance Evaluation of Multi-rail InfiniBand
Network in TSUBAME2.0 (in Japanese). IPSJ SIG Technical Report,
2012, 2012.
[168] M. Noormohammadpour and C. S. Raghavendra. Datacenter
traffic control: Understanding techniques and tradeoffs. IEEE
Comm. Surveys & Tutorials, 2017.
[169] V. Olteanu and C. Raiciu. Datacenter scale load balancing for
multipath transport. In Proceedings of the 2016 workshop on Hot
topics in Middleboxes and Network Function Virtualization, pages
20–25, 2016.
[170] D. Oran. Osi is-is intra-domain routing protocol. Technical report,
1990.
[171] M. Park, S. Sohn, K. Kwon, and T. T. Kwon. Maxpass: Credit-
based multipath transmission for load balancing in data centers.
Journal of Communications and Networks, 2019.
[172] R. Pe ˜
naranda et al. The k-ary n-direct s-indirect family of
topologies for large-scale interconnection networks. The Journal
of Supercomputing, 2016.
[173] I. Pepelnjak. EIGRP network design solutions. Cisco press, 1999.
[174] R. Perlman. An algorithm for distributed computation of a
spanningtree in an extended lan. In ACM SIGCOMM CCR,
volume 15, pages 44–53. ACM, 1985.
[175] R. Perlman. Rbridges: transparent routing. In IEEE INFOCOM,
2004.
[176] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal.
Fastpass: A centralized zero-queue datacenter network. ACM
SIGCOMM Computer Communication Review, 44(4):307–318, 2015.
[177] F. Petrini et al. The quadrics network: High-performance cluster-
ing technology. IEEE Micro, 2002.
[178] F. Petrini et al. Performance evaluation of the quadrics intercon-
nection network. Cluster Computing, 2003.
[179] G. F. Pfister. An introduction to the infiniband architecture. High
Performance Mass Storage and Parallel I/O, 42:617–632, 2001.
[180] J.-M. Pittet. IP and ARP over HIPPI-6400 (GSN). RFC 2835, May
2000.
[181] T. N. Platform. THE TUG OF WAR BETWEEN INFINIBAND
AND ETHERNET. https://www.nextplatform.com/2017/10/
30/tug-war-infiniband-ethernet/.
[182] M. Polese, F. Chiariotti, E. Bonetto, F. Rigotto, A. Zanella, and
M. Zorzi. A survey on recent advances in transport layer
protocols. IEEE Communications Surveys & Tutorials, 21(4):3584–
3608, 2019.
[183] M. Radi, B. Dezfouli, K. A. Bakar, and M. Lee. Multipath routing
in wireless sensor networks: survey and research challenges.
Sensors, 12(1):650–685, 2012.
[184] M. S. Rahman, M. A. Mollah, P. Faizian, and X. Yuan. Load-
balanced slim fly networks. In Proceedings of the 47th International
Conference on Parallel Processing, pages 1–10, 2018.
[185] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and
M. Handley. Improving datacenter performance and robustness
with multipath TCP. In Proceedings of the ACM SIGCOMM 2011
Conference on Applications, Technologies, Architectures, and Protocols
for Computer Communications, pages 266–277, 2011.
[186] J. Rasley, B. Stephens, C. Dixon, E. Rozner, W. Felter, K. Agarwal,
J. Carter, and R. Fonseca. Planck: Millisecond-scale monitoring
and control for commodity networks. In ACM SIGCOMM Com-
puter Communication Review, volume 44, pages 407–418. ACM,
2014.
[187] K. S. Reddy and L. C. Reddy. A survey on congestion control
mechanisms in high speed networks. IJCSNS-International Journal
of Computer Science and Network Security, 8(1):187–195, 2008.
[188] Y. Ren, Y. Zhao, P. Liu, K. Dou, and J. Li. A survey on tcp incast
in data center networks. International Journal of Communication
Systems, 27(8):1160–1172, 2014.
[189] T. L. Rodeheffer, C. A. Thekkath, and D. C. Anderson. Smart-
bridge: A scalable bridge architecture. ACM SIGCOMM CCR,
30(4):205–216, 2000.
[190] E. Rosen, A. Viswanathan, R. Callon, et al. Multiprotocol label
switching architecture. 2001. RFC 3031, January.
[191] D. Sampath, S. Agarwal, and J. Garcia-Luna-Aceves. ’ethernet on
air’: Scalable routing in very large ethernet-based networks. In
IEEE ICDCS, 2010.
[192] P. Schmid, M. Besta, and T. Hoefler. High-performance dis-
tributed RMA locks. In ACM HPDC, pages 19–30, 2016.
[193] H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of
atomic operations on modern architectures. In IEEE PACT, pages
445–456, 2015.
[194] M. Scott, A. Moore, and J. Crowcroft. Addressing the scalability
of ethernet with moose. In Proc. DC CAVES Workshop, 2009.
[195] S. Sen, D. Shue, S. Ihm, and M. J. Freedman. Scalable, optimal
flow routing in datacenters via local link balancing. In CoNEXT,
2013.
[196] C. Sergiou, P. Antoniou, and V. Vassiliou. A comprehensive sur-
vey of congestion control protocols in wireless sensor networks.
IEEE Communications Surveys & Tutorials, 16(4):1839–1859, 2014.
[197] S. Sharma, K. Gopalan, S. Nanda, and T.-c. Chiueh. Viking: A
multi-spanning-tree ethernet architecture for metropolitan area
and cluster networks. In IEEE INFOCOM, 2004.
[198] S. B. Shaw and A. Singh. A survey on scheduling and load
balancing techniques in cloud computing environment. In 2014
international conference on computer and communication technology
(ICCCT), pages 87–95. IEEE, 2014.
[199] H. Shoja, H. Nahid, and R. Azizi. A comparative survey on load
balancing algorithms in cloud computing. In Fifth International
Conference on Computing, Communications and Networking Technolo-
gies (ICCCNT), pages 1–5. IEEE, 2014.
[200] J. Shuja, K. Bilal, S. A. Madani, M. Othman, R. Ranjan, P. Balaji,
and S. U. Khan. Survey of techniques and architectures for
18
designing energy-efficient data centers. IEEE Systems Journal,
10(2):507–519, 2014.
[201] A. P. Silva, S. Burleigh, C. M. Hirata, and K. Obraczka. A survey
on congestion control for delay and disruption tolerant networks.
Ad Hoc Networks, 25:480–494, 2015.
[202] S. K. Singh, T. Das, and A. Jukan. A survey on internet multi-
path routing and provisioning. IEEE Communications Surveys &
Tutorials, 17(4):2157–2175, 2015.
[203] A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish:
Networking data centers randomly. 9th USENIX Symposium on
Networked Systems Design and Implementation (NSDI), 2012.
[204] T. Skeie, O. Lysne, J. Flich, P. L ´
opez, A. Robles, and J. Duato.
LASH-TOR: A Generic Transition-Oriented Routing Algorithm.
In ICPADS, page 595. IEEE Computer Society, 2004.
[205] S. Sohn, B. L. Mark, and J. T. Brassil. Congestion-triggered
multipath routing based on shortest path information. In IEEE
ICCCN, 2006.
[206] E. Solomonik, M. Besta, F. Vella, and T. Hoefler. Scaling between-
ness centrality using communication-efficient sparse matrix mul-
tiplication. In ACM/IEEE Supercomputing, page 47, 2017.
[207] P. Sreekumari and J.-i. Jung. Transport protocols for data center
networks: a survey of issues, solutions and challenges. Photonic
Network Comm., 31(1), 2016.
[208] A. Sridharan et al. Achieving near-optimal traffic engineering
solutions for current ospf/is-is networks. IEEE/ACM TON,
13(2):234–247, 2005.
[209] B. Stephens, A. Cox, W. Felter, C. Dixon, and J. Carter. PAST:
Scalable Ethernet for data centers. In ACM CoNEXT, 2012.
[210] K. Subramanian. Multi-chassis link aggregation on network
devices, June 24 2014. US Patent 8,761,005.
[211] M. Suchara, D. Xu, R. Doverspike, D. Johnson, and J. Rexford.
Network architecture for joint failure recovery and traffic engi-
neering. In ACM SIGMETRICS, 2011.
[212] J. W. Suurballe and R. E. Tarjan. A quick method for finding
shortest pairs of disjoint paths. Networks, 14(2):325–336, 1984.
[213] A. Thakur and M. S. Goraya. A taxonomic survey on load
balancing in cloud. Journal of Network and Computer Applications,
98:43–57, 2017.
[214] J. Touch and R. Perlman. Transparent interconnection of lots of
links (TRILL): Problem and applicability statement. Technical
report, 2009.
[215] N. T. Truong, I. Fujiwara, M. Koibuchi, and K.-V. Nguyen. Dis-
tributed shortcut networks: Low-latency low-degree non-random
topologies targeting the diameter and cable length trade-off. IEEE
Transactions on Parallel and Distributed Systems, 28(4):989–1001,
2016.
[216] T.-N. Truong, K.-V. Nguyen, I. Fujiwara, and M. Koibuchi.
Layout-conscious expandable topology for low-degree intercon-
nection networks. IEICE TRANSACTIONS on Information and
Systems, 99(5):1275–1284, 2016.
[217] J. Tsai and T. Moors. A review of multipath routing protocols:
From wireless ad hoc to mesh networks. In ACoRN workshop on
wireless multihop networking, 2006.
[218] F. P. Tso, G. Hamilton, R. Weber, C. Perkins, and D. P. Pezaros.
Longer is better: Exploiting path diversity in data center net-
works. In IEEE 33rd International Conference on Distributed Com-
puting Systems, ICDCS, pages 430–439, 2013.
[219] A. Valadarsky, M. Dinitz, and M. Schapira. Xpander: Unveiling
the secrets of high-performance datacenters. In ACM HotNets,
2015.
[220] L. Valiant. A scheme for fast parallel communication. SIAM
journal on computing, 11(2):350–361, 1982.
[221] B. Vamanan, J. Hasan, and T. Vijaykumar. Deadline-aware data-
center TCP (D2TCP). ACM SIGCOMM Computer Communication
Review, 42(4):115–126, 2012.
[222] S. Van der Linden, G. Detal, and O. Bonaventure. Revisiting next-
hop selection in multipath networks. In ACM SIGCOMM CCR,
volume 41, 2011.
[223] E. Vanini, R. Pan, M. Alizadeh, P. Taheri, and T. Edsall. Let it flow:
Resilient asymmetric load balancing with flowlet switching. In
NSDI, pages 407–420, 2017.
[224] C. Villamizar. OSPF optimized multipath (OSPF-OMP). 1999.
[225] A. Vishnu, A. Mamidala, S. Narravula, and D. Panda. Automatic
Path Migration over InfiniBand: Early Experiences. In IEEE
IPDPS, 2007.
[226] B. Wang, Z. Qi, R. Ma, H. Guan, and A. V. Vasilakos. A survey on
data center networking for cloud computing. Computer Networks,
91:528–547, 2015.
[227] K. Wen, P. Samadi, S. Rumley, C. P. Chen, Y. Shen, M. Bahadori,
K. Bergman, and J. Wilke. Flexfly: Enabling a reconfigurable
dragonfly through silicon photonics. In ACM/IEEE Supercomput-
ing, 2016.
[228] J. Widmer, R. Denda, and M. Mauve. A survey on tcp-friendly
congestion control. IEEE network, 15(3):28–37, 2001.
[229] N. Wolfe, M. Mubarak, N. Jain, J. Domke, A. Bhatele, C. D.
Carothers, and R. B. Ross. Preliminary Performance Analysis
of Multi-rail Fat-tree Networks. In IEEE/ACM CCGrid, 2017.
[230] C. Xu, J. Zhao, and G.-M. Muntean. Congestion control design
for multipath transport protocols: A survey. IEEE communications
surveys & tutorials, 2016.
[231] M. Xu, W. Tian, and R. Buyya. A survey on load balancing
algorithms for virtual machines placement in cloud computing.
Concurrency and Computation: Practice and Experience, 29(12):e4123,
2017.
[232] X. Xu et al. Unified source routing instructions using mpls label
stack. IETF MPLS Working Group draft, Internet Eng. Task Force,
2017.
[233] Yahho Finance. Mellanox Delivers Record First Quarter
2020 Financial Results. https://finance.yahoo.com/news/
mellanox-delivers-record-first-quarter-200500726.htm.
[234] P. Y´
ebenes et al. Improving non-minimal and adaptive routing
algorithms in slim fly networks. In IEEE HOTI, 2017.
[235] D. Zats, T. Das, P. Mohan, D. Borthakur, and R. H. Katz. Detail:
reducing the flow completion time tail in datacenter networks. In
ACM SIGCOMM, pages 139–150, 2012.
[236] H. Zhang, J. Zhang, W. Bai, K. Chen, and M. Chowdhury.
Resilient datacenter load balancing in the wild. In Proceedings
of the Conference of the ACM Special Interest Group on Data Commu-
nication, pages 253–266, 2017.
[237] J. Zhang, F. Ren, and C. Lin. Survey on transport control in data
center networks. IEEE Network, 27(4):22–26, 2013.
[238] J. Zhang, K. Xi, L. Zhang, and H. J. Chao. Optimizing network
performance using weighted multipath routing. In IEEE ICCCN,
2012.
[239] J. Zhang, F. R. Yu, S. Wang, T. Huang, Z. Liu, and Y. Liu. Load bal-
ancing in data center networks: A survey. IEEE Communications
Surveys & Tutorials, 20(3):2324–2352, 2018.
[240] J. Zhao, L. Wang, S. Li, X. Liu, Z. Yuan, and Z. Gao. A survey of
congestion control mechanisms in wireless sensor networks. In
2010 Sixth International Conference on Intelligent Information Hiding
and Multimedia Signal Processing, pages 719–722. IEEE, 2010.
[241] J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh,
and A. Vahdat. Wcmp: Weighted cost multipathing for improved
fairness in data centers. In ACm EuroSys, 2014.
[242] D. Zhuo, Q. Zhang, V. Liu, A. Krishnamurthy, and T. Anderson.
Rack-level congestion control. In Proceedings of the 15th ACM
Workshop on Hot Topics in Networks, pages 148–154. ACM, 2016.
[243] S. M. Zin, N. B. Anuar, M. L. M. Kiah, and I. Ahmedy. Survey of
secure multipath routing protocols for wsns. Journal of Network
and Computer Applications, 55:123–153, 2015.
[244] A. Zinin and I. Cisco. Routing: Packet forwarding and intra-
domain routing protocols, 2002.
... Remote direct memory access (RDMA) is adopted for efficient memory consistency management, enabling direct access to the memory of each other among the nodes [4]. Maintaining the internode process coherence is facilitated by delicate provisioning of interconnect resources. ...
... Dragonfly [4] uses a combination of local and remote cables to modularize racks into tightlyknit groups. Remote connections employ optical fibers to provide high-bandwidth links, compensating for latencies over long distances. ...
Preprint
Interconnection networks, or `interconnects,' play a crucial role in administering the communication among computing units of high-performance computing (HPC) systems. Efficient provisioning of interconnects minimizes the processing delay wherein computing units await information sharing between each other, thereby enhancing the overall computation efficiency. Ideally, interconnects are designed with topologies tailored to match specific workflows, requiring diverse structures for different applications. However, since modifying their structures mid-operation renders impractical, indirect communication incurs across distant units. In managing numerous long-routed data deliveries, heavy burdens on the network side may lead to the under-utilization of computing resources. In view of state-of-the-art HPC paradigms that solicit dense interconnections for diverse computation-hungry applications, this article presents a versatile wireless interconnecting framework, coined as Wireless Interconnection NEtwork (WINE). The framework exploits cutting-edge wireless technologies that promote workload adaptability and scalability of modern interconnects. Design and implementation of wirelessly reliable links are strategized under network-oriented scrutiny of HPC architectures. A virtual HPC platform is developed to assess WINE's feasibilities, verifying its practicality for integration into modern HPC infrastructures.
... In general, it is not our intention in this paper to analyze all the adaptive routing proposals, since there has been an extensive research in this field. Further details on adaptive routing mechanisms can be found in several surveys, such as the Besta et al. work [20]. ...
Preprint
Full-text available
The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications' communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with congestion since they provide traffic flows with alternative routes that bypass congested areas. However, adaptive routing decisions at switches are typically based on local information without a global network traffic perspective, leading to congestion spreading throughout the network beyond the original congested areas. In this paper, we propose a new efficient congestion management strategy that leverages adaptive routing notifications currently available in some interconnect technologies and efficiently isolates the congesting flows in reserved spaces at switch buffers. The experiment results based on simulations of realistic traffic scenarios show that our proposal removes the congestion impact.
... Remote Direct Memory Access (RDMA) is a mechanism for achieving high performance and scalability in both the supercomputing as well as the cloud data center landscapes [136]- [151]. RDMA has grown popular as RDMAenabled network interface cards have become widely used, and is commonly supported in modern interconnects [152], [153]. Overall, RDMA has many use-cases, particularly in distributed environment. ...
Preprint
Knowledge graphs (KGs) have achieved significant attention in recent years, particularly in the area of the Semantic Web as well as gaining popularity in other application domains such as data mining and search engines. Simultaneously, there has been enormous progress in the development of different types of heterogeneous hardware, impacting the way KGs are processed. The aim of this paper is to provide a systematic literature review of knowledge graph hardware acceleration. For this, we present a classification of the primary areas in knowledge graph technology that harnesses different hardware units for accelerating certain knowledge graph functionalities. We then extensively describe respective works, focusing on how KG related schemes harness modern hardware accelerators. Based on our review, we identify various research gaps and future exploratory directions that are anticipated to be of significant value both for academics and industry practitioners.
... Therefore, LCDN employs a different approach. It uses multiple spanning trees [15]- [17]. This allows the system to have a different L2 Spanning tree for every VLAN identifier via the Multiple Spanning Tree Protocol. ...
Preprint
The demands on networks are increasing on a fast pace. In particular, real-time applications have very strict network requirements. However, building a network capable of hosting real-time applications is a cost-intensive endeavor, especially for experimental systems such as testbeds. Systems that provide guaranteed real-time networking capabilities usually work with expensive software-defined switches. In contrast, real-time networking systems based on cheaper hardware face the limitation of lower link speeds. Therefore, this paper fills this gap and presents Low-Cost Deterministic Networking (LCDN), a system that is designed to work with cheap common off-the-shelf switches and devices. LCDN works at Gigabit speed and enables powerful testbeds that can host real-time applications with hard delay guarantees. This paper also provides the evaluation of the determinism of the used switch as well as a Raspberry Pi used as an end device.
Article
Interconnection networks, or “interconnects,” play a crucial role in administering communication among computing units of high-performance computing (HPC) systems. Efficient provisioning of interconnects minimizes the processing delay wherein computing units await information sharing between each other, thereby enhancing the overall computation efficiency. Ideally, interconnects are designed with topologies tailored to match specific workflows, requiring diverse structures for different applications. However, since modifying their structures mid-operation is impractical, indirect communication incurs across distant units. In managing numerous long-routed data deliveries, heavy burdens on the network side may lead to the under-utilization of computing resources. In view of state-of-the-art HPC paradigms that solicit dense interconnections for diverse computation-hungry applications, this article presents a versatile wireless interconnecting framework, coined wireless interconnection network (WINE). The framework exploits cutting-edge wireless technologies that promote workload adaptability and scalability of modern interconnects. Design and implementation of wirelessly reliable links are strategized under network-oriented scrutiny of HPC architectures. A virtual HPC platform is developed to assess WINE's feasibilities, verifying its practicality for integration into modern HPC infrastructures.
Conference Paper
Full-text available
We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths features a redesigned "purified" transport layer that removes virtually all TCP performance issues (e.g., the slow start), and uses flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2× lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.
Article
Full-text available
Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.
Article
Full-text available
Developing high-performance and energy-efficient algorithms for maximum matchings is becoming increasingly important in social network analysis, computational sciences, scheduling, and others. In this work, we propose the first maximum matching algorithm designed for FPGAs; it is energy-efficient and has provable guarantees on accuracy, performance, and storage utilization. To achieve this, we forego popular graph processing paradigms, such as vertex-centric programming, that often entail large communication costs. Instead, we propose a substream-centric approach, in which the input stream of data is divided into substreams processed independently to enable more parallelism while lowering communication costs. We base our work on the theory of streaming graph algorithms and analyze 14 models and 28 algorithms. We use this analysis to provide theoretical underpinning that matches the physical constraints of FPGA platforms. Our algorithm delivers high performance (more than 4× speedup over tuned parallel CPU variants), low memory, high accuracy, and effective usage of FPGA resources. The substream-centric approach could easily be extended to other algorithms to offer low-power and high-performance graph processing on FPGAs.
Article
Full-text available
Abstract In the data center networks, multipath transmission control protocol(MPTCP) uses multiple subflows to balance traffic over parallel paths and achieve high throughput. Despite much recent progress in improving MPTCP performance in data center, how to adjust the number of subflows according to network status has remained elusive. In this paper, we reveal theoretically and empirically that controlling the number of concurrent subflows is very important in reducing flow completion time (FCT) under network dynamic. We further propose a novel design called MPTCP_OPN, which adaptively adjusts the number of concurrent subflows according to the real-time network state and flexibly shifts traffic from congested paths to mitigate the high tail latency. Experimental results show that MPTCP_OPN effectively reduces the timeout probability caused by full window loss and flow completion time by up to 50% compared with MPTCP protocol.
Conference Paper
Full-text available
Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently network-accelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 8x speedup in the unpack throughput of real applications, demonstrating that non-contiguous memory transfers are a first-class candidate for network acceleration.
Conference Paper
Full-text available
We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.
Article
Numerous irregular graph datasets, for example social networks or web graphs, may contain even trillions of edges. Often, their structure changes over time and they have domain-specific rich data associated with vertices and edges. Graph database systems such as Neo4j enable storing, processing, and analyzing such large, evolving, and rich datasets. Due to the sheer size and irregularity of such datasets, these systems face unique design challenges. To facilitate the understanding of this emerging domain, we present the first survey and taxonomy of graph database systems. We focus on identifying and analyzing fundamental categories of these systems (e.g., document stores, tuple stores, native graph database systems, or object-oriented systems), the associated graph models (e.g., RDF or Labeled Property Graph), data organization techniques (e.g., storing graph data in indexing structures or dividing data into records), and different aspects of data distribution and query execution (e.g., support for sharding and ACID). 51 graph database systems are presented and compared, including Neo4j, OrientDB, and Virtuoso. We outline graph database queries and relationships with associated domains (NoSQL stores, graph streaming, and dynamic graph algorithms). Finally, we outline future research and engineering challenges related to graph databases.
Article
Graph processing has become an important part of various areas of computing, including machine learning, medical applications, social network analysis, computational sciences, and others. A growing amount of the associated graph processing workloads are dynamic , with millions of edges added or removed per second. Graph streaming frameworks are specifically crafted to enable the processing of such highly dynamic workloads. Recent years have seen the development of many such frameworks. However, they differ in their general architectures (with key details such as the support for the concurrent execution of graph updates and queries, or the incorporated graph data organization), the types of updates and workloads allowed, and many others. To facilitate the understanding of this growing field, we provide the first analysis and taxonomy of dynamic and streaming graph processing. We focus on identifying the fundamental system designs and on understanding their support for concurrency, and for different graph updates as well as analytics workloads. We also crystallize the meaning of different concepts associated with streaming graph processing, such as dynamic, temporal, online, and time-evolving graphs, edge-centric processing, models for the maintenance of updates, and graph databases. Moreover, we provide a bridge with the very rich landscape of graph streaming theory by giving a broad overview of recent theoretical related advances, and by discussing which graph streaming models and settings could be helpful in developing more powerful streaming frameworks and designs. We also outline graph streaming workloads and research challenges. Author: Please confirm or add details for any funding or financial support for the research of this article. ?
Conference Paper
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today's Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world's first 3 Pflop/s supercomputer with two separate networks, a 3--level Fat-Tree and a 12×8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.