Fig 5 - uploaded by Eitan Zahavi
Content may be subject to copyright.
A portion of the CDG in the deadlock situation shown in Fig. 4.

A portion of the CDG in the deadlock situation shown in Fig. 4.

Source publication
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in...

Contexts in source publication

Context 1
... contrast with network topologies that are "naturally" deadlock-free (like fat-trees), Dragonfly topologies contain physical cycles that will lead to cyclic dependencies if not prevented by the routing algorithm. Fig. 4 shows a deadlock situation created by several minimal-path routes in a fully- connected Dragonfly topology connecting 36 nodes. Fig. 5 shows a portion of the channel dependency graph (CDG) [8] corresponding to the deadlock situation shown in Fig. ...
Context 2
... algorithm, taking into account that a dependency exits between channels A and B (in that direction) if B can be requested after using A. For instance, in Fig. 4 ch1 is a local channel connecting switches 00 and 01 at G0, and ch2 is a global channel connecting G0 and G2, thus there is a direct dependency between ch1 and ch2 (shown as an arrow in Fig. 5), as the (basic) minimal- path routing algorithm states that both channels will be crossed consecutively by traffic flows going from switch 00 to any switch in G2. Similar dependencies exist among other channels, so that there is a cycle in the CDG involving chan- nels ch1, ch2, ch3, ch4, ch5 and ch6. Therefore, it is necessary to ...
Context 3
... stores packets routed through global channels ch4 and ch6, which corresponds to the communi- cations from G2 to G1 and from G1 to G0 (i.e., G s > G d ), respectively. By contrast, VL1 stores flows using the global channel ch2, used for communications from G0 to G2 (i.e., G d > G s ). D3R is able to break the cyclic dependency shown in the CDG in Fig. 5, thereby preventing ...

Similar publications

Preprint
Full-text available
System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates mor...

Citations

... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topologyagnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [19], which can be also used in Dragonflies. ...
... In that sense, in [20] the re-quirements of some of these routing engines are analyzed in terms of routing time and number of required VLs, but no performance measurements are provided. In [18], both a comparison of the requirements (in terms of number of required VLs) and a performance comparison (based on both simulation experiments and results from a real InfiniBand-based cluster) are provided, but not all the routing engines available for Dragonflies are considered. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [17], LASH [8] and D3R [18]. ...
Preprint
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Considerable research efforts have been devoted to investigating a range of efficient network topologies for HPC systems, including Flattened Butterfly [9][10][11], Dragonfly [12][13][14], HyperX [15], Skywalk [16], and SlimFly [17]. These structures are capable of delivering low diameters for HPC systems while also ensuring scalability through the port numbers (radixes) of the construction blocks (routers) [18]. ...
Article
Full-text available
The design of interconnection networks is a fundamental aspect of high-performance computing (HPC) systems. Among the available topologies, the Galaxyfly network stands out as a low-diameter and flexible-radix network for HPC applications. Given the paramount importance of collective communication in HPC performance, in this paper, we present two different all-to-all broadcast algorithms for the Galaxyfly network, which adhere to the supernode-first rule and the router-first rule, respectively. Our performance evaluation validates their effectiveness and shows that the first algorithm has a higher degree of utilization of network channels, and that the second algorithm can significantly reduce the average time for routers to collect packets from the supernode.
... A lot of research has been devoted to exploring a variety of efficient topologies, e.g., Flattened Butterfly [7], Dragonfly [4], [5], HyperX [8], Skywalk [9], and SlimFly [10]. Their scalability is guaranteed through the port number (radix) of the construction blocks (routers) [26]. ...
... A common example is a DF with a flattened butterfly intra-group scheme and a pruned Hamming graph inter-group scheme. Real DFs have diameters ranging from 2 to 5. Routing DFs requires special attention in order to guarantee deadlock freedom [32,57,56]. Furthermore, DFs provide low performance for adversarial inter-group traffic patterns unless either fine-tuned non-minimal 3 adaptive routing techniques [48,44,67] or groupspreading job placement policies (RDR or RRR in [43]) are used. ...
Thesis
Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.
... Many of these architectures feature some form of programmable NICs [47], [177]. Finally, there exist routing protocols for specific low-diameter topologies, for example for SF [234] or DF [153]. However, they usually do not support multipathing or non-minimal routing. ...
Article
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this article, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [31], DFSSSP [6] and D3R [20], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topology-agnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [25], which can be also used in Dragonflies. ...
... In that sense, in [27] the requirements of some of these routing engines are analyzed in terms of routing time and number of required VLs, but no performance measurements are provided. In [20], both a comparison of the requirements (in terms of number of required VLs) and a performance comparison (based on both simulation experiments and results from a real InfiniBand-based cluster) are provided, but not all the routing engines available for Dragonflies are considered. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [6], LASH [31] and D3R [20]. ...
Article
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Many of these architectures feature some form of programmable NICs [47], [177]. Finally, there exist routing protocols for specific low-diameter topologies, for example for SF [234] or DF [153]. However, they usually do not support multipathing or non-minimal routing. ...
Preprint
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this work, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and "Big Data" domains, we consider routing protocols developed for both traditional HPC systems and supercomputers, and for data centers and general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... At scale from 378 nodes (two cabinets or a "group") and above, the miniapps share optical interconnect between the groups with all other processes in the system. Dragonfly network contention was recently analysed in [30] . At 10 0 0 nodes and above the coarray global barrier strategy shows slightly higher efficiency than MPI. ...
Article
Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io , developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is network bound at scale. This is further evi-denced by the fact that very aggressive cache and inter-procedural optimisations lead to no performance gain. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimi-sation is needed to narrow the performance gap between coarrays and MPI.
Article
According to the latest TOP500 list, InfiniBand (IB) is the most widely used network architecture in the top 10 supercomputers. IB relies on Credit-based Flow Control (CBFC) to provide a lossless network and InfiniBand congestion control (IB CC) to relieve congestion, however, this can lead to the problem of victim flow since messages are mixed in the same queue and long-lived congestion spreading due to slow convergence. To deal with these problems, in this paper, we propose FlowStar, a fast convergence per-flow state accurate congestion control for InfiniBand. FlowStar includes two core mechanisms: 1) optimized per-flow CBFC mechanism provides flow state control to detect real congestion; and 2) rate adjustment rules make up for the mismatch between the original IB CC rate regulation and the per-hop CBFC to alleviate congestion spreading. FlowStar implements a per-flow congestion state on switches and can obtain in-flight packet information without additional parameter settings to ensure a lossless network. Evaluations show that FlowStar improves average and tail message complete time under different workloads.
Article
Based on the most recent TOP500 rankings, Infiniband (IB) stands out as the dominant network architecture among the top 10 supercomputers. Yet, it primarily employs deterministic routing, which tends to be suboptimal in network traffic balance. While deterministic routing invariably opts for the same forwarding path, adaptive routing offers flexibility by permitting packets to traverse varied paths for every source-destination pair. Contemporary adaptive routing methods in HPC networks typically determine path selection rooted in the switch queue's occupancy. While the queue length provides a glimpse into local congestion, it's challenging to consolidate such fragmented information to portray the full path accurately. In this paper, we introduce Alarm, an adaptive routing system that uses probabilistic path selection grounded in one-way delay metrics. The one-way delay not only offers a more holistic view of congestion, spanning from source to destination, but also captures the intricacies of network flows. Alarm gleans the one-way delay from each pathway via data packets, eliminating the need for separate delay detection packets and clock synchronization. The probabilistic selection hinges on weights determined by the one-way delay, ensuring the prevention of bottleneck links during congestion updates. Notably, routing decisions under Alarm are made per-flowlet. Guided by delay cues, the gap between flowlets is dynamically adjusted to match the maximum delay variation across diverse paths, thereby preventing the occurrence of packet out-of-order. The simulation results show that Alarm can achieve 2.0X and 1.7X better average and p99 FCT slowdown than existing adaptive routing.