Fig 1 - uploaded by Eitan Zahavi
Content may be subject to copyright.
The InfiniBand-based switch architecture. 

The InfiniBand-based switch architecture. 

Source publication
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in...

Contexts in source publication

Context 1
... HCAs and links, together with control software entities, such as the subnet manager (SM) that discovers and config- ures the IB-based network. IB-based switches contain a set of ports, while HCAs contain one (or several) ports to con- nect endnodes to the network. Every port at switches or HCAs contains a buffer to store packets to be forwarded. Fig. 1 shows an example of an IB-based switch configured with k input and k output ports. Note that, for the sake of clarity, we show unidirectional input and output ports, even though IB assumes bidirectional ports at ...
Context 2
... are assigned to pack- ets based mainly on their Service Level (SL) which is set to packets at HCAs, prior their injection in the network. The number of SLs available in IB-based networks is limited to 16, and the SL of a packet cannot be changed once that packet is injected. Every switch (and HCA) has a SL-to-VL mapping table per output port (see Fig. 1), used to assign packets requesting that output port with a specific VL, based on those packet SLs and on their arrival port (in the case of switches). For that purpose, each entry of the SL-to-VL table associates an input port and SL with a VL. Hence, a packet can be stored at different VLs along its route, depending on its SL and on ...
Context 3
... mentioned above, D3R is able to break the cycles appear- ing in the CDG by only requiring 2 VLs. Fig. 10 shows the CDG resulting from the application of the VL-mapping of D3R to the Dragonfly topology and traffic scenario shown in Fig. 4. VL0 stores packets routed through channels belonging to groups ordered in a strictly-decreasing mono- tonic order (i.e., G3, G2, G1, G0), while VL1 stores packets routed through channels belonging to ...
Context 4
... very simple solution to improve the performance of D3R is to separate intra-group traffic from the other two types of traffic by using an additional VL. Specifically, the third VL is used only for intra-group communication, the remaining two VLs being used as usual. Fig. 11 shows the D3R VL-mapping extended in order to use 3 VLs. Note that the main difference with respect to the previous version of D3R is the use of VL2 when packets source and destina- tion groups are equal ...
Context 5
... shown in Fig. 7, and the particular intra-group routing (1-hop for fully-connected intra-group networks and DOR for Hamming-based ones). Then, the routing tables are loaded onto network switches. Moreover, in order to answer the SL queries received by the SA, we have implemented in OpenSM the function path_sl containing the algorithms shown in Figs. 8 and 11. These actions enable the dfly routing engine to compute dynami- cally the SL. The D3R implementation falls through to the DFSSSP routing engine if it does not detect the structure of a supported ...
Context 6
... have performed simulations and experiments with real IB-based hardware, using a framework which integrates IB control software, IB-based hardware and OMNeT++-based simulators (see Fig. 12). We have extended previously pro- posed tools [2], [35], ...
Context 7
... the crux of this simulator, there is a network that sends and receive small IB packet fragments of 64 bytes, over links that impose credit-based link-level flow-control according to the IB specification. This model, shown in Fig. 13, utilizes a hierarchical design which re-uses the same blocks to build IB switches and HCAs. The basic blocks are: a generator, a sink, an input buffer, a VL arbiter and an out- put ...
Context 8
... performance metrics evaluated in the experiments is the normalized bandwidth against the maximum theoretical efficiency of the network. Fig. 14 shows simulation results for DFF configurations balanced and oversubscribed (i.e., a ¼ p), and also for DHF configurations. Note that all these network configurations are the ones from Table 1. 4. We have not considered other routing engines available in OpenSM because they do not offer deadlock prevention (e.g., MIN- HOP) or minimal ...
Context 9
... 14a, 14b and 14c depict performance results for bal- anced (i.e., a ¼ 2h ¼ 2p) DFF network configurations #1, #2 and #4 of Table 1, respectively. In particular, Fig. 14a (72 endnodes) shows that DFSSSP slightly outperforms D3R- 3VL, but DFSSSP uses 8 VLs (3 VLs minimum for deadlock prevention, plus 5 additional VLs for reducing intra-group contention), while D3R-3VL only uses 2 VLs for deadlock- prevention plus 1 VL for reducing intra-group ...
Context 10
... VLs increase with the network size, compared to D3R. Regarding MIN-1VL, the simulator detects deadlocks when the traffic load is around 65 percent, so network performance dropping near zero in this case. D3R-2VL does not trigger the deadlock detection, but it suffers performance degradation at high traffic loads due to intra- group contention. Figs. 14b and 14c show network efficiency results for balance DFF connecting 342 and 2550 endnodes. D3R-2VL is able to prevent deadlocks, but suffers from intra-group contention at high traffic loads. However, D3R- 3VL reduces the effects of intra-group contention with just 3 VLs, significantly outperforming DFSSSP and LASH. Again, MIN-1VL obtains the ...
Context 11
... section shows experiment results for D3R, DFSSSP, and LASH performed under real traffic workloads in the CEL- LIA (Cluster for the Evaluation of Low-Latency Architec- tures) facility built with IB-based hardware 5 (see Fig. 15). CELLIA allows us to test the correctness of the implementa- tion of a routing engine comparing simulation results with real-workloads execution. Each server node in CELLIA is a HP Proliant DL120 Gen9 with a processor Intel Xeon E5-2630v3 8-cores at 1.80 GHz and 16 GB of RAM memory. We installed CentOS 7 with a kernel version 3.10. ...
Context 12
... order to validate our implementation of D3R, and the simulation results, we have run a single experiment in CELLIA for each combination of the considered routings (D3R-2VL, D3R-3VL, DFSSSP-8VL, or LASH-2VL), bench- marks (Netgauge, Graph500, HPCG, HPCC, or NAMD) and task mappings (Linear or Random). Fig. 15 shows the perfor- mance results of these experiments. For instance, HPCC tests Ping-Pong and Ordered-Ring generate an adversarial-like traffic pattern (achieving between 1 and 4 GB/s, approxi- mately). In these cases, the D3R behavior is virtually identi- cal to that of both DFSSSP and LASH. Other tests, such as PTRANS or Netgauge-ebb ...
Context 13
... a many-to-many traffic pattern. In these scenarios, D3R performance results are sim- ilar to those of DFSSSP and LASH. In general, there are small variations among all the evaluated routing engines, because of the small size of the CELLIA network (i.e., a 38-node Drag- onfly with 12 switches). Note that the simulation in small net- works (see Fig. 14a) and the experiments in CELLIA show qualitatively similar performance ...

Similar publications

Preprint
Full-text available
System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates mor...

Citations

... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topologyagnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [19], which can be also used in Dragonflies. ...
... In that sense, in [20] the re-quirements of some of these routing engines are analyzed in terms of routing time and number of required VLs, but no performance measurements are provided. In [18], both a comparison of the requirements (in terms of number of required VLs) and a performance comparison (based on both simulation experiments and results from a real InfiniBand-based cluster) are provided, but not all the routing engines available for Dragonflies are considered. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [17], LASH [8] and D3R [18]. ...
Preprint
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Considerable research efforts have been devoted to investigating a range of efficient network topologies for HPC systems, including Flattened Butterfly [9][10][11], Dragonfly [12][13][14], HyperX [15], Skywalk [16], and SlimFly [17]. These structures are capable of delivering low diameters for HPC systems while also ensuring scalability through the port numbers (radixes) of the construction blocks (routers) [18]. ...
Article
Full-text available
The design of interconnection networks is a fundamental aspect of high-performance computing (HPC) systems. Among the available topologies, the Galaxyfly network stands out as a low-diameter and flexible-radix network for HPC applications. Given the paramount importance of collective communication in HPC performance, in this paper, we present two different all-to-all broadcast algorithms for the Galaxyfly network, which adhere to the supernode-first rule and the router-first rule, respectively. Our performance evaluation validates their effectiveness and shows that the first algorithm has a higher degree of utilization of network channels, and that the second algorithm can significantly reduce the average time for routers to collect packets from the supernode.
... A lot of research has been devoted to exploring a variety of efficient topologies, e.g., Flattened Butterfly [7], Dragonfly [4], [5], HyperX [8], Skywalk [9], and SlimFly [10]. Their scalability is guaranteed through the port number (radix) of the construction blocks (routers) [26]. ...
... A common example is a DF with a flattened butterfly intra-group scheme and a pruned Hamming graph inter-group scheme. Real DFs have diameters ranging from 2 to 5. Routing DFs requires special attention in order to guarantee deadlock freedom [32,57,56]. Furthermore, DFs provide low performance for adversarial inter-group traffic patterns unless either fine-tuned non-minimal 3 adaptive routing techniques [48,44,67] or groupspreading job placement policies (RDR or RRR in [43]) are used. ...
Thesis
Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.
... Many of these architectures feature some form of programmable NICs [47], [177]. Finally, there exist routing protocols for specific low-diameter topologies, for example for SF [234] or DF [153]. However, they usually do not support multipathing or non-minimal routing. ...
Article
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this article, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [31], DFSSSP [6] and D3R [20], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topology-agnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [25], which can be also used in Dragonflies. ...
... In that sense, in [27] the requirements of some of these routing engines are analyzed in terms of routing time and number of required VLs, but no performance measurements are provided. In [20], both a comparison of the requirements (in terms of number of required VLs) and a performance comparison (based on both simulation experiments and results from a real InfiniBand-based cluster) are provided, but not all the routing engines available for Dragonflies are considered. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [6], LASH [31] and D3R [20]. ...
Article
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Many of these architectures feature some form of programmable NICs [47], [177]. Finally, there exist routing protocols for specific low-diameter topologies, for example for SF [234] or DF [153]. However, they usually do not support multipathing or non-minimal routing. ...
Preprint
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this work, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and "Big Data" domains, we consider routing protocols developed for both traditional HPC systems and supercomputers, and for data centers and general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... At scale from 378 nodes (two cabinets or a "group") and above, the miniapps share optical interconnect between the groups with all other processes in the system. Dragonfly network contention was recently analysed in [30] . At 10 0 0 nodes and above the coarray global barrier strategy shows slightly higher efficiency than MPI. ...
Article
Fortran coarrays are an attractive alternative to MPI due to a familiar Fortran syntax, single sided communications and implementation in the compiler. Scaling of coarrays is compared in this work to MPI, using cellular automata (CA) 3D Ising magnetisation miniapps, built with the CASUP CA library, https://cgpack.sourceforge.io , developed by the authors. Ising energy and magnetisation were calculated with MPI_ALLREDUCE and Fortran 2018 co_sum collectives. The work was done on ARCHER (Cray XC30) up to the full machine capacity: 109,056 cores. Ping-pong latency and bandwidth results are very similar with MPI and with coarrays for message sizes from 1B to several MB. MPI halo exchange (HX) scaled better than coarray HX, which is surprising because both algorithms use pair-wise communications: MPI IRECV/ISEND/WAITALL vs Fortran sync images. Adding OpenMP to MPI or to coarrays resulted in worse L2 cache hit ratio, and lower performance in all cases, even though the NUMA effects were ruled out. This is likely because the CA algorithm is network bound at scale. This is further evi-denced by the fact that very aggressive cache and inter-procedural optimisations lead to no performance gain. The sampling and tracing analysis shows good load balancing in compute in all miniapps, but imbalance in communication, indicating that the difference in performance between MPI and coarrays is likely due to parallel libraries (MPICH2 vs libpgas) and the Cray hardware specific libraries (uGNI vs DMAPP). Overall, the results look promising for coarray use beyond 100k cores. However, further coarray optimi-sation is needed to narrow the performance gap between coarrays and MPI.
Article
According to the latest TOP500 list, InfiniBand (IB) is the most widely used network architecture in the top 10 supercomputers. IB relies on Credit-based Flow Control (CBFC) to provide a lossless network and InfiniBand congestion control (IB CC) to relieve congestion, however, this can lead to the problem of victim flow since messages are mixed in the same queue and long-lived congestion spreading due to slow convergence. To deal with these problems, in this paper, we propose FlowStar, a fast convergence per-flow state accurate congestion control for InfiniBand. FlowStar includes two core mechanisms: 1) optimized per-flow CBFC mechanism provides flow state control to detect real congestion; and 2) rate adjustment rules make up for the mismatch between the original IB CC rate regulation and the per-hop CBFC to alleviate congestion spreading. FlowStar implements a per-flow congestion state on switches and can obtain in-flight packet information without additional parameter settings to ensure a lossless network. Evaluations show that FlowStar improves average and tail message complete time under different workloads.
Article
Based on the most recent TOP500 rankings, Infiniband (IB) stands out as the dominant network architecture among the top 10 supercomputers. Yet, it primarily employs deterministic routing, which tends to be suboptimal in network traffic balance. While deterministic routing invariably opts for the same forwarding path, adaptive routing offers flexibility by permitting packets to traverse varied paths for every source-destination pair. Contemporary adaptive routing methods in HPC networks typically determine path selection rooted in the switch queue's occupancy. While the queue length provides a glimpse into local congestion, it's challenging to consolidate such fragmented information to portray the full path accurately. In this paper, we introduce Alarm, an adaptive routing system that uses probabilistic path selection grounded in one-way delay metrics. The one-way delay not only offers a more holistic view of congestion, spanning from source to destination, but also captures the intricacies of network flows. Alarm gleans the one-way delay from each pathway via data packets, eliminating the need for separate delay detection packets and clock synchronization. The probabilistic selection hinges on weights determined by the one-way delay, ensuring the prevention of bottleneck links during congestion updates. Notably, routing decisions under Alarm are made per-flowlet. Guided by delay cues, the gap between flowlets is dynamically adjusted to match the maximum delay variation across diverse paths, thereby preventing the occurrence of packet out-of-order. The simulation results show that Alarm can achieve 2.0X and 1.7X better average and p99 FCT slowdown than existing adaptive routing.