Conference Paper

Analyzing available routing engines for InfiniBand-based clusters with Dragonfly topology

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Dragonfly topologies, among others, can be implemented [2] based on the InfiniBand (IB) architecture [3], [4]. According to the last Top500 list [5] IB is nowadays the most widely used network technology in HPC systems. ...
... We have performed simulations and experiments with real IB-based hardware, using a framework which integrates IB control software, IB-based hardware and OMNeT++-based simulators (see Fig. 12). We have extended previously proposed tools [2], [35], [36]. ...
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
... A common example is a DF with a flattened butterfly intra-group scheme and a pruned Hamming graph inter-group scheme. Real DFs have diameters ranging from 2 to 5. Routing DFs requires special attention in order to guarantee deadlock freedom [32,57,56]. Furthermore, DFs provide low performance for adversarial inter-group traffic patterns unless either fine-tuned non-minimal 3 adaptive routing techniques [48,44,67] or groupspreading job placement policies (RDR or RRR in [43]) are used. ...
Thesis
Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.
Article
Full-text available
Point-to-point metrics, such as latency and bandwidth, are often used to characterize network performance with the consequent assumption that optimizing for these metrics is sufficient to improve parallel application performance. However, these metrics can only provide limited insight into application behavior because they do not fully account for effects, such as network congestion, that significantly influence overall network performance. Because many high-performance networks use deterministic oblivious routing, one such effect is the choice of routing algorithm. In this paper, we analyze and compare practical and theoretical aspects of different routing algorithms that are used in today's large-scale networks. We show that widely-used theoretical metrics, such as edge-forwarding index or bisection bandwidth, are not accurate predictors for average network bandwidth. Instead, we introduce an intuitive metric, which we call "effective bisection bandwidth" to characterize quality of different routing algorithms. We present a simple algorithm that globally balances routes and therefore improves the effective bandwidth of the network. Compared to the best algorithm in use today, our new algorithm shows an improvement in effective bisection bandwidth of 40% on a 724-endpoint InfiniBand cluster.
Article
The growing system size of high performance computers results in a steady decrease of the mean time between failures. Exchanging network components often requires whole system downtime which increases the cost of failures. In this work, we study a fail-in-place strategy where broken network elements remain untouched. We show, that a fail-in-place strategy is feasible for todays networks and the degradation is manageable, and provide guidelines for the design. Our network failure simulation tool chain allows system designers to extrapolate the performance degradation based on expected failure rates, and it can be used to evaluate the current state of a system. In a case study of real-world HPC systems, we will analyze the performance degradation throughout the systems lifetime under the assumption that faulty network components are not repaired, which results in a recommendation to change the used routing algorithm to improve the network performance as well as the fail-in-place characteristic.
Conference Paper
One of the objectives of the decade for High-Performance Computing systems is to reach the exascale level of computing power before 2018, hence this will require strong efforts in their design. In that sense, High-speed low-latency interconnection networks are essential elements for exascale HPC systems. Indeed, the performance of the whole system depends on that of the interconnection network. In order to develop and test new techniques, suited to exascale HPC systems, software-based networks simulators are commonly used. As developing a network simulator from scratch is a difficult task, several platforms help the developers, OMNeT++ being one of the most popular. In this paper, we propose a new generic network simulator, exploiting the features of the OMNeT++ framework. The proposed tool is the first step to model HPC high-performance interconnection networks of exascale HPC systems: the message switching layer, routing and arbitration algorithms and buffer organizations have been modeled according to the current and expected characteristics of these systems. In addition, the tool has been designed so that it is possible to simulate networks of large size. Simulation results, validated against real systems, show the accuracy of the model.
Article
A brief description of the DnUp Infiniband Routing Algorithm.
Conference Paper
Efficient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured high-performance networks. % We investigate a novel routing strategy based on the single-source-shortest-path routing algorithm and extend it to use virtual channels to guarantee deadlock-freedom. We show that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. % We demonstrate that the problem of finding the minimal number of virtual channels needed to route a general network deadlock-free is NP-complete and we propose different heuristics to solve the problem. We implement all proposed algorithms in the Open Subnet Manager of InfiniBand and compare the number of needed virtual channels and the bandwidths of multiple real and artificial network topologies which are established in practice. % Our approach allows to use the existing virtual channels more effectively to guarantee deadlock-freedom and increase the effective bandwidth of up to a factor of two. Application benchmarks show an improvement of up to 95%. Our routing scheme is not limited to InfiniBand but can be deployed on existing InfiniBand installations to increase network performance transparently without modifications to the user applications.
Article
The author presents a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer. A fat-tree routing network is parameterized not only in the number of processors, but also in the amount of simultaneous communication it can support. Since communication can be scaled independently from the number of processors, substantial hardware can be saved for such applications as finite-element analysis without resorting to a special-purpose architecture. It is proved that a fat-tree of a given size is nearly the best routing network of that size. This universality theorem is established using a three-dimensional VLSI model that incorporates wiring as a direct cost. In this model, hardware size is measured as physical volume. It is proved that for any given amount of communications hardware, a fat-tree built from that amount of hardware can stimulate every other network built from the same amount of hardware, using only slightly more time (a polylogarithmic factor greater).
Conference Paper
Evolving technology and increasing pin-bandwidth motivate the use of high-radix routers to reduce the diameter, latency, and cost of interconnection networks. High-radix networks, however, require longer cables than their low-radix counterparts. Because cables dominate network cost, the number of cables, and particularly the number of long, global cables should be minimized to realize an efficient network. In this paper, we introduce the dragonfly topology which uses a group of high-radix routers as a virtual router to increase the effective radix of the network. With this organization, each minimally routed packet traverses at most one global channel. By reducing global channels, a dragonfly reduces cost by 20% compared to a flattened butterfly and by 52% compared to a folded Clos network in configurations with ges 16K nodes.We also introduce two new variants of global adaptive routing that enable load-balanced routing in the dragonfly. Each router in a dragonfly must make an adaptive routing decision based on the state of a global channel connected to a different router. Because of the indirect nature of this routing decision, conventional adaptive routing algorithms give degraded performance. We introduce the use of selective virtual-channel discrimination and the use of credit round-trip latency to both sense and signal channel congestion. The combination of these two methods gives throughput and latency that approaches that of an ideal adaptive routing algorithm.
Report on exascale computing
  • A S C A Committee
A. S. C. A. Committee, "Report on exascale computing," in The Opportunities and Challenges of Exascale Computing. Washington, DC, USA: U.S. Department of Energy, Office of Science, 2010.
Infiniband trade association, infiniband architecture specification
  • T Hamada
  • N Nakasato
T. Hamada and N. Nakasato, "Infiniband trade association, infiniband architecture specification, volume 1, release 1.0, http://www.infinibandta.com," in in International Conference on Field Programmable Logic and Applications, 2005, 2005, pp. 366-373.
Flattened butterfly: A costefficient topology for high-radix networks
  • J Kim
  • W J Dally
  • D Abts
J. Kim, W. J. Dally, and D. Abts, "Flattened butterfly: A costefficient topology for high-radix networks," SIGARCH Comput. Archit. News, vol. 35, no. 2, pp. 126-137, Jun. 2007. [Online]. Available: http://doi.acm.org/10.1145/1273440.1250679