Conference Paper

Deadlock-Free Oblivious Routing for Arbitrary Topologies

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Efficient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured high-performance networks. % We investigate a novel routing strategy based on the single-source-shortest-path routing algorithm and extend it to use virtual channels to guarantee deadlock-freedom. We show that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. % We demonstrate that the problem of finding the minimal number of virtual channels needed to route a general network deadlock-free is NP-complete and we propose different heuristics to solve the problem. We implement all proposed algorithms in the Open Subnet Manager of InfiniBand and compare the number of needed virtual channels and the bandwidths of multiple real and artificial network topologies which are established in practice. % Our approach allows to use the existing virtual channels more effectively to guarantee deadlock-freedom and increase the effective bandwidth of up to a factor of two. Application benchmarks show an improvement of up to 95%. Our routing scheme is not limited to InfiniBand but can be deployed on existing InfiniBand installations to increase network performance transparently without modifications to the user applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Shortest path schemes * subnet Ŏ * * Ŏ D-free * They incl. Min-Hop [158], (DF-)SSSP [105], [72], and Nue [71]. * * Only when combined with NAA. ...
... We now describe the IB landscape. We omit a line of common routing protocols based on shortest paths, as they are not directly related to multipathing, but their implementations in the IB fabric manager natively support NAA; these routings are MinHop [158], SSSP [105], Deadlock-Free SSSP (DFSSSP) [72], and a DFSSSP variant called Nue [71]. ...
... Similar to LASH-TOR, the path diversity offered by SAR was not intended as multipathing feature or load balancing feature [70]. Using NAA with LMC = 1, SAR employs a primary set of shortest paths, calculated with a modified DF-SSSP routing [72], and a secondary set of paths, calculated with the Up*/Down* routing algorithm. Whenever SAR reroutes the network to adapt to the currently running HPC applications, the network traffic must temporarily switch to the fixed secondary paths to avoid potential deadlocks during the deployment of the new primary forwarding rules. ...
Article
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this article, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and “Big Data” domains, we consider routing protocols developed for both HPC systems and for data centers as well as general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... Shortest path schemes * subnet Ŏ * * Ŏ D-free * They incl. Min-Hop [158], (DF-)SSSP [105], [72], and Nue [71]. * * Only when combined with NAA. ...
... We now describe the IB landscape. We omit a line of common routing protocols based on shortest paths, as they are not directly related to multipathing, but their implementations in the IB fabric manager natively support NAA; these routings are MinHop [158], SSSP [105], Deadlock-Free SSSP (DFSSSP) [72], and a DFSSSP variant called Nue [71]. ...
... Similar to LASH-TOR, the path diversity offered by SAR was not intended as multipathing feature or load balancing feature [70]. Using NAA with LMC = 1, SAR employs a primary set of shortest paths, calculated with a modified DF-SSSP routing [72], and a secondary set of paths, calculated with the Up*/Down* routing algorithm. Whenever SAR reroutes the network to adapt to the currently running HPC applications, the network traffic must temporarily switch to the fixed secondary paths to avoid potential deadlocks during the deployment of the new primary forwarding rules. ...
Preprint
Full-text available
The recent line of research into topology design focuses on lowering network diameter. Many low-diameter topologies such as Slim Fly or Jellyfish that substantially reduce cost, power consumption, and latency have been proposed. A key challenge in realizing the benefits of these topologies is routing. On one hand, these networks provide shorter path lengths than established topologies such as Clos or torus, leading to performance improvements. On the other hand, the number of shortest paths between each pair of endpoints is much smaller than in Clos, but there is a large number of non-minimal paths between router pairs. This hampers or even makes it impossible to use established multipath routing schemes such as ECMP. In this work, to facilitate high-performance routing in modern networks, we analyze existing routing protocols and architectures, focusing on how well they exploit the diversity of minimal and non-minimal paths. We first develop a taxonomy of different forms of support for multipathing and overall path diversity. Then, we analyze how existing routing schemes support this diversity. Among others, we consider multipathing with both shortest and non-shortest paths, support for disjoint paths, or enabling adaptivity. To address the ongoing convergence of HPC and "Big Data" domains, we consider routing protocols developed for both traditional HPC systems and supercomputers, and for data centers and general clusters. Thus, we cover architectures and protocols based on Ethernet, InfiniBand, and other HPC networks such as Myrinet. Our review will foster developing future high-performance multipathing routing protocols in supercomputers and data centers.
... In particular, VC-based deadlock-free routing algorithms, either deterministic or adaptive, specially proposed for Dragonflies, are difficult to implement (or even unfeasible) in IB-based systems due to some limitations in the IB specification (see Sections 2.3 and 3). Indeed, the official releases of the IB control software do not provide any deadlock-free routing engine specially tailored to Dragonflies, but generic ones such as LASH [9] or DFSSSP [10] (see Section 2.3). However, these topology-agnostic routing engines do not scale, in terms of network resources, when applied to Dragonflies, as the number of VLs that they require to prevent deadlocks increases with network size. ...
... However, UPDN provides nonminimal routes when applied to Dragonfly topologies. Other topology-agnostic routing algorithms, such as LASH [9] and DFSSSP [10] are able to prevent deadlocks by means of the VLs available in IB, and they are available as routing engines in the official release of OpenSM. These routing engines provide minimal routes, and avoid cyclic dependencies by mapping each different route completely to a single VL (what is usually referred to as layered routing). ...
... We have used the IB simulator contributed by Mellanox Technologies to the OMNeT++ community during 2008 [37]. Since then, the model have been used to predict large IB network performance by various publications [10], [36], [38], [39], [40], [41]. ...
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
... In this work, we cannot use minimal adaptive routing with escape paths [20] to support deadlock-freedom since routing tables are assumed to be given. Alternatively, in this work, multiple Virtual Channels (VCs) are exploited to break cyclic channel dependencies, as used in LASH [6], [21] and LASH-TOR [7] routings. Both of these two conventional routing techniques can assign each path between source and destination nodes to Virtual Layers (VLs), each of which is constructed with one of the VCs. ...
... Since a cyclic dependency search has a time complexity of O(|C| + |E|), the minimum time complexity per VL becomes approximately O(|N| 3 ). In the recent improvement on LASH [21], only a cyclic dependency check per VL is needed. This can be done by initially adding all paths to a CDG and checking the cyclic dependencies for all edges. ...
... The original algorithm for the VC assignment in LASH [6] is accelerated by a recent implementation [21]. In this implementation, the time complexity and the memory complexity of VC assignment algorithm are as follows: ...
Article
Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|·log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.
... Some deadlock-free routing methods exploit multiple Virtual Channels (VCs) to break cyclic channel dependencies. One of them is LASH routing [8], [9], in which each VC belongs to one Virtual Layer (VL). Each source-and-destination pair is assigned to one of VLs, as shown in Fig. 1a. ...
... Since a cyclic dependency search has a time complexity O(|C| + |E|), the minimum time complexity per VL becomes approximately O(|N | 3 ). In the recent improvement on LASH [9], only a cyclic dependency check per VL is needed. This can be done by initially adding all paths to a CDG and checking cyclic dependencies for all edges. ...
... The original algorithm for the VC assignment in LASH [8] is accelerated by the recent implementation [9]. In this implementation, the time complexity and the memory complexity of VC assignment algorithm are as follows. ...
... The available routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topologyagnostic Up/Down algorithm is also available as a routing engine (UPDN) in OpenSM [19], which can be also used in Dragonflies. ...
... Hence, this approach has been traditionally the preferred one to design routing engines. Current routing engines available in IB that are suitable for Dragonfly networks and use Layered Routing, are DFSSSP [17], LASH [8] and D3R [18]. ...
Preprint
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... Some techniques are studied in general approaches, such as approximation of effective CBB based on common usage scenarios [40]. Other techniques are studied under specific use cases (for individual target clusters, with Deimos [19] being one example among many others) either by experimentation with sample applications (such as specific benchmarks like the NAS Parallel Benchmark suite in that same cited work, once again with numerous other examples) or by simulation of synthetic traffic (with hotspot traffic [70] being a common example to study congestion) or communication traces from real application runs [9]. The goal is either to directly compare application run times or to estimate effective network metrics potentially transferable to other use cases. ...
... Greedy algorithms are difficult to distribute or parallelise while also keeping guarantees about global results. Examples include Ftree [97], SSSP [41] (and its slower but deadlock-free counterpart DFSSSP [19]), and Nue [21]. ...
Thesis
Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.
... The SSSP based algorithms might introduce cyclic dependencies, which might lead to network deadlocks. For that reason the second part of an algorithm called DFSSSP [25] places all obtained paths to a channel dependency graph and then breaks cycles in the graph by moving paths to virtual graph layers, each layer is assigned to a virtual channel. ...
... Algorithm 4 builds single-source shortest paths from a source to a given set N dst . An initial edge weight of routing graph is |N | 2 to force minimal shortest paths following T. Hoefler in [25]. Only |N | · |N − 1| paths are considered, each w i e < |N | 2 , from this it follows that the shortest-path algorithm with the edge initialization never chooses a detour. ...
Preprint
JSC NICEVT has developed the Angara high-speed interconnect with 4D torus topology. The Angara interconnect router implements deterministic routing based on the bubble flow control, a direction order routing (DOR) and direction bits rules. The router chip also supports non standard First Step / Last Step for bypassing failed nodes and links, these steps can violate the DOR rule. In the previous work we have proposed an algorithm for generation and analysis of routing tables that guarantees no deadlocks in the Angara interconnect. It is based on a breadth-first search algorithm in a graph and it practically does not take into consideration communication channel load. Also we have never evaluated the influence of routing table generation algorithm on the performance of a real-world Angara based cluster. In this paper we present a routing graph notation that provides a possibility to build routes in the torus topology of the Angara interconnect. We propose a deadlock-free routing algorithm based on a fast single-source shortest path algorithm for the deterministic Angara routing with a single virtual channel. We evaluated the considered routing algorithms on a 32-node Desmos cluster system and benchmarked the proposed algorithm performance improvement of 11.1% for the Alltoall communication pattern and of more than 5% for the FT and IS application kernels.
... We compare SF to other topologies using three different resiliency metrics. To prevent deadlocks in case of link failures, one may utilize Deadlock-Free Single Source Shortest Path (DFSSSP) routing [26] (see Section IV-D for details). ...
... To avoid deadlocks in minimum routing one can also use a generic deadlock-avoidance technique based on automatic VC assignment to break cycles in the channel dependency graph [30]. We tested the DFSSSP scheme implemented in the Open Fabrics Enterprise Edition (OFED) [26] which is available for generic InfiniBand networks. OFED DFSSSP consistently needed three VCs to route all SF networks. ...
Preprint
We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of-the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centers as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient datacenter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.
... For other topologies, like Hypercubes and Meshes it was shown that dimension ordered routing is credit loop free [3] and more generic methodologies for credit loopfree routing was presented by [11]. A totally different approach suggests virtual channels, which are mapped to isolated buffering resources within the switches, to enforce channel ordering [3,15,4]. A methodology named LASH [10] treats buffer pools as different layers and is able to handle arbitrary topologies. ...
... These CLCMs carry a {Switch,Port}-Unique IDentifier (SPUID). The SPUID is randomized from a large range in order to minimize the probability of identifiers collision 4 . If several ports on the switch are deadlocksuspected, the switch sends CLCMs with different unique SPUID on each of them (unless the SPUID was replaced, as explained later). ...
Conference Paper
Full-text available
The recently emerging Converged Enhanced Ethernet (CEE) data center networks rely on layer-2 flow control in order to support packet loss sensitive transport protocols, such as RDMA and FCoE. Although lossless networks were proven to improve end-to-end network performance, without careful design and operation, they might suffer from in-network deadlocks, caused by cyclic buffer dependencies. These dependencies are called credit loops. Although existing credit loops rarely deadlock, when they do they can block large parts of the network. Naive solutions recover from credit loop deadlock by draining buffers and dropping packets. Previous works suggested credit-loop avoidance by central routing algorithms, but these assume specific topologies and are slow to react to failures. In this paper we present distributed algorithm to detect, assure traffic progress and recover from credit loop deadlock for arbitrary network topologies and routing protocols. The algorithm can be implemented over commodity switch hardware, requires negligible additional control bandwidth, and avoids packet loss after the deadlock occurs. We introduce two flavors of the algorithm and discuss their trade-off. We define simple scenario that assures credit loop deadlock to occur and use it to test and analyze the algorithm. In addition, we provide simulation results over 3-level fat-tree network. Last, we describe our prototype implementation over commodity data center switch.
... When many LID addresses are in use, more communication paths have to be computed by the SM and more Subnet Management Packets (SMPs) have to be sent to the switches in order to update their Linear Forwarding Tables (LFTs). In particular, the computation of the communication paths might take several minutes in large networks [28]. Moreover, as each VM, physical node, and switch occupies one LID each, the number of physical nodes and switches in the network limits the number of active VMs, and vice versa. ...
... The computational complexity of the paths is polynomially growing with the size of the subnet, and P C t is in the order of several minutes on large subnets 3 [28]. ...
Conference Paper
Full-text available
To meet the demands of the Exascale era and facilitate Big Data analytics in the cloud while maintaining flexibility, cloud providers will have to offer efficient virtualized High Performance Computing clusters in a pay-as-you-go model. As a consequence, high performance network interconnect solutions, like InfiniBand (IB), will be beneficial. Currently, the only way to provide IB connectivity on Virtual Machines (VMs) is by utilizing direct device assignment. At the same time to be scalable, Single-Root I/O Virtualization (SR-IOV) is used. However, the current SR-IOV model employed by IB adapters is a Shared Port implementation with limited flexibility, as it does not allow transparent virtualization and live-migration of VMs. In this paper, we explore an alternative SR-IOV model for IB, the virtual switch (vSwitch), and propose and analyze two vSwitch implementations with different scalability characteristics. Furthermore, as network reconfiguration time is critical to make live-migration a practical option, we accompany our proposed architecture with a scalable and topology agnostic dynamic reconfiguration method, implemented and tested using OpenSM. Our results show that we are able to significantly reduce the reconfiguration time as route recalculations are no longer needed, and in large IB subnets, for certain scenarios, the number of reconfiguration subnet management packets (SMPs) sent is reduced from several hundred thousand down to a single one.
... The IBA OFS comprises open-source implementations for the SM, the Infin-iBand drivers, and the libraries necessary for communication types IBA support, such as RDMA and MPI. The Open Subnet Manager (OpenSM), provided by OFS, is also open source and it offers different routing algorithms or routing engines, such as updn [19], ftree [20], DOR [21], LASH [22], torus-2QoS [19], SSSP [23], DFSSSP [24], and minhop [19]. The OpenSM architecture permits us to implement other routing algorithms that are better suited to other topologies. ...
Article
Full-text available
The InfiniBand (IB) interconnection technology is widely used in the networks of modern supercomputers and data centers. Among other advantages, the IB-based network devices allow for building multiple network topologies, and the IB control software (subnet manager) supports several routing engines suitable for the most common topologies. However, the implementation of some novel topologies in IB-based networks may be difficult if suitable routing algorithms are not supported, or if the IB switch or NIC architectures are not directly applicable for that topology. This work describes the implementation of the network topology known as KNS in a real HPC cluster using an IB network. As far as we know, this is the first implementation of this topology in an IB-based system. In more detail, we have implemented the KNS routing algorithm in the OpenSM software distribution of the subnet manager, and we have adapted the available IB-based switches to the particular structure of this topology. We have evaluated the correctness of our implementation through experiments in the real cluster, using well-known benchmarks. The obtained results, which match the expected performance for the KNS topology, show that this topology can be implemented in IB-based clusters as an alternative to other interconnection patterns.
... The IBA OFS comprises open-source implementations for the SM, the InfiniBand drivers, and the libraries necessary for communication types supported by IBA, such as RDMA and MPI. The Open Subnet Manager (OpenSM), provided by OFS, is also open source and it offers different routing algorithms or routing engines, such as updn [9], ftree [10], dor [11], lash [12], torus-2QoS [9], SSSP [13], DFSSSP [14], and minhop [9]. The OpenSM architecture permits us to implement other routing algorithms that are better suited to other topologies. ...
Preprint
Full-text available
InfiniBand networking technology widely utilized in modern high-performance systems. This work describes the implementation of the hybrid network topology known as KNS in a real HPC cluster using an InfiniBand interconnection network. He have used cluster CELLIA (Cluster for the Evaluation of Low-Latency Architectures), consists of 50 compute and storage nodes equipped with InfiniBand network cards, and up to 50 8-port InfiniBand switches, which allow us to build several topologies. We have implemented the KNS routing algorithm in OpenSM, the subnet manager provided by the OpenFabrics Software (OFS). We also evaluate the performance of the KNS topology using well-known benchmarks, such as HPCC, HPCG, Graph500, and Netgauge. The obtained results show that the low-diameter KNS topology is an efficient and cost-effective alternative to interconnect the computing and storage nodes in HPC clusters. As far as we know, no known InfiniBand system has implemented this topology before.
... Restricted routing. The most common solutions for deadlock avoidance are to restrict routing paths and avoid the formulation of CBD [14,44,49,41,9,13]. However, routing restrictions not only waste link bandwidth and reduce throughput [22], but also are incompatible with some topologies [11,38] and routing protocols such as OSPF and BGP [45,46]. ...
Preprint
Recent data center applications rely on lossless networks to achieve high network performance. Lossless networks, however, can suffer from in-network deadlocks induced by hop-by-hop flow control protocols like PFC. Once deadlocks occur, large parts of the network could be blocked. Existing solutions mainly center on a deadlock avoidance strategy; unfortunately, they are not foolproof. Thus, deadlock detection is a necessary last resort. In this paper, we propose DCFIT, a new mechanism performed entirely in the data plane to detect and solve deadlocks for arbitrary network topologies and routing protocols. Unique to DCFIT is the use of deadlock initial triggers, which contribute to efficient deadlock detection and deadlock recurrence prevention. Preliminary results indicate that DCFIT can detect deadlocks quickly with minimal overhead and mitigate the recurrence of the same deadlocks effectively. This work does not raise any ethical issues.
... The available 65 routing engines suitable for Dragonflies that follow the second approach (i.e. layered routing) are LASH [8], DFSSSP [17] and D3R [18], the former two being actually topology agnostic algorithms while the latter having been specially designed for Dragonflies. Regarding the third approach, the classical, topologyagnostic Up/Down algorithm is also available as a routing engine (UPDN) in 70 OpenSM [19], which can be also used in Dragonflies. ...
Article
Full-text available
The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.
... The routing information used by the SMI communication kernels can be uploaded dynamically at runtime, allowing it to be specialized to the interconnect, and even to the application. We use static routing to determine the optimal paths for routing packets between any pair of FPGAs: before the application starts, the paths between FPGAs are computed using a deadlock-free routing scheme [8], according to the target FPGA interconnection topology. If the interconnection topology changes, or the programs run on a different number of FPGAs, the bitstream does not need to be rebuilt, as the routing scheme merely needs to be recomputed and uploaded to each device. ...
Conference Paper
Full-text available
Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
... The routing information used by the SMI communication kernels can be uploaded dynamically at runtime, allowing it to be specialized to the interconnect, and even to the application. We use static routing to determine the optimal paths for routing packets between any pair of FPGAs: before the application starts, the paths between FPGAs are computed using a deadlock-free routing scheme [8], according to the target FPGA interconnection topology. If the interconnection topology changes, or the programs run on a different number of FPGAs, the bitstream does not need to be rebuilt, as the routing scheme merely needs to be recomputed and uploaded to each device. ...
Preprint
Full-text available
Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
... Since LaaS guarantees tenant isolation, tenant performance should be independent of the number of other tenants that run on the same network. To demonstrate LaaS tenant isolation, we simulate a large cluster using a well known InfiniBand flit level simulator used for instance by [20]. Fig. 13 presents the relative performance of single and multiple tenants running Stencil scientific-computing applications on a cloud of 1,728 hosts, under either Unconstrained or LaaS, normalized by the performance of a single tenant placed without constraints. ...
Article
Full-text available
The most demanding tenants of shared clouds require complete isolation from their neighbors, in order to guarantee that their application performance is not affected by other tenants. Unfortunately, while shared clouds can offer an option, whereby tenants obtain dedicated servers, they do not offer any network provisioning service, which would shield these tenants from network interference. In this paper, we introduce links as a service (LaaS), a new abstraction for cloud service that provides isolation of network links. Each tenant gets an exclusive set of links forming a virtual fat-tree, and is guaranteed to receive the exact same bandwidth and delay as if it were alone in the shared cloud. Consequently, each tenant can use the forwarding method that best fits its application. Under simple assumptions, using bipartite graph properties and pigeonhole-based analysis, we derive theoretical conditions for enabling the LaaS without capacity over-provisioning in fat-trees. New tenants are only admitted in the network, when they can be allocated hosts and links that maintain these conditions. We also provide new results on the numbers of tenants and hosts that can fit while guaranteeing network isolation. The LaaS is implementable with common network gear, tested to scale to large networks, and provides full tenant isolation at the cost of a limited reduction in the cloud utilization.
... For example, social network analytics, such as PageRank [11] and HITS [54], rely heavily on graph models to represent relationships. Analysis tasks such as bug finding and network routing have used graph-based algorithms to improve their performance and accuracy [15,57]. However, graph processing, especially for large-scale graphs, is computationally expensive. ...
Conference Paper
This paper proposes Gswitch, a pattern-based algorithmic auto-tuning system that dynamically switches between optimization variants with negligible overhead. Its novelty lies in a small set of algorithmic patterns that allow for the configurable assembly of variants of the algorithm. The fast transition of Gswitch is based on a machine learning model trained using 644 real graphs. Moreover, Gswitch provides a simple programming interface that conceals low-level tuning details from the user. We evaluate Gswitch on typical graph algorithms (BFS, CC, PR, SSSP, and BC) using Nvidia Kepler and Pascal GPUs. The results show that Gswitch runs up to 10× faster than the best configuration of the state-of-the-art programmable GPU-based graph processing libraries on 10 representative graphs. Gswitch outperforms Gunrock on 92.4% cases of 644 graphs which is the largest dataset evaluation reported to date.
... Regarding the second group of algorithms we can further differentiate between those which decouple the creation of paths from the deadlock-free assignment to VCs and those which perform both actions at the same time. DFSSSP [10] and LASH [32] belong to the first group working in a similar way in terms of breaking cycles searching for them in the CDG and moving individual paths to other virtual layers. As both techniques can suffer from a limited number of available virtual layers, LASH was improved in LASH-TOR [31] using Up*/Down* routing in the last VC when unresolvable cycles appear. ...
Conference Paper
Recently, the use of graph-based network topologies has been proposed as an alternative to traditional networks such as tori or fat-trees due to their very good topological characteristics. However they pose practical implementation challenges such as the lack of deadlock avoidance strategies. Previous proposals either lack flexibility, underutilise network resources or are exceedingly complex. We propose--and prove formally--three generic, low-complexity deadlock avoidance mechanisms that only require local information. Our methods are topology- and routing-independent and their virtual channel count is bounded by the length of the longest path. We evaluate our algorithms through an extensive simulation study to measure the impact on the performance using both synthetic and realistic traffic. First we compare against a well-known HPC mechanism for dragonfly and achieve similar performance level. Then we moved to Graph-based networks and show that our mechanisms can greatly outperform traditional, spanning-tree based mechanisms, even if these use a much larger number of virtual channels. Overall, our proposal provides a simple, flexible and high performance deadlock-avoidance solution.
... For example, Infiniband switches select the output VC (denoted Virtual Lane, VL) based on the input and output ports and the packet service level (which does not change during the path). For this reason, most routing mechanisms in Infiniband (such as LASH [33], SSSP [34] and DF-SSSP [35]) assign a single VC to a complete path from source to destination. These routing protocols typically calculate sets of paths with a reduced amount of cyclic dependencies, so that the VL assignment phase result fits in a low amount of VLs. ...
... Since LaaS guarantees tenant isolation, tenant performance should be independent of the number of other tenants that run on the same network. To demonstrate LaaS tenant isolation, we simulate a large cluster using a well known InfiniBand flit level simulator used by [19,23,57]. Fig. 11 presents the relative performance of single and multiple tenants running Stencil scientific-computing applications on a cloud of 1,728 hosts, under either Unconstrained or LaaS, normalized by the performance of a single tenant placed without constraints. ...
Conference Paper
Full-text available
The most demanding tenants of shared clouds require complete isolation from their neighbors, in order to guarantee that their application performance is not affected by other tenants. Unfortunately, while shared clouds can offer an option whereby tenants obtain dedicated servers, they do not offer any network provisioning service, which would shield these tenants from network interference. In this paper, we introduce Links as a Service (LaaS), a new abstraction for cloud service that provides isolation of network links. Each tenant gets an exclusive set of links forming a virtual fat-tree, and is guaranteed to receive the exact same bandwidth and delay as if it were alone in the shared cloud. Consequently, each tenant can use the forwarding method that best fits its application. Under simple assumptions, we derive theoretical conditions for enabling LaaS without capacity over-provisioning in fat-trees. New tenants are only admitted in the network when they can be allocated hosts and links that maintain these conditions. LaaS is implementable with common network gear, tested to scale to large networks and provides full tenant isolation at the worst cost of a 10% reduction in the cloud utilization.
... Relocated paths can create new cycles in the channel dependency graph of the next layer. Hence, all layers have to be processed in the same manner until each channel dependency graph is acyclic [11]. In summary, DFSSSP is a topology-agnostic routing engine which provides minimal routing paths and deadlock freedom by using special policies for mapping traffic flows to the VLs, so that some of these VLs are used as scape ways [12] for deadlock freedom. ...
Article
According to the latest TOP500 list, InfiniBand (IB) is the most widely used network architecture in the top 10 supercomputers. IB relies on Credit-based Flow Control (CBFC) to provide a lossless network and InfiniBand congestion control (IB CC) to relieve congestion, however, this can lead to the problem of victim flow since messages are mixed in the same queue and long-lived congestion spreading due to slow convergence. To deal with these problems, in this paper, we propose FlowStar, a fast convergence per-flow state accurate congestion control for InfiniBand. FlowStar includes two core mechanisms: 1) optimized per-flow CBFC mechanism provides flow state control to detect real congestion; and 2) rate adjustment rules make up for the mismatch between the original IB CC rate regulation and the per-hop CBFC to alleviate congestion spreading. FlowStar implements a per-flow congestion state on switches and can obtain in-flight packet information without additional parameter settings to ensure a lossless network. Evaluations show that FlowStar improves average and tail message complete time under different workloads.
Article
Based on the most recent TOP500 rankings, Infiniband (IB) stands out as the dominant network architecture among the top 10 supercomputers. Yet, it primarily employs deterministic routing, which tends to be suboptimal in network traffic balance. While deterministic routing invariably opts for the same forwarding path, adaptive routing offers flexibility by permitting packets to traverse varied paths for every source-destination pair. Contemporary adaptive routing methods in HPC networks typically determine path selection rooted in the switch queue's occupancy. While the queue length provides a glimpse into local congestion, it's challenging to consolidate such fragmented information to portray the full path accurately. In this paper, we introduce Alarm, an adaptive routing system that uses probabilistic path selection grounded in one-way delay metrics. The one-way delay not only offers a more holistic view of congestion, spanning from source to destination, but also captures the intricacies of network flows. Alarm gleans the one-way delay from each pathway via data packets, eliminating the need for separate delay detection packets and clock synchronization. The probabilistic selection hinges on weights determined by the one-way delay, ensuring the prevention of bottleneck links during congestion updates. Notably, routing decisions under Alarm are made per-flowlet. Guided by delay cues, the gap between flowlets is dynamically adjusted to match the maximum delay variation across diverse paths, thereby preventing the occurrence of packet out-of-order. The simulation results show that Alarm can achieve 2.0X and 1.7X better average and p99 FCT slowdown than existing adaptive routing.
Conference Paper
The de-facto standard topology for modern HPC systems and data-centers are Folded Clos networks, commonly known as Fat-Trees. The number of network endpoints in these systems is steadily increasing. The switch radix increase is not keeping up, forcing an increased path length in these multi-level trees that will limit gains for latency-sensitive applications. Additionally, today's Fat-Trees force the extensive use of active optical cables which carries a prohibitive cost-structure at scale. To tackle these issues, researchers proposed various low-diameter topologies, such as Dragonfly. Another novel, but only theoretically studied, option is the HyperX. We built the world's first 3 Pflop/s supercomputer with two separate networks, a 3--level Fat-Tree and a 12×8 HyperX. This dual-plane system allows us to perform a side-by-side comparison using a broad set of benchmarks. We show that the HyperX, together with our novel communication pattern-aware routing, can challenge the performance of, or even outperform, traditional Fat-Trees.
Conference Paper
Many applications in distributed systems rely on underlying lossless networks to achieve required performance. Existing lossless network solutions propose different hop-by-hop flow controls to guarantee zero packet loss. However, another crucial problem called network deadlock occurs concomitantly. Once the system traps in a deadlock, a large part of network would be disabled. Existing deadlock avoidance solutions focus all their attentions on breaking the cyclic buffer dependency to eliminate circular wait (one necessary condition of deadlock). These solutions, however, impose many restrictions on network configurations and side-effects on performance. In this work, we explore a brand-new perspective to solve network deadlock: avoiding hold and wait situation (another necessary condition). Experimental observations tell that frequent pause on upstream ports driven by existing flow control schemes is the root cause of hold and wait. We propose Gentle Flow Control (GFC) to manipulate the port rate at a fine granularity, so all ports can keep packets flowing even cyclic buffer dependency exists, and prove GFC can eliminate deadlock theoretically. We also present how to implement GFC in mainstream lossless networks (Converged Enhanced Ethernet and InfiniBand) with moderate modifications. Furthermore, testbed experiments and packet-level simulations validate GFC can efficiently avoid deadlock and introduce less than 0.5% of bandwidth occupation.
Article
Remote direct memory access over converged Ethernet deployments is vulnerable to deadlocks induced by priority flow control. Prior solutions for deadlock prevention either require significant changes to routing protocols or require excessive buffers in the switches. In this paper, we propose Tagger, a scheme for deadlock prevention. It does not require any changes to the routing protocol and needs only modest buffers. Tagger is based on the insight that given a set of expected lossless routes, a simple tagging scheme can be developed to ensure that no deadlock will occur under any failure conditions. Packets that do not travel on these lossless routes may be dropped under extreme conditions. We design such a scheme, prove that it prevents deadlock, and implement it efficiently on commodity hardware.
Conference Paper
Remote Direct Memory Access over Converged Ethernet (RoCE) deployments are vulnerable to deadlocks induced by Priority Flow Control (PFC). Prior solutions for deadlock prevention either require signi.cant changes to routing protocols, or require excessive bu.ers in the switches. In this paper, we propose Tagger, a scheme for deadlock prevention. It does not require any changes to the routing protocol, and needs only modest bu.ers. Tagger is based on the insight that given a set of expected lossless routes, a simple tagging scheme can be developed to ensure that no deadlock will occur under any failure conditions. Packets that do not travel on these lossless routes may be dropped under extreme conditions. We design such a scheme, prove that it prevents deadlock and implement it e.ciently on commodity hardware.
Conference Paper
Full-text available
Article
The interconnection network architecture is crucial for High-Performance Computing (HPC) clusters, since it must meet the increasing computing demands of applications. Current trends in the design of these networks are based on increasing link speed, while reducing latency and number of components in order to lower the cost. The InfiniBand Architecture (IBA) is an example of a powerful interconnect technology, delivering huge amounts of information in few microseconds. The IBA-based hardware is able to deliver EDR and HDR speed (i.e. 100 and 200 Gb/s, respectively). Unfortunately, congestion situations and their derived problems (i.e. Head-of-Line blocking and buffer hogging), are a serious threat for the performance of both the interconnection network and the entire HPC cluster. In this paper, we propose a new approach to provide IBA-based networks with techniques for reducing the congestion problems. We propose Flow2SL-ITh, a technique that combines a static queuing scheme (SQS) with the closed-loop congestion control mechanism included in IBA-based hardware (a.k.a. injection throttling, ITh). Flow2SL-ITh separates traffic flows storing them in different virtual lanes (VLs), in order to reduce HoL blocking, while the injection rate of congested flows is throttled. Meanwhile congested traffic vanishes, there is no buffer sharing among traffic flows stored in different VLs, which reduces congestion negative effects. We have implemented Flow2SL-ITh in OpenSM, the open-source implementation of the IBA subnet manager (SM). Experimental results obtained by running simulations and real workloads in a small IBA cluster show that Flow2SL-ITh outperforms existing techniques by up to 44%, under some traffic scenarios.
Article
This tutorial presents the details of the interconnection network utilized in many High Performance Computing (HPC) systems today. “InfiniBand” is the hardware interconnect utilized by over 35% of the top 500 supercomputers in the world as of June, 2017. “Verbs” is the term used for both the semantic description of the interface in the InfiniBand Architecture Specifications, and the name used for the functions defined in the widely used OpenFabrics Alliance (OFA) implementation of the software interface to InfiniBand. “Message Passing Interface” (MPI) is the primary software library by which HPC applications portably pass messages between processes across a wide range of interconnects including InfiniBand. Our goal is to explain how these three components are designed and how they interact to provide a powerful, efficient interconnect for HPC applications. We provide a succinct look into the inner technical workings of each component that should be instructive to both novices to HPC applications as well as to those who may be familiar with one component, but not necessarily the others, in the design and functioning of the total interconnect. A supercomputer interconnect is not a monolithic structure, and this tutorial aims to give non-experts a “big-picture” overview of its substructure with an appreciation of how and why features in one component influence those in others. We believe this is one of the first tutorials to discuss these three major components as one integrated whole. In addition, we give detailed examples of practical experience and typical algorithms used within each component in order to give insights into what issues and trade-offs are important.
Conference Paper
Lossless interconnection networks are omnipresent in high performance computing systems, data centers and network-on-chip architectures. Such networks require efficient and deadlock-free routing functions to utilize the available hardware. Topology-aware routing functions become increasingly inapplicable, due to irregular topologies, which either are irregular by design or as a result of hardware failures. Existing topology-agnostic routing methods either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. We propose a novel topology-agnostic routing approach which implicitly avoids deadlocks during the path calculation instead of solving both problems separately. We present a model implementation, called Nue, of a destination-based and oblivious routing function. Nue routing heuristically optimizes the load balancing while enforcing deadlock-freedom without exceeding a given number of virtual channels, which we demonstrate based on the InfiniBand architecture.
Article
Multi-tenancy promises high utilization of available system resources and helps maintaining cost-effective operations for service providers. However, multi-tenant high-performance computing (HPC) infrastructures, like dynamic HPC clouds, bring unique challenges, both associated with providing performance isolation to the tenants, and achieving efficient load-balancing across the network fabric. Each tenant should experience predictable network performance, unaffected by the workload of other tenants. At the same time, it is equally important that the network links are balanced, avoiding network saturation. The network saturation can lead to unpredictable application performance, and a potential loss of profit for the cloud service providers.
Article
Full-text available
A problem of increasing importance in the design of large multiprogramming systems is the, so-called, deadlock or deadly-embrace problem. In this article we survey the work that has been done on the treatment of deadlocks from both the theoretical and practical points of view.
Article
Full-text available
Point-to-point metrics, such as latency and bandwidth, are often used to characterize network performance with the consequent assumption that optimizing for these metrics is sufficient to improve parallel application performance. However, these metrics can only provide limited insight into application behavior because they do not fully account for effects, such as network congestion, that significantly influence overall network performance. Because many high-performance networks use deterministic oblivious routing, one such effect is the choice of routing algorithm. In this paper, we analyze and compare practical and theoretical aspects of different routing algorithms that are used in today's large-scale networks. We show that widely-used theoretical metrics, such as edge-forwarding index or bisection bandwidth, are not accurate predictors for average network bandwidth. Instead, we introduce an intuitive metric, which we call "effective bisection bandwidth" to characterize quality of different routing algorithms. We present a simple algorithm that globally balances routes and therefore improves the effective bandwidth of the network. Compared to the best algorithm in use today, our new algorithm shows an improvement in effective bisection bandwidth of 40% on a 724-endpoint InfiniBand cluster.
Article
Full-text available
The increasing gap between processor performance and memory access time warrants the re-examination of data movement in iterative linear solver algorithms. For this reason, we explore and establish the feasibility of modifying a standard iterative linear solver algorithm in a manner that reduces the movement of data through memory. In particular, we present an alternative to the restarted GMRES algorithm for solving a single right-hand side linear system Ax = b based on solving the block linear system AX = B. Algorithm performance, i.e. time to solution, is improved by using the matrix A in operations on groups of vectors. Experimental results demonstrate the importance of implementation choices on data movement as well as the effectiveness of the new method on a variety of problems from different application areas.
Article
Full-text available
The PERCS system was designed by IBM in re-sponse to a DARPA challenge that called for a high-productivity high-performance computing system. A major innovation in the PERCS design is the network that is built using Hub chips that are integrated into the compute nodes. Each Hub chip is about 580 mm 2 in size, has over 3700 signal I/Os, and is packaged in a module that also contains LGA-attached optical electronic devices. The Hub module implements five types of high-bandwidth interconnects with multiple links that are fully-connected with a high-performance internal crossbar switch. These links provide over 9 Tbits/second of raw bandwidth and are used to construct a two-level direct-connect topology spanning up to tens of thou-sands of POWER7 chips with high bisection bandwidth and low latency. The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation. Blue Waters is expected to deliver sustained Petascale performance over a wide range of applications. The Hub chip supports several high-performance computing protocols (e.g., MPI, RDMA, IP) and also provides a non-coherent system-wide global address space. Collective commu-nication operations such as barriers, reductions, and multi-cast are supported directly in hardware. Multiple routing modes including deterministic as well as hardware-directed random routing are also supported. Finally, the Hub module is capable of operating in the presence of many types of hardware faults and gracefully degrades performance in the presence of lane failures.
Conference Paper
Full-text available
The performance of sparse iterative solvers is typically limited by sparse matrix-vector multiplication, which is itself limited by memory system and network performance. As the gap between computation and communication speed continues to widen, these traditional sparse methods will suffer. In this paper we focus on an alternative building block for sparse iterative solvers, the "matrix powers kernel" [x, Ax, A2x, ..., Akx], and show that by organizing computations around this kernel, we can achieve near-minimal communication costs. We consider communication very broadly as both network communication in parallel code and memory hierarchy access in sequential code. In particular, we introduce a parallel algorithm for which the number of messages (total latency cost) is independent of the power k, and a sequential algorithm, that reduces both the number and volume of accesses, so that it is independent of k in both latency and bandwidth costs. This is part of a larger project to develop "communication-avoiding Krylov subspace methods," which also addresses the numerical issues associated with these methods. Our algorithms work for general sparse matrices that "partition well". We introduce parallel performance models of matrices arising from 2D and 3D problems and show predicted speedups over a conventional algorithm of up to 7times on a petaflop-scale machine and up to 22times on computation across the grid. Analogous sequential performance models of the same problems predict speedups over a conventional algorithm of up to 10times on an out-of-core implementation, and up to 2.5times when we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we validate the model on an out-of-core sequential implementation and measured a speedup of over 3times, which is close to the predicted speedup.
Conference Paper
Full-text available
Fat-tree based system area networks have been widely adopted in high performance computing clusters. In such systems, the routing is often deterministic and the traffic demand is usually uncertain and changing. In this paper, we study routing performance on fat-tree based system area networks with deterministic routing under the assumption that the traffic demand is uncertain. The performance of a routing algorithm under uncertain traffic demands is characterized by the oblivious performance ratio that bounds the relative performance of the routing algorithm and the optimal routing algorithm for any given traffic demand. We consider both single path routing where the traffic between each source-destination pair follows one path, and multi-path routing where multiple paths can be used for the traffic between a source-destination pair. We derive lower bounds of the oblivious performance ratio of any single path routing scheme for fat-tree topologies and develop single path oblivious routing schemes that achieve the optimal oblivious performance ratio for commonly used fat-tree topologies. These oblivious routing schemes provide the best performance guarantees among all single path routing algorithms under uncertain traffic demands. For multi-path routing, we show that it is possible to obtain a scheme that is optimal for any traffic demand (an oblivious performance ratio of 1) on the fat-tree topology. These results quantitatively demonstrate that single path routing cannot guarantee high routing performance while multi-path routing is very effective in balancing network loads on the fat-tree topology.
Conference Paper
Full-text available
In a (randomized) oblivious routing scheme the path chosen for a request between a source s and a target t is independent from the current traffic in the network. Hence, such a scheme consists of probability distributions over s-t paths for every source-target pair s,t in the network.In a recent result [11] it was shown that for any undirected network there is an oblivious routing scheme that achieves a polylogarithmic competitive ratio with respect to congestion. Subsequently, Azar et al. [4] gave a polynomial time algorithm that for a given network constructs the best oblivious routing scheme, i.e. the scheme that guarantees the best possible competitive ratio. Unfortunately, the latter result is based on the Ellipsoid algorithm; hence it is unpractical for large networks.In this paper we present a combinatorial algorithm for constructing an oblivious routing scheme that guarantees a competitive ratio of O(log4n) for undirected networks. Furthermore, our approach yields a proof for the existence of an oblivious routing scheme with competitive ratio O(log3n), which is much simpler than the original proof from [11].
Conference Paper
Full-text available
Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.
Conference Paper
Full-text available
This paper introduces Netgauge, an extensible open-source framework for implementing network benchmarks. The structure of Net- gauge abstracts and explicitly separates communication patterns from communication modules. As a result of this separation of concerns, new benchmark types and new network protocols can be added independently to Netgauge. We describe the rich set of pre-defined communication pat- terns and communication modules that are available in the current dis- tribution. Benchmark results demonstrate the applicability of the cur- rent Netgauge distribution to to different networks. An assortment of use-cases is used to investigate the implementation quality of selected protocols and protocol layers.
Conference Paper
Full-text available
The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we study the communication performance of a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35-40% of the network capacity with one virtual channel, 55-60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestion-free communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies
Article
Full-text available
We study oblivious routing in fat-tree-based system area networks with deterministic routing under the assumption that the traffic demand is uncertain. The performance of a routing algorithm under uncertain traffic demands is characterized by the oblivious performance ratio that bounds the relative performance of the routing algorithm with respect to the optimal algorithm for any given traffic demand. We consider both single-path routing, where only one path is used to carry the traffic between each source-destination pair, and multipath routing, where multiple paths are allowed. For single-path routing, we derive lower bounds of the oblivious performance ratio for different fat-trees and develop routing schemes that achieve the optimal oblivious performance ratios for commonly used topologies. Our evaluation results indicate that the proposed oblivious routing schemes not only provide the optimal worst-case performance guarantees but also outperform existing schemes in average cases. For multipath routing, we show that it is possible to obtain an optimal scheme for all traffic demands (an oblivious performance ratio of 1). These results quantitatively demonstrate the performance difference between single-path routing and multipath routing in fat-trees.
Conference Paper
Full-text available
Cluster networks are seen as the future access networks for multimedia streaming, e-commerce, network storage, etc. For these applications, performance and high availability are particularly crucial. Regular topologies are preferred when performance is the primary concern. However, due to spatial constraints or fault-related issues, the network structure may become irregular, which makes more difficult to find deadlock-free minimal paths. Over the recent years, several solutions have been proposed. One of them is the LASH routing, which enables minimal routing by assigning paths to different virtual layers. In this paper, we propose an extension of LASH in order to reduce the number of required virtual layers by allowing transitions between virtual layers. Evaluation results show that the new routing scheme (LASH-TOR) is able to obtain full minimal routing with a reduced number of virtual channels. For torus and mesh networks, with only two virtual channels, LASH throughput is increased by an average factor of improvement of 3.30 for large networks. For regular networks with some unconnected (faulty) links, equal performance improvements are achieved. Even for highly irregular networks of size up to 128 switches the new routing scheme only needs three virtual channels for guaranteeing minimal routing. Besides, LASH-TOR performs well compared to dimension order routing for mesh and torus networks.
Article
Full-text available
Freedom from deadlock is a key issue in cut-through, wormhole, and store and forward networks, and such freedom is usually obtained through careful design of the routing algorithm. Most existing deadlock-free routing methods for irregular topologies do, however, impose severe limitations on the available routing paths. We present a method called layered routing, which gives rise to a series of routing algorithms, some of which perform considerably better than previous ones. Our method groups virtual channels into network layers and to each layer it assigns a limited set of source/destination address pairs. This separation of traffic yields a significant increase in routing efficiency. We show how the method can be used to improve the performance of irregular networks, both through load balancing and by guaranteeing shortest-path routing. The method is simple to implement, and its application does not require any features in the switches other than the existence of a modest number of virtual channels. The performance of the approach is evaluated through extensive experiments within three classes of technologies. These experiments reveal a need for virtual channels as well as an improvement in throughput for each technology class.
Article
Full-text available
A great deal of work has been done recently on developing techniques for proving deadlock freedom for wormhole routing algorithms. One approach has been to restrict the class of routing algorithms for which the proof technique applies. The other approach is to provide a generic method that can be applied to all routing algorithms. Although this latter approach offers clear advantages, a general technique must deal with many complications. Foremost among these is the issue of irreducible cyclic dependencies that cannot result in deadlock. Such dependencies have been referred to alternatively as unreachable configurations and false resource cycles. In this paper, we apply the notion of unreachable cyclic configurations to oblivious routing algorithms. An oblivious routing algorithm is thus constructed that is deadlock-free, even though there are cycles in the channel dependency graph. The idea of unreachable configurations is then further developed to show various restrictions on when such configurations cart exist with oblivious routing algorithms. Finally, the example is generalized to allow the construction of larger networks with unreachable cycles. One benefit of characterizing when unreachable cyclic configurations can occur is that proving deadlock freedom is simplified for networks in which unreachable cycles cannot exist. Another contribution of this work is a first step toward a more formal model of unreachable cycles in wormhole-routed networks
Article
Full-text available
We describe a set of implementations of the NAS Parallel Benchmarks based on Fortran 77 and the MPI message passing standard. These implementations, which are intended to be run with little or no tuning, approximate the performance a typical user can expect for a portable parallel program on a distributed memory computer. They complement rather than replace the original NAS Parallel Benchmarks. We also present two additions to the original pencil and paper specification. First, we define "class C" sizes of the benchmarks to better suit the current and next generation of supercomputers. Second, we introduce changes to the reporting requirements for NAS Parallel Benchmark results. NASA Ames Research Center, Moffett Field, CA, 94035-1000. y MRJ, Inc. This work is supported through NASA Contract NAS 2-14303. z MCAT, Inc. This work is supported through NASA Contract NAS 2-14109. x Sterling Software, Palo Alto, California. This work is supported through NASA Contract NAS 2-13210. 1 ...
Conference Paper
A principle task in parallel and distributed systems is to reduce the communication load in the interconnection network, as this is usually the major bottleneck for the performance of distributed applications. In this paper we introduce a framework for solving on-line problems that aim to minimize the congestion (i.e. the maximum load of a network link) in general topology networks.We apply this framework to the problem of on-line routing of virtual circuits and to a dynamic data management problem. For both scenarios we achieve a competitive ratioof O(log3n) with respect to the congestion of the network links.Our on-line algorithm for the routing problem has the remarkable property that it is oblivious, i.e., the path chosen for a virtual circuit is independent of the current network load. Oblivious routing strategies can easily be implemented in distributed environments and have thereforebeen intensively studied for certain network topologies as e.g. meshes, tori and hypercubic networks. This is the first oblivious path selection algorithm that achieves a polylogarithmiccompetitive ratio in general networks.
Article
The BenchIT kernels generate a large amount of measurement results in dependence of the number of functional arguments. Using the web interface, the user is given the chance to show the selected results of different measuring programs in only one coordinate system. Often there are different reasons they can cause characteristic minima, maxima, or a special shape in a graph. It is necessary to collect additional information about the tested system to explain such effects on a base of well-known system properties and physical values of the realization. The BenchIT-project provides such an evaluation platform by offering a variety of measurement kernels, as well as a easily accessible plotting engine, thus enabling an easy way to measure performance on a specific system and compare the result, which is a full graph instead of just a number, to other results contributed by other users. The further development of the BenchIT-project will take place on all module layers. A GUI for the configuration of the measurements is under development. It will provide an easier way to handle the measurements by partially substituting the shell scripts running the measurements up to this point. Furthermore, an additional way to plot the data on the website by using Java-Applets and Java graphing tools is planned.
Article
A controller for a packet switching network is an algorithm to control the flow of packets through the network. A local controller is a controller executed independently by each node in the network, using only local information available to these nodes. A controller is deadlock- and livelock-free if it guarantees that every packet in the network reaches its destination within a finite amount of time. We present a local controller which is proved to be deadlock- and livelock-free.
Article
A recent seminal result of Räcke is that for any undirected network there is an oblivious routing algorithm with a polylogarithmic competitive ratio with respect to congestion. Unfortunately, Räcke's construction is not polynomial time. We give a polynomial time construction that guarantees Räcke's bounds, and more generally gives the true optimal ratio for any (undirected or directed) network.
Conference Paper
In this paper we isolate a combinatorial problem that, we believe, lies at the heart of this question and provide some encouragingly positive solutions to it. We show that there exists an N-processor realistic computer that can simulate arbitrary idealistic N-processor parallel computations with only a factor of O(log N) loss of runtime efficiency. The main innovation is an O(log N) time randomized routing algorithm. Previous approaches were based on sorting or permutation networks, and implied loss factors of order at least (log N)2.
Conference Paper
NOWs are arranged as a switch-based network which allows the layout of both regular and irregular topologies. However, the irregular pattern interconnect makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently, a simple and effective methodology to compute up*/down* routing tables has been proposed by us. The resulting routing algorithm is very effective in irregular topologies. However, its behavior is very poor in regular networks with orthogonal dimensions. Therefore we propose a more flexible routing scheme that is effective in both regular and irregular topologies. Unlike up*/down* routing algorithms, the proposed routing algorithm breaks cycles at different nodes for each direction in the cycle, thus providing better traffic balancing than that provided by up*/down* routing algorithms. Evaluation results modeling a Myrinet network show that the new routing algorithm increases throughput with respect to the original up*/down* routing algorithm by a factor of up to 3:5 for regular networks, also maintaining the performance of the improved up*/down* routing scheme proposed in [7] when applied to irregular networks.
Conference Paper
Today's data centers may contain tens of thousands of computers with significant aggregate bandwidth requirements. The network architecture typically consists of a tree of routing and switching elements with progressively more specialized and expensive equipment moving up the network hierarchy. Unfortunately, even when deploying the highest-end IP switches/routers, resulting topologies may only support 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost. Non-uniform bandwidth among data center nodes complicates application design and limits overall system performance. In this paper, we show how to leverage largely commodity Ethernet switches to support the full aggregate bandwidth of clusters consisting of tens of thousands of elements. Similar to how clusters of commodity computers have largely replaced more specialized SMPs and MPPs, we argue that appropriately architected and interconnected commodity switches may deliver more performance at less cost than available from today's higher-end solutions. Our approach requires no modifications to the end host network interface, operating system, or applications; critically, it is fully backward compatible with Ethernet, IP, and TCP.
Conference Paper
The InfiniBand architecture is a high-performance network technology for the interconnection of processor nodes and I/O devices using a point-to-point switch-based fabric. The InfiniBand specification defines a basic management infrastructure that is responsible for subnet configuration, activation, and fault tolerance. Subnet management entities and functions are described, but the specifications do not impose any particular implementation. We present and analyze a complete subnet management mechanism for this architecture. We allow to anticipate future directions to obtain efficient management protocols
Conference Paper
InniBand is very likely to become the de facto stan- dard for communication between processing nodes and I/O devices as well as for interprocessor communica- tion. The InniBand Architecture (IBA) supports up to 15 data virtual lanes per physical link, primarily intended for trac prioritization, deadlock avoidance, and quality of service. However, virtual lanes may also contribute to improve performance by reducing the in- uenc e of the head-of-line blocking ee ct on input phys- ical ports. On the other hand, when virtual lanes are used, crossbar complexity may be increased. The main goal of this paper is to show at what ex- tent the use of virtual lanes may contribute to improve network performance on an InniBand environment, obtaining the trade-o between the number of virtual lanes and performance improvement. Dier ent con- gur ations (crossbar organization, crossbar bandwidth, and link bandwidth) are used. Evaluation results us- ing up*/down* routing show that two virtual lanes are often enough to achieve the most of the improvement on performance, allowing the use of the remaining vir- tual lanes for other purposes. Additionally, by increas- ing the crossbar bandwidth, a lower complexity crossbar congur ation can be used.
Conference Paper
We introduce and analyze a new family of multiprocesser interconnection networks, called generalized fat trees, which include as special cases the fat trees used for the connection machine architecture CM-5, pruned butterflies, and various other fat trees proposed in the literature. The generalized fat trees provide a formal unifying concept to design and analyse a fat tree based architecture. The extended generalized fat tree network XGFT(h; m1, ..., mh; w1, ..., wh) of height h has Πi=1 h mi leaf processors and the inner nodes serve only as switches or routers. Each non-leaf node in level i has mi children and each non-root has wi+1 parent nodes. The generalized fat trees provide regularity, symmetry, recursive scalability, maximal fault-tolerance, logarithmic diameter bisection scalability, and permit simple algorithms for fault tolerant self-routing and broadcasting. These networks are also versatile, since they can efficiently embed rings, meshes and tori, trees, pyramids and hypercubes
Conference Paper
Many proposed distributed hash table (DHT) schemes for peer-to-peer network are based on some traditional parallel interconnection topologies. In this paper, we show that the Kautz graph is a very good static topology to construct DHT schemes. We demonstrate the optimal diameter and optimal fault tolerance properties of the Kautz graph and prove that the Kautz graph is (1+o(1))-congestion-free when using the long path routing algorithm. Then we propose FissionE, a novel DHT scheme based on Kautz graph. FissionE is a constant degree, O(log N) diameter and (1+o(1))-congestion-free. FissionE shows that the DHT scheme with constant degree and constant congestion can achieve O(log N) diameter, which is better than the lower bound Ω(N 1/d) conjectured before.
Article
An assertion that Dijkstra's algorithm for shortest paths (adapted to allow arcs of negative weight) runs in O(n3) steps is disproved by showing a set of networks which take O(n2n) steps.
Article
Autonet is a self-configuring local area network composed of switches interconnected by 100 Mb/s, full-duplex, point-to-point links. The switches contain 12 ports that are internally connected by a full crossbar. Switches use cut-through to achieve a packet forwarding latency as low as 2 ms/switch. Any switch port can be cabled to any other switch port or to a host network controller. A processor in each switch monitors the network's physical configuration. A distributed algorithm running on the switch processor computes the routes packets are to follow and fills in the packet forwarding table in each switch. With Autonet, distinct paths through the set of network links can carry packets in parallel, allowing many pairs of hosts to communicate simultaneously at full link bandwidth. A 30-switch network with more than 100 hosts has been the service network for Digital's Systems Research Center since February 1990
Article
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlockfree routing is the absence of cycles in the channel dependency graph. Given an arbitrary network and a routing function, the cycles of the channel dependency graph can be removed by splitting physical channels into groups of virtual channels. This method is used to develop deadlock-free routing algorithms for k-ary n-cubes, for cube connected cycles, and for shuffle? exchange networks. (This is a revised version of 5206-tr-86)
Article
A principle task in parallel and distributed systems is to reduce the communication load in the interconnection network, as this is usually the major bottleneck for the performance of distributed applications. In this paper we introduce a framework for solving on-line problems that aim to minimize the congestion (i.e. the maximum load of a network link) in general topology networks. We apply this framework to the problem of on-line routing of virtual circuits and to a dynamic data management problem. For both scenarios we achieve a competitive ratio of O(log 3 n) with respect to the congestion of the network links. Our on-line algorithm for the routing problem has the remarkable property that it is oblivious, i.e., the path chosen for a virtual circuit is independent of the current network load. Oblivious routing strategies can easily be implemented in distributed environments and have therefore been intensively studied for certain network topologies as e.g. meshes, tori and hypercubic networks. This is the first oblivious path selection algorithm that achieves a polylogarithmic competitive ratio in general networks. 1.