Thesis

New routing algorithms for heterogeneous exaflopic supercomputers

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Building efficient supercomputers requires optimising communications, and their exaflopic scale causes an unavoidable risk of relatively frequent failures.For a cluster with given networking capabilities and applications, performance is achieved by providing a good route for every message while minimising resource access conflicts between messages.This thesis focuses on the fat-tree family of networks, for which we define several overarching properties so as to efficiently take into account a realistic superset of this topology, while keeping a significant edge over agnostic methods.Additionally, a partially novel static congestion risk evaluation method is used to compare algorithms.A generic optimisation is presented for some applications on clusters with heterogeneous equipment.The proposed algorithms use distinct approaches to improve centralised static routing by combining computation speed, fault-resilience, and minimal congestion risk.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich ("fat") diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths features a redesigned "purified" transport layer that removes virtually all TCP performance issues (e.g., the slow start), and uses flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2× lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.
Article
Full-text available
Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of supercomputers. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. Dmodc computes forwarding tables with a closed-form arithmetic formula by relying on a fast preprocessing phase. This allows complete re-routing of topologies with tens of thousands of nodes in less than a second. In turn, this greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters.
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
Conference Paper
Full-text available
Dragonfly topology was introduced by Kim et al. [1] aiming to decrease the cost and diameter of the network. The topology divides routers into groups connected by long links. Each group strives to implement high-radix virtual router, connected by a completely-connected topology. In this paper, we propose an extended Dragonfly+ network in which routers inside the group are connected in Clos-like topology. Dragonfly+ is superior to conventional Dragonfly due to the significantly larger number of hosts which it is able to support. In addition, Dragonfly+ supports similar or better bi-sectional bandwidth for various traffic patterns, and requires smaller number of buffers to avoid credit loop deadlocks in loss-less networks. Moreover, we introduce a novel Fully Progressive Adaptive Routing algorithm with remote congestion notifications. To support our proposal we present analytical analysis and simulations.
Article
Full-text available
The performance of interconnection networks is a challenging issue for High-Performance Computing (HPC) systems, which becomes even more important when the number of interconnected endnodes grows. In that sense, Dragonfly interconnection patterns are a very popular option to configure the network topology, especially for large systems, as they are able to achieve a high scalability relying on high-radix switches. This kind of hierarchical topologies has two levels of interconnection (i.e., connections within the element of a group and connections among groups) and each one can be interconnected using different patterns. However, regardless of the Dragonfly interconnection pattern, the Head-of-Line (HoL) blocking effect derived from congestion situations may jeopardize the Dragonfly performance. This paper analyzes the dynamics of congestion in different Dragonfly fully-connected interconnection patterns. Also, we describe a queuing scheme called Hierarchical Two-Level Queuing (H2LQ), designed specially to reduce HoL blocking in any fully-connected Dragonfly network that uses minimal-path routing. Finally, we present experiment results which show that this scheme significantly boost Dragonfly performance, regardless the interconnection pattern, especially when congestion arises, while requiring fewer network resources than other techniques oriented to deal with the effects of congestion.
Conference Paper
Full-text available
As the size of high-performance computing systems grows, the number of events requiring a network reconfiguration, as well as the complexity of each reconfiguration, is likely to increase. In large systems, the probability of component failure is high. At the same time, with more network components, ensuring high utilization of network resources becomes challenging. Recon-figuration in interconnection networks, like InfiniBand (IB), typically involves computation and distribution of a new set of routes in order to maintain connectivity and performance. In general, current routing algorithms do not consider the existing routes in a network when calculating new ones. Such configuration-oblivious routing might result in substantial modifications to the existing paths, and the reconfiguration becomes costly as it potentially involves a large number of source-destination pairs. In this paper, we propose a novel routing algorithm for IB based fat-tree topologies, SlimUpdate. SlimUpdate employs techniques to preserve existing forwarding entries in switches to ensure a minimal routing update, without any performance penalty, and with minimal computational overhead. We present an implementation of SlimUpdate in OpenSM, and compare it with the current de facto fat-tree routing algorithm. Our experiments and simulations show a decrease of up to 80% in the number of total path modifications when using SlimUpdate routing, while achieving similar or even better performance than the fat-tree routing in most reconfiguration scenarios.
Conference Paper
Full-text available
BXI, Bull eXascale Interconnect, is the new interconnection network developed by Atos for High Performance Computing. It has been designed to meet the requirements of exascale supercomputers. At such scale, faults have to be expected and dealt with transparently so that applications remain unaffected by them. BXI features various mechanisms for this purpose, one of which is the BXI routing component presented in this paper. The BXI routing module computes the full routing tables for a 64k nodes fat-tree in a few minutes. But with partial re-computation it can withstand numerous inter-router link failures without any noticeable impact on running applications.
Conference Paper
Full-text available
High-Performance Computing (HPC) Clusters and Data Center Networks often rely on fat-tree topologies. However, fat trees and their known variants are not designed for concurrent small jobs. As a result, in recent years, HPC designers have introduced ad-hoc topologies to offer better performance for these concurrent small jobs. In this paper, we present and formally define these topologies, which we call Quasi Fat Trees (QFTs). Specifically, we formulate the graph structure of these new topologies, and show that they perform better for concurrent small jobs. Furthermore, we derive a closed-form and fault-resilient contention-free routing algorithm for all global shift permutations. This routing optimizes the run-time of large computing jobs that utilize MPI collectives. Finally, we verify the algorithm by running its implementation as an OpenSM routing engine on various sizes of QFT topologies, and show that it exhibits good performance.
Conference Paper
Full-text available
InfiniBand (IB) has become a popular network interconnect for high-performance computing (HPC) systems. Many of the large IB-based HPC systems use some variant of the fat-tree topology to take advantage of the useful properties fat-trees offer. The fat-tree routing algorithm is one of the most efficient deterministic routing algorithms for fat-tree topologies. The algorithm ensures that the number of routes assigned to each link are balanced across the fabric. However, one problem with its load-balancing technique is that it assumes uniform traffic distribution in the network. When routes towards nodes that mainly consume large amount of data are assigned to share links in the fabric while alternative links are underutilized, sub-optimal network throughput is obtained. Also, as the fat-tree algorithm routes nodes according to the indexing order, the performance may differ for two systems cabled in the exact same way. In this paper, we propose wFatTree, a novel fat-tree routing algorithm, which considers node traffic characteristics to balance load across the network links more evenly, and with predictable network performance. Our experiments and simulations show an improvement of up to 60% in total network throughput on large fat-tree installations when using wFatTree routing. Furthermore, wFatTree can also be used to prioritize traffic flowing towards the critical nodes in the network.
Conference Paper
Full-text available
Dragonfly topologies are recent network designs that are considered one of the most promising interconnect options for Exascale systems. They offer a low diameter and low network cost, but do so at the expense of path diversity, which makes them vulnerable to certain adversarial traffic patterns. Indirect routing approaches can alleviate the performance degradation that these workloads experience. However, there are limits to the improvements that can be achieved using the indirect routing approach that is popular today, limits that are inherent to the Dragonfly topological structure. In this work, we explore these limits by providing a theoretical justification to why adversarial traffic patterns routed indirectly with an algorithm that perfectly distributes load across inter-Dragonfly-group links can still induce significant bottlenecks in the intra-group links. We equally provide estimations of the performance impact of these imbalances, as well as present a set of simulation based benchmarks that confirm the theoretical predictions for practical Dragonfly systems.
Article
Full-text available
In the context of eXtended Generalized Fat Tree (XGFT) topologies, widely used in HPC and datacenter network designs, we propose a generic method, based on Integer Linear Programming (ILP), to efficiently determine optimal routes for arbitrary workloads. We propose a novel approach that combines ILP with dynamic programming, effectively reducing the time to solution. Specifically, we divide the network into smaller subdomains optimized using a custom ILP formulation that ensures global optimality of local solutions. Local solutions are then combined into an optimal global solution using dynamic programming. Finally, we demonstrate through a series of extensive benchmarks that our approach scales in practice to networks interconnecting several thousands of nodes, using a single-threaded, freely available linear programming solver on commodity hardware, with the potential for higher scalability by means of commercial, parallel solvers.
Conference Paper
Full-text available
Interconnection network performance is a key factor when constructing parallel computers. The choice of an interconnection network used in a parallel computer depends on a large number of performance factors which are very often application dependent. We give the outline of a performance evaluation and comparison methodology using what we think of as the most important parameters to be considered when solving such a problem. This methodology is applied on a new interconnection network called MCRB network and on Omega network.
Conference Paper
Full-text available
The fat-tree topology has become a popular choice for InfiniBand enterprise systems due to its deadlock freedom, fault-tolerance and full bisection bandwidth. In the HPC domain, InfiniBand fabric is used in almost 42% of the systems on the latest Top 500 list, and many of those systems are based on the fat-tree topology. Despite the popularity of the fat-tree topology, little research has been done to compare the behavior of InfiniBand routing algorithms on degraded fat-tree topologies. In this paper, we identify the weaknesses of the current fat-tree routing and propose enhancements that liberalize the restrictions imposed on the routed fabric. Furthermore, we present a thorough analysis of non-proprietary routing algorithms that are implemented in the InfiniBand Open Subnet Manager. Our results show that even though the performance of a fat-tree routed network deteriorates predictably with the number of failed links, fat-tree routing algorithm is still the best choice for severely degraded fat-tree fabrics.
Conference Paper
Full-text available
High-radix hierarchical networks are cost-effective topologies for large scale computers. In such networks, routers are organized in super nodes, with local and global interconnections. These networks, known as Dragonflies, outperform traditional topologies such as multi-trees or tori, in cost and scalability. However, depending on the traffic pattern, network congestion can lead to degraded performance. Misrouting (non-minimal routing) can be employed to avoid saturated global or local links. Nevertheless, with the current deadlock avoidance mechanisms used for these networks, supporting misrouting implies routers with a larger number of virtual channels. This exacerbates the buffer memory requirements that constitute one of the main constraints in high-radix switches. In this paper we introduce two novel deadlock-free routing mechanisms for Dragonfly networks that support on-the-fly adaptive routing. Using these schemes both global and local misrouting are allowed employing the same number of virtual channels as in previous proposals. Opportunistic Local Misrouting obtains the best performance by providing the highest routing freedom, and relying on a deadlock-free escape path to the destination for every packet. However, it requires Virtual Cut-Through flow-control. By contrast, Restricted Local Misrouting prevents the appearance of cycles thanks to a restriction of the possible routes within super nodes. This makes this mechanism suitable for both Virtual Cut-Through and Wormhole networks. Evaluations show that the proposed deadlock-free routing mechanisms prevent the most frequent pathological issues of Dragonfly networks. As a result, they provide higher performance than previous schemes, while requiring the same area devoted to router buffers.
Article
Full-text available
A deadlock-free routing algorithm can be generated for arbitrary interconnection networks using the concept of virtual channels. A necessary and sufficient condition for deadlock-free routing is the absence of cycles in a channel dependency graph. Given an arbitrary network and a routing function, the cycles of the channel dependency graph can be removed by splitting physical channels into groups of virtual channels. This method is used to develop deadlock-free routing algorithms for k-ary n-cubes, for cube-connected cycles, and for shuffle-exchange networks.
Conference Paper
Full-text available
New static source routing algorithms for High Performance Computing (HPC) are presented in this work. The target parallel architectures are based on the commonly used fat-tree networks and their slimmed versions. The evaluation of such proposals and their comparison against currently used routing mechanisms have been driven by realistic traffic generated by HPC applications. Our experimental framework is based on the integration of two existing simulators, one replaying an MPI application and another simulating the network details. The resulting simulation platform has been fed with traces from real executions. We have obtained several interesting findings: (i) contrary to the widely accepted belief, random static routing in k-ary n-trees (which is the default option for InfiniBand and Myrinet technologies) is not a good solution for HPC applications; (ii) some existing oblivious routing techniques can be very good for certain communication patterns present on applications, but clearly fail for some others and (iii) one of the proposed pattern-aware routing algorithms could be used to better utilize network resources and thus achieve higher performance, particularly for the case of cost-effective networks.
Conference Paper
Full-text available
A family of oblivious routing schemes for fat trees and their slimmed versions is presented in this work. First, two popular oblivious routing algorithms, which we refer to as S-mod-k and D-mod-k, are analyzed in detail. S-mod-k is the default routing algorithm given as an example in the first works formally describing fat tree networks. D-mod-k has been independently proposed and investigated by several authors, who conclude in their evaluations that it achieves better performance than a random or adaptive routing approach. First, we identify the reasons why these algorithms perform well. Using this insight we extend these algorithms, originally intended for full bisection networks, to slimmed networks. Based on the lessons learned we propose a new generalized family of algorithms that provides a better oblivious solution than the existing ones for this class of networks. Moreover, this family extends the previous work from k-ary n-trees to the more general class of extended generalized fat trees.
Conference Paper
Full-text available
Multistage interconnection networks based on central switches are ubiquitous in high-performance computing. Applications and communication libraries typically make use of such networks without consideration of the actual internal characteristics of the switch. However, application performance of these networks, particularly with respect to bisection bandwidth, does depend on communication paths through the switch. In this paper we discuss the limitations of the hardware definition of bisection bandwidth (capacity-based) and introduce a new metric: effective bisection bandwidth. We assess the effective bisection bandwidth of several large-scale production clusters by simulating artificial communication patterns on them. Networks with full bisection bandwidth typically provided effective bisection bandwidth in the range of 55-60%. Simulations with application-based patterns showed that the difference between effective and rated bisection bandwidth could impact overall application performance by up to 12%.
Article
A host‐switch graph was originally proposed as a graph that represents a network topology of a computer systems with 1‐port host computers and ‐port switches. It has been studied from both theoretical and practical aspects in terms of the diameter, the average shortest path length, and the performance of real applications. In recent high‐performance computing systems, however, a host computer is connected to multiple switches by using InfiniBand, NVSwitch, or Omni‐Path, and consequently they provide high bandwidths. Since a host‐switch graph cannot represent such systems, this article extends a host‐switch graph so that it can represent such systems. As a result, a host‐switch graph can include multi‐ported hosts. Furthermore, we propose to use multi‐port hosts for reducing the diameter. We show that the diameter minimization is equivalent to solving the degree diameter problem for bipartite graphs of diameter three. Our experimental results show that we can drastically reduce the diameter as well as increasing the bandwidth and improves performance of MPI applications by up to 162% as compared with networks with single‐ported hosts.
Article
The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing schemes to balance traffic flows among the available links, thus reducing congestion probability. However, traffic balance cannot solve by itself some congestion situations that may still degrade network performance. Another approach to reduce congestion is queue-based flow separation, but our previous work shows that multi-path routing may spread congested flows across several queues, thus being counterproductive. In this paper, we propose a set of restrictions to improve alternative routes selection for multi-path routing algorithms in Fat-Tree networks, so that they can be positively combined with queuing schemes.
Article
This paper aims at establishing a method for designing high-performance network topologies to bridge a gap between theoretical and practical studies. To this end, we present a novel graph called a host-switch graph, which consists of host vertices and switch vertices with maximum degree 1 and r , respectively. This graph represents a network topology of a practical parallel/distributed computer system with host computers connected by r -port switches. We discuss important metrics for designing high-performance interconnection networks: the host-to-host average shortest path length (h-ASPL) and the bisection width (BiW). In particular, we explore a method for constructing host-switch graphs with low h-ASPL and high BiW that connect the fixed number of hosts via any number of r -port switches. We demonstrate that the number of switches that provides the minimum h-ASPL can mathematically be approximated, and the minimum number of switches that provides a certain BiW can experimentally be approximated. On the basis of the approximations, we propose a randomized algorithm for searching host-switch graphs. We then apply the graphs to interconnection networks and compare them with typical network topologies. As compared with the torus, the dragonfly, and the fat-tree, our networks attain higher performance and smaller power and costs.
Conference Paper
Data center networks, and especially drop-free RoCEv2 networks require efficient congestion control protocols. DCQCN (ECN-based) and TIMELY (delay-based) are two recent proposals for this purpose. In this paper, we analyze DCQCN and TIMELY using fluid models and simulations, for stability, convergence, fairness and flow completion time. We uncover several surprising behaviors of these protocols. For example, we show that DCQCN exhibits non-monotonic stability behavior, and that TIMELY can converge to stable regime with arbitrary unfairness. We propose simple fixes and tuning for ensuring that both protocols converge to and are stable at the fair share point. Finally, using lessons learnt from the analysis, we address the broader question: are there fundamental reasons to prefer either ECN or delay for end-to-end congestion control in data center networks? We argue that ECN is a better congestion signal, due to the way modern switches mark packets, and due to a fundamental limitation of end-to-end delay-based protocols, that we derive.
Conference Paper
HPC network topology design is currently shifting from high-performance, higher-cost Fat-Trees to more cost-effective architectures. Three diameter-two designs, the Slim Fly, Multi-Layer Full-Mesh, and Two-Level Orthogonal Fat-Tree excel in this, exhibiting a cost per endpoint of only 2 links and 3 router ports with lower end-to-end latency and higher scalability than traditional networks of the same total cost. However, other than for the Slim Fly, there is currently no clear understanding of the performance and routing of these emerging topologies. For each network, we discuss minimal, indirect random, and adaptive routing algorithms along with deadlock-avoidance mechanisms. Using these, we evaluate the performance of a series of representative workloads, from global uniform and worst-case traffic to the all-to-all and near-neighbor exchange patterns prevalent in HPC applications. We show that while all three topologies have similar performance, OFTs scale to twice as many endpoints at the same cost as the others.
Conference Paper
Lossless interconnection networks are omnipresent in high performance computing systems, data centers and network-on-chip architectures. Such networks require efficient and deadlock-free routing functions to utilize the available hardware. Topology-aware routing functions become increasingly inapplicable, due to irregular topologies, which either are irregular by design or as a result of hardware failures. Existing topology-agnostic routing methods either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. We propose a novel topology-agnostic routing approach which implicitly avoids deadlocks during the path calculation instead of solving both problems separately. We present a model implementation, called Nue, of a destination-based and oblivious routing function. Nue routing heuristically optimizes the load balancing while enforcing deadlock-freedom without exceeding a given number of virtual channels, which we demonstrate based on the InfiniBand architecture.
Article
The performance of k-ary n-cube interconnection networks is analyzed under the assumption of constant wire bisection. It is shown that low-dimensional k-ary n-cube networks (e.g., tori) have lower latency and higher hot-spot throughput than high-dimensional networks (e.g., binary n-cubes) with the same bisection width.
Conference Paper
Head-of-Line (HoL) blocking is a well-known phenomenon that may dramatically degrade the performance of the modern high-performance interconnection networks. Many techniques have been proposed to solve this problem, most of them based on separating traffic flows into different queues at switch ports. However, the efficiency of these proposals may vary depending on the network topology or routing algorithm, as many of them are not aware of any specific network configuration. By contrast, other schemes are tailored to specific topologies like fat-trees, achieving a greater efficiency than "topology-agnostic" schemes. In this paper we propose a straightforward queuing scheme intended to be used in an efficient, recently-proposed hybrid topology. Our proposal significantly boosts network performance with respect to other queuing schemes while requiring similar or fewer resources. Moreover, the implementation of this scheme in InfiniBand-based networks is elementary thanks to the mapping of Service-Levels to Virtual-Lanes supported by this specification.
Conference Paper
For clusters where the topology consists of a fat-tree or more than one fat-tree combined into one subnet, there are several properties that the routing algorithms should support, beyond what exists today. One of the missing properties is that current fat-tree routing algorithm does not guarantee that each port on a multi-homed node is routed through redundant spines, even if these ports are connected to redundant leaves. As a consequence, in case of a spine failure, there is a small window where the node is unreachable until the subnet manager has rerouted to another spine. In this paper, we discuss the need for independent routes for multi-homed nodes in fat-trees by providing real-life examples when a single point of failure leads to complete outage of a multi-port node. We present and implement methods that may be used to alleviate this problem and perform simulations that demonstrate improvements in performance, scalability, availability and predictability of InfiniBand fat-tree topologies. We show that our methods not only increase the performance by up to 52.6%, but also, and more importantly, that there is no downtime associated with spine switch failure.
Conference Paper
One of the objectives of the decade for High-Performance Computing systems is to reach the exascale level of computing power before 2018, hence this will require strong efforts in their design. In that sense, High-speed low-latency interconnection networks are essential elements for exascale HPC systems. Indeed, the performance of the whole system depends on that of the interconnection network. In order to develop and test new techniques, suited to exascale HPC systems, software-based networks simulators are commonly used. As developing a network simulator from scratch is a difficult task, several platforms help the developers, OMNeT++ being one of the most popular. In this paper, we propose a new generic network simulator, exploiting the features of the OMNeT++ framework. The proposed tool is the first step to model HPC high-performance interconnection networks of exascale HPC systems: the message switching layer, routing and arbitration algorithms and buffer organizations have been modeled according to the current and expected characteristics of these systems. In addition, the tool has been designed so that it is possible to simulate networks of large size. Simulation results, validated against real systems, show the accuracy of the model.
Article
This paper describes a method of designing arrays of crosspoints for use in telephone switching systems in which it will always be possible to establish a connection from an idle inlet to an idle outlet regardless of the number of calls served by the system.
Article
Clustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the compute-nodes speed. InfiniBandTM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of these clusters. The network performance is also affected by its topology, packet routing and the communication patterns the distributed application exercises. Fat-trees are the topology structures used for constructing most large clusters as they are scalable, maintain cross-bisectional-bandwidth (CBB), and are practical to build using fixed-arity switches. In this paper, we propose a fat-tree routing algorithm that provides a congestion-free, all-to-all shift pattern leveraging on the InfiniBandTM static routing capability. The algorithm supports partially populated fat-trees built with switches of arbitrary number of ports and CBB ratios. To evaluate the proposed algorithm, detailed switch and host simulation models were developed and multiple fabric topologies were run. The results of these simulations as well as measurements on real clusters show an improvement in all-to-all delay by avoiding congestion on the fabric. Copyright © 2009 John Wiley & Sons, Ltd.
Conference Paper
Networks of workstations (NOWs) are being considered as a cost-effective alternative to parallel computers. Many NOWs are arranged as a switch-based network with irregular topology, which makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently, a simple and effective methodology to compute up*/down* routing tables has been proposed by us. The resulting up*/down* routing scheme makes use of a different link direction assignment to compute routing tables. Assignment of link direction is based on generating an underlying acyclic connected graph from the network graph. In this paper, we propose and evaluate new heuristic rules to compute the underlying graph. Moreover, we propose a traffic balancing algorithm to obtain more efficient up*/down* routing tables when source routing is used. Evaluation results show that the routing algorithm based on the new methodology increases throughput by a factor of up to 2.8 in large networks, also reducing latency significantly.
Article
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately, little raw data on failures in large IT installations are publicly available. This paper analyzes failure data collected at two large high-performance computing sites. The first data set has been collected over the past nine years at Los Alamos National Laboratory (LANL) and has recently been made publicly available. It covers 23,000 failures recorded on more than 20 different systems at LANL, mostly large clusters of SMP and NUMA nodes. The second data set has been collected over the period of one year on one large supercomputing system comprising 20 nodes and more than 10,000 processors. We study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair. We find, for example, that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate. From one system to another, mean repair time varies from less than an hour to more than a day, and repair times are well modeled by a lognormal distribution.
Article
A network (G,R) consists in a given undirected graph G of order n and a routing R, that is a collection of n(n-1) simple paths connecting every ordered pair of vertices of G. Chung, Coffman, Reiman and Simon defined the forwarding index ξ(G,R) of a network (G,R) as the maximum number of paths of R passing through any vertex of G. Similarly we define the edge-forwarding index of a network (G,R) as the maximum number of paths of R passing through any edge of G. These parameters might be of interest in different applications concerning communication networks. The forwarding (resp. edge-forwarding) index corresponds to the maximum amount of forwarding done by any node (resp. edge). The edge-forwarding index also corresponds to the maximum load of the network. Therefore it is of interest, for a given graph, to find routings minimizing these indices and we shall define the forwarding (edge-forwarding) index of a graph as the minimum taken over all possible indices of the possible networks.In this paper we give bounds on these forwarding indices, in particular as a function of the connectivity of the graph, and calculate them for products of graphs and for some specific graphs.
Conference Paper
In this paper we isolate a combinatorial problem that, we believe, lies at the heart of this question and provide some encouragingly positive solutions to it. We show that there exists an N-processor realistic computer that can simulate arbitrary idealistic N-processor parallel computations with only a factor of O(log N) loss of runtime efficiency. The main innovation is an O(log N) time randomized routing algorithm. Previous approaches were based on sorting or permutation networks, and implied loss factors of order at least (log N)2.
Conference Paper
The fat-tree topology has become a popular choice for InfiniBand fabrics due to its inherent deadlock freedom, fault-tolerance and full bisection bandwidth. InfiniBand is used by more than 40% of the systems on the latest Top 500 list, and many of these systems are based on a fat-tree topology. However, the current InfiniBand fat-tree routing algorithm suffers from flaws that reduce its scalability and flexibility. Counter-intuitively, the achievable throughput per node deteriorates both when the number of nodes in a tree decreases or when the node distribution among leaves is nonuniform. In this paper, we identify the weaknesses of the current enhanced fat-tree routing algorithm in Open Fabrics Enterprise Distribution and we propose extensions to it that alleviate all performance problems related to node distribution. The new algorithm is implemented in OpenSM for real world evaluation and for future contribution to the Open Fabrics community. We demonstrate that our solution allows to achieve a predictable high throughput regardless of the number of nodes and their distribution. Furthermore, the simulations show that our extensions improve throughput up to 30% depending on topology size and node distribution.