Conference Paper

All-to-All Routing Algorithm for Galaxyfly Networks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
With the increasing scale of parallel computer interconnection network, the possibility of processor failure or link failure between processors in the network is also increasing. In the design of supercomputers, not only link overhead and communication delay should be taken into account, but also fault-tolerant performance of networks should be emphasized. Locally exchanged twisted cube (LeTQ) is a newly proposed interconnection network with lower link overhead and shorter diameter. With the increasing scale of supercomputers, fault-tolerant routing is indispensable. In this paper, we propose a new load balancing fault-tolerant routing algorithm based on node contraction for LeTQ networks. The proposed algorithm uses the node shrinkage method to evaluate the priority of nodes. The sending node adaptively adjusts the probability of forwarding packets to the neighbor node according to the priority of the neighbor node and the state of the network. The path can be adapted to the load state of the network. The simulation results show that the fault-tolerant routing algorithm has good performance in throughput and delay.
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
Article
Full-text available
Dragonfly networks have been widely used in the current high-performance computers or high-end servers. Fault-tolerant routing in dragonfly networks is essential. The rich interconnects provide good fault-tolerance ability for the network. A new deadlockfree adaptive fault-tolerant routing algorithm based on a new two-layer safety information model, is proposed by mapping routers in a group, and groups of the dragonfly network into two separate hypercubes. The new fault-tolerant routing algorithm tolerates static and dynamic faults. Our method can determine whether a packet can reach the destination at the source by using the new safety information model, which avoids dead-ends and aimless misrouting. Sufficient simulation results show that the proposed fault-tolerant routing algorithm even outperforms the previous minimal routing algorithm in fault-free networks in many cases.
Article
Full-text available
A 3D stacked network-on-chip (NOC) promises the integration of a large number of cores in a many-core system-on-chip (SOC). The NOC can be used to test the embedded cores in such SOCs, whereby the added cost of dedicated test-access hardware can be avoided. However, a potential problem associated with 3D NOC-based test access is the emergence of hotspots due to stacking and the high toggle rates associated with structural test patterns used for manufacturing test. High temperatures and hotspots can lead to the failure of good parts, resulting in yield loss. We describe a unicast-based multicast approach and a thermal-driven test scheduling method to avoid hotspots, whereby the full NOC bandwidth is used to deliver test packets. Test delivery is carried out using a new unicast-based multicast scheme. Experimental results highlight the effectiveness of the proposed method in reducing test time under thermal constraints.
Article
Full-text available
A new deadlock-free unicast-based broadcast scheme is proposed based on a new routing scheme called minus-first routing. Minus-first routing is a partially adaptive routing scheme in dragonfly networks without any virtual channels. The main goals of the broadcast schemes are to minimize the total delivery time, and any router does not receive any message more than once. No channel competition is introduced. Two different broadcast schemes are proposed: (1) the group-first, and (2) the router-first. It is shown that unicast-based broadcast schemes are necessary to avoid deadlocks at the consumption channels. The group-first broadcast scheme delivers a message to all groups as early as possible; and the router-first scheme minimizes the number of unicast steps to traverse global links. To our knowledge, the method in this paper is the first collective communication work for dragonfly networks in the literature. Simulation results are presented to evaluate the proposed unicast-based broadcast schemes.
Article
Full-text available
We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centres as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient data enter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.
Article
Full-text available
Multicast communication, in which the same message is delivered from a source node to an arbitrary number of destination nodes, is being increasingly demanded in parallel computing. System supported multicast services can potentially offer improved performance, increased functionality, and simplified programming, and may in turn be used to support various higher-level operations for data movement and global process control. This paper presents efficient algorithms to implement multicast communication in wormhole-routed direct networks, in the absence of hardware multicast support, by exploiting the properties of the switching technology. Minimum-time multicast algorithms are presented for n-dimensional meshes and hypercubes that use deterministic, dimension-ordered routing of unicast messages. Both algorithms can deliver a multicast message to m-1 destinations in [log 2 m] message passing steps, while avoiding contention among the constituent unicast messages. Performance results of implementations on a 64-node nCUBE-2 hypercube and a 168-node Symult 2010 2-D mesh are given
Conference Paper
Full-text available
In the push to achieve exascale performance, systems will grow to over 100,000 sockets, as growing cores-per-socket and improved single-core performance provide only part of the speedup needed. These systems will need affordable interconnect structures that scale to this level. To meet the need, we consider an extension of the hypercube and flattened butterfly topologies, the HyperX, and give an adaptive routing algorithm, DAL. HyperX takes advantage of high-radix switch components that integrated photonics will make available. Our main contributions include a formal descriptive framework, enabling a search method that finds optimal HyperX configurations; DAL; and a low cost packaging strategy for an exascale HyperX. Simulations show that HyperX can provide performance as good as a folded Clos, with fewer switches. We also describe a HyperX packaging scheme that reduces system cost. Our analysis of efficiency, performance, and packaging demonstrates that the HyperX is a strong competitor for exascale networks.
Article
Full-text available
Algorithms for performing gossiping on one- and higher-dimensional meshes are presented. As a routing model, the practically important wormhole routing is assumed. We especially focus on the trade-off between the start-up time and the transmission time. For one-dimensional arrays and rings, we give a novel lower bound and an asymptotically optimal gossiping algorithm for all choices of the parameters involved. For two-dimensional meshes and tori, a simple algorithm composed of one-dimensional phases is presented. For an important range of packet and mesh sizes, it gives clear improvements upon previously developed algorithms. The algorithm is analyzed theoretically and the achieved improvements are also convincingly demonstrated by simulations, as well as an implementation on the Paragon. On the Paragon, our algorithm even outperforms the gossiping routine provided in the NX message-passing library. For higher-dimensional meshes, we give algorithms which are based on an interesting generalization of the notion of a diagonal. These algorithms are analyzed theoretically, as well as by simulation
Article
Full-text available
We show that deadlocks due to dependencies on consumption channels are a fundamental problem in wormhole multicast routing. This type of resource deadlocks has not been addressed in many previously proposed wormhole multicast algorithms. We also show that deadlocks on consumption channels can be avoided by using multiple classes of consumption channels and restricting the use of consumption channels by multicast messages. We provide upper bounds for the number of consumption channels required to avoid deadlocks. In addition, we present a new multicast routing algorithm, column-path, which is based on the well-known dimension-order routing used in many multicomputers and multiprocessors. Therefore, this algorithm could be implemented in existing multicomputers with simple changes to the hardware. Using simulations, we compare the performance of the proposed column-path algorithm with the previously proposed Hamiltonian-path-based multipath and an e-cube-based multicast routing algorithms. Our results show that for multicast traffic, the column-path routing offers higher throughputs, while the multipath algorithm offers lower message latencies. Another result of our study is that the commonly implemented simplistic scheme of sending one copy of a multicast message to each of its destinations exhibits good performance provided the number of destinations is small
Chapter
New deadlock-free unicast-based all-to-all broadcast algorithms are proposed for dragonfly networks. An all-to-all broadcast delivers a message from each router to all routers. Two different all-to-all broadcast algorithms GFA2A and RFA2A using the previous group-first and router-first one-to-all broadcast schemes are presented. A new all-to-all broadcast algorithm named A2A is presented by collecting all messages from all routers in the same group to a single router first and combining them, which are forwarded to all routers in the same group. Each router forwards messages to all other routers in the same groups after receiving all messages from other groups. The proposed algorithms can be implemented with the unicast hardware, that is, each input port is assigned two indistinguishable buffers.
Article
Interconnection networks play an essential role in the architecture of high-performance computing (HPC) systems. In this article, we explore the Galaxyfly family to build flexible-scale interconnection networks. Galaxyfly is guaranteed to retain a small constant diameter while achieving a flexible tradeoff between network scale and bisection bandwidth. Galaxyfly not only supports small-scale interconnection networks with smaller diameter but also lowers the demands for high-radix routers and is able to utilize routers with moderate radix to build exascale interconnection networks. We analyze the constructible configuration of Galaxyfly and evaluate the properties of Galaxyfly. We conduct extensive simulations and analysis to evaluate the performance, cost, and power consumption of Galaxyfly on physical layout against state-of-the-art topologies. The results show that our design achieves better performance than most existing topologies under typical HPC workloads, and is cost-effective to deploy for exascale HPC systems.
Conference Paper
Future pervasive applications, like mobile augmented reality, have huge bandwidth and computation demands and very stringent delay constraints. Edge computing has been proposed to cope with such challenging requirements, since it shortens significantly the distance between the end users and the servers. On the other hand, serverless computing is emerging among cloud technologies to respond to the need of highly scalable event-driven execution of stateless tasks. In this paper, we investigate the convergence of the two to enable very lowlatency execution of short-lived stateless tasks whose computation is offloaded from the user terminal to servers hosted by or close to edge devices in mobile pervasive environments. We realized a proof-of-concept implementation to delve into the specific issue of efficient dispatching of tasks in a distributed manner to achieve high scalability. We evaluated our proposed algorithm with experiments in a large-scale emulated network environment, showing that our solution achieves similar or better delay performance than a centralized solution, with far less network utilization.
Article
High-radix routers with lower latency and higher bandwidth play an increasingly important role in constructing large-scale interconnection networks such as those used in super-computers and datacenters. The tile-based crossbar approach partitions a single large crossbar into many small tiles and can considerably reduce the complexity of arbitration while providing higher throughput than the conventional switch implementation. However, it is not scalable due to power consumption, placement, and routing problems. Inspired by non-saturated throughput theory, this paper proposes a scalable router microarchitecture, termed Multiport Binding Tile-based Router (MBTR). By aggregating multiple physical ports into a single tile a high-radix router can be flexibly organized into different tile arrays, thus the number of tiles and hardware overhead can be considerably reduced. For a radix-64 router MBTR achieves up to 5075%50 \sim 75\% reduction in memory consumption as well as wire area compared with a hierarchical switch. We theoretically deduce the sufficient and necessary conditions for the asymmetrical crossbar to achieve un-saturated relative 100% throughput. Based on this observation we analyze the MBTR throughput and derive the condition that should be satisfied by the MBTR design parameters to yield 100% throughput. We further discuss how to make a trade-off between MBTR parameters based on the constraints of performance, power and area. The simulation results demonstrate MBTR is indistinguishable from the YARC router in terms of throughput and delay, and can even outperform it by reducing potential contention for output ports. We have fabricated a 36-port MBTR chip at 28nm, providing 100Gb/s bidirectional bandwidth per port, with a fall-through latency of just 30ns. Internally it runs at 9.6Tb/s, thus offering a speedup of 1.34×1.34\times .
Article
Clouds offer flexible and economically attractive compute and storage solutions for enterprises. However, the effectiveness of cloud computing for high-performance computing (HPC) systems still remains questionable. When clouds are deployed on lossless interconnection networks, like InfiniBand (IB), challenges related to load-balancing, low-overhead virtualization, and performance isolation hinder full potential utilization of the underlying interconnect. Moreover, cloud data centers incorporate a highly dynamic environment rendering static network reconfigurations, typically used in IB systems, infeasible. In this paper, we present a framework for a self-adaptive network architecture for HPC clouds based on lossless interconnection networks, demonstrated by means of our implemented IB prototype. Our solution, based on a feedback control and optimization loop, enables the lossless HPC network to dynamically adapt to the varying traffic patterns, current resource availability, workload distributions, and also in accordance with the service provider-defined policies. Furthermore, we present IBAdapt, a simplified ruled-based language for the service providers to specify adaptation strategies used by the framework. Our developed self-adaptive IB network prototype is demonstrated using state-of-the-art industry software. The results obtained on a test cluster demonstrate the feasibility and effectiveness of the framework when it comes to improving Quality-of-Service compliance in HPC clouds.
Conference Paper
With low-delay switches on the horizon, end-to-end latency in large-scale High Performance Computing (HPC) interconnects will be dominated by cable delays. In this context we define a new network topology, Skywalk, for deploying low-latency interconnects in upcoming HPC systems. Skywalk uses randomness to achieve low latency, but does so in a way that accounts for the physical layout of the topology so as to lead to further cable length and thus latency reductions. Via graph analysis and discrete-event simulation we show that Skywalk compares favorably (in terms of latency, cable length, and throughput) to traditional low-degree torus and moderate-degree hypercube topologies, to high-degree fully-connected Dragonfly topologies, to the HyperX topology, and to recently proposed fully random topologies.
Conference Paper
Higher global bandwidth requirement for many applications and lower network cost have motivated the use of the Dragonfly network topology for high performance computing systems. In this paper we present the architecture of the Cray Cascade system, a distributed memory system based on the Dragonfly [1] network topology. We describe the structure of the system, its Dragonfly network and the routing algorithms. We describe a set of advanced features supporting both mainstream high performance computing applications and emerging global address space programing models. We present a combination of performance results from prototype systems and simulation data for large systems. We demonstrate the value of the Dragonfly topology and the benefits obtained through extensive use of adaptive routing.
Article
McKay, Miller and Širáň used a voltage graph construction to introduce three families of graphs of order 2q2, where q is a prime power. These graphs have diameter 2 and some of the largest known graphs of diameter 2 and given degree come from these families. We provide an alternative description of these graphs as modified incidence graph of an affine plane. This leads to a complete determination of their automorphism groups.
Article
This paper proposes multidestination message passing on wormhole k-ary n-cube networks using a new base-routing-conformed-path (BRCP) model. This model allows both unicast (single-destination) and multidestination messages to co-exist in a given network without leading to deadlock. The model is illustrated with several common routing schemes (deterministic, as well as adaptive), and the associated deadlock-freedom properties are analyzed. Using this model, a set of new algorithms for popular collective communication operations, broadcast and multicast, are proposed and evaluated. It is shown that the proposed algorithms can considerably reduce the latency of these operations compared to the Umesh (unicast-based multicast) and the Hamiltonian path-based schemes. A very interesting result that is presented shows that a multicast can be implemented with reduced or near-constant latency as the number of processors participating in the multicast increases beyond a certain number. It is also shown that the BRCP model can take advantage of adaptivity in routing schemes to further reduce the latency of these operations. The multidestination mechanism and the BRCP model establish a new foundation to provide fast and scalable collective communication support on wormhole-routed systems
Article
Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new-generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The nCUBE-2, a wormhole-routed hypercube multicomputer, provides hardware support for broadcast and a restricted form of multicast in which the destinations form a subcube. However, the broadcast routing algorithm adopted in the nCUBE-2 is not deadlock-free. In this paper, four multicast wormhole routing strategies for 2-D mesh multicomputers are proposed and studied. All of the algorithms are shown to be deadlock-free. These are the first deadlock-free multicast wormhole routing algorithms ever proposed. A simulation study has been conducted that compares the performance of these multicast algorithms under dynamic network traffic conditions in a 2-D mesh. The results indicate that a dual-path routing algorithm offers performance advantages over tree-based, multipath, and fixed-path algorithms
Article
Efficient routing of messages is a key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. While multicast communication is highly demanded in many applications, most of the existing multicomputers do not directly support this service; rather it is indirectly supported by multiple one-to-one or broadcast communications, which result in more network traffic and a waste of system resources. The authors study routing evaluation criteria for multicast communication under different switching technologies. Multicast communication in multicomputers is formulated as a graph theoretical problem. Depending on the evaluation criteria and switching technologies, they study three optimal multicast communication problems, which are equivalent to the finding of the following three subgraphs: optimal multicast path, optimal multicast cycle, and minimal Steiner tree, where the interconnection of a multicomputer defines a host graph. They show that all these optimization problems are NP-complete for the popular 2D-mesh and hypercube host graphs. Heuristic multicast algorithms for these routing problems are proposed
Gossiping on meshes and tori
  • N Jiang
  • J Kim
  • W J Dally
N. Jiang, J. Kim, and W.J. Dally, "Gossiping on meshes and tori," IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 6, pp. 513-525, 1998.