Article

A new proposal to deal with congestion in InfiniBand-based fat-trees

Authors:
  • University of Castilla-La Mancha, Albacete, Spain
  • Simula Metropolitan Center for Digital Engineering
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some static schemes, such as VOQnet [2], VOQsw [29], DAMQs [29], DBBM [23], or DSBM [24] are topology and routing agnostic. By contrast, other static schemes like vFtree [10], and Flow2SL [5] are tailored to specific routing algorithms and network topologies, such that HoL blocking is reduced more efficiently than agnostic schemes while requiring similar resources. Indeed, the latter schemes take advantage of topology and routing knowledge to separate flows as much as possible with the available queues. ...
... As Q is a reduced number, several source-destination pairs are mapped to the same queue. Unfortunately, vFtree is effective only in Fat-Trees with fewer than three stages, so Flow2SL was proposed to overcome the vFtree flaws [5]. Flow2SL defines as many groups of consecutive destination end-nodes as the number of queues available at each port (Q ). ...
... Fig. 6 shows a NEtwork Description (NED) diagram of the assumed switch model. 5 Specifically, the port module models the credit-based flow control and the Virtual Channels (VCs), and it keeps track of 5 Note that the OMNeT++ framework uses the NED language to define the simple and compound modules, and the connections among them using ''gates''. Each of these modules is implemented in C++. the incoming and outgoing packets to the switch. ...
Article
The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing schemes to balance traffic flows among the available links, thus reducing congestion probability. However, traffic balance cannot solve by itself some congestion situations that may still degrade network performance. Another approach to reduce congestion is queue-based flow separation, but our previous work shows that multi-path routing may spread congested flows across several queues, thus being counterproductive. In this paper, we propose a set of restrictions to improve alternative routes selection for multi-path routing algorithms in Fat-Tree networks, so that they can be positively combined with queuing schemes.
... A handful of SQSs are tailored to fat trees, such as vFtree [6] and Flow2SL [7]. In the original paper of Flow2SL, a set of metrics are defined to measure the quality of a given mapping policy: If Φ = 0 for all the VLs, any flow destination is assigned to only one VL at that port (i.e. ...
... The main limitation of Flow2SL is that the number of available VLs must be a divisor of the number of fat tree pods [7]. For example, the number of available VLs for the fat tree of Fig being assigned to different VLs at third-stage switches (i.e. ...
... We quantify the Path2SL performance by comparing with Flow2SL [7] configured with several VLs, and with D-mod- K routing [2] using 1 VL as baseline. We have obtained the evaluation results trough simulation of 216-node (6 pods) and 1000-node (10 pods) fat trees, and through experiments in a real IB-based cluster (configured as a 45-node fat tree) under several benchmarks. ...
Article
The number of endnodes in high-performance computing (HPC) and Datacenter (DC) systems is constantly increasing. Hence, it is crucial to minimize the impact of network congestion to guarantee a suitable network performance. InfiniBand is a prominent interconnect technology that allows implementing efficient topologies and routing algorithms, as well as queuing schemes that reduce the Head-of-Line (HoL) blocking effect derived from congestion situations. Here we explain and evaluate thoroughly a queuing scheme called Path2SL, that optimizes the use of the InfiniBand Virtual Lanes (VLs) to reduce HoL blocking in fat-tree network topologies.
... Note that the above definitions are just a rephrasing of well-known concepts presented by Duato [33], [34] and Dally & Seitz [8]. We have included them for the sake of completeness of these theoretical foundations. ...
... As such it is responsible for holding a large set of queues, one for each VL and output port pair which stores incoming flits. The ibuf limits the growing of any of these queues to prevent a situation when a specific VL could end up hogging the full buffer space [34]. Hence, a minimum buffer space is guaranteed to any VL at both host channel adapters (HCAs) and switches ports. ...
... In summary, experiments under real and simulated scenarios confirm D3R as a new routing engine for IB-based Dragonflies, offering deadlock-freedom by using a small number of VLs, leaving the remainder VLs free to be used for other purposes like reducing the impact of congestion [34], [51], and/or providing Quality of Service (QoS) for differentiated services [52]. ...
Article
Full-text available
Dragonfly topologies are gathering great interest as one of the most promising interconnect options for High-Performance Computing systems. Dragonflies contain physical cycles that may lead to traffic deadlocks unless the routing algorithm prevents them properly. Previous topology-aware algorithms are difficult to implement, or even unfeasible, in systems based on the InfiniBand (IB) architecture, which is the most widely used network technology in HPC systems. In this paper, we present a new deterministic, minimal-path routing for Dragonfly that prevents deadlocks using VLs according to the IB specification, so that it can be straightforwardly implemented in IB-based networks. We have called this proposal D3R (Deterministic Deadlock-free Dragonfly Routing). D3R is scalable as it requires only 2 VLs to prevent deadlocks regardless of network size, i.e. fewer VLs than the required by the deadlock-free routing engines available in IB that are suitable for Dragonflies. Alternatively, D3R achieves higher throughput if an additional VL is used to reduce internal contention in the Dragonfly groups. We have implemented D3R as a new routing engine in OpenSM, the control software including the subnet manager in IB. We have evaluated D3R by means of simulation and by experiments performed in a real IB-based cluster, the results showing that, in general, D3R outperforms other routing engines.
... Therefore, switches implementing VOQs and VCs at the same time potentially achieve a higher performance, than switches without this functionality. Indeed, the availability of both VOQs and VCs in current switch architecture is leading to new proposals that leverage these features to enhance overall system performance [8]. ...
... This means that Virtual Output Queues (VOQs) are implemented [10]. At the same time, these switches require a policy to mitigate the buffer hogging [8], thus Virtual Channels (VCs) are needed. We detail these features in next sections. ...
... This effect arises when packets flows in the same VOQ, but in different VCs, consume the whole buffer, preventing other packets to arrive to the input port, even if they are addressed to free output ports. Nevertheless, it has been reported in [8] that having VOQs and VCs in a switch allows to enhance switch performance when properly managed. ...
... Moreover, congestion spreading also amplifies the negative effects of congestion (i.e., HoL blocking and buffer hogging), since their appearance is increased when congestion spreads. Congestion spreading also spoils the performance of techniques that try to reduce HoL blocking based on statically separating flows into different queues (or virtual channels, VCs) [28][29][30], as the spread congesting flows are likely to be present in more queues (see section 2.3). ...
... For instance, vFtree [29] takes into account the source and destination leaf switch of each packet, and shuffles the packets addressed to consecutive leaf switches among the available queues in a switch buffer. Another scheme is Flow2SL [30], which defines as many groups of consecutive destinations as the number of available queues, and maps packets to the different queues or VLs depending on their source and destination group. On the other hand, there are other proposals agnostic to the network topology and routing, such as Destination-Based Buffer Management (DBBM) [28]. ...
Preprint
Full-text available
The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications' communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with congestion since they provide traffic flows with alternative routes that bypass congested areas. However, adaptive routing decisions at switches are typically based on local information without a global network traffic perspective, leading to congestion spreading throughout the network beyond the original congested areas. In this paper, we propose a new efficient congestion management strategy that leverages adaptive routing notifications currently available in some interconnect technologies and efficiently isolates the congesting flows in reserved spaces at switch buffers. The experiment results based on simulations of realistic traffic scenarios show that our proposal removes the congestion impact.
... Some static schemes, such as VOQnet [17], VOQsw [25], DAMQs [25], DBBM [26], or DSBM [27] are topology and routing agnostic. By contrast, other static schemes like vFtree [28], and Flow2SL [29] are tailored to specific routing algorithms and network topologies, such that HoL blocking is reduced more efficiently than agnostic schemes while requiring similar resources . Indeed, the latter schemes take advantage of topology and routing knowledge to separate flows as much as possible with the available queues. ...
... As Q is a reduced number, several source-destination pairs are mapped to the same queue. Unfortunately, vFtree is effective only in Fat-Trees with fewer than three stages, so Flow2SL was proposed to overcome the vFtree flaws [29]. Flow2SL defines as many groups of consecutive destination end-nodes as the number of queues available at each port (Q). ...
Preprint
Full-text available
The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing schemes to balance traffic flows among the available links, thus reducing congestion probability. However, traffic balance cannot solve by itself some congestion situations that may still degrade network performance. Another approach to reduce congestion is queue-based flow separation, but our previous work shows that multi-path routing may spread congested flows across several queues, thus being counterproductive. In this paper, we propose a set of restrictions to improve alternative routes selection for multi-path routing algorithms in Fat-Tree networks, so that they can be positively combined with queuing schemes.
... Moreover, congestion spreading also amplifies the negative effects of congestion (i.e., HoL blocking and buffer hogging), since their appearance is increased when congestion spreads. Congestion spreading also spoils the performance of techniques that try to reduce HoL blocking based on statically separating flows into different queues (or virtual channels, VCs) [28][29][30], as the spread congesting flows are likely to be present in more queues (see Sect. 2.3). ...
... For instance, vFtree [29] takes into account the source and destination leaf switch of each packet, and shuffles the packets addressed to consecutive leaf switches among the available queues in a switch buffer. Another scheme is Flow2SL [30], which defines as many groups of consecutive destinations as the number of available queues, and maps packets to the different queues or VLs depending on their source and destination group. On the other hand, there are other proposals agnostic to the network topology and routing, such as Destination-Based Buffer Management (DBBM) [28]. ...
Article
Full-text available
The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications’ communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with congestion since they provide traffic flows with alternative routes that bypass congested areas. However, adaptive routing decisions at switches are typically based on local information without a global network traffic perspective, leading to congestion spreading throughout the network beyond the original congested areas. In this paper, we propose a new efficient congestion management strategy that leverages adaptive routing notifications currently available in some interconnect technologies and efficiently isolates the congesting flows in reserved spaces at switch buffers. The experiment results based on simulations of realistic traffic scenarios show that our proposal removes the congestion impact.
... Mathematically, reliability RðtÞ is the probability of proper working of a system in the time interval from 0 to t. Thus, reliability always is significant for performance analysis of most of the network systems, like lifeline networks [24][25][26], wireless mobile ad hoc networks (MANETs) [27][28][29], wireless mesh networks [30][31][32][33][34], wireless sensor networks [35][36][37][38], social networks [39], stochastic-flow manufacturing networks (SMNs) [40], and interconnection networks (INs) [41][42][43][44][45][46][47][48][49][50][51][52][53][54]. ...
... In general, since the communication resources such as switching elements and links are limited and shared in an interconnection network, contentions and deadlocks arise. When several packets flows concurrently request access to the same output port from different input ports, congestion happens inside a switch and its origins contention [51]. In these cases, only one packet can cross at a given moment (this packet is selected randomly in this study), while the other packets contending for the output port should have chosen other paths to reach their destination. ...
Chapter
This chapter gives helpful information about interconnection networks. First, the important role of interconnection networks in multiprocessor systems will be expressed. Then, a classification of interconnection networks will be provided in Sect. 2. In the next sections, we will examine the different interconnection topologies utilized for interfacing processors and memory modules. Generally, in this chapter, we will introduce two principle types of interconnection networks: static interconnection networks and dynamic interconnection networks. In addition, various types of these two main structures will be discussed in this chapter.
... Mathematically, reliability RðtÞ is the probability of proper working of a system in the time interval from 0 to t. Thus, reliability always is significant for performance analysis of most of the network systems, like lifeline networks [24][25][26], wireless mobile ad hoc networks (MANETs) [27][28][29], wireless mesh networks [30][31][32][33][34], wireless sensor networks [35][36][37][38], social networks [39], stochastic-flow manufacturing networks (SMNs) [40], and interconnection networks (INs) [41][42][43][44][45][46][47][48][49][50][51][52][53][54]. ...
... In general, since the communication resources such as switching elements and links are limited and shared in an interconnection network, contentions and deadlocks arise. When several packets flows concurrently request access to the same output port from different input ports, congestion happens inside a switch and its origins contention [51]. In these cases, only one packet can cross at a given moment (this packet is selected randomly in this study), while the other packets contending for the output port should have chosen other paths to reach their destination. ...
Chapter
Designing of network topologies, that the blocking problem is reduced to a satisfactory level in them, can be achieved due to improving the fault tolerance of multistage interconnection networks. So researchers are interested in the use of efficient methods to improve the fault tolerance in these networks. Therefore, some significant approaches to improve fault tolerance on multistage interconnection networks will be investigated in this chapter. Increasing the number of stages, using several improved MINs in parallel, and using replicated networks are included in this kind of methods.
... Prior work on InfiniBand congestion control includes simulation and experimental studies [1,3,6,7], recommendations for setting CC parameters [4,8], and new methods to combat the effects of congestion [9][10][11][12][13][14]. ...
... The VOQsw methodology [10], vFtree [13], Flow2SL [14] have the same objective of offering a solution that does not require switch modifications, and leverage InfiniBand's service lane (SL) and virtual lane (VL) features. Our DCMS solution is complementary to these methodologies as it would handle the intra-VL hogging problem. ...
Article
While the InfiniBand link-by-link flow control helps avoid packet loss, it unfortunately causes the effects of congestion to spread through a network. Even flows that do not pass through congested ports can suffer from reduced throughput. We propose a Dynamic Congestion Management System (DCMS) to address this problem. Without per-flow information, the DCMS leverages performance counters of switch ports to detect onset of congestion and determines whether-or-not victim flows are present. The DCMS then takes actions to cause an aggressive reduction in the sending rates of congestion-causing (contributor) flows, if victim flows are present. On the other hand, if there are no victim flows, the DCMS allows the contributor to maintain high sending rates and finish as quickly as possible. The value of dynamic management of a switch congestioncontrol parameter called Marking Rate, which is responsible for how quickly contributor flows can be throttled, is evaluated in an experimental testbed. Our results show that dynamic congestion management can enable a network to serve both contributor flows and victim flows effectively. The DCMS solution operates within the constraints of the InfiniBand Standard.
... Note that this requires as many queues per port as there are ports in the switch. Actually, this strategy can be implemented in different switch architectures, either by having at input buffers as many read ports as output ports (as in non-blocking switches [13]), or by dividing input buffers into as many logical queues as output ports, these queues sharing a single read port (as in [4] and [31]). Note the latter approach can be based on Virtual Channels to implement the queues, which allows to exploit VC-level flow control. ...
... By contrast, other solutions are specially designed to be aware of these aspects so that HoL blocking is reduced by using fewer resources. For instance, queuing schemes such as Output-Based Queue Assignment (OBQA) [12,13] and vFTree [17] have been devised for fat-tree topologies and the routing algorithms proposed in [16] and [34], respectively. Similarly, Band-Based Queuing (BBQ) [33] is tailored to KNS topologies [27] with Hybrid-DOR routing algorithm [27]. ...
Article
Full-text available
The performance of interconnection networks is a challenging issue for High-Performance Computing (HPC) systems, which becomes even more important when the number of interconnected endnodes grows. In that sense, Dragonfly interconnection patterns are a very popular option to configure the network topology, especially for large systems, as they are able to achieve a high scalability relying on high-radix switches. This kind of hierarchical topologies has two levels of interconnection (i.e., connections within the element of a group and connections among groups) and each one can be interconnected using different patterns. However, regardless of the Dragonfly interconnection pattern, the Head-of-Line (HoL) blocking effect derived from congestion situations may jeopardize the Dragonfly performance. This paper analyzes the dynamics of congestion in different Dragonfly fully-connected interconnection patterns. Also, we describe a queuing scheme called Hierarchical Two-Level Queuing (H2LQ), designed specially to reduce HoL blocking in any fully-connected Dragonfly network that uses minimal-path routing. Finally, we present experiment results which show that this scheme significantly boost Dragonfly performance, regardless the interconnection pattern, especially when congestion arises, while requiring fewer network resources than other techniques oriented to deal with the effects of congestion.
... Fig. 2 shows in orange color the main components of this efficient resource management. For congestion management, distinct approaches are considered (collective communication primitive optimization, injection throttling [22][23][24], static queuing schemes (SQS) [25,26], etc.) in order to deal more efficiently with the different types of congestion. ...
... Suffering from high power cost, long network latency, and low network throughput [2,6,7], the overall performance of existing NoC systems is undesirable. Many approaches have been proposed to improve NoC performance, which includes changing the NoC structure [7,8], dynamically changing packet injection rate [9,10], optimizing routing algorithms [5,11], and so on. Routing algorithm optimization is one of a lower cost solution [12][13][14][15][16][17][18]. ...
Article
Full-text available
Routing algorithms is a key factor that determines the performance of NoC (Networks-on-Chip) systems. Regional congestion awareness routing algorithms have shown great potential in improving the performance of NoC. However, it incurs a significant queuing latency when practitioners use existing regional congestion awareness routing algorithms to make routing decisions, thus degrading the performance of NoC. In this paper, we propose an efficient area partition-based congestion-aware routing algorithm, ParRouting, which aims at increasing the throughput and reducing the latency for NoC systems. First, ParRouting partitions the network into two areas (i.e., edge area and central area.) based on node priorities. Then, for the edge area, ParRouting selects the output node based on different priorities for higher throughput; for the central area, ParRouting selects the node in the low congestion direction as the output node for lower queuing latency. Our experimental results indicate that ParRouting achieves a 53.4% reduction in packet average latency over SPLASH -2 ocean application and improves the saturated throughput by up to 38.81% over a synthetic traffic pattern for an NoC system, compared to existing routing algorithms.
... As a consequence, the reduced set of queues per port is not always efficiently leveraged to reduce HoL blocking, and the performance of these techniques may drop in specific topologies when HoL blocking appears. By contrast, other queuing schemes take into account the network configuration, such as Flow2SL [12] and vFTree [18], which are specially designed for fat-trees topologies using deterministic routing algorithm [16,35], to exploit their characteristics and reduce the HoL blocking more effectively. Similarly, we devised a static queuing scheme specially tailored to KNS topologies using the hybrid-DOR (deterministic) routing algorithm, called Band-based Queuing (BBQ) [34]. ...
Article
Hybrid and direct topologies are cost-efficient and scalable options to interconnect thousands of end nodes in high-performance computing (HPC) systems. They offer a rich path diversity, high bisection bandwidth, and a reduced diameter guaranteeing low latency. In these topologies, efficient deterministic routing algorithms can be used to balance smartly the traffic flows among the available routes. Unfortunately, congestion leads these networks to saturation, where the HoL blocking effect degrades their performance dramatically. Among the proposed solutions to deal with HoL blocking, the routing algorithms selecting alternative routes, such as adaptive and oblivious, can mitigate the congestion effects. Other techniques use queues to separate congested flows from non-congested ones, thus reducing the HoL blocking. In this article, we propose a new approach that reduces HoL blocking in hybrid and direct topologies using source-adaptive and oblivious routing. This approach also guarantees deadlock-freedom as it uses virtual networks to break potential cycles generated by the routing policy in the topology. Specifically, we propose two techniques, called Source-Adaptive Solution for Head-of-Line Blocking Avoidance (SASHA) and Oblivious Solution for Head-of-Line Blocking Avoidance (OSHA). Experiment results, carried out through simulations under different traffic scenarios, show that SASHA and OSHA can significantly reduce the HoL blocking.
... Network congestion in InfiniBand, and especially the HOL blocking, has captured the attention of researchers and engineers for years. Techniques like Virtual Output Queue (VoQ) [8], Destination Based Buffer Management (DBBM) [16], Output Based Queue Assignment (OBQA) [19], Traffic Flow to Service Level (Flow2SL) [20], Set Aside Queues (SAQs) techniques [4,5,15], and proposals such as XCP [14], DCQCN [21], etc. in Ethernet, are trying to reduce congestion from different perspectives. However, none of the existing solutions/optimizations for InfiniBand congestion are based on the existing IB CC mechanism. ...
... On the other hand, in NoCs and High Performance Computing (HPC), the system cannot usually bear this delay. Hence, other techniques must be employed in such systems [9]. ...
Article
Full-text available
This paper presents a novel methodology for improving efficiency and power consumption of networks-on-chip (NoCs). The proposed approach applies queue length considerations of a modified version of RED algorithm. Moreover, a stochastic learning-automata-based algorithm has been used to optimize the threshold values required in RED algorithm. Furthermore, a new architecture has been provided for dynamic flow control of virtual channels. The proposed method contributes to reduction in queue blockages and power consumption in addition to determining an appropriate size for virtual channels. The proposed algorithm was evaluated under various synthetic traffic patterns for different injection rates and trace-driven SPLASH-2 benchmark suite. The experimental results demonstrate that the algorithm reduces latency and power consumption by 23% and 52%, respectively, compared to the conventional NoC. Further, compared to Express Virtual Channels (EVC) scheme, it showed 13% and 36% improvement in latency and power consumption, respectively.
... Interconnection networks play an essential role in many parallel computing systems, especially in HPC systems. Indeed, the overall performance of many HPC applications depends largely on the performance of the interconnection network: if the network does not offer high performance, it may become the system bottleneck and end nodes may have to wait for packets to arrive to continue processing [4]. ...
Article
The main purpose of this paper is to propose a hybrid congestion control algorithm to prevent congestion in 2-D broadcast-based multiprocessor architectures with multiple input queues. Our algorithm utilizes a node’s both input queue and output channel parameters to detect and prevent congestion. The intermediate node selection procedure and the bypass operation have also been developed as part of the proposed algorithm. The performance of the algorithm is tested with several synthetic traffic patterns on the 2-D simultaneous optical multiprocessor exchange bus. The performance of the algorithm is compared with that of the algorithms which use only input and only output parameters and it is shown that the proposed congestion control algorithm using hybrid parameters performs better than the other algorithms. The proposed algorithm is able to decrease the average network response time by 33.63 %, average input waiting time by 29.13 % and increase average processor utilization by 7.57 % on the average.
... Generally, contentions and deadlocks arise because the communication resources such as switching elements and links are limited and shared in an interconnection network. The origin of congestion is contention [24], which happens inside a switch when several packets flows concurrently request access to the same output port from different input ports. In these cases, only one packet can cross at a given moment (this packet is selected randomly in this study), while the other packets contending for the output port should have chosen other paths to reach their destination. ...
Article
RDMA (Remote Direct Memory Access) networks require efficient congestion control to maintain their high throughput and low latency characteristics. However, congestion control protocols deployed at the software layer suffer from slow response times due to the communication overhead between host hardware and software. This limitation has hindered their ability to meet the demands of high-speed networks and applications. Harnessing the capabilities of rapidly advancing Network Interface Card (NIC) can drive progress in congestion control. Some simple congestion control protocols have been offloaded to RDMA NIC to enable faster detection and processing of congestion. However, offloading congestion control to the RDMA NIC faces a significant challenge in integrating the RDMA transport protocol with advanced congestion control protocols that involve complex mechanisms. We have observed that reservation-based proactive congestion control protocols share strong similarities with RDMA transport protocols, allowing them to integrate seamlessly and combine the functionalities of the transport layer and network layer. In this paper, we present COER, an RDMA NIC architecture that leverages the functional components of RDMA to perform reservations and completes the scheduling of congestion control during the scheduling process of the RDMA protocol. COER facilitates the streamlined development of offload strategies for congestion control techniques, specifically proactive congestion control, on RDMA NIC. We use COER to design offloading schemes for eleven congestion control protocols, which we implement and evaluate using a network emulator with a cycle-accurate RDMA NIC model that can load MPI programs. The evaluation results demonstrate that the architecture of COER does not compromise the original characteristics of the congestion control protocols. Compared to a layered protocol stack approach, COER enables the performance of RDMA networks to reach new heights.
Article
In recent years, congestion in Networks-on-chip (NoC) has emerged as an important research topic due to the increasing number of processing cores. To solve the congestion, all the methods that have been proposed require a congestion criterion to detect whether a node is congested or not. All the congestion criteria that have been developed so far have similar behavior for all nodes in the network. In this paper, for the first time, a heterogeneous congestion criterion is proposed for a two-dimensional mesh network that is determined for each node based on its betweenness centrality. This criterion can be generalized to the other topology such as torus easily. This criterion is calculated before the network starts up and does not have any overhead in run time. Using this criterion will reduce the average latency of any congestion-aware method, such as congestion-aware routing algorithms. The evaluation section shows that the use of this criterion in three famous routing algorithms reduces the average latency up to 48% (21% on average for all algorithms and traffic patterns) in both real and synthetic traffics. In addition, the usage of this criterion reduces the power consumption in all simulation conditions because of reducing the average latency and lack of overhead. It is also shown at the end of the evaluation section that an increase in the network size will result in better performance of this criterion.
Thesis
In recent years, energy has become one of the most important factors for de- signing and operating large scale computing systems. This is particularly true in high-performance computing, where systems often consist of thousands of nodes. Especially after the end of Dennard’s scaling, the demand for energy- proportionality in components, where energy is depending linearly on utilization, increases continuously. As the main contributor to the overall power consumption, processors have received the main attention so far. The increasing energy proportionality of processors, however, shifts the focus to other components such as interconnection networks. Their share of the overall power consumption is expected to increase to 20% or more while other components further increase their efficiency in the near future. Hence, it is crucial to improve energy proportionality in interconnection networks likewise to reduce overall power and energy consumption. To facilitate these attempts, this work provides comprehensive studies about energy saving in interconnection networks at different levels. First, interconnection networks differ fundamentally from other components in their underlying technology. To gain a deeper understanding of these differences and to identify targets for energy savings, this work provides a detailed power analysis of current network hardware. Furthermore, various applications at different scales are analyzed regarding their communication patterns and locality properties. The findings show that communication makes up only a small fraction of the execution time and networks are actually idling most of the time. Another observation is that point-to-point communication often only occurs within various small subsets of all participants, which indicates that a coordinated mapping could further decrease network traffic. Based on these studies, three different energy-saving policies are designed, which all differ in their implementation and focus. Then, these policies are evaluated in an event-based, power-aware network simulator. While two policies that operate completely local at link level, enable significant energy savings of more than 90% in most analyses, the hybrid one does not provide further benefits despite significant additional design effort. Additionally, these studies include network design parameters, such as transition time between different link configurations, as well as the three most common topologies in supercomputing systems. The final part of this work addresses the interactions of congestion management and energy-saving policies. Although both network management strategies aim for different goals and use opposite approaches, they complement each other and can increase energy efficiency in all studies as well as improve the performance overhead as opposed to plain energy saving.
Article
Increasing the number of processing cores in the networks-on-chip in recent years has made congestion one of the most important challenges in this field. One of the best ways to solve this problem, which has greater effectiveness and less overhead, is the use of congestion-aware routing algorithm. In this algorithm, when a packet is generated, a route is selected from the minimum routes based on the betweenness centrality, history of the previous packets’ routes and the adaptivity degree. The packet tries to move as far as possible in the selected route, and in the event of extreme congestion, it can change its route for limited times and again selects a new route according to the same parameters. To be more in detail, this algorithm is a combination of deterministic and adaptive routings. In order to reduce the overhead, the adaptive routing algorithm uses only the neighboring data. The proposed routing algorithm has been compared with five different algorithms in terms of the average packet latency, power consumption and variance of the switch activities under real and synthetic traffics. In turn, the proposed algorithm has better performance through simulation experiments.
Chapter
Communication performance plays a crucial role in both the scalability and the time-to-solution of parallel applications. The share of links in modern high-performance computer networks inevitably introduces contention for communications involving multiple point-to-point messages, thus hinders their performance. Passive contention reduction such as the congestion control of the networks can mitigate network contention but with extra protocol cost, while application-level active contention reduction such as topology mapping techniques can only reduce contention of applications with static communication patterns. In this paper, we explore a different approach to actively reduce network contention through a congestion-avoiding message scheduling algorithm, namely CAMS. CAMS determines how to inject the messages in groups to reduce contention just in time before injecting them into the network, thus it is useful in applications with dynamic communication patterns. Experiments with a 2D halo-exchange benchmark on the Tianhe-2A supercomputer shows that it can improve communication performance up to 27% when messages get large. The proposed approach can be used in conjunction with topology mapping to further improve communication performance.
Article
With the rapid development of data center network, the traditional traffic scheduling method can easily cause problems such as link congestion and load imbalance. Therefore, this paper proposes a novel dynamic flow scheduling algorithm GA-ACO (Genetic Algorithm and Ant COlony algorithms). GA-ACO algorithm obtains the global perspective of the network under the SDN (Software defined network) architecture. It then calculates the global optimal path for the elephant flow on the congestion link, and reroutes it. Extensive experiments have been executed to evaluate the performance of the proposed GA-ACO algorithm. The simulation results show that, in comparison with ECMP and ACO-SDN algorithm, GA-ACO can not only reduce the maximum link utilization, but also improve the bandwidth effectively.
Article
The interconnection network architecture is crucial for High-Performance Computing (HPC) clusters, since it must meet the increasing computing demands of applications. Current trends in the design of these networks are based on increasing link speed, while reducing latency and number of components in order to lower the cost. The InfiniBand Architecture (IBA) is an example of a powerful interconnect technology, delivering huge amounts of information in few microseconds. The IBA-based hardware is able to deliver EDR and HDR speed (i.e. 100 and 200 Gb/s, respectively). Unfortunately, congestion situations and their derived problems (i.e. Head-of-Line blocking and buffer hogging), are a serious threat for the performance of both the interconnection network and the entire HPC cluster. In this paper, we propose a new approach to provide IBA-based networks with techniques for reducing the congestion problems. We propose Flow2SL-ITh, a technique that combines a static queuing scheme (SQS) with the closed-loop congestion control mechanism included in IBA-based hardware (a.k.a. injection throttling, ITh). Flow2SL-ITh separates traffic flows storing them in different virtual lanes (VLs), in order to reduce HoL blocking, while the injection rate of congested flows is throttled. Meanwhile congested traffic vanishes, there is no buffer sharing among traffic flows stored in different VLs, which reduces congestion negative effects. We have implemented Flow2SL-ITh in OpenSM, the open-source implementation of the IBA subnet manager (SM). Experimental results obtained by running simulations and real workloads in a small IBA cluster show that Flow2SL-ITh outperforms existing techniques by up to 44%, under some traffic scenarios.
Article
This tutorial presents the details of the interconnection network utilized in many High Performance Computing (HPC) systems today. “InfiniBand” is the hardware interconnect utilized by over 35% of the top 500 supercomputers in the world as of June, 2017. “Verbs” is the term used for both the semantic description of the interface in the InfiniBand Architecture Specifications, and the name used for the functions defined in the widely used OpenFabrics Alliance (OFA) implementation of the software interface to InfiniBand. “Message Passing Interface” (MPI) is the primary software library by which HPC applications portably pass messages between processes across a wide range of interconnects including InfiniBand. Our goal is to explain how these three components are designed and how they interact to provide a powerful, efficient interconnect for HPC applications. We provide a succinct look into the inner technical workings of each component that should be instructive to both novices to HPC applications as well as to those who may be familiar with one component, but not necessarily the others, in the design and functioning of the total interconnect. A supercomputer interconnect is not a monolithic structure, and this tutorial aims to give non-experts a “big-picture” overview of its substructure with an appreciation of how and why features in one component influence those in others. We believe this is one of the first tutorials to discuss these three major components as one integrated whole. In addition, we give detailed examples of practical experience and typical algorithms used within each component in order to give insights into what issues and trade-offs are important.
Article
The number of endnodes in high-performance computing systems has grown significantly in the last years. Hence, the interconnection network has become an essential issue as it may end up being the system bottleneck if it is not properly designed. In that sense, the Dragonfly topology has become very popular for interconnecting high-performance computing systems in the last years because it offers high performance at an affordable cost. However, when using deterministic minimal-path routing, this topology is not able to offer a high performance under certain traffic conditions. This problem can be solved by using oblivious or adaptive routing. However, there are no congestion management techniques specially tailored to Dragonfly topologies using oblivious or adaptive routing. Note that in congestion situations, the Dragonfly performance may drop because of the head-of-line blocking effect. This effect could be even more dangerous in systems where several applications with different priorities coexist. In this work we propose several techniques especially designed for providing differentiated services and congestion management in Dragonfly networks using oblivious or adaptive routing. First, we propose the hierarchical 3-level queuing queuing scheme, which configures several virtual channels distributed into 3 virtual networks to reduce the head-of-line blocking while deadlocks derived from the routing algorithm are prevented. Second, we extend hierarchical 3-level queuing to provide differentiated services through 2 different solutions. Finally, some experiments are performed to show the benefits obtained by using the proposed techniques.
Conference Paper
The InfiniBand Congestion Control (CC) mechanism is able to reduce congestion and improve performance in many situations. In this paper we study the characteristics of congestion in InfiniBand by monitoring and analyzing the CC mechanism with a hardware analyzer. To the best of our knowledge, this is the first paper that presents experience with, and analysis of, the InfiniBand CC with such a tool. We found that there can be more than one “root of congestion”, as defined by the IBTA specification, existing at the same time in the congestion tree, and a “root of congestion” can be converted to a “victim of congestion” as its nature changes. We also observed that even with constant traffic flows, the “root of congestion” will shift from one place to another within the congestion tree, with corresponding consequence for packets from various traffic sources: traffic that might be negatively impacted by tree spreading and might be not contributing to the “root of congestion” before, will be treated as a congestion contributor and then be throttled by the CC mechanism.
Article
Interconnection networks have a great impact on the performance of parallel systems. These networks provide the communication mechanism and framework needed by parallel applications. One such important network is fat-tree. Selection functions were shown to have a great impact on the performance of fat-trees. Selection functions perform differently under certain traffic patterns. The stage and destination priority (SADP) selection function was shown to have better performance in case of uniform traffic while the stage and origin priority (SAOP) selection function was shown to perform better in case of hot-spot traffic. In this paper, we propose a cost-efficient congestion management mechanism for fat-trees that choose a certain selection function for certain traffic pattern. The mechanism has the ability to detect the current traffic pattern and switch to a certain selection function that is proved to give better performance under the detected traffic pattern. This directly decreases the congestion in the network. First, we analyze the hot-spot traffic in fat-trees if SADP selection function is used. We derive a condition for the existence of hot-spot traffic if SADP function is used. We give an implementation for detecting this condition. Once this condition is detected, the network is forced to switch to use the SAOP selection function. Then, we use the analysis of SAOP to derive a condition to detect that a non hot-spot traffic exists in the fat-tree. We give an implementation for detecting this condition. In turn, we switch back to the SADP selection function. We use synthetic workloads to show the accuracy of the proposed mechanism for detecting the hot-spot traffic in the network. We show that the proposed mechanism incurs a constant number of bits per physical link as an overhead. Finally, we compare the proposed mechanism with other techniques.
Conference Paper
Full-text available
As the size of High Performance Computing clusters grows, the increasing probability of interconnect hot spots degrades the latency and effective bandwidth the network provides. This paper presents a solution to this scalability problem for real life constant bisectional-bandwidth fat-tree topologies. It is shown that maximal bandwidth and cut-through latency can be achieved for MPI global collective traffic. To form such congestion-free configuration, MPI programs should utilize collective communication, MPI-node-order should be topology aware, and the packets routing should match the MPI communication patterns. First, we show that MPI collectives can be classified into unidirectional and bidirectional shifts. Using this property, we propose a scheme for congestion-free routing of the global collectives in fully and partially populated fat trees running a single job. Simulation results of the proposed routing, MPI-node-order and communication patterns show a 40% throughput improvement over previously published results for all-to-all collectives.
Conference Paper
Full-text available
The NAS Parallel Benchmarks (NPB) are a well-known suite of benchmarks that proxy scientific computing applications. They specify several problem sizes that represent how such applications may run on different sizes of HPC systems. However, even the largest problem (class F) is still far too small to exercise properly a petascale supercomputer. Our work shows how one may scale the Block Tridiagonal (BT) NPB from today's published size to petascale and exascale computing systems. In this paper we discuss the pros and cons of various ways of scaling. We discuss how scaling BT would impact computation, memory access, and communications, and highlight the expected bottleneck, which turns out to be not memory or communication bandwidth, but network latency. Two complementary ways are presented to overcome latency obstacles. We also describe a practical method to gather approximate performance data for BT at exascale on actual hardware, without requiring an exascale system.
Article
Full-text available
Since at least 1985(1) it has been known that certain traffic patterns in multistage interconnection networks, hot spots, can cause catastrophic congestion and loss of throughput. No practical technique has, until now, been demonstrated to alleviate this prob- lem, which becomes increasingly severe as network size increases and networks are driven closer to saturation. The congestion control architecture (CCA) proposed for the InfiniBand™ Architecture was alleged to be a solution, but when it was defined it lacked both guid- ance for setting its parameters and demonstration of its effectiveness. At its adoption, it had not even been demonstrated that there were any parameter settings that would work at all, avoiding instability or oscillations. This paper reports on extensive evaluation of IBA CCA, under different scenarios and congestion control parameters that (a) delivers the first guidance for setting CCA parameters for IBA; and (b) demonstrates that this is the first effective solution to hot spot contention published in 20 years. This result expected to be signifi- cant for standards such as IBA and IEEE 802.1/3, particularly as virtualized networks become more common.
Conference Paper
Full-text available
In a lossless interconnection network, network congestion needs to be detected and resolved to ensure high performance and good utilization of network resources at high network load. If no countermeasure is taken, congestion at a node in the network will stimulate the growth of a congestion tree that not only affects contributors to congestion, but also other traffic flows in the network. Left untouched, the congestion tree will block traffic flows, lead to underutilization of network resources and result in a severe drop in network performance. The InfiniBand standard specifies a congestion control (CC) mechanism to detect and resolve congestion before a congestion tree is able to grow and, by that, hamper the network performance. The InfiniBand CC mechanism includes a rich set of parameters that can be tuned in order to achieve effective CC. Even though it has been shown that the CC mechanism, properly tuned, is able to improve both throughput and fairness in an interconnection network, it has been questioned whether the mechanism is fast enough to keep up with dynamic network traffic, and if a given set of parameter values for a topology is robust when it comes to different traffic patterns, or if the parameters need to be tuned depending on the applications in use. In this paper we address both these questions. Using the three-stage fat-tree topology from the Sun Datacenter InfiniBand Switch 648 as a basis, and a simulator tuned against CC capable InfiniBand hardware, we conduct a systematic study of the efficiency of the InfiniBand CC mechanism as the network traffic becomes increasingly more dynamic. Our studies show that the InfiniBand CC, even when using a single set of parameter values, performs very well as the traffic patterns becomes increasingly more dynamic, outperforming a network without CC in all cases. Our results show throughput increases varying from a few percent, to a seventeen-fold increase.
Conference Paper
Full-text available
Interconnection networks-on-chip (NOCs) are rapidly replacing other forms of interconnect in chip multiprocessors and system-on-chip designs. Existing interconnection networks use either oblivious or adaptive routing algorithms to determine the route taken by a packet to its destination. Despite somewhat higher implementation complexity, adaptive routing enjoys better fault tolerance characteristics, increases network throughput, and decreases latency compared to oblivious policies when faced with non-uniform or bursty traffic. However, adaptive routing can hurt performance by disturbing any inherent global load balance through greedy local decisions. To improve load balance in adapting routing, we propose Regional Congestion Awareness (RCA), a lightweight technique to improve global network balance. Instead of relying solely on local congestion information, RCA informs the routing policy of congestion in parts of the network beyond adjacent routers. Our experiments show that RCA matches or exceeds the performance of conventional adaptive routing across all workloads examined, with a 16% average and 71% maximum latency reduction on SPLASH-2 benchmarks running on a 49-core CMP. Compared to a baseline adaptive router, RCA incurs a negligible logic and modest wiring overhead.
Conference Paper
Full-text available
In lossless interconnection networks congestion control (CC) can be an effective mechanism to achieve high performance and good utilization of network resources. Without CC, congestion in one node may grow into a congestion tree that can degrade the performance severely. This degradation can affect not only contributors to the congestion, but also throttles innocent traffic flows in the network. The InfiniBand standard describes CC functionality for detecting and resolving congestion. The InfiniBand CC concept is rich in the way that it specifies a set of parameters that can be tuned in order to achieve effective CC. There is, however, limited experience with the InfiniBand CC mechanism. To the best of our knowledge, only a few simulation studies exist. Recently, InfiniBand CC has been implemented in hardware, and in this paper we present the first experiences with such equipment. We show that the implemented InfiniBand CC mechanism effectively resolves congestion and improves fairness by solving the parking lot problem, if the CC parameters are appropriately set. By conducting extensive testing on a selection of the CC parameters, we have explored the parameter space and found a subset of parameter values that leads to efficient CC for our test scenarios. Furthermore, we show that the InfiniBand CC increases the performance of the well known HPC Challenge benchmark in a congested network.
Conference Paper
Full-text available
We have developed a new method to uniformly balance communication traffic over the interconnection network called Distributed Routing Balancing (DRB) that is based on limited and load-controlled path expansion in order to maintain a low message latency. DRB defines how to create alternative paths to expand single paths (expanded path definition) and when to use them depending on traffic load (expanded path selection carried out by DRB Routing). The alternative path definition offers a broad range of alternatives to choose from and the DRB Routing is designed with the goal of minimising monitoring and decision overhead. Evaluation in terms of latency and bandwidth is presented. Some conclusions from the experimentation and comparisons with existing methods are given. It is demonstrated that DRB is a method to effectively balance network traffic.
Conference Paper
Full-text available
To avoid head of line blocking in switches, virtual output queues (VOQs) are commonly used. However, the number of VOQs grows quadratically with the number of ports, making this approach impractical for large switches. In this paper, we propose dynamic switch buffer management (DSBM) to tackle this problem. Similar to DBBM, it saves memory by reducing the number of buffers. Our scheme significantly improves the performance by dynamically assigning the incoming cells to the least occupied buffers.
Conference Paper
Full-text available
We propose a distributed congestion management scheme for non-blocking, 3-stage Clos networks, comprising plain buffered crossbar switches. VOQ requests are routed us- ing multipath routing to the switching elements of the 3rd- stage, and grants travel back to the linecards the other way around. The fabric elements contain independent single- resource schedulers, that serve requests and grants in a pipeline. As any other network with limited capacity, this scheduling network may suffer from oversubscribed links, hotspot con- tention, etc., which we identify and tackle. We also reduce the cost of internal buffers, by reducing the data RTT, and by allowing sub-RTT crosspoint buffers. Performance simu- lations demonstrate that, with almost all outputs congested, packets destined to non-congested outputs experience very low delays (flow isolation). For applications requiring very low communication delays, we propose a second, parallel op- eration mode, wherein linecards can forward a few packets eagerly, each, bypassing the request-grant latency overhead.
Conference Paper
Full-text available
The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection networks. In this paper we study the communication performance of a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35-40% of the network capacity with one virtual channel, 55-60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestion-free communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies
Conference Paper
Full-text available
It is a well known fact that multiple virtual lanes can improve performance in interconnection networks, but this knowledge has had little impact on real clusters. Currently, a large number of clusters using InfiniBand is based on fat-tree topologies that can be routed deadlock-free using only one virtual lane. Consequently, all the remaining virtual lanes are left unused. In this paper we suggest an enhancement to the fat-tree algorithm that utilizes virtual lanes to improve performance when hot-spots are present. Even though the bisection bandwidth in a fat-tree is constant, hot-spots are still possible and they will degrade performance for flows not contributing to them due to head-of-line blocking. Such a situation may be alleviated through adaptive routing or congestion control, however, these methods are not yet readily available in InfiniBand technology. To remedy this problem, we have implemented an enhanced fat-tree algorithm in OpenSM that distributes traffic across all available virtual lanes without any configuration needed. We evaluated the performance of the algorithm on a small cluster and did a large-scale evaluation through simulations. In a congested environment, results show that we are able to achieve throughput increases up to 38% on a small cluster and from 221% to 757% depending on the hot-spot scenario for a 648-port simulated cluster.
Conference Paper
Full-text available
Clusters of PCs have become very popular to build high performance computers. These machines use commo- dity PCs linked by a high speed interconnect. Routing is one of the most important design issues of interconnection networks. Adaptive routing usually better balances net- work traffic, thus allowing the network to obtain a higher throughput. However, adaptive routing introduces out-of- order packet delivery, which is unacceptable for some ap- plications. Concerning topology, most of the commercially available interconnects are based on fat-tree. Fat-trees offer a rich connectivity among nodes, making possible to obtain paths between all source-destination pairs that do not share any link. We exploit this idea to propose a deterministic routing algorithm for fat-trees, comparing it with adaptive routing in several workloads. The results show that determi- nistic routing can achieve a similar, and in some scenarios higher, level of performance than adaptive routing, while providing in-order packet delivery. 1
Conference Paper
Full-text available
Several Congestion Management Mechanisms (CMMs) have been proposed for Multistage Interconnection Networks (MINs) in order to avoid the degradation of network performance when congestion appears. Most of them are based on Explicit Congestion Notification (ECN). For this purpose, switches detect congestion and, depending on the applied mechanism, some flags are marked to warn the source hosts. In response, source hosts apply corrective actions to adjust their packet injection rate. These mechanisms have been evaluated by analyzing whether they are able to manage a congestion situation but there is not a comparison study among them. Moreover, marking effects are not separately analyzed from corrective actions. In this paper, we analyze the current proposals for CMMs, showing the impact of the applied packet marking techniques as well as the corrective actions they apply.
Conference Paper
Full-text available
As the number of components in cluster-based systems increases, costand power consumption also increase. One way to reduce both problems is usingsmaller networks with adequate congestion management mechanisms. Recentsuccessful proposals (RECN) eliminate the negative effects of congestion,the Head-of-Line (HOL) blocking, leaving congestion harmless. RECN relies onsource-based networks architectures, where the entire route is placed at packetheaders before injection. Unfortunately, distributed table-based routing is alsocommon in cluster-based networks, being InfiniBand the most prominent example.We propose a novel congestion management technique for distributed tablebasedrouting. The mechanism relies on additional congestion information locatedat routing tables. With this information HOL blocking is minimized bysmartly using switch queues. Detailed memory organization and the way congestioninformation is updated/propagated is described. Preliminary results indicatethat with modest resource requirements maximum network performance is keptregardless of congestion.
Article
Full-text available
Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels [DalSei]. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual channels decouple buffer resources from transmission resources. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be left idle. Simulation studies show that, given a fixed amount of buffer storage per link, virtual-channel flow control increases throughput by a factor of 3.5, approaching the capacity of the network.
Conference Paper
Full-text available
As the number of computing and storage nodes keeps increasing, the interconnection network is becoming a key element of many computing and communication systems, where the overall performance directly depends on network performance. This performance may dramatically drop during congestion situations. Although congestion may be avoided by overdimensioning the network, the current trend is to reduce overall cost and power consumption by reducing the number of network components. Thus, the network will be prone to congestion, thereby becoming mandatory the use of congestion management techniques. In that sense, the technique known as Regional Explicit Congestion Notification (RECN) completely eliminates the Head-of-Line (HOL) blocking produced by congested packets, turning congestion harmless. However, RECN has been designed for switches with queues at input and output ports (CIOQ switches), thus it can not be directly applied to other types of switches. Additionally, the method RECN uses for detecting congestion requires several detection queues that increase the memory requirements and thus switch cost. Thus, we completely redefine the RECN mechanism in order to achieve different goals. First, we adapt RECN to a switch organization with queues only at input ports (IQ switches). These switches are simpler and cheaper to produce than CIOQ ones. Second, we propose a new method for detecting congestion that does not require several detection queues, thereby reducing RECN memory requirements. These improvements lead to achieve a cost-effective switch organization that derive maximum performance even in the presence of congestion. Also, we present in detail a realistic switch architecture supporting the new mechanism. Results demonstrate that the new RECN version in an IQ switch achieves maximum network performance in all the analyzed situations. These results have been obtained with a reduction factor of data memory requirements of 5 with respect to the previous RECN mech- anism in CIOQ switches.
Conference Paper
Full-text available
Driving computer interconnection networks closer to saturation minimizes cost/performance and power consumption, but requires efficient congestion control to prevent catastrophic performance degradation during traffic peaks or "hot spot" traffic patterns. The InfiniBand™Architecture provides such congestion control, but lacks guidance for setting its parameters. At its adoption, it was unproven that there were any settings that would work at all, avoid instability or oscillations. This paper reports on a simulation-driven exploration of that parameter space which verifies that the architected scheme can, in fact, work properly despite inherent delays in its feedback mechanism.
Conference Paper
Full-text available
This paper presents flit-reservation flow control, in which control flits traverse the network in advance of data flits, reserving buffers and channel bandwidth. Flit-reservation flow control requires control flits to precede data flits, which can be realized through fast on-chip control wires or the pipelining of control flits one or more cycles ahead of data flits. Scheduling ahead of data at rival enables buffers to be held only during actual buffer usage, unlike existing flow control methods. It also eliminates data latency due to routing and arbitration decisions. Simulations with fast control wires show that flit-reservation flow control extends the 63% throughput attained by virtual-channel flow control with 8 flit buffers per input to 77%, an improvement of 20% with equal storage and bandwidth overheads. Its throughput with 6 buffers (77%) approaches that of virtual-channel flow control using 16 buffers (80%), reflecting the significant buffer savings as a result of efficient buffer utilization. Data latency is also reduced by 15.6% as compared to virtual-channel flow control. The improvement in throughput is similarly realized by the pipelining of each control flit a cycle ahead of their data flits, using control and data networks with the same propagation delay of 1 cycle
Conference Paper
Full-text available
Multiprocessing (MP) on networks of workstations (NOW) is a high-performance computing architecture of growing importance. In traditional MP's, wormhole routing interconnection networks use fixed-size flits and backpressure. In NOW's, ATM-one of the major contending interconnection technologies-uses fixed-size cells, while backpressure can be added to it. We argue that ATM with backpressure has interesting similarities with wormhole routing. We are implementing ATLAS I, a single-chip gigabit ATM switch, which includes credit flow control (backpressure), according to a protocol resembling Quantum Flow Control (QFC). We show by simulation that this protocol performs better than the traditional multi-lane wormhole protocol: high throughput and low latency are provided with less buffer space. Also, ATLAS I demonstrates little sensitivity to bursty traffic, and, unlike wormhole, it is fair in terms of latency in hot-spot configurations. We use detailed switch models, operating at clock-cycle granularity
Article
Full-text available
We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using local information (typically channel queue depth). GAL senses global congestion using segmented injection queues to decide the directions to route in each dimension. It further load balances the network by routing in the selected directions adaptively. Using global information, GAL achieves the performance (latency and throughput) of minimal adaptive routing on benign traffic patterns and performs as well as the best obliviously load-balanced routing algorithm (GOAL) on adversarial traffic.
Article
Full-text available
One type of interconnection network for a medium to large-scale parallel processing system (i.e., a system with 26 to 216 processors) is a buffered packet-switched multistage interconnection network (MIN). It has been shown that the performance of these networks is satisfactory for uniform network traffic. More recently, several studies have indicated that the performance of MIN's is degraded significantly when there is hot spot traffic, that is, a large fraction of the messages are routed to one particular destination. A multipath MIN is a MIN with two or more paths between all source and destination pairs. This research investigates how the Extra Stage Cube multipath MIN can reduce the detrimental effects of tree saturation caused by hot spots. Simulation is used to evaluate the performance of the proposed approaches. The objective of this evaluation is to show that, under certain conditions, the performance of the network with the usual routing scheme is severely degraded by the presence of hot spots. With the proposed approaches, although the delay time of hot spot traffic may be increased, the performance of the background traffic, which constitutes the majority of the network traffic, can be significantly improved
Article
Full-text available
The use of adaptive routing in a multicomputer interconnection network improves network performance by using all available paths and provides fault tolerance by allowing messages to be routed around failed channels and nodes. Two deadlock-free adaptive routing algorithms are described. Both algorithms allocate virtual channels using a count of the number of dimension reversals a packet has performed to eliminate cycles in resource dependency graphs. The static algorithm eliminates cycles in the network channel dependency graph. The dynamic algorithm improves virtual channel utilization by permitting dependency cycles and instead eliminating cycles in the packet wait-for graph. It is proved that these algorithms are deadlock-free. Experimental measurements of their performance are presented
Article
Full-text available
In a multi-user production cluster there is no control over the intra-cluster communication patterns, which can cause unanticipated hot spots to occur in the cluster interconnect. In a multistage interconnect a common side effect of such a hot-spot is the roll-over of the saturation to other areas in the interconnect that were otherwise not in the direct path of the primary congested element. This paper investigates the effects of treesaturation in the interconnect of the AC 3 Velocity cluster, which is a multistage interconnect constructed out of 40 GigaNet switches. The main congestion control mechanism employed at the GigaNet switches is a direct feedback to the traffic source, allowing for fast control over the source of the congestion, avoiding the spread from the congestion area. The experiments reported are designed to examine the effects of the congestion control in detail.
Article
Clustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the compute-nodes speed. InfiniBandTM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of these clusters. The network performance is also affected by its topology, packet routing and the communication patterns the distributed application exercises. Fat-trees are the topology structures used for constructing most large clusters as they are scalable, maintain cross-bisectional-bandwidth (CBB), and are practical to build using fixed-arity switches. In this paper, we propose a fat-tree routing algorithm that provides a congestion-free, all-to-all shift pattern leveraging on the InfiniBandTM static routing capability. The algorithm supports partially populated fat-trees built with switches of arbitrary number of ports and CBB ratios. To evaluate the proposed algorithm, detailed switch and host simulation models were developed and multiple fabric topologies were run. The results of these simulations as well as measurements on real clusters show an improvement in all-to-all delay by avoiding congestion on the fabric. Copyright © 2009 John Wiley & Sons, Ltd. The paper was presented in the International Super Computer 2007 conference in Dresden Germany.
Article
As parallel computing systems increase in size, the interconnection network is becoming a critical subsystem. The current trend in network design is to use as few components as possible to interconnect the end nodes, thereby reducing cost and power consumption. However, this increases the probability of congestion appearing in the network. As congestion may severely degrade network performance, the use of a congestion management mechanism is becoming mandatory in modern interconnects. One of the most cost-effective proposals to deal with the problems derived from congestion situations is the Regional Explicit Congestion Notification (RECN) strategy, based on using special queues to totally isolate the packet flows which contribute to congestion, thereby preventing the Head-of-Line (HoL) blocking effect that these flows may cause to others. Unfortunately, RECN requires the use of source-based routing, thus not being suitable for interconnects with distributed routing, like InfiniBand. Although some RECN-like mechanisms have been proposed for distributed-routing networks, they are not scalable due to the huge amount of control memory that they require in medium-size or large networks. In this paper, we propose Distributed-Routing-Based Congestion Management (DRBCM), a new scalable technique which, following the RECN principles, totally prevents congestion from producing HoL-blocking in multistage interconnection networks (MINs) using tag-based distributed routing. Simulation results indicate that, regardless of network size, DRBCM presents small resource requirements to keep network performance at maximum level even in scenarios of heavy congestion, where it utterly outperforms (with a gain up to 70 percent) current solutions for distributed-routing networks, like the InfiniBand congestion-control mechanism based on injection throttling. Thus, DRBCM is an efficient, cost-effective, and scalable solution for congestion management.
Article
The Interconnection networks are essential elements in current computing systems. For this reason, achieving the best network performance, even in congestion situations, has been a primary goal in recent years. In that sense, there exist several techniques focused on eliminating the main negative effect of congestion: the Head of Line (HOL) blocking. One of the most successful HOL blocking elimination techniques is RECN, which can be applied in source routing networks. FBICM follows the same approach as RECN, but it has been developed for distributed deterministic routing networks. Although FBICM effectively eliminates HOL blocking, it requires too much resources to be implemented. In this paper we present a new FBICM version, based on a new organization of switch memory resources, that significantly reduces the required silicon area, complexity and cost. Moreover, we present new results about FBICM, in network topologies not yet analyzed. From the experiment results we can conclude that a far less complex and feasible FBICM implementation can be achieved by using the proposed improvements, while not losing efficiency.
Article
Clustered systems have become a dominant architecture of scalable high-performance super computers. In these large-scale computers, the network performance and scalability is as critical as the compute-nodes speed. InfiniBandTM has become a commodity networking solution supporting the stringent latency, bandwidth and scalability requirements of these clusters. The network performance is also affected by its topology, packet routing and the communication patterns the distributed application exercises. Fat-trees are the topology structures used for constructing most large clusters as they are scalable, maintain cross-bisectional-bandwidth (CBB), and are practical to build using fixed-arity switches. In this paper, we propose a fat-tree routing algorithm that provides a congestion-free, all-to-all shift pattern leveraging on the InfiniBandTM static routing capability. The algorithm supports partially populated fat-trees built with switches of arbitrary number of ports and CBB ratios. To evaluate the proposed algorithm, detailed switch and host simulation models were developed and multiple fabric topologies were run. The results of these simulations as well as measurements on real clusters show an improvement in all-to-all delay by avoiding congestion on the fabric. Copyright © 2009 John Wiley & Sons, Ltd.
Conference Paper
Several techniques to prevent congestion in multiprocessor interconnection networks have been recently proposed. Unfortunately, they either suffer from a lack of robustness or detect congestion relying on global information that wastes a lot of transmission resources. This paper presents a new mechanism that uses only local information to avoid network saturation in wormhole networks. It is robust and works properly in different conditions. It first applies preventive measures of different intensity depending on the estimated traffic level; and if necessary, it uses message throttling during predefined time intervals that are extended if congestion is repeatedly detected. Evaluation results for different network loads and topologies show that the proposed mechanism avoids network performance degradation. Most important, without introducing any penalty for low and medium network loads.
Article
Two simple models of queueing on an N times N space-division packet switch are examined. The switch operates synchronously with fixed-length packets; during each time slot, packets may arrive on any inputs addressed to any outputs. Because packet arrivals to the switch are unscheduled, more than one packet may arrive for the same output during the same time slot, making queueing unavoidable. Mean queue lengths are always greater for queueing on inputs than for queueing on outputs, and the output queues saturate only as the utilization approaches unity. Input queues, on the other hand, saturate at a utilization that depends on N , but is approximately (2 -sqrt{2}) = 0.586 when N is large. If output trunk utilization is the primary consideration, it is possible to slightly increase utilization of the output trunks-upto (1 - e^{-1}) = 0.632 as N rightarrow infty -by dropping interfering packets at the end of each time slot, rather than storing them in the input queues. This improvement is possible, however, only when the utilization of the input trunks exceeds a second critical threshold-approximately ln (1 +sqrt{2}) = 0.881 for large N .
Conference Paper
Existing congestion control mechanisms in interconnects can be divided into two general approaches. One is to throttle traffic injection at the sources that contribute to congestion, and the other is to isolate the congested traffic in specially designated resources. These two approaches have different, but non-overlapping weaknesses. In this paper we present in detail a method that combines injection throttling and congested-flow isolation. Through simulation studies we first demonstrate the respective flaws of the injection throttling and of flow isolation. Thereafter we show that our combined method extracts the best of both approaches in the sense that it gives fast reaction to congestion, it is scalable and it has good fairness properties with respect to the congested flows.
Conference Paper
High performance, freedom from deadlocks, and freedom from livelocks are desirable properties of interconnection networks. Unfortunately, these can be conflicting goals because networks may either devote or under-utilize resources to avoid deadlocks and livelocks. These resources could otherwise be used to improve performance. For example, a minimal adaptive routing algorithm may forgo some routing options to ensure livelock-freedom but this hurts performance at high loads. In contrast, chaotic routing achieves higher performance as it allows full-routing flexibility including misroutes (hops that take a packet farther from its destination) and it is deadlock-free. Unfortunately, Chaotic routing only provides probabilistic guarantees of livelock-freedom. In this paper we propose a new routing algorithm called BLAM (bypass buffers with limited adaptive lazy misroutes) which achieves Chaos-like performance, but guarantees freedom from both deadlocks and livelocks. BLAM achieves Chaos-like performance by allowing packets to be "lazily" misrouted outside the minimal rectangle. Lazy misrouting is critical to BLAM's performance because eager misrouting can misroute unnecessarily, thereby degrading performance. To avoid deadlocks, BLAM uses a logically separate deadlock-free network (like minimal, adaptive routing), virtual cut-through, and the packet exchange protocol (like Chaos). To avoid livelocks, unlike Chaos, BLAM limits the number of times a packet is misrouted to a predefined threshold. Beyond the threshold, stalled packets are routed by the deadlock-free network to their destinations. Simulations show that our BLAM implementation sustains high throughput at heavy loads for a variety of network configurations and communication patterns.
Conference Paper
We introduce and analyze a new family of multiprocesser interconnection networks, called generalized fat trees, which include as special cases the fat trees used for the connection machine architecture CM-5, pruned butterflies, and various other fat trees proposed in the literature. The generalized fat trees provide a formal unifying concept to design and analyse a fat tree based architecture. The extended generalized fat tree network XGFT(h; m1, ..., mh; w1, ..., wh) of height h has Πi=1 h mi leaf processors and the inner nodes serve only as switches or routers. Each non-leaf node in level i has mi children and each non-root has wi+1 parent nodes. The generalized fat trees provide regularity, symmetry, recursive scalability, maximal fault-tolerance, logarithmic diameter bisection scalability, and permit simple algorithms for fault tolerant self-routing and broadcasting. These networks are also versatile, since they can efficiently embed rings, meshes and tori, trees, pyramids and hypercubes
Conference Paper
The fat-tree is one of the most common topologies for the interconnection networks of PC Clusters which are currently used for high-performance parallel computing. Among other advantages, fat-trees allow the use of simple but very efficient routing schemes. One of them is a deterministic routing algorithm that has been recently proposed, offering similar (or better) performance than Adaptive Routing while reducing complexity and guaranteeing in-order packet delivery. However, as other deterministic routing proposals, this deterministic routing algorithm cannot react when high traffic loads or hot-spot traffic scenarios produce severe contention for the use of network resources, leading to the appearance of Head-Of-Line (HOL) blocking, which spoils network performance. In that sense, we present in this paper a simple, efficient strategy for dealing with the HOL blocking that may appear in fat-trees with the aforementioned deterministic routing algorithm. From the results presented in the paper, we can conclude that, in the mentioned environment, our proposal considerably reduces HOL blocking without significantly increasing switch complexity and required silicon area.
Article
When a large number of processors try to access a common variable, referred to as hot-spot accesses in [6], not only can the resulting memory contention seriously degrade performance, but it can also cause tree saturation in the interconnection network which blocks both hot and regular requests alike. It is shown in [6] that even if only a small percentage of all requests are to a hot-spot, these requests can cause very serious performances problems, and networks that do the necessary combining of requests are suggested to keep the interconnection network and memory contention from becoming a bottleneck.
Article
The author presents a new class of universal routing networks, called fat-trees, which might be used to interconnect the processors of a general-purpose parallel supercomputer. A fat-tree routing network is parameterized not only in the number of processors, but also in the amount of simultaneous communication it can support. Since communication can be scaled independently from the number of processors, substantial hardware can be saved for such applications as finite-element analysis without resorting to a special-purpose architecture. It is proved that a fat-tree of a given size is nearly the best routing network of that size. This universality theorem is established using a three-dimensional VLSI model that incorporates wiring as a direct cost. In this model, hardware size is measured as physical volume. It is proved that for any given amount of communications hardware, a fat-tree built from that amount of hardware can stimulate every other network built from the same amount of hardware, using only slightly more time (a polylogarithmic factor greater).
Article
Congestion management is likely to become a critical issue in interconnection networks, as increasing power consumption and cost concerns lead to improvements in the efficiency of network resources. In previous configurations, networks were usually oversized and underutilized. In a smaller network, however, contention is more likely to occur and blocked packets cause head-of-line (HoL) blocking among the rest of the packets, spreading congestion quickly. The best-known solution to HoL blocking is Virtual Output Queues (VOQs). However, the cost of implementing VOQs increases quadratically with the number of output ports in the network, making it unpractical. The situation is aggravated when several priorities and/or Quality of Service (QoS) levels must be supported. Therefore, a more scalable and cost-effective solution is required to reduce or eliminate HoL blocking. In this paper, we present a family of methodologies, referred to as Destination-Based Buffer Management (DBBM), to reduce/eliminate the HoL blocking effect on interconnection networks. DBBM efficiently uses the resources (mainly memory queues) of the network. These methodologies are comprehensively evaluated in terms of throughput, scalability, and fairness. Results show that using the DBBM strategy, with a reduced number of queues at each switch, it is possible to achieve roughly the same throughput as the VOQ mechanism. Moreover, all of the proposed strategies are designed in such a way that they can be used in any switch architecture. We compare DBBM with RECN, a sophisticated mechanism that eliminates HoL blocking in congestion situations. Our mechanism is able to achieve almost the same performance with very low logic requirements (in contrast with RECN).
Conference Paper
In this paper, we propose a new congestion management strategy for lossless multistage interconnection networks that scales as network size and/or link bandwidth increase. Instead of eliminating congestion, our strategy avoids performance degradation beyond the saturation point by eliminating the HOL blocking produced by congestion trees. This is achieved in a scalable manner by using separate queues for congested flows. These are dynamically allocated only when congestion arises, and deallocated when congestion subsides. Performance evaluation results show that our strategy responds to congestion immediately and completely eliminates the performance degradation produced by HOL blocking while using only a small number of additional queues.
Conference Paper
Infiniband system area networks (SANs) which use link-level flow control experience congestion spreading, where one bottleneck link causes traffic to block throughout the network. In this paper, we propose an end-to-end congestion control scheme that avoids congestion spreading, delivers high throughput, and prevents flow starvation. It couples a simple switch-based ECN packet marking mechanism appropriate for typical SAN switches with small input buffers, together with a source response mechanism that uses rate control combined with a window limit. The classic fairness convergence requirement for source response functions assumes network feedback is synchronous. We relax the classic requirement by exploiting the asynchronous behavior of packet marking. Our experimental results demonstrate that compared to conventional approaches, our proposed marking mechanism improves fairness. Moreover, rate increase functions possible under the relaxed requirement reclaim available bandwidth aggressively and improve throughput in both static and dynamic traffic scenarios.
Conference Paper
One-track performance in tightly-coupled multiprocessors typically, degrades rapidly beyond network saturation. Consequently, designers must keep a network below its saturation point by reducing the load on the network. Congestion control via source throttling-a common technique to reduce the network load-presents new packets from entering the network in the presence of congestion. Unfortunately, prior schemes to implement source throttling either lack vital global information about the network to make the correct decision (whether to throttle or not) or depend on specific network parameters, network topology or communication pattern. This paper presents a global-knowledge-based, self-tuned, congestion control technique that prevents saturation at high loads across different network configurations and commutation pattern. Our design is composed of two key components. First, we use global information about a network to obtain a timely estimate of network congestion. We compare this estimate to a threshold value to determine when to throttle packet injection. The second component is a self-tuning mechanism that automatically determines appropriate threshold values based on throughput feedback. A combination of these two techniques provides high performance under heavy load does not penalize performance under light load, and gracefully adapts to changes in communication patterns
Conference Paper
It is well known that head-of-line (HOL) blocking limits the throughput of an input-queued switch with FIFO queues. Under certain conditions, the throughput can be shown to be limited to approximately 58%. It is also known that if non-FIFO queueing policies are used, the throughput can be increased. However it has not been previously shown that if a suitable queueing policy and scheduling algorithm are used then it is possible to achieve 100% throughput for all independent arrival processes. In this paper we prove this to be the case using a simple linear programming argument and quadratic Lyapunov function. In particular we assume that each input maintains a separate FIFO queue for each output and that the switch is scheduled using a maximum weight bipartite matching algorithm
Article
An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to configure the crossbar switch, deciding the order in which packets will be served. Previous results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second
Article
The theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks is developed. The author proposes some basic definitions and two theorems. These create the conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Two design methodologies are also proposed. The first supplies algorithms with a high degree of freedom, without increasing the number of physical channels. The second methodology is intended for the design of fault-tolerant algorithms. Some examples are given to show the application of the methodologies. Simulations show the performance improvement that can be achieved by designing the routing algorithms with the new theory
Article
Compared to the overdimensioned designs of the past, current interconnection networks operate closer to the point of saturation and run a higher risk of congestion. Among proposed strategies for congestion management, only the regional explicit congestion notification (RECN) mechanism achieves both the required efficiency and the scalability that emerging systems demand
Article
It is well known that head-of-line blocking limits the throughput of an input-queued switch with first-in-first-out (FIFO) queues. Under certain conditions, the throughput can be shown to be limited to approximately 58.6%. It is also known that if non-FIFO queueing policies are used, the throughput can be increased. However, it has not been previously shown that if a suitable queueing policy and scheduling algorithm are used, then it is possible to achieve 100% throughput for all independent arrival processes. In this paper we prove this to be the case using a simple linear programming argument and quadratic Lyapunov function. In particular, we assume that each input maintains a separate FIFO queue for each output and that the switch is scheduled using a maximum weight bipartite matching algorithm. We introduce two maximum weight matching algorithms: longest queue first (LQF) and oldest cell first (OCF). Both algorithms achieve 100% throughput for all independent arrival processes. LQF favors queues with larger occupancy, ensuring that larger queues will eventually be served. However, we find that LQF can lead to the permanent starvation of short queues. OCF overcomes this limitation by favoring cells with large waiting times
Article
Small n × n switches are key components of interconnection networks used in multiprocessors and multicomputers. The architecture of these n × n switches, particularly their internal buffers, is critical for achieving high-throughput low-latency communication with cost-effective implementations. Several buffer structures are discussed and compared in terms of implementation complexity, inter-switch handshaking requirements, and their ability to deal with variations in traffic patterns and message lengths. A design for buffers that provide non-FIFO message handling and efficient storage allocation for variable size packets using linked lists managed by a simple on-chip controller is presented. The new buffer design is evaluated by comparing it to several alternative designs in the context of a multistage interconnection network. The modeling and simulation show that the new buffer outperforms alternative buffers and can thus be used to improve the performance of a wide variety of systems currently using less efficient buffers
Article
Current technology trends make it possible to build communication networks that can support high performance distributed computing. This paper describes issues in the design of a prototype switch for an arbitrary topology point-to-point network with link speeds of up to one gigabit per second. The switch deals in fixed-length ATM-style cells, which it can process at a rate of 37 million cells per second. It provides high bandwidth and low latency for datagram traffic. In addition, it supports real-time traffic by providing bandwidth reservations with guaranteed latency bounds. The key to the switch's operation is a technique called parallel iterative matching, which can quickly identify a set of conflict-free cells for transmission in a time slot. Bandwidth reservations are accommodated in the switch by building a fixed schedule for transporting cells from reserved flows across the switch; parallel iterative matching can fill unused slots with datagram traffic. Finally, we no...
Article
The past few years have seen a rise in popularity of massively parallel architectures that use fat-trees as their interconnection In this paper we formalize a parametric family of fat-trees, the k-ary n-trees, built with constant arity switches interconnected in a regular topology. A simple adaptive routing algorithm for k-ary n-trees sends each message to one of the nearest common ancestors (NCA) of both source and destination choosing the less loaded physical channels and then reaches the destination following the unique available path. Through simulation on a 4-ary 4-tree with 256 nodes, we analyze some variants of the adaptive algorithm that utilize wormhole routing with 1, 2 and 4 virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 3540 % of the network capacity with 1 virtual channel, 55-60% with 2 virtual chan...