Preprint
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Component reliability and performance pose a great challenge for interconnection networks. Future technology scaling such as transistor integration capacity in VLSI design will result in higher device degradation and manufacture variability. As a consequence, changes in the network arise, often rendering irregular topologies. This paper proposes a topology-agnostic distributed segment-based algorithm able to handle switch discovery in any topology while guaranteeing connectivity among switches. The proposal, known as Transparent Distributed Segment-Based Routing (TDSR), has been applied to meshes with defective link configurations.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The Horizon 2020 MANGO project aims at exploring deeply heterogeneous accelerators for use in High-Performance Computing systems running multiple applications with different Quality of Service (QoS) levels. The main goal of the project is to exploit customization to adapt computing resources to reach the desired QoS. For this purpose, it explores different but interrelated mechanisms across the architecture and system software. In particular, in this paper we focus on the runtime resource management, the thermal management, and support provided for parallel programming, as well as introducing three applications on which the project foreground will be validated.
Article
Full-text available
Most standard cluster interconnect technologies are flexible with respect to network topology. This has spawned a substantial amount of research on topology agnostic routing algo- rithms, which make no assumption about the network structure, thus providing the flexibility needed to route on irregular net- works. Actually, such an irregularity should be often interpreted as minor modifications of some regular interconnection pattern, such as those induced by faults. In fact, topology agnostic routing algorithms are also becoming increasingly useful for networks on-chip (NoC), where faults may make the preferred 2D mesh topology irregular. Existing topology agnostic routing algorithms were developed for varying purposes, giving them different and not always comparable properties. Details are scattered among many papers, each with distinct conditions, making comparison difficult. This paper presents a comprehensive overview of the known topology agnostic routing algorithms. We classify these algorithms by their most important properties, and evaluate them consis- tently. This provides significant insight into the algorithms and their appropriateness for different on and off chip environments.
Article
Full-text available
We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.
Conference Paper
Full-text available
Network-based parallel processing using commodity personal computers has been widely developed. Since such systems require high degree of flexibility and scalability of wiring, a high-speed network with an irregular topology is often needed. In traditional routing algorithms for irregular networks, available paths are considerably restricted in order to avoid deadlocks. In this paper we propose a novel routing algorithm called left-up-first turn routing (L-turn routing), which makes a better traffic balancing in irregular networks by building a specific spanning tree. Result of simulations shows that L-turn routing achieves better performance than traditional ones with each topology.
Conference Paper
Full-text available
In recent years we have seen a growing interest in irregular network topologies for cluster interconnects. One problem related to such topologies is that the combination of shortest path and deadlock free routing is difficult. As a result of this the existing solutions for routing in irregular networks either guarantee shortest paths relative to some constraint (like up*/down*), or have to resort to deadlock recovery through non-minimal escape channels. In this paper we propose a method that guarantees shortest path routing and in-order delivery, and that uses virtual channels for deadlock avoidance. We present a theoretical upper bound on the number of virtual channels needed, and through extensive empirical testing we demonstrate that the actual number of virtual channels is very low even for large networks.
Conference Paper
Full-text available
Computers get faster every year, but the demand for computing resources seems to grow at an even faster rate. Depending on the problem domain, this demand for more power can be satisfied by either, massively parallel computers, or clusters of computers. Common for both approaches is the dependence on high performance interconnect networks such as Myrinet, Infiniband, or 10 Gigabit Ethernet. While high throughput and low latency are key features of interconnection networks, the issue of fault-tolerance is now becoming increasingly important. As the number of network components grows so does the probability for failure, thus it becomes important to also consider the fault-tolerance mechanism of interconnection networks. The main challenge then lies in combining performance and fault-tolerance, while still keeping cost and complexity low. This paper proposes a new deterministic routing methodology for tori and meshes, which achieves high performance without the use of virtual channels. Furthermore, it is topology agnostic in nature, meaning it can handle any topology derived from any combination of faults when combined with static reconfiguration. The algorithm, referred to as segment-based routing (SR), works by partitioning a topology into subnets, and subnets into segments. This allows us to place bidirectional turn restrictions locally within a segment. As segments are independent, we gain the freedom to place turn restrictions within a segment independently from other segments. This results in a larger degree of freedom when placing turn restrictions compared to other routing strategies. In this paper a way to compute segment-based routing tables is presented and applied to meshes and tori. Evaluation results show that SR increases performance by a factor of 1.8 over FX and up*/down* routing.
Conference Paper
Full-text available
Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.
Article
Full-text available
This paper considers the question of identifying the parameters governing the behavior of fundamental global network problems. Many papers on distributed network algorithms consider the task of optimizing the running time successful when an O(n) bound is achieved on an n-vertex network. We propose that a more sensitive parameter is the network's diameter \Diam. This is demonstrated in the paper by providing a distributed minimum-weight spanning tree algorithm whose time complexity is sublinear in n, but linear in \Diam (specifically, O(\Diam + n^\varepsilon \cdot \log^* n) for ε=ln3ln6=0.6131...\varepsilon = \frac{\ln 3}{\ln 6} = 0.6131...). Our result is achieved through the application of graph decomposition and edge-elimination-by-pipelining techniques that may be of independent interest.
Conference Paper
Full-text available
Cluster networks are seen as the future access networks for multimedia streaming, e-commerce, network storage, etc. For these applications, performance and high availability are particularly crucial. Regular topologies are preferred when performance is the primary concern. However, due to spatial constraints or fault-related issues, the network structure may become irregular, which makes more difficult to find deadlock-free minimal paths. Over the recent years, several solutions have been proposed. One of them is the LASH routing, which enables minimal routing by assigning paths to different virtual layers. In this paper, we propose an extension of LASH in order to reduce the number of required virtual layers by allowing transitions between virtual layers. Evaluation results show that the new routing scheme (LASH-TOR) is able to obtain full minimal routing with a reduced number of virtual channels. For torus and mesh networks, with only two virtual channels, LASH throughput is increased by an average factor of improvement of 3.30 for large networks. For regular networks with some unconnected (faulty) links, equal performance improvements are achieved. Even for highly irregular networks of size up to 128 switches the new routing scheme only needs three virtual channels for guaranteeing minimal routing. Besides, LASH-TOR performs well compared to dimension order routing for mesh and torus networks.
Conference Paper
Full-text available
The InfiniBand Architecture (IBA) defines a switch-based network with point-to-point links whose topology is arbitrarily established by the customer. We propose a simple and effective methodology for designing deadlock-free routing strategies that are able to route packets through minimal paths in InfiniBand networks. This methodology can meet the trade-off between network performance and the number of resources dedicated to deadlock avoidance. Evaluation results show that the resulting routing strategies significantly outperform up*/down* routing. In particular, throughput improvement ranges, on average, from 1.33 for small networks to 4.05 for large networks. Also, it is shown that just two virtual lanes and three service levels are enough to achieve more than 80% of the throughput improvement achieved by the best proposed routing strategy (the one that always provides minimal paths without limiting the number of resources).
Article
Full-text available
IMesh, the tile processor architecture's on-chip interconnection network, connects the multicore processor's tiles with five 2D mesh networks, each specialized for a different use. taking advantage of the five networks, the C-based ILIB interconnection library efficiently maps program communication across the on-chip interconnect. the tile processor's first implementation, the tile64, contains 64 cores and can execute 192 billion 32-bit operations per second at 1 Ghz.
Article
Full-text available
The TRIPS chip prototypes two networks on chip to demonstrate the viability of a routed interconnection fabric for memory and operand traffic. In a 170-million-transistor custom ASIC chip, these NoCs provide system performance within 28 percent of ideal noncontended networks at a cost of 20 percent of the die area. our experience shows that NoCs are area- and complexity-efficient means of providing high-bandwidth, low-latency on-chip communication.
Article
Full-text available
Component failures and planned component replacements cause changes in the topology and routing paths supplied by the interconnection network of a parallel processor system over time. Such changes may require the network to be reconfigured such that the existing routing function is replaced by one that enables packets to reach their intended destinations amid the changes. Efficient reconfiguration methods are desired which allow the network to function uninterruptedly over the course of the reconfiguration process while remaining free from deadlocking behavior. In this paper, we propose, evaluate, and prove the deadlock freedom of a new network reconfiguration protocol that overlaps various phases of "static" reconfiguration processes traditionally used in commercial and research systems to provide performance efficiency on par with that of recently proposed "dynamic" reconfiguration processes but without their complexity. Simulation results show that the proposed Overlapping Static Reconfiguration protocol can reduce reconfiguration time by up to 50 percent, reduce packet latency by several orders of magnitude, reduce packet dropping by an order of magnitude, and provide unhalted packet injection as compared to traditional static reconfiguration while allowing network throughput similar to dynamic reconfiguration.
Article
This article describes the architecture of Knights Landing, the second-generation Intel Xeon Phi product family, which targets high-performance computing and other highly parallel workloads. It provides a significant increase in scalar and vector performance and a big boost in memory bandwidth compared to the prior generation, called Knights Corner. Knights Landing is a self-booting, standard CPU that is completely binary compatible with prior Intel Xeon processors and is capable of running all legacy workloads unmodified. Its innovations include a core optimized for power efficiency, a 512-bit vector instruction set, a memory architecture comprising two types of memory for high bandwidth and large capacity, a high-bandwidth on-die interconnect, and an integrated on-package network fabric. These features enable the Knights Landing processor to provide significant performance improvement for computationally intensive and bandwidth-bound workloads while still providing good performance on unoptimized legacy workloads, without requiring any special way of programming other than the standard CPU programming model.
Conference Paper
The expected low reliability of the silicon substrate at upcoming technology nodes presents a key challenge for digital system designers. Networks-on-chip (NoCs) are especially concerning because they are often the only communication infrastructure for the chips in which they are deployed. Recently, routing reconfiguration solutions have been proposed to address this problem. However, they come at a high silicon cost, and often require suspending the normal network activity while executing a centralized, resource-hungry reconfiguration algorithm. This paper proposes a novel, fast and minimalistic routing reconfiguration algorithm, called BLINC. BLINC utilizes pre-computed routing metadata to quickly evaluate localized detours upon each fault manifestation. We showcase the efficacy of our algorithm by deploying it in a novel NoC fault detection and reconfiguration solution, where BLINC enables uninterrupted NoC operation during aggressive online testing. If a fault seems likely to occur, we circumvent it in advance with the aid of our BLINC reconfiguration algorithm. Experimental results show an 80% reduction in the average number of routers affected by a reconfiguration event, compared to state-of-the-art techniques. BLINC enables negligible performance degradation in our detection and reconfiguration solution, while solutions based on current techniques suffer a 17-fold latency increase.
Article
In this paper, we present DiSR, a distributed approach to topology discovery and defect mapping in a self-assembled nano network-on-chip. The main aim is to achieve the already-proven properties of segment-based deadlock freedom requiring neither a topology graph as input, nor a centralized algorithm to configure network paths. After introducing the conceptual elements and the execution model of DiSR, we show how the open-source Nanoxim platform has been used to evaluate the proposed approach in the process of discovering irregular network topology while establishing network segments. Comparison against a tree-based approach shows how DiSR still preserves some important properties (coverage, defect tolerance, scalability) while avoiding resource hungry solutions such as virtual channels and hardware redundancy. Finally, we propose a gate-level hardware implementation of the required control logic and storage for DiSR, demonstrating a relatively acceptable impact ranging from 10 to about 20% of the budget of transistors available for each node.
Article
An increasing number of statistical software packages offer exploratory data displays and summaries. For one of these, the graphical technique known as the boxplot, a selective survey of popular software packages revealed several definitions. These alternative constructions arise from different choices in computing quartiles and the fences that determine whether an observation is “outside” and thus plotted individually. We examine these alternatives and their consequences, discuss related background for boxplots (such as the probability that a sample contains one or more outside observations and the average proportion of outside observations in a sample), and offer recommendations that lead to a single standard form of the boxplot.
Article
While most of the electronics industry is dependent on the ever-decreasing size of lithographic transistors, this scaling cannot continue indefinitely. Nanoelectronics (circuits built with components on the scale of 10 nm) seem to be the most promising successor to lithographic based ICs. Molecular-scale devices including diodes, bistable switches, carbon nanotubes, and nanowires have been fabricated and characterized in chemistry labs. Techniques for self-assembling these devices into different architectures have also been demonstrated and used to build small-scale prototypes. While these devices and assembly techniques will lead to nanoscale electronics, they also have the drawback of being prone to defects and transient faults. Fault-tolerance techniques will be crucial to the use of nanoelectronics. Lastly, changes to the software tools that support the fabrication and use of ICs will be needed to extend them to support nanoelectronics. This paper introduces nanoelectronics and reviews the current progress made in research in the areas of technologies, architectures, fault tolerance, and software tools.
Article
During the past 10 years the fields of artificial neural networks (ANNs) and massively parallel computing have been evolving rapidly. In this paper we study the attempts to make ANN algorithms run on massively parallel computers as well as designs of new parallel systems tuned for ANN computing. Following a brief survey of the most commonly used models, the different dimensions of parallelism in ANN computing are identified, and the possibilities for mapping onto the structures of different parallel architectures are analyzed. Different classes of parallel architectures used or designed for ANN are identified. Reported implementations are reviewed and discussed. It is concluded that the regularity of ANN computations suits SIMD architectures perfectly and that broadcast or ring communication can be very efficiently utilized. Bit-serial processing is very interesting for ANN, but hardware support for multiplication should be included. Future artificial neural systems for real-time applications will require flexible processing modules that can be put together to form MIMSIMD systems.
Article
Borůvka presented in 1926 the first solution of the Minimum Spanning Tree Problem (MST) which is generally regarded as a cornerstone of Combinatorial Optimization. In this paper we present the first English translation of both of his pioneering works. This is followed by the survey of development related to the MST problem and by remarks and historical perspective. Out of many available algorithms to solve MST the Borůvka's algorithm is the basis of the fastest known algorithms.
Conference Paper
NOWs are arranged as a switch-based network which allows the layout of both regular and irregular topologies. However, the irregular pattern interconnect makes routing and deadlock avoidance quite complicated. Current proposals use the up*/down* routing algorithm to remove cyclic dependencies between channels and avoid deadlock. Recently, a simple and effective methodology to compute up*/down* routing tables has been proposed by us. The resulting routing algorithm is very effective in irregular topologies. However, its behavior is very poor in regular networks with orthogonal dimensions. Therefore we propose a more flexible routing scheme that is effective in both regular and irregular topologies. Unlike up*/down* routing algorithms, the proposed routing algorithm breaks cycles at different nodes for each direction in the cycle, thus providing better traffic balancing than that provided by up*/down* routing algorithms. Evaluation results modeling a Myrinet network show that the new routing algorithm increases throughput with respect to the original up*/down* routing algorithm by a factor of up to 3:5 for regular networks, also maintaining the performance of the improved up*/down* routing scheme proposed in [7] when applied to irregular networks.
Conference Paper
Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.
Conference Paper
Extreme transistor technology scaling is causing increasing concerns in device reliability: the expected lifetime of individual transistors in complex chips is quickly decreasing, and the problem is expected to worsen at future technology nodes. With complex designs increasingly relying on Networks-on-Chip (NoCs) for on-chip data transfers, a NoC must continue to operate even in the face of many transistor failures. Specifically, it must be able to reconfigure and reroute packets around faults to enable continued operation, i.e., generate new routing paths to replace the old ones upon a failure. In addition to these reliability requirements, NoCs must maintain low latency and high throughput at very low area budget. In this work, we propose a distributed reconfiguration solution named Ariadne, targeting large, aggressively scaled, unreliable NoCs. Ariadne utilizes up*/down* for fast routing at high bandwidth, and upon any number of concurrent network failures in any location, it reconfigures to discover new resilient paths to connect the surviving nodes. Experimental results show that Ariadne provides a 40%-140% latency improvement (when subject to 50 faults in a 64-node NoC) over other on-chip state-of-the-art fault tolerant solutions, while meeting the low area budget of on-chip routers with an overhead of just 1.97%.
Article
A distributed algorithm is presented that constructs the minimum-weight spanning tree in a connected undirected graph with distinct edge weights. A processor exists at each node of the graph, knowing initially only the weights of the adjacent edges. The processors obey the same algorithm and exchange messages with neighbors until the tree is constructed. The total number of messages required for a graph of N nodes and E edges is at most 5N log2N + 2E, and a message contains at most one edge weight plus log28N bits. The algorithm can be initiated spontaneously at any node or at any subset of nodes.
Article
In this paper, we propose a general turn model, called a Tree-turn model, for tree-based routing algorithms on irregular topologies. In the Tree-turn model, links are classified as either a tree link or a cross link and six directions are associated with the channels of links. Then we can prohibit some of the turns formed by these six directions such that an efficient deadlock-free routing algorithm, Tree-turn routing, can be derived. There are three phases to develop the Tree-turn routing. First, a coordinated tree for a given topology is created. Second, a communication graph is constructed based on the topology and the corresponding coordinated tree. Third, the forwarding table is set up by using all-pairs shortest path algorithm according to the prohibited turns in the Tree-turn model and the directions of the channels in the communication graph. To evaluate the performance of the proposed Tree-turn routing, we develop a simulator and implement Tree-turn routing along with up*/down* routing, L-turn routing, and up*/down* routing with DFS methodology. The simulation results show that Tree-turn routing outperforms other routing algorithms for all the test cases.
Conference Paper
We study parallel algorithms for the minimum spanning tree problem, based on the sequential algorithm of O. Boruvka (1926). The target architectures for our algorithm are asynchronous, distributed-memory machines. Analysis of our parallel algorithm on a simple model that is reminiscent of the LogP model, shows that in principle a speedup proportional to the number of processors can be achieved, but that communication costs can be significant. To reduce these costs, we develop a new randomized linear work pointer jumping scheme that performs better than previous linear work algorithms. We also consider empirically the effects of data imbalance on the running time. For the graphs used in our experiments, load balancing schemes result in little improvement in running times. Our implementations on sparse graphs with 64,000 vertices on Thinking Machine's CM-5 achieve a speedup factor of about 4 on 16 processors. On this environment, packaging of messages turns out to be the most effective way to reduce communication costs
Article
Autonet is a self-configuring local area network composed of switches interconnected by 100 Mb/s, full-duplex, point-to-point links. The switches contain 12 ports that are internally connected by a full crossbar. Switches use cut-through to achieve a packet forwarding latency as low as 2 ms/switch. Any switch port can be cabled to any other switch port or to a host network controller. A processor in each switch monitors the network's physical configuration. A distributed algorithm running on the switch processor computes the routes packets are to follow and fills in the packet forwarding table in each switch. With Autonet, distinct paths through the set of network links can carry packets in parallel, allowing many pairs of hosts to communicate simultaneously at full link bandwidth. A 30-switch network with more than 100 hosts has been the service network for Digital's Systems Research Center since February 1990
Release 1.3. The InfiniBand Trade Association
The InfiniBand R Architecture Specification, Volume 1, Release 1.3. The InfiniBand Trade Association, March 2015. [Online]. Available: http://www.infinibandta.org
An 80-tile 1.28 tflops network-on-chip in 65nm cmos
  • S Vangal
  • J Howard
  • G Ruhl
  • S Dighe
  • H Wilson
  • J Tschanz
  • D Finan
  • P Iyer
  • A Singh
  • T Jacob
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob et al., "An 80-tile 1.28 tflops network-on-chip in 65nm cmos," in 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. IEEE, 2007, pp. 98-589.
Apparatus and method for multi-die interconnection
  • J.-P Fricker
  • P Ferolito
J.-P. Fricker and P. Ferolito, "Apparatus and method for multi-die interconnection," Jul. 30 2019, uS Patent App. 10/366,967.
On the shortest spanning subtree of a graph and the traveling salesman problem
J. B. Kruskal, "On the shortest spanning subtree of a graph and the traveling salesman problem," Proceedings of the American Mathematical society, vol. 7, no. 1, pp. 48-50, 1956.