Conference Paper

Express virtual channels: towards the ideal interconnection fabric.

DOI: 10.1145/1250662.1250681 Conference: 34th International Symposium on Computer Architecture (ISCA 2007), June 9-13, 2007, San Diego, California, USA
Source: DBLP

ABSTRACT ABSTRACT Due to wire delay scalability and bandwidth,limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect dieren t processing elements in many-core chips. However, current state-ofthe-art packet-switched networks rely on complex routers which increases the communication overhead and energy consumption as compared,to the ideal interconnection fabric. In this paper, we try to close the gap between the stateof-the-art packet-switched network and the ideal interconnect by proposing express virtual channels (EVCs), a novel o w control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks. Our evaluation results using a detailed cycle-accurate simulator on a range of synthetic trac,and SPLASH benchmark traces show upto 84% reduction in packet latency and upto 23% improvement in throughput while reducing the average router energy consumption by upto 38% over an existing state-of-the-art packet-switched design. When compared to the ideal interconnect, EVCs add just two cycles to the no-load latency, and are within 14% of the ideal throughput. Moreover, we show that the proposed design incurs a minimal hardware overhead while exhibiting excellent scalability with increasing network sizes.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Packet-based networks-on-chip (NoC) are considered among the most viable candidates for the on-chip interconnection network of many-core chips. Unrelenting increases in the number of processing elements on a single chip die necessitate a scalable and efficient communication fabric. The resulting enlargement of the on-chip network size has been accompanied by an equivalent widening of the physical inter-router channels. However, the growing link bandwidth is not fully utilized, because the packet size is not always a multiple of the channel width. While slicing of the physical channel enhances link utilization, it incurs additional delay, because the number of flit per packet also increases. This paper proposes a novel router micro-architecture that employs fine-grained bandwidth “sharding” (i.e., partitioning) and stealing in order to mitigate the elevation in the zero-load latency caused by slicing. Consequently, the zero-load latency of the Sharded Router becomes identical with that of a conventional router, whereas its throughput is markedly improved by fully utilizing all available bandwidth. Detailed experiments using a full-system simulation framework indicate that the proposed router reduces the average network latency by up to 19% and the execution time of real multi-threaded workloads by up to 43%. Finally, hardware synthesis analysis verifies the modest area overhead of the Sharded Router over a conventional design.
    Parallel Computing 09/2013; 39(9):372–388. · 1.21 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In recent years, chip multiprocessors (CMPs) have seen an increasing asymmetry between the number of cores and the number of memory access points on a single die, prompting a new study of network topologies that eciently connect many nodes to few nodes. In this paper, we evaluate the latency and power eciency of the tapered-fat tree (TFT) topology on a cycle-accurate simulator intended to model the TILE64 multiprocessor. We replace the original mesh network with two TFT networks (one for memory requests, one for memory responses) and run four synthetic bench- marks modeled after those in the PARSEC suite. Because several connections in the TFT network require global wires, we also modeled the multi-cycle latencies using a wire-delay model. Our simulator keeps track of activity factors in each of the routers, which we combine with the Orion power mod- els to determine the switching power of the routers. We de- termined that for applications with large amounts of shar- ing and little o-chip trac, the TFT topology oers neg- ligible advantage over the mesh. However, the benchmarks that exhibited large amounts of o-chip trac
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the size of FPGA devices grows following Moore's law, it becomes possible to put a complete manycore system onto a single FPGA chip. The centralized memory hierarchy on typical embedded sys-tems in which both data and instructions are stored in the off-chip global memory will introduce the bus contention problem as the number of pro-cessing cores increases. In this work, we present our exploration into how distributed multi-tiered memory hierarchies can effect the scalability of manycore systems. We use the Xilinx Virtex FPGA devices as the test-ing platforms and the buses as the interconnect. Several variances of the centralized memory hierarchy and the distributed memory hierarchy are compared by running various benchmarks, including matrix multiplica-tion, IDEA encryption and 3D FFT. The results demonstrate the good scalability of the distributed memory hierarchy for systems up to 32 Mi-croBlaze processors, which is constrained by the FPGA resources on the Virtex-6LX240T device.


Available from