Conference Paper

Express virtual channels: towards the ideal interconnection fabric.

DOI: 10.1145/1250662.1250681 Conference: 34th International Symposium on Computer Architecture (ISCA 2007), June 9-13, 2007, San Diego, California, USA
Source: DBLP

ABSTRACT ABSTRACT Due to wire delay scalability and bandwidth,limitations inherent in shared buses and dedicated links, packet-switched on-chip interconnection networks are fast emerging as the pervasive communication fabric to connect dieren t processing elements in many-core chips. However, current state-ofthe-art packet-switched networks rely on complex routers which increases the communication overhead and energy consumption as compared,to the ideal interconnection fabric. In this paper, we try to close the gap between the stateof-the-art packet-switched network and the ideal interconnect by proposing express virtual channels (EVCs), a novel o w control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks. Our evaluation results using a detailed cycle-accurate simulator on a range of synthetic trac,and SPLASH benchmark traces show upto 84% reduction in packet latency and upto 23% improvement in throughput while reducing the average router energy consumption by upto 38% over an existing state-of-the-art packet-switched design. When compared to the ideal interconnect, EVCs add just two cycles to the no-load latency, and are within 14% of the ideal throughput. Moreover, we show that the proposed design incurs a minimal hardware overhead while exhibiting excellent scalability with increasing network sizes.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Network-on-chip (NoC) has rapidly become a promising alternative for complex system-on-chip architectures including recent multicore architectures. Additionally, optimizing NoC architectures with respect to different design objectives that are suitable for a particular application domain is crucial for achieving high-performance and energy-efficient customized solutions. Despite the fact that many researches have provided various solutions for different aspects of NoCs design, a comprehensive NoCs system solution has not emerged yet. This paper presents a novel methodology to provide a solution for complex on-chip communication problems to reduce power, latency and area overhead. Our proposed NoC communication architecture is based on setting up virtual source–destination paths between selected pairs of NoCs cores so that the packets belonging to distance nodes in the network can bypass intermediate routers while traveling through these virtual paths. In this scheme, the paths are constructed for an application based on its task-graph at the design time. After that, the run time scheduling mechanism is applied to improve the buffer management, virtual channel and switch allocation schemes and hence, the constructed paths are optimized dynamically. Moreover, in our design the router complexity and its overheads are reduced. Additionally, the suggested router has been implemented on Xilinx Virtex-5 FPGA family. The evaluation results captured by SPLASH-2 benchmark suite reveal that in comparison with the conventional NoC router, the proposed router takes 25% and 53% reduction in latency and energy, respectively besides 3.5% area overhead. Indeed, our experimental results demonstrate a significant reduction in the average packet latency and total power consumption with negligible area overhead.
    Microelectronics Journal 04/2014; · 0.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the higher scalability and parallelism integration offered by the Network-on-Chip (NoC) over the traditional shared-bus based systems, it's still not an ideal solution for future large scale Systems-on-Chip (SoCs), due to some limitations such as high power consumption, high cost communication, and low throughput. Recently, merging NoC to the third dimension (3D-NoC) has been proposed to deal with those prob-lems, as it was a solution offering lower power consumption and higher speed. One of the most important design choices for 3D-NoC implementa-tion is the routing algorithm, as it controls the path decision that a flit has to follow traveling along the network, which has a direct impact on the overall system performance. In this context, we previously developed a 3D-NoC named OASIS, which is a 4x4x4 mesh topology design using Wormhole switching and Stall-and-Go flow control scheme. In this paper, we describe the differ-ent components of the 3D-OASIS-NoC (3D-ONoC) architecture including our proposed Look-ahead-XYZ routing scheme(LA-XYZ) that aims to op-timize the router pipeline design. Evaluation results showed that, using JPEG encoder and Matrix applications, 3D-OASIS-NoC reduces the num-ber of hops by 41% and also the average stall count to 74%. Then reducing the execution time to up to 40% when compared with 2D-OASIS-NoC, while observing a 16% reduction in term of dynamic power.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the size of FPGA devices grows following Moore's law, it becomes possible to put a complete manycore system onto a single FPGA chip. The centralized memory hierarchy on typical embedded sys-tems in which both data and instructions are stored in the off-chip global memory will introduce the bus contention problem as the number of pro-cessing cores increases. In this work, we present our exploration into how distributed multi-tiered memory hierarchies can effect the scalability of manycore systems. We use the Xilinx Virtex FPGA devices as the test-ing platforms and the buses as the interconnect. Several variances of the centralized memory hierarchy and the distributed memory hierarchy are compared by running various benchmarks, including matrix multiplica-tion, IDEA encryption and 3D FFT. The results demonstrate the good scalability of the distributed memory hierarchy for systems up to 32 Mi-croBlaze processors, which is constrained by the FPGA resources on the Virtex-6LX240T device.


Available from