[Show abstract][Hide abstract] ABSTRACT: Our toolchain for accelerating application called Courier-FPGA, is designed
for utilize the processing power of CPU-FPGA platforms for software programmers
and non-expert users. It automatically gathers runtime information of library
functions from a running target binary, and constructs the function call graph
including input-output data. Then, it uses corresponding predefined hardware
modules if these are ready for FPGA and prepares software functions on CPU by
using Pipeline Generator. The Pipeline Generator builds a pipeline control
program by using Intel Threading Building Block to run both hardware modules
and software functions in parallel. Finally, Courier-FPGA dynamically replaces
the original functions in the binary and accelerates it by using the built
pipeline. Courier-FPGA performs these acceleration processes without user
intervention, source code tweaks or re-compilations of the binary. We describe
the technical details of this mixed software hardware pipeline on CPU-FPGA
platforms in this paper. In our case study, Courier-FPGA was used to accelerate
a corner detection using the Harris-Stephens method application binary on the
Zynq platform. A series of functions were off-loaded, and speed up 15.36 times
was achieved by using the built pipeline.
[Show abstract][Hide abstract] ABSTRACT: A 32-bit CPU which operates with the lowest energy of 13.4 pJ/cycle at 0.35V and 14MHz, operates at 0.22V to 1.2V and with 0.14µA sleep current is demonstrated. The low power performance is attained by Reverse-Body-Bias-Assisted 65nm SOTB CMOS (Silicon On Thin Buried oxide) technology. The CPU can operate more than 100 years with 610mAH Li battery.
[Show abstract][Hide abstract] ABSTRACT: We have concluded that with a router using MMTH the power consumption is associated with the bit change rate of the data, and when NAS parallel benchmarks work on NoC, it is reduced by 42.4% on average at 2GHz compared with a traditional FIFO implementation. The performance degradation caused by the delay of the reading time can be mostly saved by the look-ahead technique in the router.
[Show abstract][Hide abstract] ABSTRACT: We show a task level pipelining on multiple accelerators with PEACH2. PEACH2, which is implemented on FPGA, enables ultra low latency direct communication among multiple accelerators over computational nodes. By installing PEACH2, typical high performance computation nodes are tightly coupled. In this environment, application can be accelerated by exploiting not only data level parallelism, but also task level parallelism. Furthermore, we can process multiple task on multiple accelerators in a pipelined manner. In our evaluation, pipelined
application which is implemented in a task level pipelined manner achieves 52% speed up compared to a single GPU.
[Show abstract][Hide abstract] ABSTRACT: A wireless 3D NoC architecture is described for building-block SiPs, in which the number of hardware components (or chips) in a package can be changed after chips have been fabricated. The architecture uses inductive-coupling links that can connect more than two examined dies without wire connections. Each chip has data transceivers for the uplink and downlink in order to communicate with its neighboring chips in the package. These chips form a vertical unidirectional ring network so as to fully exploit the flexibility of the wireless approach that enables us to add, remove, and swap the chips in the ring. To avoid protocol and structural deadlocks in the ring, we use bubble flow control, which does not rely on the conventional VC-based deadlock avoidance mechanism. In addition, we propose a bidirectional communication scheme to form a bidirectional ring network by using the inductive-coupling transceivers that can dynamically change the communication modes, such as TX, RX, and Idle modes. This paper illustrates the inductive-coupling transceiver circuits, which can carry high data transfer rates of up to 8 Gbps per channel, for the wireless 3D NoC. It also illustrates an implementation of a wireless 3D NoC that has on-chip routers and transceivers implemented with a 65 nm process in order to show the feasibility of our proposal. The vertical bubble flow control and conventional VC-based approach on the uni- and bidirectional ring networks are compared with the vertical broadcast bus in terms of throughput, hardware amount, and application performance using a full system multiprocessor simulator. The results show that the proposed bidirectional communication scheme efficiently improves application performance without adding any inductive-coupling transceivers. In addition, the proposed vertical bubble flow network outperforms the conventional VC-based approach by 7.9-12.5 percent with a 33.5 percent smaller router area for building-block SiPs connecting up to eight- chips.
IEEE Transactions on Computers 01/2014; 63(3):748-763. · 1.38 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: In this paper, we demonstrate that we can reduce the communication latency significantly by inserting a fraction of randomness into a wireless 3D NoC (where CMOS wireless links are used for vertical inter-chip communication) when considering the physical constraints of the 3D design space. Towards this end, we consider two cases, namely 1) replacing existing horizontal 2D links in a wireless 3D NoC with randomized shortcut NoC links and 2) enabling full connectivity by adding a randomized NoC layer to a wireless 3D platform with partial or no horizontal connectivity. Consequently, the packet routing is optimized by exploiting both the existing and the newly added random NoC. At the same time, by adding randomly wired shortcut NoCs to a wireless 3D platform, a good balance can be established between the modularity of the design and the minimum randomness needed to achieve low latency, and experimental results show that by adding a random NoC chip to wireless 3D CMPs without built-in horizontal connectivity, the communication latency can be reduced by as much as 26.2% when compared to adding a 2D mesh NoC. Also, the application execution time and average flit transfer energy can be improved accordingly.
[Show abstract][Hide abstract] ABSTRACT: Power-performance efficiency is still remaining a primary concern for microprocessor designers. One of the sources of power inefficiency for recent LSI chips is increasing leakage power consumption. Power-gating is a well known technique to reduce leakage power consumption by switching off the power supply to idle logic blocks. Recently, fine-grained power-gating is emerged as a technique to minimize leakage current during the active processor cycles by switching on and off a logic blocks in much finer temporal/spatial granularity. Though fine-grained power-gating is useful, a comprehensive evaluation and analysis has not been conducted on a real LSI chips. In this paper, we evaluate fine-grained run-time power-gating for microprocessors' functional units using a real embedded microprocessor. We also introduce an architecture and compiler co-operative power-gating scheme which mitigates negative power reduction caused by the energy overhead associated with finegrained power-gating. The experimental results with a fabricated core shows that a hardware-based scheme saves power consumption of functional units by 44% and hardware compiler co-operative scheme further improves power efficiency by 5.9% when core temperature is 25 °C.
[Show abstract][Hide abstract] ABSTRACT: Leakage power is a serious problem especially for accerelators which use a large size Processing Element (PE) array. Here, a low power reconfigurable accelerator called Cool Mega Array (CMA) with back-gate bias control (CMA-bb) is implemented and evaluated. In CMA-bb, the back-gate bias of the microcontroller and PE array can be controlled independently. In the idle mode, reverse bias is given to the both parts to suppress the leakage current. When high performance is required, forward bias is used to increase the clock frequency. For simple applications, the operational power can be suppressed by using reverse bias only in the PE array. The real chip is implemented with a 65nm experimental process for low leakage applications. The evaluation results show that the leakage current can be suppressed to 300μA by using the reverse bias. The operational frequency is increased from 39MHz to 50MHz with up to 21% increase of operational power by using the forward bias. For simple applications, 8% to 9.4% of operational power is saved by giving reverse bias only to the PE array.
2013 International Conference on Field-Programmable Technology (FPT); 12/2013
[Show abstract][Hide abstract] ABSTRACT: Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
2013 International Conference on Field-Programmable Technology (FPT); 12/2013
[Show abstract][Hide abstract] ABSTRACT: Photon mapping is a kind of rendering techniques which enables depicting complicated light concentrations for 3D graphics. Searching kd-tree of photons with k-near neighbor search (k-NN) requires a large amount of computations. As k-NN search includes high degree of parallelism, the operation can be accelerated by GPU and recent multi-core microprocessors. However, memory access bottleneck will limit their computation speed. Here, as an alternative approach, an FPGA implementation of k-NN search operation in kd-tree is proposed. In the proposed design, we maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM. Furthermore, an implementation of the discovery process of the max distance which is not depending on the number of Estimate-Photons is proposed. Through the implementation on Spartan6, Virtex6 and Virtex7, it appears that 26 fundamental modules can be mounted on Virtex7. As a result, the proposed module achieved the throughput of approximately 282 times as that of software execution at maximum.
Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications; 03/2013
[Show abstract][Hide abstract] ABSTRACT: The authors developed a scalable heterogeneous multicore processor. 3D heterogeneous chip stacking of a general-purpose CPU and reconfigurable multicore accelerators enables various trade-offs between performance and energy consumption. The stacked chips interconnect through a scalable 3D network on a chip (NoC). By simply changing the number of stacked accelerator chips, processor parallelism can be widely scaled. No design change is needed, and hence, no additional nonrecurring engineering (NRE) cost is required. An inductive-coupling ThruChip Interface (TCI) is applied to stacked-chip communications, forming a low-cost and robust high-speed 3D NoC. The authors developed a prototype system called Cube-1 with 65-nm CMOS test chips, and confirmed successful system operations, including 10 hours of continuous Linux OS operation. Simple filters and a streaming application were implemented on Cube-1 and performance acceleration up to about three times was achieved.
[Show abstract][Hide abstract] ABSTRACT: Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.
Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific; 01/2013
[Show abstract][Hide abstract] ABSTRACT: A scalable heterogeneous multi-core processor is developed. 3D heterogeneous chip stacking of a general-purpose CPU and reconfigurable multi-core accelerators improves computational energy efficiency by proper task assignment and massive parallel computing. The stacked chips interconnect through a scalable 3D Network on Chip (NoC). By simply changing the number of stacked accelerator chips, processor parallelism can be widely scaled. In combination with Dynamic Voltage and Frequency Scaling (DVFS), the energy efficiency can be optimized for various performance requirements. No design change is needed, and hence no additional Non-Recurring Engineering (NRE) cost. An inductive-coupling ThruChip Interface (TCI) is applied to stacked-chip communications, forming a low-cost and robust high-speed 3D NoC. A prototype demonstration system has been developed with 65nm CMOS test chips. Successful system operations including 10-hours continuous Linux OS operation are confirmed for the first time.
[Show abstract][Hide abstract] ABSTRACT: Cube-1 is a heterogeneous multi-core processor which can achieve the required performance with the least energy consumption as possible. It can control the performance and energy with two levels: (1) the number of accelerators can be easily changed by increasing or decreasing the number of stacked chips after fabrication, as they are connected with inductive coupling links. (2) The supply voltage for PE array of the accelerator can be controlled by the host CPU so that the required performance can be obtained with a minimum supply voltage.
Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: A contact-less approach that connects chips in vertical dimension has a great potential to customize components in 3-D chip multiprocessors (CMPs), assuming card-style components inserted to a single cartridge communicate each other wirelessly using inductive-coupling technology. To simplify the vertical communication interfaces, static Time Division Multiple Access (TDMA) is used for the vertical broadcast buses, while arbitrary or customized topologies can be used for intra-chip networks. In this paper, we propose the Headfirst sliding routing scheme to overcome the simple static TDMA-based vertical buses. Each vertical bus grants a communication time-slot for different chips at the same time periodically, which means these buses work with different phases. Depending on the current time, packets are routed toward the best vertical bus (elevator) just before the elevator acquires its communication time-slot. Network simulations show that Headfirst sliding routing reduces the communication latency by up to 32.7%, and full-system CMP simulations show that it reduces application execution time by 9.9%. Synthesis results show that the area and critical path delay overheads are modest.
Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Network-on-Chips (NoCs) with wireless inductive coupling have been utilized in real heterogeneous multicore systems. Although the inductive-coupling itself is energy-efficient (e.g., 0.14pJ per bit ), inductors continuously consume a certain amount of power, regardless of packet transfers. That is, inductors waste significant power especially when the utilization of vertical links (i.e., inductors) is low, which is a typical use case of 3-D ICs that the most communications are within a chip while the communications between chips are infrequent. Such power can be reduced by shutting down the link by controlling bias voltage of transistors used in the transmitter and receiver. Here, we propose generalized link on-off techniques for wireless NoCs with irregular network topologies. The simulation shows that the proposed low-power techniques reduce the power consumption by 43.8%-55.0%.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.
Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: We demonstrate task level pipelining on multiple accelerators with PEACH2. PEACH2 is implmented on FPGA, and enables ultra low latency direct communication among multiple accelerators over computational nodes. By installing PEACH2, typical high performance computation nodes are tightly coupled. In this environment, application can be accelerated by exploiting not only data level parallelism, but also task level pipelined operation. Furthermore, we can processe multiple task on multiple accelerators in a pipelined manner. In our demonstration, application achieves 44% speed up compared to a single GPU.
Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: As the scales of parallel applications and platforms increase the negative impact of communication latencies on performance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit low-latency topologies of high-radix switches. In this context, we propose the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Using graph analysis we find that these topologies, when compared to non-random topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Using flit-level discrete event simulation we find that random shortcut topologies achieve throughput comparable to and latency lower than that of existing non-random topologies such as hypercubes and tori. Finally, we discuss and quantify practical challenges for random shortcut topologies, including routing scalability and larger physical cable lengths.
[Show abstract][Hide abstract] ABSTRACT: In aerospace application, computational fluid dynamics (CFD) is recognized as a cost effective design tool. UPACS, a package for CFD is convenient for users, which has various solvers to support large scale of complexity. The problem is its computation speed which is hard to be enhanced by using clusters due to its complex memory access patterns. As an economical solution, accelerators using FPGAs are hopeful candidates. However, the total scale of UPACS is too large to be implemented on small numbers of FPGAs. For cost efficient implementation, partial reconfiguration which can dynamically reconfigure only required functions is proposed in this paper. MUSCL scheme, a main function used frequently in UPACS is selected as a target. Partial reconfiguration is applied to the flux limiter functions in MUSCL. Two reconfigurable partitions are created for Turbulence MUSCL and Convection MUSCL. All limiter functions are developed independently and synthesized separately from the top MUSCL module. Required limiter functions for both MUSCLs are loaded dynamically without interfering each other operation. This implementation has successfully reduced the resource utilization by 60%. Total power consumption is also reduced by 29%. Configuration speed is improved by 15-times faster as compared to fully reconfiguration method. All implemented functions achieved at least 17 times speed-up compared with the software implementation.
2012 7th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC); 07/2012