Hideharu Amano

Keio University, Edo, Tōkyō, Japan

Are you Hideharu Amano?

Claim your profile

Publications (298)25.31 Total impact

  • Source
    Takaaki Miyajima, David Thomas, Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Our toolchain for accelerating application called Courier-FPGA, is designed for utilize the processing power of CPU-FPGA platforms for software programmers and non-expert users. It automatically gathers runtime information of library functions from a running target binary, and constructs the function call graph including input-output data. Then, it uses corresponding predefined hardware modules if these are ready for FPGA and prepares software functions on CPU by using Pipeline Generator. The Pipeline Generator builds a pipeline control program by using Intel Threading Building Block to run both hardware modules and software functions in parallel. Finally, Courier-FPGA dynamically replaces the original functions in the binary and accelerates it by using the built pipeline. Courier-FPGA performs these acceleration processes without user intervention, source code tweaks or re-compilations of the binary. We describe the technical details of this mixed software hardware pipeline on CPU-FPGA platforms in this paper. In our case study, Courier-FPGA was used to accelerate a corner detection using the Harris-Stephens method application binary on the Zynq platform. A series of functions were off-loaded, and speed up 15.36 times was achieved by using the built pipeline.
    08/2014;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 32-bit CPU which operates with the lowest energy of 13.4 pJ/cycle at 0.35V and 14MHz, operates at 0.22V to 1.2V and with 0.14µA sleep current is demonstrated. The low power performance is attained by Reverse-Body-Bias-Assisted 65nm SOTB CMOS (Silicon On Thin Buried oxide) technology. The CPU can operate more than 100 years with 610mAH Li battery.
    COOL Chips XVII, Yokohama, Japan; 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: A wireless 3D NoC architecture is described for building-block SiPs, in which the number of hardware components (or chips) in a package can be changed after chips have been fabricated. The architecture uses inductive-coupling links that can connect more than two examined dies without wire connections. Each chip has data transceivers for the uplink and downlink in order to communicate with its neighboring chips in the package. These chips form a vertical unidirectional ring network so as to fully exploit the flexibility of the wireless approach that enables us to add, remove, and swap the chips in the ring. To avoid protocol and structural deadlocks in the ring, we use bubble flow control, which does not rely on the conventional VC-based deadlock avoidance mechanism. In addition, we propose a bidirectional communication scheme to form a bidirectional ring network by using the inductive-coupling transceivers that can dynamically change the communication modes, such as TX, RX, and Idle modes. This paper illustrates the inductive-coupling transceiver circuits, which can carry high data transfer rates of up to 8 Gbps per channel, for the wireless 3D NoC. It also illustrates an implementation of a wireless 3D NoC that has on-chip routers and transceivers implemented with a 65 nm process in order to show the feasibility of our proposal. The vertical bubble flow control and conventional VC-based approach on the uni- and bidirectional ring networks are compared with the vertical broadcast bus in terms of throughput, hardware amount, and application performance using a full system multiprocessor simulator. The results show that the proposed bidirectional communication scheme efficiently improves application performance without adding any inductive-coupling transceivers. In addition, the proposed vertical bubble flow network outperforms the conventional VC-based approach by 7.9-12.5 percent with a 33.5 percent smaller router area for building-block SiPs connecting up to eight- chips.
    IEEE Transactions on Computers 01/2014; 63(3):748-763. · 1.38 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Photon mapping is a kind of rendering techniques which enables depicting complicated light concentrations for 3D graphics. Searching kd-tree of photons with k-near neighbor search (k-NN) requires a large amount of computations. As k-NN search includes high degree of parallelism, the operation can be accelerated by GPU and recent multi-core microprocessors. However, memory access bottleneck will limit their computation speed. Here, as an alternative approach, an FPGA implementation of k-NN search operation in kd-tree is proposed. In the proposed design, we maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM. Furthermore, an implementation of the discovery process of the max distance which is not depending on the number of Estimate-Photons is proposed. Through the implementation on Spartan6, Virtex6 and Virtex7, it appears that 26 fundamental modules can be mounted on Virtex7. As a result, the proposed module achieved the throughput of approximately 282 times as that of software execution at maximum.
    Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications; 03/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The authors developed a scalable heterogeneous multicore processor. 3D heterogeneous chip stacking of a general-purpose CPU and reconfigurable multicore accelerators enables various trade-offs between performance and energy consumption. The stacked chips interconnect through a scalable 3D network on a chip (NoC). By simply changing the number of stacked accelerator chips, processor parallelism can be widely scaled. No design change is needed, and hence, no additional nonrecurring engineering (NRE) cost is required. An inductive-coupling ThruChip Interface (TCI) is applied to stacked-chip communications, forming a low-cost and robust high-speed 3D NoC. The authors developed a prototype system called Cube-1 with 65-nm CMOS test chips, and confirmed successful system operations, including 10 hours of continuous Linux OS operation. Simple filters and a streaming application were implemented on Cube-1 and performance acceleration up to about three times was achieved.
    IEEE Micro 01/2013; 33(6):6-15. · 2.39 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.
    Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: A scalable heterogeneous multi-core processor is developed. 3D heterogeneous chip stacking of a general-purpose CPU and reconfigurable multi-core accelerators improves computational energy efficiency by proper task assignment and massive parallel computing. The stacked chips interconnect through a scalable 3D Network on Chip (NoC). By simply changing the number of stacked accelerator chips, processor parallelism can be widely scaled. In combination with Dynamic Voltage and Frequency Scaling (DVFS), the energy efficiency can be optimized for various performance requirements. No design change is needed, and hence no additional Non-Recurring Engineering (NRE) cost. An inductive-coupling ThruChip Interface (TCI) is applied to stacked-chip communications, forming a low-cost and robust high-speed 3D NoC. A prototype demonstration system has been developed with 65nm CMOS test chips. Successful system operations including 10-hours continuous Linux OS operation are confirmed for the first time.
    Cool Chips XVI (COOL Chips), 2013 IEEE; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cube-1 is a heterogeneous multi-core processor which can achieve the required performance with the least energy consumption as possible. It can control the performance and energy with two levels: (1) the number of accelerators can be easily changed by increasing or decreasing the number of stacked chips after fabrication, as they are connected with inductive coupling links. (2) The supply voltage for PE array of the accelerator can be controlled by the host CPU so that the required performance can be obtained with a minimum supply voltage.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: A contact-less approach that connects chips in vertical dimension has a great potential to customize components in 3-D chip multiprocessors (CMPs), assuming card-style components inserted to a single cartridge communicate each other wirelessly using inductive-coupling technology. To simplify the vertical communication interfaces, static Time Division Multiple Access (TDMA) is used for the vertical broadcast buses, while arbitrary or customized topologies can be used for intra-chip networks. In this paper, we propose the Headfirst sliding routing scheme to overcome the simple static TDMA-based vertical buses. Each vertical bus grants a communication time-slot for different chips at the same time periodically, which means these buses work with different phases. Depending on the current time, packets are routed toward the best vertical bus (elevator) just before the elevator acquires its communication time-slot. Network simulations show that Headfirst sliding routing reduces the communication latency by up to 32.7%, and full-system CMP simulations show that it reduces application execution time by 9.9%. Synthesis results show that the area and critical path delay overheads are modest.
    Networks on Chip (NoCS), 2013 Seventh IEEE/ACM International Symposium on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Network-on-Chips (NoCs) with wireless inductive coupling have been utilized in real heterogeneous multicore systems. Although the inductive-coupling itself is energy-efficient (e.g., 0.14pJ per bit [1]), inductors continuously consume a certain amount of power, regardless of packet transfers. That is, inductors waste significant power especially when the utilization of vertical links (i.e., inductors) is low, which is a typical use case of 3-D ICs that the most communications are within a chip while the communications between chips are infrequent. Such power can be reduced by shutting down the link by controlling bias voltage of transistors used in the transmitter and receiver. Here, we propose generalized link on-off techniques for wireless NoCs with irregular network topologies. The simulation shows that the proposed low-power techniques reduce the power consumption by 43.8%-55.0%.
    Cool Chips XVI (COOL Chips), 2013 IEEE; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design of an FPGA-based Blokus Duo solver. It searches a game tree by using the miniMax algorithm with alpha-beta pruning and move ordering. In addition, HLS tool called CyberWorkBench (CWB) is used to implement hardware. By making the use of functions in CWB, parallel fully pipelined design is generated. The implemented solver works at 100MHz with Xilinx Spartan-6 XC6SLX45 FPGA on the Digilent Atlys board. It can search states after three moves in most cases.
    Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We demonstrate task level pipelining on multiple accelerators with PEACH2. PEACH2 is implmented on FPGA, and enables ultra low latency direct communication among multiple accelerators over computational nodes. By installing PEACH2, typical high performance computation nodes are tightly coupled. In this environment, application can be accelerated by exploiting not only data level parallelism, but also task level pipelined operation. Furthermore, we can processe multiple task on multiple accelerators in a pipelined manner. In our demonstration, application achieves 44% speed up compared to a single GPU.
    Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: We have developed a high-throughput, compact network switch (the RHiNET-2/SW) for a distributed parallel computing system. Eight pairs of 800-Mbit/s×12-channel optical interconnection modules and a CMOS ASIC switch are integrated on a compact circuit board. To realize high-throughput (64 Gbit/s) and low-latency network, the SW-LSI has a customized high-speed LVDS I/O interface, and a high-speed internal SRAM memory in a 784-pin BGA one-chip package. We have also developed device implementation technologies to overcome the electrical problems (loss and crosstalk) caused by such high integration. The RHiNET-2/SW system enables high-performance parallel processing in a distributed computing environment.
    New Generation Computing 04/2012; 18(2):187-197. · 0.80 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational Fluid Dynamics (CFD) is used as a common design tool in aerospace industry. UPACS, a package for CFD is convenient for users, since a customized simulator can be built just by selecting required functions. The problem is its computation speed which is hard to be enhanced by using clusters due to its complex memory access patterns. As an economical solution, accelerators using FPGAs are hopeful candidates. However, the total scale of UPACS is too large to be implemented on small numbers of FPGAs. For cost efficient implementation, partial reconfiguration which can dynamically reconfigure only required functions is proposed in this paper. Here, MUSCL algorithm used frequently in UPACS is selected as a target. Partial reconfiguration is applied to the flux limiter functions (FLF) in MUSCL. Four FLFs are implemented for Turbulence MUSCL (TMUSCL) and eight FLFs are for Convection MUSCL (CMUSCL). All FLFs are developed independently and separated from the top MUSCL module. At start-up, only required FLFs are selected and deployed to the system without interfering the other modules. This implementation has successfully reduced the resource utilization by 44% to 63%. Total power consumption also reduced by 33%. Configuration speed is improved by 34-times faster as compared to fully reconfiguration method. All implemented functions achieved at least 17 times speed-up compared with the software implementation.
    Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications; 03/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose a multi-voltage (multi-Vdd) variable pipeline router to reduce the power consumption of Network-on-Chips (NoCs) designed for chip multi-processors (CMPs). Our multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike dynamic voltage and frequency scaling (DVFS) routers, the operating frequency is the same for all routers throughout the CMP; thus, there is no need to synchronize neighboring routers working at different frequencies. In this paper, we implemented the multi-Vdd variable pipeline router, which selects two supply voltage levels and pipeline modes, using a 65nm CMOS process and evaluated it using a full-system CMP simulator. Evaluation results show that although the application performance degraded by 1.0% to 2.1%, the standby power of NoCs reduced by 10.4% to 44.4%.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-speed power gating (PG) techniques are useful for reducing leakage power of functional units in a CPU core. This paper analyzes trade off of functional units in a MIPS R3000 based processor with three fine-grained PG methods: the cell-based, row-based and ring-based. Compared with the cell-based PG technique, which was used in our previous work - Geyser-1 processor, the row-based and ring-based PG technique achieved much smaller area and less implemental cost with a certain additional delay to wake-up latency. The simulation results with benchmark programs show that all three methods can reduce leakage power by 28∼54% at 25C.
    01/2012;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As the scales of parallel applications and platforms increase the negative impact of communication latencies on performance becomes large. Fortunately, modern High Performance Computing (HPC) systems can exploit low-latency topologies of high-radix switches. In this context, we propose the use of random shortcut topologies, which are generated by augmenting classical topologies with random links. Using graph analysis we find that these topologies, when compared to non-random topologies of the same degree, lead to drastically reduced diameter and average shortest path length. The best results are obtained when adding random links to a ring topology, meaning that good random shortcut topologies can easily be generated for arbitrary numbers of switches. Using flit-level discrete event simulation we find that random shortcut topologies achieve throughput comparable to and latency lower than that of existing non-random topologies such as hypercubes and tori. Finally, we discuss and quantify practical challenges for random shortcut topologies, including routing scalability and larger physical cable lengths.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: FaSTAR developed by JAXA is a leading edge CFD (Computational Fluid Dynamics) program package which supports various solvers based on unstructured grids. The computation based on unstructured grid causes a lot of pipeline stalls by RAW (Read After Write) hazard when reconfigurable accelerators are implemented in FPGAs. In order to cope with this problem, the OoO (Out-of-Order) mechanism generator is proposed. By setting parameters depending on the target computation, the OoO mechanism with appropriate structure of the execution unit and waiting buffer is generated. The OoO mechanisms are applied to five subroutines in FaSTAR, and it achieved 2.6 times performance as the case of in-order execution, and 2.9 times as the software executed by Intel Core2Duo processor with reasonable amount of overhead.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cube-2 is a prototype of building block scalable reconfigurable accelerator using an inductive coupling interconnect. It is consisting of a ultra low leakage embedded processor Geyser and coarse-grained reconfigurable accelerators CMA (Cool Mega Array). A Geyser chip and multiple CMA chips are stacked, and a powerful network is formed by using the inductive coupling interconnect. The performance can be enhanced by increasing the number of CMA chips. JPEG decoder is implemented with a cooperation of Geyser and CMAs, and low power execution by controlling the power supply voltage of CMAs is demonstrated.
    Field-Programmable Technology (FPT), 2012 International Conference on; 01/2012
  • H. Amano, M. Kimura, N. Ozaki
    [Show abstract] [Hide abstract]
    ABSTRACT: Although context memory or configuration cache is a key mechanism for quick dynamic reconfiguration of multi-context Dynamically Reconfigurable Processing Array (DRPA), it requires a large amount of area and energy. In order to save them, methods to remove the context memory from multi-context DRPA are proposed. In order to keep a context without switching, Loop Separation for Keeping Datapath (LSKD)is introduced. By separating loops by the compiler and some additional hardware, the same context can be used without switching in a certain clock cycles. The back-ground configuration data loading time can be reduced by multicasting configuration data with two dimensional bit-map. For further reduction, the differential loading and spare register are proposed. With combination of them, the increasing execution time is only up to 12-13% if the target application does not have loop-carried dependency. With the above overhead on the performance, the semiconductor area becomes 63%, and the energy consumption is reduced to 40%, thus, the performance per cost or energy is much improved.
    Embedded Multicore Socs (MCSoC), 2012 IEEE 6th International Symposium on; 01/2012

Publication Stats

1k Citations
25.31 Total Impact Points

Institutions

  • 1993–2014
    • Keio University
      • • Graduate School of Science and Technology
      • • Department of Information and Computer Science
      • • Center for Computer Science
      • • Faculty of Science and Technology
      Edo, Tōkyō, Japan
  • 2011
    • The University of Tokyo
      Edo, Tōkyō, Japan
  • 2009–2010
    • Shibaura Institute of Technology
      Tōkyō, Japan
  • 2005
    • Nagasaki University
      • Department of Computer and Information Science
      Nagasaki-shi, Nagasaki-ken, Japan
  • 2002
    • Toshiba Corporation
      Edo, Tōkyō, Japan
  • 1999
    • Nankai University
      T’ien-ching-shih, Tianjin Shi, China