Hideharu Amano

Keio University, Edo, Tōkyō, Japan

Are you Hideharu Amano?

Claim your profile

Publications (259)34.62 Total impact

  • IEICE Transactions on Electronics 01/2015; E98.C(7):559-568. DOI:10.1587/transele.E98.C.559 · 0.39 Impact Factor
  • IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2015; DOI:10.1109/TVLSI.2015.2418216 · 1.14 Impact Factor
  • IEICE Transactions on Electronics 01/2015; E98.C(7):536-543. DOI:10.1587/transele.E98.C.536 · 0.39 Impact Factor
  • Source
    Takaaki Miyajima · David Thomas · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: This new toolchain for accelerating application on CPU-FPGA platforms, called Courier-FPGA, extracts runtime information from a running target binary, and re-constructs the function call graph including input-output data. Then, it synthesizes hardware modules on the FPGA and makes software functions on CPU by using Pipeline Generator. The Pipeline Generator also builds a pipeline control program by using Intel Threading Building Block (Intel TBB) to run both hardware modules and software functions in parallel. Finally, Courier-FPGA's Function Off-loader dynamically replaces and off-loads the original functions in the binary by using the built pipeline. Courier-FPGA performs the off-loading without user intervention, source code tweaks or re-compilations of the binary. In our case studies, Courier-FPGA was used to accelerate a histogram-of-gradients (HOG) feature detection program on the Zynq platform. A series of functions were off-loaded, and the program was sped up 3.98 times by using the built pipeline.
    Journal of Information Processing 01/2015; 23(2):153-162. DOI:10.2197/ipsjjip.23.153
  • Source
    Takaaki Miyajima · David Thomas · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Our toolchain for accelerating application called Courier-FPGA, is designed for utilize the processing power of CPU-FPGA platforms for software programmers and non-expert users. It automatically gathers runtime information of library functions from a running target binary, and constructs the function call graph including input-output data. Then, it uses corresponding predefined hardware modules if these are ready for FPGA and prepares software functions on CPU by using Pipeline Generator. The Pipeline Generator builds a pipeline control program by using Intel Threading Building Block to run both hardware modules and software functions in parallel. Finally, Courier-FPGA dynamically replaces the original functions in the binary and accelerates it by using the built pipeline. Courier-FPGA performs these acceleration processes without user intervention, source code tweaks or re-compilations of the binary. We describe the technical details of this mixed software hardware pipeline on CPU-FPGA platforms in this paper. In our case study, Courier-FPGA was used to accelerate a corner detection using the Harris-Stephens method application binary on the Zynq platform. A series of functions were off-loaded, and speed up 15.36 times was achieved by using the built pipeline.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational fluid dynamics (CFD) is an important tool for designing aircraft components. FaSTAR (Fast Aerodynamics Routines) is one of the most recent CFD packages and has various subroutines. However, its irregular and complicated data structure makes it difficult to execute FaSTAR on parallel machines due to memory access problem. The use of a reconfigurable platform based on field programmable gate arrays (FPGAs) is a promising approach to accelerating memory-bottle-necked applications like FaSTAR. However, even with hardware execution, a large number of pipeline stalls can occur due to read-after-write (RAW) data hazards. Moreover, it is difficult to predict when such stalls will occur because of the unstructured mesh used in FaSTAR. To eliminate this problem, we developed an out-of-order mechanism for permuting the data order so as to prevent RAW hazards. It uses an execution monitor and a wait buffer. The former identifies the state of the computation units, and the latter temporarily stores data to be processed in the computation units. This out-of-order mechanism can be applied to various types of computations with data dependency by changing the number of execution monitors and wait buffers in accordance with the equations used in the target computation. An out-of-order system can be reconfigured by automatic changing of the parameters. Application of the proposed mechanism to five subroutines in FaSTAR showed that its use reduces the number of stalls to less than 1% compared to without the mechanism. In-order execution was speeded up 2.6-fold and software execution was speeded up 2.9-fold using an Intel Core 2 Duo processor with a reasonable amount of overhead.
    IEICE Transactions on Information and Systems 05/2014; E97.D(5):1225-1234. DOI:10.1587/transinf.E97.D.1225 · 0.19 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A 32-bit CPU which operates with the lowest energy of 13.4 pJ/cycle at 0.35V and 14MHz, operates at 0.22V to 1.2V and with 0.14µA sleep current is demonstrated. The low power performance is attained by Reverse-Body-Bias-Assisted 65nm SOTB CMOS (Silicon On Thin Buried oxide) technology. The CPU can operate more than 100 years with 610mAH Li battery.
    COOL Chips XVII, Yokohama, Japan; 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We have concluded that with a router using MMTH the power consumption is associated with the bit change rate of the data, and when NAS parallel benchmarks work on NoC, it is reduced by 42.4% on average at 2GHz compared with a traditional FIFO implementation. The performance degradation caused by the delay of the reading time can be mostly saved by the look-ahead technique in the router.
    2014 IEEE COOL Chips XVII (COOL Chips); 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We show a task level pipelining on multiple accelerators with PEACH2. PEACH2, which is implemented on FPGA, enables ultra low latency direct communication among multiple accelerators over computational nodes. By installing PEACH2, typical high performance computation nodes are tightly coupled. In this environment, application can be accelerated by exploiting not only data level parallelism, but also task level parallelism. Furthermore, we can process multiple task on multiple accelerators in a pipelined manner. In our evaluation, pipelined application which is implemented in a task level pipelined manner achieves 52% speed up compared to a single GPU.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A wireless 3D NoC architecture is described for building-block SiPs, in which the number of hardware components (or chips) in a package can be changed after chips have been fabricated. The architecture uses inductive-coupling links that can connect more than two examined dies without wire connections. Each chip has data transceivers for the uplink and downlink in order to communicate with its neighboring chips in the package. These chips form a vertical unidirectional ring network so as to fully exploit the flexibility of the wireless approach that enables us to add, remove, and swap the chips in the ring. To avoid protocol and structural deadlocks in the ring, we use bubble flow control, which does not rely on the conventional VC-based deadlock avoidance mechanism. In addition, we propose a bidirectional communication scheme to form a bidirectional ring network by using the inductive-coupling transceivers that can dynamically change the communication modes, such as TX, RX, and Idle modes. This paper illustrates the inductive-coupling transceiver circuits, which can carry high data transfer rates of up to 8 Gbps per channel, for the wireless 3D NoC. It also illustrates an implementation of a wireless 3D NoC that has on-chip routers and transceivers implemented with a 65 nm process in order to show the feasibility of our proposal. The vertical bubble flow control and conventional VC-based approach on the uni- and bidirectional ring networks are compared with the vertical broadcast bus in terms of throughput, hardware amount, and application performance using a full system multiprocessor simulator. The results show that the proposed bidirectional communication scheme efficiently improves application performance without adding any inductive-coupling transceivers. In addition, the proposed vertical bubble flow network outperforms the conventional VC-based approach by 7.9-12.5 percent with a 33.5 percent smaller router area for building-block SiPs connecting up to eight- chips.
    IEEE Transactions on Computers 03/2014; 63(3):748-763. DOI:10.1109/TC.2012.249 · 1.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Power-performance efficiency is still remaining a primary concern for microprocessor designers. One of the sources of power inefficiency for recent LSI chips is increasing leakage power consumption. Power-gating is a well known technique to reduce leakage power consumption by switching off the power supply to idle logic blocks. Recently, fine-grained power-gating is emerged as a technique to minimize leakage current during the active processor cycles by switching on and off a logic blocks in much finer temporal/spatial granularity. Though fine-grained power-gating is useful, a comprehensive evaluation and analysis has not been conducted on a real LSI chips. In this paper, we evaluate fine-grained run-time power-gating for microprocessors' functional units using a real embedded microprocessor. We also introduce an architecture and compiler co-operative power-gating scheme which mitigates negative power reduction caused by the energy overhead associated with finegrained power-gating. The experimental results with a fabricated core shows that a hardware-based scheme saves power consumption of functional units by 44% and hardware compiler co-operative scheme further improves power efficiency by 5.9% when core temperature is 25 °C.
    Design Automation and Test in Europe; 01/2014
  • IPSJ Transactions on System LSI Design Methodology 01/2014; 7:27-36. DOI:10.2197/ipsjtsldm.7.27
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design and control scheme of a microprocessor whose internal function units are power gated at instruction-by-instruction basis. Enabling/disabling the power gating is adaptively controlled under the support of on-chip leakage monitors and the operating system to minimize energy overhead due to sleep-in and wakeup. Measured results of the fabricated chip in the 65nm CMOS technology demonstrated that our approach reduces energy to 21-35% in the range of 25-85°C as compared to the non power-gated case. Energy dissipation was reduced by up to 15% as compared to the conventional fine-grain power gating technique in the same temperature range.
    2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC); 01/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we demonstrate that we can reduce the communication latency significantly by inserting a fraction of randomness into a wireless 3D NoC (where CMOS wireless links are used for vertical inter-chip communication) when considering the physical constraints of the 3D design space. Towards this end, we consider two cases, namely 1) replacing existing horizontal 2D links in a wireless 3D NoC with randomized shortcut NoC links and 2) enabling full connectivity by adding a randomized NoC layer to a wireless 3D platform with partial or no horizontal connectivity. Consequently, the packet routing is optimized by exploiting both the existing and the newly added random NoC. At the same time, by adding randomly wired shortcut NoCs to a wireless 3D platform, a good balance can be established between the modularity of the design and the minimum randomness needed to achieve low latency, and experimental results show that by adding a random NoC chip to wireless 3D CMPs without built-in horizontal connectivity, the communication latency can be reduced by as much as 26.2% when compared to adding a 2D mesh NoC. Also, the application execution time and average flit transfer energy can be improved accordingly.
    Design Automation and Test in Europe; 01/2014
  • Hongliang Su · Weihan Wang · Kuniaki Kitamori · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Leakage power is a serious problem especially for accerelators which use a large size Processing Element (PE) array. Here, a low power reconfigurable accelerator called Cool Mega Array (CMA) with back-gate bias control (CMA-bb) is implemented and evaluated. In CMA-bb, the back-gate bias of the microcontroller and PE array can be controlled independently. In the idle mode, reverse bias is given to the both parts to suppress the leakage current. When high performance is required, forward bias is used to increase the clock frequency. For simple applications, the operational power can be suppressed by using reverse bias only in the PE array. The real chip is implemented with a 65nm experimental process for low leakage applications. The evaluation results show that the leakage current can be suppressed to 300μA by using the reverse bias. The operational frequency is increased from 39MHz to 50MHz with up to 21% increase of operational power by using the forward bias. For simple applications, 8% to 9.4% of operational power is saved by giving reverse bias only to the PE array.
    2013 International Conference on Field-Programmable Technology (FPT); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Fast Aerodynamics Routines (FaSTAR) is one of the most recent fluid dynamics software package. The problem of FaSTAR is hard to be executed in parallel machines because of its irregular and unpredictable data structure. Exploiting reconfigurable hardware with their advantages to make up for the inadequacy of the existing high performance computers had gradually become the solutions. However, a single FPGA is not enough for the FaSTAR package because the whole module is very large. Instead of using many FPGAs, partially reconfigurable hardware available in recent FPGAs is explored for this application. Advection term computation module in FaSTAR is chosen as a target subroutine. We proposed a reconfigurable flux calculation scheme using partial reconfiguration technique to save hardware resources to fit in a single FPGA. We developed flux computational module and five flux calculation schemes are implemented as reconfigurable modules. This implementation has advantages of up to 62.75% resource saving and enhancing the configuration speed by 6.28 times. Performance evaluation also shows that 2.65 times acceleration is achieved compared to Intel Core 2 Duo at 2.4 GHz.
    2013 International Conference on Field-Programmable Technology (FPT); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: The authors developed a scalable heterogeneous multicore processor. 3D heterogeneous chip stacking of a general-purpose CPU and reconfigurable multicore accelerators enables various trade-offs between performance and energy consumption. The stacked chips interconnect through a scalable 3D network on a chip (NoC). By simply changing the number of stacked accelerator chips, processor parallelism can be widely scaled. No design change is needed, and hence, no additional nonrecurring engineering (NRE) cost is required. An inductive-coupling ThruChip Interface (TCI) is applied to stacked-chip communications, forming a low-cost and robust high-speed 3D NoC. The authors developed a prototype system called Cube-1 with 65-nm CMOS test chips, and confirmed successful system operations, including 10 hours of continuous Linux OS operation. Simple filters and a streaming application were implemented on Cube-1 and performance acceleration up to about three times was achieved.
    IEEE Micro 11/2013; 33(6):6-15. DOI:10.1109/MM.2013.112 · 1.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Photon mapping is a kind of rendering techniques which enables depicting complicated light concentrations for 3D graphics. Searching kd-tree of photons with k-near neighbor search (k-NN) requires a large amount of computations. As k-NN search includes high degree of parallelism, the operation can be accelerated by GPU and recent multi-core microprocessors. However, memory access bottleneck will limit their computation speed. Here, as an alternative approach, an FPGA implementation of k-NN search operation in kd-tree is proposed. In the proposed design, we maximized the effective throughput of the block RAM by connecting multiple Query Modules to both ports of RAM. Furthermore, an implementation of the discovery process of the max distance which is not depending on the number of Estimate-Photons is proposed. Through the implementation on Spartan6, Virtex6 and Virtex7, it appears that 26 fundamental modules can be mounted on Virtex7. As a result, the proposed module achieved the throughput of approximately 282 times as that of software execution at maximum.
    Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications; 03/2013
  • IEICE Transactions on Electronics 01/2013; E96.C(4):404-412. DOI:10.1587/transele.E96.C.404 · 0.39 Impact Factor
  • Kugami Daiki · Takaaki Miyajima · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: High-Level Synthesis has been researched and developed for these 20 years. Not only ASIC, but also reconfigurable devices, especially Field Programmable Gate Array (FPGA) development environment has been improved as well. Various types of large algorithms also have been implemented on FPGAs in order to shorten their processing time, especially in the field of Computational Fluid Dynamics(CFD). However, for such an acceleration, FPGA has some limitations when programmers try to implement large algorithm. Area is one of the largest constraints for FPGA, so programmers have to divide one large algorithm into some small parts. The number of arithmetic units also constraints the size of algorithm and degree of the speed-up. Here, wetry to divide a large algorithm into some small functions, and implement on some FPGAs by using a High-Level Synthesis(HLS) tool. Since the trial and error is easy to be done withHLS tool, we propose a technique for exploration of division point of a large algorithm by using a HLS tool CWB (Cyber Work Bench).
    Advanced Information Networking and Applications Workshops (WAINA), 2013 27th International Conference on; 01/2013

Publication Stats

1k Citations
34.62 Total Impact Points


  • 1985–2014
    • Keio University
      • • Department of Information and Computer Science
      • • Faculty of Science and Technology
      • • Center for Computer Science
      • • Department of Electronics and Electrical Engineering
      Edo, Tōkyō, Japan
  • 2013
    • Japan Aerospace Exploration Agency
      Chōfu, Tōkyō, Japan
  • 2009
    • Shibaura Institute of Technology
      Tōkyō, Japan
  • 2005
    • Nagasaki University
      • Department of Computer and Information Science
      Nagasaki, Nagasaki, Japan
  • 1997
    • Tokyo Denki University
      • Division of Electrical and Electronic Engineering
      Tokyo, Tokyo-to, Japan