Hideharu Amano

Keio University, Edo, Tokyo, Japan

Are you Hideharu Amano?

Claim your profile

Publications (300)49.38 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: A 32bit CPU, which can operate more than 15 years with 220mAH Li battery, or eternally operate with an energy harvester of in-door light is presented. The CPU was fabricated by using 65nm SOTB CMOS technology (Silicon on Thin Buried oxide) where gate length is 60nm and BOX layer thickness is 10nm. The threshold voltage was designed to be as low as 0.19V so that the CPU operates at over threshold region, even at lower supply voltages down to 0.22V. Large reverse body bias up to -2.5V can be applied to bodies of SOTB devices without increasing gate induced drain leak current to reduce the sleep current of the CPU. It operated at 14MHz and 0.35V with the lowest energy of 13.4 pJ/cycle. The sleep current of 0.14μA at 0.35V with the body bias voltage of -2.5V was obtained. These characteristics are suitable for such new applications as energy harvesting sensor network systems, and long lasting wearable computers. Copyright © 2015 The Institute of Electronics, Information and Communication Engineers.
    No preview · Article · Jul 2015 · IEICE Transactions on Electronics
  • [Show abstract] [Hide abstract]
    ABSTRACT: The authors have been researching on reducing the power consumption of microprocessors, and developed a low-power processor called “Geyser” by applying power gating (PG) function to the individual functional units of the processor. PG function on Geyser reduces the power consumption of functional units by shutting off the power voltage of idle units. However, the energy overhead of switching the supply voltage for units on and off causes power increases. The amount of the energy overhead varies with the behavior of each functional unit which is influenced by running application, and also with the core temperature. It is therefore necessary to switch the PG function itself on or off according to the state of the processor at runtime to reduce power consumption more effectively. In this paper, the authors propose a PG control method to take the power overhead into account by the operating system (OS). In the proposed method, for achieving much power reduction, the OS calculates the power consumption of each functional unit periodically and inhibits the PG function of the unit whose energy overhead is judged too high. The method was implemented in the Linux process scheduler and evaluated. The results show that the average power consumption of the functional units is reduced by up to 17.2%.
    No preview · Article · Jul 2015 · IEICE Transactions on Electronics
  • [Show abstract] [Hide abstract]
    ABSTRACT: Wireless 3-D network-on-chips (NoCs) with inductive-coupling ThruChip interfaces provide a large degree of flexibility for customizing the number of arbitrary chips in a package after chips have been fabricated. To simplify the vertical communication interfaces, static time division multiple access (TDMA) is used for the vertical broadcast buses, while arbitrary or customized topologies can be used for the intrachip network. This paper proposes two techniques to break through the simple static TDMA-based vertical buses while maintaining a simple communication interface. The first technique is headfirst sliding (HS) routing to reduce the waiting time for acquiring the communication time-slot. HS routing selects the best vertical bus based on the current time, taking advantage of static TDMA. The second technique extends carrier sense multiple access with collision detection (CSMA/CD) for vertical broadcast buses. We introduce a packet collision detection technique for inductive-coupling buses and propose two retransmission strategies to reduce the waiting time for packet retransmissions caused by collisions. Network simulation results show that HS routing reduces the communication latency by 39.1% compared with the conventional static TDMA bus-based 3-D NoC that uses the shortest path routing. The proposed CSMA/CD bus also improves the latency by 52.5% and throughput by 34.1%. The full-system simulation results show that HS routing and the proposed CSMA/CD technique reduce the application execution time accordingly while maintaining the average flit transfer energy overhead modest.
    No preview · Article · Apr 2015 · IEEE Transactions on Very Large Scale Integration (VLSI) Systems
  • [Show abstract] [Hide abstract]
    ABSTRACT: Heterogeneous clusters using accelerators are widely used for high-performance computing system. In such systems, the inter-node communication among accelerators becomes bottleneck due to the data transfer between the accelerator and the host. To eliminate this overhead, we have been developing a novel communication system realizing direct communication among accelerators over computation nodes under the HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) project. Also we are investigating high-level parallel programming language, and several practical application programs on our concept, as well as studying the enhancement of TCA and developing system software stack in the CREST project.
    No preview · Chapter · Apr 2015
  • Source
    Takaaki Miyajima · David Thomas · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: This new toolchain for accelerating application on CPU-FPGA platforms, called Courier-FPGA, extracts runtime information from a running target binary, and re-constructs the function call graph including input-output data. Then, it synthesizes hardware modules on the FPGA and makes software functions on CPU by using Pipeline Generator. The Pipeline Generator also builds a pipeline control program by using Intel Threading Building Block (Intel TBB) to run both hardware modules and software functions in parallel. Finally, Courier-FPGA's Function Off-loader dynamically replaces and off-loads the original functions in the binary by using the built pipeline. Courier-FPGA performs the off-loading without user intervention, source code tweaks or re-compilations of the binary. In our case studies, Courier-FPGA was used to accelerate a histogram-of-gradients (HOG) feature detection program on the Zynq platform. A series of functions were off-loaded, and the program was sped up 3.98 times by using the built pipeline.
    Preview · Article · Jan 2015 · Journal of Information Processing
  • Mai Izawa · Nobuaki Ozaki · Yusuke Koizumi · Rie Uno · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Cool Mega Array (CMA) is an energy efficient reconfigurable accelerator consisting of a large PE array with combinatorial circuits and a small microcontroller. In order to enhance the energy efficiency of the total system, a co-processor design of CMA called CMA-Geyser is proposed. By partly replacing the programmable microcontroller by the host processor Geyser with a dedicated hardware controller, the setting up for the CMA and data transfer can be efficiently done.The design using 65nm CMOS process is compared with an off-loading style multicore system Cube-1. By eliminating the data memory required in Cube-1, CMA-Geyser reduced 21.3% of semiconductor area. Also, it achieved about 2.7 times performance of Cube-1 by the efficient data communication between host and the accelerator.
    No preview · Article · Jan 2015
  • Source
    Takaaki Miyajima · David Thomas · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Computationally intensive applications using an open-source library such as OpenCV, BLAS or FFT are widely available on various research or industry applications. Although the optimized code of such libraries has been prepared for an accelerator, off-loading is difficult for non-expert users, especially when only binary of applications can be accessed. This paper presents a new toolchain for application acceleration called Courier. It only requires a executable binary of the target application and a corresponding function code for an accelerator. Besides, it doesn't require a source code of the application nor re-compilation of the binary. A work-flow of Courier is a simple and intended for non-expert users. It extracts runtime information from running binary, generates task graph, and then replaces the original function with a corresponding accelerator function. Many steps along with the application acceleration process are automatically executed. The users can refer to the acceleration result and modify the task graph if needed. In our case studies, Courier was used for acceleration of three applications; image processing, matrix multiplication and spectrum analysis. Functions are off-loaded to a GPU without any modification to the original source code. Applications are sped up 8.89, 8.16 and 1.23 times, respectively.
    Preview · Article · Jan 2015 · IPSJ Transactions on System LSI Design Methodology
  • Source
    Takaaki Miyajima · David Thomas · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: Our toolchain for accelerating application called Courier-FPGA, is designed for utilize the processing power of CPU-FPGA platforms for software programmers and non-expert users. It automatically gathers runtime information of library functions from a running target binary, and constructs the function call graph including input-output data. Then, it uses corresponding predefined hardware modules if these are ready for FPGA and prepares software functions on CPU by using Pipeline Generator. The Pipeline Generator builds a pipeline control program by using Intel Threading Building Block to run both hardware modules and software functions in parallel. Finally, Courier-FPGA dynamically replaces the original functions in the binary and accelerates it by using the built pipeline. Courier-FPGA performs these acceleration processes without user intervention, source code tweaks or re-compilations of the binary. We describe the technical details of this mixed software hardware pipeline on CPU-FPGA platforms in this paper. In our case study, Courier-FPGA was used to accelerate a corner detection using the Harris-Stephens method application binary on the Zynq platform. A series of functions were off-loaded, and speed up 15.36 times was achieved by using the built pipeline.
    Preview · Article · Aug 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ultralow-voltage (ULV) operation of CMOS circuits is effective for significantly reducing the power consumption of the circuits. Although operation at the minimum energy point (MEP) is effective, its slow operating speed has been an obstacle. The silicon-on-thin-buried-oxide (SOTB) CMOS is a strong candidate for ultralow-power (ULP) electronics because of its small variability and back-bias control. These advantages of SOTB CMOS enable power and performance optimization with adaptive Vth control at ULV and can achieve ULP operation with acceptably high speed and low leakage. In this paper, we describe our recent results on the ULV operation of the CPU, SRAM, ring oscillator, and, other logic circuits. Our 32-bit RISC CPU chip, named 'Perpetuum Mobile,' has a record low energy consumption of 13.4 pJ when operating at 0.35 V and 14 MHz. Perpetuum-Mobile micro-controllers are expected to be a core building block in a huge number of electronic devices in the internet-of-things (IoT) era.
    No preview · Conference Paper · Jun 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Computational fluid dynamics (CFD) is an important tool for designing aircraft components. FaSTAR (Fast Aerodynamics Routines) is one of the most recent CFD packages and has various subroutines. However, its irregular and complicated data structure makes it difficult to execute FaSTAR on parallel machines due to memory access problem. The use of a reconfigurable platform based on field programmable gate arrays (FPGAs) is a promising approach to accelerating memory-bottle-necked applications like FaSTAR. However, even with hardware execution, a large number of pipeline stalls can occur due to read-after-write (RAW) data hazards. Moreover, it is difficult to predict when such stalls will occur because of the unstructured mesh used in FaSTAR. To eliminate this problem, we developed an out-of-order mechanism for permuting the data order so as to prevent RAW hazards. It uses an execution monitor and a wait buffer. The former identifies the state of the computation units, and the latter temporarily stores data to be processed in the computation units. This out-of-order mechanism can be applied to various types of computations with data dependency by changing the number of execution monitors and wait buffers in accordance with the equations used in the target computation. An out-of-order system can be reconfigured by automatic changing of the parameters. Application of the proposed mechanism to five subroutines in FaSTAR showed that its use reduces the number of stalls to less than 1% compared to without the mechanism. In-order execution was speeded up 2.6-fold and software execution was speeded up 2.9-fold using an Intel Core 2 Duo processor with a reasonable amount of overhead.
    No preview · Article · May 2014 · IEICE Transactions on Information and Systems
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A 32-bit CPU which operates with the lowest energy of 13.4 pJ/cycle at 0.35V and 14MHz, operates at 0.22V to 1.2V and with 0.14µA sleep current is demonstrated. The low power performance is attained by Reverse-Body-Bias-Assisted 65nm SOTB CMOS (Silicon On Thin Buried oxide) technology. The CPU can operate more than 100 years with 610mAH Li battery.
    Full-text · Conference Paper · Apr 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Power consumption of Network-on-Chip (NoC) is becoming more important in many core processors. Input buffers utilized in routers consume a significant part of the total power of NoCs. In order to reduce this power consumption, a novel power efficient memory called Marching Memory Through type (MMTH) is introduced. By connecting transparent latches in tandem, MMTH achieves high speed operation with a low power consumption. MMTH, however, requires a certain overhead at read operation, and hence we propose a latency reduction scheme based on the look-ahead routing. The proposed router was designed in Renesas's 40nm process and compared with a standard router using conventional register-based FIFOs in terms of the network performance, application performance, and power consumption. The result of evaluation shows that the proposed router reduces the power consumption by 42.4% on average at 2GHz and the expense of only 0.5-2.0% performance overhead.
    No preview · Conference Paper · Apr 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: We show a task level pipelining on multiple accelerators with PEACH2. PEACH2, which is implemented on FPGA, enables ultra low latency direct communication among multiple accelerators over computational nodes. By installing PEACH2, typical high performance computation nodes are tightly coupled. In this environment, application can be accelerated by exploiting not only data level parallelism, but also task level parallelism. Furthermore, we can process multiple task on multiple accelerators in a pipelined manner. In our evaluation, pipelined application which is implemented in a task level pipelined manner achieves 52% speed up compared to a single GPU.
    No preview · Article · Mar 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: A wireless 3D NoC architecture is described for building-block SiPs, in which the number of hardware components (or chips) in a package can be changed after chips have been fabricated. The architecture uses inductive-coupling links that can connect more than two examined dies without wire connections. Each chip has data transceivers for the uplink and downlink in order to communicate with its neighboring chips in the package. These chips form a vertical unidirectional ring network so as to fully exploit the flexibility of the wireless approach that enables us to add, remove, and swap the chips in the ring. To avoid protocol and structural deadlocks in the ring, we use bubble flow control, which does not rely on the conventional VC-based deadlock avoidance mechanism. In addition, we propose a bidirectional communication scheme to form a bidirectional ring network by using the inductive-coupling transceivers that can dynamically change the communication modes, such as TX, RX, and Idle modes. This paper illustrates the inductive-coupling transceiver circuits, which can carry high data transfer rates of up to 8 Gbps per channel, for the wireless 3D NoC. It also illustrates an implementation of a wireless 3D NoC that has on-chip routers and transceivers implemented with a 65 nm process in order to show the feasibility of our proposal. The vertical bubble flow control and conventional VC-based approach on the uni- and bidirectional ring networks are compared with the vertical broadcast bus in terms of throughput, hardware amount, and application performance using a full system multiprocessor simulator. The results show that the proposed bidirectional communication scheme efficiently improves application performance without adding any inductive-coupling transceivers. In addition, the proposed vertical bubble flow network outperforms the conventional VC-based approach by 7.9-12.5 percent with a 33.5 percent smaller router area for building-block SiPs connecting up to eight- chips.
    No preview · Article · Mar 2014 · IEEE Transactions on Computers
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics is one of the most frequently applied fields in FPGAs. Some applications in this field can be efficiently implemented by systolic arrays, which are intrinsically suited to FPGA implementations. Others can be expressed as numerical computations which can parallelize through pipelining, instruction-level and data-level parallelism. This chapter covers two sample applications encountered in bioinformatics, namely homology searches and biochemical molecular simulations, and shows how FPGAs can be effectively harnessed to achieve higher performances compared to off-the-shelf microprocessor technologies. © 2013 Springer Science+Business Media, LLC. All rights are reserved.
    No preview · Article · Mar 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. Its power consumed for communication by inductive coupling link is one of big problems. A dynamic on/off link control for topology-agnostic 3D NoC (Network on Chip) architecture using inductive-coupling is proposed. The proposed low-power techniques stop the transistors by cutting off the bias voltage in the transmitter of the wireless vertical links only when their utilization is higher than the threshold. Meanwhile, the whole wireless vertical link will be shut down when the utilization is lower than the threshold in order to reduce the power consumption of wireless 3D NoCs. Full-system many-core simulations using power parameters derived from a real chip implementation show that the proposed low-power techniques reduce the power consumption by 43.8-55.0%, while the average performance overhead is 1.4% in wireless topology-agnostic 3D NoC.
    No preview · Article · Feb 2014 · IPSJ Transactions on System LSI Design Methodology
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we demonstrate that we can reduce the communication latency significantly by inserting a fraction of randomness into a wireless 3D NoC (where CMOS wireless links are used for vertical inter-chip communication) when considering the physical constraints of the 3D design space. Towards this end, we consider two cases, namely 1) replacing existing horizontal 2D links in a wireless 3D NoC with randomized shortcut NoC links and 2) enabling full connectivity by adding a randomized NoC layer to a wireless 3D platform with partial or no horizontal connectivity. Consequently, the packet routing is optimized by exploiting both the existing and the newly added random NoC. At the same time, by adding randomly wired shortcut NoCs to a wireless 3D platform, a good balance can be established between the modularity of the design and the minimum randomness needed to achieve low latency, and experimental results show that by adding a random NoC chip to wireless 3D CMPs without built-in horizontal connectivity, the communication latency can be reduced by as much as 26.2% when compared to adding a 2D mesh NoC. Also, the application execution time and average flit transfer energy can be improved accordingly.
    No preview · Conference Paper · Jan 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Power-performance efficiency is still remaining a primary concern for microprocessor designers. One of the sources of power inefficiency for recent LSI chips is increasing leakage power consumption. Power-gating is a well known technique to reduce leakage power consumption by switching off the power supply to idle logic blocks. Recently, fine-grained power-gating is emerged as a technique to minimize leakage current during the active processor cycles by switching on and off a logic blocks in much finer temporal/spatial granularity. Though fine-grained power-gating is useful, a comprehensive evaluation and analysis has not been conducted on a real LSI chips. In this paper, we evaluate fine-grained run-time power-gating for microprocessors' functional units using a real embedded microprocessor. We also introduce an architecture and compiler co-operative power-gating scheme which mitigates negative power reduction caused by the energy overhead associated with finegrained power-gating. The experimental results with a fabricated core shows that a hardware-based scheme saves power consumption of functional units by 44% and hardware compiler co-operative scheme further improves power efficiency by 5.9% when core temperature is 25 °C.
    No preview · Conference Paper · Jan 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design and control scheme of a microprocessor whose internal function units are power gated at instruction-by-instruction basis. Enabling/disabling the power gating is adaptively controlled under the support of on-chip leakage monitors and the operating system to minimize energy overhead due to sleep-in and wakeup. Measured results of the fabricated chip in the 65nm CMOS technology demonstrated that our approach reduces energy to 21-35% in the range of 25-85°C as compared to the non power-gated case. Energy dissipation was reduced by up to 15% as compared to the conventional fine-grain power gating technique in the same temperature range.
    No preview · Conference Paper · Jan 2014
  • Remi Chaintreuil · Rie Uno · Hideharu Amano
    [Show abstract] [Hide abstract]
    ABSTRACT: The Cool Mega-Array is a highly power-efficient coarse-grained reconfigurable accelerator, particularly aimed toward multimedia applications executed on battery-driven devices. It consists of a large processing elements array, without any memory element, a simple micro-controller for data management and the data memory. The power consumption of the PE array itself is very low, and can be further reduced by dynamically scaling its power voltage in order to adapt to the desired speed of computation. A modular version of this design is proposed, which provides the ability to reconfigure the PE array structure, and adapt its size to the application. This allows the execution of applications with different complexities and degrees of parallelism on a relatively smaller chip, compared to simply using a large PE array, depending on the implementation choices and the set of applications (4 times less processing elements in the implementation example). Power-leakage can also be reduced by using coarse-grained power-gating.
    No preview · Conference Paper · Dec 2013

Publication Stats

2k Citations
49.38 Total Impact Points

Institutions

  • 1985-2015
    • Keio University
      • • Department of Information and Computer Science
      • • Faculty of Science and Technology
      • • Graduate School of Science and Technology
      • • Center for Computer Science
      • • Department of Electronics and Electrical Engineering
      Edo, Tokyo, Japan
  • 2013
    • Japan Aerospace Exploration Agency
      Chōfu, Tōkyō, Japan
  • 2009
    • Shibaura Institute of Technology
      Tōkyō, Japan
  • 2005
    • Nagasaki University
      • Department of Computer and Information Science
      Nagasaki, Nagasaki, Japan
  • 2002
    • Toshiba Corporation
      Edo, Tōkyō, Japan
  • 2001
    • Hitachi, Ltd.
      • Central Research Laboratory
      Edo, Tōkyō, Japan
  • 1997
    • Tokyo Denki University
      • Division of Electrical and Electronic Engineering
      Tokyo, Tokyo-to, Japan