D. Blaauw

Synopsys, Mountain View, California, United States

Are you D. Blaauw?

Claim your profile

Publications (417)166.59 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: Voltage scaling is widely used to improve SRAM energy efficiency [1-2], particularly in mobile systems with tight power budgets. The resulting energy benefits are limited by the minimum voltage ensuring error-free operation, Vmin, which has stagnated due to growing process variation in advanced technology nodes [3]. Error-tolerant applications and systems (e.g., multimedia) allow more aggressive voltage scaling by operating below Vmin, which is acceptable if errors due to bitcell write/read failures do not perceptibly reduce application quality (e.g., image quality). Unfortunately, in traditional SRAMs bit error rate degrades rapidly for VDD <; Vmin [4], limiting energy gains. Under a given quality target, further energy reduction is possible through application-specific methods that exploit the features of data stored in a given application [4-5]. However, these approaches are not reusable across applications, and further the energy-quality trade-off is fixed at design time, which degrades energy savings in applications with lower quality targets and in chips near typical corner
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, San Francisco, CA; 02/2014
  • [show abstract] [hide abstract]
    ABSTRACT: The majority of the power consumption of a high-throughput LDPC decoder is spent on memory. Unlike in a general-purpose processor, the memory access in an LDPC decoder is deterministic and the access window is short. We take advantage of the unique memory access characteristic to design a non-refresh eDRAM that holds data for the necessary access window, and further improve its access time by trading off the excess retention time. The resulting 3T eDRAM cell is designed to balance wordline coupling to reliably retain data for a fast access. We integrate 32 5x210 non-refresh eDRAM arrays in a row-parallel LDPC decoder suitable for the IEEE 802.11ad standard. Memory refresh is eliminated and random access is replaced with a simple sequential addressing. With row merging and dual-frame processing, the 1.6 mm 2 65 nm LDPC decoder chip achieves a peak throughput of 9 Gb/s at 89.5 pJ/b, of which only 21% is spent on eDRAMs. With voltage and frequency scaling, the power consumption of the LDPC decoder is reduced to 37.7 mW for a 1.5 Gb/s throughput at 35.6 pJ/b.
    IEEE Journal of Solid-State Circuits 01/2014; 49(3):783-794. · 3.06 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: This paper proposes a power-efficient speeded-up robust features (SURF) extraction accelerator targeted primarily for micro air vehicles (MAVs) with autonomous navigation (Fig. 9.7.1). Typical object recognition SoCs [4-6] employ an application-specific algorithm to choose specific regions of interest (ROIs) to reduce computation by focusing on a small portion of the image. However, this approach is not feasible in applications where the whole image must be analyzed, such as visual navigation that requires the extraction of general features to determine location or movement. In addition, multicore architectures need to run at high clock frequencies to meet high peak performance requirements and the power consumption of inter-core communication becomes prohibitive. Since feature extraction algorithms require significant memory accesses across a large area, parallelization in a multicore system requires costly high-bandwidth memories for massive intermediate data.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: The theoretical lower limit of subthreshold swing in mosfets (60 mV/decade) significantly restricts low-voltage operation since it results in a low ON -to- OFF current ratio at low supply voltages. This paper investigates extremely low-power circuits based on new Si/SiGe heterojunction tunneling transistors (HETTs) that have a subthreshold swing of . Device characteristics, as determined through technology computer aided design tools, are used to develop a Verilog-A device model to simulate and evaluate a range of HETT-based circuits. We show that an HETT-based ring oscillator (RO) shows a 9-19 times reduction in dynamic power compared to a CMOS RO. We also explore two key differences between HETTs and traditional mosfets, namely, asymmetric current flow and increased Miller capacitance, analyze their effect on circuit behavior, and propose methods to address them. HETT characteristics have the most dramatic impact on static random access memory (SRAM) operation and we propose a novel seven-transistor HETT-based SRAM cell topology to overcome, and take advantage of, the asymmetric current flow. This new HETT SRAM design achieves 7-37 times reduction in leakage power compared to CMOS.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2013; 21(9):1632-1643. · 1.22 Impact Factor
  • Cheng Zhuo, D. Sylvester, D. Blaauw
    [show abstract] [hide abstract]
    ABSTRACT: Oxide breakdown has become an increasingly pressing reliability issue in modern very large scale integration design with ultrathin oxides. The conventional guard-band methodology assumes uniformly thin oxide thickness, resulting in overly pessimistic reliability estimation that severely degrades system performance. In this paper, we present the use of limited post-fabrication measurements of oxide thicknesses from on-chip sensors to aid in the chip-level oxide breakdown reliability management. A key challenge, which is the focus of this paper, is precisely predicting and managing the reliability condition of each chip with a limited number of measurements and quantifying the tradeoff between reliability margin and system performance. Given the post-fabrication measurements, chip oxide breakdown reliability can be formulated as a conditional distribution that allows one to achieve a significantly more accurate chip lifetime estimation. The estimation is then used to individually tune the supply voltage of each chip for performance maximization while maintaining or improving the reliability. Experimental results show that, by using 25 measurements, the proposed method can achieve an average of 19% performance improvement, and a 27% maximum for a design with up to 50 million devices, with an average operation time of approximately 0.4 s per chip.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 01/2013; 32(4):630-643. · 1.09 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: We propose Bubble Razor, an architecturally independent approach to timing error detection and correction that avoids hold-time issues and enables large timing speculation windows. A local stalling technique that can be automatically inserted into any design allows the system to scale to larger processors. We implemented Bubble Razor on an ARM Cortex-M3 microprocessor in 45 nm CMOS without detailed knowledge of its internal architecture to demonstrate the technique's automated capability. The flip-flop based design was converted to two-phase latch timing using commercial retiming tools; Bubble Razor was then inserted using automatic scripts. This system marks the first published implementation of a Razor-style scheme on a complete, commercial processor. It provides an energy efficiency improvement of 60% or a throughput gain of up to 100% compared to operating with worst case timing margins.
    IEEE Journal of Solid-State Circuits 01/2013; 48(1):66-81. · 3.06 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Data communication between local system blocks through on-chip global interconnects presents significant design challenges in scaled VLSI systems. The goal of this research is to reduce the energy consumed per bit transmitted, while achieving Gb/s data rates over interconnect lengths up to 10mm. Voltage-mode signaling with capacitive boosting [1-2] has been proposed for low-power on-chip interconnects. To increase the data rate over RC-limited interconnect, aggressive equalization schemes should be used in receivers [1-3] and transmitters [1-2] at the cost of significant power consumption. As an alternative to voltage-mode signaling, current-mode signaling has been considered. It was originally used for fast bitline sensing in memory [4-5] to take inherent advantage of a reduced RC time constant. However, prior work on current-mode transceivers for on-chip interconnect shows worse energy efficiency than their voltage-mode counterparts due to large static power dissipation by current-sensing circuit [6-7]. This paper presents a 95fJ/b current-mode transceiver for on-chip global interconnect. The transceiver is implemented in 65nm CMOS and achieves a data rate of up to 4Gb/s over a 10mm link with a BER of less than 10-12.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Fast boosting of supply rails is critical for near-threshold computing to overcome serial code bottlenecks. A novel supply boosting technique, called Shortstop, boosts a 3nF core in 26ns while maintaining acceptable supply voltage droops. The innate parasitic inductance of a dedicated dirty supply rail is used as a boost-converter and combined with an on-chip boost capacitor. Shortstop boosts a core up to 1.8× faster than a header-based approach, while reducing supply droop by 2-7×.
    VLSI Circuits (VLSIC), 2013 Symposium on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: In this paper, we explore the challenges in scaling on-chip networks towards kilo-core processors. Current low-radix topologies optimize for fast local communication, but do not scale well to kilo-core systems because of the large number of routers required. These increase both power and hop count. In contrast, symmetric high-radix topologies optimize for global communication with fewer hop counts, but degrade local communication with their large, slow routers.
    High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Novel ultra low-leakage ESD power clamp designs for wireless sensor applications are proposed and implemented in 0.18μm CMOS. Using new biasing structures to limit both subthreshold leakage and GIDL, the proposed designs consume as little as 43pW at 25 C and 119nW at 125 C with 4500V HBM level and 400V MM level protection, marking an 18-139× leakage reduction over conventional ESD clamps.
    Custom Integrated Circuits Conference (CICC), 2013 IEEE; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Emerging demands on ultra-low-power wireless sensor platform have presented challenges for nano-watt design of various circuit components. Clock management unit, as an essential block, is one of the most actively researched blocks. It is required to distribute various frequency ranges for energy-optimal operation, e.g., Hz for internal timer [1], kHz for global clock [2], and MHz for fast data transmission or intensive signal processing [3]. However, free-running oscillators are seriously affected by process variations and should be readjusted by post-fabrication trimming. Though a crystal gives a stable frequency, the use of multiple crystals is generally not allowed by limited form-factor and increased cost. Instead, frequency multiplication from one clean reference is more effective way for higher frequency generation. Considering high-frequency clock is only intermittently used in sensor applications, the clock multiplier should provide a fast settling when turned on as well as low-power dissipation. This paper presents a 423nW, 3.2 MHz all-digital multiplying DLL (MDLL) with a digitally controlled leakage-based oscillator (DCLO) and a fast frequency relocking scheme adaptive to the amount of frequency drift during sleep state, which is required for intermittent operation of sensor node platforms.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We propose a maximum power point tracking (MPPT) circuit for micro-scale sensor systems that measures ripple voltages in a switched capacitor energy harvester. Compared to conventional current mirror type MPPT circuits, this design incurs no voltage drop and does not require high bandwidth amplifiers. Using correlated double sampling, high accuracy is achieved with a power overhead of 5%, even at low harvested currents of 1.4uA based on measured results in 180nm CMOS.
    VLSI Circuits (VLSIC), 2013 Symposium on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: A 346μm2 reference-free, asynchronous VCO-based sensor interface circuit is demonstrated in 28nm LP CMOS. This design does not require high accuracy current sources, voltage sources, or low jitter timing references. It achieves wide resolution and voltage scalability, and consumes only ~1/100th the area of prior approaches. Resolution can be scaled from 2.8 to 11.7 bits and VDD from 500mV to 1.0V.
    Solid-State Circuits Conference (A-SSCC), 2013 IEEE Asian; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We propose a temperature sensor using a novel process-invariant temperature sensing element and voltage to current converter for battery-operated ultra-low power micro systems. By introducing a new temperature-to-voltage sensing element that outputs only 75mV, the sensor achieves ultra-low power. The sensor was implemented in 180nm CMOS process and uses 0.09mm2 of area. Measurements from test chips show 65nW power consumption, the lowest reported to date, with an inaccuracy of +1.3°C /-1.4°C across 0°C to 100°C after 2-point calibration.
    Custom Integrated Circuits Conference (CICC), 2013 IEEE; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We present a self-adapting power management unit (PMU) for ultra-low power wireless sensor nodes. The PMU uses 1.03nF of on-chip MIM capacitance in a reconfigurable switched-capacitor network (SCN) that automatically adapts to different battery voltages for down-conversion and different harvesting sources/harvesting conditions for up-conversion. The PMU achieves 63.8% / 60.7% down-conversion efficiency at 17.9μW active mode / 12.8nW sleep mode power loading. With the adaptive down-conversion ratio, load power range is improved by 3.76× and 5.48× in sleep and active mode, respectively. We show how the proposed adaptation method enables harvesting with solar, microbial fuel cell, and thermal energy sources, increases harvesting efficiency by 1.92× and achieves the peak extraction efficiency of 99.8% for solar cell.
    Circuits and Systems (ISCAS), 2013 IEEE International Symposium on; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Supply-voltage scaling has stagnated in recent technology nodes, leading to so-called dark silicon. To increase overall chip multiprocessor (CMP) performance, it is necessary to improve the energy efficiency of individual tasks so that more tasks can be executed simultaneously within thermal limits. In this article, the authors investigate the limit of voltage scaling together with task parallelization to maintain task completion latency while reducing energy consumption. Additionally, they examine improvements in energy efficiency and parallelism when serial portions of code can be overcome through quickly boosting a core's operating voltage. When accounting for parallelization overheads, minimum task energy is obtained at near-threshold supply voltages across six commercial technology nodes and provides 4× improvement in overall CMP performance. Boosting is most effective when the task is modestly parallelizable but not highly parallelizable.
    IEEE Micro 01/2013; 33(5):30-37. · 2.39 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Visual monitoring with CMOS image sensors opens up a variety of new applications for wireless sensor nodes, ranging from military surveillance to in vivo molecular imaging. In particular, the ability to detect motion can enable more intelligent power management through on-demand duty cycling and reduced data-retention requirements. Conventional imager designs focus on achieving higher resolution, frame rate [1], or dynamic range [2], resulting in power consumption levels that are unsuitable for battery-powered wireless sensor nodes [3].
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: Centip3De uses the synergy between 3D integration and near-threshold computing to create a reconfigurable system that provides both energy-efficient operation and techniques to address single-thread performance bottlenecks. The original Centip3De design is a seven-layer 3D stacked design with 128 cores and 256 Mbytes of DRAM. Silicon results show a two-layer, 64-core system in 130-nm technology, which achieved an energy efficiency of 3,930 DMIPS/W.
    IEEE Micro 01/2013; 33(2):8-16. · 2.39 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: Ultra-low power microsystems are gaining more popularity due to their applicability in critical areas of societal need. Power management in these microsystems is a major challenge as a relatively high battery voltage (e.g., 4V) must be down-converted to several low supplies, such as 0.6V for near-threshold digital circuits and 1.2V for analog circuits [1]. Furthermore, the small form factors of such systems rule out the use of external inductors, making switched-capacitor (SC) DC-DC converters the favored topology [2-4].
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International; 01/2013
  • [show abstract] [hide abstract]
    ABSTRACT: We present Centip3De, a large-scale 3D CMP with a cluster-based near-threshold computing (NTC) architecture. Centip3De uses a 3D stacking technology in conjunction with 130 nm CMOS. Measured results for a two-layer, 64-core system are discussed, with the system achieving 3930 DMIPS/W energy efficiency, which is >; 3x improvement over traditional operation at full supply voltage. This project demonstrates the feasibility of large-scale 3D design, a synergy between 3D and NTC architectures, a unique cluster-based NTC cache design, and how to maximize performance in a thermally-constrained design.
    IEEE Journal of Solid-State Circuits 01/2013; 48(1):104-117. · 3.06 Impact Factor

Publication Stats

6k Citations
166.59 Total Impact Points


  • 2011
    • Synopsys
      Mountain View, California, United States
  • 2001–2011
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, MI, United States
  • 2004–2010
    • Intel
      Santa Clara, California, United States
  • 2002–2010
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States
    • ARM Ltd
      Cambridge, England, United Kingdom
  • 2005–2006
    • Indian Institute of Technology Kanpur
      Cawnpore, Uttar Pradesh, India
    • Texas A&M University
      • Department of Electrical and Computer Engineering
      College Station, TX, United States
    • Arizona State University
      • School of Electrical, Computer and Energy Engineering
      Mesa, AZ, United States
  • 2002–2004
    • Sun Pharma USA
      Philadelphia, Pennsylvania, United States
  • 2003
    • The University of Arizona
      • Department of Electrical and Computer Engineering
      Tucson, AZ, United States
  • 1989–2003
    • University of Illinois, Urbana-Champaign
      • • Coordinated Science Laboratory
      • • Department of Electrical and Computer Engineering
      Urbana, Illinois, United States