A. Raychowdhury

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you A. Raychowdhury?

Claim your profile

Publications (97)53.65 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Advanced human-machine interfaces require improved embedded sensors that can seamlessly interact with the user. Voice-based communication has emerged as a promising interface for next generation mobile, automotive and hands-free devices. Presented here is such an audio front-end with Voice Activity Detection (VAD) hardware targeted for low-power embedded SoCs, featuring a 512 pt FFT, programmable filters, noise floor estimator and a decision engine which has been fabricated in 32 nm CMOS. The dual-VCC, dual-frequency design allows the core datapath to scale to near-threshold voltage (NTV), where power consumption is less than 50 uW. At peak energy efficiency, the core can process audio data at 2.3 nJ/frame - a 9.4X improvement over nominal voltage conditions.
    IEEE Journal of Solid-State Circuits 01/2013; 48(8):1963-1969. · 3.06 Impact Factor
  • A. Raychowdhury
    [Show abstract] [Hide abstract]
    ABSTRACT: Ever larger on-die memory arrays for future processors in CMOS logic technology drives the need for dense and scalable embedded memory alternatives beyond SRAM and eDRAM. Recent advances in non-volatile STT-RAM technology, which stores data by the spin orientation of a soft ferromagnetic material and shows current induced switching, have created interest for its use as embedded memory [1-3]. When a spin-polarized current passes through a mono-domain ferromagnet, it attempts to polarize the current in its preferred direction of magnetic moment. As the ferromagnet absorbs some of the angular momentum of the electrons, it creates a torque that causes a flip in the direction of magnetization in the ferromagnet. This is used in magnetic tunneling junction (MTJ) based spin torque transfer (STT) RAM cells where a thin insulator (MgO) is sandwiched between a fixed ferromagnetic layer (polarizer) and the free layer (storage node). This can be integrated in the metal stack (Fig. 1) and hence provide high memory density. Depending on the direction of the current flow (perpendicular to these layers in our study), the magnetization of the free layer is switched to a parallel (P: low resistance state) or anti-parallel (AP: high resistance state) state. The minimum size cell (mincell) contains an access transistor (Tx) of width 2F (WTX=2F, F: half-pitch of the process node) and a planar storage node of dimensions 2FxF. The area of the mincell is 3Fx2F=6F2. In this paper, we examine the design space for key magnetic material properties and access transistor needed for embedded on-die memory with adequate scalability, density, read/write performance and robustness against various intrinsic variabilities and disturbances. New models and simulation methodologies, calibrated to existing measurements [1], for read, write and disturbance mechanisms are developed. Different storage node structures and materials are evaluated to reveal the most promising scaling opti- ns.
    Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Domain Wall Memory (DWM) is a recently developed spin-based memory technology in which several bits of data are densely packed into the domains of a ferromagnetic wire. DWM has shown great promise in enabling non-volatile memory with unprecedented density and high energy efficiency. In this work, we propose TapeCache, a first attempt to employ DWMs as last-level caches in general purpose computing platforms. DWMs enable much higher density compared to SRAM, DRAM, and other spin-based memory technologies such as STT-MRAM. However, they also pose unique challenges such as serial access to the bits stored in a DWM cell, leading to variable access latencies. We propose a novel circuit-architecture co-design for TapeCache, consisting of (i) a multi-port DWM macro-cell optimized for read operations considering the asymmetry in applications' read/write characteristics, and (ii) a new cache organization and suitable management policies that mitigate the performance penalty arising from serial access to bits in a macro-cell. Over a wide range of SPEC 2006 benchmarks, TapeCache achieves 7.8X improvement in area, an average energy improvement of 7.3X, and an average performance improvement of 1.2% compared to an iso-capacity SRAM cache. Compared to an iso-capacity STT-MRAM cache, TapeCache obtains 2.3X improvement in area and 1.4X average energy savings with virtually identical performance.
    07/2012;
  • A. Raychowdhury, D. Somasekhar, J. Tschanz, V. De
    [Show abstract] [Hide abstract]
    ABSTRACT: A fully-digital phase-locked low dropout regulator (LDO) has been designed in 32nm CMOS for fine-grained power delivery to multi-Vcc digital circuits. Measurements across a wide range of input voltages and currents exhibit that the LDO offers excellent load regulation and efficiency close to 97% of ideal efficiency at nominal load current conditions (ILOAD=3mA).
    VLSI Circuits (VLSIC), 2012 Symposium on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This session brings together specialists from the DfT, DfY and DfR domains that will address key problems together with their solutions for the 14nm node and beyond, dealing with extremely complex chips affected by high defect levels, unpredictable and heterogeneous timing behavior, circuit degradation over time, including extreme situations related with the ultimate CMOS nodes, where all processor nodes, routers and links of single-chip massively parallel tera-device processors could comprise timing faults (such as delay faults or clock skews); a large percentage of these parts are affected by catastrophic failures; all parts experience significant performance degradations over time; and new catastrophic failures occur at low MTBF.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Infrequent dynamic events like V<sub>CC</sub> droops and temperature changes result in the use of a static V<sub>CC</sub> guardband in 8T SRAM arrays. This paper proposes the use of tunable replica bits (TRBs) as a potential solution to mitigating a part of the V<sub>CC</sub> guardband. Measured data on a 16 KB 8T array featuring tun able replica bits illustrate 9% reduction of the operating minimum V<sub>CC</sub> (V<sub>MIN</sub>) and correspondingly a 7.5% reduction in array power.
    IEEE Journal of Solid-State Circuits 05/2011; · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 45 nm microprocessor core integrates resilient error-detection and recovery circuits to mitigate the clock frequency (F<sub>CLK</sub>) guardbands for dynamic parameter variations to improve throughput and energy efficiency. The core supports two distinct error-detection designs, allowing a direct comparison of the relative trade-offs. The first design embeds error-detection sequential (EDS) circuits in critical paths to detect late timing transitions. In addition to reducing the Fclk guardbands for dynamic variations, the embedded EDS design can exploit path-activation rates to operate the microprocessor faster than infrequently-activated critical paths. The second error-detection design offers a less-intrusive approach for dynamic timing-error detection by placing a tunable replica circuit (TRC) per pipeline stage to monitor worst-case delays. Although the TRCs require a delay guardband to ensure the TRC delay is always slower than critical-path delays, the TRC design captures most of the benefits from the embedded EDS design with less implementation overhead. Furthermore, while core min-delay constraints limit the potential benefits of the embedded EDS design, a salient advantage of the TRC design is the ability to detect a wider range of dynamic delay variation, as demonstrated through low supply voltage (V<sub>CC</sub>) measurements. Both error-detection designs interface with error-recovery techniques, enabling the detection and correction of timing errors from fast-changing variations such as high-frequency V<sub>CC</sub> droops. The microprocessor core also supports two separate error-recovery techniques to guarantee correct execution even if dynamic variations persist. The first technique requires clock control to replay errant instructions at 1/2F<sub>CLK</sub>. In comparison, the second technique is a new multiple-issue instruction replay design that corrects errant instructions with a lower performance penalty and without requiring clock control. Silico- - n measurements demonstrate that resilient circuits enable a 41% throughput gain at equal energy or a 22% energy reduction at equal throughput, as compared to a conventional design when executing a benchmark program with a 10% V<sub>CC</sub> droop. In addition, the microprocessor includes a new adaptive clock control circuit that interfaces with the resilient circuits and a phase-locked loop (PLL) to track recovery cycles and adapt to persistent errors by dynamically changing Fclk f°Γ maximum efficiency.
    IEEE Journal of Solid-State Circuits 02/2011; · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 45 nm microprocessor integrates an all-digital dynamic variation monitor (DVM) to continuously measure the impact of dynamic parameter variations on circuit-level per- formance to enhance silicon debug and adaptive clock control. The DVM consists of a tunable replica circuit, a time-to-digital converter, and multiplexers to measure circuit delay or frequency changes with less than a 1% measured resolution error while capturing clock-to-data correlations. In validating the DVM with microprocessor maximum clock frequency measure- ments, an on-die noise injector circuit induces a supply voltage droop at a particular cycle in the test program. The measurement is then repeated for over a thousand itera- tions while shifting the droop placement to a different cycle per iteration. Silicon measurements demonstrate the DVM capability of tracking the worst case reduction to within 1% for a wide range of droop profiles. Furthermore, silicon measure- ments reveal that is highly sensitive to the placement and magnitude of a high-frequency droop during program exe- cution, thus highlighting the value of the DVM for silicon debug. In addition, the DVM interfaces with an adaptive clock control circuit to dynamically adjust the clock frequency by changing the divide ratio in the phase-locked loop in response to persistent variations, enabling the microprocessor to adapt to the operating environment for maximum efficiency.
    IEEE Trans. on Circuits and Systems. 01/2011; 58-I:2017-2025.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A novel MTJ stack employing a bidirectional switching mechanism has been proposed. By using a pulsed bidirection current, an MTJ with a SAF free layer and AP fixed layers can be switched through a metastable ferromagnetic state. Numerical simulations suggest high readability (like single barrier structures) and writability (like double barrier structures) of the proposed device.
    01/2011;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Built-in resiliency features enable a microprocessor to detect and correct errors due to fast dynamic voltage droop events as well as other types of dynamic variations. Timing errors in the microprocessor core as well as read (RD) and write (WR) errors in the 8T SRAM based cache can be detected. As a result, guardbands added for these variations are reduced or eliminated, improving performance and reducing power consumption. Measurements on a 45 nm research microprocessor core demonstrate 41% improvement in throughput or 22% reduction in energy at 0.8 V. Measurements on the cache demonstrate reduction of the minimum operating Vcc (VMIN) by 9% thereby resulting in a 7.5% reduction of net operating power.
    IEEE Journal on Emerging and Selected Topics in Circuits and Systems - JETCAS. 01/2011; 1(3):208-217.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 45 nm microprocessor integrates an all-digital dynamic variation monitor (DVM) to continuously measure the impact of dynamic parameter variations on circuit-level performance to enhance silicon debug and adaptive clock control. The DVM consists of a tunable replica circuit, a time-to-digital converter, and multiplexers to measure circuit delay or frequency changes with less than a 1% measured resolution error while capturing clock-to-data correlations. In validating the DVM with microprocessor maximum clock frequency (FMAX) measurements, an on-die noise injector circuit induces a supply voltage (VCC) droop at a particular cycle in the test program. The FMAX measurement is then repeated for over a thousand iterations while shifting the droop placement to a different cycle per iteration. Silicon measurements demonstrate the DVM capability of tracking the worst case FMAX reduction to within 1% for a wide range of VCC droop profiles. Furthermore, silicon measurements reveal that FMAX is highly sensitive to the placement and magnitude of a high-frequency VCC droop during program execution, thus highlighting the value of the DVM for silicon debug. In addition, the DVM interfaces with an adaptive clock control circuit to dynamically adjust the clock frequency by changing the divide ratio in the phase-locked loop in response to persistent variations, enabling the microprocessor to adapt to the operating environment for maximum efficiency.
    Circuits and Systems I: Regular Papers, IEEE Transactions on 01/2011; 58(9):2017-2025. · 2.24 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: High leakage power in sub-100-nm memory technol- ogy nodes drives the need for nonvolatile memory devices to reduce power consumption and enhance battery life. Spin-transfer torque magnetic tunneling junction (STT-MTJ) is a promising nonvolatile memory device with comparable read and write performances as SRAM and eDRAM with almost zero standby power. In this paper, we present a simulation framework that can solve transport (using nonequilibrium Green's function (NEGF) formalism) and magnet dynamics (using Landau-Lifshitz-Gilbert (LLG) equa- tion) self-consistently to study the read and write performances of STT-MTJ. Due to process variations, thermal disturbances, and stray fields, the performance of STT-MTJ degrades and results in one transistor-one STT-MTJ (1T-1STTMTJ) memory failures. A thorough memory design space investigation can help us to reduce such failures. Hence, we present a design space exploration framework for 1T-1STTMTJ memory, which consists of magnetic materials with different RAPA products, different genres of MTJ stacks, and a transistor. A comprehensive study based on critical memory performance metrics such as tunneling magnetoresis- tance, JC , and write cycle shows the relative merits and demerits of each MTJ stack for embedded memory applications. Finally, the benefits of synthetic antiferromagnet free layer in providing immunity against stray fields are shown illustrating the need for coupled free-layer stacks in scaled technology nodes. Index Terms—Dual-barrier, dual-free-layer, failures, magnetic memory, magnetic tunneling junction (MTJ), non-equilibrium Green's function (NEGF), spin-transfer torque (STT), variability, yield.
    IEEE Transactions on Electron Devices 01/2011; 58(12):4333-4343. · 2.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents numerical analysis of domain wall propagation for dense embedded memory applications. Self-consistent simulation framework based on Four Component Spin Transport Model and Landau-Lifshitz-Gilbert equation is able to capture domain wall motion in terms of critical current density requirement, domain wall velocity, and power dissipation. Effect of patterned notches on memory stability, domain wall velocity and nanostrip resistance are also presented. Finally, the proposed simulation framework is used to investigate performance, scalability and organization of the domain wall motion based memory structure.
    Electron Devices Meeting (IEDM), 2011 IEEE International; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 45nm microprocessor integrates an all-digital dynamic variation monitor (DVM), consisting of a tunable replica circuit with a time-to-digital converter, to measure the impact of dynamic variations on path-level delay or frequency. Measurements reveal the high sensitivity of the microprocessor maximum clock frequency (F<sub>MAX</sub>) to the placement and magnitude of a high-frequency supply voltage (V<sub>CC</sub>) droop and demonstrate the DVM capability of tracking F<sub>MAX</sub> changes to within 1% for a wide range of V<sub>CC</sub> droop profiles. Furthermore, the DVM interfaces with an adaptive clock control circuit to dynamically change the clock frequency in response to dynamic variations, enabling the microprocessor to operate at maximum efficiency.
    Custom Integrated Circuits Conference (CICC), 2010 IEEE; 10/2010
  • A. Raychowdhury, D. Somasekhar, T. Karnik, V.K. De
    [Show abstract] [Hide abstract]
    ABSTRACT: The paper presents a RD disturb model study of STT-MTJ memory bits. It shows that high-current short-pulsed RD may cause failure under hammer conditions. Analytical models for such have been developed and validated against numerical simulations.
    Device Research Conference (DRC), 2010; 07/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 45 nm 1.3 GHz microprocessor core employs error-detection circuits, tunable replica circuits, and error-recovery circuits, to mitigate dynamic variation guardbands for maximum throughput. An adaptive clock controller adjusts the frequency based on error statistics to optimize efficiency. Silicon measurements show resilient operation as well as throughput gains of 12 to 16% at 1.0 V and 22 to 23% at 0.8 V.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 16 KB 8T register-file macro in a 45 nm CMOS process uses on-die PVT-adaptive boosting of read- and write-wordline for minimizing V<sub>MN</sub> while reducing boosting overhead for maximum power benefit. Measurements of 1 MB 8T arrays in a single-VCC ¿mP core indicate 6 to 27% lower power for arrays access variations of 10% (75 pF) to 30% (1 nF).
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • S.K. Gupta, A. Raychowdhury, K. Roy
    [Show abstract] [Hide abstract]
    ABSTRACT: Ultralow-power dissipation can be achieved by operating digital circuits with scaled supply voltages, albeit with degradation in speed and increased susceptibility to parameter variations. However, operating digital logic and memory circuits in the subthreshold region (supply voltage less than the transistor threshold voltage) for ultralow-power operations requires device, circuit as well as architectural design optimizations, different from the conventional superthreshold design. This paper analyzes such optimizations from energy dissipation point of view and shows that it is feasible to achieve robust operation of ultralow-voltage systems. Operation with power supply as low as 60 mV is demonstrated. Techniques to reduce the impact of process variations on subthreshold circuits are also discussed. In addition, it is shown that subthreshold leakage current can be useful for other applications like thermal sensors.
    Proceedings of the IEEE 03/2010; · 6.91 Impact Factor
  • A. Raychowdhury, D. Somasekhar, T. Karnik, V. De
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents modeling and analysis of 1T-1MTJ STT RAM memory arrays under process variations and thermal disturbances. Bounds on the magnetic material design space for embedded applications are illustrated. Impact of relaxed timing/area and the effect of scaling for 1T-1MTJ bitcells have been evaluated.
    Electron Devices Meeting (IEDM), 2009 IEEE International; 01/2010
  • IEEE Custom Integrated Circuits Conference, CICC 2010, San Jose, California, USA, 19-22 September, 2010, Proceedings; 01/2010

Publication Stats

1k Citations
53.65 Total Impact Points

Institutions

  • 2013
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
  • 2010
    • Intel
      Santa Clara, California, United States
  • 2003–2010
    • Purdue University
      • School of Electrical and Computer Engineering
      West Lafayette, IN, United States
  • 2008
    • Texas A&M University
      • Department of Computer Science and Engineering
      College Station, Texas, United States