M. Anders

Intel, Santa Clara, California, United States

Are you M. Anders?

Claim your profile

Publications (62)44.1 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1–3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
    IEEE Journal of Solid-State Circuits 01/2012; 48(1):178-180. · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes an all-digital PVT-variation tolerant true-random number generator (TRNG), fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die entropy generation in high-performance microprocessors. The TRNG harvests differential thermal-noise at the diffusion nodes of a pre-charged cross-coupled inverter pair to resolve out of metastability, generating one random bit/cycle. A self-calibrating 2-step tuning mechanism using coarse-grained configurable inverters and fine-grained programmable clock delay generators, along with an entropy-tracking feedback loop provide tolerance to 20% PVT variation-induced device mismatches, enabling lowest-reported energy-consumption of 2.9 pJ/bit with a dense layout occupying 4004 μm2, while achieving: (i) 2.4 Gbps random bit throughput, 7 mW total power consumption with 0.7 mW leakage power component, measured at 1.1 V, 50°C, (ii) random bitstreams that passes all NIST RNG tests with raw entropy/bit measured up to 0.9999999993, (iii) good distribution of 1's with 4-bit entropy of 3.97996 and high-entropy pattern probability of 0.066 (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 14 Mbps, 5.6 μW, measured at 280 mV, 50°C, (v) 12 fine-grained high-entropy settings for the TRNG to dither in during steady-state operation, (vi) <;3% error while using an analytical ergodic Markov chain model for predicting pattern probabilities and (vii) 200x higher throughput and 9x higher energy-efficiency than previously reported implementations. Design modifications for robust operation in 22 nm high-volume manufacturing in the presence of 3σ process variations demonstrate scalability of the all-digital design to future technologies.
    IEEE Journal of Solid-State Circuits 01/2012; 47(11):2807-2821. · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 128-entry × 152b 3-read/2-write ported multi-precision floating-point register file/shuffler with measured 2.8GHz operation is fabricated in 1.05V, 32nm CMOS. Single-precision (24b-mantissa), 2-way 12b or 4-way 6b reduced mantissa precision modes, certainty tracking bits, mode-dependent gating, area-efficient windowing using 1R/1W cells, and ultra-low-voltage read/write circuits enable 350mV-1.2V wide dynamic voltage range with measured peak energy-efficiency of 751GOPS/W at 400mV, 4-way 6b-mode (22.3× higher than 1.05V single-precision mode) and 19% area reduction over single-precision 3R/2W implementations.
    VLSI Circuits (VLSIC), 2012 Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput floating-point computations are key building blocks of 3D graphics, signal processing and high-performance computing workloads [1,2]. Higher floating-point precisions offer improved accuracy at the expense of performance and energy efficiency, with variable-precision floating-point circuits providing run-time precision selection [3]. Real-time certainty tracking enables variable-precision circuits not only to operate at the higher energy efficiency of low-precision datapaths, but also to preserve high-precision accuracy. A variable-precision floating-point unit that performs fused multiply-adds (FMA) with single-cycle throughput while supporting operation in either 1-way single-precision (24b mantissa), 2-way 12b precision or 4-way 6b precision modes is fabricated in 32nm High-k/Metal-gate CMOS [4]. Simultaneous floating-point certainty tracking, preshifted addends, a combined rounding and negation incrementer, efficient reuse of mantissa datapath for multiple parallel lower precision calculations, robust ultra-low voltage circuits, and fine-grained clock gating enable nominal energy efficiency of 52GFLOPS/W (IEEE 32b single-precision, measured at 1.45GHz, 1.05V, 25°C) with a dense layout occupying 0.045mm2 (Fig. 10.3.7) while achieving: (i) scalable performance up to 3.6GFLOPS (single-precision), 96mW measured at 1.2V; (ii) up to 4× higher throughput of 14.4GFLOPS with variable-precision, while maintaining single-precision accuracy; (iii) fast single-cycle precision reconfigurability; (iv) precision mode-dependent power consumption for up to 40% clock power reduction; (v) near-threshold single-precision operation measured at 300mV, 1.75MHz, 11μW; and, (vi) peak energy efficiency of 321GFLOPS/W (single-precision) and 1.2TFLOPS/W (6b precision) at 325mV, 25°C.
    Digest of Technical Papers - IEEE International Solid-State Circuits Conference 01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advanced lighting computation is the key ingredient for rendering realistic images in high-throughput D graphics pipelines. It is the most performance and power-critical operation in programmable vertex and pixel shaders due to the large number of complex floating-point (FP) multiplications and exponentiations [1]. Performance and energy-efficiency of geometry rendering can be significantly improved by hardware acceleration of lighting computations, which is leveraged by vertex/pixel shader programs residing in the memory of a programmable D graphics engine [2] (Fig. 10.4.1). A single-cycle throughput lighting accelerator targeted for on-die acceleration of D graphics vertex and pixel shading in high-performance processors and mobile SoCs is fabricated in 32nm high-k metal-gate CMOS [3] (Fig. 10.4.1). Ambient, diffuse, and specular components of the Phong Illumination (PI) equation [4] are computed in parallel in the log domain with 4-cycle latency and 560mV-to-1.2V operation. A high-accuracy 5-segment piecewise linear (PWL) approximation-based log circuit (FPWL-L) with low Hamming weight coefficients, a 32×32b signed truncated specular multiplier, and a high-precision 4-segment PWL approximation-based anti-log circuit (FPWL-AL) enable accurate fixed-point log-domain computation of PI lighting. Five FP multiplications and one FP exponentiation are transformed to five fixed-point additions and one fixed-point multiplication, respectively, resulting in single-cycle lighting throughput of 2.05GVertices/s (measured at 1.05V, 25°C) in a compact area of 0.064mm2 (Fig. 10.4.7) while achieving: (i) 47% reduction in critical path logic stages, (ii) 0.56% mean vertex lighting error compared to a single-precision FP computation, (iii) 354μW active leakage power measured at 1.05V, 25°C, (iv) scalable performance up to 2.22GHz, 232mW measured at 1.2V, and (v) peak energy efficiency of 56GVertices/s/W, measured at 560mV, 25°C.
    IEEE Journal of Solid-State Circuits 01/2012; 48(1):184-186. · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Moore's Law will continue providing abundance of transistors for integration, only to be limited by the energy consumption. Near threshold voltage (NTV) operation has potential to improve energy efficiency by an order of magnitude. We discuss design techniques necessary for reliable operation over a wide range of supply voltage---from nominal down to subthreshold region. The system designed for NTV can dynamically select modes of operation, from high performance, to high energy efficiency, to the lowest power.
    01/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 4-way to 32-way reconfigurable 256b vector shifter with measured 2.3GHz operation consuming 41mW is fabricated in 0.9V, 22nm tri-gate CMOS. Byte-wise any-to-any permute-assisted skip for coarse-grained byte shifts, rotate-back shifter, reconfigurable mask bit generation/decoder and ultra-low voltage circuits enable 38% area reduction and 240mV-1.1V wide dynamic voltage range with a measured 8.2x higher energy efficiency at 260mV compared to nominal 0.9V operation.
    ESSCIRC (ESSCIRC), 2012 Proceedings of the; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 128-entry × 128b content addressable memory (CAM) design enables 145ps search operation in 1.0V, 32nm high-k metal-gate CMOS technology. A high-speed 16b wide dynamic AND match-line, combined with a fully static search-line and swapped XOR CAM cell simulations show a 49% reduction of search energy at iso-search delay of 145ps over an optimized high-performance conventional NOR-type CAM design, enabling 1.07fJ/bit/search operation. Scaling the supply voltage of the proposed CAM enables 0.3fJ/bit/search with 1.07ns search delay at 0.5V.
    ESSCIRC (ESSCIRC), 2011 Proceedings of the; 10/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(2<sup>4</sup>)<sup>2</sup> composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm<sup>2</sup>, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm<sup>2</sup> (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.
    IEEE Journal of Solid-State Circuits 05/2011; · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: An all-digital True Random Number Generator is fabricated in 45nm CMOS with 2.4Gbps random bit throughput and total power consumption of 7mW. Two-step coarse/fine-grained tuning with a self-calibrating feedback loop enables robust operation in the presence of 20% process variation while providing immunity to run-time voltage and temperature fluctuations. The 100% digital design enables a compact layout occupying 4004μm<sup>2</sup> with measured entropy of 0.999965, and scalable operation down to 280mV, while passing all NIST RNG tests.
    VLSI Circuits (VLSIC), 2010 IEEE Symposium on; 07/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: An on-die multi-core circuit-switched network achieves 2.64Tb/s throughput for an 8×8 2D mesh, consuming 4.73W in 45nm CMOS at 1.1V and 50°C. Pipelined circuit-switched transmission, circuit channel queue circuits and dual supplies enable up to 1.51Tb/s/W energy efficiency, with scalable streaming performance of 6.43Tb/s.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 32 nm on-die fine-grained reconfigurable fabric for DSP/media accelerators is fabricated and occupies a 0.076 mm<sup>2</sup> die. The optimized hybrid arithmetic configurable logic blocks with self-decoded look-up tables, ultra-low voltage PVT-tolerant register file circuits and dual-supply operation help enable a 2.4 GHz nominal performance at 1.0 V and 320 mV-to-1.2 V dynamic voltage range. The peak energy efficiency is 2.6TOPS/W when measured at 340 mV and 50°C.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b × 16b multiplies, 32b × 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 × 2 tile of signed 2's complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50°C) with a dense layout occupying 0.081 mm<sup>2</sup> while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5× in active leakage power; (iv) 10× standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 ¿W; and (vi) compensation for up to 3× performance variation in ultra-low voltage mode.
    IEEE Journal of Solid-State Circuits 02/2010; · 3.06 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a 64-entry × 32b 1-read, 1-write ported register file with measured 8.3GHz operation consuming 83mW, fabricated in 1.0V 32nm CMOS. Contention-free shared keeper circuits combined with variation tolerant dual-ended transmission gate write memory cells enable 300mV Vcc-min reduction and measured scalable near-threshold voltage operation to 340mV with energy efficiency of 550GOPS/W.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A multi-mode Secure Hashing Algorithm (SHA) accelerator is fabricated in 45nm CMOS and occupies 0.0625mm2 with 18Gbps throughput and total power consumption of 50mW. The reconfigurable hardware accelerator computes SHA-1/224/256/384/512 message-digest using unified SHA bit-slices and configurable compression circuits resulting in 40% area reduction and
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: A Karatsuba-based 64b Galois field multiplier for on-die acceleration of public-key encryption is fabricated in 1.1V, 45nm CMOS and occupies 0.021mm2. 2-level Karatsuba design using interleaved 32b multipliers and folded datapath organization results in single-cycle latency at 3GHz operation with total power consumption of 74mW and 32% area reduction over conventional multipliers, resulting in 3.2x speedup of Diffie-Helman key exchange workloads.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput parallel SIMD vector computations are the most performance and power-critical operations in multimedia, graphics and signal processing workloads. An array of SIMD vector processing engines delivers high- throughput short bit-width arithmetic operations on large data sets with orders of magnitude higher energy efficiencies vs. general-purpose cores. A reconfigurable 4-way SIMD engine targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors is fabricated in 45 nm high-K/metal-gate CMOS.
    J. Solid-State Circuits. 01/2010; 45:95-102.
  • [Show abstract] [Hide abstract]
    ABSTRACT: An on-die, reconfigurable AES encrypt/decrypt hardware accelerator is fabricated in 45nm CMOS, targeted for content-protection in high-performance microprocessors. Compared to conventional AES implementations, this design computes the entire AES round in native GF(24)2 composite-field with one-time GF(28)-to-GF(24)2 mapping cost amortized over multiple AES iterations. This approach along with a fused Mix/InvMixColumns circuit and folded ShiftRow datapath results in 20% area savings and 67% reduction in worst-case interconnect length, enabling AES-128/192/256 ECB block throughput of 53/44/38Gbps, 125mW power measured at 1.1V, 50°C.
    01/2010;
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a motion estimation engine fabricated in 65 nm CMOS, targeted for special-purpose on-die acceleration of sum of absolute difference (SAD) computation in real-time video encoding workloads on power-constrained mobile microprocessors. Four-way speculative difference computation using dual 4:2 compressors, optimal reuse of sum XOR min-terms in static 4:2 compressor carry gates, distributed accumulation of input carries for efficient negation and robust ultra-low voltage optimized circuits enable peak SAD efficiency of 12.8 macro-block SADs/nJ within a dense layout occupying 0.089 mm<sup>2</sup> while achieving: (i) scalable performance up to 2.4 GHz, 82 mW measured at 1.4 V, 50degC , (ii) deep subthreshold operation measured at 230 mV while operating down to 4.3 MHz and consuming 14.4 muW , (iii) maximum energy efficiency of 411 GOPS/Watt by operating at 320 mV, 23 MHz and consuming 56 muW (9.6x higher efficiency than nominal 1.2 V operation), (iv) 20% higher energy efficiency for up-conversion of ultra-low voltage signals using a two-stage cascaded split-output level shifter, and (v) tolerance of up to plusmn2x process and temperature induced performance variation using supply voltage compensation of plusmn50 mV.
    IEEE Journal of Solid-State Circuits 02/2009; · 3.06 Impact Factor
  • IEEE International Solid-State Circuits Conference, ISSCC 2009, Digest of Technical Papers, San Francisco, CA, USA, 8-12 February, 2009; 01/2009

Publication Stats

645 Citations
44.10 Total Impact Points

Institutions

  • 2006–2010
    • Intel
      Santa Clara, California, United States
  • 2008
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, MI, United States
  • 2004
    • Linköping University
      • Department of Electrical Engineering (ISY)
      Linköping, OEstergoetland, Sweden