M. Anders

Intel, Santa Clara, California, United States

Are you M. Anders?

Claim your profile

Publications (65)49.5 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes an on-die lightweight nanoAES hardware accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 m and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 W, measured at 0.9 V, 25 C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 W, measured at 340 mV, 25 C and (v) first-reported Galois-field polynomial-based micro-architectural- co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.
    IEEE Journal of Solid-State Circuits 04/2015; 50(4):1-11. DOI:10.1109/JSSC.2014.2384039 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Energy-efficient networks-on-chip (NoCs) are key enablers for exa-scale computation by shifting power budget from communication toward computation. As core counts scale into the 100s, on-chip interconnect fabrics must support increasing heterogeneity and voltage/clock domains. Synchronous NoCs require either a single clock distributed globally or clock-crossing data FIFOs between clock domains [1]. A global clock requires costly full-chip margining and significant power and area for clock distribution, while synchronizing data FIFOs add power, performance, and area overhead per clock crossing. Source-synchronous NoCs mitigate these penalties by forwarding a local clock along with each packet, but still suffer from high data storage power due to packet switching. Circuit switching removes intra-route data storage, but suffers from low network utilization due to serialized channel setup and data transfer [2]. Hybrid packet/circuit switching parallelizes these operations for higher network utilization. A 16×16 mesh, 112b data, 256 voltage/clock domain NoC with source-synchronous operation, hybrid packet/circuit-switched flow control, and ultra-low-voltage optimizations is fabricated in 22nm tri-gate CMOS [3] to enable: i) 20.2Tb/s total throughput at 0.9V, 25°C, ii) a 2.7× increase in bisection bandwidth to 2.8Tb/s and 93% reduction in circuit-switched latency at 407ps/hop through source-synchronous operation, iii) a 62% latency improvement and 55% increase in energy efficiency to 7.0Tb/s/W through circuit switching, iv) a peak energy efficiency of 18.3Tb/s/W for near-threshold operation at 430mV, 25°C, and v) ultra-low-voltage operation down to 340mV with router power scaling to 363μW.
    2014 IEEE International Solid- State Circuits Conference (ISSCC); 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Physically unclonable function (PUF) circuits are low-cost cryptographic primitives used for generation of unique, stable and secure keys or chip IDs for device authentication and data security in high-performance microprocessors [1][2][3][7]. The volatile nature of PUFs provides a high level of security and tamper resistance against invasive probing attacks compared to conventional fuse-based key storage technologies [4]. A process-voltage-temperature (PVT) variation-tolerant all-digital PUF array targeted for on-die generation of 100% stable, device-specific, high-entropy keys is fabricated in 22nm tri-gate high-κ metal-gate CMOS technology [5], featuring: i) a hybrid delay/cross-coupled PUF circuit where interaction of 16 minimum-sized, variation-impacted transistors determines resolution dynamics, ii) a temporal majority voting (TMV) circuit to stabilize occasionally unstable bits, resulting in 53% reduction in instability, iii) burn-in hardening to reinforce manufacturing-time PUF bias, resulting in 22% reduction in bit-errors, iv) soft dark bits for run-time identification and sequestration of highly unstable bits during field operation, resulting in 78% lower bit-errors, v) 19× separation between inter- and intra-PUF Hamming distance, enabling die-specific keys, vi) autocorrelation factor≈0 and entropy=0.9997, while passing NIST randomness tests, vii) high tolerance to voltage and temperature variation with 82% reduction in average Hamming-distance using a 100-cycle dark bit window, viii) in-situ PUF hardening by leveraging directed NBTI aging to improve stability during field operation, and ix) ultra-low energy consumption of 0.19pJ/b with compact bitcell layout of 4.66μm2 (Fig. 16.2.7a).
    2014 IEEE International Solid- State Circuits Conference (ISSCC); 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes an all-digital PVT-variation tolerant true-random number generator (TRNG), fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die entropy generation in high-performance microprocessors. The TRNG harvests differential thermal-noise at the diffusion nodes of a pre-charged cross-coupled inverter pair to resolve out of metastability, generating one random bit/cycle. A self-calibrating 2-step tuning mechanism using coarse-grained configurable inverters and fine-grained programmable clock delay generators, along with an entropy-tracking feedback loop provide tolerance to 20% PVT variation-induced device mismatches, enabling lowest-reported energy-consumption of 2.9 pJ/bit with a dense layout occupying 4004 μm2, while achieving: (i) 2.4 Gbps random bit throughput, 7 mW total power consumption with 0.7 mW leakage power component, measured at 1.1 V, 50°C, (ii) random bitstreams that passes all NIST RNG tests with raw entropy/bit measured up to 0.9999999993, (iii) good distribution of 1's with 4-bit entropy of 3.97996 and high-entropy pattern probability of 0.066 (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 14 Mbps, 5.6 μW, measured at 280 mV, 50°C, (v) 12 fine-grained high-entropy settings for the TRNG to dither in during steady-state operation, (vi) <;3% error while using an analytical ergodic Markov chain model for predicting pattern probabilities and (vii) 200x higher throughput and 9x higher energy-efficiency than previously reported implementations. Design modifications for robust operation in 22 nm high-volume manufacturing in the presence of 3σ process variations demonstrate scalability of the all-digital design to future technologies.
    IEEE Journal of Solid-State Circuits 11/2012; 47(11):2807-2821. DOI:10.1109/JSSC.2012.2217631 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1–3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
    IEEE Journal of Solid-State Circuits 02/2012; 48(1):178-180. DOI:10.1109/ISSCC.2012.6176966 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput floating-point computations are key building blocks of 3D graphics, signal processing and high-performance computing workloads [1,2]. Higher floating-point precisions offer improved accuracy at the expense of performance and energy efficiency, with variable-precision floating-point circuits providing run-time precision selection [3]. Real-time certainty tracking enables variable-precision circuits not only to operate at the higher energy efficiency of low-precision datapaths, but also to preserve high-precision accuracy. A variable-precision floating-point unit that performs fused multiply-adds (FMA) with single-cycle throughput while supporting operation in either 1-way single-precision (24b mantissa), 2-way 12b precision or 4-way 6b precision modes is fabricated in 32nm High-k/Metal-gate CMOS [4]. Simultaneous floating-point certainty tracking, preshifted addends, a combined rounding and negation incrementer, efficient reuse of mantissa datapath for multiple parallel lower precision calculations, robust ultra-low voltage circuits, and fine-grained clock gating enable nominal energy efficiency of 52GFLOPS/W (IEEE 32b single-precision, measured at 1.45GHz, 1.05V, 25°C) with a dense layout occupying 0.045mm2 (Fig. 10.3.7) while achieving: (i) scalable performance up to 3.6GFLOPS (single-precision), 96mW measured at 1.2V; (ii) up to 4× higher throughput of 14.4GFLOPS with variable-precision, while maintaining single-precision accuracy; (iii) fast single-cycle precision reconfigurability; (iv) precision mode-dependent power consumption for up to 40% clock power reduction; (v) near-threshold single-precision operation measured at 300mV, 1.75MHz, 11μW; and, (vi) peak energy efficiency of 321GFLOPS/W (single-precision) and 1.2TFLOPS/W (6b precision) at 325mV, 25°C.
    Digest of Technical Papers - IEEE International Solid-State Circuits Conference 01/2012; 55:182-184. DOI:10.1109/ISSCC.2012.6176987
  • [Show abstract] [Hide abstract]
    ABSTRACT: Moore's Law will continue providing abundance of transistors for integration, only to be limited by the energy consumption. Near threshold voltage (NTV) operation has potential to improve energy efficiency by an order of magnitude. We discuss design techniques necessary for reliable operation over a wide range of supply voltage---from nominal down to subthreshold region. The system designed for NTV can dynamically select modes of operation, from high performance, to high energy efficiency, to the lowest power.
    01/2012; DOI:10.1145/2228360.2228572
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 128-entry × 152b 3-read/2-write ported multi-precision floating-point register file/shuffler with measured 2.8GHz operation is fabricated in 1.05V, 32nm CMOS. Single-precision (24b-mantissa), 2-way 12b or 4-way 6b reduced mantissa precision modes, certainty tracking bits, mode-dependent gating, area-efficient windowing using 1R/1W cells, and ultra-low-voltage read/write circuits enable 350mV-1.2V wide dynamic voltage range with measured peak energy-efficiency of 751GOPS/W at 400mV, 4-way 6b-mode (22.3× higher than 1.05V single-precision mode) and 19% area reduction over single-precision 3R/2W implementations.
    VLSI Circuits (VLSIC), 2012 Symposium on; 01/2012
  • A. Agarwal · S. Hsu · S. Mathew · M. Anders · H. Kaul · F. Sheikh · R. Krishnamurthy ·
    [Show abstract] [Hide abstract]
    ABSTRACT: A 4-way to 32-way reconfigurable 256b vector shifter with measured 2.3GHz operation consuming 41mW is fabricated in 0.9V, 22nm tri-gate CMOS. Byte-wise any-to-any permute-assisted skip for coarse-grained byte shifts, rotate-back shifter, reconfigurable mask bit generation/decoder and ultra-low voltage circuits enable 38% area reduction and 240mV-1.1V wide dynamic voltage range with a measured 8.2x higher energy efficiency at 260mV compared to nominal 0.9V operation.
    ESSCIRC (ESSCIRC), 2012 Proceedings of the; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Advanced lighting computation is the key ingredient for rendering realistic images in high-throughput D graphics pipelines. It is the most performance and power-critical operation in programmable vertex and pixel shaders due to the large number of complex floating-point (FP) multiplications and exponentiations [1]. Performance and energy-efficiency of geometry rendering can be significantly improved by hardware acceleration of lighting computations, which is leveraged by vertex/pixel shader programs residing in the memory of a programmable D graphics engine [2] (Fig. 10.4.1). A single-cycle throughput lighting accelerator targeted for on-die acceleration of D graphics vertex and pixel shading in high-performance processors and mobile SoCs is fabricated in 32nm high-k metal-gate CMOS [3] (Fig. 10.4.1). Ambient, diffuse, and specular components of the Phong Illumination (PI) equation [4] are computed in parallel in the log domain with 4-cycle latency and 560mV-to-1.2V operation. A high-accuracy 5-segment piecewise linear (PWL) approximation-based log circuit (FPWL-L) with low Hamming weight coefficients, a 32×32b signed truncated specular multiplier, and a high-precision 4-segment PWL approximation-based anti-log circuit (FPWL-AL) enable accurate fixed-point log-domain computation of PI lighting. Five FP multiplications and one FP exponentiation are transformed to five fixed-point additions and one fixed-point multiplication, respectively, resulting in single-cycle lighting throughput of 2.05GVertices/s (measured at 1.05V, 25°C) in a compact area of 0.064mm2 (Fig. 10.4.7) while achieving: (i) 47% reduction in critical path logic stages, (ii) 0.56% mean vertex lighting error compared to a single-precision FP computation, (iii) 354μW active leakage power measured at 1.05V, 25°C, (iv) scalable performance up to 2.22GHz, 232mW measured at 1.2V, and (v) peak energy efficiency of 56GVertices/s/W, measured at 560mV, 25°C.
    IEEE Journal of Solid-State Circuits 01/2012; 48(1):184-186. DOI:10.1109/ISSCC.2012.6176967 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 128-entry × 128b content addressable memory (CAM) design enables 145ps search operation in 1.0V, 32nm high-k metal-gate CMOS technology. A high-speed 16b wide dynamic AND match-line, combined with a fully static search-line and swapped XOR CAM cell simulations show a 49% reduction of search energy at iso-search delay of 145ps over an optimized high-performance conventional NOR-type CAM design, enabling 1.07fJ/bit/search operation. Scaling the supply voltage of the proposed CAM enables 0.3fJ/bit/search with 1.07ns search delay at 0.5V.
    ESSCIRC (ESSCIRC), 2011 Proceedings of the; 10/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(2<sup>4</sup>)<sup>2</sup> composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm<sup>2</sup>, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm<sup>2</sup> (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.
    IEEE Journal of Solid-State Circuits 05/2011; 46(4-46):767 - 776. DOI:10.1109/JSSC.2011.2108131 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Network on-Chip (NoC) is an interconnect fabric to connect sub-system blocks on a chip. The NoC should provide high bandwidth and low latency, should consume low energy, and should be compact. However, all these requirements are at odds and require tradeoffs at all levels. In this chapter, we discuss issues and challenges for future NoCs with demands for high bandwidth and low energy. Next, we present details of how coupling packet-switched arbitration with circuit-switched data transfer can achieve these goals. In this hybrid network, packet-switched arbitration is used to reserve future circuit-switched channels for the data transfer, eliminating the performance bottlenecks associated with pure circuit-switched networks while maintaining their power advantage. Furthermore, proximity-based data streaming increases network throughput and improves energy efficiency. Measurements of this NoC in 45 nm CMOS are described to analyze de-sign tradeoffs.
  • [Show abstract] [Hide abstract]
    ABSTRACT: An all-digital True Random Number Generator is fabricated in 45nm CMOS with 2.4Gbps random bit throughput and total power consumption of 7mW. Two-step coarse/fine-grained tuning with a self-calibrating feedback loop enables robust operation in the presence of 20% process variation while providing immunity to run-time voltage and temperature fluctuations. The 100% digital design enables a compact layout occupying 4004μm<sup>2</sup> with measured entropy of 0.999965, and scalable operation down to 280mV, while passing all NIST RNG tests.
    VLSI Circuits (VLSIC), 2010 IEEE Symposium on; 07/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: Interconnect networks, for high-bandwidth energy-efficient core-to-core communication, are key to enabling future tera-scale multi-core processors. Packet-switched 2D mesh networks provide efficient interconnect utilization, low latencies and high throughputs, but suffer from low energy efficiency due to data storage during routing [1-2]. Circuit-switched data transfer achieves both high bandwidth and energy efficiency by eliminating intra-route data storage [3]. An 8x8 mesh circuit-switched network-on-chip, consisting of arbitration logic for 512b data width with 1b data interconnect, is fabricated in 45nm high-κ metal-gate CMOS [4]. Scaling data width measurements to 512b, the circuits achieve 560Gb/s/W energy efficiency, 4.1Tb/s bisection bandwidth, and 11ns diagonal corner-to-corner fall-through latency. Reconfigurable router circuits allow dynamic optimization of both circuit-switched channel-queue depth and the ratio of arbitration vs. data transfer rates based on traffic patterns. Pipelined arbitration phases with packet-switched channel allocation circuits, dual-supply optimization of data transfer power and proximity-based streaming circuits enable: i) 2.64Tb/s maximum throughput for random 512b transmissions measured at 1.1V, 50° C, ii) 87% increased throughput from channel queuing, iii) 6.43Tb/s scalable performance with streaming traffic at energy efficiency of 0.91Tb/s/W, iv) 4.73W total network power at 74mW per router with <17% arbitration overhead, v) traffic-dependent network power consumption scalable down to 1.35W at 21mW per router, vi) 28% power savings through dual-supply optimization at iso-throughput, and vii) low-voltage energy efficiency of 1.51Tb/s/W measured at 550mV, 50°C.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: A 32 nm on-die fine-grained reconfigurable fabric for DSP/media accelerators is fabricated and occupies a 0.076 mm<sup>2</sup> die. The optimized hybrid arithmetic configurable logic blocks with self-decoded look-up tables, ultra-low voltage PVT-tolerant register file circuits and dual-supply operation help enable a 2.4 GHz nominal performance at 1.0 V and 320 mV-to-1.2 V dynamic voltage range. The peak energy efficiency is 2.6TOPS/W when measured at 340 mV and 50°C.
    Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2010 IEEE International; 03/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b × 16b multiplies, 32b × 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 × 2 tile of signed 2's complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50°C) with a dense layout occupying 0.081 mm<sup>2</sup> while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5× in active leakage power; (iv) 10× standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 ¿W; and (vi) compensation for up to 3× performance variation in ultra-low voltage mode.
    IEEE Journal of Solid-State Circuits 02/2010; 45(1-45):95 - 102. DOI:10.1109/JSSC.2009.2031813 · 3.01 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: An on-die, reconfigurable AES encrypt/decrypt hardware accelerator is fabricated in 45nm CMOS, targeted for content-protection in high-performance microprocessors. Compared to conventional AES implementations, this design computes the entire AES round in native GF(24)2 composite-field with one-time GF(28)-to-GF(24)2 mapping cost amortized over multiple AES iterations. This approach along with a fused Mix/InvMixColumns circuit and folded ShiftRow datapath results in 20% area savings and 67% reduction in worst-case interconnect length, enabling AES-128/192/256 ECB block throughput of 53/44/38Gbps, 125mW power measured at 1.1V, 50°C.
  • [Show abstract] [Hide abstract]
    ABSTRACT: High-throughput parallel SIMD vector computations are the most performance and power-critical operations in multimedia, graphics and signal processing workloads. An array of SIMD vector processing engines delivers high- throughput short bit-width arithmetic operations on large data sets with orders of magnitude higher energy efficiencies vs. general-purpose cores. A reconfigurable 4-way SIMD engine targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors is fabricated in 45 nm high-K/metal-gate CMOS.
  • [Show abstract] [Hide abstract]
    ABSTRACT: A Karatsuba-based 64b Galois field multiplier for on-die acceleration of public-key encryption is fabricated in 1.1V, 45nm CMOS and occupies 0.021mm2. 2-level Karatsuba design using interleaved 32b multipliers and folded datapath organization results in single-cycle latency at 3GHz operation with total power consumption of 74mW and 32% area reduction over conventional multipliers, resulting in 3.2x speedup of Diffie-Helman key exchange workloads.
    01/2010; DOI:10.1109/ESSCIRC.2010.5619895

Publication Stats

913 Citations
49.50 Total Impact Points


  • 2004-2014
    • Intel
      Santa Clara, California, United States
  • 2008
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
  • 2005
    • University of Massachusetts Amherst
      Amherst Center, Massachusetts, United States