Ken Mai's research while affiliated with Carnegie Mellon University and other places

Publications (86)

Article
Hardware security is a growing concern given the current globalized nature of the integrated circuit (IC) ecosystem. With actors spread across different entities (companies, countries, etc.), it becomes challenging to guarantee that the fabricated hardware was not copied, reverse engineered, maliciously modified, or overproduced. The untrusted foun...
Article
The configuration bitstream is a persistent source of vulnerability in FPGA designs, and thus FPGA vendors have implemented bitstream encryption. A number of attacks on these countermeasures have been demonstrated including direct probing of the configuration storage cells, side-channel attacks on the decryption blocks, and attacks on the scan chai...
Article
SRAM-based FPGAs are widely deployed in space and high-radiation environments, but they exhibit vulnerability to radiation effects. Designs can be hardened against radiation effects with design-side countermeasures such as redundancy, scrubbing, and partitioning. Through neutron tests, we investigate the impact of these design-side countermeasures...
Preprint
Full-text available
This paper summarizes our work on experimentally analyzing, exploiting, and addressing vulnerabilities in multi-level cell NAND flash memory programming, which was published in the industrial session of HPCA 2017, and examines the work's significance and future potential. Modern NAND flash memory chips use multi-level cells (MLC), which store two b...
Preprint
This paper summarizes our work on experimentally characterizing, mitigating, and recovering read disturb errors in multi-level cell (MLC) NAND flash memory, which was published in DSN 2015, and examines the work's significance and future potential. NAND flash memory reliability continues to degrade as the memory is scaled down and more bits are pro...
Preprint
This paper summarizes our work on experimentally characterizing, mitigating, and recovering data retention errors in multi-level cell (MLC) NAND flash memory, which was published in HPCA 2015, and examines the work's significance and future potential. Retention errors, caused by charge leakage over time, are the dominant source of flash memory erro...
Conference Paper
The crossbar is a popular topology for on-chip networks that offers non-blocking connectivity and uniform latency. However, as the number of nodes increases, crossbars typically scale poorly in area, power, and latency/throughput. To better understand the design space, we have developed an on-chip crossbar modeling tool based on analytical models c...
Article
Hardware true random number generators are an essential functional block in many secure systems. Current designs that use bi-stable elements balanced in the metastable region are capable of both high randomness and high bitrate. However, these designs require extensive support circuits to maintain balance in the metastable region, complex built-in...
Article
Differential power analysis (DPA) has been shown to be a highly effective and easy to mount side-channel attack. One effective method of increasing DPA resistance is to use three-phase dual-rail pre-charge logic (TDPL), but this type of logic is vulnerable to manipulation of the clock generation/distribution hardware. If an attacker can slow down t...
Article
Computer architects are increasingly interested in evaluating their ideas at the register-transfer level (RTL) to gain more precise insights on the key characteristics (frequency, area, power) of a micro/architectural design proposal. However, the RTL synthesis process is notoriously tedious, slow, and errorprone and is often outside the area of ex...
Article
Most existing packet-based on-chip networks assume routers have buffers to buffer packets at times of contention. Recently, deflection-based bufferless routing algorithms have been proposed as an alternative design to reduce the area, power, and complexity disadvantages associated with buffering in routers. While bufferless routing shows significan...
Article
Full-text available
Retention errors, caused by charge leakage over time, are the dominant source of flash memory errors. Understanding, characterizing, and reducing retention errors can significantly improve NAND flash memory reliability and endurance. In this paper, we first characterize, with real 2y-nm MLC NAND flash chips, how the threshold voltage distribution o...
Article
Despite best efforts, integrated systems are 'born' (manufactured) with a unique 'personality' that stems from our inability to precisely fabricate their underlying circuits, and create software a priori for controlling the resulting uncertainty. It is possible to use sophisticated test methods to identify the best-performing systems but this would...
Article
Continued scaling of NAND flash memory to smaller process technology nodes decreases its reliability, necessitating more sophisticated mechanisms to correctly read stored data values. To distinguish between different potential stored values, conventional techniques to read data from flash memory employ a single set of reference voltage values, whic...
Article
Full-text available
Continued scaling of NAND flash memory to smaller process technology nodes decreases its reliability, necessitating more sophisticated mechanisms to correctly read stored data values. To distinguish between different potential stored values, conventional techniques to read data from flash memory employ a single set of reference voltage values, whic...
Conference Paper
Physical unclonable functions (PUFs) are primitives that generate high-entropy, tamper resistant bits for use in secure systems. For applications such as cryptographic key generation, the PUF response bits must be highly reliable, consistent across multiple evaluations under voltage and temperature variations. Conventionally, error correcting codes...
Conference Paper
The embedded memory hierarchy of microprocessors and systems-on-a-chip plays a critical role in the overall system performance, area, power, resilience, and yield. However, as process technologies scale down to nanometer-regime geometries, the design and implementation of the embedded memory system are becoming increasingly difficult due to a numbe...
Conference Paper
As NAND flash memory continues to scale down to smaller process technology nodes, its reliability and endurance are degrading. One important source of reduced reliability is the phenomenon of program interference: when a flash cell is programmed to a value, the programming operation affects the threshold voltage of not only that cell, but also the...
Conference Paper
Physically Unclonable Functions (PUFs) are structures with many applications, including device authentication, identification, and cryptographic key generation. In this paper we propose a new PUF, called SCAN-PUF, based on scan-chain power-up states. We argue that scan chains have multiple characteristics that make them uniquely suited as a low-cos...
Conference Paper
Achieving high reliability across environmental variations and over aging in physical unclonable functions (PUFs) remains a challenge for PUF designers. The conventional method to improve PUF reliability is to use powerful error correction codes (ECC) to correct the errors in the raw response from the PUF core. Unfortunately, these ECC blocks gener...
Conference Paper
Full-text available
With continued scaling of NAND flash memory process technology and multiple bits programmed per cell, NAND flash reliability and endurance are degrading. Understanding, characterizing, and modeling the distribution of the threshold voltages across different cells in a modern multi-level cell (MLC) flash memory can enable the design of more effectiv...
Chapter
Side-channel attacks bypass the theoretical strength of cryptographic algorithms by exploiting weaknesses in the cryptographic system hardware implementation via nonprimary, side-channel inputs and outputs. Commonly exploited side-channel outputs include: power consumption, electromagnetic (EM) emissions, light, timing, and sound (Fig. 8.1). Common...
Conference Paper
While SRAM and DRAM are often assumed to have very small data retention times (bits are lost immediately at power-down) and no data remanence (stored bits leave no traces even after a prolonged storage period), under some conditions these assumptions do not hold. Both retention and remanence have been exploited by malicious attackers to compromise...
Conference Paper
Physical Unclonable Functions (PUFs) are security primitives used in a number of security applications like authentication, identification, and secure key generation. PUF implementations are evaluated on their security characteristics (uniqueness, randomness, and reliability), as well as conventional VLSI design metrics (area, power, and performanc...
Conference Paper
With the continued scaling of NAND flash and multi-level cell technology, flash-based storage has gained widespread use in systems ranging from mobile platforms to enterprise servers. However, the robustness of NAND flash cells is an increasing concern, especially at nanometer-regime process geometries. NAND flash memory bit error rate increases ex...
Article
We demonstrate the efficacy and associated costs of three reliability enhancing techniques for bi-stable PUF designs (SRAM and sense amplifier-based) — directed accelerated aging, multiple evaluations, and activation control. Measured results from a 65nm bulk CMOS full custom PUF testchip demonstrate that these technique are able to reduce the perc...
Article
As NAND flash memory manufacturers scale down to smaller process technology nodes and store more bits per cell, reliability and endurance of flash memory reduce. Wear-leveling and error correction coding can improve both reliability and endurance, but finding effective algorithms requires a strong understanding of flash memory error patterns. To en...
Conference Paper
The CoRAM memory architecture for FPGA-based computing augments traditional reconfigurable fabric with a natural and effective way for applications to interact with off-chip memory and I/O. The two central tenets of the CoRAM memory architecture are (1) the deliberate separation of concerns between computation versus data marshalling and (2) the us...
Conference Paper
FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, FPGAs have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to F...
Conference Paper
NAND flash memory has been widely used for data storage due to its high density, high throughput, low cost, and low power. However, as flash memory manufacturers scale to smaller process technologies and store more bits per cell, the reliability and endurance of flash memory are decreasing. Wear-leveling and error correction coding can significantl...
Article
We present an analysis of the extendability of traditional perpendicular recording based on micromagnetic recording simulation and channel cost/performance analysis. The analysis indicates that head field gradient is the single most important factor in realizing densities above 1.0 Tb/in2. The conclusion of this analysis is that 1.5 Tb/in2 may be p...
Conference Paper
NAND Flash memory has been widely used for data storage due to its high density, high throughput, low cost, and low power. However, as the storage cells become smaller and with more bits programmed per cell, they are expected to suffer from reduced reliability and limited endurance. Wear-leveling and signal processing can significantly improve both...
Article
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the PROTOFLEX s...
Conference Paper
To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Althou...
Conference Paper
In the AES algorithm, the Substitution Box (S-Box) often dominates the area and delay of implementations. The S-Box performs a byte-wise substitution on the data based on an established code book, and most AES algorithm implementations use a large complex logic block consisting mainly of XORs to implement the S-Box. Direct implementation of the S-B...
Conference Paper
Physically Unclonable Functions (PUFs) implement die specific random functions that offer a promising mechanism in various security applications. Stability or reliability of a PUF response is a key concern, especially when the IC containing the PUF is subjected to severe environmental variations. In cryptographic applications, errors in response bi...
Conference Paper
Power analysis attacks are a common and effective method of defeating cryptographic systems. Many power-analysis-resistant digital circuit techniques have been previously proposed, leaving the circuit designer a myriad of choices without a simple way to compare and contrast the strengths and weaknesses of each technique. In this paper, we compare f...
Conference Paper
SRAM design in scaled technologies requires knowledge of phenomena at the process, circuit, and architecture level. Decisions made at various levels of the design hierarchy affect the global figures of merit (FoMs) of an SRAM, such as, performance, power, area, and yield. However, the lack of a quick mechanism to understand the impact of changes at...
Conference Paper
A designer's intent and knowledge about the critical issues and trade-offs underlying a custom circuit design are implicit in the simulations she sets up for design creation and verification. However, this knowledge is tightly conjoined with technology-specific features and decoupled from the final schematic in traditional design flows. As a result...
Article
Low-density parity-check (LDPC) codes offer a promising error correction approach for high-density magnetic recording systems due to their near-Shannon limit error-correcting performance. However, evaluation of LDPC codes at the extremely low bit error rates (BER) required by hard disk drive systems, typically around 10<sup>-12</sup> to 10<sup>- 15...
Conference Paper
Device variability in modern processes has become a major concern in SRAM design, degrading performance, yield, power, and reliability. While low-swing bitlines can reduce power consumption and increase performance, offset in the sense amplifiers due to device variability hinders the scalability of this technique. A promising method for decreasing...
Article
The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing soft- and hard-error rates in commodity designs. Because modern chip designs consist of multiple high-speed clock domains, conventional lockstepped redundant execution is no longer practical. Recent work suggests an asynchronous app...
Article
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating large multiprocessor systems with hundreds or thousands of processors or when instrumentation is introduced. We propose the ProtoFlex s...
Conference Paper
A configurable replica bitline (cRBL) technique for controlling sense-amplifier enable (SAE) timing for small-swing bitline SRAMs is described. Post-silicon selection of a subset of replica bitline driver cells from a statistically designed pool of cells facilitates precise SAE timing. An exponential reduction in timing variation is enabled by stat...
Conference Paper
Transistor sizing to control random mismatch is investigated. Input offset voltage of 65 nm bulk CMOS SRAM sense amplifiers are measured to analyze NMOS and PMOS threshold voltage (Vtn, Vtp) variation effects and compare them with statistical models and Pelgrom model predictions. A linear statistical response surface model (RSM) relating input offs...
Conference Paper
Differential power analysis (DPA) has been shown to be an effective attack on cryptographic systems capable of revealing secret data by measuring power consumption. DPA-resistant circuits currently incur severe penalties in terms of performance, area, and power - as much as 4times in each. Additionally, most are dual-rail logic families, which can...
Article
Well-designed circuits are one key ldquoinsulatingrdquo layer between the increasingly unruly behavior of scaled complementary metal-oxide-semiconductor devices and the systems we seek to construct from them. As we move forward into the nanoscale regime, circuit design is burdened to ldquohiderdquo more of the problems intrinsic to deeply scaled de...
Conference Paper
Full-text available
In deep sub-micron ICs, growing amounts of on- die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory pr...
Article
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose the PROTOFLEX simulation architecture, which uses...
Conference Paper
PROTOFLEX is an FPGA-accelerated hybrid simulation/emulation platform designed to support large-scale multiprocessor hardware and software research. Unlike prior attempts at FPGA multiprocessor system emulators, PROTOFLEX emulates full-system fidelity-i.e., runs stock commercial operating systems with I/O support. This is accomplished without undue...
Article
1 INTRODUCTION After years of sustained focus on uniprocessor performance, the "power wall" has started the microprocessor hardware and software industry down the multiprocessing (MP) path. This transition raises questions on how to study and design future MP architectures that could incorporate hundreds or even thou-sands of processors. To compoun...
Article
Full-text available
In deep sub-micron ICs, growing amounts of on- die memory and scaling effects make embedded memories increasingly vulnerable to reliability and yield problems. As scaling progresses, soft and hard errors in the memory system will increase and single error events are more likely to cause large-scale multi- bit errors. However, conventional memory pr...
Article
This paper presents the architecture and circuit techniques for a reconfigurable SRAM building block. The memory block can emulate many memory structures including a cache tag or data array, a FIFO, and a simple scratchpad memory. We choose the block size based on the optimal partition size for large SRAM structures, use self-resetting and replica...
Article
Full-text available
With single processor systems running into instruction-level parallelism (ILP) limits and fundamental VLSI constraints, multiprocessor chips provide a realistic path towards scalable performance by allowing one to take advantage of thread-level (TLP) and data-level parallelism (DLP) in emerging applications. Nevertheless, parallel architectures are...
Article
Full-text available
This paper describes the architecture and circuits of a reconfigurable memory block, called a mat, for use in such a memory system, whose architecture is more fully described in [1]. The primary design challenge of this system is to implement considerable flexibility in the memories while minimizing the hardware overhead. Carefully choosing the rec...
Conference Paper
A 2 kB reconfigurable SRAM block, using self-timed, pulse-mode circuits capable of emulating a portion of a cache or a streaming FIFO is realized in a 1.8 V 0.18 μm CMOS process and operates at 1.1 GHz (10F04 cycle). The additional logic needed for reconfigurability consumes 26% of the total power and 32% of the total area.
Conference Paper
We present circuits for a high-efficiency low-swing interconnect scheme suitable for the Smart Memories reconfigurable architecture. By using a separate supply, global clocking, and differential signaling, we reduce design complexity; and by using overdrive circuits, equalization techniques, and sense-amplifiers we retain high performance. A testch...
Conference Paper
We update prior wire scaling studies with data from the 2001 and 2002 ITRS roadmaps, extending out to the 13 nm node. Combining this data with more sophisticated wire models, over nine generations we see both local and global wires degrading relative to gates, by one and three orders of magnitude respectively. However, using repeaters for global wi...
Article
We present circuits for a high-efficiency low-swing interconnect scheme suitable for the Smart Memories reconfigurable architecture. By using a separate supply, global clocking, and differential signaling, we reduce design complexity; and by using overdrive circuits, equalization techniques, and senseamplifiers we retain high performance. A testchi...
Article
Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, ta...
Conference Paper
Full-text available
Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, ta...
Conference Paper
Interconnect scaling to deep submicron processes presents many challenges to today's CAD flows. A recent analysis by D. Sylvester and K. Keutzer (1998) examined the behavior of average length wires under scaling, and controversially concluded that current CAD tools are adequate for future module-level designs. We show that average length wire scali...
Conference Paper
Full-text available
Displaying the real-time behavior of critical signals on VLSI chips is difficult and can require expensive test equipment. We present a simple sampling technique to display the analog waveforms of high bandwidth on-chip signals on a laboratory oscilloscope. It is based on the subsampling of periodic signals. This circuit was used to verify the oper...