[Show abstract][Hide abstract] ABSTRACT: We prove the correctness of a recently-proposed cache coherence protocol,
Tardis, which is simple, yet scalable to high processor counts, because it only
requires O(logN) storage per cacheline for an N-processor system. We prove that
Tardis follows the sequential consistency model and is both deadlock- and
livelock-free. Our proof is based on simple and intuitive invariants of the
system and thus applies to any system scale and many variants of Tardis.
[Show abstract][Hide abstract] ABSTRACT: Oblivious RAM (ORAM) is a cryptographic primitive that hides memory access patterns as seen by untrusted storage. Recently, ORAM has been architected into secure processors. A big challenge for hardware ORAM schemes is how to efficiently manage the Position Map (PosMap), a central component in modern ORAM algorithms. Implemented naively, the PosMap causes ORAM to be fundamentally unscalable in terms of on-chip area. On the other hand, a technique called Recursive ORAM fixes the area problem yet significantly increases ORAM's performance overhead.
To address this challenge, we propose three new mechanisms. We propose a new ORAM structure called the PosMap Lookaside Buffer (PLB) and PosMap compression techniques to reduce the performance overhead from Recursive ORAM empirically (the latter also improves the construction asymptotically). Through simulation, we show that these techniques reduce the memory bandwidth overhead needed to support recursion by 95%, reduce overall ORAM bandwidth by 37% and improve overall SPEC benchmark performance by 1.27x. We then show how our PosMap compression techniques further facilitate an extremely efficient integrity verification scheme for ORAM which we call PosMap MAC (PMMAC). For a practical parameterization, PMMAC reduces the amount of hashing needed for integrity checking by >= 68x relative to prior schemes and introduces only 7% performance overhead.
We prototype our mechanisms in hardware and report area and clock frequency for a complete ORAM design post-synthesis and post-layout using an ASIC flow in a 32~nm commercial process. With 2 DRAM channels, the design post-layout runs at 1~GHz and has a total area of .47~mm2. Depending on PLB-specific parameters, the PLB accounts for 10% to 26% area. PMMAC costs 12% of total design area. Our work is the first to prototype Recursive ORAM or ORAM with any integrity scheme in hardware.
[Show abstract][Hide abstract] ABSTRACT: We build and evaluate Tiny ORAM, an Oblivious RAM prototype on FPGA. Oblivious RAM is a cryptographic primitive that completely obfuscates an application’s data, access pattern, and read/write behavior to/from external memory (such as DRAM or disk). Tiny ORAM makes two main contributions. First, by removing an algorithmic bottleneck in prior work, Tiny ORAM is the first hardware ORAM design to support arbitrary block sizes (e.g., 64 Bytes to 4096 Bytes). With a 64 Byte block size, Tiny ORAM can finish an access in 1.4 µs, over 40X faster than the prior-art implementation. Second, through novel algorithmic and engineering-level optimizations, Tiny ORAM reduces the number of symmetric encryption operations by ~ 3X compared to a prior work. Tiny ORAM is also the first design to implement and report real numbers for the cost of symmetric encryption in hardware ORAM constructions. Putting it together, Tiny ORAM requires 18381 (5%) LUTs and 146 (13%) Block RAM on a Xilinx XC7VX485T FPGA, including the cost of encryption
[Show abstract][Hide abstract] ABSTRACT: A new memory coherence protocol, TARDIS, is proposed. TARDIS uses timestamp
counters representing logical as opposed to physical time to order memory
operations and enforce memory consistency models in any type of shared memory
system. Compared to the widely-adopted directory coherence protocol, TARDIS is
simpler, only requires O(log N ) storage per cache block for an N-core system
rather than the O(N) sharer information required by conventional directory
protocols, and integrates better with some system optimizations. On average,
TARDIS achieves similar performance to directory protocols on a wide range of
[Show abstract][Hide abstract] ABSTRACT: This paper describes the use of physical unclonable functions (PUFs) in low-cost authentication and key generation applications. First, it motivates the use of PUFs versus conventional secure nonvolatile memories and defines the two primary PUF types: “strong PUFs” and “weak PUFs.” It describes strong PUF implementations and their use for low-cost authentication. After this description, the paper covers both attacks and protocols to address errors. Next, the paper covers weak PUF implementations and their use in key generation applications. It covers error-correction schemes such as pattern matching and index-based coding. Finally, this paper reviews several emerging concepts in PUF technologies such as public model PUFs and new PUF implementation technologies.
Proceedings of the IEEE 08/2014; 102(8):1126-1141. DOI:10.1109/JPROC.2014.2320516 · 5.47 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper presents the author retrospective on the analytical cache modeling work published in the 2001 International Conference on Supercomputing (ICS). We summarize the history of the work, revisit primary observations and lessons that we learned from the modeling effort, and also briefly describe follow-up work to show how the research direction evolved over time. Original Paper: http://dx.doi.org/10.1145/377792.377797
[Show abstract][Hide abstract] ABSTRACT: This paper details the design and application of a new ultra-high speed real-time simulation for Hardware-in-the-Loop (HiL) testing and design of high-power power electronics systems. Our real-time hardware emulation for HiL system is based on a custom, heterogeneous, recon-figurable, multicore processor design that emulates power electronics, and includes a circuit compiler that translates graphic system models into processor executable machine code. We present digital processor architecture details, and describe the process of power electronic cir-cuit compilation. This approach to real-time emulation yields real-time execution in the order of 1µs simulation time step (including input/output latency) for a broad class of power electronics converters. In addition, we present HiL simulation experimental results for three representative systems: namely, a variable speed induction motor drive, a utility grid connected photovoltaic converter system, and a hybrid electric vehicle motor drive.
[Show abstract][Hide abstract] ABSTRACT: Physical Unclonable Functions (PUFs) allow a silicon device to be authenticated based on its manufacturing variations using challenge/response evaluations. Popular realizations use linear additive functions as building blocks. Security is scaled up using non-linear mixing (e.g., adding XORs). Because the responses are physically derived and thus noisy, the resulting explosion in noise impacts both the adversary (which is desirable) as well as the verifier (which is undesirable). We present the first architecture for linear additive physical functions where the noise seen by the adversary and the noise seen by the verifier are bifurcated by using a randomized decimation technique and a novel response recovery method at an authentication verification server. We allow the adversary's noise ηa → 0.50 while keeping the verifier's noise ηv constant, using a parameter-based authentication modality that does not require explicit challenge/response pair storage at the server. We present supporting data using 28nm FPGA PUF noise results as well as machine learning attack results. We demonstrate that our architecture can also withstand recent side-channel attacks that filter the noise (to clean up training challenge/response labels) prior to machine learning.
2014 IEEE International Symposium on Hardware-Oriented Security and Trust (HOST); 05/2014
[Show abstract][Hide abstract] ABSTRACT: This paper described recent improvements to the Graphite simulator designed to help explore current and emerging research topics. With these improvements, Graphite is ideally suited to explore both power and performance in future multicore and manycore processors, especially those incorporating dynamic runtime monitoring and adaptation. Separate validation of Graphite has shown performance results within about 6% on average (18% worst case) of a cycle-level simulator and normalized power trends are predicted to within 10%. This makes Graphite accurate enough for medium- to long-term studies while maintaining very high performance. Graphite is freely available for anyone to use: http://graphite.csail.mit.edu.
International Symposium on Performance Analysis of Systems and Software (ISPASS), Monterey, CA; 03/2014
[Show abstract][Hide abstract] ABSTRACT: This paper proposes novel robust and low-overhead physical unclonable function (PUF) authentication and key exchange protocols that are resilient against reverse-engineering attacks. The protocols are executed between a party with access to a physical PUF (prover) and a trusted party who has access to the PUF compact model (verifier). The proposed protocols do not follow the classic paradigm of exposing the full PUF responses or a transformation of them. Instead, random subsets of the PUF response strings are sent to the verifier so the exact position of the subset is obfuscated for the third-party channel observers. Authentication of the responses at the verifier side is done by matching the substring to the available full response string; the index of the matching point is the actual obfuscated secret (or key) and not the response substring itself. We perform a thorough analysis of resiliency of the protocols against various adversarial acts, including machine learning and statistical attacks. The attack analysis guides us in tuning the parameters of the protocol for an efficient and secure implementation. The low overhead and practicality of the protocols are evaluated and confirmed by hardware implementation.
IEEE Transactions on Emerging Topics in Computing 03/2014; 2(1):37-49. DOI:10.1109/TETC.2014.2300635
[Show abstract][Hide abstract] ABSTRACT: Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.
2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA); 02/2014
[Show abstract][Hide abstract] ABSTRACT: Next generation multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). Our goal is to lower memory access latency and energy by replicating only high locality cache lines in the LLC slice of the requesting core, while simultaneously keeping the off-chip miss rate low. Our approach relies on low overhead yet highly accurate in-hardware run-time classification of data locality at the cache line granularity, and only allows replication for cache lines with high reuse. Furthermore, our classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. The locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional coherence protocols. Moreover, the complexity of our protocol is low since no additional coherence states are created. On a set of parallel benchmarks, our protocol reduces the overall energy by 16%, 14%, 13% and 21% and the completion time by 4%, 9%, 6% and 13% when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes.
2014 IEEE 20th International Symposium on High-Performance Computer Architecture (HPCA); 02/2014
[Show abstract][Hide abstract] ABSTRACT: Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Access (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.
[Show abstract][Hide abstract] ABSTRACT: A major security concern with outsourcing data storage to third-party providers is authenticating the integrity and freshness of data. State-of-the-art software-based approaches require clients to maintain state and cannot immediately detect forking attacks, while approaches that introduce limited trusted hardware (e.g., a monotonic counter) at the storage server achieve low throughput. This paper proposes a new design for authenticating data storage using a small piece of high-performance trusted hardware attached to an untrusted server. The proposed design achieves significantly higher throughput than previous designs. The server-side trusted hardware allows clients to authenticate data integrity and freshness without keeping any mutable client-side state. Our design achieves high performance by parallelizing server-side authentication operations and permitting the untrusted server to maintain caches and schedule disk writes, while enforcing precise crash recovery and write access control.
Proceedings of the 2013 ACM workshop on Cloud computing security workshop; 11/2013
[Show abstract][Hide abstract] ABSTRACT: This paper investigates secure ways to interact with tamper-resistant hardware leaking a strictly bounded amount of information. Architectural support for the interaction mechanisms is studied and performance implications are evaluated.
The interaction mechanisms are built on top of a recently-proposed secure processor Ascend[ascend-stc12]. Ascend is chosen because unlike other tamper-resistant hardware systems, Ascend completely obfuscates pin traffic through the use of Oblivious RAM (ORAM) and periodic ORAM accesses. However, the original Ascend proposal, with the exception of main memory, can only communicate with the outside world at the beginning or end of program execution; no intermediate information transfer is allowed.
Our system, Stream-Ascend, is an extension of Ascend that enables intermediate interaction with the outside world. Stream-Ascend significantly improves the generality and efficiency of Ascend in supporting many applications that fit into a streaming model, while maintaining the same security level.Simulation results show that with smart scheduling algorithms, the performance overhead of Stream-Ascend relative to an insecure and idealized baseline processor is only 24.5%, 0.7%, and 3.9% for a set of streaming benchmarks in a large dataset processing application. Stream-Ascend is able to achieve a very high security level with small overheads for a large class of applications.
Proceedings of the 2013 ACM workshop on Cloud computing security workshop; 11/2013
[Show abstract][Hide abstract] ABSTRACT: We discuss numerical modeling attacks on several proposed strong physical unclonable functions (PUFs). Given a set of challenge-response pairs (CRPs) of a Strong PUF, the goal of our attacks is to construct a computer algorithm which behaves indistinguishably from the original PUF on almost all CRPs. If successful, this algorithm can subsequently impersonate the Strong PUF, and can be cloned and distributed arbitrarily. It breaks the security of any applications that rest on the Strong PUF's unpredictability and physical unclonability. Our method is less relevant for other PUF types such as Weak PUFs. The Strong PUFs that we could attack successfully include standard Arbiter PUFs of essentially arbitrary sizes, and XOR Arbiter PUFs, Lightweight Secure PUFs, and Feed-Forward Arbiter PUFs up to certain sizes and complexities. We also investigate the hardness of certain Ring Oscillator PUF architectures in typical Strong PUF applications. Our attacks are based upon various machine learning techniques, including a specially tailored variant of logistic regression and evolution strategies. Our results are mostly obtained on CRPs from numerical simulations that use established digital models of the respective PUFs. For a subset of the considered PUFs-namely standard Arbiter PUFs and XOR Arbiter PUFs-we also lead proofs of concept on silicon data from both FPGAs and ASICs. Over four million silicon CRPs are used in this process. The performance on silicon CRPs is very close to simulated CRPs, confirming a conjecture from earlier versions of this work. Our findings lead to new design requirements for secure electrical Strong PUFs, and will be useful to PUF designers and attackers alike.
IEEE Transactions on Information Forensics and Security 11/2013; 8(11):1876-1891. DOI:10.1109/TIFS.2013.2279798 · 2.07 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper presents a light-weight dynamic optimization framework for homogeneous multicores. Our system profiles applications at runtime to detect hot program paths, and offloads the optimization of these paths to a Partner core. Our work contributes two insights: (1) that the dynamic optimization process is highly insensitive to runtime factors in homogeneous multicores and (2) that the Partner core's view of application hot paths can be noisy, allowing the entire optimization process to be implemented with very little dedicated hardware in a multicore.
2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC); 10/2013
[Show abstract][Hide abstract] ABSTRACT: With exascale multicores, the question of how to efficiently support a shared memory model is of paramount importance. As programmers demand the convenience of coherent shared memory, ever-growing core counts place higher demands on memory subsystems, and increasing on-chip distances mean that interconnect delays exert a significant effect on memory access latencies.