[Show abstract][Hide abstract] ABSTRACT: This paper describes the use of physical unclonable functions (PUFs) in low-cost authentication and key generation applications. First, it motivates the use of PUFs versus conventional secure nonvolatile memories and defines the two primary PUF types: “strong PUFs” and “weak PUFs.” It describes strong PUF implementations and their use for low-cost authentication. After this description, the paper covers both attacks and protocols to address errors. Next, the paper covers weak PUF implementations and their use in key generation applications. It covers error-correction schemes such as pattern matching and index-based coding. Finally, this paper reviews several emerging concepts in PUF technologies such as public model PUFs and new PUF implementation technologies.
Proceedings of the IEEE 01/2014; 102(8):1126-1141. · 6.91 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper proposes novel robust and low-overhead physical unclonable function (PUF) authentication and key exchange protocols that are resilient against reverse-engineering attacks. The protocols are executed between a party with access to a physical PUF (prover) and a trusted party who has access to the PUF compact model (verifier). The proposed protocols do not follow the classic paradigm of exposing the full PUF responses or a transformation of them. Instead, random subsets of the PUF response strings are sent to the verifier so the exact position of the subset is obfuscated for the third-party channel observers. Authentication of the responses at the verifier side is done by matching the substring to the available full response string; the index of the matching point is the actual obfuscated secret (or key) and not the response substring itself. We perform a thorough analysis of resiliency of the protocols against various adversarial acts, including machine learning and statistical attacks. The attack analysis guides us in tuning the parameters of the protocol for an efficient and secure implementation. The low overhead and practicality of the protocols are evaluated and confirmed by hardware implementation.
Emerging Topics in Computing, IEEE Transactions on. 01/2014; 2(1):37-49.
[Show abstract][Hide abstract] ABSTRACT: This paper investigates secure ways to interact with tamper-resistant hardware leaking a strictly bounded amount of information. Architectural support for the interaction mechanisms is studied and performance implications are evaluated. The interaction mechanisms are built on top of a recently-proposed secure processor Ascend[ascend-stc12]. Ascend is chosen because unlike other tamper-resistant hardware systems, Ascend completely obfuscates pin traffic through the use of Oblivious RAM (ORAM) and periodic ORAM accesses. However, the original Ascend proposal, with the exception of main memory, can only communicate with the outside world at the beginning or end of program execution; no intermediate information transfer is allowed. Our system, Stream-Ascend, is an extension of Ascend that enables intermediate interaction with the outside world. Stream-Ascend significantly improves the generality and efficiency of Ascend in supporting many applications that fit into a streaming model, while maintaining the same security level.Simulation results show that with smart scheduling algorithms, the performance overhead of Stream-Ascend relative to an insecure and idealized baseline processor is only 24.5%, 0.7%, and 3.9% for a set of streaming benchmarks in a large dataset processing application. Stream-Ascend is able to achieve a very high security level with small overheads for a large class of applications.
Proceedings of the 2013 ACM workshop on Cloud computing security workshop; 11/2013
[Show abstract][Hide abstract] ABSTRACT: A major security concern with outsourcing data storage to third-party providers is authenticating the integrity and freshness of data. State-of-the-art software-based approaches require clients to maintain state and cannot immediately detect forking attacks, while approaches that introduce limited trusted hardware (e.g., a monotonic counter) at the storage server achieve low throughput. This paper proposes a new design for authenticating data storage using a small piece of high-performance trusted hardware attached to an untrusted server. The proposed design achieves significantly higher throughput than previous designs. The server-side trusted hardware allows clients to authenticate data integrity and freshness without keeping any mutable client-side state. Our design achieves high performance by parallelizing server-side authentication operations and permitting the untrusted server to maintain caches and schedule disk writes, while enforcing precise crash recovery and write access control.
Proceedings of the 2013 ACM workshop on Cloud computing security workshop; 11/2013
[Show abstract][Hide abstract] ABSTRACT: Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.
Proceedings of the 40th Annual International Symposium on Computer Architecture; 07/2013
[Show abstract][Hide abstract] ABSTRACT: Keeping user data private is a huge problem both in cloud computing and computation outsourcing. One paradigm to achieve data privacy is to use tamper-resistant processors, inside which users' private data is decrypted and computed upon. These processors need to interact with untrusted external memory. Even if we encrypt all data that leaves the trusted processor, however, the address sequence that goes off-chip may still leak information. To prevent this address leakage, the security community has proposed ORAM (Oblivious RAM). ORAM has mainly been explored in server/file settings which assume a vastly different computation model than secure processors. Not surprisingly, naïvely applying ORAM to a secure processor setting incurs large performance overheads. In this paper, a recent proposal called Path ORAM is studied. We demonstrate techniques to make Path ORAM practical in a secure processor setting. We introduce background eviction schemes to prevent Path ORAM failure and allow for a performance-driven design space exploration. We propose a concept called super blocks to further improve Path ORAM's performance, and also show an efficient integrity verification scheme for Path ORAM. With our optimizations, Path ORAM overhead drops by 41.8%, and SPEC benchmark execution time improves by 52.4% in relation to a baseline configuration. Our work can be used to improve the security level of previous secure processors.
Proceedings of the 40th Annual International Symposium on Computer Architecture; 06/2013
[Show abstract][Hide abstract] ABSTRACT: Engineering complex systems inevitably requires a designer to balance many conflicting design requirements including performance, cost, power, and design time. In many cases, FPGAs enable engineers to balance these design requirements in ways not possible ...
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 02/2013
[Show abstract][Hide abstract] ABSTRACT: We discuss numerical modeling attacks on several proposed strong physical unclonable functions (PUFs). Given a set of challenge-response pairs (CRPs) of a Strong PUF, the goal of our attacks is to construct a computer algorithm which behaves indistinguishably from the original PUF on almost all CRPs. If successful, this algorithm can subsequently impersonate the Strong PUF, and can be cloned and distributed arbitrarily. It breaks the security of any applications that rest on the Strong PUF's unpredictability and physical unclonability. Our method is less relevant for other PUF types such as Weak PUFs. The Strong PUFs that we could attack successfully include standard Arbiter PUFs of essentially arbitrary sizes, and XOR Arbiter PUFs, Lightweight Secure PUFs, and Feed-Forward Arbiter PUFs up to certain sizes and complexities. We also investigate the hardness of certain Ring Oscillator PUF architectures in typical Strong PUF applications. Our attacks are based upon various machine learning techniques, including a specially tailored variant of logistic regression and evolution strategies. Our results are mostly obtained on CRPs from numerical simulations that use established digital models of the respective PUFs. For a subset of the considered PUFs-namely standard Arbiter PUFs and XOR Arbiter PUFs-we also lead proofs of concept on silicon data from both FPGAs and ASICs. Over four million silicon CRPs are used in this process. The performance on silicon CRPs is very close to simulated CRPs, confirming a conjecture from earlier versions of this work. Our findings lead to new design requirements for secure electrical Strong PUFs, and will be useful to PUF designers and attackers alike.
IEEE Transactions on Information Forensics and Security 01/2013; 8(11):1876-1891. · 1.90 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Conventional oblivious routing algorithms do not take into account resource requirements (e.g., bandwidth, latency) of various flows in a given application. As they are not aware of flow demands that are specific to the application, network resources can be poorly utilized and cause serious local congestion. Also, flows, or packets, may share virtual channels in an undetermined way; the effects of head-of-line blocking may result in throughput degradation. In this paper, we present a framework for application-aware routing that assures deadlock freedom under one or more virtual channels by forcing routes to conform to an acyclic channel dependence graph. In addition, we present methods to statically and efficiently allocate virtual channels to flows or packets, under oblivious routing, when there are two or more virtual channels per link. Using the application-aware routing framework, we develop and evaluate a bandwidth-sensitive oblivious routing scheme that statically determines routes considering an application's communication characteristics. Given bandwidth estimates for flows, we present a mixed integer-linear programming (MILP) approach and a heuristic approach for producing deadlock-free routes that minimize maximum channel load. Our framework can be used to produce application-aware routes that target the minimization of latency, number of flows through a link, bandwidth, or any combination thereof. Our results show that it is possible to achieve better performance than traditional deterministic and oblivious routing schemes on popular synthetic benchmarks using our bandwidth-sensitive approach. We also show that, when oblivious routing is used and there are more flows than virtual channels per link, the static assignment of virtual channels to flows can help mitigate the effects of head-of-line blocking, which may impede packets that are dynamically competing for virtual channels. We experimentally explore the performance tradeoffs of static and dynamic - irtual channel allocation on bandwidth-sensitive and traditional oblivious routing methods.
IEEE Transactions on Computers 01/2013; 62(1):59-73. · 1.38 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Oblivious-RAMs (ORAM) are used to hide memory access patterns. Path ORAM has gained popularity due to its efficiency and simplicity. In this paper, we propose an efficient integrity verification layer for Path ORAM, which only imposes 17% latency overhead. We also show that integrity verification is vital to maintaining privacy for recursive Path ORAMs under active adversaries.
High Performance Extreme Computing Conference (HPEC), 2013 IEEE; 01/2013
[Show abstract][Hide abstract] ABSTRACT: The paper states that people are trusting the cloud more and more to perform sensitive operations. Demanding more trust in software systems is a recipe for disaster. Suppose the people only trust hardware manufacturers and cryptographers, and not system software developers, application programmers, or other software vendors. It will be the hardware manufacturer's job to produce a piece of hardware that provides some security properties. These properties will correspond to cryptographic operations being implemented correctly in the hardware and adding a modicum of physical security. The beauty of hardware is that its functionality is fixed. If we design our systems to only depend on hardware properties, then we need not worry about software changes or patches introducing new security holes-inevitable in current systems. How can it ensure privacy of data despite the practically infinite number of malicious programs out there? The Ascend processor attempts to achieve these goals; the only entity that the client has to trust is the processor itself.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a novel Multicore Architecture for Real-Time Hybrid Applications (MARTHA) with time-predictable execution, low computational latency, and high performance that meets the requirements for control, emulation and estimation of next-generation power electronics and smart grid systems. Generic general-purpose architectures running real-time operating systems (RTOS) or quality of service (QoS) schedulers have not been able to meet the hard real-time constraints required by these applications. We present a framework based on switched hybrid automata for modeling power electronics applications. Our approach allows a large class of power electronics circuits to be expressed as switched hybrid models which can be executed on a single hardware platform.
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013; 01/2013
[Show abstract][Hide abstract] ABSTRACT: As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM2), establishes a new design point. On the one hand, EM2 supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM2 chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM2 is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months.
Computer Design (ICCD), 2013 IEEE 31st International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: With exascale multicores, the question of how to efficiently support a shared memory model is of paramount importance. As programmers demand the convenience of coherent shared memory, ever-growing core counts place higher demands on memory subsystems, and increasing on-chip distances mean that interconnect delays exert a significant effect on memory access latencies.
[Show abstract][Hide abstract] ABSTRACT: Fully homomorphic encryption (FHE) techniques are capable of performing encrypted computation on Boolean circuits, i.e., the user specifies encrypted inputs to the program, and the server computes on the encrypted inputs. Applying these techniques to general programs with recursive procedures and data-dependent loops has not been a focus of attention. In this paper, we take a first step toward building an interpreter that, given programs with complex control flow, schedules efficient code suitable for the application of FHE schemes. We first describe how programs written in a small Turing-complete instruction set can be executed with encrypted data and point out inefficiencies in this methodology. We then provide examples of scheduling (a) the greatest common divisor (GCD) problem using Euclid's algorithm and (b) the 3-Satisfiability (3SAT) problem using a recursive backtracking algorithm into path-levelized FHE computations. We describe how path levelization reduces control flow ambiguity and improves encrypted computation efficiency. Using these techniques and data-dependent loops as a starting point, we then build support for hierarchical programs made up of phases, where each phase corresponds to a fixed point computation that can be used to further improve the efficiency of encrypted computation. In our setting, the adversary learns an estimate of the number of steps required to complete the computation, which we show is the least amount of leakage possible.
Proceedings of the 2012 ACM Workshop on Cloud computing security workshop; 10/2012
[Show abstract][Hide abstract] ABSTRACT: This paper considers encrypted computation where the user specifies encrypted inputs to an untrusted program, and the server computes on those encrypted inputs. To this end we propose a secure processor architecture, called Ascend, that guarantees privacy of data when arbitrary programs use the data running in a cloud-like environment (e.g., an untrusted server running an untrusted software stack). The key idea to guarantee privacy is obfuscated instruction execution; Ascend does not disclose what instruction is being run at any given time, be it an arithmetic instruction or a memory instruction. Periodic accesses to external instruction and data memory are performed through an Oblivious RAM (ORAM) interface to prevent leakage through memory access patterns. We evaluate the processor architecture on SPEC benchmarks running on encrypted data and quantify overheads.
Proceedings of the seventh ACM workshop on Scalable trusted computing; 10/2012
[Show abstract][Hide abstract] ABSTRACT: This paper argues for a "less is more" design philosophy when integrating dynamic optimization into a multicore system. The primary insight is that dynamic optimization is inherently loosely-coupled and can therefore be supported on multicores with very low-overhead by using a Partner core. We exploit this property by designing a dynamic optimizer composed of a two-core partnership that requires a minimal amount of dedicated hardware and is resilient to (a) reducing the Partner core's clock frequency, (b) changing the Partner core's placement on the multicore die and (c) varying the latency of dynamic optimization operations.
Proceedings of the 21st international conference on Parallel architectures and compilation techniques; 09/2012
[Show abstract][Hide abstract] ABSTRACT: The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of single-path directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNA-ensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to single-path directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNA-ensign is available at http://csb.cs.mcgill.ca/RNAensign.
Nucleic Acids Research 08/2012; · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We present Path ORAM, an extremely simple Oblivious RAM protocol with a small
amount of client storage. Partly due to its simplicity, Path ORAM is the most
practical ORAM scheme known to date. We formally prove that Path ORAM requires
O(log^2 N / k) bandwidth overhead for block size B = k * log N. For block sizes
bigger than O(log^2 N) bits, Path ORAM is asymptotically better than the best
known ORAM scheme with small client storage. Due to its practicality, Path ORAM
has been adopted in the design of secure processors since its proposal.
[Show abstract][Hide abstract] ABSTRACT: Addressing the challenges of extreme scale computing requires holistic design of new programming models and systems that support those models. This paper discusses the Angstrom processor, which is designed to support a new Self-aware Computing (SEEC) model. In SEEC, applications explicitly state goals, while other systems components provide actions that the SEEC runtime system can use to meet those goals. Angstrom supports this model by exposing sensors and adaptations that traditionally would be managed independently by hardware. This exposure allows SEEC to coordinate hardware actions with actions specified by other parts of the system, and allows the SEEC runtime system to meet application goals while reducing costs (e.g., power consumption).