Srinivas Devadas

Massachusetts Institute of Technology, Cambridge, Massachusetts, United States

Are you Srinivas Devadas?

Claim your profile

Publications (389)152.45 Total impact

  • Farrukh Hijaz · Qingchuan Shi · George Kurian · Srinivas Devadas · Omer Khan
    [Show abstract] [Hide abstract]
    ABSTRACT: Next generation large single-chip multicores will process massive data with varying degree of locality. Harnessing on-chip data locality to optimize the utilization of on-chip cache and network resources is of fundamental importance. We propose a locality-aware selective data replication protocol for the last-level cache (LLC). The goal is to lower memory access latency and energy by only replicating cache lines with high reuse in the LLC slice of the requesting core, while simultaneously keep the off-chip miss rate low. The approach relies on low-overhead yet highly accurate in-hardware runtime cache line level classifier that only allows replication of cache lines with high reuse. Furthermore, a classifier captures the LLC pressure at the existing replica locations and adapts its replication decision accordingly. On a set of parallel benchmarks, the proposed protocol reduces overall energy by 14.7, 10.7, 10.5, and 16.7 % and completion time by 2.5, 6.5, 4.5, and 9.5 % when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA, and Static-NUCA LLC management schemes. An efficient classifier implementation is evaluated with an overhead of 5.44 KB, which translates to only 1.58 % on top of the Static-NUCA baseline’s cache related per-core storage.
    No preview · Article · Feb 2016 · The Journal of Supercomputing
  • Xiangyao Yu · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: The scalability of cache coherence protocols is a significant challenge in multicore and other distributed shared memory systems. Traditional snoopy and directory-based coherence protocols are difficult to scale up to many-core systems because of the overhead of broadcasting and storing sharers for each cacheline. Tardis, a recently proposed coherence protocol, shows potential in solving the scalability problem, since it only requires O(logN) storage per cacheline for an N-core system and needs no broadcasting support. The original Tardis protocol, however, only supports the sequential consistency memory model. This limits its applicability in real systems since most processors today implement relaxed consistency models like Total Store Order (TSO). Tardis also incurs large network traffic overhead on some benchmarks due to an excessive number of renew messages. Furthermore, the original Tardis protocol has suboptimal performance when the program uses spinning to communicate between threads. In this paper, we address these downsides of Tardis protocol and make it significantly more practical. Specifically, we discuss the architectural, memory system and protocol changes required in order to implement TSO consistency model on Tardis, and prove that the modified protocol satisfies TSO. We also propose optimizations for better leasing policies and to handle program spinning. Evaluated on 20 benchmarks, optimized Tardis at 64 (256) cores can achieve average performance improvement of 15.8% (8.4%) compared to the baseline Tardis and 1% (3.4%) compared to the baseline directory protocol. Our optimizations also reduce the average network traffic by 4.3% (6.1%) compared to the baseline directory protocol. On this set of benchmarks, optimized Tardis improves on a fullmap directory protocol in the metrics of energy, performance and storage, while being simpler to implement.
    No preview · Article · Nov 2015
  • Keun Sup Shim · Mieszko Lis · Omer Khan · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: For certain applications involving chip multiprocessors with more than 16 cores, a directoryless architecture with fine-grained and partial-context thread migration can outperform directory-based coherence, providing lighter on-chip traffic and reduced verification complexity.
    No preview · Article · Sep 2015 · Computer
  • Source
    Article: Riffle
    Albert Kwon · David Lazar · Srinivas Devadas · Bryan Ford
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing anonymity systems sacrifice anonymity for efficient communication or vice-versa. Onion-routing achieves low latency, high bandwidth, and scalable anonymous communication, but is susceptible to traffic analysis attacks. Designs based on DC-Nets, on the other hand, protect the users against traffic analysis attacks, but sacrifice bandwidth. Verifiable mixnets maintain strong anonymity with low bandwidth overhead, but suffer from high computation overhead instead. In this paper, we present Riffle, a bandwidth and computation efficient communication system with strong anonymity. Riffle consists of a small set of anonymity servers and a large number of users, and guarantees anonymity among all honest clients as long as there exists at least one honest server. Riffle uses a new hybrid verifiable shuffle technique and private information retrieval for bandwidth- and computation-efficient anonymous communication. Our evaluation of Riffle in file sharing and microblogging applications shows that Riffle can achieve a bandwidth of over 100KB/s per user in an anonymity set of 200 users in the case of file sharing, and handle over 100,000 users with less than 10 second latency in the case of microblogging.
    Preview · Article · Aug 2015
  • Source
    M.-D.M. Yu · Matthias Hiller · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a PUF key generation scheme that uses the provably optimal method of maximum-likelihood (ML) detection on symbols derived from PUF response bits. Each device forms a noisy, device-specific symbol constellation, based on manufacturing variation. Each detected symbol is a letter in a codeword of an error correction code, resulting in non-binary codewords. We present a three-pronged validation strategy: i. mathematical (deriving an optimal symbol decoder), ii. simulation (comparing against prior approaches), and iii. empirical (using implementation data). We present simulation results demonstrating that for a given PUF noise level and block size (an estimate of helper data size), our new symbol-based ML approach can have orders of magnitude better bit error rates compared to prior schemes such as block coding, repetition coding, and threshold-based pattern matching, especially under high levels of noise due to extreme environmental variation. We demonstrate environmental reliability of a ML symbol-based soft-decision error correction approach in 28nm FPGA silicon, covering -65°C to 105°C ambient (and including 125°C junction), and with 128bit key regeneration error probability ≤ 1 ppm.
    Full-text · Article · Jun 2015
  • Article: PrORAM

    No preview · Article · Jun 2015 · ACM SIGARCH Computer Architecture News
  • Source
    Xiangyao Yu · Muralidaran Vijayaraghavan · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: We prove the correctness of a recently-proposed cache coherence protocol, Tardis, which is simple, yet scalable to high processor counts, because it only requires O(logN) storage per cacheline for an N-processor system. We prove that Tardis follows the sequential consistency model and is both deadlock- and livelock-free. Our proof is based on simple and intuitive invariants of the system and thus applies to any system scale and many variants of Tardis.
    Preview · Article · May 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Oblivious RAM (ORAM) is a cryptographic primitive that hides memory access patterns as seen by untrusted storage. Recently, ORAM has been architected into secure processors. A big challenge for hardware ORAM schemes is how to efficiently manage the Position Map (PosMap), a central component in modern ORAM algorithms. Implemented naively, the PosMap causes ORAM to be fundamentally unscalable in terms of on-chip area. On the other hand, a technique called Recursive ORAM fixes the area problem yet significantly increases ORAM's performance overhead. To address this challenge, we propose three new mechanisms. We propose a new ORAM structure called the PosMap Lookaside Buffer (PLB) and PosMap compression techniques to reduce the performance overhead from Recursive ORAM empirically (the latter also improves the construction asymptotically). Through simulation, we show that these techniques reduce the memory bandwidth overhead needed to support recursion by 95%, reduce overall ORAM bandwidth by 37% and improve overall SPEC benchmark performance by 1.27x. We then show how our PosMap compression techniques further facilitate an extremely efficient integrity verification scheme for ORAM which we call PosMap MAC (PMMAC). For a practical parameterization, PMMAC reduces the amount of hashing needed for integrity checking by >= 68x relative to prior schemes and introduces only 7% performance overhead. We prototype our mechanisms in hardware and report area and clock frequency for a complete ORAM design post-synthesis and post-layout using an ASIC flow in a 32~nm commercial process. With 2 DRAM channels, the design post-layout runs at 1~GHz and has a total area of .47~mm2. Depending on PLB-specific parameters, the PLB accounts for 10% to 26% area. PMMAC costs 12% of total design area. Our work is the first to prototype Recursive ORAM or ORAM with any integrity scheme in hardware.
    Full-text · Article · May 2015 · ACM SIGPLAN Notices
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We build and evaluate Tiny ORAM, an Oblivious RAM prototype on FPGA. Oblivious RAM is a cryptographic primitive that completely obfuscates an application’s data, access pattern, and read/write behavior to/from external memory (such as DRAM or disk). Tiny ORAM makes two main contributions. First, by removing an algorithmic bottleneck in prior work, Tiny ORAM is the first hardware ORAM design to support arbitrary block sizes (e.g., 64 Bytes to 4096 Bytes). With a 64 Byte block size, Tiny ORAM can finish an access in 1.4 µs, over 40X faster than the prior-art implementation. Second, through novel algorithmic and engineering-level optimizations, Tiny ORAM reduces the number of symmetric encryption operations by ~ 3X compared to a prior work. Tiny ORAM is also the first design to implement and report real numbers for the cost of symmetric encryption in hardware ORAM constructions. Putting it together, Tiny ORAM requires 18381 (5%) LUTs and 146 (13%) Block RAM on a Xilinx XC7VX485T FPGA, including the cost of encryption
    Full-text · Article · Apr 2015

  • No preview · Article · Mar 2015 · ACM SIGARCH Computer Architecture News
  • Source
    Michel A. Kinsy · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present an Integer Linear Programming (ILP) formulation and two non-iterative heuristics for scheduling a task-based application onto a heterogeneous many-core architecture. Our ILP formulation is able to handle different application performance targets, e.g., low execution time, low memory miss rate, and different architectural features, e.g., cache sizes. For large size problem where the ILP convergence time may be too long, we propose a simple mapping algorithm which tries to spread tasks onto as many processing units as possible, and a more elaborate heuristic that shows good mapping performance when compared to the ILP formulation. We use two realistic power electronics applications to evaluate our mapping techniques on full RTL many-core systems consisting of eight different types of processor cores.
    Preview · Article · Feb 2015
  • Source
    Michel A. Kinsy · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing complexity of embedded systems is accelerating the use of multicore processors in these systems. This trend gives rise to new problems such as the sharing of on-chip network resources among hard real-time and normal best effort data traffic. We propose a network-on-chip router that provides predictable and deterministic communication latency for hard real-time data traffic while maintaining high concurrency and throughput for best-effort/general-purpose traffic with minimal hardware overhead. The proposed router requires less area than non-interfering networks, and provides better Quality of Service (QoS) in terms of predictability and determinism to hard real-time traffic than priority-based routers. We present a deadlock-free algorithm for decoupled routing of the two types of traffic. We compare the area and power estimates of three different router architectures with various QoS schemes using the IBM 45-nm SOI CMOS technology cell library. Performance evaluations are done using three realistic benchmark applications: a hybrid electric vehicle application, a utility grid connected photovoltaic converter system, and a variable speed induction motor drive application.
    Preview · Article · Feb 2015
  • Source
    Xiangyao Yu · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: A new memory coherence protocol, TARDIS, is proposed. TARDIS uses timestamp counters representing logical as opposed to physical time to order memory operations and enforce memory consistency models in any type of shared memory system. Compared to the widely-adopted directory coherence protocol, TARDIS is simpler, only requires O(log N ) storage per cache block for an N-core system rather than the O(N) sharer information required by conventional directory protocols, and integrates better with some system optimizations. On average, TARDIS achieves similar performance to directory protocols on a wide range of benchmarks.
    Preview · Article · Jan 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Computer architectures are moving towards an era dominated by many-core machines with dozens or even hundreds of cores on a single chip. This unprecedented level of on-chip parallelism introduces a new dimension to scalability that current database management systems (DBMSs) were not designed for. In particular, as the number of cores increases, the problem of concurrency control becomes extremely challenging. With hundreds of threads running in parallel, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts. To better understand just how unprepared current DBMSs are for future CPU architectures, we performed an evaluation of concurrency control for on-line transaction processing (OLTP) workloads on many-core chips. We implemented seven concurrency control algorithms on a main-memory DBMS and using computer simulations scaled our system to 1024 cores. Our analysis shows that all algorithms fail to scale to this magnitude but for different reasons. In each case, we identify fundamental bottlenecks that are independent of the particular database implementation and argue that even state-of-the-art DBMSs suffer from these limitations. We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
    Preview · Article · Nov 2014 · Proceedings of the VLDB Endowment

  • No preview · Article · Nov 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the use of physical unclonable functions (PUFs) in low-cost authentication and key generation applications. First, it motivates the use of PUFs versus conventional secure nonvolatile memories and defines the two primary PUF types: “strong PUFs” and “weak PUFs.” It describes strong PUF implementations and their use for low-cost authentication. After this description, the paper covers both attacks and protocols to address errors. Next, the paper covers weak PUF implementations and their use in key generation applications. It covers error-correction schemes such as pattern matching and index-based coding. Finally, this paper reviews several emerging concepts in PUF technologies such as public model PUFs and new PUF implementation technologies.
    Full-text · Article · Aug 2014 · Proceedings of the IEEE
  • G. Edward Suh · George Kurian · Srinivas Devadas · Larry Rudolph
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the author retrospective on the analytical cache modeling work published in the 2001 International Conference on Supercomputing (ICS). We summarize the history of the work, revisit primary observations and lessons that we learned from the modeling effort, and also briefly describe follow-up work to show how the research direction evolved over time. Original Paper:
    No preview · Article · Jun 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper details the design and application of a new ultra-high speed real-time simulation for Hardware-in-the-Loop (HiL) testing and design of high-power power electronics systems. Our real-time hardware emulation for HiL system is based on a custom, heterogeneous, recon-figurable, multicore processor design that emulates power electronics, and includes a circuit compiler that translates graphic system models into processor executable machine code. We present digital processor architecture details, and describe the process of power electronic cir-cuit compilation. This approach to real-time emulation yields real-time execution in the order of 1µs simulation time step (including input/output latency) for a broad class of power electronics converters. In addition, we present HiL simulation experimental results for three representative systems: namely, a variable speed induction motor drive, a utility grid connected photovoltaic converter system, and a hybrid electric vehicle motor drive.
    Preview · Article · May 2014
  • Source
    Meng-Day Yu · David M'Raihi · Ingrid Verbauwhede · Srinivas Devadas
    [Show abstract] [Hide abstract]
    ABSTRACT: Physical Unclonable Functions (PUFs) allow a silicon device to be authenticated based on its manufacturing variations using challenge/response evaluations. Popular realizations use linear additive functions as building blocks. Security is scaled up using non-linear mixing (e.g., adding XORs). Because the responses are physically derived and thus noisy, the resulting explosion in noise impacts both the adversary (which is desirable) as well as the verifier (which is undesirable). We present the first architecture for linear additive physical functions where the noise seen by the adversary and the noise seen by the verifier are bifurcated by using a randomized decimation technique and a novel response recovery method at an authentication verification server. We allow the adversary's noise ηa → 0.50 while keeping the verifier's noise ηv constant, using a parameter-based authentication modality that does not require explicit challenge/response pair storage at the server. We present supporting data using 28nm FPGA PUF noise results as well as machine learning attack results. We demonstrate that our architecture can also withstand recent side-channel attacks that filter the noise (to clean up training challenge/response labels) prior to machine learning.
    Full-text · Conference Paper · May 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper described recent improvements to the Graphite simulator designed to help explore current and emerging research topics. With these improvements, Graphite is ideally suited to explore both power and performance in future multicore and manycore processors, especially those incorporating dynamic runtime monitoring and adaptation. Separate validation of Graphite has shown performance results within about 6% on average (18% worst case) of a cycle-level simulator and normalized power trends are predicted to within 10%. This makes Graphite accurate enough for medium- to long-term studies while maintaining very high performance. Graphite is freely available for anyone to use:
    No preview · Conference Paper · Mar 2014

Publication Stats

13k Citations
152.45 Total Impact Points


  • 1988-2014
    • Massachusetts Institute of Technology
      • • Laboratory for Computer Science
      • • Computer Science and Artificial Intelligence Laboratory
      • • Department of Electrical Engineering and Computer Science
      Cambridge, Massachusetts, United States
  • 2013
    • Idenix Pharmaceuticals, Inc.
      Cambridge, Massachusetts, United States
  • 2012
    • RSA Laboratories
      Cambridge, Massachusetts, United States
    • McGill University
      • McGill Centre for Bioinformatics
      Montréal, Quebec, Canada
  • 2004-2012
    • Distributed Artificial Intelligence Laboratory
      Berlín, Berlin, Germany
  • 2007
    • Cornell University
      • Department of Electrical and Computer Engineering
      Итак, New York, United States
  • 1986-2001
    • University of California, Berkeley
      • Department of Electrical Engineering and Computer Sciences
      Berkeley, California, United States
  • 1998
    • MIT Portugal
      Porto Salvo, Lisbon, Portugal
    • The University of Arizona
      Tucson, Arizona, United States
  • 1997
    • Institute for Systems and Computer Engineering of Porto (INESC Porto)
      Oporto, Porto, Portugal
  • 1993-1997
    • Princeton University
      • Department of Electrical Engineering
      Princeton, New Jersey, United States
  • 1992
    • Synopsys
      Mountain View, California, United States