Milos Prvulovic

Georgia Institute of Technology, Atlanta, Georgia, United States

Are you Milos Prvulovic?

Claim your profile

Publications (52)4.76 Total impact

  • A. Zajic, M. Prvulovic
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper shows that electromagnetic (EM) information leakage from modern laptops and desktops (with no peripherals attached) is indeed possible and is relatively easy to achieve. The experiments are performed on three laptop systems and one desktop system with different processors (Intel Centrino, Core 2, Core i7, and AMD Turion), and show that both active (program deliberately tries to cause emanations at a particular frequency) and passive (emanations at different frequencies happen as a result of system activity) EM side-channel attacks are possible on all the systems we tested. Furthermore, this paper shows that EM information leakage can reliably be received at distances that vary from tens of centimeters to several meters including the signals that have propagated through cubicle or structural walls. Finally, this paper shows how activity levels and data values used in accessing different parts of the memory subsystem (off-chip memory and each level of on-chip caches) affect the transmission distance.
    IEEE Transactions on Electromagnetic Compatibility 01/2014; 56(4):885-893. · 1.33 Impact Factor
  • Jungju Oh, A. Zajic, M. Prvulovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Growth in core count creates an increasing demand for interconnect bandwidth, driving a change from shared buses to packet-switched on-chip interconnects. However, this increases the latency between cores separated by many links and switches. In this paper, we show that a low-latency unswitched interconnect built with transmission lines can be synergistically used with a high-throughput switched interconnect. First, we design a broadcast ring as a chain of unidirectional transmission line structures with very low latency but limited throughput. Then, we create a new adaptive packet steering policy that judiciously uses the limited throughput of this ring by balancing expected latency benefit and ring utilization. Although the ring uses 1.3% of the on-chip metal area, our experimental results show that, in combination with our steering, it provides an execution time reduction of 12.4% over a mesh-only baseline.
    Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Programs written in languages allowing direct access to memory through pointers often contain memory-related faults, which cause nondeterministic failures and security vulnerabilities. We present a new dynamic tainting technique to detect illegal memory accesses. When memory is allocated, at runtime, we taint both the memory and the corresponding pointer using the same taint mark. Taint marks are then propagated and checked every time a memory address m is accessed through a pointer p; if the associated taint marks differ, an illegal access is reported. To allow always-on checking using a low overhead, hardware-assisted implementation, we make several key technical decisions. We use a configurable, low number of reusable taint marks instead of a unique mark for each allocated area of memory, reducing the performance overhead without losing the ability to target most memory-related faults. We also define the technique at the binary level, which helps handle applications using third-party libraries whose source code is unavailable. We created a software-only prototype of our technique and simulated a hardware-assisted implementation. Our results show that 1) it identifies a large class of memory-related faults, even when using only two unique taint marks, and 2) a hardware-assisted implementation can achieve performance overheads in single-digit percentages.
    IEEE Transactions on Computers 02/2012; · 1.38 Impact Factor
  • Ioannis Doudalis, Milos Prvulovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Bidirectional debugging and error recovery have different goals (programmer productivity and system reliability, respectively), yet they both require the ability to roll-back the program or the system to a past state. This rollback functionality is typically implemented using checkpoints that can restore the system/application to a specific point in time. There are several types of checkpoints, and bidirectional debugging and error-recovery use them in different ways. This paper presents Euripus1, a flexible hardware accelerator for memory checkpointing which can create different combinations of checkpoints needed for bidirectional debugging, error recovery, or both. In particular, Euripus is the first hardware technique to provide consolidation-friendly undo-logs (for bidirectional debugging), to allow simultaneous construction of both undo and redo logs, and to support multi-level checkpointing for the needs of error-recovery. Euripus incurs low performance overheads (30%, and supports rapid multi-level error recovery that allows >95% system efficiency even with very high error rates.
    01/2012;
  • Source
    Jungju Oh, Milos Prvulovic, Alenka G. Zajic
    [Show abstract] [Hide abstract]
    ABSTRACT: As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (baseband) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.
    38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA; 01/2011
  • [Show abstract] [Hide abstract]
    ABSTRACT: With computing increasingly becoming more dispersed, relying on mobile devices, distributed computing, cloud computing, etc. there is an increasing threat from adversaries obtaining physical access to some of the computer systems through theft or security breaches. With such an untrusted computing node, a key challenge is how to provide secure computing environment where we provide privacy and integrity for data and code of the application. We propose SecureME, a hardware-software mechanism that provides such a secure computing environment. SecureME protects an application from hardware attacks by using a secure processor substrate, and also from the Operating System (OS) through memory cloaking, permission paging, and system call protection. Memory cloaking hides data from the OS but allows the OS to perform regular virtual memory management functions, such as page initialization, copying, and swapping. Permission paging extends the OS paging mechanism to provide a secure way for two applications to establish shared pages for inter-process communication. Finally, system call protection applies spatio-temporal protection for arguments that are passed between the application and the OS. Based on our performance evaluation using microbenchmarks, single-program workloads, and multiprogrammed workloads, we found that SecureME only adds a small execution time overhead compared to a fully unprotected system. Roughly half of the overheads are contributed by the secure processor substrate. SecureME also incurs a negligible additional storage overhead over the secure processor substrate.
    Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While multicore processors promise large performance benefits for parallel applications, writing these applications is notoriously difficult. Tuning a parallel application to achieve good performance, also known as performance debugging, is often more challenging than debugging the application for correctness. Parallel programs have many performance-related issues that are not seen in sequential programs. An increase in cache misses is one of the biggest challenges that programmers face. To minimize these misses, programmers must not only identify the source of the extra misses, but also perform the tricky task of determining if the misses are caused by interthread communication (i.e., coherence misses) and if so, whether they are caused by true or false sharing (since the solutions for these two are quite different). In this article, we propose a new programmer-centric definition of false sharing misses and describe our novel algorithm to perform coherence miss classification. We contrast our approach with existing data-centric definitions of false sharing. A straightforward implementation of our algorithm is too expensive to be incorporated in real hardware. Therefore, we explore the design space for low-cost hardware support that can classify coherence misses on-the-fly into true and false sharing misses, allowing existing performance counters and profiling tools to expose and attribute them. We find that our approximate schemes achieve good accuracy at only a fraction of the cost of the ideal scheme. Additionally, we demonstrate the usefulness of our work in a case study involving a real application.
    ACM Transactions on Architecture and Code Optimization 01/2011; 8:8. · 0.68 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the ubiquity of multi-core processors, software must make eective use of multiple cores to obtain good performance on modern hardware. One of the biggest roadblocks to this is load imbalance, or the uneven distribution of work across cores. We propose LIME, a framework for analyzing parallel programs and reporting the cause of load imbalance in application source code. This framework uses statistical techniques to pinpoint load imbalance problems stemming from both control flow issues (e.g., unequal iteration counts) and interactions between the application and hardware (e.g., unequal cache miss counts). We evaluate LIME on applications from widely used parallel benchmark suites, and show that LIME accurately reports the causes of load imbalance, their nature and origin in the code, and their relative importance.
    Proceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, USA, May 21-28, 2011; 01/2011
  • Source
    I. Doudalis, M. Prvulovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Bidirectional execution is a powerful debugging technique that allows program execution to proceed both forward and in reverse. Many software-only techniques and tools have emerged that use checkpointing and replay to provide the effect of reverse execution, although with considerable performance overheads in both forward and reverse execution. Recent hardware proposals for checkpointing and execution replay minimize these performance overheads, but in a way that prevents checkpoint consolidation, a key technique for reducing memory use while retaining the ability to reverse long periods of execution. This paper presents HARE, a hardware technique that efficiently supports both checkpointing and consolidation. Our experiments show that on average HARE incurs <3% performace overheads even when creating tens of checkpoints per second, provides reverse execution times similar to forward execution times, and reduces the total space used by checkpoints by a factor of 36 on average (this factor gets better for longer runs) relative to prior consolidation-less hardware checkpointing schemes.
    High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on; 02/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Easing the programmer's burden does not compromise system performance or increase the complexity of hardware implementation.
    Commun. ACM. 01/2009; 52:58-65.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In today's digital world, computer security issues have become increasingly important. In particular, researchers have proposed designs for secure processors that utilize hardware-based memory encryption and integrity verification to protect the privacy and integrity of computation even from sophisticated physical attacks. However, currently proposed schemes remain hampered by problems that make them impractical for use in today's computer systems: lack of virtual memory and Inter-Process Communication support as well as excessive storage and performance overheads. In this article, we propose (1) address independent seed encryption (AISE), a counter-mode-based memory encryption scheme using a novel seed composition, and (2) bonsai Merkle trees (BMT), a novel Merkle tree-based memory integrity verification technique, to eliminate these system and performance issues associated with prior counter-mode memory encryption and Merkle tree integrity verification schemes. We present both a qualitative discussion and a quantitative analysis to illustrate the advantages of our techniques over previously proposed approaches in terms of complexity, feasibility, performance, and storage. Our results show that AISE+BMT reduces the overhead of prior memory encryption and integrity verification schemes from 12&percnt; to 2&percnt; on average for single-threaded benchmarks on uniprocessor systems, and from 15&percnt; to 4&percnt; for coscheduled benchmarks on multicore systems while eliminating critical system-level problems.
    TACO. 01/2009; 5.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Existing software and hardware approaches for finding and identifying memory bugs have a number of drawbacks including considerable performance overheads, target only a specific type of bug, implementation cost, and inefficient use of computational resources. This article describes MemTracker, a new hardware support mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker associates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. MemTracker's rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that combines all three. For the most demanding (combined) checker with 8 bits state per memory word, we observe performance overheads of only around 3&percnt;, on average, and 14.5&percnt; worst-case across different benchmark suites. Such low overheads allow continuous (always-on) use of MemTracker-enabled checkers, even in production runs.
    TACO. 01/2009; 6.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents FlexiTaint, a hardware accelerator for dynamic taint propagation. FlexiTaint is implemented as an in-order addition to the back-end of the processor pipeline, and the taints for memory locations are stored as a packed array in regular memory. The taint propagation scheme is specified via a software handler that, given the operation and the sourcespsila taints, computes the new taint for the result. To keep performance overheads low, FlexiTaint caches recent taint propagation lookups and uses a filter to avoid lookups for simple common-case behavior. We also describe how to implement consistent taint propagation in a multi-core environment. Our experiments show that FlexiTaint incurs average performance overheads of only 1% for SPEC2000 benchmarks and 3.7% for Splash-2 benchmarks, even when simultaneously following two different taint propagation policies.
    High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on; 03/2008
  • Source
    S. Subramaniam, M. Prvulovic, G.H. Loh
    [Show abstract] [Hide abstract]
    ABSTRACT: Simultaneous multithreading (SMT) attempts to keep a dynamically scheduled processorpsilas resources busy with work from multiple independent threads. Threads with long-latency stalls, however, can lead to a reduction in overall throughput because they occupy many of the critical processor resources. In this work, we first study the interaction between stalls caused by ambiguous memory dependences and SMT processing. We then propose the technique of proactive exclusion (PE) where the SMT fetch unit stops fetching from a thread when a memory dependence is predicted to exist. However, after the dependence has been resolved, the thread is delayed waiting for new instructions to be fetched and delivered down the front-end pipeline. So we introduce an early parole (EP) mechanism that exploits the predictability of dependence-resolution delays to restart fetch of an excluded thread so that the instructions reach the execution core just as the original dependence resolves. We show that combining these two techniques (PEEP) yields a 16.9% throughput improvement on a 4-way SMT processor that supports speculative memory disambiguation. These strong results indicate that a fetch policy that is cognizant of future stalls considerably improves the throughput of an SMT machine.
    High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on; 03/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiprocessor computer systems are currently widely used in commercial settings to run critical applications. These applications often operate on sensitive data such as customer records, credit card numbers, and financial data. As a result, these systems are the frequent targets of attacks because of the potentially significant gain an attacker could obtain from stealing or tampering with such data. This provides strong motivation to protect the confidentiality and integrity of data in commercial multiprocessor systems through architectural support. Architectural support is able to protect against software-based attacks, and is necessary to protect against hardware-based attacks. In this work, we propose architectural mechanisms to ensure data confidentiality and integrity in Distributed Shared Memory multiprocessors which utilize a point-to-point based interconnection network. Our approach improves upon previous work in this area, mainly in the fact that our approach reduces performance overheads by significantly reducing the amount of cryptographic operations required. Evaluation results show that our approach can protect data confidentiality and integrity in a 16-processor DSM system with an average overhead of 1.6% and a maximum of only 7% across all SPLASH-2 applications.
    14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In today's digital world, computer security issues have become increasingly important. In particular, researchers have proposed designs for secure processors which utilize hardware-based mem- ory encryption and integrity verification to protect the privacy and integrity of computation even from sophisticated physical attacks. However, currently proposed schemes remain hampered by prob- lems that make them impractical for use in today's computer sys- tems: lack of virtual memory and Inter-Process Communication support as well as excessive storage and performance overheads. In this paper, we propose 1) Address Independent Seed Encryption (AISE), a counter-mode based memory encryption scheme using a novel seed composition, and 2) Bonsai Merkle Trees (BMT), a novel Merkle Tree-based memory integrity verification technique, to elim- inate these system and performance issues associated with prior counter-mode memory encryption and Merkle Tree integrity veri- fication schemes. We present both a qualitative discussion and a quantitative analysis to illustrate the advantages of our techniques over previously proposed approaches in terms of complexity, feasi- bility, performance, and storage. Our results show that AISE+BMT reduces the overhead of prior memory encryption and integrity ver- ification schemes from 12% to 2% on average, while eliminating critical system-level problems.
    Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Programs written in languages that provide direct access to memory through pointers often contain memory-related faults, which may cause non-deterministic failures and even security vulnerabilities. In this paper, we present a new technique based on dynamic taint- ing for protecting programs from illegal memory accesses. When memory is allocated, at runtime, our technique taints both the mem- ory and the corresponding pointer using the same taint mark. Taint marks are then suitably propagated while the program executes and are checked every time a memory address m is accessed through a pointer p; if the taint marks associated with m and p differ, the ex- ecution is stopped and the illegal access is reported. To allow for a low-overhead, hardware-assisted implementation of the approach, we make several key technical and engineering decisions in the definition of our technique. In particular, we use a configurable, low number of reusable taint marks instead of a unique mark for each area of memory allocated, which reduces the overhead of the approach without limiting its flexibility and ability to target most memory-related faults and attacks known to date. We also define the technique at the binary level, which lets us handle the (very) common case of applications that use third-party libraries whose source code is unavailable. To investigate the effectiveness and practicality of our approach, we implemented it for heap-allocated memory and performed a preliminary empirical study on a set of programs. Our results show that (1) our technique can identify a large class of memory-related faults, even when using only two unique taint marks, and (2) a hardware-assisted implementation of the technique could achieve overhead in the single digits.
    22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), November 5-9, 2007, Atlanta, Georgia, USA; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Memory bugs are a broad class of bugs that is becoming increasingly common with increasing software complexity, and many of these bugs are also security vulnerabilities. Un- fortunately, existing software and even hardware approaches for finding and identifying memory bugs have considerable performance overheads, target only a narrow class of bugs, are costly to implement, or use computational resources in- efficiently. This paper describes MemTracker, a new hardware sup- port mechanism that can be configured to perform different kinds of memory access monitoring tasks. MemTracker as- sociates each word of data in memory with a few bits of state, and uses a programmable state transition table to react to different events that can affect this state. The number of state bits per word, the events to which MemTracker reacts, and the transition table are all fully programmable. Mem- Tracker's rich set of states, events, and transitions can be used to implement different monitoring and debugging checkers with minimal performance overheads, even when frequent state updates are needed. To evaluate MemTracker, we map three different checkers onto it, as well as a checker that com- bines all three. For the most demanding (combined) checker, we observe performance overheads of only 2.7% on average and 4.8% worst-case on SPEC 2000 applications. Such low overheads allow continuous (always-on) use of MemTracker- enabled checkers even in production runs.
    13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 10-14 February 2007, Phoenix, Arizona, USA; 01/2007
  • [Show abstract] [Hide abstract]
    ABSTRACT: The ability to detect and pinpoint memory-related bugs in production runs is important because in-house testing may miss bugs. This paper presents HeapMon, a heap memory bug-detection scheme that has a very low performance overhead, is automatic, and is easy to deploy. HeapMon relies on two new techniques. First, it decouples application execution from bug monitoring, which executes as a helper thread on a separate core in a chip multiprocessor system. Second, it associates a filter bit with each cached word to safely and significantly reduce bug checking frequency—by 95% on average. We test the effectiveness of these techniques using existing and injected memory bugs in SPEC®2000 applications and show that HeapMon effectively detects and identifies most forms of heap memory bugs. Our results also indicate that the HeapMon performance overhead is only 5%, on average—orders of magnitude less than existing tools. Its overhead is also modest: 3.1% of the cache size and a 32-KB victim cache for on-chip filter bits and 6.2% of the allocated heap memory size for state bits, which are maintained by the helper thread as a software data structure.
    Ibm Journal of Research and Development 04/2006; · 0.69 Impact Factor
  • Source
    M. Prvulovic
    [Show abstract] [Hide abstract]
    ABSTRACT: Chip-multiprocessors are becoming the dominant vehicle for general-purpose processing, and parallel software will be needed to effectively utilize them. This parallel software is notoriously prone to synchronization bugs, which are often difficult to detect and repeat for debugging. While data race detection and order-recording for deterministic replay are useful in debugging such problems, only order-recording schemes are lightweight, whereas data race detection support scales poorly and degrades performance significantly. This paper presents our CORD (cost-effective order-recording and data race detection) mechanism. It is similar in cost to prior order-recording mechanisms, but costs considerably less then prior schemes for data race detection. CORD also has a negligible performance overhead (0.4% on average) and detects most dynamic manifestations of synchronization problems (77% on average). Overall, CORD is fast enough to run always (even in performance-sensitive production runs) and provides the support programmers need to deal with the complexities of writing, debugging, and maintaining parallel software for future multi-threaded and multi-core machines.
    High-Performance Computer Architecture, 2006. The Twelfth International Symposium on; 03/2006

Publication Stats

1k Citations
4.76 Total Impact Points

Institutions

  • 2003–2013
    • Georgia Institute of Technology
      • School of Electrical & Computer Engineering
      Atlanta, Georgia, United States
    • University of Zaragoza
      Caesaraugusta, Aragon, Spain
  • 2008
    • Georgia State University
      Atlanta, Georgia, United States
  • 2007–2008
    • North Carolina State University
      • Department of Electrical and Computer Engineering
      Raleigh, NC, United States
  • 2001–2003
    • University of Illinois, Urbana-Champaign
      • Department of Computer Science
      Urbana, IL, United States
    • Hewlett-Packard
      Palo Alto, California, United States
  • 2002
    • Cornell University
      Ithaca, New York, United States