Murali Annavaram

University of Southern California, Los Angeles, California, United States

Are you Murali Annavaram?

Claim your profile

Publications (64)19.56 Total impact

  • Daniel Wong, Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Cluster-level packing techniques have long been used to improve the energy proportionality of server clusters by masking the poor energy proportionality of individual servers. With the emergence of high energy proportional servers, we revisit whether cluster-level packing techniques are still the most effective way to achieve high cluster-wide energy proportionality. Our findings indicate that cluster-level packing techniques can eventually limit cluster-wide energy proportionality and it may be more beneficial to depend solely on server-level low power techniques. Server-level low power techniques generally require a high latency slack to be effective due to diminishing idle periods as server core count increases. In order for server-level low power techniques to be a viable alternative, the latency slack required for these techniques must be lowered. We found that server-level active low power modes offer the lowest latency slack, independent of server core count, and propose low power mode switching policies to meet the best-case latency slack under realistic conditions. By overcoming these major issues, we show that server-level low power modes can be a viable alternative to cluster-level packing techniques in providing high cluster-wide energy proportionality.
    2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA); 02/2014
  • Waleed Dweik, Murali Annavaram, Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.
    Design Automation and Test in Europe; 01/2014
  • M. Manoochehri, M. Annavaram, M. Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to shrinking feature sizes, processors are becoming more vulnerable to soft errors. One of the most vulnerable components of a processor is its write-back cache. This paper proposes a new reliable write-back cache called Correctable Parity Protected Cache (CPPC), which adds correction capability to parity protection. In CPPC, parity bits detect faults and the XOR of all data written into the cache is kept to recover from detected faults. The added correction scheme provides a high degree of reliability and corrects both single and spatial multi-bit faults in exchange for very small performance and power overheads. CPPC is compared to competitive schemes. Our simulation data show that CPPC improves reliability significantly while its overheads are very small, especially in the L2 cache.
    IEEE Transactions on Computers 01/2014; 63(10):2431-2444. · 1.38 Impact Factor
  • Mohammad Abdel-Majeed, Daniel Wong, Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: With the widespread adoption of GPGPUs in varied application domains, new opportunities open up to improve GPGPU energy efficiency. Due to inherent application-level inefficiencies, GPGPU execution units experience significant idle time. In this work we propose to power gate idle execution units to eliminate leakage power, which is becoming a significant concern with technology scaling. We show that GPGPU execution units are idle for short windows of time and conventional microprocessor power gating techniques cannot fully exploit these idle windows efficiently due to power gating overhead. Current warp schedulers greedily intersperse integer and floating point instructions, which limit power gating opportunities for any given execution unit type. In order to improve power gating opportunities in GPGPU execution units, we propose a Gating Aware Two-level warp scheduler (GATES) that issues clusters of instructions of the same type before switching to another instruction type. We also propose a new power gating scheme, called Blackout, that forces a power gated execution unit to sleep for at least the break-even time necessary to overcome the power gating overhead before returning to the active state. The combination of GATES and Blackout, which we call Warped Gates, can save 31.6% and 46.5% of integer and floating point unit static energy. The proposed solutions suffer less than 1% performance and area overhead.
    Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture; 12/2013
  • K. Patel, M. Annavaram, M. Pedram
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to prohibitive cost of data center setup and maintenance, many small-scale businesses rely on hosting centers to provide the cloud infrastructure to run their workloads. Hosting centers host services of the clients on their behalf and guarantee quality of service as defined by service level agreements (SLAs.) To reduce energy consumption and to maximize profit it is critical to optimally allocate resources to meet client SLAs. Optimal allocation is a nontrivial task due to 1) resource heterogeneity where energy consumption of a client task varies depending on the allocated resources 2) lack of energy proportionality where energy cost for a task varies based on server utilization. In this paper, we introduce a generalized Network Flow-based Resource Allocation framework, called NFRA, for energy minimization and profit maximization. NFRA provides a unified framework to model profit maximization under a wide range of SLAs. We will demonstrate the simplicity of this unified framework by deriving optimal resource allocations for three different SLAs. We derive workload demands and server energy consumption data from SPECWeb2009 benchmark results to demonstrate the efficiency of NFRA framework.
    IEEE Transactions on Computers 01/2013; 62(9):1772-1785. · 1.38 Impact Factor
  • Jinho Suh, M. Annavaram, M. Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce PHYS (Profiled-HYbrid Sampling), a sampling framework for soft-error benchmarking of caches. Reliability simulations of caches are much more complex than performance simulations and therefore exhibit large simulation slowdowns (two orders of magnitude) over performance simulations. The major problem is that the reliability lifetime of every accessed block must be tracked from beginning to end, on top of simulating the benchmark, in order to track the total number of vulnerability cycles (VCs) between two accesses to the block. Because of the need to track SDCs (silent error corruption) and to distinguish between true and false DUEs (detected but unrecoverable errors) vulnerability cycles cannot be truncated when data is written back from cache to main memory. Vulnerability cycles must be maintained even during a block's sojourn in main memory to track whether corrupted values in a block are used by the processor, until program termination. PHYS solves this problem by sampling intervals between accesses to each memory block, instead of sampling the execution of the processor in a time interval as is classically done in performance simulations. At first a statistical profiling phase captures the distribution of VCs for every block. This profiling step provides a statistical guarantee of the minimum sampling rate of access intervals needed to meet a desired FIT error target with a given confidence interval. Then, per cacheset sampling rates are dynamically adjusted to sample VCs with higher merit. We compare PHYS with many other possible sampling methods, some of which are widely used to accelerate performance-centric simulations but have also been applied in the past to track reliability lifetime. We demonstrate the superiority of PHYS in the context of reliability benchmarking through exhaustive evaluations of various sampling techniques.
    Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on; 01/2013
  • Mohammad Abdel-Majeed, Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: General purpose graphics processing units (GPGPUs) have the ability to execute hundreds of concurrent threads. To support massive parallelism GPGPUs provide a very large register file, even larger than a cache, to hold the state of each thread. As technology scales, the leakage power consumption of the SRAM cells is getting worse making the register file static power consumption a major concern. As the supply voltage scaling slows, dynamic power consumption of a register file is not reducing. These concerns are particularly acute in GPGPUs due to their large register file size. This paper presents two techniques to reduce the GPGPU register file power consumption. By exploiting the unique software execution model of GPGPUs, we propose a tri-modal register access control unit to reduce the leakage power. This unit first turns off any unallocated register, and places all allocated registers into drowsy state immediately after each access. The average inter-access distance to a register is 789 cycles in GPGPUs. Hence, aggressively moving a register into drowsy state immediately after each access results in 90% reduction in leakage power with negligible performance impact. To reduce dynamic power this paper proposes an active mask aware activity gating unit that avoids charging bit lines and wordlines of registers associated with all inactive threads within a warp. Due to insufficient parallelism and branch divergence warps have many inactive threads. Hence, registers associated with inactive threads can be identified precisely using the active mask. By combining the two techniques we show that the power consumption of the register file can be reduced by 69% on average.
    High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on; 01/2013
  • D. Wong, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Measuring energy proportionality accurately and understanding the reasons for disproportionality are critical first steps in designing future energy-efficient servers. This article presents two metrics-linear deviation and proportionality gap-that let system designers analyze and understand server energy consumption at various utilization levels. An analysis of published SPECpower results shows that energy proportionality improvements are not uniform across various server utilization levels. Even highly energy proportional servers suffer significantly at nonzero but low utilizations. To address the lack of energy proportionality at low utilization, the authors present KnightShift, a server-level heterogeneous server providing an active low-power mode. KnightShift is tightly coupled with a low-power compute node called Knight. Knight responds to low-utilization requests whereas the primary server responds only to high-utilization requests, enabling two energy-efficient operating regions. The authors evaluate KnightShift against a variety of real-world datacenter workloads using a combination of prototyping and simulation.
    IEEE Micro 01/2013; 33(3):28-37. · 2.39 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: KNOWME Networks is a wireless body area network with 2 triaxial accelerometers, a heart rate monitor, and mobile phone that acts as the data collection hub. One function of KNOWME Networks is to detect physical activity (PA) in overweight Hispanic youth. The purpose of this study was to evaluate the in-laboratory recognition accuracy of KNOWME. Twenty overweight Hispanic participants (10 males; age 14.6 ± 1.8 years), underwent 4 data collection sessions consisting of 9 activities/session: lying down, sitting, sitting fidgeting, standing, standing fidgeting, standing playing an active video game, slow walking, brisk walking, and running. Data were used to train activity recognition models. The accuracy of personalized and generalized models is reported. Overall accuracy for personalized models was 84%. The most accurately detected activity was running (96%). The models had difficulty distinguishing between the static and fidgeting categories of sitting and standing. When static and fidgeting activity categories were collapsed, the overall accuracy improved to 94%. Personalized models demonstrated higher accuracy than generalized models. KNOWME Networks can accurately detect a range of activities. KNOWME has the ability to collect and process data in real-time, building the foundation for tailored, real-time interventions to increase PA or decrease sedentary time.
    Journal of Physical Activity and Health 03/2012; 9(3):432-41. · 1.95 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Traffic monitoring using probe vehicles with GPS receivers promises significant improvements in cost, coverage, and accuracy over dedicated infrastructure systems. Current approaches, however, raise privacy concerns because they require participants to reveal their positions to an external traffic monitoring server. To address this challenge, we describe a system based on virtual trip lines and an associated cloaking technique, followed by another system design in which we relax the privacy requirements to maximize the accuracy of real-time traffic estimation. We introduce virtual trip lines which are geographic markers that indicate where vehicles should provide speed updates. These markers are placed to avoid specific privacy sensitive locations. They also allow aggregating and cloaking several location updates based on trip line identifiers, without knowing the actual geographic locations of these trip lines. Thus, they facilitate the design of a distributed architecture, in which no single entity has a complete knowledge of probe identities and fine-grained location information. We have implemented the system with GPS smartphone clients and conducted a controlled experiment with 100 phone-equipped drivers circling a highway segment, which was later extended into a year-long public deployment.
    IEEE Transactions on Mobile Computing 01/2012; 11:849-864. · 2.40 Impact Factor
  • Sangwon Lee, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Wireless Body Area Networks (WBANs) promise to revolutionize health care in the near future. By integrating bio-sensors with a mobile phone it is possible to monitor an individual's health and related behaviors. Monitoring is done by analyzing the sensor data either on a mobile phone or on a remote server by relaying this information over a wireless network. However, the “wireless” aspect of WBAN is being limited by the battery life of the mobile phone. A WBAN designer has a range of options to trade-off limited battery with many important metrics. From the choice of programming languages to dynamically choosing between computation versus communication under varying signal strengths, there are several non-obvious choices that can have dramatic impact on battery life. In this research we use an in-field deployed WBAN called KNOWME to present a comprehensive quantification of a mobile phone's energy consumption. We quantify the energy impact of different programming paradigms, sensing modalities, data storage, and conflicting computation and communication demands. Based on the knowledge gained from the measurement studies, we propose an Active Energy Profiling strategy that uses short profiling periods to automatically determine the most energy efficient choices for running a WBAN.
    Workload Characterization (IISWC), 2012 IEEE International Symposium on; 01/2012
  • Source
    J. Suh, M. Annavaram, Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the growing trend that a Single Event Upset (SEU) can cause spatial Multi-Bit Upsets (MBUs), the effects of spatial MBUs has recently become an important yet very challenging issue, especially in large, last-level caches (LLCs) protected by protection codes. In the presence of spatial MBUs, the strength of the protection codes becomes a critical design issue. Developing a reliability model that includes the cumulative effects of overlapping SBUs, temporal MBUs and spatial MBUs is a very challenging problem, especially when protection codes are active. In this paper, we introduce a new framework called MACAU. MACAU is based on a Markov chain model and can compute the intrinsic MTTFs of scrubbed caches as well as benchmark caches protected by various codes. MACAU is the first framework that quantifies the failure rates of caches due to the combined effects of SBUs, temporal MBUs and spatial MBUs.
    High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Power gating is an increasingly important actuation knob in chip-level dynamic power management. In a multi-core setting, a key design issue in this context, is determining the right balance of gating at the unit-level (within a core) and at the core-level. Another issue is how to architect the predictive control associated with such gating, in order to ensure maximal power savings at minimal performance loss. We use an abstract, analytical modeling framework to understand and discuss the fundamental tradeoffs in such a design. We consider plausible ranges of software/hardware control latencies and workload characteristics to understand when and where it makes sense to disable one or both of the gating mechanisms (i.e. intra- and inter-core). The overall goal of this research is to devise predictive power gating algorithms in a multi-core setting, with built-in “guard” mechanisms to prevent negative outcomes: e.g. a net increase in power consumption or an unacceptable level of performance loss.
    WEED 2010 - Workshop on Energy-Efficient Design. 01/2012;
  • M. Demertzi, B. Zandian, R. Rojas, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Silicon scaling has led to accelerated wearout, which results in increased number of intermittent errors. Of the various factors that can lead to intermittent errors, device utilization is a major contributor to wearout. Every device on a chip is activated in response to either control signals or data movement resulting from instruction execution. This research proposes a systematic methodology for benchmarking the vulnerability of instruction set architecture (ISA) toward intermittent errors. By following each instruction during its execution through the processor pipeline, we quantify how many devices each instruction activates during its execution. We propose Vulnerability to Intermittent Failures (VIF) as a metric to quantify the stress imposed on circuits by an instruction. We show how VIF varies from instruction to instruction and how different inputs can affect VIF.
    Workload Characterization (IISWC), 2012 IEEE International Symposium on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Wireless body area sensing networks have the potential to revolutionize health care in the near term. The coupling of biosensors with a wireless infrastructure enables the real-time monitoring of an individual¿s health and related behaviors continuously, as well as the provision of realtime feedback with nimble, adaptive, and personalized interventions. The KNOWME platform is reviewed, and lessons learned from system integration, optimization, and in-field deployment are provided. KNOWME is an endto- end body area sensing system that integrates off-the-shelf sensors with a Nokia N95 mobile phone to continuously monitor and analyze the biometric signals of a subject. KNOWME development by an interdisciplinary team and in-laboratory, as well as in-field deployment studies, employing pediatric obesity as a case study condition to monitor and evaluate physical activity, have revealed four major challenges: (1) achieving robustness to highly varying operating environments due to subject-induced variability such as mobility or sensor placement, (2) balancing the tension between acquiring high fidelity data and minimizing network energy consumption, (3) enabling accurate physical activity detection using a modest number of sensors, and (4) designing WBANs to determine physiological quantities of interest such as energy expenditure. The KNOWME platform described in this article directly addresses these challenges.
    IEEE Communications Magazine 01/2012; 50(5):116-125. · 3.66 Impact Factor
  • Source
    Sabyasachi Ghosh, Mark Redekopp, Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Data center energy costs are growing concern. Many datacenters use direct-attached-storage architecture where data is distributed across disks attached to several servers. In this organization even if a server is not utilized it can not be turned off since each server carries a fraction of the permanent state needed to complete a request. Operating servers at low utilization is very inefficient due to the lack of energy proportionality. In this research we propose to use outof- band management processor, typically used for remotely managing a server, to satisfy I/O requests from a remote server. By handling requests with limited processing needs, the management processor takes the load off the primary server thereby allowing the primary server to sleep when not actively being used; we call this approach KnightShift. We describe how existing management processors can be modified to handle KnightShift responsibility. We use several production datacenter traces to evaluate the energy impact of KnightShift and show that energy consumption can be reduced by 2.6X by allowing management processors to handle only those requests that demand less than 5
    WEED 2010 - Workshop on Energy-Efficient Design. 01/2012;
  • Mohammad Abdel-Majeed, Mike Chen, Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to build high performance real-time sensing systems every building block in the system should be built with a technology that allows that building block to achieve its best performance. Technologies like BJT and BICMOS are better suited for building basic analog blocks like input buffers and power amplifiers, while CMOS is the best choice for digital data processing. To build mixed-technology systems traditionally system-in-package (SiP) techniques are used. SiP integration uses bonding wires or flip chip instead of on-chip integration. In this paper we study the feasibility of using 3D stacking to integrate heterogeneous blocks built using different technologies within a real-time sensing system. Several of the previous studies on 3D stacking focused on integrating multiple digital blocks and using through-silicon-vias (TSVs) to transfer digital signals between the layers in a stack. In this paper we study the behavior of the analog signals traversing through TSVs and measure how well 3D stacking can enhance or limit the performance of analog and digital stacking. In order to quantify the power and performance characteristics, we modeled bonding wire, flip chip, and through-silicon-via (TSV) interfaces. Using these models we show that 3D stacking of analog and analog/digital components can double the bandwidth, increase sampling frequency by nearly two orders magnitude and and improve the signal integrity by 3 dB compared to bond wires.
    01/2012;
  • Source
    Yi Wang, B. Krishnamachari, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: User/environmental context detection on mobile devices benefits end-users by providing information support to various kinds of applications. A pervasive question, however, is how the sensors on the mobile device should be sampled energy efficiently without sacrificing too much detection accuracy. In this paper, we formulate the user state sensing problem as the intermittent sampling of a semi-Markov process, a model that provides general and flexible capturing of realistic data with any type of state sojourn distributions. We propose (a) a semi-Markov state estimation mechanism that selects the most likely user state while observations are missing, and (b) a semi-Markov optimal sensing policy us* which minimizes the expected state estimation error while maintaining a given energy budget. Their performance are shown to significantly outperform Markov algorithms on simulated two-state processes and real user state traces pertaining to different types of state distributions. Finally, in order to evaluate the performance of us*, we implement a client-server based basic human activity recognition system on N95 smartphones and desktops which automatically computes user-specific optimal sensing policy based on historically collected data. We show that us* improves the estimation accuracy by 27.8% and 48.6% respectively over Markov-optimal policy and uniform sampling through a set of experiments.
    Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 2012 9th Annual IEEE Communications Society Conference on; 01/2012
  • D. Wong, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Server energy proportionality has been improving over the past several years. Many components in a system, such as CPU, memory and disk, have been achieving good energy proportionality behavior. Using a wide range of server power data from the published SPEC power data we show that the overall system energy proportionality has reached 80%. We present two novel metrics, linear deviation and proportionality gap, that provide insights into accurately quantifying energy proportionality. Using these metrics we show that energy proportionality improvements are not uniform across various server utilization levels. In particular, the energy proportionality of even a highly proportional server suffers significantly at non-zero but low utilizations. We propose to tackle the lack of energy proportionality at low utilization using server-level heterogeneity. We present Knight Shift, a server-level heterogenous server architecture that introduces an active low power mode, through the addition of a tightly-coupled compute node called the Knight, enabling two energy-efficient operating regions. We evaluated Knight Shift against a variety of real-world data center workloads using a combination of prototyping and simulation, showing up to 75% energy savings with tail latency bounded by the latency of the Knight and up to 14% improvement to Performance per TCO dollar spent.
    Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on; 01/2012
  • Hyeran Jeon, M. Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice in supercomputing centers. Unsurprisingly, most of the GPGPU-based studies have been focusing on performance improvement leveraging GPGPU's high degree of parallelism. However, for many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter-warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmer's effort.
    Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on; 01/2012