Murali Annavaram

University of California, Los Angeles, Los Ángeles, California, United States

Are you Murali Annavaram?

Claim your profile

Publications (76)25.85 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: New and emerging mobile technologies are providing unprecedented possibilities for understanding and intervening on obesity-related behaviors in real time. However, the mobile health (mHealth) field has yet to catch up with the fast-paced development of technology. Current mHealth efforts in weight management still tend to focus mainly on short message systems (SMS) interventions, rather than taking advantage of real-time sensing to develop just-in-time adaptive interventions (JITAIs). This paper will give an overview of the current technology landscape for sensing and intervening on three behaviors that are central to weight management: diet, physical activity, and sleep. Then five studies that really dig into the possibilities that these new technologies afford will be showcased. We conclude with a discussion of hurdles that mHealth obesity research has yet to overcome and a future-facing discussion.
    No preview · Article · Sep 2015
  • Sangpil Lee · Keunsoo Kim · Gunjae Koo · Hyeran Jeon · Won Woo Ro · Murali Annavaram

    No preview · Article · Jun 2015 · ACM SIGARCH Computer Architecture News
  • [Show abstract] [Hide abstract]
    ABSTRACT: In mobile-based traffic monitoring applications, each user provides real-time updates on their location and speed while driving. This data is collected by a centralized server and aggregated to provide participants with current traffic conditions. Successful participation in traffic monitoring applications utilizing participatory sensing depends on two factors: the information utility of the estimated traffic condition, and the amount of private information (position and speed) each participant reveals to the server. We assume each user prefers to reveal as little private information as possible, but if everyone withholds information, the quality of traffic estimation will deteriorate. In this paper, we model these opposing requirements by considering each user to have a utility function that combines the benefit of high quality traffic estimates and the cost of privacy loss. Using a novel Markovian model, we mathematically derive a policy that takes into account the mean, variance and correlation of traffic on a given stretch of road and yields the optimal granularity of information revelation to maximize user utility. We validate the effectiveness of this policy through real-world empirical traces collected during the Mobile Century experiment in Northern California. The validation shows that the derived policy yields utilities that are very close to what could be obtained by an oracle scheme with full knowledge of the ground truth.
    No preview · Article · Dec 2014 · Pervasive and Mobile Computing
  • Mehrtash Manoochehri · Murali Annavaram · Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to shrinking feature sizes, processors are becoming more vulnerable to soft errors. One of the most vulnerable components of a processor is its write-back cache. This paper proposes a new reliable write-back cache called Correctable Parity Protected Cache (CPPC), which adds correction capability to parity protection. In CPPC, parity bits detect faults and the XOR of all data written into the cache is kept to recover from detected faults. The added correction scheme provides a high degree of reliability and corrects both single and spatial multi-bit faults in exchange for very small performance and power overheads. CPPC is compared to competitive schemes. Our simulation data show that CPPC improves reliability significantly while its overheads are very small, especially in the L2 cache.
    No preview · Article · Oct 2014 · IEEE Transactions on Computers
  • Waleed Dweik · Mohammad Abdel Majeed · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Graphics processing units (GPUs) are rapidly becoming the parallel accelerators of choice to run general purpose applications. GPUs that run general purpose applications are termed as GPGPUs. Many mission-critical and long-running scientific application are being ported to run on GPGPUs. These applications demand strong computational integrity. GPGPUs, like many other digital components, face imminent reliability threats due to technology scaling. Of particular concern is the infield hard faults that are persistent and irreversible. GPGPUs comprise of dozens of streaming processors where each streaming processor employs tens of execution units, organized as single instruction multiple thread(SIMT) lanes to deliver massive parallel computational power. In this paper we exploit the massive replication of SIMTlanes to tolerate infield hard faults. First, we introduce thread shuffling to reroute threads, originally mapped to faulty SIMTlanes, to idle healthy lanes. Thread shuffling is insufficient when the number of healthy SIMT lanes is fewer than the number of active threads. To broaden the reach of thread shuffling, we propose dynamic warp deformation to split the warp into multiple sub-warps, each sub-warp uses fewer SIMT lanes thereby providing more opportunities to avoid using a faulty SIMT lane. Finally, we propose warp shuffling which exploits non-uniform degradation of different streaming processors by scheduling a warp to a streaming processor that requires fewer warp splits. Hence, warp shuffling helps to reduce the performance overhead associated with dynamic warp deformation. By deploying the proposed techniques, we can tolerate the worst case scenario of having up to three hard faults per four SIMT lane cluster with at most 36%performance degradation.
    No preview · Conference Paper · Jun 2014
  • Daniel Wong · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Cluster-level packing techniques have long been used to improve the energy proportionality of server clusters by masking the poor energy proportionality of individual servers. With the emergence of high energy proportional servers, we revisit whether cluster-level packing techniques are still the most effective way to achieve high cluster-wide energy proportionality. Our findings indicate that cluster-level packing techniques can eventually limit cluster-wide energy proportionality and it may be more beneficial to depend solely on server-level low power techniques. Server-level low power techniques generally require a high latency slack to be effective due to diminishing idle periods as server core count increases. In order for server-level low power techniques to be a viable alternative, the latency slack required for these techniques must be lowered. We found that server-level active low power modes offer the lowest latency slack, independent of server core count, and propose low power mode switching policies to meet the best-case latency slack under realistic conditions. By overcoming these major issues, we show that server-level low power modes can be a viable alternative to cluster-level packing techniques in providing high cluster-wide energy proportionality.
    No preview · Conference Paper · Feb 2014
  • Waleed Dweik · Murali Annavaram · Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.
    No preview · Conference Paper · Jan 2014
  • Mohammad Abdel-Majeed · Daniel Wong · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: With the widespread adoption of GPGPUs in varied application domains, new opportunities open up to improve GPGPU energy efficiency. Due to inherent application-level inefficiencies, GPGPU execution units experience significant idle time. In this work we propose to power gate idle execution units to eliminate leakage power, which is becoming a significant concern with technology scaling. We show that GPGPU execution units are idle for short windows of time and conventional microprocessor power gating techniques cannot fully exploit these idle windows efficiently due to power gating overhead. Current warp schedulers greedily intersperse integer and floating point instructions, which limit power gating opportunities for any given execution unit type. In order to improve power gating opportunities in GPGPU execution units, we propose a Gating Aware Two-level warp scheduler (GATES) that issues clusters of instructions of the same type before switching to another instruction type. We also propose a new power gating scheme, called Blackout, that forces a power gated execution unit to sleep for at least the break-even time necessary to overcome the power gating overhead before returning to the active state. The combination of GATES and Blackout, which we call Warped Gates, can save 31.6% and 46.5% of integer and floating point unit static energy. The proposed solutions suffer less than 1% performance and area overhead.
    No preview · Conference Paper · Dec 2013
  • Kimish Patel · Murali Annavaram · Massoud Pedram
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to prohibitive cost of data center setup and maintenance, many small-scale businesses rely on hosting centers to provide the cloud infrastructure to run their workloads. Hosting centers host services of the clients on their behalf and guarantee quality of service as defined by service level agreements (SLAs.) To reduce energy consumption and to maximize profit it is critical to optimally allocate resources to meet client SLAs. Optimal allocation is a nontrivial task due to 1) resource heterogeneity where energy consumption of a client task varies depending on the allocated resources 2) lack of energy proportionality where energy cost for a task varies based on server utilization. In this paper, we introduce a generalized Network Flow-based Resource Allocation framework, called NFRA, for energy minimization and profit maximization. NFRA provides a unified framework to model profit maximization under a wide range of SLAs. We will demonstrate the simplicity of this unified framework by deriving optimal resource allocations for three different SLAs. We derive workload demands and server energy consumption data from SPECWeb2009 benchmark results to demonstrate the efficiency of NFRA framework.
    No preview · Article · Sep 2013 · IEEE Transactions on Computers
  • Jinho Suh · Murali Annavaram · Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we introduce PHYS (Profiled-HYbrid Sampling), a sampling framework for soft-error benchmarking of caches. Reliability simulations of caches are much more complex than performance simulations and therefore exhibit large simulation slowdowns (two orders of magnitude) over performance simulations. The major problem is that the reliability lifetime of every accessed block must be tracked from beginning to end, on top of simulating the benchmark, in order to track the total number of vulnerability cycles (VCs) between two accesses to the block. Because of the need to track SDCs (silent error corruption) and to distinguish between true and false DUEs (detected but unrecoverable errors) vulnerability cycles cannot be truncated when data is written back from cache to main memory. Vulnerability cycles must be maintained even during a block's sojourn in main memory to track whether corrupted values in a block are used by the processor, until program termination. PHYS solves this problem by sampling intervals between accesses to each memory block, instead of sampling the execution of the processor in a time interval as is classically done in performance simulations. At first a statistical profiling phase captures the distribution of VCs for every block. This profiling step provides a statistical guarantee of the minimum sampling rate of access intervals needed to meet a desired FIT error target with a given confidence interval. Then, per cacheset sampling rates are dynamically adjusted to sample VCs with higher merit. We compare PHYS with many other possible sampling methods, some of which are widely used to accelerate performance-centric simulations but have also been applied in the past to track reliability lifetime. We demonstrate the superiority of PHYS in the context of reliability benchmarking through exhaustive evaluations of various sampling techniques.
    No preview · Conference Paper · Jun 2013
  • Daniel Wong · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Measuring energy proportionality accurately and understanding the reasons for disproportionality are critical first steps in designing future energy-efficient servers. This article presents two metrics-linear deviation and proportionality gap-that let system designers analyze and understand server energy consumption at various utilization levels. An analysis of published SPECpower results shows that energy proportionality improvements are not uniform across various server utilization levels. Even highly energy proportional servers suffer significantly at nonzero but low utilizations. To address the lack of energy proportionality at low utilization, the authors present KnightShift, a server-level heterogeneous server providing an active low-power mode. KnightShift is tightly coupled with a low-power compute node called Knight. Knight responds to low-utilization requests whereas the primary server responds only to high-utilization requests, enabling two energy-efficient operating regions. The authors evaluate KnightShift against a variety of real-world datacenter workloads using a combination of prototyping and simulation.
    No preview · Article · May 2013 · IEEE Micro
  • Mohammad Abdel-Majeed · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: General purpose graphics processing units (GPGPUs) have the ability to execute hundreds of concurrent threads. To support massive parallelism GPGPUs provide a very large register file, even larger than a cache, to hold the state of each thread. As technology scales, the leakage power consumption of the SRAM cells is getting worse making the register file static power consumption a major concern. As the supply voltage scaling slows, dynamic power consumption of a register file is not reducing. These concerns are particularly acute in GPGPUs due to their large register file size. This paper presents two techniques to reduce the GPGPU register file power consumption. By exploiting the unique software execution model of GPGPUs, we propose a tri-modal register access control unit to reduce the leakage power. This unit first turns off any unallocated register, and places all allocated registers into drowsy state immediately after each access. The average inter-access distance to a register is 789 cycles in GPGPUs. Hence, aggressively moving a register into drowsy state immediately after each access results in 90% reduction in leakage power with negligible performance impact. To reduce dynamic power this paper proposes an active mask aware activity gating unit that avoids charging bit lines and wordlines of registers associated with all inactive threads within a warp. Due to insufficient parallelism and branch divergence warps have many inactive threads. Hence, registers associated with inactive threads can be identified precisely using the active mask. By combining the two techniques we show that the power consumption of the register file can be reduced by 69% on average.
    No preview · Conference Paper · Jan 2013
  • Daniel Wong · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Server energy proportionality has been improving over the past several years. Many components in a system, such as CPU, memory and disk, have been achieving good energy proportionality behavior. Using a wide range of server power data from the published SPEC power data we show that the overall system energy proportionality has reached 80%. We present two novel metrics, linear deviation and proportionality gap, that provide insights into accurately quantifying energy proportionality. Using these metrics we show that energy proportionality improvements are not uniform across various server utilization levels. In particular, the energy proportionality of even a highly proportional server suffers significantly at non-zero but low utilizations. We propose to tackle the lack of energy proportionality at low utilization using server-level heterogeneity. We present Knight Shift, a server-level heterogenous server architecture that introduces an active low power mode, through the addition of a tightly-coupled compute node called the Knight, enabling two energy-efficient operating regions. We evaluated Knight Shift against a variety of real-world data center workloads using a combination of prototyping and simulation, showing up to 75% energy savings with tail latency bounded by the latency of the Knight and up to 14% improvement to Performance per TCO dollar spent.
    No preview · Conference Paper · Dec 2012
  • Hyeran Jeon · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: General purpose graphics processing units (GPGPUs) are feature rich GPUs that provide general purpose computing ability with massive number of parallel threads. The massive parallelism combined with programmability made GPGPUs the most attractive choice in supercomputing centers. Unsurprisingly, most of the GPGPU-based studies have been focusing on performance improvement leveraging GPGPU's high degree of parallelism. However, for many scientific applications that commonly run on supercomputers, program correctness is as important as performance. Few soft or hard errors could lead to corrupt results and can potentially waste days or even months of computing effort. In this research we exploit unique architectural characteristics of GPGPUs to propose a light weight error detection method, called Warped Dual Modular Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR is light weight because it exploits the underutilized parallelism in GPGPU computing for error detection. Error detection spans both within a warp as well as between warps, called intra-warp and inter-warp DMR, respectively. Warped-DMR achieves 96% error coverage while incurring a worst-case 16% performance overhead without extra execution units or programmer's effort.
    No preview · Conference Paper · Dec 2012
  • Melina Demertzi · Bardia Zandian · Ricardo Rojas · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: Silicon scaling has led to accelerated wearout, which results in increased number of intermittent errors. Of the various factors that can lead to intermittent errors, device utilization is a major contributor to wearout. Every device on a chip is activated in response to either control signals or data movement resulting from instruction execution. This research proposes a systematic methodology for benchmarking the vulnerability of instruction set architecture (ISA) toward intermittent errors. By following each instruction during its execution through the processor pipeline, we quantify how many devices each instruction activates during its execution. We propose Vulnerability to Intermittent Failures (VIF) as a metric to quantify the stress imposed on circuits by an instruction. We show how VIF varies from instruction to instruction and how different inputs can affect VIF.
    No preview · Conference Paper · Nov 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Wireless body area sensing networks have the potential to revolutionize health care in the near term. The coupling of biosensors with a wireless infrastructure enables the real-time monitoring of an individual¿s health and related behaviors continuously, as well as the provision of realtime feedback with nimble, adaptive, and personalized interventions. The KNOWME platform is reviewed, and lessons learned from system integration, optimization, and in-field deployment are provided. KNOWME is an endto- end body area sensing system that integrates off-the-shelf sensors with a Nokia N95 mobile phone to continuously monitor and analyze the biometric signals of a subject. KNOWME development by an interdisciplinary team and in-laboratory, as well as in-field deployment studies, employing pediatric obesity as a case study condition to monitor and evaluate physical activity, have revealed four major challenges: (1) achieving robustness to highly varying operating environments due to subject-induced variability such as mobility or sensor placement, (2) balancing the tension between acquiring high fidelity data and minimizing network energy consumption, (3) enabling accurate physical activity detection using a modest number of sensors, and (4) designing WBANs to determine physiological quantities of interest such as energy expenditure. The KNOWME platform described in this article directly addresses these challenges.
    Full-text · Article · May 2012 · IEEE Communications Magazine
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Traffic monitoring using probe vehicles with GPS receivers promises significant improvements in cost, coverage, and accuracy over dedicated infrastructure systems. Current approaches, however, raise privacy concerns because they require participants to reveal their positions to an external traffic monitoring server. To address this challenge, we describe a system based on virtual trip lines and an associated cloaking technique, followed by another system design in which we relax the privacy requirements to maximize the accuracy of real-time traffic estimation. We introduce virtual trip lines which are geographic markers that indicate where vehicles should provide speed updates. These markers are placed to avoid specific privacy sensitive locations. They also allow aggregating and cloaking several location updates based on trip line identifiers, without knowing the actual geographic locations of these trip lines. Thus, they facilitate the design of a distributed architecture, in which no single entity has a complete knowledge of probe identities and fine-grained location information. We have implemented the system with GPS smartphone clients and conducted a controlled experiment with 100 phone-equipped drivers circling a highway segment, which was later extended into a year-long public deployment.
    Full-text · Article · May 2012 · IEEE Transactions on Mobile Computing
  • [Show abstract] [Hide abstract]
    ABSTRACT: KNOWME Networks is a wireless body area network with 2 triaxial accelerometers, a heart rate monitor, and mobile phone that acts as the data collection hub. One function of KNOWME Networks is to detect physical activity (PA) in overweight Hispanic youth. The purpose of this study was to evaluate the in-laboratory recognition accuracy of KNOWME. Twenty overweight Hispanic participants (10 males; age 14.6 ± 1.8 years), underwent 4 data collection sessions consisting of 9 activities/session: lying down, sitting, sitting fidgeting, standing, standing fidgeting, standing playing an active video game, slow walking, brisk walking, and running. Data were used to train activity recognition models. The accuracy of personalized and generalized models is reported. Overall accuracy for personalized models was 84%. The most accurately detected activity was running (96%). The models had difficulty distinguishing between the static and fidgeting categories of sitting and standing. When static and fidgeting activity categories were collapsed, the overall accuracy improved to 94%. Personalized models demonstrated higher accuracy than generalized models. KNOWME Networks can accurately detect a range of activities. KNOWME has the ability to collect and process data in real-time, building the foundation for tailored, real-time interventions to increase PA or decrease sedentary time.
    No preview · Article · Mar 2012 · Journal of Physical Activity and Health
  • Mohammad Abdel-Majeed · Mike Chen · Murali Annavaram
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to build high performance real-time sensing systems every building block in the system should be built with a technology that allows that building block to achieve its best performance. Technologies like BJT and BICMOS are better suited for building basic analog blocks like input buffers and power amplifiers, while CMOS is the best choice for digital data processing. To build mixed-technology systems traditionally system-in-package (SiP) techniques are used. SiP integration uses bonding wires or flip chip instead of on-chip integration. In this paper we study the feasibility of using 3D stacking to integrate heterogeneous blocks built using different technologies within a real-time sensing system. Several of the previous studies on 3D stacking focused on integrating multiple digital blocks and using through-silicon-vias (TSVs) to transfer digital signals between the layers in a stack. In this paper we study the behavior of the analog signals traversing through TSVs and measure how well 3D stacking can enhance or limit the performance of analog and digital stacking. In order to quantify the power and performance characteristics, we modeled bonding wire, flip chip, and through-silicon-via (TSV) interfaces. Using these models we show that 3D stacking of analog and analog/digital components can double the bandwidth, increase sampling frequency by nearly two orders magnitude and and improve the signal integrity by 3 dB compared to bond wires.
    No preview · Article · Mar 2012
  • Source
    Jinho Suh · Murali Annavaram · Michel Dubois
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to the growing trend that a Single Event Upset (SEU) can cause spatial Multi-Bit Upsets (MBUs), the effects of spatial MBUs has recently become an important yet very challenging issue, especially in large, last-level caches (LLCs) protected by protection codes. In the presence of spatial MBUs, the strength of the protection codes becomes a critical design issue. Developing a reliability model that includes the cumulative effects of overlapping SBUs, temporal MBUs and spatial MBUs is a very challenging problem, especially when protection codes are active. In this paper, we introduce a new framework called MACAU. MACAU is based on a Markov chain model and can compute the intrinsic MTTFs of scrubbed caches as well as benchmark caches protected by various codes. MACAU is the first framework that quantifies the failure rates of caches due to the combined effects of SBUs, temporal MBUs and spatial MBUs.
    Preview · Conference Paper · Feb 2012

Publication Stats

2k Citations
25.85 Total Impact Points

Institutions

  • 2011-2014
    • University of California, Los Angeles
      • Department of Electrical Engineering
      Los Ángeles, California, United States
  • 2008-2014
    • University of Southern California
      • Department of Electrical Engineering
      Los Ángeles, California, United States
  • 2012
    • Raytheon BBN Technologies
      Cambridge, Massachusetts, United States
  • 2004-2007
    • Intel
      Santa Clara, California, United States
  • 2005
    • Mission College
      Santa Clara, California, United States
  • 2001
    • University of Michigan
      • Department of Electrical Engineering and Computer Science (EECS)
      Ann Arbor, Michigan, United States
  • 2000
    • Concordia University–Ann Arbor
      Ann Arbor, Michigan, United States