Figure 1 - uploaded by Dimitris Kaseridis
Content may be subject to copyright.
Power viruses widely used in the industry 

Power viruses widely used in the industry 

Source publication
Conference Paper
Full-text available
To effectively design a computer system for the worst case power consumption scenario, system architects often use hand-crafted maximum power consuming benchmarks at the assembly language level. These stressmarks, also called power viruses, are very tedious to generate and require significant domain knowledge. In this paper, we propose SYMPO, an au...

Context in source publication

Context 1
... of a system, recent trends have shown that the power consumption of the DRAM subsystem is significantly high [19] [17] and is predicted to increase in the future. Thus, it is important to characterize the power consumption of the entire system rather than just the processor while constructing max-power viruses. Our metrics include the burstiness of accesses to DRAM by characterizing memory level parallelism. The granularity of the instruction mix in the generated synthetic is very important to generate good power viruses and our synthetic benchmark generation is more fine grained in terms of the number of instruction types generated than compared to Joshi et al’s work [15]. The significant contributions of this paper are: i) propose the usage of SYMPO, which is an abstract workload generator integrated with an industry-grade Genetic Algorithm (GA) tool set, the IBM SNAP [2] [26], for generating system level power viruses in Sparc, Alpha, and x86 ISAs. ii) validation of the efficacy of the generated power viruses on a full system simulator by comparison with that of the popular stressmark MPrime (torture test) and show that our methodology results in power viruses that consume 16-40% more power than MPrime . iii) validation of the generated power virus on real hardware using an instrumented AMD quad-core Phenom II system. We show that our power virus has a power consumption higher than most of the other state of the art hand crafted power viruses. iv) comparison of the efficacy of the power viruses generated in Alpha ISA to that of the previous approach by Joshi et. al [15], showing that SYMPO results in consumption of 9-24% more power than the previous approach. It is to be noted that the worst-case power of a system is not sim- ply the sum of the maximum power of each component. Due to underutilization of resources and contention for shared resources, such as caches or memory ports, the aggregate worst-case is significantly less than the sum. The aim of this work is to find a reasonable worst case power virus that is present in the real world workload space. The rest of the paper is organized as follows. In section 2, we provide a survey of the various industry standard power viruses along with their power consumption on real hardware. Section 3 introduces SYMPO, our power virus generation framework and section 4 elaborates on the experimental setup using full-system, processor simulators and results showing the effectiveness of the generated power viruses in comparison to that of the MPrime based torture test. Section 5 provides a microarchitecture independent characterization of the industry grade power viruses. We provide the related work in section 6 and summarize in Section 7. There have been many industry efforts towards writing power viruses and stress benchmarks. Among them, MPrime [5], in [6], CPUburn [4] are the most popular benchmarks. We first give a brief description of these power viruses and then characterize them based on microarchitecture independent metrics. MPrime [5] is a BSD software application that searches for a Mersenne prime number using an efficient Fast Fourier Transform (FFT) algorithm. For the past few years, MPrime has been popu- larly called the torture test and has been used for testing the stability of a computer system by overclockers, PC enthusiasts and the processor design industry. This is because of the fact that this program is designed to subject the processor and memory to an incredibly intense workload resulting in errors. The amount of time a processor remains successfully stable while executing this workload is used as a measure of that system’s stability by a typical over- clocker. MPrime has been used in testing the CPU, memory, L1 and L2 caches, CPU cooling, and case cooling efficiencies. CPUburn-in [6] is advertised as an ultimate stability testing tool written by Michal Mienik, which is also written for overclockers. This program attempts to heat up any x86 processor to the maximum possible operating temperature. It allows the user to adjust the CPU frequency to the practical maximum while still being sure that stability is achieved even under the most stressful conditions. The program continuously monitors for erroneous calculations ensuring the CPU does not generate errors during calculations. It employs FPU intensive functions to heat up the CPU. CPUburn [4] is a power virus suite written in assembly language, copyrighted but freely licensed under the GNU Public Li- cense by Robert Redelmeier. The purpose of these programs is also to heat up x86 CPUs as much as possible. Unlike CPUburn-in , they are specifically optimized for different processors. FPU and ALU instructions are coded at the assembly level into an infinite loop. The goal has been to maximize CPU temperature, stressing the cooling system, motherboard and power supply. The programs are BurnP5, BurnP6, BurnK6, BurnK7, BurnMMX . The description of each of the power viruses are given in Figure 1. To see the effectiveness of these power viruses on real hardware, we measure their power and thermal characteristics on the AMD Phenom II X4 (K10) Processor Model 945 system. Figure 2 shows the configuration of this system. The CPU core power of this system is measured using in-system instrumentation. A specialized AMD-designed system board is used which provides fine-grain power instrumentation for all power rails, including CPU core. Each high-power rail, such as CPU core, contains a Hall- Effect current sensor connected at its origin. The sensor provides a 0-5V signal that is linearly proportional to the power flowing into the rail. The voltage signal is measured by a National Instruments PCI-6255 data logger. The data logger attaches to the current sensor through a small twisted pair conductor. The data logger sam- ples current and voltage applied to each rail at a rate of 10KHz. Since the voltage cannot be assumed to be a constant due to droops, spikes and drifts, we measure both voltage and current to calculate power. Using the data logs, application power is calculated off-line with post-processing software. The measured power of the various power viruses are shown in Figure 3. Four copies of these benchmarks were run on the quadcore hardware until the power consumption reached a stable state, which was around 200 seconds. It can be noted that BurnK7 consumes the maximum power on this hardware of 72.1 Watts after reaching a steady state. Even the highest power consuming two SPEC CPU2006 workloads, 416.gamess and 453.povray consume only 63.1 and 59.6 Watts respectively. BurnK7 consuming maximum power on this hardware can be attributed to the fact that the machine configurations of the AMD Phenom II (K10) and K7 are to some extent similar to each other. It can be observed that the power viruses generated for other machines like the BurnP5 , BurnP6 do not consume as much power as BurnK7 , again showing the importance of developing a specialized power virus for each of the microarchitectures. In the next two sections, we describe our power virus generation framework and the validation by comparison with the industry grade MPrime torture test. The overall power virus generation methodology is given in Figure 4. Our framework consists of 3 important components as described below. Our workload space consists of a set of 17 dimensions falling under the categories of control flow predictability, instruction mix, instruction level parallelism, data locality and memory level parallelism as shown in Figure 5. An abstract workload synthesizer was developed to be able to generate a C file with embedded assembly instructions following the characteristics of a given workload specification. These embedded assembly instructions are packed into a loop and this loop is iterated until the performance characteristics converge. The workload specification consists of the following, Instruction mix : Specifies the frequency of each type of instruction in the program. Each of the instruction type in the abstract workload model has a weight associated with it ranging from 0 to 4. The proportion of this instruction type in the generated synthetic is not only governed by this weight, but also based on the weights associated with the remaining instruction types as they are correlated with each other. As different instruction types have different latencies and power consumption, the instruction mix has a major effect on the overall power consumption of a workload. Since ...

Citations

... Previous studies used GA-based synthesis tools to find the viruses that trigger voltage droops [4], [31], [52], [53], increase processor power [5], [24], [25], [38], [79] or temperature [32], [38]. However, the vast majority of these studies implemented models that search for the worst-case instruction sequences only, ignoring data patterns. ...
Conference Paper
Full-text available
Failures become inevitable in DRAM devices, which is a major obstacle for scaling down the density of cells in future DRAM technologies. These failures can be detected by specific DRAM tests that implement the data and memory access patterns having a strong impact on DRAM reliability. However, the design of such tests is very challenging, especially for testing DRAM devices in operation, due to an extremely large number of possible cell-to-cell interference effects and combinations of patterns inducing these effects. In this paper, we present a new framework for the synthesis of DRAM reliability stress viruses, DStress. This framework automatically searches for the data and memory access patterns that induce the worst-case DRAM error behavior regardless the internal DRAM design. The search engine of our framework is based on Genetic Algorithms (GA) and a programming tool that we use to specify the patterns examined by GA. To evaluate the effect of program viruses on DRAM reliability, we integrate DStress with an experimental server where 72 DRAM chips can operate under various operating parameters and temperatures. We present the results of our 7-month experimental study on the search of DRAM reliability stress viruses. We show that DStress finds the worst-case data pattern virus and the worst-case memory access virus with probabilities of 1 − 4 × 10 −7 and 0.95, respectively. We demonstrate that the discovered patterns induce by at least 45% more errors than the traditional data pattern micro-benchmarks used in previous studies. We show that DStress enables us to detect the marginal DRAM operating parameters reducing the DRAM power by 17.7 % on average without compromising reliability. Overall, our framework facilitates the exploration of new data patterns and memory access scenarios increasing the probability of DRAM errors, which is essential for improving the state-of-the-art DRAM testing mechanisms.
... AR is a term used in power/performance modeling to quantify the computational intensity of a workload [34]. AR describes the switching rate of a processor component (e.g., CPU core, graphics engine, IO) for a workload when compared to the highest possible power, P peak , that can be consumed by the most computationally-intensive workload (i.e., also known as the power-virus workload [31,77,88]). AR and P peak can be estimated 1) o ine using power modeling tools such as McPAT [77], SYMPO [31] or Intel's Blizzard [9]), and 2) at runtime using activity sensors implemented in the processor components [7,10,19,30,78,102,110,126]. ...
... AR describes the switching rate of a processor component (e.g., CPU core, graphics engine, IO) for a workload when compared to the highest possible power, P peak , that can be consumed by the most computationally-intensive workload (i.e., also known as the power-virus workload [31,77,88]). AR and P peak can be estimated 1) o ine using power modeling tools such as McPAT [77], SYMPO [31] or Intel's Blizzard [9]), and 2) at runtime using activity sensors implemented in the processor components [7,10,19,30,78,102,110,126]. Load-line. ...
... From this equation, we can see that the voltage at the load input (V cc ) decreases when the current of the load (I cc ) increases (e.g., when running a workload with a high AR). Therefore, to keep the voltage at the load (V cc ) above a minimum functional voltage under even the most computationally-intensive workload (i.e., power-virus [31,77,88], for which AR=1), the input voltage (V IN ) is set to a level that provides enough guardband. ...
... AR is a term used in power/performance modeling to quantify the computational intensity of a workload [34]. AR describes the switching rate of a processor component (e.g., CPU core, graphics engine, IO) for a workload when compared to the highest possible power, P peak , that can be consumed by the most computationally-intensive workload (i.e., also known as the power-virus workload [31,77,88]). AR and P peak can be estimated 1) o ine using power modeling tools such as McPAT [77], SYMPO [31] or Intel's Blizzard [9]), and 2) at runtime using activity sensors implemented in the processor components [7,10,19,30,78,102,110,126]. ...
... AR describes the switching rate of a processor component (e.g., CPU core, graphics engine, IO) for a workload when compared to the highest possible power, P peak , that can be consumed by the most computationally-intensive workload (i.e., also known as the power-virus workload [31,77,88]). AR and P peak can be estimated 1) o ine using power modeling tools such as McPAT [77], SYMPO [31] or Intel's Blizzard [9]), and 2) at runtime using activity sensors implemented in the processor components [7,10,19,30,78,102,110,126]. Load-line. ...
... From this equation, we can see that the voltage at the load input (V cc ) decreases when the current of the load (I cc ) increases (e.g., when running a workload with a high AR). Therefore, to keep the voltage at the load (V cc ) above a minimum functional voltage under even the most computationally-intensive workload (i.e., power-virus [31,77,88], for which AR=1), the input voltage (V IN ) is set to a level that provides enough guardband. ...
Preprint
Full-text available
Modern client processors typically use one of three commonly-used power delivery network (PDN) architectures: 1) mother-board voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics (e.g., work-load type and computational intensity). This leads to energy-inefficiency and performance loss, as modern client processors operate across a wide spectrum of power consumption and execute a wide variety of workloads. To address this inefficiency, we propose FlexWatts, a hybrid adaptive PDN for modern client processors whose goal is to provide high energy-efficiency across the processor's wide range of power consumption and workloads. FlexWatts provides high energy-efficiency by intelligently and dynamically allocating PDNs to processor domains depending on the processor's power consumption and workload. FlexWatts is based on three key ideas. First, FlexWatts combines IVRs and LDOs in a novel way to share multiple on-chip and off-chip resources and thus reduce cost, as well as board and die area overheads. This hybrid PDN is allocated for processor domains with a wide power consumption range (e.g., CPU cores and graphics engines) and it dynamically switches between two modes: IVR-Mode and LDO-Mode, depending on the power consumption. Second, for all other processor domains (that have a low and narrow power range, e.g., the IO domain), FlexWatts statically allocates off-chip VRs, which have high energy-efficiency for low and narrow power ranges. Third, FlexWatts introduces a novel prediction algorithm that automatically switches the hybrid PDN to the mode (IVR-Mode or LDO-Mode) that is the most beneficial based on processor power consumption and workload characteristics. To evaluate the tradeoffs of PDNs, we develop and open-source PDNspot, the first validated architectural PDN model that enables quantitative analysis of PDN metrics. Using PDNspot, we evaluate FlexWatts on a wide variety of SPEC CPU2006, graphics (3DMark06), and battery life (e.g., video playback) workloads against IVR, the state-of-the-art PDN in modern client processors. For a 4W thermal design power (TDP) processor, FlexWatts improves the average performance of the SPEC CPU2006 and 3DMark06 workloads by 22% and 25%, respectively. For battery life workloads, FlexWatts reduces the average power consumption of video playback by 11% across all tested TDPs (4W-50W). FlexWatts has comparable cost and area overhead to IVR. We conclude that FlexWatts provides high energy-efficiency across a modern client processor's wide range of power consumption and wide variety of workloads, with minimal overhead.
... Benchmark suites are usually built to represent the nominal behavior of real world applications and not to mimic worstcase scenarios. However, worst-case scenarios in terms of microarchitectural activity, heat dissipation, power consumption and voltage noise [4], [8], [9], [14], [17], [18] are critical to understand the limits and sensitivities of current generation processors, so that future systems can migrate to the most promising regions of the microarchitectural design space. These worst case scenarios are closely tied to the microarchitecture, and must be created in accordance. ...
... 1) Generation Model: As highlighted in prior work [10], there are two prominent design models for stress-test generation: a) based on an abstract-workload model and b) based on instruction-level primitives. In the abstract-model [8], [9], [14] the stress test generation process involves tuning a vector of workload generation parameters/knobs such as instruction mix, register dependency distance, memory footprint / stride patterns and branch transition patterns. The vector is then used to generate the assembly (or high level language) code On the other hand, for the instruction-level frameworks [10], [17], [18], the tuning is performed directly on the instruction assembly, with per-instruction control. ...
... 8 Inspired by adaptive learning rate based gradient methods [19], the tuning mechanism's step-sizes are larger on earlier epochs and gradually become smaller, allowing for rapid convergence earlier but slower but surer convergence later on. 9 To add robustness to the convergence to help avoid local minima, a random set of knobs are skipped in tuning each iteration, with decreasing skipping probability over epochs. ...
Preprint
We present MicroGrad, a centralized automated framework that is able to efficiently analyze the capabilities, limits and sensitivities of complex modern processors in the face of constantly evolving application domains. MicroGrad uses Microprobe, a flexible code generation framework as its back-end and a Gradient Descent based tuning mechanism to efficiently enable the evolution of the test cases to suit tasks such as Workload Cloning and Stress Testing. MicroGrad can interface with a variety of execution infrastructure such as performance and power simulators as well as native hardware. Further, the modular 'abstract workload model' approach to building MicroGrad allows it to be easily extended for further use. In this paper, we evaluate MicroGrad over different use cases and architectures and showcase that MicroGrad can achieve greater than 99\% accuracy across different tasks within few tuning epochs and low resource requirements. We also observe that MicroGrad's accuracy is 25 to 30\% higher than competing techniques. At the same time, it is 1.5x to 2.5x faster or would consume 35 to 60\% less compute resources (depending on implementation) over alternate mechanisms. Overall, MicroGrad's fast, resource efficient and accurate test case generation capability allow it to perform rapid evaluation of complex processors.
... Existing benchmark suites come with limitations as discussed in Section 2. ParaDnn is the first parameterized benchmark suite for deep learning in the literature. In the same spirit as parameterized benchmarks, synthetic benchmarks have commonly been used, such as BenchMaker [31], and SYMPO [16], constructing benchmarks with hardware-independent characteristics. Some try to match the statistical characteristics of real applications [55,33]. ...
Preprint
Full-text available
Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.
... A relatively large body of work [1, 2, 4, 9, 10, 19, 28-30, 32, 33, 35, 36] uses microbenchmarks to infer properties of the memory hierarchy. Another line of work [5,13,14,25] uses automatically generated microbenchmarks to characterize the energy consumption of microprocessors. Comparably little work [8,12,17,34] is targeted at instruction characterizations. ...
Conference Paper
Modern microarchitectures are some of the world's most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldom available. In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work. To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel's Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work.
... Facebook recently reported that it prevented 18 potential power outages within six months in 2016 [41]. The situation would have been worse if malicious adversaries intentionally drop power viruses to launch power attacks [16], [17]. The consequence of a power outage could be devastating, e.g., Delta Airlines encountered a shutdown of a power source in its data center in August 2016, which caused large-scale delays and cancellations of flights [8]. ...
... Existing power attacks maximize the power consumption by customizing powerintensive workloads, denoted as power viruses. For example, Ganesan et al. [16], [17] leveraged genetic algorithms to automatically generate power viruses that consume more power than normal stress benchmarks. However, launching a power attack from scratch or being agnostic about the surrounding environment wastes unnecessary attacking resources. ...
Article
Full-text available
Container technology provides a lightweight operating system level virtual hosting environment. Its emergence profoundly changes the development and deployment paradigms of multi-tier distributed applications. However, due to the incomplete implementation of system resource isolation mechanisms in the Linux kernel, some security concerns still exist for multiple containers sharing an operating system kernel on a multi-tenancy container-based cloud service. In this paper, we first present the information leakage channels we discovered that are accessible within containers. Such channels expose a spectrum of system-wide host information to containers without proper resource partitioning. By exploiting such leaked host information, it becomes much easier for malicious adversaries (acting as tenants in a container cloud) to launch attacks that might impact the reliability of cloud services. We demonstrate that the information leakage channels could be exploited to infer private data, detect and verify co-residence, build covert channels, and launch more advanced cloud-based attacks. We discuss the root causes of the containers information leakage and propose a two-stage defense approach. As demonstrated in the evaluation, our defense is effective and incurs trivial performance overhead.
... A relatively large body of work [3,4,6,11,12,19,28,29,30,32,33,34,35] uses microbenchmarks to infer properties of the memory hierarchy. Another line of work [7,15,16,25] uses automatically generated microbenchmarks to characterize the energy consumption of microprocessors. Comparably little work [2,10,14,17] is targeted at instruction characterizations. ...
Preprint
Modern microarchitectures are some of the world's most complex man-made systems. As a consequence, it is increasingly difficult to predict, explain, let alone optimize the performance of software running on such microarchitectures. As a basis for performance predictions and optimizations, we would need faithful models of their behavior, which are, unfortunately, seldomly available. In this paper, we present the design and implementation of a tool to construct faithful models of the latency, throughput, and port usage of x86 instructions. To this end, we first discuss common notions of instruction throughput and port usage, and introduce a more precise definition of latency that, in contrast to previous definitions, considers dependencies between different pairs of input and output operands. We then develop novel algorithms to infer the latency, throughput, and port usage based on automatically-generated microbenchmarks that are more accurate and precise than existing work. To facilitate the rapid construction of optimizing compilers and tools for performance prediction, the output of our tool is provided in a machine-readable format. We provide experimental results for processors of all generations of Intel's Core architecture, i.e., from Nehalem to Coffee Lake, and discuss various cases where the output of our tool differs considerably from prior work.
... Most prior workload cloning proposals [10,11,19,20,27,36,40,41,45,50,52,53,63] exploit some form of temporal and/or spatial locality to model memory access behavior. Locality models are also useful for synthesizing stressmarks [26,28,34], to model program resource demands [16], to utilize multiple granularity architectures [31], to estimate performance of emerging memory architectures [30,55] and to optimize simulations [23]. In this section, we will discuss the state-of-the-art workload cloning proposals and their challenges. ...
Conference Paper
Growing complexity of applications pose new challenges to memory system design due to their data intensive nature, complex access patterns, larger footprints, etc. The slow nature of full-system simulators, challenges of simulators to run deep software stacks of many emerging workloads, proprietary nature of software, etc. pose challenges to fast and accurate microarchitectural explorations of future memory hierarchies. One technique to mitigate this problem is to create spatio-temporal models of access streams and use them to explore memory system tradeoffs. However, existing memory stream models have weaknesses such as they only model temporal locality behavior or model spatio-temporal locality using global stride transitions, resulting in high storage/metadata overhead. In this paper, we propose HALO, a Hierarchical memory Access LOcality modeling technique that identifies patterns by isolating global memory references into localized streams and further zooming into each local stream capturing multi-granularity spatial locality patterns. HALO also models the interleaving degree between localized stream accesses leveraging coarse-grained reuse locality. We evaluate HALO's effectiveness in replicating original application performance using over 20K different memory system configurations and show that HALO achieves over 98.3%, 95.6%, 99.3% and 96% accuracy in replicating performance of prefetcher-enabled L1 & L2 caches, TLB and DRAM respectively. HALO outperforms the state-of-the-art memory cloning schemes, WEST and STM, while using ~39X less metadata storage than STM.
... There have been many efforts towards writing power viruses and stress benchmarks. For example, SYMPO [11], an automatic system level max power virus generation framework, which maximizes the power consumption of the CPU and the memory system, MAMPO [12], as well as the MPrime [13] and stress-ng [14] are the most popular benchmarks, which aim to increase the power consumption of the microprocessor by torturing it; they have been used for testing the stability of the microprocessor during overclocking. However, power viruses are not capable to reveal pessimistic voltage margins. ...
Conference Paper
Full-text available
In this paper, we propose the employment of fast targeted programs (diagnostic micro-viruses) that aim to stress individually the main hardware components of a multicore CPU architecture which most likely determine the limits of voltage scaling, i.e. safe Vmin values. We describe in detail the complex development process for the diagnostic micro-viruses and their comprehensive validation in modern multicore CPU hardware. The combined execution of the micro-viruses takes very short time compared to regular programs execution, and can quickly reveal the voltage limits of the cores and chips at voltage levels below nominal. The micro-virus based characterization flow requires orders of magnitude shorter time while it delivers virtually identical: (a) Vmin values for the different CPU chips, and (b) Vmin values for the different cores within a CPU chip. We evaluate our micro-viruses based characterization flow (and compare it to the SPEC-based flow) on three different chips (a nominal graded and two corner parts) of Applied Micro’s X-Gene 2 micro-server family (with 8-core ARMv8-based CPUs manufactured in 28nm). We report detailed validation and evaluation results that prove the effectiveness of the micro-viruses for the fast and accurate identification of the voltage margins variability among the chips and the cores of a multicore CPU.