Conference Paper

SPECTR: Formal Supervisory Control and Coordination for Many-core Systems Resource Management

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Resource management strategies for many-core systems need to enable sharing of resources such as power, processing cores, and memory bandwidth while coordinating the priority and significance of system- and application-level objectives at runtime in a scalable and robust manner. State-of-the-art approaches use heuristics or machine learning for resource management, but unfortunately lack formalism in providing robustness against unexpected corner cases. While recent efforts deploy classical control-theoretic approaches with some guarantees and formalism, they lack scalability and autonomy to meet changing runtime goals. We present SPECTR, a new resource management approach for many-core systems that leverages formal supervisory control theory (SCT) to combine the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently meet changing runtime goals. SPECTR is a scalable and robust control architecture and a systematic design flow for hierarchical control of many-core systems. SPECTR leverages SCT techniques such as gain scheduling to allow autonomy for individual controllers. It facilitates automatic synthesis of the high-level supervisory controller and its property verification. We implement SPECTR on an Exynos platform containing ARM»s big.LITTLE-based heterogeneous multi-processor (HMP) and demonstrate that SPECTR»s use of SCT is key to managing multiple interacting resources (e.g., chip power and processing cores) in the presence of competing objectives (e.g., satisfying QoS vs. power capping). The principles of SPECTR are easily applicable to any resource type and objective as long as the management problem can be modeled using dynamical systems theory (e.g., difference equations), discrete-event dynamic systems, or fuzzy dynamics.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Rahmani et al. [39] use Supervisory Control Theory to address the issue of dynamic goals by changing the priorities of low-level controllers adaptively. However, designing a controller requires a stable (sub)system model to be identified. ...
... However, Yukta, like classical controllers, lacks self-adaptivity (6), which enables rapid responses to abrupt runtime changes. In SPECTR [39], Rahmani et al. also solve the scalability issue via hierarchy. SPECTR uses Supervisory Control Theory at the top of its hierarchy which additionally provides selfadaptivity (6) in conjunction with classical controllers by coordinating their reference values and updating priorities dynamically. ...
... Estimation -/Modelbased Heuristics Classical Control Theory Machine Learning Hierarchical Control Hybrid Control + Machine Learning [10,11,13,17,24] [ 22,26,35,40,41] [4,16,18] [ 36,39] [SOSA] 1. Robustness Table 1: Major on-chip resource management approaches and the key challenges they address ( * = uniquely addressed by SOSA). is updated using reinforcement learning at runtime, however, the learning is done off-device, and requires communication with a remote server. Continuously updating a statistical model on device was applied by Kasture et al. in [25] to control DVFS in datacenters for latency-critical workloads. ...
Conference Paper
Resource management strategies for many-core systems dictate the sharing of resources among applications such as power, processing cores, and memory bandwidth in order to achieve system goals. System goals require consideration of both system constraints (e.g., power envelope) and user demands (e.g., response time, energy-efficiency). Existing approaches use heuristics, control theory, and machine learning for resource management. They all depend on static system models, requiring a priori knowledge of system dynamics, and are therefore too rigid to adapt to emerging workloads or changing system dynamics. We present SOSA, a cross-layer hardware/software hierarchical resource manager. Low-level controllers optimize knob configurations to meet potentially conflicting objectives (e.g., maximize throughput and minimize energy). SOSA accomplishes this for many-core systems and unpredictable dynamic workloads by using rule-based reinforcement learning to build subsystem models from scratch at runtime. SOSA employs a high-level supervisor to respond to changing system goals due to operating condition, e.g., switch from maximizing performance to minimizing power due to a thermal event. SOSA's supervisor translates the system goal into low-level objectives (e.g., core instructions-per-second (IPS)) in order to control subsystems by coordinating numerous knobs (e.g., core operating frequency, task distribution) towards achieving the goal. The software supervisor allows for flexibility, while the hardware learners allow quick and efficient optimization. We evaluate a simulation-based implementation of SOSA and demonstrate SOSA's ability to manage multiple interacting resources in the presence of conflicting objectives, its efficiency in configuring knobs, and adaptability in the face of unpredictable workloads. Executing a combination of machine-learning kernels and microbenchmarks on a multicore system-on-a-chip, SOSA achieves target performance with less than 1% error starting with an untrained model, maintains the performance in the face of workload disturbance, and automatically adapts to changing constraints at runtime. We also demonstrate the resource manager with a hardware implementation on an FPGA.
... Where the systems diverge is in target application/hardware (embedded v. enterprise) and choice of QoSM (feedback control v. machine learning). In the past couple years, systems power management using control theory have been proposed including those based upon SISO [38] and a supervisory and control theory (SCT) [39]. We agree that the formalism of control theory has its place in power management; however, we believe the very stability and rigor that are obtained by control theory can limit the optimizations available when interactions between systems become more complex. ...
... The ODROID XU4 is build around a Samsung Exynos5422, an ARM big.LITTLE processor. The XU3 and XU4 are very similar and have been used extensively as a test platform in previous research into intelligent power management [39,[43][44][45]53,65]. The ARM big.LITTLE heterogeneous multiprocessing architecture (HMP) consists of two clusters of cores: four high performance, higher-power Cortex-A15 cores and four lower-performance, lower-power Cortex-A7 cores. ...
Article
Full-text available
With the computational systems of even embedded devices becoming ever more powerful, there is a need for more effective and pro-active methods of dynamic power management. The work presented in this paper demonstrates the effectiveness of a reinforcement-learning based dynamic power manager placed in a software framework. This combination of Q-learning for determining policy and the software abstractions provide many of the benefits of co-design, namely, good performance, responsiveness and application guidance, with the flexibility of easily changing policies or platforms. The Q-learning based Quality of Service Manager (2QoSM) is implemented on an autonomous robot built on a complex, powerful embedded single-board computer (SBC) and a high-resolution path-planning algorithm. We find that the 2QoSM reduces power consumption up to 42% compared to the Linux on-demand governor and 10.2% over a state-of-the-art situation aware governor. Moreover, the performance as measured by path error is improved by up to 6.1%, all while saving power.
... It is known that machine learning-based solution space exploration cannot guarantee finding stable behaviors at all times. Hence, we deploy a reflective supervisory control layer (e.g., SPECTR [8]) to monitor and, when necessary, guide the emergent behaviors by applying tighter constraints to the LCT leaf controllers. An archive-based backup policy provides the ability to adhere to constraints. ...
... Research in intelligent management of computer systems has established hierarchy as an effective way to provide coordination and adaptivity to controllers in a scalable manner [8]. SOSA [16] specifically implements a resource management hierarchy with a supervisor that coordinates distributed LCTs to achieve a global goal. ...
Conference Paper
Full-text available
MPSoCs increasingly depend on adaptive resource management strategies at runtime for efficient utilization of resources when executing complex application workloads. In particular, conflicting demands for adequate computation performance and power-/energy-efficiency constraints make desired application goals hard to achieve. We present a hierarchical, cross-layer hardware/software resource manager capable of adapting to changing workloads and system dynamics with zero initial knowledge. The manager uses rule-based reinforcement learning classifier tables (LCTs) with an archive-based backup policy as leaf controllers. The LCTs directly manipulate and enforce MPSoC building block operation parameters in order to explore and optimize potentially conflicting system requirements (e.g., meeting a performance target while staying within the power constraint). A supervisor translates system requirements and application goals into per-LCT objective functions (e.g., core instructions-per-second (IPS). Thus, the supervisor manages the possibly emergent behavior of the low-level LCT controllers in response to 1) switching between operation strategies (e.g., maximize performance vs. minimize power; and 2) changing application requirements. This hierarchical manager leverages the dual benefits of a software supervisor (enabling flexibility), together with hardware learners (allowing quick and efficient optimization). Experiments on an FPGA prototype confirmed the ability of our approach to identify optimized MPSoC operation parameters at runtime while strictly obeying given power constraints. Index Terms-Backup-based reinforcement machine learning, MPSoC runtime management, hierarchical reflective control
... The applications periodically issue heartbeats and inform the system about their performance. To evaluate our proposed approach, we use a set of synthetic micro-benchmarks with attributes that reflect interactive and I/O dependent nature of applications [4] [14]. These micro-benchmarks have configurable active and idle periods to emulate a wide range of workload distribution patterns. ...
... Resource Management Several works on run-time resource management targeted meeting performance requirements while optimizing power consumption [1], [2], [4], [5], [14]. The HPM scheduler presented in [2] deployed several proportional-integral-derivative (PID) controllers for resource sharing, DVFS and task migration decisions. ...
Conference Paper
Full-text available
Run-time resource allocation of heterogeneous multi-core systems is challenging with varying workloads and limited power and energy budgets. User interaction within these systems changes the performance requirements, often conflicting with concurrent applications' objective and system constraints. Current resource allocation approaches focus on optimizing fixed objective, ignoring the variation in system and applications' objective at run-time. For an efficient resource allocation, the system has to operate autonomously by formulating a hierarchy of goals. We present goal-driven autonomy (GDA) for on-chip resource allocation decisions, which allows systems to generate and prioritize goals in response to the workload and system dynamic variation. We implemented a proof-of-concept resource management framework that integrates the proposed goal management control to meet power, performance and user requirements simultaneously. Experimental results on an Exynos platform containing ARM's big.LITTLE-based heterogeneous multi-processor (HMP) show the effectiveness of GDA in efficient resource allocation in comparison with existing fixed objective policies.
... Supervisory control provides the opportunity to benefit from both classical control-theoretic methods and heuristics in a robust fashion. The SCT hierarchy in Fig. 6.6 is successfully used to manage quality-of-service (QoS) goals within a power budget on an HMP in [18]. ...
Chapter
Full-text available
In this chapter, we explore adaptive resource management techniques for cyber-physical systems-on-chip that employ principles of computational self-awareness to varying degrees, specifically reflection. By supporting various self-X properties, systems gain the ability to reason about runtime configuration decisions by considering the significance of competing objectives, user requirements, and operating conditions, while executing unpredictable workloads.
... Supervisory control provides the opportunity to benefit from both classical control theoretic methods and heuristics in a robust fashion. The SCT hierarchy in Figure 6 is successfully used to manage quality-of-service (QoS) goals within a power budget on an HMP in [17]. ...
Book
Full-text available
In this chapter, we explore adaptive resource management techniques for cyber-physical systems-on-chip that employ principles of computational self-awareness to varying degrees, specifically reflection. By supporting various self-X properties, systems gain the ability to reason about runtime configuration decisions by considering the significance of competing objectives, user requirements, and operating conditions, while executing unpredictable workloads.
... With such a large number of possible configurations, Mobile platforms require intelligent runtime management in order to achieve system goals for complex workloads [14,15]. In this part of the tutorial, we will present lines of research [16,17,18,19] addressing key challenges for achieving computational selfawareness that can make the design, maintenance and operation of complex, heterogeneous systems adaptive, autonomous, and highly efficient. ...
Conference Paper
The overlap of the two established fields of cyber-physical systems and self-aware computing systems constitutes a challenging class of systems that require autonomy and must satisfy multiple, possibly conflicting constraints (e.g., performance, timeliness, energy, reliability). Self-aware cyber-physical systems are situated in dynamic physical environments and constrained in their resources, they understand their own state and that of their environment. Based on that understanding, they are able to make appropriate decisions autonomously at runtime with high efficiency. In this tutorial, we will review the state of the art of this exciting domain.
... These techniques combine PID-like control techniques to various parts of the system, and provide empirical demonstrations of overall system behavior rather than guarantees. Recent work on optimal resource allocation focuses on designing sophisticated control systems using linear quadratic Gaussian control for example [13,21,22]. Unlike SLAMBooster, they do not exploit application-specific properties to perform control. ...
Preprint
Simultaneous Localization and Mapping (SLAM) is the problem of constructing a map of an agent's environment while localizing or tracking the mobile agent's position and orientation within the map. Algorithms for SLAM have high computational requirements, which has hindered their use on embedded devices. Approximation can be used to reduce the time and energy requirements of SLAM implementations as long as the approximations do not prevent the agent from navigating correctly through the environment. Previous studies of approximation in SLAM have assumed that the entire trajectory of the agent is known before the agent starts to move, and they have focused on offline controllers that use features of the trajectory to set approximation knobs at the start of the trajectory. In practice, the trajectory is not usually known ahead of time, and allowing knob settings to change dynamically opens up more opportunities for reducing computation time and energy. We describe SLAMBooster, an application-aware online control system for SLAM that adaptively controls approximation knobs during the motion of the agent. SLAMBooster is based on a control technique called hierarchical proportional control but our experiments showed this application-agnostic control led to an unacceptable reduction in the quality of localization. To address this problem, SLAMBooster exploits domain knowledge: it uses features extracted from input frames and from the estimated motion of the agent in its algorithm for controlling approximation. We implemented SLAMBooster in the open-source SLAMBench framework. Our experiments show that SLAMBooster reduces the computation time and energy consumption by around half on the average on an embedded platform, while maintaining the accuracy of the localization within reasonable bounds. These improvements make it feasible to deploy SLAM on a wider range of devices.
... The middleware implements closed-loop resource management policies and embeds hierarchical system models that use sensory data to predict how the system state may change given new actuation actions. The middleware offers scalable and autonomous resource management by leveraging formal supervisory control theory combining the strengths of classical control theory with state-of-the-art heuristic approaches to efficiently achieve changing run-time goals [19]. It also provides robustness and stability guarantees to the power management unit by using an adaptive control theoretic technique called Gain Scheduling which decomposes the entire nonlinear operating region of the DVFS controller into linear sub-regions and adaptively switches between static linear feedback controllers designed for each operating sub-region [20]. ...
Chapter
The increasing complexity and unpredictability of emerging applications makes it challenging for multi-processor system-on-chips to satisfy their performance requirements while keeping power consumption within bounds. In order to tackle this problem, the research community has focused on developing dynamic resource managers that aim to optimize runtime parameters, such as clock frequency, voltage and task mapping. There is a large diversity in the approaches proposed in this context, but a class of resource managers that has gained traction recently is that of reinforcement learning-based controllers. In this paper we propose CoLeCTs, a resource manager that enhances the state-of-the-art resource manager SOSA by employing a joint reward assignment function and enabling collaborative information exchange among multiple learning agents. In this manner we tackle the suboptimal determination of local performance targets for heterogeneous applications and allow cooperative decision making for the learning agents. We evaluate and quantify the benefits of our approach via trace-based simulations.KeywordsMPSoCsResource managementDVFSReinforcement learningLCTsCooperation
Article
Computing is taking a central role in advancing science, technology, and society, facilitated by increasingly capable systems. Computers are expected to perform a variety of tasks, including life-critical functions, while the resources they require (such as storage and energy) are becoming increasingly limited. To meet expectations, computers use control algorithms that monitor the requirements of the applications they run and reconfigure themselves in response.
Article
A challenge related to the design of many-core systems in recent technology nodes is the power density, which may induce safe temperature violations and reliability degradation. Dynamic Thermal Management (DTM) techniques rely on accurate temperature estimation obtained from thermal sensors or analytical temperature models. This work proposes a Temperature Estimation Accelerator (TEA) to estimate the temperature of processing elements at runtime in a fast and energy-efficient way compared to a software approach. Thermal monitoring based on TEA provides fine-grain temperature data and can enable DTMs to deal with unpredictabilities, such as network congestion, by actuating in large many-cores systems at runtime.
Book
Full-text available
This Open Access book celebrates Professor Peter Marwedel's outstanding achievements in compilers, embedded systems, and cyber-physical systems. The contributions in the book summarize the content of invited lectures given at the workshop “Embedded Systems” held at the Technical University Dortmund in early July 2019 in honor of Professor Marwedel's seventieth birthday. • Provides a comprehensive view from leading researchers with respect to the past, present, and future of the design of embedded and cyber-physical systems; • Discusses challenges and (potential) solutions from theoreticians and practitioners on modeling, design, analysis, and optimization for embedded and cyber-physical systems; • Includes coverage of model verification, communication, software runtime systems, operating systems and real-time computing.
Article
Large-scale datacenters often host latency-sensitive services that have stringent Quality-of-Service requirement and experience diurnal load pattern. Co-locating best-effort applications that have no QoS requirement with the latency-sensitive services has been widely used to improve the resource utilization of datacenters with careful shared resource management. However, existing co-location techniques tend to result in the power overload problem on power constrained servers due to the ignorance of the power consumption. To this end, we propose Sturgeon , a runtime system proactively manages resources between co-located applications in a power constrained environment, to ensure the QoS of latency-sensitive services while maximizing the throughput of best-effort applications. Our investigation shows that, at a given load, there are multiple feasible resource configurations to meet both QoS requirement and power budget, while one of them yields the maximum throughput of best-effort applications. To find such a configuration, we establish models to accurately predict the performance and power consumption of the co-located applications. Sturgeon monitors the QoS of the services periodically, in order to eliminate the potential QoS violation caused by the unpredictable interference. Besides, when the datacenter hosts different types of applications to perform co-location, Sturgeon places applications with their preferable candidates to improve the overall throughput. The experimental results show that at server level Sturgeon improves the throughput of the best-effort application by 25.43 percent compared to the state-of-the-art technique, while guaranteeing the 95%-ile latency within the QoS target; at cluster level, Sturgeon improves the overall throughput of best-effort applications by 13.74 percent compared to the baseline.
Article
Embodied self-aware computing systems are embedded in a physical environment with a rich set of sensors and actuators to interact both with their environment and with their own embodiment. Through this interaction, they learn about their situation, their own state, and their performance. Although they are application specific like traditional embedded systems (ESs), they are significantly more flexible, robust, and autonomous; they can adapt to a wide range of environmental variation and can cope with deterioration and shortcomings of their own performance. As such, embodied self-aware computing systems are an evolution of traditional embedded and cyber-physical systems into the direction of more autonomy, robustness, and flexibility. When traditional ESs operate in a changing world by demanding unchanging and fully characterized computing resources, embodied self-aware computing systems adapt to a changing world and changing computing resources. This article surveys the methods and methodologies used for embodied self-aware computing systems structured along with the faculties of: 1) sensory observation and abstraction; 2) self-aware assessment; and 3) hierarchical goals and control. The discussion is exemplified by application cases in the areas of systems-on-chip, control systems, health monitoring, and condition monitoring in industrial production systems.
Article
This paper describes the development of a software architectural framework for implementing compute-aware control systems, where the term "compute-aware" describes controllers can modify existing low-level computing platform power managers in response to the needs of the physical system controller. This level of interaction means that high-level decisions can be made as to when to operate the computing platform in a power-savings mode or a high performance mode in response to situation awareness of the physical system. The framework is demonstrated experimentally on a mobile robot platform. In this example, a situation-aware governor is developed that adjusts the speed of the processor based on the physical performance of the robot as it traverses a path through obstacles. The results show that the situation-aware governor results in overall power savings of up to 38.9% percent with 1.3% degradation in performance compared to the static high-power strategy.
Conference Paper
Resource control in heterogeneous computers built with subsystems from different vendors is challenging. There is a tension between the need to quickly generate local decisions in each subsystem and the desire to coordinate the different subsystems for global optimization. In practice, global coordination among subsystems is considered hard, and current commercial systems use centralized controllers. The result is high response time and high design cost due to lack of modularity. To control emerging heterogeneous computers effectively, we propose a new control framework called Tangram that is fast, globally coordinated, and modular. Tangram introduces a new formal controller that combines multiple engines for optimization and safety, and has a standard interface. Building the controller for a subsystem requires knowing only about that subsystem. As a heterogeneous computer is assembled, the controllers in the different subsystems are connected hierarchically, exchanging standard coordination signals. To demonstrate Tangram, we prototype it in a heterogeneous server that we assemble using components from multiple vendors. Compared to state-of-the-art control, Tangram reduces, on average, the execution time of heterogeneous applications by 31% and their energy-delay product by 39%.
Article
As computing platforms increasingly embrace heterogeneity, runtime resource managers need to efficiently, dynamically, and robustly manage shared resources (e.g., cores, power budgets, memory bandwidth). To address the complexities in heterogeneous systems, state-of-the-art techniques that use heuristics or machine learning have been proposed. On the other hand, conventional control theory can be used for formal guarantees, but may face unmanageable complexity for modeling system dynamics of complex heterogeneous systems. We address this challenge through HESSLE-FREE (Heterogeneous Systems Leveraging Fuzzy Control for Runtime Resource Management): an approach leveraging fuzzy control theory that combines the strengths of classical control theory together with heuristics to form a light-weight, agile, and efficient runtime resource manager for heterogeneous systems. We demonstrate the efficacy of HESSLE-FREE executing on a NVIDIA Jetson TX2 platform (containing a heterogeneous multi-processor with a GPU) to show that HESSLE-FREE: 1) provides opportunity for optimization in the controller and stability analysis to enhance the confidence in the reliability of the system; 2) coordinates heterogeneous compute units to achieve desired objectives (e.g., QoS, optimal power references, FPS) efficiently and with lower complexity, and 3) eases the burden of system specification.
Article
In modern heterogeneous MPSoCs, the management of shared memory resources is crucial in delivering end-to-end QoS. Previous frameworks have either focused on singular QoS targets or the allocation of partitionable resources among CPU applications at relatively slow timescales. However, heterogeneous MPSoCs typically require instant response from the memory system where most resources cannot be partitioned. Moreover, the health of different cores in a heterogeneous MPSoC is often measured by diverse performance objectives. In this work, we propose the Self-Aware Resource Allocation framework for heterogeneous MPSoCs. Priority-based adaptation allows cores to use different target performance and self-monitor their own intrinsic health. In response, the system allocates non-partitionable resources based on priorities. The proposed framework meets a diverse range of QoS demands from heterogeneous cores. Moreover, we present a runtime scheme to configure priority-based adaptation so that distinct sensitivities of heterogeneous QoS targets with respect to memory allocation can be accommodated. In addition, the priority of best-effort cores can also be regulated.
Conference Paper
Mobile apps have become indispensable in our daily lives, but many apps are not designed to be energy-aware that they may consume the constrained resources on mobile devices in a wasteful manner. Blindly throttling heavy resource usage, while helps reducing energy consumption, prohibits apps from taking advantages of the resources to do useful work. We argue that addressing this issue requires mobile OS to continuously assess if a resource is still truly needed even after it is granted to an app. This paper proposes that lease, a mechanism commonly used in distributed systems, is a well-suited abstraction in resource-constrained mobile devices to mitigate app energy misbehavior. We design a lease-based, utilitarian resource management mechanism, LeaseOS, that analyzes the utility of a resource to an app at each lease term, and then makes lease decisions based on the utility. We implement LeaseOS on top of the latest Android OS and evaluate it with 20 real-world apps with energy bugs. LeaseOS reduces wasted power by 92% on average and significantly outperforms the state-of-the-art Android Doze and DefDroid. It also did not cause any usability disruption to the evaluated apps. LeaseOS itself incurs negligible energy overhead.
Article
Two critical quality factors for mobile devices are battery life and user-perceived performance of User Interface (UI) events. Unfortunately, state-of-the-art solutions have at least one of the following three limitations: 1) they cannot efficiently handle concurrent UI events of multiple apps (either foreground or background) and so may lead to performance imbalance and high energy consumption, 2) they try to regulate UI event performance periodically and thus may not efficiently handle the aperiodicity of interactive events, which can result in poor responsiveness or high overheads, and 3) they rely mainly on CPU frequency/voltage scaling and so may have limited energy savings. In this paper, we present SURF, Supervisory control of User-perceived peRFormance, which is designed to overcome the three limitations. First, it dynamically allocates resources to concurrent UI events for balanced performance. Second, SURF uses supervisory control theory to handle the aperiodicity of user action events. Third, it optimizes the allocation of UI events to CPU cores for additional energy savings. We test SURF on several mobile device models with real-world open-source apps and show that, without causing perceivable performance degradation, it can reduce the CPU energy consumption by 28-98% compared to state-of-the-art solutions.
Article
Full-text available
The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by manufacturers.
Article
Full-text available
The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by DRAM standards. Using an FPGA-based testing platform, we perform an experimental study of 124 real DDR3L (low-voltage) DRAM chips manufactured recently by three major DRAM vendors. We find that reducing the supply voltage below a certain point introduces bit errors in the data, and we comprehensively characterize the behavior of these errors. We discover that these errors can be avoided by increasing the latency of three major DRAM operations (activation, restoration, and precharge). We perform detailed DRAM circuit simulations to validate and explain our experimental findings. We also characterize the various relationships between reduced supply voltage and error locations, stored data patterns, DRAM temperature, and data retention. Based on our observations, we propose a new DRAM energy reduction mechanism, called Voltron. The key idea of Voltron is to use a performance model to determine by how much we can reduce the supply voltage without introducing errors and without exceeding a user-specified threshold for performance loss. Our evaluations show that Voltron reduces the average DRAM and system energy consumption by 10.5% and 7.3%, respectively, while limiting the average system performance loss to only 1.8%, for a variety of memory-intensive quad-core workloads. We also show that Voltron significantly outperforms prior dynamic voltage and frequency scaling mechanisms for DRAM.
Conference Paper
Full-text available
To meet the performance and energy efficiency demands of emerging complex and variable workloads, heterogeneous many-core architectures are increasingly being deployed, necessitating operating systems support for adaptive task allocation to efficiently exploit this heterogeneity in the face of unpredictable workloads. We present SPARTA, a throughput-aware runtime task allocation approach for Heterogeneous Many-core Platforms (HMPs) to achieve energy efficiency. SPARTA collects sensor data to characterize tasks at runtime and uses this information to prioritize tasks when performing allocation in order to maximize energy-efficiency (instructions-per-Joule) without sacrificing performance. Our experimental results on heterogeneous many-core architectures executing mixes of MiBench and PARSEC benchmarks demonstrate energy reductions of up to 23% when compared to state-of-the-art alternatives. SPARTA is also scalable with low overhead, enabling energy savings in large-scale architectures with up to hundreds of cores.
Conference Paper
Full-text available
Power Capping techniques are used to restrict power consumption of computer systems to a thermally safe limit. Current many-core systems employ dynamic voltage and frequency scaling (DVFS), power gating (PG) and scheduling methods as actuators for power capping. These knobs are oriented towards power actuation, while the need for performance and energy savings are increasing in the dark silicon era. To address this, we propose approximation (APPX) as another knob for close-looped power management, lending performance and energy efficiency to existing power capping techniques. We use approximation in a pro-active way for long-term performance-energy objectives, complementing the short-term reactive power objectives. We implement an approximation-enabled power management framework, APPEND, that dynamically chooses an application with appropriate level of approximation from a set of variable accuracy implementations. Subject to the system dynamics, our power manager chooses an effective combination of knobs - APPX, DVFS and PG, in a hierarchical way to ensure power capping with performance and energy gains. Our proposed approach yields 1.5x higher throughput, improved latency upto 5x, better performance per energy and dark silicon mitigation compared to state-of-the-art power management techniques over a set of applications ranging from high to no error resilience.
Conference Paper
Full-text available
A number of techniques have been proposed to provide runtime performance guarantees while minimizing power consumption. One drawback of existing approaches is that they work only on a fixed set of components (or actuators) that must be specified at design time. If new components become available, these management systems must be redesigned and reimplemented. In this paper, we propose PTRADE, a novel performance management framework that is general with respect to the components it manages. PTRADE can be deployed to work on a new system with different components without redesign and reimplementation. PTRADE's generality is demonstrated through the management of performance goals for a variety of benchmarks on two different Linux/x86 systems and a simulated 128-core system, each with different components governing power and performance tradeoffs. Our experimental results show that PTRADE provides generality while meeting performance goals with low error and close to optimal power consumption.
Article
Many-core systems are highly complex and require thorough orchestration of different goals across the computing abstraction stack to satisfy embedded system constraints. Contemporary resource management approaches typically focus on a fixed objective, while neglecting the need for replanning (i.e., updating the objective function). This trend is particularly observable in existing resource allocation and application mapping approaches that allocate a task to a tile to maximize a fixed objective (e.g., the cores’ and network’s performance), while minimizing others (e.g., latency and power consumption). However, embedded system goals typically vary over time, and also over abstraction levels, requiring a new approach to orchestrate these varying goals. We motivate the problem by showcasing conflicts resulting from state-of-the-art fixed-objective resource allocation approaches, and highlight the need to incorporate dynamic goal management from the very early stages of design. We then present the concept of a Hierarchical Dynamic Goal Manager (HDGM) that considers the priority, significance, and constraints of each application, while holistically coupling the overlapping and/or contradicting goals of different applications to satisfy embedded system constraints.
Conference Paper
The energy consumption of DRAM is a critical concern in modern computing systems. Improvements in manufacturing process technology have allowed DRAM vendors to lower the DRAM supply voltage conservatively, which reduces some of the DRAM energy consumption. We would like to reduce the DRAM supply voltage more aggressively, to further reduce energy. Aggressive supply voltage reduction requires a thorough understanding of the effect voltage scaling has on DRAM access latency and DRAM reliability. In this paper, we take a comprehensive approach to understanding and exploiting the latency and reliability characteristics of modern DRAM when the supply voltage is lowered below the nominal voltage level specified by manufacturers.
Article
Competitive graphics performance is crucial for the success of state-of-the-art mobile processors. High graphics performance comes at the cost of higher power consumption, which elevates the temperature due to limited cooling solutions. To avoid thermal violations, the system needs to operate within a static or dynamically changing power budget. Since the power budget is a shared resource, there is a strong demand for effective dynamic power budgeting techniques. This paper presents a novel technique to efficiently distribute the power budget among the CPU and GPU cores, while maximizing performance. The proposed technique is evaluated using a state-of-the-art mobile platform using industrial benchmarks, and an in-house simulator. The experiments on the mobile platform show an average increase of 15% in frame rate, when compared to default power allocation algorithms.
Article
Aggressive technology scaling has enabled the fabrication of many-core architectures while triggering challenges such as limited power budget and increased reliability issues, such as aging phenomena. Dynamic power management and runtime mapping strategies can be utilized in such systems to achieve optimal performance while satisfying power constraints. However, lifetime reliability is generally neglected. We propose a novel lifetime reliability/performance-aware resource co-management approach for many-core architectures in the dark silicon era. The approach is based on a two-layered architecture, composed of a long-term runtime reliability controller and a shortterm runtime mapping and resource management unit. The former evaluates the cores’ aging status w.r.t. a target reference specified by the designer, and performs recovery actions on highly stressed cores by means of power capping. The aging status is utilized in runtime application mapping to maximize system performance while fulfilling reliability requirements and honoring the power budget. Experimental evaluation demonstrates the effectiveness of the proposed strategy, showing that outperforms most recent state-of-the-art contributions.
Conference Paper
We propose a hierarchical Model Predictive Control (MPC) strategy for energy management in plugin hybrid electric vehicles. An inner feedback loop addresses the problem of optimally tracking a given reference trajectory for the battery state of charge over a short future horizon using knowledge of the predicted driving cycle. The associated receding horizon optimization problem is solved using a projected Newton method. The controller is compared with existing approaches based on Pontryagin's Minimum Principle and the effects of imprecise knowledge of the future driving cycle are discussed. An outer feedback loop generates the state of charge reference trajectory by solving approximately the optimal control problem for the entire driving cycle. By considering averages of the driver demand over longer time intervals the required number of prediction steps is reduced such that the outer loop problem can also be efficiently solved using the proposed Newton method. Advantages over approaches that assume a linearly decreasing state of charge reference trajectory are discussed.
Conference Paper
Computational sprinting is a class of mechanisms that boost performance but dissipate additional power. We describe a sprinting architecture in which many, independent chip multiprocessors share a power supply and sprints are constrained by the chips' thermal limits and the rack's power limits. Moreover, we present the computational sprinting game, a multi-agent perspective on managing sprints. Strategic agents decide whether to sprint based on application phases and system conditions. The game produces an equilibrium that improves task throughput for data analytics workloads by 4-6× over prior greedy heuristics and performs within 90% of an upper bound on throughput from a globally optimized policy.
Conference Paper
Many modern mobile and desktop applications involve real-time interactions with users. For these interactive applications, tasks must complete in a reasonable amount of time in order to provide a responsive user experience. Conversely, completing a task faster than the limits of human perception does not improve the user experience. Thus, for energy efficiency, tasks should be run just fast enough to meet the response-time requirement instead of wasting energy by running faster. In this paper, we present a predictive DVFS controller that predicts the execution time of a job before it executes in order to appropriately set the DVFS level to just meet user response-time deadlines. Our results show 56% energy savings compared to running tasks at the maximum frequency with almost no deadline misses. This is 27% more energy savings than the default Linux interactive power governor, which also shows 2% deadline misses on average.
Conference Paper
Approximate computing trades off accuracy of results for resources such as energy or computing time. There is a large and rapidly growing literature on approximate computing that has focused mostly on showing the benefits of approximate computing. However, we know relatively little about how to control approximation in a disciplined way. In this paper, we address the problem of controlling approximation for non-streaming programs that have a set of "knobs" that can be dialed up or down to control the level of approximation of different components in the program. We formulate this control problem as a constrained optimization problem, and describe a system called Capri that uses machine learning to learn cost and error models for the program, and uses these models to determine, for a desired level of approximation, knob settings that optimize metrics such as running time or energy usage. Experimental results with complex benchmarks from different problem domains demonstrate the effectiveness of this approach.
Conference Paper
Efficiently allocating shared resources in computer systems is critical to optimizing execution. Recently, a number of market-based solutions have been proposed to attack this problem. Some of them provide provable theoretical bounds to efficiency and/or fairness losses under market equilibrium. However, they are limited to markets with potentially important constraints, such as enforcing equal budget for all players, or curve-fitting players' utility into a specific function type. Moreover, they do not generally provide an intuitive "knob" to control efficiency vs. fairness. In this paper, we introduce two new metrics, Market Utility Range (MUR) and Market Budget Range (MBR), through which we provide for the first time theoretical bounds on efficiency and fairness of market equilibria under arbitrary budget assignments. We leverage this result and propose ReBudget, an iterative budget re-assignment algorithm that can be used to control efficiency vs. fairness at run-time. We apply our algorithm to a multi-resource allocation problem in multicore chips. Our evaluation using detailed execution-driven simulations shows that our budget re-assignment technique is intuitive, effective, and efficient.
Conference Paper
In a multi-core system, interference at shared resources (such as caches and main memory) slows down applications running on different cores. Accurately estimating the slowdown of each application has several benefits: e.g., it can enable shared resource allocation in a manner that avoids unfair application slowdowns or provides slowdown guarantees. Unfortunately, prior works on estimating slowdowns either lead to inaccurate estimates, do not take into account shared caches, or rely on a priori application knowledge. This severely limits their applicability. In this work, we propose the Application Slowdown Model (ASM), a new technique that accurately estimates application slowdowns due to interference at both the shared cache and main memory, in the absence of a priori application knowledge. ASM is based on the observation that the performance of each application is strongly correlated with the rate at which the application accesses the shared cache. Thus, ASM reduces the problem of estimating slowdown to that of estimating the shared cache access rate of the application had it been run alone on the system. To estimate this for each application, ASM periodically 1) minimizes interference for the application at the main memory, 2) quantifies the interference the application receives at the shared cache, in an aggregate manner for a large set of requests. Our evaluations across 100 workloads show that ASM has an average slowdown estimation error of only 9.9%, a 2.97x improvement over the best previous mechanism. We present several use cases of ASM that leverage its slowdown estimates to improve fairness, performance and provide slowdown guarantees. We provide detailed evaluations of three such use cases: slowdown-aware cache partitioning, slowdown-aware memory bandwidth partitioning and an example scheme to provide soft slowdown guarantees. Our evaluations show that these new schemes perform significantly better than state-of-the-art cache partitioning and memory scheduling schemes.
Conference Paper
Integrated GPUs have become an indispensable component of mobile processors due to the increasing popularity of graphics applications. The GPU frequency is a key factor both in application throughput and mobile processor power consumption under graphics workloads. Therefore, dynamic power management algorithms have to assess the performance sensitivity to the GPU frequency accurately. Since the impact of the GPU frequency on performance varies rapidly over time, there is a need for online performance models that can adapt to varying workloads. This paper presents a light-weight adaptive runtime performance model that predicts the frame processing time. We use this model to estimate the frame time sensitivity to the GPU frequency. Our experiments on a mobile platform running common GPU benchmarks show that the mean absolute percentage error in frame time and frame time sensitivity prediction are 3.8% and 3.9%, respectively.
Conference Paper
Power and thermal dissipation constrain multicore performance scaling. Modern processors are built such that they could sustain damaging levels of power dissipation, creating a need for systems that can implement processor power caps. A particular challenge is developing systems that can maximize performance within a power cap, and approaches have been proposed in both software and hardware. Software approaches are flexible, allowing multiple hardware resources to be coordinated for maximum performance, but software is slow, requiring a long time to converge to the power target. In contrast, hardware power capping quickly converges to the the power cap, but only manages voltage and frequency, limiting its potential performance. In this work we propose PUPiL, a hybrid software/hardware power capping system. Unlike previous approaches, PUPiL combines hardware's fast reaction time with software's flexibility. We implement PUPiL on real Linux/x86 platform and compare it to Intel's commercial hardware power capping system for both single and multi-application workloads. We find PUPiL provides the same reaction time as Intel's hardware with significantly higher performance. On average, PUPiL outperforms hardware by from 1:18-2:4 depending on workload and power target. Thus, PUPiL provides a promising way to enforce power caps with greater performance than current state-of-the-art hardware-only approaches.
Article
Data center power is a scarce resource that often goes underutilized due to conservative planning. This is because the penalty for overloading the data center power delivery hierarchy and tripping a circuit breaker is very high, potentially causing long service outages. Recently, dynamic server power capping, which limits the amount of power consumed by a server, has been proposed and studied as a way to reduce this penalty, enabling more aggressive utilization of provisioned data center power. However, no real at-scale solution for data center-wide power monitoring and control has been presented in the literature. In this paper, we describe Dynamo -- a data center-wide power management system that monitors the entire power hierarchy and makes coordinated control decisions to safely and efficiently use provisioned data center power. Dynamo has been developed and deployed across all of Facebook's data centers for the past three years. Our key insight is that in real-world data centers, different power and performance constraints at different levels in the power hierarchy necessitate coordinated data center-wide power management. We make three main contributions. First, to understand the design space of Dynamo, we provide a characterization of power variation in data centers running a diverse set of modern workloads. This characterization uses fine-grained power samples from tens of thousands of servers and spanning a period of over six months. Second, we present the detailed design of Dynamo. Our design addresses several key issues not addressed by previous simulation-based studies. Third, the proposed techniques and design have been deployed and evaluated in large scale data centers serving billions of users. We present production results showing that Dynamo has prevented 18 potential power outages in the past 6 months due to unexpected power surges; that Dynamo enables optimizations leading to a 13% performance boost for a production Hadoop cluster and a nearly 40% performance increase for a search cluster; and that Dynamo has already enabled an 8% increase in the power capacity utilization of one of our data centers with more aggressive power subscription measures underway.
Article
Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator---improving performance and efficiency--or run on the precise core---maintaining quality. In this paper we introduce Mithra, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. Mithra seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that Mithra performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.
Article
As processors seek more resource efficiency, they increasingly need to target multiple goals at the same time, such as a level of performance, power consumption, and average utilization. Robust control solutions cannot come from heuristic-based controllers or even from formal approaches that combine multiple single-parameter controllers. Such controllers may end-up working against each other. What is needed is control-theoretical MIMO (multiple input, multiple output) controllers, which actuate on multiple inputs and control multiple outputs in a coordinated manner. In this paper, we use MIMO control-theory techniques to develop controllers to dynamically tune architectural parameters in processors. To our knowledge, this is the first work in this area. We discuss three ways in which a MIMO controller can be used. We develop an example of MIMO controller and show that it is substantially more effective than controllers based on heuristics or built by combining single-parameter formal controllers. The general approach discussed here is likely to be increasingly relevant as future processors become more resource-constrained and adaptive.
Article
Approximate computing trades off accuracy of results for resources such as energy or computing time. There is a large and rapidly growing literature on approximate computing that has focused mostly on showing the benefits of approximate computing. However, we know relatively little about how to control approximation in a disciplined way. In this paper, we address the problem of controlling approximation for non-streaming programs that have a set of "knobs" that can be dialed up or down to control the level of approximation of different components in the program. We formulate this control problem as a constrained optimization problem, and describe a system called Capri that uses machine learning to learn cost and error models for the program, and uses these models to determine, for a desired level of approximation, knob settings that optimize metrics such as running time or energy usage. Experimental results with complex benchmarks from different problem domains demonstrate the effectiveness of this approach.
Article
Computational sprinting is a class of mechanisms that boost performance but dissipate additional power. We describe a sprinting architecture in which many, independent chip multiprocessors share a power supply and sprints are constrained by the chips' thermal limits and the rack's power limits. Moreover, we present the computational sprinting game, a multi-agent perspective on managing sprints. Strategic agents decide whether to sprint based on application phases and system conditions. The game produces an equilibrium that improves task throughput for data analytics workloads by 4-6× over prior greedy heuristics and performs within 90% of an upper bound on throughput from a globally optimized policy.
Article
Power management of networked many-core systems with runtime application mapping becomes more challenging in the dark silicon era. It necessitates considering network characteristics at runtime to achieve better performance while honoring the peak power upper bound. On the other hand, power management has a direct effect on chip temperature, which is the main driver of the aging effects. Therefore, alongside performance fulfillment, the controlling mechanism must also consider the current cores' reliability in its actuator manipulation to enhance the overall system lifetime in the long term. In this paper, we propose a multiobjective dynamic power management technique that uses current power consumption and other network characteristics including the reliability of the cores as the feedback while utilizing fine-grained voltage and frequency scaling and per-core power gating as the actuators. In addition, disturbance rejecter and reliability balancer are designed to help the controller to better smooth power consumption in the short term and reliability in the long term, respectively. Simulations of dynamic workloads and mixed criticality application profiles show that our method not only is effective in honoring the power budget while considerably boosting the system throughput, but also increases the overall system lifetime by minimizing aging effects by means of power consumption balancing.
Article
Efficiently allocating shared resources in computer systems is critical to optimizing execution. Recently, a number of market-based solutions have been proposed to attack this problem. Some of them provide provable theoretical bounds to efficiency and/or fairness losses under market equilibrium. However, they are limited to markets with potentially important constraints, such as enforcing equal budget for all players, or curve-fitting players' utility into a specific function type. Moreover, they do not generally provide an intuitive "knob" to control efficiency vs. fairness. In this paper, we introduce two new metrics, Market Utility Range (MUR) and Market Budget Range (MBR), through which we provide for the first time theoretical bounds on efficiency and fairness of market equilibria under arbitrary budget assignments. We leverage this result and propose ReBudget, an iterative budget re-assignment algorithm that can be used to control efficiency vs. fairness at run-time. We apply our algorithm to a multi-resource allocation problem in multicore chips. Our evaluation using detailed execution-driven simulations shows that our budget re-assignment technique is intuitive, effective, and efficient.
Article
Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate fairness mechanisms in each individual resource. Such resource-based fairness mechanisms implemented independently in each resource can make contradictory decisions, leading to low fairness and loss of performance. Therefore, a coordinated mechanism that provides fairness in the entire shared memory system is desirable. This paper proposes a new approach that provides fairness in the entire shared memory system, thereby eliminating the need for and complexity of developing fairness mechanisms for each individual resource. Our technique, Fairness via Source Throttling (FST), estimates the unfairness in the entire shared memory system. If the estimated unfairness is above a threshold set by system software, FST throttles down cores causing unfairness by limiting the number of requests they can inject into the system and the frequency at which they do. As such, our source-based fairness control ensures fairness decisions are made in tandem in the entire memory system. FST also enforces thread priorities/weights, and enables system software to enforce different fairness objectives and fairness-performance tradeoffs in the memory system. Our evaluations show that FST provides the best system fairness and performance compared to four systems with no fairness control and with state-of-the-art fairness mechanisms implemented in both shared caches and memory controllers.
Article
This paper presents a control-theoretic approach to optimize the energy consumption of integrated CPU and GPU subsystems for graphic applications. It achieves this via a dynamic management of the CPU and GPU frequencies. To this end, we first model the interaction between the GPU and CPU as a queuing system. Second, we formulate a Multi-Input-Multi-Output state-space closed loop control to ensure robustness and stability. We evaluated this control on an Intel Baytrail-based Android platform. Experimental evaluations show energy savings of 17.4% in the CPU-GPU subsystem with a low performance impact of 0.9%.
Article
Performance, power, and energy (PPE) are critical aspects of modern computing. It is challenging to accurately predict, in real time, the effect of dynamic voltage and frequency scaling (DVFS) on PPE across a wide range of voltages and frequencies. This results in the use of reactive, iterative, and inefficient algorithms for dynamically finding good DVFS states. We propose PPEP, an online PPE prediction framework that proactively and rapidly searches the DVFS space. PPEP uses hardware events to implement both a cycles-per-instruction (CPI) model as well as a per-core power model in order to predict PPE across all DVFS states. We verify on modern AMD CPUs that the PPEP power model achieves an average error of 4.6% (2.8% standard deviation) on 152 benchmark combinations at 5 distinct voltage-frequency states. Predicting average chip power across different DVFS states achieves an average error of 4.2% with a 3.6% standard deviation. Further, we demonstrate the usage of PPEP by creating and evaluating a highly responsive power capping mechanism that can meet power targets in a single step. PPEP also provides insights for future development of DVFS technologies. For example, we find that it is important to carefully consider background workloads for DVFS policies and that enabling north bridge DVFS can offer up to 20% additional energy saving or a 1.4x performance improvement.
Conference Paper
Traditional Network-on-Chips (NoCs) employ simple arbitration strategies, such as round-robin or oldest-first, to decide which packets should be prioritized in the network. This is counter-intuitive since different packets can have very different effects on system performance due to, e.g., different level of memory-level parallelism (MLP) of applications. Certain packets may be performance-critical because they cause the processor to stall, whereas others may be delayed for a number of cycles with no effect on application-level performance as their latencies are hidden by other outstanding packets'latencies. In this paper, we define slack as a key measure that characterizes the relative importance of a packet. Specifically, the slack of a packet is the number of cycles the packet can be delayed in the network with no effect on execution time. This paper proposes new router prioritization policies that exploit the available slack of interfering packets in order to accelerate performance-critical packets and thus improve overall system performance. When two packets interfere with each other in a router, the packet with the lower slack value is prioritized. We describe mechanisms to estimate slack, prevent starvation, and combine slack-based prioritization with other recently proposed application-aware prioritization mechanisms. We evaluate slack-based prioritization policies on a 64-core CMP with an 8x8 mesh NoC using a suite of 35 diverse applications. For a representative set of case studies, our proposed policy increases average system throughput by 21.0% over the commonlyused round-robin policy. Averaged over 56 randomly-generated multiprogrammed workload mixes, the proposed policy improves system throughput by 10.3%, while also reducing application-level unfairness by 30.8%.
Article
Recent empirical studies have shown that multicore scaling is fast becoming power limited, and consequently, an increasing fraction of a multicore processor has to be under clocked or powered off. Therefore, in addition to fundamental innovations in architecture, compilers and parallelization of application programs, there is a need to develop practical and effective dynamic energy management (DEM) techniques for multicore processors. Existing DEM techniques mainly target reducing processor power consumption and temperature, and only few of them have addressed improving energy efficiency for multicore systems. With energy efficiency taking a center stage in all aspects of computing, the focus of the DEM needs to be on finding practical methods to maximize processor efficiency. Towards this, this article presents STEAM -- an optimal closed-loop DEM controller designed for multicore processors. The objective is to maximize energy efficiency by dynamic voltage and frequency scaling (DVFS). Energy efficiency is defined as the ratio of performance to power consumption or performance-per-watt (PPW). This is the same as the number of instructions executed per Joule. The PPW metric is actually replaced by PαPW (performanceα-per-Watt), which allows for controlling the importance of performance versus power consumption by varying α. The proposed controller was implemented on a Linux system and tested with the Intel Sandy Bridge processor. There are three power management schemes called governors, available with Intel platforms. They are referred to as (1) Powersave (lowest power consumption), (2) Performance (achieves highest performance), and (3) Ondemand. Our simple and lightweight controller when executing SPEC CPU2006, PARSEC, and MiBench benchmarks have achieved an average of 18% improvement in energy efficiency (MIPS/Watt) over these ACPI policies. Moreover, STEAM also demonstrated an excellent prediction of core temperatures and power consumption, and the ability to control the core temperatures within 3ˆC of the specified maximum. Finally, the overhead of the STEAM implementation (in terms of CPU resources) is less than 0.25%. The entire implementation is self-contained and can be installed on any processor with very little prior knowledge of the processor.
Article
Efficiently utilizing off-chip DRAM bandwidth is a critical issuein designing cost-effective, high-performance chip multiprocessors(CMPs). Conventional memory controllers deliver relativelylow performance in part because they often employ fixed,rigid access scheduling policies designed for average-case applicationbehavior. As a result, they cannot learn and optimizethe long-term performance impact of their scheduling decisions,and cannot adapt their scheduling policies to dynamic workloadbehavior.We propose a new, self-optimizing memory controller designthat operates using the principles of reinforcement learning (RL)to overcome these limitations. Our RL-based memory controllerobserves the system state and estimates the long-term performanceimpact of each action it can take. In this way, the controllerlearns to optimize its scheduling policy on the fly to maximizelong-term performance. Our results show that an RL-basedmemory controller improves the performance of a set of parallelapplications run on a 4-core CMP by 19% on average (upto 33%), and it improves DRAM bandwidth utilization by 22%compared to a state-of-the-art controller.
Conference Paper
Cores in chip-multiprocessors (CMPs) share multiple memory subsystem resources. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate fairness mechanisms for each resource. Such resource-based fairness mechanisms implemented independently in each resource can make contradictory decisions, leading to low fairness and performance loss. Therefore, a coordinated mechanism that provides fairness in the entire shared memory system is desirable. This article proposes a new approach that provides fairness in the entire shared memory system, thereby eliminating the need for and complexity of developing fairness mechanisms for each resource. Our technique, Fairness via Source Throttling (FST), estimates unfairness in the entire memory system. If unfairness is above a system-software-set threshold, FST throttles down cores causing unfairness by limiting the number of requests they create and the frequency at which they do. As such, our source-based fairness control ensures fairness decisions are made in tandem in the entire memory system. FST enforces thread priorities/weights, and enables system-software to enforce different fairness objectives in the memory system. Our evaluations show that FST provides the best system fairness and performance compared to three systems with state-of-the-art fairness mechanisms implemented in both shared caches and memory controllers.
Conference Paper
In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. In this paper, we demonstrate that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes, we propose a Virtual Write Queue which dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this paper demonstrates that performance limiting effects of highly-threaded architectures can be overcome. We show that through awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. Through full-system cycle-accurate simulations of SPEC cpu2006, we demonstrate that the proposed Virtual Write Queue achieves an average 10.9% system-level throughput improvement on memory-intensive workloads, along with an overall reduction of 8.7% in memory power across the whole suite.
Book
Introduces theoretical and practical aspects of adaptive control. Starting with a broad overview, the text explores real-time estimation, self-tuning regulators and model-reference adaptive systems, stochastic adaptive control, and automatic tuning of regulators. Additional topics include gain scheduling, robust high-gain control and self-oscillating controllers, and suggestions for implementing adaptive controllers. Concluding chapters feature a summary of applications and a brief review of additional areas closely related to adaptive control.
Article
Within the Control and Automation Laboratory at Chalmers, there has been developed a software suite to facilitate the manipulation of state automata and Petri nets for supervisor calculation (among other things). This suite of software tools includes a graphical automata/Petri net drawing tool, a commandline based automata/Petri net manipulation tool and a graphical visualisation tool. The two first tools (jointly named Desco, for Discrete Event Systems Controller), consisting of the N’gin (that is, the mathematical manipulation engine) and the GUI (the graphical user interface) have been developed at the Control and Automation Laboratory. The third tool is a general graph drawing software, GrapViz, from AT&T research (see http://www.research.att.com/sw/tools/graphviz/) which is closely integrated with Desco.
Conference Paper
Heterogeneous multi-cores that integrate cores with different power performance characteristics are promising alternatives to homogeneous systems in energy- and thermally constrained environments. However, the heterogeneity imposes significant challenges to power-aware scheduling. We present a price theory-based dynamic power management framework for heterogeneous multi-cores that co-ordinates various energy savings opportunities, such as dynamic voltage/frequency scaling, load balancing, and task migration in tandem, to achieve the best power-performance characteristics. Unlike existing centralized power management frameworks, ours is distributed and hence scalable with minimal runtime overhead. We design and implement the framework within Linux operating system on ARM big.LITTLE heterogeneous multi-core platform. Experimentalevaluation confirms the advantages of our approach compared to the state-of-the-art techniques for power management in heterogeneous multi-cores.
Conference Paper
Cloud computing promises flexibility and high performance for users and high cost-efficiency for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both cost effectiveness and future scalability. We present Quasar, a cluster management system that increases resource utilization while providing consistently high application performance. Quasar employs three techniques. First, it does not rely on resource reservations, which lead to underutilization as users do not necessarily understand workload dynamics and physical resource requirements of complex codebases. Instead, users express performance constraints for each workload, letting Quasar determine the right amount of resources to meet these constraints at any point. Second, Quasar uses classification techniques to quickly and accurately determine the impact of the amount of resources (scale-out and scale-up), type of resources, and interference on performance for each workload and dataset. Third, it uses the classification results to jointly perform resource allocation and assignment, quickly exploring the large space of options for an efficient way to pack workloads on available resources. Quasar monitors workload performance and adjusts resource allocation and assignment when needed. We evaluate Quasar over a wide range of workload scenarios, including combinations of distributed analytics frameworks and low-latency, stateful services, both on a local cluster and a cluster of dedicated EC2 servers. At steady state, Quasar improves resource utilization by 47% in the 200-server EC2 cluster, while meeting performance constraints for workloads of all types.
Article
Adaptive microarchitectures are a promising solution for designing high-performance, power-efficient microprocessors. They offer the ability to tailor computational resources to the specific requirements of different programs or program phases. They have the potential to adapt the hardware cost-effectively at runtime to any application’s needs. However, one of the key challenges is how to dynamically determine the best architecture configuration at any given time, for any new workload. This article proposes a novel control mechanism based on a predictive model for microarchitectural adaptivity control. This model is able to efficiently control adaptivity by monitoring the behaviour of an application’s different phases at runtime. We show that by using this model on SPEC 2000, we double the energy/performance efficiency of the processor when compared to the best static configuration tuned for the whole benchmark suite. This represents 74% of the improvement available if we know the best microarchitecture for each program phase ahead of time. In addition, we present an extended analysis of the best configurations found and show that the overheads associated with the implementation of our scheme have a negligible impact on performance and power.
Conference Paper
Dynamic power management features are now an integral part of processor chip and system design. Dynamic voltage and frequency scaling (DVFS), core folding and per-core power gating (PCPG) are power control actuators (or "knobs") that are available in modern multi-core systems. However, figuring out the actuation protocol for such knobs in order to achieve maximum efficiency has so far remained an open research problem. In the context of specific system utilization dynamics, the desirable order of applying these knobs is not easy to determine. For complexity-effective algorithm development, DVFS, core folding and PCPG control methods have evolved in a somewhat decoupled manner. However, as we show in this paper, independent actuation of these techniques can lead to conflicting decisions that jeopardize the system in terms of power-performance efficiency. Therefore, a more robust coordination protocol is necessary in orchestrating the power management functions. Heuristics for achieving such coordinated control are already becoming available in server systems. It remains an open research problem to optimally adjust power and performance management options at run-time for a wide range of time-varying workload applications, environmental conditions, and power constraints. This research paper contributes a novel approach for a systematically architected, robust, multi-knob power management protocol, which we empirically analyze on live server systems. We use a latest generation POWER7+ multi-core system to demonstrate the benefits of our proposed new coordinated power management algorithm (called PAMPA). We report measurement-based analysis to show that PAMPA achieves comparable power-performance efficiencies (relative to a baseline decoupled control system) while achieving conflict-free actuation and robust operation.
Conference Paper
Future microprocessors may become so power constrained that not all transistors will be able to be powered on at once. These systems will be required to nimbly adapt to changes in the chip power that is allocated to general-purpose cores and to specialized accelerators. This paper presents Flicker, a general-purpose multicore architecture that dynamically adapts to varying and potentially stringent limits on allocated power. The Flicker core microarchitecture includes deconfigurable lanes--horizontal slices through the pipeline--that permit tailoring an individual core to the running application with lower overhead than microarchitecture-level adaptation, and greater flexibility than core-level power gating. To exploit Flicker's flexible pipeline architecture, a new online multicore optimization algorithm combines reduced sampling techniques, application of response surface models to online optimization, and heuristic online search. The approach efficiently finds a near-global-optimum configuration of lanes without requiring offline training, microarchitecture state, or foreknowledge of the workload. At high power allocations, core-level gating is highly effective, and slightly outperforms Flicker overall. However, under stringent power constraints, Flicker significantly outperforms core-level gating, achieving an average 27% performance improvement.
Conference Paper
Recent work has introduced memory system dynamic voltage and frequency scaling (DVFS), and has suggested that balanced scaling of both CPU and the memory system is the most promising approach for conserving energy in server systems. In this paper, we first demonstrate that CPU and memory system DVFS often conflict when performed independently by separate controllers. In response, we propose Co Scale, the first method for effectively coordinating these mechanisms under performance constraints. Co Scale relies on execution profiling of each core via (existing and new) performance counters, and models of core and memory performance and power consumption. Co Scale explores the set of possible frequency settings in such a way that it efficiently minimizes the full-system energy consumption within the performance bound. Our results demonstrate that, by effectively coordinating CPU and memory power management, Co Scale conserves a significant amount of system energy compared to existing approaches, while consistently remaining within the prescribed performance bounds. The results also show that Co Scale conserves almost as much system energy as an offline, idealized approach.
Conference Paper
Asymmetric multi-core architectures integrating cores with diverse power-performance characteristics is emerging as a promising alternative in the dark silicon era where only a fraction of the cores on chip can be powered on due to thermal limits. We introduce a hierarchical power management framework for asymmetric multi-cores that builds on control theory and coordinates multiple controllers in a synergistic manner to achieve optimal power-performance efficiency while respecting the thermal design power budget. We integrate our framework within Linux and implement/evaluate it on real ARM big.LITTLE asymmetric multi-core platform.
Conference Paper
Applications running concurrently on a multicore system interfere with each other at the main memory. This interference can slow down different applications differently. Accurately estimating the slow down of each application in such a system can enable mechanisms that can enforce quality-of-service. While much prior work has focused on mitigating the performance degradation due to inter-application interference, there is little work on estimating slow down of individual applications in a multi-programmed environment. Our goal in this work is to build such an estimation scheme. To this end, we present our simple Memory-Interference-induced Slowdown Estimation (MISE) model that estimates slowdowns caused by memory interference. We build our model based on two observations. First, the performance of a memory-bound application is roughly proportional to the rate at which its memory requests are served, suggesting that request-service-rate can be used as a proxy for performance. Second, when an application's requests are prioritized over all other applications' requests, the application experiences very little interference from other applications. This provides a means for estimating the uninterfered request-service-rate of an application while it is run alongside other applications. Using the above observations, our model estimates the slowdown of an application as the ratio of its uninterfered and interfered request service rates. We propose simple changes to the above model to estimate the slowdown of non-memory-bound applications. We demonstrate the effectiveness of our model by developing two new memory scheduling schemes: 1) one that provides soft quality-of-service guarantees and 2) another that explicitly attempts to minimize maximum slowdown (i.e., unfairness) in the system. Evaluations show that our techniques perform significantly better than state-of-the-art memory scheduling approaches to address the above problems.