The problem of programming scalable multicore processors has renewed interest in message-passing languages and frame- works. Such languages and frameworks are typically actor- oriented, implementing some variant of the standard Actor semantics. This paper analyzes some of the more signiflcant efiorts to build actor-oriented frameworks for the JVM plat- form. It compares the frameworks in terms of their execution semantics, the communication and synchronization abstrac- tions provided, and the representations used in the imple- mentations. It analyzes the performance of actor-oriented frameworks to determine the costs of supporting difierent actor properties on JVM. The analysis suggests that with suitable optimizations, standard Actor semantics and some useful communication and synchronization abstractions may be supported with reasonable e-ciency on the JVM plat- form.
Additive AND/OR graphs are defined as AND/ OR graphs without circuits, which can be considered as folded AND/OR trees; i. e. the cost of a common subproblem is added to the cost as many times as the subproblem occurs, but it is computed only once. Additive ...
Virtual machine migration is a useful and widely used workload management technique. However, the overhead of moving gigabytes of data across machines, racks, or even data centers limits its applicability. According to a recent study by IBM [7], the number of distinct servers visited by a migrating VM is small; often just two. By storing a checkpoint on each server, a subsequent incoming migration of the same VM must transfer less data over the network.
Our analysis shows that for short migration intervals of 2 hours on average 50% to 70% of the checkpoint can be reused. For longer migration intervals of up to 24 hours still between 20% to 50% can be reused. In addition, we compared different methods to reduce the migration traffic. We find that content-based redundancy elimination consistently achieves better results than relying on dirty page tracking alone. Sometimes the difference is only a few percent, but can reach up to 50% and more. Our empirical measurements with a QEMU-based prototype confirm the reduction in migration traffic and time.
Power estimation of software processes provides critical indicators to drive scheduling or power capping heuristics. State-of-the-art power estimation solutions only provide coarse-grained support for power estimation. In this paper, we therefore propose a middleware for assembling and configuring software-defined power meters. Software-defined power meters provide real-time and accurate power estimation of software processes. In particular, our solution automatically learns an application-agnostic power model, which can be used to estimate the power consumption of applications.
Our approach, named BitWatts, builds on a distributed actor middleware to collect process usage and infer fine-grained power consumption without imposing any hardware investment (e.g., power meters).
With the trend of ever growing data centers and scaling core counts, simple programming models for efficient distributed and concurrent programming are required. One of the successful principles for scalable computing is the actor model, which is based on message passing. Actors are objects that hold local state that can only be modified by the exchange of messages. To avoid typical concurrency hazards, each actor processes messages sequentially. However, this limits the scalability of the model. We have shown in former work that concurrent message processing can be implemented with the help of transactional memory, ensuring sequential processing, when required. This approach is advantageous in low contention phases, however, does not scale for high contention phases. In this paper we introduce a combination of dynamic resource allocation and non-transactional message processing to overcome this limitation. This allows for efficient resource utilization as these two mechanisms can be handled in parallel. We show that we can substantially reduce the execution time of high-contention workloads in a micro-benchmark as well as in a real-world application.
Reliability is an essential concern for processor designers due to increasing transient and permanent fault rates. Executing instruction streams redundantly in chip multi processors (CMP) provides high reliability since it can detect both transient and permanent faults. Additionally, it also minimizes the Silent Data Corruption rate. However, comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors might lead to substantial performance degradation. In this study we propose FaulTM, an error detection and recovery schema utilizing Hardware Transactional Memory (HTM) in order to reduce these performance degradations. We show how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency. Compared with lockstepping, FaulTM reduces the performance degradation by 2.5X for SPEC2006 benchmark.
Dynamic Virtual Machine (VM) consolidation dynamically adapts VM resource allocation to resource demands promising significant cost benefits for highly variable enterprise workloads. In this work, we analyze large enterprise workloads with the goal of understanding how effective are the VM consolidation variants in real world. We observe that burstiness in memory demand is much lower than the burstiness in CPU demand. Further, memory is the more constrained resource in virtualized servers, significantly reducing the potential gains due to dynamic consolidation. We study consolidation planning in four very large data centers and observe that the savings in facilities cost due to dynamic consolidation over static consolidation is not as large as estimated by past studies. Further, the savings over intelligent semi-static consolidation are surprisingly modest in most cases, putting a question mark over the applicability of dynamic consolidation in real world.
Processors supporting a wide range of supply voltages are necessary to achieve high performance in nominal supply voltage and to reduce the power consumption in low supply voltage. However, when the supply voltage is lowered below the safe margin (especially close to the threshold voltage level), the memory cell failure rate increases drastically. Thus, it is essential to provide reliability solutions for memory structures. This paper proposes a novel, reliable L1 cache design, Flexicache, which automatically configures itself for different supply voltages in order to tolerate different fault rates. Flexicache is a circuit-driven solution achieving in-cache replication with no increase in the access latency and with a minimum increase in the energy consumption. It defines three operating modes: Single Version Mode, Double Version Mode and Triple Version Mode. Compared to the best previous proposal, Flexicache can provide 34% higher energy reduction for L1 caches with 2× higher error correction capability in the low-voltage mode.
A thorough thermal analysis of integrated circuits (ICs) is essential to prevent temperature driven reliability issues, which might cause the failure of microelectronic devices. The classical analysis approach is based on finite element methods (FEM). However, in the last decades, other computational methodologies have been developed with the aim to obtain results more quickly and at a reasonable accuracy. In this paper, a transient fast thermal model (TFTM) methodology for 3DICs based on 3D-convolution and fast Fourier transform is presented. This methodology allows to quickly and accurately predict the temporal evolution of the chip temperature distribution, due to power dissipation that can be non-uniform both in time and space, in all tiers of the 3D package. In the first part of the paper the computational methodology is derived and described. Next, results are presented and validated with respect to conventional FEM simulations, showing good accuracy and computational time reduction. A realistic case, wherein different load switching scenarios are compared for a commercial floor-plan, is analyzed as an example of the applicability of the presented methodology. The speed of this algorithm, based on 3D-convolution, is compared with the one of previous work based on 2D-convolution and subseq uent time superposition.
Power estimation of software processes provides critical indicators to drive scheduling or power capping heuristics. State-of-the-art solutions can perform coarse-grained power estimation in virtualized environments, typically treating virtual machines (VMs) as a black box. Yet, VM-based systems are nowadays commonly used to host multiple applications for cost savings and better use of energy by sharing common resources and assets.
In this paper, we propose a fine-grained monitoring middleware providing real-time and accurate power estimation of software processes running at any level of virtualization in a system. In particular, our solution automatically learns an application-agnostic power model, which can be used to estimate the power consumption of applications.
Our middleware implementation, named BitWatts, builds on a distributed actor implementation to collect process usage and infer fine-grained power consumption without imposing any hardware investment (e.g., power meters). BitWatts instances use high-throughput communication channels to spread the power consumption across the VM levels and between machines. Our experiments, based on CPU- and memory-intensive benchmarks running on different hardware setups, demonstrate that BitWatts scales both in number of monitored processes and virtualization levels. This non-invasive monitoring solution therefore paves the way for scalable energy accounting that takes into account the dynamic nature of virtualized environments.
This paper proposes DESSERT (DESign Space ExploRation Tool at System-Level), a novel simulation-based tool for heterogeneous multi-core processor based platforms. This tool supports power/energy estimation, comprehensive architectural explorations and optimization of the given embedded applications for multi-core processor architectures. The development of DESSERT consists of three steps. First, we developed generic functional-level power models for different parts of the multi-core system to estimate power/energy, which are integrated into the system-level simulation environment. Second, we built a SystemC-based virtual platform prototype of the processor architecture to accurately extract the functional activities needed by the power model. Third, we designed a runtime task-dependencies management and optimization technique (work-load or dynamic slack reclamation) based on programming models that support both OpenMP and Pthread API for multi-core execution to consider both data-level and thread-level parallelism. The combination of above three steps leads to a novel Design Space Exploration (DSE) methodology. Power and energy estimates are validated against real board measurements. DESSERT power/energy estimation results provide less than 5% of error and offer reliable power/energy based DSE for the given applications.
Abstract
—Consistent hash based storage systems are used in
many real world applications for which energy is one of the main
cost factors. However, these systems are typically designed and
deployed without any mechanisms to save energy at times of
low demand. We present an energy conserving implementation
of a consistent hashing based key-value store, called PowerCass,
based on Apache’s Cassandra. In PowerCass, nodes are divided
into three groups:
active
,
dormant
, and
sleepy
. Nodes in the active
group store cover all the data and running continuously. Dormant
nodes are only powered during peak activity time and for replica
synchronization. Sleepy nodes are offline almost all the time
except for replica synchronization and exceptional peak loads.
With this simple and elegant approach we are able to reduce the
energy consumption by up to 66% compared to the unmodified
key-value store Cassandra.
Large synchronization and communication overhead will become a major concern in future extreme-scale machines (e.g., HPC systems, supercomputers). These systems will push upwards performance limits by adopting chips equipped with one order of magnitude more cores than today. Alternative execution models can be explored in order to exploit the high parallelism offered by future massive many-core chips. This paper proposes the integration of standard cores with dedicated co-processing units that enable the system to support a fine-grain data-flow execution model developed within the TERAFLUX project. An instruction set architecture extension for supporting fine-grain thread scheduling and execution is proposed. This instruction set extension is supported by the co-processor that provides hardware units for accelerating thread scheduling and distribution among the available cores. Two fundamental aspects are at the base of the proposed system: the programmers can adopt their preferred programming model, and the compilation tools can produce a large set of threads mainly communicating in a producer–consumer fashion, hence enabling data-flow execution. Experimental results demonstrate the feasibility of the proposed approach and its capability of scaling with the increasing number of cores.
Today's cloud offerings, while promising flexibility, fail to deliver this flexibility to lower-end services with frequent, minute-long idle times. We present DreamServer, an architecture and combination of technologies to deploy virtualized services just-in-time: virtualized web applications are suspended when idle and resurrected only when the next request arrives. We demonstrate that stateful VM resume can be accomplished in less than one second for select applications.
We have implemented DreamServer by customizing well-known open source software projects, the Apache web server and the virtual machine emulator qemu. Our evaluation of just how fast idle services can be reactivated includes different storage technologies, local and networked storage, and multiple VM resume strategies. The empirical results show that just-in-time deployment of virtualized services is possible with a minimal additional delay. This brings us closer to our understanding of flexibility in the cloud especially for customers with sporadic resource usage.
Dramatic environmental and economic impact of the ever increasing power and energy consumption of modern computing devices in data centers is now a critical challenge. On the one hand, designers use technology scaling as one of the methods to face the phenomenon called dark silicon (only segments of a chip function concurrently due to power restrictions). On the other hand, designers use extreme-scale systems such as teradevices to meet the performance needs of their applications which in turn increases the power consumption of the platform. In order to overcome these challenges, we need novel computing paradigms that address energy efficiency. One of the promising solutions is to incorporate parallel distributed methodologies at different abstraction levels.
Distributed data centers for multi-cloud environments usually do not consist of homogeneous hardware as they are not built at the same time by the same owner.
Assigning workloads to the most appropriate processing units is therefore a challenging task.
In this paper we show how in the context of heterogeneous data centers power consumption can be used as a metric to drive scheduling.
We study the performance and energy efficiency of a set of heterogeneous architectures for multiple micro-benchmarks (stressing CPU, memory and disk) and for a real-world cloud application.
We observe from our results that some architectures are more energy efficient for disk-intense workloads, whereas others are better for CPU-intense workloads.
This study provides the basis for workload characterization and cross-cloud scheduling under constraints of energy efficiency.
While cloud computing is excellent at supporting elastic services that scale up to tens or hundreds of servers, its support for small-scale applications that only sporadically require one VM is lacking. To better support this sporadic usage model, we employ Software Defined Networking (SDN) technology to expose events related to network activity. Specifically, we rely on notifications when switch flow entries are removed or missing to determine resource (in)activity. Our prototype, Sloth, activates virtual machines based on incoming network traffic. Conversely, idle VMs are suspended to conserve resources. We present the design and architecture of our SDN-enabled on-demand resource deployment solution. Our empirical evaluation shows that VMs can be reactivated in less than one second, triggered by SDN events. This on-demand resource activation opens up novel applications for Cloud providers, allowing them to transparently deactivate idle VMs while maintaining connectivity at the same time.
Today's cloud offerings, while promising flexibility, fail to deliver this flexibility to lower-end services with frequent, minute-long idle times. We present DreamServer, an architecture and combination of technologies to deploy virtualized services just-in-time: virtualized web applications are suspended when idle and resurrected only when the next request arrives. We demonstrate that stateful VM resume can be accomplished in less than one second for select applications. We have implemented DreamServer by customizing well-known open source software projects, the Apache web server and the virtual machine emulator qemu. Our evaluation of just how fast idle services can be reactivated includes different storage technologies, local and networked storage, and multiple VM resume strategies. The empirical results show that just-in-time deployment of virtualized services is possible with a minimal additional delay. This brings us closer to our understanding of flexibility in the cloud especially for customers with sporadic resource usage.
Efficient processing of fine-pitched Through Silicon Vias, micro-bumps and back-side re-distribution layers enable face-to-back or face-to-face integration of heterogeneous ICs using 3D stacking and/or Silicon Interposers. While these technology features are extremely compelling, they considerably stress the existing design practices and EDA tool flows typically conceived for 2D systems. With all system, technology and implementation level options brought with these features, the design space increases to an extent where traditional 2D tools cannot be used any more for efficient exploration. Therefore, the cost-effective design of future 3D ICs products will require new planning and co-optimisation techniques and tools that are fast and accurate enough to cope with these challenges. In this paper we present design methodology and the practical EDA tool chain that covers different aspects of the design flow and is specific to efficient design of 3D-ICs. Flow features include: fast synthesis and 3D design partitioning at gate level, TSV/micro-bump array planning, 3D floor planning, placement and routing, congestion analysis, fast thermal and mechanical modeling, easy technology vs. implementation trade-off analysis, 3D device models generations and Design-for-Test (DfT). The application of the tool chain is illustrated using concrete example of a real-world design, showing not only the applicability of the tool chain, but also the benefits of heterogeneous 2.5 and 3D integration technologies.
The power envelope has become a major issue for the design of computer systems. One way of reducing energy consumption is to downscale the voltage of microprocessors. However, this does not come without costs. By decreasing the voltage, the likelihood of failures increases drastically and without mechanisms for reliability, the systems would not operate any more. For reliability we need (1) error detection and (2) error recovery mechanisms. We provide in this paper a first study inves-tigating the combination of different error detection mechanisms with transactional memory, with the objective to improve energy efficiency. According to our evaluation, using reliability schemes combined with transactional memory for error recovery reduces energy by 54 % while providing a reliability level of 100 %.
The actor model has been successfully used for scalable computing in distributed systems. Actors are objects with a local state, which can only be modified by the exchange of messages. One of the fundamental principles of actor models is to guarantee sequential message processing, which avoids typical concurrency hazards, but limits the achievable message throughput. Preserving the sequential semantics of the actor
model is, however, necessary for program correctness.
In this paper, we propose to add support for speculative concurrent execution in actors using transactional memory (TM). Our approach is designed to operate with message passing and shared memory, and can
thus take advantage of parallelism available on distributed and multi-core systems. The processing of each message is wrapped in a transaction executed atomically and in isolation, but concurrently with other messages. This allows us (1) to scale while keeping the dependability guarantees
ensured by sequential message processing, and (2) to further increase robustness of the actor model against threats due to the rollback ability that comes for free with transactional processing of messages. We validate our design within the Scala programming language and the Akka framework. We show that the overhead of using transactions is hidden by the improved message processing throughput, thus leading to an overall performance gain.
Pets: Power and energy estimation tool at system-level