Article

Capelin: Data-Driven Compute Capacity Procurement for Cloud Datacenters Using Portfolios of Scenarios

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Preprint
Full-text available
Amidst the climate crisis, the massive introduction of renewable energy sources has brought tremendous challenges to both the power grid and its surrounding markets. As datacenters have become ever-larger and more powerful, they play an increasingly significant role in the energy arena. With their unique characteristics, datacenters have been proved to be well-suited for regulating the power grid yet currently provide little, if any, such active response. This problem is due to issues such as unsuitability of the market design, high complexity of the currently proposed solutions, as well as the potential risks thereof. This work aims to provide individual datacenters with insights on the feasibility and profitability of directly participating in the energy market. By modelling the power system of datacenters, and by conducting simulations on real-world datacenter traces, we demonstrate the substantial financial incentive for individual datacenters to directly participate in both the day-ahead and the balancing markets. In turn, we suggest a new short-term, direct scheme of market participation for individual datacenters in place of the current long-term, inactive participation. Furthermore, we develop a novel proactive DVFS scheduling algorithm that can both reduce energy consumption and save energy costs during the market participation of datacenters. Also, in developing this scheduler, we propose an innovative combination of machine learning methods and the DVFS technology that can provide the power grid with indirect demand response (DR). Our experimental results strongly support that individual datacenters can and should directly participate in the energy market both to save their energy costs and to curb their energy consumption, whilst providing the power grid with indirect DR.
Conference Paper
Full-text available
Datacenters act as cloud-infrastructure to stakeholders across industry, government, and academia. To meet growing demand yet operate efficiently, datacenter operators employ increasingly more sophisticated scheduling systems, mechanisms, and policies. Although many scheduling techniques already exist, relatively little research has gone into the abstraction of the scheduling process itself, hampering design, tuning, and comparison of existing techniques. In this work, we propose a reference architecture for datacenter schedulers. The architecture follows five design principles: components with clearly distinct responsibilities, grouping of related components where possible, separation of mechanism from policy, scheduling as complex workflow, and hierarchical multi-scheduler structure. To demonstrate the validity of the reference architecture, we map to it state-of-the-art datacenter schedulers. We find scheduler-stages are commonly underspecified in peer-reviewed publications. Through trace-based simulation and real-world experiments, we show underspecification of scheduler-stages can lead to significant variations in performance.
Article
Full-text available
In only a decade, cloud computing has emerged from a pursuit for a service-driven information and communication technology (ICT), becoming a significant fraction of the ICT market. Responding to the growth of the market, many alternative cloud services and their underlying systems are currently vying for the attention of cloud users and providers. To make informed choices between competing cloud service providers, permit the cost-benefit analysis of cloud-based systems, and enable system DevOps to evaluate and tune the performance of these complex ecosystems, appropriate performance metrics, benchmarks, tools, and methodologies are necessary. This requires re-examining old system properties and considering new system properties, possibly leading to the re-design of classic benchmarking metrics such as expressing performance as throughput and latency (response time). In this work, we address these requirements by focusing on four system properties: (i) elasticity of the cloud service, to accommodate large variations in the amount of service requested, (ii) performance isolation between the tenants of shared cloud systems and resulting performance variability, (iii) availability of cloud services and systems, and (iv) the operational risk of running a production system in a cloud environment. Focusing on key metrics for each of these properties, we review the state-of-the-art, then select or propose new metrics together with measurement approaches. We see the presented metrics as a foundation toward upcoming, future industry-standard cloud benchmarks.
Article
Full-text available
Our society is digital: industry, science, governance, and individuals depend, often transparently, on the inter-operation of large numbers of distributed computer systems. Although the society takes them almost for granted, these computer ecosystems are not available for all, may not be affordable for long, and raise numerous other research challenges. Inspired by these challenges and by our experience with distributed computer systems, we envision Massivizing Computer Systems, a domain of computer science focusing on understanding, controlling, and evolving successfully such ecosystems. Beyond establishing and growing a body of knowledge about computer ecosystems and their constituent systems, the community in this domain should also aim to educate many about design and engineering for this domain, and all people about its principles. This is a call to the entire community: there is much to discover and achieve.
Conference Paper
Full-text available
The REliable CApacity Provisioning and enhanced remediation for distributed cloud applications (RECAP) project aims to advance cloud and edge computing technology, to develop mechanisms for reliable capacity provisioning, and to make application placement, infrastructure management, and capacity provisioning autonomous, predictable and optimized. This paper presents the RECAP vision for an integrated edge-cloud architecture, discusses the scientific foundation of the project, and outlines plans for toolsets for continuous data collection, application performance modeling, application and component auto-scaling and remediation, and deployment optimization. The paper also presents four use cases from complementing fields that will be used to showcase the advancements of RECAP.
Article
Full-text available
Business-critical workloads - web servers, mail servers, app servers, etc. - are increasingly hosted in virtualized data enters acting as Infrastructure-as-a-Service clouds (cloud data enters). Understanding how business-critical workloads demand and use resources is key in capacity sizing, in infrastructure operation and testing, and in application performance management. However, relatively little is currently known about these workloads, because the information is complex - larges-scale, heterogeneous, shared-clusters - and because datacenter operators remain reluctant to share such information. Moreover, the few operators that have shared data (e.g., Google and several supercomputing centers) have enabled studies in business intelligence (MapReduce), search, and scientific computing (HPC), but not in business-critical workloads. To alleviate this situation, in this work we conduct a comprehensive study of business-critical workloads hosted in cloud data enters. We collect two large-scale and long-term workload traces corresponding to requested and actually used resources in a distributed datacenter servicing business-critical workloads. We perform an in-depth analysis about workload traces. Our study sheds light into the workload of cloud data enters hosting business-critical workloads. The results of this work can be used as a basis to develop efficient resource management mechanisms for data enters. Moreover, the traces we released in this work can be used for workload verification, modelling and for evaluating resource scheduling policies, etc.
Article
Full-text available
As real systems become larger and more complex, the use of simulator frameworks grows in our research community. By leveraging them, users can focus on the major aspects of their algorithm, run in-siclo experiments (i.e., simulations), and thoroughly analyze results, even for a large-scale environment without facing the complexity of conducting in-vivo studies (i.e., on real testbeds). Since nowadays the virtual machine (VM) technology has become a fundamental building block of distributed computing environments, in particular in cloud infrastructures, our community needs a full-fledged simulation framework that enables us to investigate large-scale virtualized environments through accurate simulations. To be adopted, such a framework should provide easy-To-use APIs as well as accurate simulation results. In this paper, we present a highly-scalable and versatile simulation framework supporting VM environments. By leveraging SimGrid, a widely-used open-source simulation toolkit, our simulation framework allows users to launch hundreds of thousands of VMs on their simulation programs and control VMs in the same manner as in the real world (e.g., suspend/resume and migrate). Users can execute computation and communication tasks on physical machines (PMs) and VMs through the same SimGrid API, which will provide a seamless migration path to IaaS simulations for hundreds of SimGrid users. Moreover, SimGrid VM includes a live migration model implementing the precopy migration algorithm. This model correctly calculates the migration time as well as the migration traffic, taking account of resource contention caused by other computations and data exchanges within the whole system. This allows user to obtain accurate results of dynamic virtualized systems.We confirmed accuracy of both the VM and the live migration models by conducting several micro-benchmarks under various conditions. Finally, we conclude the article by presenting a first usecase of one consolidation algorithm dealing with a significant number of VMs/PMs. In addition to confirming the accuracy and scalability of our framework, this first scenario illustrates the main interest of SimGrid VM: investigating through in-siclo experiments pros/cons of new algorithms in order to limit expensive in-vivo experiments only to the most promising ones.
Article
Full-text available
Elastic resource scaling lets cloud systems meet application service level objectives (SLOs) with minimum resource provisioning costs. In this paper, we present CloudScale, a system that automates fine-grained elastic resource scaling for multi-tenant cloud computing infrastructures. CloudScale employs online resource demand prediction and prediction error handling to achieve adaptive resource allocation without assuming any prior knowledge about the applications running inside the cloud. CloudScale can resolve scaling conflicts between applications using migration, and integrates dynamic CPU voltage/frequency scaling to achieve energy savings with minimal effect on application SLOs. We have implemented CloudScale on top of Xen and conducted extensive experiments using a set of CPU and memory intensive applications (RUBiS, Hadoop, IBM System S). The results show that CloudScale can achieve significantly higher SLO conformance than other alternatives with low resource and energy cost. CloudScale is non-intrusive and light-weight, and imposes negligible overhead (
Article
Full-text available
Simulation techniques have become a powerful tool for deciding the best starting conditions on pay-as-you-go scenarios. This is the case of public cloud infrastructures, where a given number and type of virtual machines (in short VMs) are instantiated during a specified time, being this reflected in the final budget. With this in mind, this paper introduces and validates iCanCloud, a novel simulator of cloud infrastructures with remarkable features such as flexibility, scalability, performance and usability. Furthermore, the iCanCloud simulator has been built on the following design principles: (1) it's targeted to conduct large experiments, as opposed to others simulators from literature; (2) it provides a flexible and fully customizable global hypervisor for integrating any cloud brokering policy; (3) it reproduces the instance types provided by a given cloud infrastructure; and finally, (4) it contains a user-friendly GUI for configuring and launching simulations, that goes from a single VM to large cloud computing systems composed of thousands of machines.
Conference Paper
Full-text available
Virtualization is an essential technology in modern datacenters. Despite advantages such as security isolation, fault isolation, and environment isolation, current virtualization techniques do not provide effective performance isolation between virtual machines (VMs). Specifically, hidden contention for physical resources impacts performance differently in different workload configurations, causing significant variance in observed system throughput. To this end, characterizing workloads that generate performance interference is important in order to maximize overall utility. In this paper, we study the effects of performance interference by looking at system-level workload characteristics. In a physical host, we allocate two VMs, each of which runs a sample application chosen from a wide range of benchmark and real-world workloads. For each combination, we collect performance metrics and runtime characteristics using an instrumented Ken hypervisor. Through subsequent analysis of collected data, we identify clusters of applications that generate certain types of performance interference. Furthermore, we develop mathematical models to predict the performance of a new application from its workload characteristics. Our evaluation shows our techniques were able to predict performance with average error of approximately 5%
Conference Paper
Full-text available
Resource pools are computing environments that offer virtualized access to shared resources. When used effectively they can align the use of capacity with business needs (flexibility), lower infrastructure costs (via resource sharing), and lower operating costs (via automation). This paper describes the Quartermaster capacity manager service for managing such pools. It implements a trace-based technique that models workload (e.g., application) resource demands, their corresponding resource allocations, and resource access quality of service. The primary advantages of the technique are its accuracy, generality, support for resource access qualities of service, and optimizing search method. We pose general capacity management questions for resource pools and explain how the capacity manager helps to address them in an automated manner. A case study demonstrates and validates the method on empirical data from an enterprise application. We show that the technique exploits much of the resource savings to be achieved from resource sharing and is significantly more accurate at estimating per-server required capacity than a benchmark method used in practice to manage a resource pool. Finally, we explain how the problems relate to other practices regarding enterprise capacity management and software performance engineering.
Conference Paper
Full-text available
As the complexity of IT systems increases, performance management and capacity planning become the largest and most difficult e xpenses to control. New methodologies and modeling techniques that explain large-system behavior and help predict their future performance are now needed to effectively tackle the emerging performance issues. With the multi-tier architecture paradigm be- coming an industry standard for developing scalable client-server applications, it is important to design effective and accurate performance prediction mod- els of multi-tier applications under an enterprise product ion environment and a real workload mix. To accurately answer performance questions for an existing production system with a real workload mix, we design and implement a new capacity planning and anomaly detection tool, called R-Capriccio, that is based on the following three components: i) a Workload Profiler that exploits locality in existing enterprise web workloads and extracts a small set of most popular, core client transactions responsible for the majority of client requests in the system; ii) a Regression-based Solverthat is used for deriving the CPU demand of each core transaction on a given hardware; and iii) an Analytical Model that is based on a network of queues that models a multi-tier system. To validate R-Capriccio, we conduct a detailed case study using the access logs from two heterogeneous production servers that represent customized client accesses to a popular and actively used HP Open View Service Desk application.
Conference Paper
Full-text available
Resilience against unexpected server failures is a key desirable function of quality-assured service systems. A good capacity planning decision should cost-effectively allocate spare capacity for exploiting failure resilience mechanisms. In this paper, we propose an optimal capacity planning algorithm for server-cluster based service systems,particularly the ones that provision composite services via several servers. The algorithm takes into account two commonly used failure resilience mechanisms: intra-cluster load-controlling and inter-cluster failover. The goal is to minimize the resource cost while assuring service levels on the end-to-end throughput and response time of provisioned composite services under normal conditions and server failure conditions. We illustrate that the stated goal can be formalized as a capacity planning optimization problem and can be solved mathematically via convex analysis and linear optimization techniques. We also quantitatively demonstrate that the proposed algorithm can find the min-cost capacity planning solution that assures the end-to-end performance of managed composite services for both the non-failure case and the common server failure cases in a three-tier web-based service system with multiple server clusters. To the best of our knowledge, this paper presents the first research effort in optimizing the cost of supporting failure resilience for quality-assured composite services.
Conference Paper
Full-text available
Distributed systems such as grids, peer-to-peer systems, and even Internet DNS servers have grown significantly in size and complexity in the last decade. This rapid growth has allowed distributed systems to serve a large and increasing number of users, but has also made resource and system failures inevitable. Moreover, perhaps as a result of system complexity, in distributed systems a single failure can trigger within a short time span several more failures, forming a group of time-correlated failures. To eliminate or alleviate the significant effects of failures on performance and functionality, the techniques for dealing with failures require good failure models. However, not many such models are available, and the available models are valid for few or even a single distributed system. In contrast, in this work we propose a model that considers groups of time-correlated failures and is valid for many types of distributed systems. Our model includes three components, the group size, the group inter-arrival time, and the resource downtime caused by the group. To validate this model, we use failure traces corresponding to fifteen distributed systems. We find that space-correlated failures are dominant in terms of resource downtime in seven of the fifteen studied systems. For each of these seven systems, we provide a set of model parameters that can be used in research studies or for tuning distributed systems. Last, as a result of our work six of the studied traces have been made available through the Failure Trace Archive ( http://fta.inria.fr ).
Article
Full-text available
Large scale distributed systems such as Grids are difficult to study from theoretical models and simulators only. Most Grids deployed at large scale are production platforms that are inappropriate research tools because of their limited reconfiguration, control and monitoring capabilities. In this paper, we present Grid'5000, a 5000 CPU nation-wide infrastructure for research in Grid computing. Grid'5000 is designed to provide a scientific tool for computer scientists similar to the large-scale instruments used by physicists, astronomers, and biologists. We describe the motivations, design considerations, architecture, control, and monitoring infrastructure of this experimental platform. We present configuration examples and performance results for the reconfiguration subsystem.
Article
Full-text available
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often operate at low utilization due to resource stranding and fragmentation. To attack this first problem, we propose (1) increasing network agility, and (2) providing appropriate incentives to shape resource consumption. Second, we note that cloud service providers are building out geo-distributed networks of data centers. Geo-diversity lowers latency to users and increases reliability in the presence of an outage taking out an entire site. However, without appropriate design and management, these geo-diverse data center networks can raise the cost of providing service. Moreover, leveraging geo-diversity requires services be designed to benefit from it. To attack this problem, we propose (1) joint optimization of network and data center resources, and (2) new systems and mechanisms for geo-distributing state.
Book
Tackling the questions that systems designers care about, this book brings queueing theory decisively back to computer science. The book is written with computer scientists and engineers in mind and is full of examples from computer systems, as well as manufacturing and operations research. Fun and readable, the book is highly approachable, even for undergraduates, while still being thoroughly rigorous and also covering a much wider span of topics than many queueing books. Readers benefit from a lively mix of motivation and intuition, with illustrations, examples and more than 300 exercises – all while acquiring the skills needed to model, analyze and design large-scale systems with good performance and low cost. The exercises are an important feature, teaching research-level counterintuitive lessons in the design of computer systems. The goal is to train readers not only to customize existing analyses but also to invent their own.
Conference Paper
Data center networks evolve as they serve customer traffic. When applying network changes, operators risk impacting customer traffic because the network operates at reduced capacity and is more vulnerable to failures and traffic variations. The impact on customer traffic ultimately translates to operator cost (e.g., refunds to customers). However, planning a network change while minimizing the risks is challenging as we need to adapt to a variety of traffic dynamics and cost functions while scaling to large networks and large changes. Today, operators often use plans that maximize the residual capacity (MRC), which often incurs a high cost under different traffic dynamics. Instead, we propose Janus, which searches the large planning space by leveraging the high degree of symmetry in data center networks. Our evaluation on large Clos networks and Facebook traffic traces shows that Janus generates plans in real-time only needing 33~71% of the cost of MRC planners while adapting to a variety of settings.
Article
This book describes warehouse-scale computers (WSCs), the computing platforms that power cloud computing and all the great web services we use every day. It discusses how these new systems treat the datacenter itself as one massive computer designed at warehouse scale, with hardware and software working in concert to deliver good levels of internet service performance. The book details the architecture of WSCs and covers the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. Each chapter contains multiple real-world examples, including detailed case studies and previously unpublished details of the infrastructure used to power Google's online services. Targeted at the architects and programmers of today's WSCs, this book provides a great foundation for those looking to innovate in this fascinating and important area, but the material will also be broadly interesting to those who just want to understand the infrastructure powering the internet. The third edition reflects four years of advancements since the previous edition and nearly doubles the number of pictures and figures. New topics range from additional workloads like video streaming, machine learning, and public cloud to specialized silicon accelerators, storage and network building blocks, and a revised discussion of data center power and cooling, and uptime. Further discussions of emerging trends and opportunities ensure that this revised edition will remain an essential resource for educators and professionals working on the next generation of WSCs.
Article
In this article, the authors provide their views on whether organizations should scale up or scale out their graph computations. This question was explored in a previous installment of this column by Jimmy Lin, where he made a case for scale-up through several examples. In response, the authors discuss three cases for scale-out.
Article
Today's cloud services extensively rely on replication techniques to ensure availability and reliability. In complex datacenter network architectures, however, seemingly independent replica servers may inadvertently share deep dependencies (e.g., aggregation switches). Such unexpected common dependencies may potentially result in correlated failures across the entire replication deployments, invalidating the efforts. Although existing cloud management and diagnosis tools have been able to offer post-failure forensics, they, nevertheless, typically lead to quite prolonged failure recovery time in the cloud-scale systems. In this paper, we propose a novel language framework, named RepAudit, that manages to prevent correlated failure risks before service outages occur, by allowing cloud administrators to proactively audit the replication deployments of interest. In particular, RepAudit consists of three new components: 1) a declarative domain-specific language, RAL, for cloud administrators to write auditing programs expressing diverse auditing tasks; 2) a high-performance RAL auditing engine that generates the auditing results by accurately and efficiently analyzing the underlying structures of the target replication deployments; and 3) an RAL-code generator that can automatically produce complex RAL programs based on easily written specifications. Our evaluation result shows that RepAudit uses 80x less lines of code than state-of-the-art efforts in expressing the auditing task of determining the top-20 critical correlated-failure root causes. To the best of our knowledge, RepAudit is the first effort capable of simultaneously offering expressive, accurate and efficient correlated failure auditing to the cloud-scale replication systems.
Conference Paper
Cloud research to date has lacked data on the characteristics of the production virtual machine (VM) workloads of large cloud providers. A thorough understanding of these characteristics can inform the providers' resource management systems, e.g. VM scheduler, power manager, server health manager. In this paper, we first introduce an extensive characterization of Microsoft Azure's VM workload, including distributions of the VMs' lifetime, deployment size, and resource consumption. We then show that certain VM behaviors are fairly consistent over multiple lifetimes, i.e. history is an accurate predictor of future behavior. Based on this observation, we next introduce Resource Central (RC), a system that collects VM telemetry, learns these behaviors offline, and provides predictions online to various resource managers via a general client-side library. As an example of RC's online use, we modify Azure's VM scheduler to leverage predictions in oversubscribing servers (with oversubscribable VM types), while retaining high VM performance. Using real VM traces, we then show that the prediction-informed schedules increase utilization and prevent physical resource exhaustion. We conclude that providers can exploit their workloads' characteristics and machine learning to improve resource management substantially.
Article
Infrastructure as a Service (IaaS) cloud providers typically offer multiple service classes to satisfy users with different requirements and budgets. Cloud providers are faced with the challenge of estimating the minimum resource capacity required to meet Service Level Objectives (SLOs) defined for all service classes. This paper proposes a capacity planning method that is combined with an admission control mechanism to address this challenge. The capacity planning method uses analytical models to estimate the output of a quota-based admission control mechanism and find the minimum capacity required to meet availability SLOs and admission rate targets for all classes. An evaluation using trace-driven simulations shows that our method estimates the best cloud capacity with a mean relative error of 2.5% with respect to the simulation, compared to a 36% relative error achieved by a single-class baseline method that does not consider admission control mechanisms. Moreover, our method exhibited a high SLO fulfillment for both availability and admission rates, and obtained mean CPU utilization over 91%, while the single-class baseline method had values not greater than 78%.
Article
New pricing policies are emerging where cloud providers charge resource provisioning based on the allocated CPU frequencies. As a result, resources are offered to users as combinations of different performance levels and prices which can be configured at runtime. With such new pricing schemes and the increasing energy costs in data centres, balancing energy savings with performance and revenue losses is a challenging problem for cloud providers. CPU frequency scaling can be used to reduce power dissipation, but also impacts VM performance and therefore revenue. In this paper, we firstly propose a non-linear power model that estimates power dissipation of a multi-core CPU physical machine (PM) and secondly a pricing model that adjusts the pricing based on the VM’s CPU-boundedness characteristics. Finally, we present a cloud controller that uses these models to allocate VMs and scale CPU frequencies of the PMs to achieve energy cost savings that exceed service revenue losses. We evaluate the proposed approach using simulations with realistic VM workloads, electricity price and temperature traces and estimate energy savings of up to 14:57%.
Conference Paper
In today's commercial data centers, the computation density grows continuously as the number of hardware components and workloads in units of virtual machines increase. The service availability guaranteed by data centers heavily depends on the reliability of the physical and virtual servers. In this study, we conduct an analysis on 10K virtual and physical machines hosted on five commercial data centers over an observation period of one year. Our objective is to establish a sound understanding of the differences and similarities between failures of physical and virtual machines. We first capture their failure patterns, i.e., the failure rates, the distributions of times between failures and of repair times, as well as, the time and space dependency of failures. Moreover, we correlate failures with the resource capacity and run-time usage to identify the characteristics of failing servers. Finally, we discuss how virtual machine management actions, i.e., consolidation and on/off frequency, impact virtual machine failures.
Article
The Mnemos resource management and scheduling architecture uses portfolio scheduling, topology-aware virtual-resource management, and state information to self-adapt to significant workload changes and to analyze risks. Simulations with real-world workload traces reveal the potential for significant cost savings.
Article
The data centers used to create cloud services represent a significant investment in capital outlay and ongoing costs. Accordingly, we first examine the costs of cloud service data centers today. The cost breakdown reveals the importance of optimizing work completed per dollar invested. Unfortunately, the resources inside the data centers often operate at low utilization due to resource stranding and fragmentation. To attack this first problem, we propose (1) increasing network agility, and (2) providing appropriate incentives to shape resource consumption. Second, we note that cloud service providers are building out geo-distributed networks of data centers. Geo-diversity lowers latency to users and increases reliability in the presence of an outage taking out an entire site. However, without appropriate design and management, these geo-diverse data center networks can raise the cost of providing service. Moreover, leveraging geo-diversity requires services be designed to benefit from it. To attack this problem, we propose (1) joint optimization of network and data center resources, and (2) new systems and mechanisms for geo-distributing state.
Article
From an enterprise perspective, one key motivation to transform the traditional IT management into Cloud is the cost reduction of the hosted services. In an Infrastructure-as-a-Service (IaaS) Cloud, virtual machine (VM) instances share the physical machines (PMs) in the provider's data center. With large number of PMs, providers can maintain low cost of service downtime at the expense of higher infrastructure and other operational costs (e.g., power consumption and cooling costs). Hence, determining the optimal PM capacity requirements that minimize the overall cost is of interest. In this paper, we show how a cost analysis and optimization framework can be developed using stochastic availability and performance models of an IaaS Cloud. Specifically, we study two cost minimization problems to address the capacity planning in an IaaS Cloud: (1) what is the optimal number of PMs that minimizes the total cost of ownership for a given downtime requirement set by service level agreements? and, (2) is it more economical to use cheaper but less reliable PMs or to use costlier but more reliable PMs for insuring the same availability characteristics? We use simulated annealing, a well-known stochastic search algorithm, to solve these optimization problems. Results from our analysis show that the optimal solutions are found within reasonable time.
Article
In the cloud context, pricing and capacity planning are two important factors to the profit of the Infrastructure-as-a-Service (IaaS) providers. This paper investigates the problem of joint pricing and capacity planning in the IaaS provider market with a set of Software-as-a-Service (SaaS) providers, where each SaaS provider leases the virtual machines (VMs) from the IaaS providers to provide cloud-based application services to its end-users. We study two market models, one with a monopoly IaaS provider market, the other with multiple-IaaS-provider market. For the monopoly IaaS provider market, we first study the SaaS providers’ optimal decisions in terms of the amount of end-user requests to admit and the number of VMs to lease, given the resource price charged by the IaaS provider. Based on the best responses of the SaaS providers, we then derive the optimal solution to the problem of joint pricing and capacity planning to maximize the IaaS provider’s profit. Next, for the market with multiple IaaS providers, we formulate the pricing and capacity planning competition among the IaaS providers as a three-stage Stackelberg game. We explore the existence and uniqueness of Nash equilibrium, and derive the conditions under which there exists a unique Nash equilibrium. Finally, we develop an iterative algorithm to achieve the Nash equilibrium.
Article
Tackling the questions that systems designers care about, this book brings queueing theory decisively back to computer science. The book is written with computer scientists and engineers in mind and is full of examples from computer systems, as well as manufacturing and operations research. Fun and readable, the book is highly approachable, even for undergraduates, while still being thoroughly rigorous and also covering a much wider span of topics than many queueing books. Readers benefit from a lively mix of motivation and intuition, with illustrations, examples and more than 300 exercises - all while acquiring the skills needed to model, analyze and design large-scale systems with good performance and low cost. The exercises are an important feature, teaching research-level counterintuitive lessons in the design of computer systems. The goal is to train readers not only to customize existing analyses but also to invent their own.
Article
Qualitative research design can be complicated depending upon the level of experience a researcher may have with a particular type of methodology. As researchers, many aspire to grow and expand their knowledge and experiences with qualitative design in order to better utilize diversified research paradigms for future investigations. One of the more popular areas of interest in qualitative research design is that of the interview protocol. Interviews provide in-depth information pertaining to participants' experiences and viewpoints of a particular topic. Often times, interviews are coupled with other forms of data collection in order to provide the researcher with a well-rounded collection of information for analyses. This paper explores the effective ways to conduct in-depth, qualitative interviews for novice investigators by employing a step-by-step process for implementation. Qualitative research design can be complicated depending upon the level of experience a researcher may have with a particular type of methodology. As researchers, many aspire to grow and expand their knowledge and experiences with qualitative design in order to better utilize a variety of research paradigms. One of the more popular areas of interest in qualitative research design is that of the interview protocol. Interviews provide in-depth information pertaining to participants' experiences and viewpoints of a particular topic. Often times, interviews are coupled with other forms of data collection in order to provide the researcher with a well-rounded collection of information for analyses. This paper explores the effective ways to conduct in-depth, qualitative interviews for novice investigators by expanding upon the practical components of each interview design.
Article
We present CCM (Cloud Capacity Manager) - a prototype system and its methods for dynamically multiplexing the compute capacity of virtualized datacenters at scales of thousands of machines, for diverse workloads with variable demands. Extending prior studies primarily concerned with accurate capacity allocation and ensuring acceptable application performance, CCM also sheds light on the tradeoffs due to two unavoidable issues in large scale commodity datacenters: (i) maintaining low operational overhead given variable cost of performing management operations necessary to allocate resources, and (ii) coping with the increased incidences of these operations' failures. CCM is implemented in an industry-strength cloud infrastructure built on top of the VMware vSphere virtualization platform and is currently deployed in a 700 physical host datacenter. Its experimental evaluation uses production workload traces and a suite of representative cloud applications to generate dynamic scenarios. Results indicate that the pragmatic cloud-wide nature of CCM provides up to 25% more resources for workloads and improves datacenter utilization by up to 20%, compared to the common alternative approach of multiplexing capacity within multiple independent smaller datacenter partitions.
Article
Studies on data center capacity planning, maintenance, and reorganization have been of interest to all its stakeholders since the data centers were first instituted. Recent study shows that data center costs contribute to nearly 25% of all information technology budgets in a company. Several methodologies have been adopted for strategic data center capacity reduction such as dynamic shutdown, virtualization, and logical partitions. The greatest challenge around data center capacity reduction is an approach that captures all data center variables and allows for strategic reduction in capacity while minimizing risks.This paper uses causal Bayesian Belief Network to represent data center capacity planning decision process. It encapsulates three areas that influence the data center demand. These areas include market conditions, development process, and internal business decisions. The approach uses sensitivity analysis to narrow down the factors that influence the decision process the most while providing an opportunity, if one exists, to also reduce unused data center capacity. An iterative approach was applied to develop a causal Bayesian Belief Network, to carry out decisions at each stage, and to collect sensitivity values. Training data was simulated using Geometric Brownian motion generated through Monte-Carlo simulation. The Bayesian belief network itself was designed using Netica.
Article
With the increasing presence, scale, and complexity of distributed systems, resource failures are becoming an important and practical topic of computer science research. While numerous failure models and failure-aware algorithms exist, their comparison has been hampered by the lack of public failure data sets and data processing tools. To facilitate the design, validation, and comparison of fault-tolerant models and algorithms, we have created the Failure Trace Archive (FTA)—an online, public repository of failure traces collected from diverse parallel and distributed systems. In this work, we first describe the design of the archive, in particular of the standard FTA data format, and the design of a toolbox that facilitates automated analysis of trace data sets. We also discuss the use of the FTA for various current and future purposes. Second, after applying the toolbox to nine failure traces collected from distributed systems used in various application domains (e.g., HPC, Internet operation, and various online applications), we present a comparative analysis of failures in various distributed systems. Our analysis presents various statistical insights and typical statistical modeling results for the availability of individual resources in various distributed systems. The analysis results underline the need for public availability of trace data from different distributed systems. Last, we show how different interpretations of the meaning of failure data can result in different conclusions for failure modeling and job scheduling in distributed systems. Our results for different interpretations show evidence that there may be a need for further revisiting existing failure-aware algorithms, when applied for general rather than for domain-specific distributed systems.
Article
Managing storage growth is painful [1]. When a system exhausts available storage, it is not only an operational inconvenience but also a budgeting nightmare. Many system administrators already have historical data for their systems and thus can predict full capacity events in advance. EMC has developed a capacity forecasting tool for Data Domain systems which has been in production since January 2011. This tool analyses historical data from over 10,000 back-up systems daily, forecasts the future date for full capacity, and sends proactive notifications. This paper describes the architecture of the tool, the predictive model it employs, and the results of the implementation.
Conference Paper
The management of power consumption in datacenters has become an important problem. This needs a systematic evaluation of the as-is scenario to identify potential areas for improvement and quantify the impact of any strategy. We present a measurement study of a production datacenter from a joint perspective of power and performance at the individual server level. Our observations help correlate power consumption of production servers with their activity, and identify easily implementable improvements. We find that production servers are underutilized from an activity perspective; are overrated from a power perspective; execute temporally similar workloads over a granularity of weeks; do not idle efficiently; and have power consumptions that are well tracked by their CPU utilizations. Our measurements suggest the following steps for improvement: staggering periodic activities on servers; enabling deeper sleep states; and provisioning based on measurement.
Article
Software process improvement (SPI) aims to understand the software process as it is used within an organisation and thus drive the implementation of changes to that process to achieve specific goals such as increasing development speed, achieving higher product quality or reducing costs. Accordingly, SPI researchers must be equipped with the methodologies and tools to enable them to look within organisations and understand the state of practice with respect to software process and process improvement initiatives, in addition to investigating the relevant literature. Having examined a number of potentially suitable research methodologies, we have chosen Grounded Theory as a suitable approach to determine what was happening in actual practice in relation to software process and SPI, using the indigenous Irish software product industry as a test-bed. The outcome of this study is a theory, grounded in the field data, that explains when and why SPI is undertaken by the software industry. The objective of this paper is to describe both the selection and usage of grounded theory in this study and evaluate its effectiveness as a research methodology for software process researchers. Accordingly, this paper will focus on the selection and usage of grounded theory, rather than results of the SPI study itself.
Article
While large grids are currently supporting the work of thousands of scientists, very little is known about their actual use. Due to strict organizational permissions, there are few or no traces of grid workloads available to the grid researcher and practitioner. To address this problem, in this work we present the Grid Workloads Archive (GWA), which is at the same time a workload data exchange and a meeting point for the grid community. We define the requirements for building a workload archive, and describe the approach taken to meet these requirements with the GWA. We introduce a format for sharing grid workload information, and tools associated with this format. Using these tools, we collect and analyze data from nine well-known grid environments, with a total content of more than 2000 users submitting more than 7 million jobs over a period of over 13 operational years, and with working environments spanning over 130 sites comprising 10 000 resources. We show evidence that grid workloads are very different from those encountered in other large-scale environments, and in particular from the workloads of parallel production environments: they comprise almost exclusively single-node jobs, and jobs arrive in “bags-of-tasks”. Finally, we present the immediate applications of the GWA and of its content in several critical grid research and practical areas: research in grid resource management, and grid design, operation, and maintenance.
Conference Paper
Traditionally, any capacity planning problem is modeled with deterministic workloads by considering the peak workload for resource allocation. In the context of businesses using cloud service, cloud provider could allocate resources for peak workload which could lead to under utilization of resource and charging users for unused yet provisioned resources. Hence we came up with a better capacity planning algorithm which could ensure that we plan for peak usage but do not provision for it. In our approach, we modeled the problem as a stochastic optimization problem with the objective of minimizing the number of servers considering two important constraints a) stochastic nature of workloads and b) minimizing the application SLA violations. We implemented the model using genetic algorithm and to address the stochastic nature of work loads, we reserved a free pool of resources in each server by the quantity determined by our algorithm. We evaluated the solution with real sever utilization data from a datacenter seeking consolidation. We did comparative analysis on the number of servers required suggested by our solution vs. peak work loads based solutions for various service levels. Our results illustrate that reserving certain amount of resources in servers for addressing variability of workloads gives better results in terms of lesser number of servers compared to packing resources based on peak workloads for the same service levels.
Article
Cloud computing is a recent advancement wherein IT infrastructure and applications are provided as ‘services’ to end-users under a usage-based payment model. It can leverage virtualized services even on the fly based on requirements (workload patterns and QoS) varying with time. The application services hosted under Cloud computing model have complex provisioning, composition, configuration, and deployment requirements. Evaluating the performance of Cloud provisioning policies, application workload models, and resources performance models in a repeatable manner under varying system and user configurations and requirements is difficult to achieve. To overcome this challenge, we propose CloudSim: an extensible simulation toolkit that enables modeling and simulation of Cloud computing systems and application provisioning environments. The CloudSim toolkit supports both system and behavior modeling of Cloud system components such as data centers, virtual machines (VMs) and resource provisioning policies. It implements generic application provisioning techniques that can be extended with ease and limited effort. Currently, it supports modeling and simulation of Cloud computing environments consisting of both single and inter-networked clouds (federation of clouds). Moreover, it exposes custom interfaces for implementing policies and provisioning techniques for allocation of VMs under inter-networked Cloud computing scenarios. Several researchers from organizations, such as HP Labs in U.S.A., are using CloudSim in their investigation on Cloud resource provisioning and energy-efficient management of data center resources. The usefulness of CloudSim is demonstrated by a case study involving dynamic provisioning of application services in the hybrid federated clouds environment. The result of this case study proves that the federated Cloud computing model significantly improves the application QoS requirements under fluctuating resource and service demand patterns.
Capelin: Data-driven capacity procurement for cloud datacenters using portfolios of scenarios – Extended technical report
  • G Andreadis
  • F Mastenbroek
  • A Iosup
The design and operation of CloudLab
  • duplyakin
Learning from failure across multiple clusters
  • N El-Sayed
  • H Zhu
  • B Schroeder