Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Next‐generation cloud data centers are based on software‐defined data center infrastructures that promote flexibility, automation, optimization, and scalability. The Redfish standard and the Intel Rack Scale Design technology enable software‐defined infrastructure and disaggregate bare‐metal compute, storage, and networking resources into virtual pools to dynamically compose resources and create virtual performance‐optimized data centers (vPODs) tailored to workload‐specific demands. This article proposes four chassis design configurations based on Distributed Management Task Force's Redfish industry standard applied to compose vPOD systems, namely, a fully shared design, partially shared homogeneous design, partially shared heterogeneous design, and not shared design; their main difference is based on the used hardware disaggregation level. Furthermore, we propose models that combine reliability block diagram and stochastic Petri net modeling approaches to represent the complexity of the relationship between the pool of disaggregated hardware resources and their power and cooling sources in a vPOD. These four proposed design configurations were analyzed and compared in terms of availability and component's sensitivity indexes by scaling their configurations considering different data center infrastructure. From the obtained results, we can state that, in general, when one increases the hardware disaggregation, availability is improved. However, after a given point, the availability level of the fully shared, partially shared homogeneous, and partially shared heterogeneous configurations remain almost equal, while the not shared configuration is still able to improve its availability.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... 1. Monitoring, which monitores the EA resources and notifies other HAC components (e.g., resource management) about any problems. supporting utilities such as UPS, power distribution unit (PDU) [48], cloud operating systems [49], and backup infrastructure redundancy by multiple sites, and redundant data centre equipment, such as UPS and fault tolerance for components, HA for the individual data centre components [29,50] no ...
Preprint
Full-text available
The delivery of key services in domains ranging from finance and manufacturing to healthcare and transportation is underpinned by a rapidly growing number of mission-critical enterprise applications. Ensuring the continuity of these complex applications requires the use of software-managed infrastructures called high-availability clusters (HACs). HACs employ sophisticated techniques to monitor the health of key enterprise application layers and of the resources they use, and to seamlessly restart or relocate application components after failures. In this paper, we first describe the manifold uses of HACs to protect essential layers of a critical application and present the architecture of high availability clusters. We then propose a taxonomy that covers all key aspects of HACs -- deployment patterns, application areas, types of cluster, topology, cluster management, failure detection and recovery, consistency and integrity, and data synchronisation; and we use this taxonomy to provide a comprehensive survey of the end-to-end software solutions available for the HAC deployment of enterprise applications. Finally, we discuss the limitations and challenges of existing HAC solutions, and we identify opportunities for future research in the area.
... In this section, a dynamic method for modeling cloud infrastructure 11,43,44,45 is presented. The presented method applies an algorithm to use a hierarchical modeling approach proposed in Section 4.1. ...
Article
Full-text available
In cloud computing services, high availability is one of the quality of service requirements which is necessary to maintain customer confidence. High availability systems can be built by applying redundant nodes and multiple clusters in order to cope with software and hardware failures. Due to cloud computing complexity, dependability analysis of the cloud may require combining state‐based and nonstate‐based modeling techniques. This article proposes a hierarchical model combining reliability block diagrams and continuous time Markov chains to evaluate the availability of OpenStack private clouds, by considering different scenarios. The steady‐state availability, downtime, and cost are used as measures to compare different scenarios studied in the article. The heterogeneous workloads are considered in the proposed models by varying the number of CPUs requested by each customer. Both hardware and software failure rates of OpenStack components used in the model are collected via setting up a real OpenStack environment applying redundancy techniques. Results obtained from the proposed models emphasize the positive impact of redundancy on availability and downtime. Considering the tradeoff between availability and cost, system providers can choose an appropriate scenario for a specific purpose.
Article
By overcoming the server box barrier, resource disaggregation in data centers significantly improves resource utilization, which also provides a more cost-efficient approach for resource upgrade and expansion. The advantages of disaggregation have been explored in earlier research to improve the resource efficiency. This paper investigates the potential benefits of disaggregation from the aspect of reliability, which has not been considered before. Resource disaggregation brings a new failure pattern. For example, in a conventional server, the failure of one type of resource leads to the failure of the entire server, so that other types of resources in the same server also become unavailable. After disaggregating, the failure of different types of resources becomes more isolated, so that other resources are still available. In this paper, we model the reliability of a resource allocation request in a server-based or disaggregated DC based on whether the request is allocated with only working resources, or also provisioned with backup. We then consider a resource allocation problem with the objective of maximizing the number of requests accepted with guaranteed reliability. We provide an integer linear programming and a heuristic approach for this. Numerical studies demonstrate that resource disaggregation is possible to improve service reliability.
Article
The delivery of key services in domains ranging from finance and manufacturing to healthcare and transportation is underpinned by a rapidly growing number of mission-critical enterprise applications. Ensuring the continuity of these complex applications requires the use of software-managed infrastructures called high-availability clusters (HACs). HACs employ sophisticated techniques to monitor the health of key enterprise application layers and of the resources they use, and to seamlessly restart or relocate application components after failures. In this paper, we first describe the manifold uses of HACs to protect essential layers of a critical application and present the architecture of high availability clusters. We then propose a taxonomy that covers all key aspects of HACs—deployment patterns, application areas, types of cluster, topology, cluster management, failure detection and recovery, consistency and integrity, and data synchronisation; and we use this taxonomy to provide a comprehensive survey of the end-to-end software solutions available for the HAC deployment of enterprise applications. Finally, we discuss the limitations and challenges of existing HAC solutions, and we identify opportunities for future research in the area.
Article
Full-text available
Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.
Article
Full-text available
Abstract Purpose In this paper, a comprehensive fault tree analysis (FTA) on the critical components of industrial robots is conducted. This analysis is integrated with the reliability block diagram (RBD) approach in order to investigate the robot system reliability. Design For practical implementation, a particular autonomous guided vehicle (AGV) system is first modeled. Then, FTA is adopted to model the causes of failures, enabling the probability of success to be determined. In addition, RBD is employed to simplify the complex system of the AGV for reliability evaluation purpose. Findings Finally, hazard decision tree (HDT) is configured to compute the hazard of each component and the whole AGV robot system. Through this research, a promising technical approach is established, allowing decision makers to identify the critical components of AGVs along with their crucial hazard phases at the design stage. Originality As complex systems have become global and essential in today’s society, their reliable design and the determination of their availability have turned into a very important task for managers and engineers. Industrial robots are examples of these complex systems that are being increasingly used for intelligent transportation, production and distribution of materials in warehouses and automated production lines.
Article
Full-text available
In modern corporate environments, having a disaster recovery solution is no longer a luxury but a business necessity. The adoption of a disaster recovery solution, however, can be costly and often only affordable by large enterprises. Disaster-Recovery-as-a-Service (DRaaS) is a cloud-based solution that small and medium-sized businesses have been adopting to guarantee availability even in catastrophic situations. It offers lower acquisition and operational costs than traditional solutions. This work presents availability models for evaluating a DRaaS solution taking into account crucial disaster recovery metrics, such as downtime, costs, recovery time objective, and transaction loss. Furthermore, we performed sensitivity analysis on the DRaaS model to determine the parameters that cause the greatest impact on the system availability. Based on these analyses, disaster recovery coordinators can determine the optimum point to recover the damaged system by balancing the cost of system downtime against the cost of resources required for restoring the system. Our numerical results show the effectiveness of a DRaaS solution in terms of downtime, cost, and transition loss.
Article
Full-text available
Cloud computing infrastructures are designed to be accessible anywhere and anytime. This requires various fault tolerance mechanisms for coping with software and hardware failures. Hierarchical modeling approaches are often used to evaluate the availability of such systems, leveraging the representation of complex failure and repair events in distinct parts of the system. This paper presents an availability evaluation for redundant private clouds, represented by RBDs and Markov chains, hierarchically assembled. These private clouds follow the basic architecture of Eucalyptus-based environments, but employing warm-standby redundant hosts for some of its main components. Closed-form equations for the steady-state availability are presented, allowing direct analytical solution for large systems. The availability equations are symbolically differentiated, allowing parametric sensitivity analysis. The results from sensitivity analysis enables system planning for improving the steady- state availability. The sensitivity indices show that failure of the Eucalyptus Cloud Manager subsystem and the respective repair activities deserve priority for maximizing the system availability.
Conference Paper
Full-text available
Critical properties of software systems, such as reliability, should be considered early in the development, when they can govern crucial architectural design decisions. A number of design-time reliability-analysis methods has been developed to support this task. However, the methods are often based on very low-level formalisms, and the connection to different architectural aspects (e.g., the system usage profile) is either hidden in the constructs of a formal model (e.g., transition probabilities of a Markov chain), or even neglected (e.g., resource availability). This strongly limits the applicability of the methods to effectively support architectural design. Our approach, based on the Palladio Component Model (PCM), integrates the reliability-relevant architectural aspects in a highly parameterized UML-like model, which allows for transparent evaluation of architectural design options. It covers the propagation of the system usage profile throughout the architecture, and the impact of the execution environment, which are neglected in most of the existing approaches. Before analysis, the model is automatically transformed into a formal Markov model in order to support effective analytical techniques to be employed. The approach has been validated against a reliability simulation of a distributed Business Reporting System.
Book
Do you need to know what technique to use to evaluate the reliability of an engineered system? This self-contained guide provides comprehensive coverage of all the analytical and modeling techniques currently in use, from classical non-state and state space approaches, to newer and more advanced methods such as binary decision diagrams, dynamic fault trees, Bayesian belief networks, stochastic Petri nets, non-homogeneous Markov chains, semi-Markov processes, and phase type expansions. Readers will quickly understand the relative pros and cons of each technique, as well as how to combine different models together to address complex, real-world modeling scenarios. Numerous examples, case studies and problems provided throughout help readers put knowledge into practice, and a solutions manual and Powerpoint slides for instructors accompany the book online. This is the ideal self-study guide for students, researchers and practitioners in engineering and computer science.
Article
Traditional data center infrastructure suffers from a lack of standard and ubiquitous management solutions. Despite the achieved contributions, existing tools lack interoperability and are hardware dependent. Vendors are already actively participating in the specification and design of new standard software and hardware interfaces within different forums. Nevertheless, the complexity and variety of data center infrastructure components that includes servers, cooling, networking, and power hardware, coupled with the introduction of the software defined data center paradigm, led to the parallel development of a myriad of standardization efforts. In an attempt to shed light on recent works, we survey and discuss the main standardization efforts for traditional data center infrastructure management.
Article
To assess the availability of different data center configurations, understand the main root causes of data center failures and represent its low‐level details, such as subsystem's behavior and their interconnections, we have proposed, in previous works, a set of stochastic models to represent different data center architectures (considering three subsystems: power, cooling, and IT) based on the TIA‐942 standard. In this paper, we propose the Data Center Availability (DCAV), a web‐based software system to allow data center operators to evaluate the availability of their data center infrastructure through a friendly interface, without need of understanding the technical details of the stochastic models. DCAV offers an easy step‐by‐step interface to create and configure a data center model. The main goal of the DCAV system is to abstract low‐level details and modeling complexities, becoming the data center availability analysis a simple and less time‐consuming task.
Article
Large data centers are complex systems that depend on several generations of hardware and software components, ranging from legacy mainframes and rack-based appliances to modular blade servers and modern rack scale design solutions. To cope with this heterogeneity, the data center manager must coordinate a multitude of tools, protocols, and standards. Currently, data center managers, standardization bodies, and hardware/software manufacturers are joining efforts to develop and promote Redfish as the main hardware management standard for data centers, and even beyond the data center. The authors hope that this article can be used as a starting point to understand how Redfish and its extensions are being targeted as the main management standard for next-generation data centers. This article describes Redfish and the recent collaborations to leverage this standard.
Article
Emergency call services are expected to be highly available in order to minimize the loss of urgent calls and, as a consequence, minimize loss of life due to lack of timely medical response. This service availability depends heavily on the cloud data center on which it is hosted. However, availability information alone cannot provide sufficient understanding of how failures impact the service and users' perception. In this paper, we evaluate the impact of failures on an emergency call system, considering service‐level metrics such as the number of affected calls per failure and the time an emergency service takes until it recovers from a failure. We analyze a real data set from an emergency call center for a large Brazilian city. From stochastic models that represent a cloud data center, we evaluate different data center architectures to observe the impact of failures on the emergency call service. Results show that changing data center's architecture in order to improve availability from two to three nines cannot decrease the average number of affected calls per failure. On the other hand, it can decrease the probability to affect a considerable number of calls at the same time.
Article
Because of the dependence on Internet‐based services, many efforts have been conceived to mitigate the impact of disasters on service provision. In this context, cloud computing has become an interesting alternative for implementing disaster tolerant services due to its resource on‐demand and pay‐as‐you‐go models. This paper proposes a sensitivity analysis approach to assess the parameters that most impact the availability of cloud data centers, taking into account disaster occurrence, hardware and software failures, and disaster recovery mechanisms for cloud systems. The analysis adopts continuous‐time Markov chains, and the results indicate that disaster issues should not be neglected. Hardware failure rate and time for migration of virtual machines (VMs) are the critical factors pointed out for the system modeled in our analysis. Moreover, the location where data centers are placed has a significant impact on system availability, due to time for migrating VMs from a backup server. This paper proposes a sensitivity analysis approach to assess the parameters that most impact the availability of cloud data centers, taking into account disaster occurrence, hardware and software failures, and disaster recovery mechanisms for cloud systems. The analysis adopts continuous‐time Markov chains, and the results indicate that disaster issues should not be neglected. Hardware failure rate and time for migration of virtual machines are the critical factors pointed out for the system modeled in our analysis.
Article
Different challenges are facing the adoption of cloud-based applications, including high availability (HA), energy, and other performance demands. Therefore, an integrated solution that addresses these issues is critical for cloud services. Cloud providers promise the HA of their infrastructure while cloud tenants are encouraged to deploy their applications across multiple availability zones with different reliability levels. Moreover, the environmental and cost impacts of running the applications in the cloud are an integral part of incorporated responsibility, where both the cloud providers and tenants intend to reduce. Hence, a formal and analytical stochastic model is needed for both the tenants and providers to quantify the expected availability offered by an application deployment. If multiple deployment options can satisfy the HA requirement, the question remains, how can we choose the deployment that satisfies the other providers and tenants requirements? For instance, choosing data centers with low carbon emissions can both reduce the environmental footprint and potentially earn carbon tax credits that lessen the operational cost. Therefore, this paper proposes a cloud scoring system and integrates it with a Stochastic Petri Net model. While the Petri Net model evaluates the availability of cloud applications deployments, the scoring system selects the optimal HA-aware deployment in terms of energy, operational expenditure (OPEX), and other norms. We illustrate our approach with a use case that shows how we can use the various deployment options in the cloud to satisfy both the cloud tenant and provider needs.
Article
The cost-aware exploration on enhancing fault-tolerant becomes an important issue of service quality from cloud platform. To approach this goal with greener design, a novel server backup strategy is adopted with two types of standby server with warm standby and cold standby configurations. On such two-level standby scheme, cost elaboration has been explored in terms of deployment ratio between warm standbys and cold standbys. The cold standbys provide a greener power solution than those of conventional warm standbys. The optimal cost policy has been proposed to maintain regulated quality of service for the cloud customers. On qualitative study, a Petri Net is developed and designed to visualize the whole system operational flow. On quantitative research for decision support, the theory of finite source queue is elaborated and relevant comprehensive mathematical analysis on cost pattern has been made in detail. Relevant simulations have been conducted to validate the proposed cost optimization model as well. On green contribution, the saving of power consumption has been estimated on the basis of switching warm standbys into cold standbys, which amounts for the reduction of CO2 emission. Hence the proposed approach indeed provides a feasibly standby architecture to meet cloud logistic economy with greener deployment.
Article
The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture.
Chapter
Model checking is an automatic verification technique for hardware and software systems that are finite state or have finite state abstractions. It has been used successfully to verify computer hardware, and it is beginning to be used to verify computer software as well. As the number of state variables in the system increases, the size of the system state space grows exponentially. This is called the “state explosion problem”. Much of the research in model checking over the past 30 years has involved developing techniques for dealing with this problem. In these lecture notes, we will explain how the basic model checking algorithms work and describe some recent approaches to the state explosion problem, with an emphasis on Bounded Model Checking.
Article
This paper studies a control problem for optimal switching on and off a cloud computing services modeled by an M=M=1 queue with holding, running and switching costs. The main result is that an average-optimal policy either always runs the system or is an (M; N)-policy defined by two thresholds M and N, such that the system is switched on upon an arrival epoch when the system size accumulates to N and it is switched off upon a departure epoch when the system size decreases to M. We compare the optimal (M; N)-policy with the classical (0; N)-policy and show the non-optimality of it.
Article
Mobile cloud computing is a new paradigm that uses cloud computing resources to overcome the limitations of mobile computing. Due to its complexity, dependability and performance studies of mobile clouds may require composite modeling techniques, using distinct models for each subsystem and combining state-based and non-state-based formalisms. This paper uses hierarchical modeling and four different sensitivity analysis techniques to determine the parameters that cause the greatest impact on the availability of a mobile cloud. The results show that distinct approaches provide similar results regarding the sensitivity ranking, with specific exceptions. A combined evaluation indicates that system availability may be improved effectively by focusing on a reduced set of factors that produce large variation on the measure of interest. The time needed to replace a fully discharged battery in the mobile device is a parameter with high impact on steady-state availability, as well as the coverage factor for the failures of some cloud servers. This paper also shows that a sensitivity analysis through partial derivatives may not capture the real level of impact for some parameters in a discrete domain, such as the number of active servers. The analysis through percentage differences, or the factorial design of experiments, fulfills such a gap.
Article
Mathematical models are utilized to approximate various highly complex engineering, physical, environmental, social, and economic phenomena. Model parameters exerting the most influence on model results are identified through a 'sensitivity analysis'. A comprehensive review is presented of more than a dozen sensitivity analysis methods. This review is intended for those not intimately familiar with statistics or the techniques utilized for sensitivity analysis of computer models. The most fundamental of sensitivity techniques utilizes partial differentiation whereas the simplest approach requires varying parameter values one-at-a-time. Correlation analysis is used to determine relationships between independent and dependent variables. Regression analysis provides the most comprehensive sensitivity measure and is commonly utilized to build response surfaces that approximate complex models.
Article
The successful development and marketing of commercial high-availability systems requires the ability to evaluate the availability of systems. Specifically, one should be able to demonstrate that projected customer requirements are met, to identify availability bottlenecks, to evaluate and compare different configurations, and to evaluate and compare different designs. For evaluation approaches based on analytic modeling, these systems are often sufficiently complex so that state-space methodsare not effective due to the large number of states, whereas combinatorial methods are inadequate for capturing all significant dependencies. The two-level hierarchical decomposition proposed here is suitable for the availability modeling of blade server systems such as IBM BladeCenter®, a commercial, high-availability multicomponent system comprising up to 14 separate blade servers and contained within a chassis that provides shared subsystems such as power and cooling. This approach is based on an availability model that combines a high-level fault tree model with a number of lowerlevel Markov models. It is used to determine component level contributions to downtime as well as steady-state availability for both standalone and clustered blade servers. Sensitivity of the results to input parameters is examined, extensions to the models are described, and availability bottlenecks and possible solutions are identified.
Solutions for intel rack scale design standards
  • Megarac
Techniques and Research Directions
  • P Maciel
  • Trivedi Kishor
  • R Matias
  • D Kim