Conference Paper

Optimizing Resource Availability in Composable Data Center Infrastructures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

To meet service level agreement (SLA) requirements, the majority of enterprise IT infrastructure is typically overpro-visioned, underutilized, non-compliant and lacking in required agility resulting in significant inefficiencies. As enterprises introduce and migrate to next-generation applications designed to be horizontally scalable, they require infrastructure that can manage the duality of legacy and next generation application requirements. To address this, composable data center infrastructure disaggregates and refactors compute, storage, network and other infrastructure resources in to shared resources pools that can be "composed" and allocated on-demand. In this paper, we model the allocation of resources in a composable data center infrastructure as a bounded multidimensional knapsack and then apply multi-objective optimization algorithms, Non-dominated Sorting Genetic Algorithm (NSGA-II) and Generalized Differential Evolution (GDE3), to allocate resources efficiently. The main goal is to maximize resource availability for the application owner, while meeting minimum requirements (in terms of CPU, memory, network, and storage) within budget constraints. We consider two different scenarios to analyze heterogeneity and variability aspects when allocating resources on composable data center infrastructure.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Neste trabalho propomos o uso de Rede de Petri Estocástica (SPN) para modelar o desempenho de blockchains Hyperledger Fabric com diferentes parametrizações. SPNs são conhecidas pelo alto grau de representatividade, sendo mais intuitivos que opções convencionais, como cadeias de Markov, para representar concorrência, paralelismo, e sincronização em sistemas , Pinheiro et al. 2019, Rodrigues et al. 2019, Ferreira et al. 2019, Silva et al. 2022]. Os trabalhos relacionados na Seção 2 propuseram modelos para analisar a disponibilidade e custos na implantação [Melo et al. 2022, Melo et al. 2021], identificar gargalos [Xu et al. 2021, Sukhwani et al. 2018, Yuan et al. 2020] e o comportamento da rede em situações de ataque [Shahriar et al. 2020]. ...
Conference Paper
Hyperledger Fabric é uma plataforma para redes blockchains permissionadas que permite o armazenamento e o acesso distribuído a dados de forma segura e auditável para aplicações corporativas. Existe um crescente interesse por aplicações dessa plataforma, mas o seu uso requer a configuração de uma blockchain com diferentes etapas de processamento de requisições. As diversas configurações possíveis impactam nas qualidades não funcionais da plataforma, em especial desempenho e custo. Este artigo propõe um modelo de Rede de Petri Estocástica (SPN) para modelar o desempenho de requisições na plataforma Hyperledger Fabric com variadas parametrizações para blockchain, capacidade de computadores e taxas de requisições. Apresentamos também um estudo de utilização do modelo que serve como uma exemplificação para auxiliar os administradores de redes blockchains permissionadas a adequar suas configurações encontrando o melhor desempenho para aplicações. O modelo permitiu, por exemplo, identificar o tamanho do bloco que leva a um tempo médio de resposta excessivamente alto (variando de 1 a 25 segundos) causado por alto enfileiramento de requisições.
... This section presents the reference architecture for the blockchain system and the SPN model, with details on the execution flow and its base components. The SPN model was proposed to apply a simulation that integrates the formal description, proof of correction, and performance evaluation of the proposed context [15,16,[22][23][24][25][26][27][28][29][30]. Figure 1 illustrates the reference architecture representing the Hyperledger Cello. The environment used to host Hyperledger Cello consists of two nodes, the master node and the worker node, each responsible for running a series of services. ...
Article
Full-text available
Blockchain has become an important processing paradigm in recent years. The blockchain supports financial transactions and validates contracts, documents and data. However, the evolution of blockchain has become viable for many applications. The servers’ availability and reliability (dependence) are required in the data processing. The contract will only be signed if there are enough components to form the blockchain blocks. This paper analyses the dependency between project components that use blockchain. We present a model based on stochastic Petri net (SPN) for evaluating the dependency of the blockchain architecture. The Design of Experiments (DoE) method was used to analyse this model’s factors, seeking to know which ones had the higher impact on the system. The sensitivity analysis showed that the MongoDB component has a greater impact on the system dependency and the need to upgrade such a component. Also, for reliability, making component improvements is unnecessary if the system has fewer than 36,000 h of runtime.
... Stochastic modeling is appropriate for both deterministic and nondeterministic events. Stochastic Petri nets (SPNs) are special cases of stochastic models [22][23][24][25][26][27][28][29][30][31][32][33]. SPNs enable setting up of state equations, algebraic equations, and other mathematical models governing the behavior of systems. ...
Article
Full-text available
Cloud computing has been widely adopted over the years by practitioners and companies with a variety of requirements. With a strong economic appeal, cloud computing makes possible the idea of computing as a utility, in which computing resources can be consumed and paid for with the same convenience as electricity. One of the main characteristics of cloud as a service is elasticity supported by auto-scaling capabilities. The auto-scaling cloud mechanism allows adjusting resources to meet multiple demands dynamically. The elasticity service is best represented in critical web trading and transaction systems that must satisfy a certain service level agreement (SLA), such as maximum response time limits for different types of inbound requests. Nevertheless, existing cloud infrastructures maintained by different cloud enterprises often offer different cloud service costs for equivalent SLAs upon several factors. The factors might be contract types, VM types, auto-scaling configuration parameters, and incoming workload demand. Identifying a combination of parameters that results in SLA compliance directly in the system is often sophisticated, while the manual analysis is prone to errors due to the huge number of possibilities. This paper proposes the modeling of auto-scaling mechanisms in a typical cloud infrastructure using a stochastic Petri net (SPN) and the employment of a well-established adaptive search metaheuristic (GRASP) to discover critical trade-offs between performance and cost in cloud services.The proposed SPN models enable cloud designers to estimate the metrics of cloud services in accordance with each required SLA such as the best configuration, cost, system response time, and throughput.The auto-scaling SPN model was extensively validated with 95% confidence against a real test-bed scenario with 18.000 samples. A case-study of cloud services was used to investigate the viability of this method and to evaluate the adaptability of the proposed auto-scaling model in practice. On the other hand, the proposed optimization algorithm enables the identification of economic system configuration and parameterization to satisfy required SLA and budget constraints. The adoption of the metaheuristic GRASP approach and the modeling of auto-scaling mechanisms in this work can help search for the optimized-quality solution and operational management for cloud services in practice.
... This 2019 paper [16], focusing on optimizing resource availability, took a holistic hardware and software approach to resource allocation while meeting service level agreements (SLA). The authors model the problem of resource allocation in a data center and apply optimization algorithms to determine solutions. ...
... Sensitivity analysis is a measure of the effect of a given input data about the output data, aiming to outline the weak links of the computer systems, and from then on, seek to adopt a set of techniques that aim to improve these systems in different scenarios [45]. Some jobs use sensitivity analysis to provide the necessary security and forward the perspective of system administrators [46,47]. In this work, we have applied a sensitivity analysis with DoE. ...
Article
Full-text available
Smart buildings in big cities are now equipped with an internet of things (IoT) infrastructure to constantly monitor different aspects of people’s daily lives via IoT devices and sensor networks. The malfunction and low quality of service (QoS) of such devices and networks can severely cause property damage and perhaps loss of life. Therefore, it is important to quantify different metrics related to the operational performance of the systems that make up such computational architecture even in advance of the building construction. Previous studies used analytical models considering different aspects to assess the performance of building monitoring systems. However, some critical points are still missing in the literature, such as (i) analyzing the capacity of computational resources adequate to the data demand, (ii) representing the number of cores per machine, and (iii) the clustering of sensors by location. This work proposes a queuing network based message exchange architecture to evaluate the performance of an intelligent building infrastructure associated with multiple processing layers: edge and fog. We consider an architecture of a building that has several floors and several rooms in each of them, where all rooms are equipped with sensors and an edge device. A comprehensive sensitivity analysis of the model was performed using the Design of Experiments (DoE) method to identify bottlenecks in the proposal. A series of case studies were conducted based on the DoE results. The DoE results allowed us to conclude, for example, that the number of cores can have more impact on the response time than the number of nodes. Simulations of scenarios defined through DoE allow observing the behavior of the following metrics: average response time, resource utilization rate, flow rate, discard rate, and the number of messages in the system. Three scenarios were explored: (i) scenario A (varying the number of cores), (ii) scenario B (varying the number of fog nodes), and (iii) scenario C (varying the nodes and cores simultaneously). Depending on the number of resources (nodes or cores), the system can become so overloaded that no new requests are supported. The queuing network based message exchange architecture and the analyses carried out can help system designers optimize their computational architectures before building construction.
... SPNs [32][33][34][35][36][37][38] can be identified as a type of directed graph divided into two parts, filled by three types of objects. These objects are places, transitions, and directed arcs that connect places for transitions and transitions to places. ...
Article
Full-text available
Applications in the Internet of Things (IoT) context continuously generate large amounts of data. The data must be processed and monitored to allow rapid decision making. However, the wireless connection that links such devices to remote servers can lead to data loss. Thus, new forms of a connection must be explored to ensure the system’s availability and reliability as a whole. Unmanned aerial vehicles (UAVs) are becoming increasingly empowered in terms of processing power and autonomy. UAVs can be used as a bridge between IoT devices and remote servers, such as edge or cloud computing. UAVs can collect data from mobile devices and process them, if possible. If there is no processing power in the UAV, the data are sent and processed on servers at the edge or in the cloud. Data offloading throughout UAVs is a reality today, but one with many challenges, mainly due to unavailability constraints. This work proposes stochastic Petri net (SPN) models and reliability block diagrams (RBDs) to evaluate a distributed architecture, with UAVs focusing on the system’s availability and reliability. Among the various existing methodologies, stochastic Petri nets (SPN) provide models that represent complex systems with different characteristics. UAVs are used to route data from IoT devices to the edge or the cloud through a base station. The base station receives data from UAVs and retransmits them to the cloud. The data are processed in the cloud, and the responses are returned to the IoT devices. A sensitivity analysis through Design of Experiments (DoE) showed key points of improvement for the base model, which was enhanced. A numerical analysis indicated the components with the most significant impact on availability. For example, the cloud proved to be a very relevant component for the availability of the architecture. The final results could prove the effectiveness of improving the base model. The present work can help system architects develop distributed architectures with more optimized UAVs and low evaluation costs.
... Stochastic Petri nets (SPN) [12,13] are analytical models capable of representing concurrency, synchronization, and parallelism of complex systems. Among other metrics, SPN are well suited to evaluate performance and availability [14][15][16]. The use of SPNs has already been successfully applied in the context of MCC in previous works [17][18][19]. ...
Article
Full-text available
Mobile Edge Computing (MEC) has emerged as a promising network computing paradigm associated with mobile devices at local areas to diminish network latency under the employment and utilization of cloud/edge computing resources. In that context, MEC solutions are required to dynamically allocate mobile requests as close as possible to their computing resources. Moreover, the computing power and resource capacity of MEC server machines can directly impact the performance and operational availability of mobile apps and services. The systems practitioners must understand the trade off between performance and availability in systems design stages. The analytical models are suited to such an objective. Therefore, this paper proposes Stochastic Petri Net (SPN) models to evaluate both performance and availability of MEC environments. Different to previous work, our proposal includes unique metrics such as discard probability and a sensitivity analysis that guides the evaluation decisions. The models are highly flexible by considering fourteen transitions at the base model and twenty-five transitions at the extended model. The performance model was validated with a real experiment, the result of which indicated equality between experiment and model with p-value equal to 0.684 by t-Test. Regarding availability, the results of the extended model, different from the base model, always remain above 99%, since it presents redundancy in the components that were impacting availability in the base model. A numerical analysis is performed in a comprehensive manner, and the output results of this study can serve as a practical guide in designing MEC computing system architectures by making it possible to evaluate the trade-off between Mean Response Time (MRT) and resource utilization.
... Petri nets are a tool for formal modeling of quantitative properties of concurrent and synchronized systems. Petri nets with random firing delays applied in transitions are considered as stochastic Petri nets (SPNs) [11,[30][31][32][33]. Since the last decade, the utilization of SPN has been enticing the researchers's attention in the modeling and performance analysis of discrete event systems. ...
Article
Full-text available
Nowadays, the Internet of Things (IoT) allows monitoring and automation in diverse contexts, such as hospitals, homes, or even smart cities, just to name a few examples. IoT data processing may occur, at the edge of the network or in the cloud, but frequently the processing must be divided between the two layers. Aiming to guarantee that the IoT systems works efficiently, it is essential to evaluate the system even in initial design stages. However, evaluating hybrid systems composed by multiple layers is not an easy task as a myriad of parameters are involved in the process. Thus, this paper presents two SPN models (one base and extended one) that can represent an abstract distributed system composed of IoT, edge and cloud layers. The models are highly configurable to be used in diverse simulation scenarios. Besides a sensitivity analysis evidenced the most impacting components in the studied architecture and made it possible to optimize the base SPN model. Finally a case study explores multiple metrics of interest concurrently and works as a guide of the model utilization. Ultimately, the proposed approach can assist system designers to avoid unnecessary investment in original equipment.
Preprint
Full-text available
Blockchain has become an important processing paradigm in recent years. The blockchain supports financial transactions and validates contracts, documents and data. However, the evolution of blockchain has become viable for many applications. The servers' availability and reliability (dependence) are required in the data processing. The contract will only be signed if there are enough components to form the blockchain blocks. This paper analyses the dependency between project components that use blockchain. We present a model based on stochastic Petri net (SPN) for evaluating the dependency of the blockchain architecture. The Design of Experiments (DoE) method was used to analyze this model's factors, seeking to know which ones had the higher impact on the system. The sensitivity analysis showed that the MongoDB component has a greater impact on the system dependency and the need to upgrade such a component. Also, for reliability, making component improvements is unnecessary if the system has fewer than 36,000 hours of runtime.
Article
Hardware disaggregation decouples resources (e.g., processors and memory) from monolithic servers, potentially improving service reliability. However, from another perspective, directly exposing resource modules to a shared network may adversely affect service reliability. In this paper, we study a reliable resource allocation problem in disaggregated DCs (DDCs), considering network impact and different disaggregation scales. We provide a mixed-integer linear programming formulation and a resource allocation framework named Radar for this problem. Numerical results demonstrate that the benefits of hardware disaggregation may be adversely affected by an imperfect network. It also shows that both the hardware backup and a proposed migration-based restoration can be applied to overcome this potential adverse effect.
Article
Full-text available
Vehicle ad hoc networks (VANETs) have emerged to make traffic more efficient and intelligent. Road side units (RSUs) can act as sensors and as a provider of route information for vehicles. RSUs have processing, storage, and communication capabilities. However, RSUs can suffer from peak requests, non-functional data demands and unavailability. To overcome this deficiency, cloud computing can act as an additional resource, processing part of the requests, named vehicular cloud computing (VCC). This paper uses stochastic Petri nets (SPNs) and reliability block diagrams (RBD) to assess a VCC architecture's availability and reliability with multiple RSUs. Two sensitivity analyses were performed which have identified the model's components that have the most significant impact. In addition to a base model, extended models with greater redundancy were also proposed. The base model has obtained A = 97.68%, and the extended model obtained A = 99.19%. Therefore, the models aim to help network administrators plan more optimised VANET architectures, reducing failures.
Article
Full-text available
ehicle ad hoc networks (VANETs) have emerged to make traffic more efficient and intelligent. Road side units (RSUs) can act as sensors and as a provider of route information for vehicles. RSUs have processing, storage, and communication capabilities. However, RSUs can suffer from peak requests, non-functional data demands and unavailability. To overcome this deficiency, cloud computing can act as an additional resource, processing part of the requests, named vehicular cloud computing (VCC). This paper uses stochastic Petri nets (SPNs) and reliability block diagrams (RBD) to assess a VCC architecture's availability and reliability with multiple RSUs. Two sensitivity analyses were performed which have identified the model's components that have the most significant impact. In addition to a base model, extended models with greater redundancy were also proposed. The base model has obtained A = 97.68%, and the extended model obtained A = 99.19%. Therefore, the models aim to help network administrators plan more optimised VANET architectures, reducing failures.
Article
Full-text available
Surveillance monitoring systems are highly necessary, aiming to prevent many social problems in smart cities. The internet of things (IoT) nowadays offers a variety of technologies to capture and process massive and heterogeneous data. Due to the fact that (i) advanced analyses of video streams are performed on powerful recording devices; while (ii) surveillance monitoring services require high availability levels in the way that the service must remain connected, for example, to a connection network that offers higher speed than conventional connections; and that (iii) the trust-worthy dependability of a surveillance system depends on various factors, it is not easy to identify which components/devices in a system architecture have the most impact on the dependability for a specific surveillance system in smart cities. In this paper, we developed stochastic Petri net models for a surveillance monitoring system with regard to varying several parameters to obtain the highest dependability. Two main metrics of interest in the dependability of a surveillance system including reliability and availability were analyzed in a comprehensive manner. The analysis results show that the variation in the number of long-term evolution (LTE)-based stations contributes to a number of nines (#9s) increase in availability. The obtained results show that the variation of the mean time to failure (MTTF) of surveillance cameras exposes a high impact on the reliability of the system. The findings of this work have the potential of assisting system architects in planning more optimized systems in this field based on the proposed models.
Chapter
Full-text available
Much of the research on measuring the business value of cloud computing examines cloud computing from the perspective of a centralised commodity-based aggregated conceptualisation of cloud computing, largely based on the NIST reference architecture. Advances in new processor architectures and virtualisation combined with the rise of the Internet of Things are not only changing cloud computing but introducing new computing paradigms from the cloud to the edge. These new paradigms present both opportunities and challenges, not least managing complexity several orders of magnitude greater than today. Yet, academic research on measuring the business value of cloud computing is lagging practice and remains far removed from these innovations. New research is required that explores the relationship between investments in new cloud computing paradigms and business value, and the measurement thereof. This chapter explores a selection of these new paradigms, which may provide fruitful research pathways in the future.
Article
Full-text available
Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.
Conference Paper
Full-text available
This paper evaluates the optimal scale of datacentre (DC) resource disaggregation for composable DC infrastructures and investigates the impact of present day silicon photonics technologies on the energy efficiency of different composable DC infrastructures. We formulated a mixed integer linear programming (MILP) model to this end. Our results show that present day silicon photonics technologies enable better network energy efficiency for rack-scale composable DCs compared to pod-scale composable DCs despite reported similarities in CPU and memory resource power consumption.
Chapter
Full-text available
New use scenarios, workloads, and increased heterogeneity combined with rapid growth in adoption are increasing the management complexity of cloud computing at all levels. High performance computing (HPC) is a particular segment of the IT market that provides significant technical challenges for cloud service providers and exemplifies many of the challenges facing cloud service providers as they conceptualise the next generation of cloud architectures. This chapter introduces cloud computing, HPC, and the challenges of supporting HPC in the cloud. It discusses how heterogeneous computing and the concepts of self-organisation, self-management, and separation of concerns can be used to inform novel cloud architecture designs and support HPC in the cloud at hyperscale. Three illustrative application scenarios for HPC in the cloud—(i) oil and gas exploration, (ii) ray tracing, and (iii) genomics—are discussed.
Article
Full-text available
Abstract Purpose In this paper, a comprehensive fault tree analysis (FTA) on the critical components of industrial robots is conducted. This analysis is integrated with the reliability block diagram (RBD) approach in order to investigate the robot system reliability. Design For practical implementation, a particular autonomous guided vehicle (AGV) system is first modeled. Then, FTA is adopted to model the causes of failures, enabling the probability of success to be determined. In addition, RBD is employed to simplify the complex system of the AGV for reliability evaluation purpose. Findings Finally, hazard decision tree (HDT) is configured to compute the hazard of each component and the whole AGV robot system. Through this research, a promising technical approach is established, allowing decision makers to identify the critical components of AGVs along with their crucial hazard phases at the design stage. Originality As complex systems have become global and essential in today’s society, their reliable design and the determination of their availability have turned into a very important task for managers and engineers. Industrial robots are examples of these complex systems that are being increasingly used for intelligent transportation, production and distribution of materials in warehouses and automated production lines.
Article
Full-text available
With the popularity of cloud computing, it has become crucial to provide on-demand services dynamically according to the user's requirements. Reliability and energy efficiency are two key challenges in cloud computing systems (CCS) that need careful attention and investigation. The recent survey articles are either focused on the reliability techniques or energy efficiency methods in cloud computing. This paper presents a thorough review of existing techniques for reliability and energy efficiency and their trade-off in cloud computing. We also discuss the classifications on resource failures, fault tolerance mechanisms and energy management mechanisms in cloud systems. Moreover, various challenges and research gaps in trade-off between reliability and energy efficiency are identified for future research and developments.
Conference Paper
Full-text available
Cloud computing is a new paradigm that provides services through the Internet. Such paradigm has the influence of the previous available technologies (e.g., cluster, peer-to-peer and grid computing) and has been adopted to reduce costs, to provide flexibility and to make management easier. Companies like Google, Amazon, Microsoft, IBM, HP, Yahoo, Oracle, and EMC have conducted significant investments on cloud infrastructure to provide services with high availability levels. The advantages of cloud computing allowed the construction of digital libraries that represent collections of information. This system demands high reliability and studies regarding analysis of availability are important due to the relevance of conservation and dissemination of the scientific and literature information. This paper proposes an approach to model and evaluate the availability of a digital library. A case study is conducted to show the applicability of the proposed approach. The obtained results are useful for the design of this system since missing data can lead to various errors and incalculable losses.
Article
Full-text available
Multiobjective evolutionary algorithms (MOEAs) have been widely used in real-world applications. However, most MOEAs based on Pareto-dominance handle many-objective problems (MaOPs) poorly due to a high proportion of incomparable and thus mutually nondominated solutions. Recently, a number ofmany-objective evolutionary algorithms (MaOEAs) have been proposed to deal with this scalability issue. In this article, a survey of MaOEAs is reported. According to the key ideas used, MaOEAs are categorized into seven classes: relaxed dominance based, diversity-based, aggregation-based, indicator-based, reference set based, preference-based, and dimensionality reduction approaches. Several future research directions in this field are also discussed.
Article
Full-text available
Programmable Logic Controllers (PLC) are widely used in industry. The reliability of the PLC is vital to many critical applications. This paper presents a novel approach to the symbolic analysis of PLC systems. The approach includes, (1) calculating the uncertainty characterization of the PLC system, (2) abstracting the PLC system as a Hidden Markov Model, (3) solving the Hidden Markov Model with domain knowledge, (4) combining the solved Hidden Markov Model and the uncertainty characterization to form a regular Markov model, and (5) utilizing probabilistic model checking to analyze properties of the Markov model. This framework provides automated analysis of both uncertainty calculations and performance measurements, without the need for expensive simulations. A case study of an industrial, automated PLC system demonstrates the effectiveness of our work.
Article
Full-text available
The implementation of cloud computing has attracted computing as a utility and enables penetrative applications from scientific, consumer and business domains. However, this implementation faces tremendous energy consumption, carbon dioxide emission and associated costs concerns. With energy consumption becoming key issue for the operation and maintenance of cloud datacenters, cloud computing providers are becoming profoundly concerned. In this paper, we present formulations and solutions for Green Cloud Environments (GCE) to minimize its environmental impact and energy consumption under new models by considering static and dynamic portions of cloud components. Our proposed methodology captures cloud computing data centers and presents a generic model for them. To implement this objective, an in-depth knowledge of energy consumption patterns in cloud environment is necessary. We investigate energy consumption patterns and show that by applying suitable optimization policies directed through our energy consumption models, it is possible to save 20% of energy consumption in cloud data centers. Our research results can be integrated into cloud computing systems to monitor energy consumption and support static and dynamic system level-optimization.
Conference Paper
Full-text available
In this paper, we investigate the dependability modeling of computer networks with redundancy mechanism. We use Stochastic Petri Net as an enabling modeling approach for analytical evaluation of complex scenarios. We apply our proposed modeling approach in a case study to evaluate the availability of computer networks in four different architectures. Reliability Importance is used to analyze the system availability according to the most important components.
Conference Paper
Full-text available
Critical properties of software systems, such as reliability, should be considered early in the development, when they can govern crucial architectural design decisions. A number of design-time reliability-analysis methods has been developed to support this task. However, the methods are often based on very low-level formalisms, and the connection to different architectural aspects (e.g., the system usage profile) is either hidden in the constructs of a formal model (e.g., transition probabilities of a Markov chain), or even neglected (e.g., resource availability). This strongly limits the applicability of the methods to effectively support architectural design. Our approach, based on the Palladio Component Model (PCM), integrates the reliability-relevant architectural aspects in a highly parameterized UML-like model, which allows for transparent evaluation of architectural design options. It covers the propagation of the system usage profile throughout the architecture, and the impact of the execution environment, which are neglected in most of the existing approaches. Before analysis, the model is automatically transformed into a formal Markov model in order to support effective analytical techniques to be employed. The approach has been validated against a reliability simulation of a distributed Business Reporting System.
Book
Full-text available
Every aspect of human life is crucially determined by the result of decisions. Whereas private decisions may be based on emotions or personal taste, the complex professional environment of the 21st century requires a decision process which can be formalized and validated independently from the involved individuals. Therefore, a quantitative formulation of all factors influencing a decision and also of the result of the decision process is sought.
Article
Full-text available
Energy-proportional designs would enable large energy savings in servers, potentially doubling their efficiency in real-life use. Achieving energy proportionality will require significant improvements in the energy usage profile of every system component, particularly the memory and disk subsystems. Energy efficiency, a new focus for general-purpose computing, has been a major technology driver in the mobile and embedded areas for some time. Earlier work emphasized extending battery life, but it has since expanded to include peak power reduction because thermal constraints began to limit further CPU performance improvements. Energy management has now become a key issue for servers and data center operations, focusing on the reduction of all energy-related costs, including capital, operating expenses, and environmental impacts. Many
Conference Paper
The recent advancement of technology in both software and hardware enables us to revisit the concept of the composable architecture in the system design. The composable system design provides flexibility to serve a variety of workloads. The system offers a dynamic co-design platform that allows experiments and measurements in a controlled environment. This speeds up the system design and software evolution. It also decouples the lifecycles of components. The design consideration includes adopting available technology with the understanding of application characteristics. With the flexibility, we show the design has the potential to be the infrastructure of both cloud computing and HPC architecture serving a variety of workloads.
Article
Recent research trends exhibit a growing imbalance between the demands of tenants' software applications and the provisioning of hardware resources. Misalignment of demand and supply gradually hinders workloads from being efficiently mapped to fixed-sized server nodes in traditional data centers. The incurred resource holes not only lower infrastructure utilization but also cripple the capability of a data center for hosting large-sized workloads. This deficiency motivates the development of a new rack-wide architecture referred to as the composable system. The composable system transforms traditional server racks of static capacity into a dynamic compute platform. Specifically, this novel architecture aims to link up all compute components that are traditionally distributed on traditional server boards, such as central processing unit (CPU), random access memory (RAM), storage devices, and other application-specific processors. By doing so, a logically giant compute platform is created and this platform is more resistant against the variety of workload demands by breaking the resource boundaries among traditional server boards. In this paper, we introduce the concepts of this reconfigurable architecture and design a framework of the composable system for cloud data centers. We then develop mathematical models to describe the resource usage patterns on this platform and enumerate some types of workloads that commonly appear in data centers. From the simulations, we show that the composable system sustains nearly up to 1.6 times stronger workload intensity than that of traditional systems and it is insensitive to the distribution of workload demands. This demonstrates that this composable system is indeed an effective solution to support cloud data center services.
Article
With the introduction of network function virtualization technology, migrating entire enterprise data centers into the cloud has become a possibility. However, for a cloud service provider (CSP) to offer such services, several research problems still need to be addressed. In previous work, we have introduced a platform, called network function center (NFC), to study research issues related to virtualized network functions (VNFs). In an NFC, we assume VNFs to be implemented on virtual machines that can be deployed in any server in the CSP network. We have proposed a resource allocation algorithm for VNFs based on genetic algorithms (GAs). In this paper, we present a comprehensive analysis of two resource allocation algorithms based on GA for: 1) the initial placement of VNFs and 2) the scaling of VNFs to support traffic changes. We compare the performance of the proposed algorithms with a traditional integer linear programming resource allocation technique. We then combine data from previous empirical analyses to generate realistic VNF chains and traffic patterns, and evaluate the resource allocation decision making algorithms. We assume different architectures for the data center, implement different fitness functions with GA, and compare their performance when scaling over the time.
Article
Many-objective problems refer to the optimization problems containing more than three conflicting objectives. To obtain a representative set of well-distributed non-dominated solutions close to Pareto front in the objective space remains a challenging problem. Many papers have proposed different Multi-Objective Evolutionary Algorithms to solve the lack of the convergence and diversity in many-objective problems. One of the more promising approaches uses a set of reference points to discriminate the solutions and guide the search process. However, this approach was incorporated mainly in Multi-Objective Evolutionary Algorithms, and there are just some few promising adaptations of Particle Swarm Optimization approaches for effectively tackling many-objective problems regarding convergence and diversity. Thus, this paper proposes a practical and efficient Many-Objective Particle Swarm Optimization algorithm for solving many-objective problems. Our proposal uses a set of reference points dynamically determined according to the search process, allowing the algorithm to converge to the Pareto front, but maintaining the diversity of the Pareto front. Our experimental results demonstrate superior or similar performance when compared to other state-of-art algorithms.
Article
The rapid growth of cloud computing, both in terms of the spectrum and volume of cloud workloads, necessitate re-visiting the traditional rack-mountable servers based datacenter design. Next generation datacenters need to offer enhanced support for: (i) fast changing system configuration requirements due to workload constraints, (ii) timely adoption of emerging hardware technologies, and (iii) maximal sharing of systems and subsystems in order to lower costs. Disaggregated datacenters, constructed as a collection of individual resources such as CPU, memory, disks etc., and composed into workload execution units on demand, are an interesting new trend that can address the above challenges. In this paper, we demonstrated the feasibility of composable systems through building a rack scale composable system prototype using PCIe switch. Through empirical approaches, we develop assessment of the opportunities and challenges for leveraging the composable architecture for rack scale cloud datacenters with a focus on big data and NoSQL workloads. In particular, we compare and contrast the programming models that can be used to access the composable resources, and developed the implications for the network and resource provisioning and management for rack scale architecture.
Article
Recent expansions of Internet-of-Things (IoT) applying cloud computing have been growing at a phenomenal rate. As one of the developments, heterogeneous cloud computing has enabled a variety of cloud-based infrastructure solutions, such as multimedia big data. Numerous prior researches have explored the optimizations of on-premise heterogeneous memories. However, the heterogeneous cloud memories are facing constraints due to the performance limitations and cost concerns caused by the hardware distributions and manipulative mechanisms. Assigning data tasks to distributed memories with various capacities is a combinatorial NP-hard problem. This paper focuses on this issue and proposes a novel approach, Cost-Aware Heterogeneous Cloud Memory Model (CAHCM), aiming to provision a high performance cloud-based heterogeneous memory service offerings. The main algorithm supporting CAHCM is Dynamic Data Allocation Advance (2DA) Algorithm that uses genetic programming to determine the data allocations on the cloud-based memories. In our proposed approach, we consider a set of crucial factors impacting the performance of the cloud memories, such as communication costs, data move operating costs, energy performance, and time constraints. Finally, we implement experimental evaluations to examine our proposed model. The experimental results have shown that our approach is adoptable and feasible for being a cost-aware cloud-based solution.
Article
Elasticity is a key feature in cloud computing, which distinguishes this paradigm from other ones, such as cluster and grid computing. On the other hand, dynamic resource reallocation is one of the most important and complex issues in cloud scenarios, which can be expressed as a multi-objective optimization problem with the opposing objectives of maximizing demand satisfaction and minimizing costs and resource consumptions. In this paper, we propose a meta-heuristic approach for cloud resource allocation based on the bio-inspired coral-reefs optimization paradigm to model cloud elasticity in a cloud-data center, and on the classic Game Theory to optimize the resource reallocation schema with respect to cloud provider’s optimization objectives, as well as customer requirements, expressed trough Service Level Agreements formalized by using a fuzzy linguistic method.
Article
Web service composition combines available services to provide new functionality. The various available services have different quality-of-service (QoS) attributes. Building a QoS-optimal web service composition is a multi-criteria NP-hard problem. Most of the existing approaches reduce this problem to a single-criterion problem by aggregating different criteria into a unique global score (scalarization). However, scalarization has some significant drawbacks: the end user is supposed to have a complete a priori knowledge of its preferences/constraints about the desired solutions and there is no guarantee that the aggregated results matches it. Moreover, non-convex parts of the Pareto set cannot be reached by optimizing a convex weighted sum. An alternative is to use Pareto-based approaches that enable a more accurate selection of the end-user solution. However, so far, only few solutions based on these approaches have been proposed and there exists no comparative study published to date. This motivated us to perform an analysis of several state-of-the-art multi-objective evolutionary algorithms. Multiple scenarios with different complexities are considered. Performance metrics are used to compare several evolutionary algorithms. Results indicate that GDE3 algorithm yields the best performances on this problem, also with the lowest time complexity.
Article
Cloud computing developers face multiple challenges in adapting systems and applications for increasingly heterogeneous datacenter architectures. A major appeal of cloud computing is that it abstracts hardware architecture from both end users and programmers. This abstraction allows underlying infrastructure to be scaled up or improved-for example, by adding datacenter servers or upgrading to newer hardware-without forcing changes in applications. The long-dominant x86 processor architecture, along with high-level, portable languages such as Java, PHP, Python, and SQL, has helped assure the continued viability of such abstraction. Meanwhile, exponential growth in microprocessor capability, mirroring Moore's law, has helped to improve performance for most applications that execute on general-purpose processors, including those deployed on clouds.
Article
A system with n independent components which has a k-out-of-n: G structure operates if at least k components operate. Parallel systems are 1-out-of-n: G systems, that is, the system goes out of service when all of its components fail. This paper investigates the mean residual life function of systems with independent and nonidentically distributed components. Some examples related to some lifetime distribution functions are given. We present a numerical example for evaluating the relationship between the mean residual life of the k-out-of-n: G system and that of its components.
Article
What do you do when you can no longer enforce Moore's law?
Article
Cloud computing has emerged as a highly cost-effective computation paradigm for IT enterprise applications, scientific computing, and personal data management. Because cloud services are provided by machines of various capabilities, performance, power, and thermal characteristics, it is challenging for providers to understand their cost effectiveness when deploying their systems. This article analyzes a parallelizable task in a heterogeneous cloud infrastructure with mathematical models to evaluate the energy and performance trade-off. As the authors show, to achieve the optimal performance per utility, the slowest node's response time should be no more than three times that of the fastest node. The theoretical analysis presented can be used to guide allocation, deployment, and upgrades of computing nodes for optimizing utility effectiveness in cloud computing services.
Article
Many-objective optimization refers to the optimization problems containing large number of objectives, typically more than four. Non-dominance is an inadequate strategy for convergence to the Pareto front for such problems, as almost all solutions in the population become non-dominated, resulting in loss of convergence pressure. However, for some problems, it may be possible to generate the Pareto front using only a few of the objectives, rendering the rest of the objectives redundant. Such problems may be reducible to a manageable number of relevant objectives, which can be optimized using conventional multiobjective evolutionary algorithms (MOEAs). For dimensionality reduction, most proposals in the paper rely on analysis of a representative set of solutions obtained by running a conventional MOEA for a large number of generations, which is computationally overbearing. A novel algorithm, Pareto corner search evolutionary algorithm (PCSEA), is introduced in this paper, which searches for the corners of the Pareto front instead of searching for the complete Pareto front. The solutions obtained using PCSEA are then used for dimensionality reduction to identify the relevant objectives. The potential of the proposed approach is demonstrated by studying its performance on a set of benchmark test problems and two engineering examples. While the preliminary results obtained using PCSEA are promising, there are a number of areas that need further investigation. This paper provides a number of useful insights into dimensionality reduction and, in particular, highlights some of the roadblocks that need to be cleared for future development of algorithms attempting to use few selected solutions for identifying relevant objectives.
Conference Paper
One of the challenges of Infrastructure-as-a-Service Clouds is how to dynamically allocate resources to virtual machines such that quality of service constraints are satisfied and operating costs are minimized. The tradeoff between these two conflicting goals can be expressed by a utility function. In this paper, a two-tier resource management approach based on adequate utility functions is presented, consisting of local controllers that dynamically allocate CPU shares to virtual machines to maximize a local node utility function and a global controller that initiates live migrations of virtual machines to other physical nodes to maximize a global system utility function. Experimental results show the benefits of the proposed approach in Cloud computing environments.
Article
The successful development and marketing of commercial high-availability systems requires the ability to evaluate the availability of systems. Specifically, one should be able to demonstrate that projected customer requirements are met, to identify availability bottlenecks, to evaluate and compare different configurations, and to evaluate and compare different designs. For evaluation approaches based on analytic modeling, these systems are often sufficiently complex so that state-space methodsare not effective due to the large number of states, whereas combinatorial methods are inadequate for capturing all significant dependencies. The two-level hierarchical decomposition proposed here is suitable for the availability modeling of blade server systems such as IBM BladeCenter®, a commercial, high-availability multicomponent system comprising up to 14 separate blade servers and contained within a chassis that provides shared subsystems such as power and cooling. This approach is based on an availability model that combines a high-level fault tree model with a number of lowerlevel Markov models. It is used to determine component level contributions to downtime as well as steady-state availability for both standalone and clustered blade servers. Sensitivity of the results to input parameters is examined, extensions to the models are described, and availability bottlenecks and possible solutions are identified.
Article
During the past few years, cloud computing has become a key IT buzzword. Although the definition of cloud computing is still "cloudy", the trade press and bloggers label many vendors as cloud computing vendors, and report on their services and issues. Cloud computing is in its infancy in terms of market adoption. However, it is a key IT megatrend that will take root. This article reviews its definition and status, adoption issues, and provides a glimpse of its future and discusses technical issues that are expected to be addressed.
Conference Paper
The high energy costs for running a data center led to a rethinking towards an energy-efficient operation of a data center. Designed for supporting the expected peak traffic load, the goal of the data center provider such as Amazon or Google is now to dynamically adapt the number of offered resources according to the current traffic load. In this paper, we present a queuing theoretical model to evaluate the trade-off between waiting time and power consumption if only a subset of servers is active all the time and the remaining servers are enabled on demand. We develop a queuing model with thresholds to turn-on reserve servers when needed. Furthermore, the resulting system behavior under varying parameters and requirements for Pareto optimality are studied.
Article
Peer Reviewed http://deepblue.lib.umich.edu/bitstream/2027.42/46947/1/10994_2005_Article_422926.pdf
Quantifying datacenter inefficiency: Making the case for composable infrastructure
  • nadkami
A. Nadkami, "Quantifying datacenter inefficiency: Making the case for composable infrastructure," 2017.
Worldwide composable/disaggregated infrastructure forecast
  • A Nadkarni
  • E Sheppard
  • K Stolarksi
A. Nadkarni, E. Sheppard, and K. Stolarksi, "Worldwide composable/disaggregated infrastructure forecast, 2018-2023," 2018.
Intel® Rack Scale Design (Intel® RSD) v2.5: Architecture Specification
  • Intel
Intel, "Intel® Rack Scale Design (Intel® RSD) v2.5: Architecture Specification," Intel, Tech. Rep., 07 2019.
Telco Distributed DC with Transport Protocol Enhancement for 5G Mobile Networks: A Survey
  • J Cheng
  • K.-J Grinnemo
J. Cheng and K.-J. Grinnemo, Telco Distributed DC with Transport Protocol Enhancement for 5G Mobile Networks: A Survey, 2017.
Dependability modeling in: Performance and dependability in service computing: Concepts, techniques and research directions
  • P Maciel
  • K Trivedi
  • R Matias
  • D Kim
P. Maciel, K. Trivedi, R. Matias, and D. Kim, "Dependability modeling in: Performance and dependability in service computing: Concepts, techniques and research directions," Hershey: IGI Global, Pennsylvania, USA, vol. 13, 2010.
Application of evolutionary algorithms for load forecasting in smart grids
  • D A R De Jesus
  • W Rivera
D. A. R. de Jesus and W. Rivera, "Application of evolutionary algorithms for load forecasting in smart grids," in Proceedings of the International Conference on Foundations of Computer Science (FCS). The Steering Committee of The World Congress in Computer Science, Computer..., 2018, pp. 40-44.
Many-objective evolutionary algorithms: A survey
  • B Li
  • J Li
  • K Tang
  • X Yao
B. Li, J. Li, K. Tang, and X. Yao, "Many-objective evolutionary algorithms: A survey," ACM Computing Surveys (CSUR), vol. 48, no. 1, p. 13, 2015.
Application of evolutionary algorithms for load forecasting in smart grids
  • de jesus