ArticlePDF Available

Abstract and Figures

Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.
Content may be subject to copyright.
Vol.:(0123456789)
The Journal of Supercomputing
https://doi.org/10.1007/s11227-019-02852-3
1 3
A methodology toassess theavailability ofnext‑generation
data centers
DanielRosendo1· DemisGomes1· GutoLeoniSantos1· GlaucoGoncalves2·
AndreMoreira1· LeylaneFerreira1· PatriciaTakakoEndo3 · JudithKelner1·
DjamelSadok1· AmardeepMehta4· MattiasWildeman4
© Springer Science+Business Media, LLC, part of Springer Nature 2019
Abstract
Cloud data center providers benefit from software-defined infrastructure once it pro-
motes flexibility, automation, and scalability. The new paradigm of software-defined
infrastructure helps facing current management challenges of a large-scale infra-
structure, and guarantying service level agreements with established availability lev-
els. Assessing the availability of a data center remains a complex task as it requires
gathering information of a complex infrastructure and generating accurate models
to estimate its availability. This paper covers this gap by proposing a methodol-
ogy to automatically acquire data center hardware configuration to assess, through
models, its availability. The proposed methodology leverages the emerging stand-
ardized Redfish API and relevant modeling frameworks. Through such approach,
we analyzed the availability benefits of migrating from a conventional data center
infrastructure (named Performance Optimization Data center (POD) with redundant
servers) to a next-generation virtual Performance Optimized Data center (named
virtual POD (vPOD) composed of a pool of disaggregated hardware resources).
Results show that vPOD improves availability compared to conventional data center
configurations.
Keywords Next-generation data center· Redfish standard· Hardware
disaggregation· vPOD· Availability· Sensitivity analysis
1 Introduction
Current business applications require different computational resources and
services, and cloud providers have gained huge popularity due pay-as-you-
go business model, and also because cloud computing allows the applications
* Patricia Takako Endo
patriciaendo@gmail.com; patricia.endo@upe.br
Extended author information available on the last page of the article
D.Rosendo et al.
1 3
deployment with guarantees of high availability, scalability, and security. How-
ever, short outages of cloud data centers may cause massive problems, ranging
from economic losses to impact in human life [1]. Recently, an information tech-
nology (IT) failure in the British Airways data center resulted in 600 flights being
canceled affecting about 75,000 passengers with a cost of $112 million [2].
As a result, a cloud data center must rely on some levels of redundancy to miti-
gate possible failures, and also have flexible mechanisms, such as virtualization
[3, 4], to manage this dynamic and complex system. A new concept called next-
generation data centers has emerged in order to provide flexibility, automation,
optimization, and scalability [5]. The next-generation data centers are moving
toward hyperscale architectures and software-defined control and management,
making them composable data centers. It refracts conventional ones into logi-
cal resource pools (compute, memory, I/O, and networking), based on hardware
resource interconnected via fast optical fiber links and managed via software.
Furthermore, it is now possible to allocate only the required hardware resources,
and in the same way, reallocate it easily via software, when necessary. This fea-
ture allows reducing idle resources, energy consumption, and maintenance time.
According to [6], this design approach “offers the potential advantage of enabling
continuous peak workload performance while minimizing resource fragmentation
for fast evolving heterogeneous workloads”.
Several industry initiatives emerged in order to turn this paradigm a reality.
For instance, Intel offers the Intel Rack Scale Design (Intel RSD) that is an archi-
tecture for disaggregated composable data center infrastructure. Many partners
have developed solutions based on Intel RSD, such as Ericsson (Hyperscale Data-
center System 8000), American Megatrends (MegaRAC), Dell EMC (DSS 9000),
GIGABYTE (GIGABYTE’s Server Management), Inspur (InCloudRack), Quanta
QCT (Rackgo X-RSD), Inventec (inRSD), Wiwynn (ST300), and Supermicro
(Supermicro RSD).
Despite the benefits coming from the next-generation data center, cloud operators
still have to face large-scale infrastructure management challenges. According to
[7], performance management is a big challenge, because it is fundamental for audit-
ing service quality objectives and policies. With the scaling of the infrastructure, the
resource management complexity also increases (such as, scheduling tasks in order
to satisfy applications’ requirements while also efficiently utilizing system resources
[8, 9]), becoming time-consuming and expensive. It is in such context that comput-
ing and mathematical models are used to manage different data center facets, such as
availability, maintainability, and reliability.
Some early works designed a set of models [1017] to examine and compare
availability for a sophisticated range of cloud data center infrastructure. To the best
of our knowledge, none of the existing works considers the availability automation
of next-generation data center infrastructures. Consequently, this research is first to
address the following concern: how to automatically acquire information about a
next-generation data center infrastructure and assess its availability?
In order to answer this question, we propose a methodology to automatically
acquire information about the (conventional and next generation) data center infra-
structure configuration through standard APIs and then generate models to estimate
1 3
A methodology toassess theavailability ofnext-generation…
data center availability. Moreover, we leverage our proposed models to identify and
suggest useful improvements based on the analysis of our results.
The remaining of this paper is organized as follows. Sect.2 presents basic con-
cepts needed to better understand our proposal. Section3 describes our methodol-
ogy to acquire the data center information and creates models to estimate its avail-
ability. Sections4 and 5 show the methodology of theevaluation scenario and the
availability and sensitivity analysis results, respectively. Section 6 discusses our
results and Sect.7 presents related work. Finally, Sect.8 concludes this paper and
delineates future works.
2 Background
This section presents the main concepts and standards applied in this work. At first,
the Sect. 2.1 describes architectural differences between conventional (POD with
redundant servers) and next-generation (vPOD) data centers. Next, Sect.2.2 pre-
sents the information provided by the Redfish schema regarding conventional and
next-generation data centers. Lastly, Sect.2.3 presents the modeling technique used
to analyze the availability of both data center configurations, while Sect. 2.4 pre-
sents the sensitivity analysis technique applied.
2.1 Conventional versusnext‑generation data centers
As the IT world is transitioning to a software-defined paradigm, there is a need to
address this transition in the context of next-generation data center infrastructure
management [18]. Data centers have evolved, shifting away from commodity servers
architecture toward a more flexible and software-based architecture with a focus on
automation and scalability.
The concept of Performance Optimized Datacenter (POD), see Fig. 1a, con-
sists in dividing the data center infrastructure into modules. Each POD has a set of
switches, storage devices, racks, and servers, tightly connected via a local network
[19].
Grouping servers into POD eases their monitoring, management, and decou-
ples the data center space into specific demands and application requirements.
For instance, one may dedicate one POD for high-performance computing (HPC),
another one for memory-intensive applications, and so forth [20].
However, the reconfiguration of PODs is not a trivial task and presents some pit-
falls that a data center manager needs to be aware of. Allocating too many hardware
resources (switches, storage, servers, etc.) for a POD may lead to possibly having
them staying idle (see the gray bar in the graphics of the right side of Fig.1a), lead-
ing to suboptimal resource allocation, raising energy costs, etc.
Next-generation data centers introduce the concept of virtual POD (vPOD)
by using software-defined infrastructure stacks and the Intel Rack Scale Design
(RSD) technology [6], see Fig.1b. Intel RSD enables software-defined infrastruc-
ture and disaggregates bare-metal compute, storage, and networking resources
D.Rosendo et al.
1 3
into virtual pools, namely pools of network, storage, memory, and processor,
interconnected via optical backplane which allows them to operate at the speed of
a single logical system[21], see bottom of Fig.1b. Intel RSD simplifies resource
management and provides the ability to dynamically compose resources based on
workload-specific demands [22]. Furthermore, it allows the composition of vPOD
systems from these disaggregated pools of resources distributed within a rack
or across multiple ones. In the example of Fig.1b, we have the “vPOD A” (in
light wine color) composed of pools of hardware from Rack 1 and part of Rack
2, while the “vPOD B” (shown in light blue color) consists of pools of hardware
from Rack 3.
These pools of hardware resources available across multiple racks were optimally
combined to compose the “vPOD A” and “vPOD B” systems to handle a particu-
lar workload, see the short bar in gray in the graphics of the right side of Fig.1b.
RACK 1RACK 2RACK 3
RACK 1RACK 2RACK 3
POD
Opcal backplane
PROCESSOR
POOL
NETWORK
POOL
STORAGE
POOL
MEMORY
POOL
Ethernet
vPOD B
vPOD A
NICDISKMEM CPU
NICDISKMEM CPU
Not opmal
performance
High power
consumpon
Dynamically
opmized
hardware
Fast
deployment
of new
services
(a)
(b)
Fig. 1 Differences between a POD and b vPOD
1 3
A methodology toassess theavailability ofnext-generation…
Besides, these vPOD systems may be reallocated easily on demand (via software),
reducing idle resources, energy consumption, and maintenance time[19].
Comparing both POD and vPOD configurations, vPOD provides a number of
benefits over the POD, including greater agility, higher utilization, simplified man-
agement, and higher availability [23]. vPOD technology accelerates large-scale data
center deployments by elevating provisioning and management to the rack level. In
vPOD systems, the hardware is best suited to a particular workload, and all storage,
compute, and networking infrastructure are allocated as needed from different physi-
cal machines or across multiple racks. Furthermore, due to their software-defined
infrastructure, vPOD systems are simple to scale or modify as needed to correctly
size it for a given workload (reallocated for different needs and different environ-
ments), which allows to maximize the resource utilization through disaggregating
compute, network, and storage resources. This flexibility provided by the vPOD
technology avoids the issue of having idle hardware resources as was the case for the
less flexible POD-based cloud configuration.
Regarding availability aspects, in the case of POD, a failure in a single hardware
component of a server (a processor, for example) leads other components to also
become unavailable due to the overall dependency (e.g., the whole server fails in
this case). Besides, replacing failed components or upgrading them may halt some
data center activities once the entire server stays down due to maintenance. On the
other hand, in vPOD technology, the hardware disaggregation avoids the failure
dependency problem, as it occurs in POD. Furthermore, the replacement of failed
components in a vPOD system is all managed via software, being less error prone,
easier, and faster, when compared to the POD configuration.
2.2 Redsh
Redfish is an open industry standard specification and schema for system man-
agement as defined by the Distributed Management Task Force (DMTF). Redfish
aims to simplify and assure the management of scalable (hyperscale) data center
hardware.
Redfish Schema defines properties and actions related to individual resources
(systems, racks, enclosures, chassis, and other), as well as, redundancy informa-
tion (number of fans and power supplies to chassis) and the relationship between
resources and services. Those schema are defined according to the OData conven-
tions and translated to a JSON representation for network transport [24].
Through the Redfish RESTful API, customers can integrate their solutions to
Redfish’s tool chain in order to consume those Redfish services. The Redfish API
allows data center operators to manage data center resources, control long lived
tasks (power on/off/reboot and gather utilization data), discover devices (rack, chas-
sis, and servers), and handle events (resource status changes, resource alerts, and
events to notify that a resource has been added, updated, or removed) [24].
The Service Root refers to a particular resource accessed through the service entry
point (Fig.2). This resource serves as the starting point for locating and accessing
other services and resources.
D.Rosendo et al.
1 3
Accessing the Service Root through a GET request to /redfish/v1, data center
operators can access the remaining services (Systems, Chassis, Composition, etc)
to obtain further and more specific information. The Chassis Service (/redfish/v1/
Chassis), for example, provides a physical view of systems such as thermal sensors,
power supplies, cooling sources, and redundancy of components.
The System Service (/redfish/v1/Systems) provides a logical view of the hardware
components in a system such as processors, memories, storage, and network. For
instance, a GET request to /redfish/v1/Systems/<ids>/Processors/<idp> returns
information about the number of cores and threads, model, and architecture of the
processor whose identification is<idp> and that pertains to a system whose identifi-
cation is given by<ids>.
Another system management feature provided by Redfish is the Composition
Service, which can be accessed through /redfish/v1/CompositionService. It refers to
the context of a pool of disaggregated hardware that could be put together to create
a composed logical system. The Composition Service consists of Resource Blocks.
A Resource Block may contain a pool of disaggregated hardware resources such as
processor, memory, storage, network interface, among others [18].
2.3 Reliability block diagram
Reliability block diagram (RBD) is a well-established formal method applied to evalu-
ate dependability metrics (e.g., availability, downtime, reliability, among others) of a
Fig. 2 Redfish schema: services
1 3
A methodology toassess theavailability ofnext-generation…
complex system. This method designs graphical diagrams that show the interdepend-
ence of the system’s components. An RBD model represents the functioning state (i.e.,
failure or success) of a system based on the functioning states of its components [25].
Blocks in RBD can be arranged in series, in parallel, or following k-out-of-n logic,
as shown in Fig.3. Under a serial configuration (Fig.3a), if one component fails, the
system as a whole will become unavailable. On the other hand, in a parallel configura-
tion (blocks from the right at Fig.3b), the system fails only if all parallel components
become unavailable [26]. In k-out-of-n configuration, the system is available if there are
at least k components of n working [27].
To calculate the availability
Ai
of each system component
i
, we use Eq.1. The mean
time to failure (MTTF) represents the time the system remained available, whereas the
mean time to repair (MTTR) means the time that the system remained in repair, that is,
unavailable. In this work, we consider that both MTTF and MTTR follow exponential
distributions. However, this limitation can be overcome since non-exponential distribu-
tions can be approximated by semi-Markov models following the approximating tran-
sition rate technique presented by Kitchin[28] or phase-type approximation methods
[29].
Thus, Eqs.2, 3, and 4 calculate the availability of the systems with series, parallel,
and k-out-of-n configurations, respectively [27].
(1)
A
i=
MTTF
i
MTTF
i
+MTTR
i
(2)
As=
N
i=0
A
i
(3)
A
p=1
N
i=0
(1Ai
)
Fig. 3 Block configurations in RBD: a series; b parallel; c k-out-of-n
D.Rosendo et al.
1 3
Considering that the reliability at the instant t of a system is defined by Eq.5, the
reliability of a parallel system (
Rp(t)
), a series system (
Rs(t)
), and a
k_out_of _n
sys-
tem is defined by Eqs.7, 6, and 8, respectively.
where
depicts the failure rate.
Once the reliability metric is calculated, it is possible to obtain the MTTF of the sys-
tem until a given instant t, using the following Eq.[30]:
Finally, the MTTR of the overall system can be calculated by placing estimated
availability and MTTF in Eq.1.
2.4 Sensitivity analysis
Numerical models are used in many fields to predict the response of complex
systems. However, as computer power increases, the complexity of the models
increases too. Generally, as the complexity of models increases, the models output
uncertainty increases too due to the randomness present in input parameters [31].
Therefore, sensitivity analysis is a technique used to investigate how the variation of
input parameters can influence the output of numerical models [32].
The sensitivity analysis techniques can be divided into two main categories: qual-
itative and quantitative [32]. Qualitative methods provide a visual inspection of the
model predictions, for example through plots or representations of posterior distri-
butions of input parameters. On the other hand, sensitivity analysis via quantitative
methods associates input factors to a reproducible and numerical evaluation of its
relative influence, through a set of sensitivity indices.
(4)
A
koutof n=
n
i=k
(n
i)Ak
i(1Ai)n
k
(5)
Rx
(t)=e
𝜆
x
t
(6)
R
s(t)=
N
x=0
Rx(t)=
N
x=1
e𝜆x
t
(7)
R
p(t)=1
N
x=0
(Rx(t)) = 1
N
x=1
(e𝜆xt
)
(8)
R
koutof n(t)=
n
i=k
(n
i)Rk
i(t)(1Ri(t))n
k
(9)
MTTF
=
0
R(t)dt
1 3
A methodology toassess theavailability ofnext-generation…
There are many different quantitative methods such as differential sensitivity
analysis, factorial design, importance factors, or percentage difference [33]. Differ-
ential analysis can be used for several analytical methods such as Markov reward
models, stochastic petri nets, and queuing networks. However, this method may not
properly evaluate the sensitivity either in noncontinuous domains or to methods
for which it is not possible to derive closed-form equations. In these scenarios, the
application of percentage difference approach can overcome the problems [34].
The percentage difference technique calculates the sensitive index for each input
parameter of a model from its minimum value to its maximum value, utilizing the
entire parameter range of possible values. Equation10 shows the calculus of the
index S, where
min{Y(𝜃)}
and
max{Y(𝜃)}
are, respectively, the minimum and maxi-
mum output values that are computed when varying the input parameter
𝜃
over the
range of the values of interest [34].
3 Automating thedata center availability estimation
The availability analysis of physical and virtual resources in large and complex data
center infrastructures poses a considerable challenge. In this work, we propose a
methodology to deal with this issue by first automating the acquisition of informa-
tion regarding data center resources and, after that, generating models to estimate
the overall data center availability, including a set of servers in a POD, as well as,
hardware pools (processor, memory, storage, and network) that compose a vPOD.
Figure4 depicts the methodology proposed in this paper. In summary, the meth-
odology uses the Redfish standard to collect information about the data center (pro-
cessor, memory, network, storage, power supply, and cooling source) infrastructure
and then generates stochastic models dynamically. Recall that we are considering
both conventional (POD with redundant servers) and next-generation data center
infrastructures (vPOD composed by a pool of disaggregated hardware resources).
Each step will be described in detail in the next subsections.
3.1 Acquiring data center infrastructure information
The acquisition of information that represents a data center is the first step that
needs to be accomplished. At this stage, industry standards and protocols for data
center software and hardware management could be used and combined to deliver
state information about the data center architecture, subsystems (power, cooling and
IT), redundancy of components (processor, memory, network, storage, power sup-
ply, and cooling source), and the way those components are interconnected.
Many organizations have defined data center-related standard management inter-
faces and protocols, including the Institute of Electrical and Electronics Engineers
(IEEE), Internet Engineering Task Force (IETF), and the Distributed Management
(10)
S
𝜃(Y)=
max{Y(𝜃)} − min{Y(𝜃)}
max{Y(
𝜃
)}
D.Rosendo et al.
1 3
Task Force (DMTF). Significant contributions include DMTF Redfish, Storage Net-
working Industry Association (SNIA) Swordfish, IETF Simple Network Manage-
ment Protocol (SNMP), Intelligent Platform Management Interface (IPMI), Sys-
tem Management BIOS (SMBIOS), Systems Management Architecture for Server
(SMASH), among others.
In our proposed methodology, we decide to use the Redfish standard because it cur-
rently provides the richest and more detailed set of information bases describing data
center infrastructure than the standards presented previously. Moreover, Redfish allows
the management of hybrid and complex infrastructures (many multi-node servers) and
it offers the data center management (platforms and devices) through a single interface
[35].
Through the Redfish REST API, our proposed methodology obtains the Redfish
Schema [36] with information about the hardware components in the data center. To
generate our models, we access information on processor, memory, network, storage,
Fig. 4 Methodology to generate models based on information gathered through Redfish
1 3
A methodology toassess theavailability ofnext-generation…
power supply, and cooling source. Besides, we also use computer system data such as
servers (in POD) and vPOD systems.
3.2 Modeling thedata center infrastructure
In this subsection, the process to automatically generate the available models based
on the information gathered from Redfish is described. The main idea here is to
identify the computer system (servers in a POD and vPOD systems) components
and their relationships in order to generate the respective model to calculate the
availability.
Hence, to generate availability models, we require two inputs: (a) the information
about the data center infrastructure (in JSON format) provided by the Redfish API
(described previously), and (b) the MTTF and MTTR values of all components of
the data center. The MTTF and MTTR values must be provided previously by the
data center operator, and they are stored in an input configuration file.
To achieve this, mathematical modeling techniques could be used to represent
the data center infrastructure and estimate its availability. Among these techniques,
RBD, Markov chains, product-form queuing networks, performability models, and
stochastic petri net models are widely adopted in the literature. The model should
cover the interactions between components and their relationships [37], including
the composition, interconnection, and behavior of the data center architecture.
We choose to use RBD to model the data center infrastructure once this technique
allows to evaluate dependability metrics based on a logical system, whose compo-
nents are represented as blocks [38]. Using RBD is possible to derive closed-form
equations to calculate dependability metrics; thus, the results are usually obtained
faster than other methods [39].
The information mapping of the data center infrastructure (gathered using the
Redfish) to its respective RBD model for a conventional data center (POD with
redundant servers) is presented in Fig.5. Similar information is depicted in Fig. 6
for a vPOD system (pool of hardware resources). For the sake of simplicity in dem-
onstrating the process of generating the models, we are considering an example with
only two servers, A and B, present in a data center infrastructure, while for vPOD,
we consider a simple vPOD composed of two resource blocks.
At first, our proposed methodology organizes the JSON data previously acquired
from the Redfish composition service schema into a class structure in Java to allow
the manipulation of data. As we use the Mercury tool [39] to solve our RBD models,
we need to translate the JSON data into the Mercury scripting language.1
The process of generating the stochastic models based on the Redfish data is
illustrated by Algorithm1. Redundant components of a computer system are mod-
eled using k-out-of-n blocks taking into account the application requirements (where
k
is the number of components necessary to run the application properly and make it
available), and n is the amount of available resource. Then, the k-out-of-n blocks are
1 For more details about the Mercury user manual and the Mercury scripting language, see [40, 41],
respectively.
D.Rosendo et al.
1 3
arranged in series to represent the dependence between other hardware types (pro-
cessor, memory, network, etc.).
The application requirement refers to the minimal amount of hardware com-
ponents (number of processor, memory, network, among others) that servers (in
POD) or vPOD systems must have to run the application properly and make it
available. We highlight that, in this work, we abstract the hardware configuration,
as well as, different hardware specifications (for example, a processor “A” with
8 cores and a processor “B” with 4 cores, each one from different manufacturer,
and different MTTF and MTTR values). But, our methodology may be extended
Fig. 5 RBD generation process for a POD with redundant servers
1 3
A methodology toassess theavailability ofnext-generation…
to cover these cases. Therefore, in this work, we consider that all hardware com-
ponents of the same type are equal and have the same MTTF and MTTR values.
The algorithm takes as input the application requirements and the MTTF and
MTTR values of each component that compose the computer system. The line 2 of
the Algorithm1 presents the function that gathers the data center information using
the Redfish API. So, for each computer system in the data center infrastructure (line
5), an RBD model is generated to represent the components in a computer system
(e.g., processors, power supplies, memories, etc), as shown in line 8. As mentioned
previously, the blocks of this RBD model are k out of n, where the k of each compo-
nent is determined following the application requirement, and the MTTF and MTTR
of each component are used in the RBD blocks, as illustrated in the function of line
10.
For conventional data centers, consider the example of the Fig.5 with the follow-
ing components: one processor, two disks, four memory, and one network interface.
Considering that the application requires four memory to be available, a k-out-of-n
block is generated, where in this case,
k
and
n
are equal to four.
On the other hand, the example of the vPOD in Fig. 6 presents the following
hardware resources: two network interfaces, eight memory, four disks, and two pro-
cessor. Considering the same memory requirement cited in the previous example,
the k-out-of-n block is generated with
k
equal to four and
n
equal to eight.
Fig. 6 RBD generation process for a vPOD composed by two resource blocks
D.Rosendo et al.
1 3
Algorithm1Methodologyto generate availabilitymodels based on Redfish
Require:
application Requirement, MTTF and MTTR Components
Ensure:
DC RBD model,DCavailability
1: # Read data center informationfrom Redfish API
2: DCinfrastructure redf ishAPI ()
3: # Create an empty list of RBDmodelsofcomputer systems
4: computerSystemM odelsList list
5: for all computerSy stemInf o DCinfrastructure do
6: # Create an empty list of RBD modelofcomponents
7: componentsList list
8: for all components computerSy stemInf o do
9: # Generate RBD k-out-of-n model of computer system components, takingint
o
account the application requirement
10:
blocks generateK ooN RBD(
components,
applicationRequirement,
MTTFRComponents)
11:
componentsList.add(blocks)
12:
end for
13:
# Arrange the KooN RBD blocks in series
14:
computerSystemRBD
generateSeriesRBD(componentsList)
15:
computerSystemM odelsList.add(
computerSystemRBD)
16:
end for
17:
if computerSystemM odelsList.length == 1 then
18:
model first(computerSystemM odelsList)
19:
else
20:
# Createan empty list of blo cks regarding to each computer system in DC
21:
dcRBD list
22:
for all computerSystemModel
computerSystemM odelsList do
23:
# Calculate the MTTF and MTTR values of each RBDmodel
24:
mttf, mttr solve(computerSystemM odel)
25:
dcRBD.add(block(mttf,mttr))
26:
end for
27:
# Arrange all computer system models in parallel
28:
model generateParallelRBD(dcRBD)
29:
end if
30:
av solve(model)
31:
return (model,av)
Afterward, all the RBD blocks that represent computer system components gen-
erated previously are arranged in series, as shown in line 14. That way, we compose
the RBD model for each computer system in a data center.
After generating the RBD model for each computer system, in line 17, the algo-
rithm checks the length of the list. If it has a single element, it gets this element and
goes to line 30 to solve the model and lastly returns the availability.
Otherwise, it has to enter in the loop of line 22 to solve the models in order to
obtain the respective MTTF and MTTR values for each server, as shown in line 24.
This step is necessary in cases where we have to integrate redundant servers (as
shown in Fig.5) and then calculate the overall availability. The MTTF and MTTR
1 3
A methodology toassess theavailability ofnext-generation…
values are used as input to a secondary RBD model with blocks in parallel (as in
line 28), where each block in RBD represents a server. We calculate the MTTF and
MTTR values because if we change the modeling technique of the secondary model
to a more complex one, such as stochastic petri nets or Markov chains, the failure
and repair times will be needed. We can use these modeling techniques to represent
more complex behavior, such as redundancy mechanisms and live migration of the
application. In other words, our methodology can be extended to cover these mod-
eling techniques. Finally, in line 30, the algorithm solves the final model, calculating
the availability of the data center, and then returns the result in line 31. Figure7 pre-
sents a pipeline of the algorithm.
The complexity of this algorithm depends on the number of computerSystemInfo
(N) and the number of components (M) present at the data center, as well as; it also
depends on the complexity of the functions generateKooNRBD() (line 10), genera-
teSeriesRBD() (line 14), generateParallelRBD() (line 28), and solve() (line 30). The
functions for generating RBDs are simpler than the solve() function and the time to
process this function for the data center parallel RBD model (line 24) and the RBD
of a computerSystemModel (line 30) are similar. This way, the computational com-
plexity of the Algorithm1 is given by
O(N×M)
that corresponds to the intensive
task of getting the MTTF and MTTR of each individual component.
4 Methodology
In this paper, our interest is to analyze and compare the availability of a conventional
data center (named POD) against a next-generation data center (named vPOD). We
scaled both of them equally in number of installed hardware. We use the Redfish
Emulator2 to create the evaluating scenarios (through python scripts) which repre-
sent the hardware architectures of vPODs and servers in POD. For that, we imple-
mented our methodology proposed in Sect.3 to automatically acquire the infrastruc-
ture configuration of vPOD systems and servers in POD (Fig.8) and generate the
respective models to assess their availability. Our implementation embeds the Mer-
cury tool [39] to solve the models and obtain the availability results.
Fig. 7 Pipeline of algorithm to generate availability models based on Redfish
2 https ://githu b.com/DMTF/Redfi sh-Inter face-Emula tor.
D.Rosendo et al.
1 3
We are considering an application requirement that refers to the minimum
hardware configuration that the vPOD and servers (in POD) must attend, but we
are not modeling the application per si. We made the two following assumptions:
(i) at first, we consider a cloud application that will be hosted in the data center.
This application requires one network interface, four memory units, two disks,
and one processor to be available; (ii) we also consider that a single power supply
and cooling source are needed to complete the hardware requirements.
In our evaluation, we increase the amount of power and cooling components,
as well as, hardware resources (processor, memory, storage, and network) allo-
cated to the application in order to observe how availability changes in both POD
and vPOD cases. We start with the pack of hardware as the minimum set of hard-
ware resources needed to meet application requirements.
Table1 presents the MTTF and MTTR of the hardware resources and power
and cooling components that compose the pack of hardware. Such values were
obtained from the literature due to the difficulties to measure them in a real world.
We highlight that the MTTF and MTTR of the power supply and cooling source
refer to a power and cooling subsystem in tier 1, respectively.
The hardware resources allocated to the infrastructure (POD and vPOD)
are discreetly scaled up from 2 to 5 packs of hardware. This way, we provide a
more fair evaluation, since it compares POD and vPOD infrastructures whose
Fig. 8 Evaluation scenario: POD with redundant servers and vPOD
1 3
A methodology toassess theavailability ofnext-generation…
amount of hardware resources is the same, differing only on the way hardware is
arranged.
Considering the POD configuration, we improve its availability through
redundant servers, varying from 2 (
N+1
) to 5 (
N+4
) servers, each one con-
figured with the pack of hardware described previously, see the middle layer
of Fig.8. We highlight that we have the same application requirement for each
server, and each server has just the minimum set of hardware resources and
power supply and cooling source to handle the application.
Therefore, a failure in one of the server’s hardware components, power sup-
ply, or cooling source will be considered as a failure of the whole server, but it
is not considered a failure of the application since it is supported by other server.
That way, a POD with five servers will be unavailable only if five out of five
servers fail.
On the other hand, we scaled the vPOD configuration from two packs of hard-
ware, totaling a hardware pool with two network interfaces, eight memories, four
disks, and two processors, in addition to two independent power supplies and two
independent cooling sources, to 5 packs of hardware, totaling five network inter-
faces, twenty memories, ten disks and five processors, as well as, five independent
power supplies and five independent cooling sources, see bottom of Fig.8.
We highlight that, differently from the POD configuration (application
requirement for each server), in a vPOD, we have an application requirement for
a pool of hardware resources. Therefore, a vPOD with 5 packs of hardware will
be unavailable if one of the following conditions occurs: five out of five network
interfaces fail, 17 out of 20 memories fail, 9 out of 10 disks fail, five out of five
processors fail, five out of five power supplies fail, or five out of five cooling
sources fail.
5 Evaluation results
This section presents the availability results obtained from the proposed models
regarding the POD and vPOD configurations. Furthermore, it presents the sensi-
tivity analysis results to identify the components that most impact on the avail-
ability as we increase the amount of redundant components.
Table 1 MTTF and MTTR
values of hardware resources,
power supply, and cooling
source [27, 4244]
Component MTTF (in h) MTTR (in h)
Processor 292,000.00 6.0
Memory 480,000.00 2.5
Storage 200,000.00 2.5
Network 120,000.00 2.5
Power 259,122.90 142.41
Cooling 4182.07 23.90
D.Rosendo et al.
1 3
5.1 Availability analysis
Table 2 presents the availability results. The vPOD configuration presents higher
availability than the classical POD configuration with redundant servers, even when
we change both configurations from 2 to 5 packs of hardware. Furthermore, the
difference in availability (see the number of 9’s3) between vPOD and POD infra-
structures increases as we increase the amount of hardware available, ranging from
Table 2 Availability results: POD (redundant servers) and vPOD (pool of hardware)
Packs of
hardware
Configuration Availability (%) Number of nines Downtime (min/year)
2x 2 servers (
N+1
) 99.9960117205 4.39 20.4
vPOD 99.9967432521 4.48 16.8
3x 3 servers (
N+2
) 99.9999748128 6.59
1.32 ×10
1
vPOD 99.9999816583 6.73
9.6 ×102
4x 4 servers (
N+3
) 99.9999998409 8.79
7.8 ×10
4
vPOD 99.9999998959 8.98
5.46 ×10
4
5x 5 servers (
N+4
) 99.9999999989 10.99
5.22 ×10
6
vPOD 99.9999999994 11.22
3.06 ×106
Fig. 9 Number of nines comparison between POD (redundant servers) and vPOD (pool of hardware)
3 The system is considered high available if it presents five 9’s of availability, meaning that its downtime
is about only 5.26min per year.
1 3
A methodology toassess theavailability ofnext-generation…
a difference of 0.09 nines (for 2 packs of hardware) to 0.23 nines (for 5 packs of
hardware). The Fig.9 presents this difference.
In a scenario with 2 packs of hardware, the POD obtains 4.39 nines of availability
with downtime of 20.4min per year, against 4.48 nines and 16.8min of downtime
when using vPOD configuration. Increasing to 3 and 4 packs of hardware, the POD
achieves 6.59 nines and 8.79 nines against 6.73 nines and 8.98 nines in vPOD con-
figuration, respectively. Lastly, considering 5 packs of hardware, the 5 servers (N+4
redundancy) at the POD obtained 10.99 nines against 11.22 nines in vPOD and they
present a downtime of
5.22 ×106
and
3.06 ×106
, respectively.
5.2 Sensitivity analysis
Table 3 shows the sensitivity analysis results. We applied the sensitivity analysis
using the percentage difference technique (Sect. 2) with a
±10%
variation of the
default MTTF and MTTR values (Table1). For each scenario, the table presents the
top-three components that most impact on the availability (most sensible indexes),
comprising the MTTF and MTTR of components, such as cooling source (indi-
cated by CO, in this case
CO_MTTF
and
CO_MTTR
, respectively) and power supply
(PW). In this analysis, indexes related to the hardware resources (processor, mem-
ory, storage, and network) do not appeared in top-three list.
Regarding the POD configuration, the top-three indexes obtained when scaling
from 2 to 5 servers comprise: cooling MTTF, cooling MTTR, and power MTTF,
respectively. Therefore, the cooling source and the power supply most impact on
availability of each server. Besides the top-three indexes listed in Table3, the MTTR
of the power supply occupied the fourth position, followed by storage MTTF and
MTTR. Lastly, processor (MTTF and MTTR) and network (MTTR) obtained the
lower indexes.
In vPOD configuration, we also have the same top-three components as in POD
configuration, but with lower indexes. In the resource pool that composes the vPOD,
the processor and network interface, as well as the power supply and cooling source,
have fewer components than memory and storage device. Differently from the POD
configuration with
N+1
redundant servers that fails when one out of four memories
fails in both servers, a vPOD remains available even four out of eight memories
fail. Due to the greater number of components (reaching 20 when using 5 packs of
hardware), the memory presented the lowest sensitivity index, in all cases. The com-
ponents with fewer hardware resources had the higher indexes, being the processor
(MTTF and MTTR) and storage (MTTF and MTTR), in this sequence, on seventh to
tenth position in the rank.
6 Discussion
One of the main goals of this paper was to analyze both POD and vPOD approaches
in terms of availability. As presented in Sect.5.1, the vPOD approach presented a
higher availability, which in turn reduced the data center interruption time and the
D.Rosendo et al.
1 3
Table 3 Sensitivity analysis results: POD (redundant servers) and vPOD (pool of hardware)
Configuration Packs of hardware
2x 3x 4x 5x
POD
CO_MTTF
1.45 ×10
5
CO_MTTF
1.39 ×10
7
CO_MTTF
1.18 ×10
9
CO_MTTF
9.54 ×10
12
CO_MTTR
1.42 ×105
CO_MTTR
1.35 ×107
CO_MTTR
1.14 ×109
CO_MTTR
9.11 ×1012
PW_MTTF
1.40 ×10
6
PW_MTTF
1.32 ×10
8
PW_MTTF
1.11 ×10
10
PW_MTTF
8.84 ×10
13
vPOD
CO_MTTF
1.30 ×10
5
CO_MTTF
1.13 ×10
7
CO_MTTF
8.70 ×10
10
CO_MTTF
6.30 ×10
12
CO_MTTR
1.28 ×10
5
CO_MTTR
1.09 ×10
7
CO_MTTR
8.35 ×10
10
CO_MTTR
5.99 ×10
12
PW_MTTF
1.24 ×107
PW_MTTF
1.04 ×1010
PW_MTTF
7.79 ×1014
PW_MTTF
5.48 ×1017
1 3
A methodology toassess theavailability ofnext-generation…
downtime costs. Such interruptions are expensive reaching about $336,000 per hour
in lost revenue (from Amazon and Microsoft) [45]. As another example, one hour
of downtime of a credit-card authorization service (transactions that cannot be com-
pleted) costs to credit-card companies about $2.6 millions [45]. In addition to finan-
cial losses, failures can result in severe service interruptions, decreased productivity,
and damaged business reputations [46].
From the availability results (presented in Sect.5.1) and considering the afore-
mentioned downtime cost of U$ 336,000 [45] per hour, using 2 packs of hardware in
a POD configuration with 2 servers (
N+1
redundancy), the downtime cost reaches
U$ 114,240 per year due to its four-nine availability. On the other hand, when using
the vPOD infrastructure, the downtime cost reaches U$ 94,080 per year, represent-
ing a downtime cost reduction in U$ 20,160 per year.
When using 3 packs of hardware, POD with 3 servers (
N+2
redundancy) and
vPOD configurations present an annual downtime cost of U$ 739,20 and U$ 537,60,
respectively. For 4 packs of hardware, the POD approach (4 servers, N+3 redun-
dancy) resulted in an annual downtime cost of U$ 4,37, against U$ 3,06 for vPOD.
We also highlight that the availability results presented in Table 2 showed a
higher difference between vPOD and POD infrastructures as the redundancy grows:
with 2 packs of hardware, the difference was 0.09 nines; for 4 packs of hardware, it
reached 0.19; and with 5 packs of hardware, 0.23 nines. Therefore, higher levels of
resource disaggregation in vPODs improve even more the availability compared to
adding redundant servers in a infrastructure.
The higher level of hardware disaggregation and consequently the lower failure
dependence between the hardware components, provided by the vPOD technology,
improves its availability. Furthermore, the higher availability of vPOD, compared to
POD configuration, comes from better flexibility to manage the hardware resources.
In physical servers, for example, a network interface failure would lead to a server
failure, becoming unavailable even when other components are operating in per-
fect conditions. On the other hand, in vPOD, it does not occur, once the hardware
resources that compose a vPOD system are disaggregated pools of hardware.
Therefore, any improvements in critical cloud infrastructures to reduce the impact
of such interruptions are crucial. Furthermore, together with this higher availability
provided by the vPOD technology, other aspects such as greater agility, higher utili-
zation, and simplified management, also justify the technology exchange.
Regarding sensitivity analysis, POD and vPOD configurations presented the
same top-three components that most impact on the availability. The lowest MTTF
of cooling source (4,182.07 hours, see Table 1) compared to other components
explains the highest impact of CO_MTTF on availability in both configurations, as
presented in Table3.
7 Related work
There are several works that model data center infrastructure, such as [1014]. How-
ever, none of them considers the vPOD data center design architecture.
D.Rosendo et al.
1 3
In [14], authors presented models to evaluate the impact, cost, and dependability
of data center cooling and power infrastructures. The authors focused on sustainable
impact and availability and reliability power system.
In [35], the authors present a tool called Redfish Conformance Test Tool (RCTT).
This tool provides a test environment for the Redfish standard in order to allow cus-
tomers to test their services in an automatic way, understanding specifications, and
identifying the compatibility between these specifications and the client require-
ments. Despite the fact that RCTT can be used to test Redfish in terms of implemen-
tation and interoperability between heterogeneous multi-vendors system, it does not
provide models to estimate data center availability.
The work in [47] presents a tool associated with availability analysis models.
This approach is able to transform an input of the data center model into a Petri Net
model and then to perform availability analysis. The results feed a scoring selec-
tion tool for obtaining the best features such as energy efficiency and high avail-
ability. However, this work does not consider vPODs when assessing the data center
availability.
In this way, our proposal differs from the literature since it stands out in terms of
information collected from the data center in automatic way, and also in terms of
vPOD modeling of next-generation data center infrastructure.
8 Conclusions andfuture works
From the perspective of the data center provider, availability is an essential metric
for maximizing profits and guaranteeing service level agreement (SLA). However,
most providers lack a methodology for gathering hardware resource information and
providing the availability of a cloud data center. Moreover, the next-generation cloud
data center will increasingly be dominated by disaggregated hardware resources in
order to improve management and efficiency. The question we raised was whether
this new strategy also increases the availability or not.
This work covers both concerns by proposing a methodology to estimate the data
center availability based on hardware resources information provided via the Redfish
API. The evaluation results showed that the vPOD configuration increases availabil-
ity when compared to conventional cloud data center strategies, such as a POD with
redundant servers. Despite not considered in our models, vPOD technology allows
real-time monitoring at the hardware level and the allocation of hardware resources
by management software, which decreases the repair time in comparison to conven-
tional strategies. Therefore, the availability increase coupled with vPOD benefits are
certainly worth considering.
As future work, we plan to improve our methodology by using other powerful
modeling techniques, such as Markov chains or SPNs to represent more complex
aspects of data centers, such as different redundancy strategies, live migration of
the services, and software components such as virtual machines, containers, and
applications. We also plan to use optimization algorithms in our methodology. Such
algorithms should improve cloud data center availability by offering better hardware
composition (processor, memory, storage, and network) to allocate and to configure
1 3
A methodology toassess theavailability ofnext-generation…
vPODs while considering constraints such as available hardware resources, hard-
ware costs, and availability level requirements.
Acknowledgements This work was supported by the Research, Development and Innovation Center,
Ericsson Telecomunicações S.A., Brazil. Authors would like to thank Carolina Cani for her support in
our images.
References
1. Trivedi KS, Bobbio A (2017) Reliability and availability engineering: modeling, analysis, and appli-
cations. Cambridge University Press, Cambridge
2. British air data center outage feeds outrage at airline cost cuts (2017). http://www.datac enter knowl
edge.com. Accessed Nov 2018
3. Al-Yatama A, Ahmad I, Al-Dabbous N (2017) Memory allocation algorithm for cloud services. J
Supercomput 73(11):5006–5033
4. Fard SYZ, Ahmadi MR, Adabi S (2017) A dynamic VM consolidation technique for QOS and
energy consumption in cloud environment. J Supercomput 73(10):4347–4368
5. Han S, Egi N, Panda A, Ratnasamy S, Shi G, Shenker S (2013) Network support for resource disag-
gregation in next-generation datacenters. In: Proceedings of the Twelfth ACM Workshop on Hot
Topics in Networks, p 10. ACM
6. Li CS, Franke H, Parris C, Abali B, Kesavan M, Chang V (2017) Composable architecture for rack
scale big data computing. Future Gener Comput Syst 67:180–193
7. Fareghzadeh N, Seyyedi MA, Mohsenzadeh M (2019) Toward holistic performance management in
clouds: taxonomy, challenges and opportunities. J Supercomput 75(1):272–313
8. Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in
uncertain cloud environment. J Supercomput 73(11):4906–4922
9. Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content man-
agement. J Supercomput 73(12):5150–5172
10. Addabbo T, Fort A, Mugnaini M, Vignoli V, Simoni E, Mancini M (2016) Availability and reliabil-
ity modeling of multicore controlled ups for datacenter applications. Reliab Eng Syst Saf 149:56–
62. https ://doi.org/10.1016/j.ress.2015.12.010
11. Alissa HA, Nemati K, Sammakia BG, Seymour MJ, Tipton R, Mendo D, Demetriou DW, Schnee-
beli K (2016) Chip to chiller experimental cooling failure analysis of data centers: the interaction
between it and facility. IEEE Trans Compon Packag Manuf Technol 6(9):1361–1378. https ://doi.
org/10.1109/TCPMT .2016.25990 25
12. Callou G, Maciel P, Tutsch D, Araújo J (2012) Models for dependability and sustainability analysis
of data center cooling architectures. In: IEEE/IFIP International Conference on Dependable Systems
and Networks Workshops (DSN 2012), pp 1–6. https ://doi.org/10.1109/DSNW.2012.62646 97
13. Liu Z, Chen Y, Bash C, Wierman A, Gmach D, Wang Z, Marwah M, Hyser C (2012) Renewable
and cooling aware workload management for sustainable data centers. In: Proceedings of the 12th
ACM SIGMETRICS/Performance Joint International Conference on Measurement and Modeling
of Computer Systems, SIGMETRICS ’12, pp 175–186. ACM, New York, NY, USA. https ://doi.
org/10.1145/22547 56.22547 79
14. Callou G, Maciel P, Tutsch D, Ferreira J, Araújo J, Souza R (2013) Estimating sustainability impact
of high dependable data centers: a comparative study between brazilian and US energy mixes. Com-
puting 95(12):1137–1170. https ://doi.org/10.1007/s0060 7-013-0328-y
15. Gomes D, Endo P, Gonçalves G, Rosendo D, Santos G, Kelner J, Sadok D, Mahloo M (2017) Eval-
uating the cooling subsystem availability on a cloud data center. In: IEEE Symposium on Comput-
ers and Communications. IEEE
16. Santos G, Endo P, Gonçalves G, Rosendo D, Gomes D, Kelner J, Sadok D, Mahloo M (2017) Ana-
lyzing the it subsystem failure impact on availability of cloud services. In: IEEE Symposium on
Computers and Communications. IEEE
17. Rosendo D, Santos G, Gomes D, Moreira A, Gonçalves G, Endo P, Kelner J, Sadok D, Mahloo M
(2017) How to improve cloud services availability? Investigating the impact of power and it subsys-
tems failures. In: HICSS Hawaii International Conference on System Sciences. HICSS
D.Rosendo et al.
1 3
18. Redfish composability white paper (2017). https ://www.dmtf.org/sites /defau lt/files /stand ards/docum
ents/DSP20 50_1.0.0.pdf. Accessed Apr 2018
19. Cheng J, Grinnemo KJ (2017) Telco distributed DC with transport protocol enhancement for 5G
mobile networks: a survey. Karlstads universitet
20. Intel rack scale design architecture specification (2018) Software v2.3.3
21. Intel rack scale design architecture (2019). https ://www.intel .com/conte nt/dam/www/publi c/us/en/
docum ents/white -paper s/rack-scale -desig n-archi tectu re-white -paper .pdf. Accessed Mar 2019
22. Megarac solutions for intel rack scale design standards (2019). https ://ami.com/ami_downl oads/
MegaR AC_Solut ions_for_Intel _Rack_Scale _Desig n_Data_Sheet .pdf. Accessed Mar 2019
23. Supermicro rack scale design (rsd) solution overview (2019). https ://www.super micro .com/solut
ions/SRSD.cfm. Accessed Mar 2019
24. Redfish scalable platforms management api specification (2018) DMTF Redfish DSP0266
25. Fazlollahtabar H, Akhavan Niaki ST (2017) Integration of fault tree analysis, reliability block
diagram and hazard decision tree for industrial robot reliability evaluation. Ind Robot Int J
44(6):754–764
26. Maciel P, Trivedi K, Matias R, Kim D (2010) Dependability modeling. In: Performance and depend-
ability in service computing: Concepts, Techniques and Research Directions. IGI Global, Hershey,
Pennsylvania, USA, 13
27. Araujo J, Maciel P, Torquato M, Callou G, Andrade E (2014) Availability evaluation of digital
library cloud services. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable
Systems and Networks (DSN), pp 666–671. IEEE
28. Kitchin JF (1988) Practical Markov modeling for reliability analysis. In: 1988 Proceedings of the
Annual Reliability and Maintainability Symposium, pp 290–296. IEEE
29. Malhotra M, Reibman A (1993) Selecting and implementing phase approximations for semi-markov
models. Stoch Models 9(4):473–506
30. Høyland A, Rausand M (2009) System reliability theory: models and statistical methods, vol 420.
Wiley, New York
31. Vu-Bac N, Lahmer T, Zhuang X, Nguyen-Thoi T, Rabczuk T (2016) A software framework for
probabilistic sensitivity analysis for computationally expensive models. Adv Eng Softw 100:19–31
32. Pianosi F, Beven K, Freer J, Hall JW, Rougier J, Stephenson DB, Wagener T (2016) Sensitivity
analysis of environmental models: a systematic review with practical workflow. Environ Model
Softw 79:214–232
33. Hamby D (1994) A review of techniques for parameter sensitivity analysis of environmental models.
Environ Monit Assess 32(2):135–154
34. Andrade E, Nogueira B, Matos R, Callou G, Maciel P (2017) Availability modeling and analysis of
a disaster-recovery-as-a-service solution. Computing 99:1–26
35. Kumari P, Saleem F, Sill A, Chen Y (2017) Validation of redfish: the scalable platform management
standard. In: Companion Proceedings of the 10th International Conference on Utility and Cloud
Computing, pp 113–117. ACM
36. Redfish resource and schema guide (2017) DSP2046 DMTF Redfish
37. Cassandras CG, Lafortune S (2009) Introduction to discrete event systems. Springer, Berlin
38. Verma AK, Ajit S, Karanki DR (2010) Reliability and safety engineering, vol 43. Springer, Berlin
39. Maciel P, Matos R, Silva B, Figueiredo J, Oliveira D, Fé I, Maciel R, Dantas J (2017) Mercury:
Performance and dependability evaluation of systems with exponential, expolynomial, and general
distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing
(PRDC), pp 50–57. IEEE
40. Mercury tool manual v4.7.0 (2019). http://www.modcs .org/wp-conte nt/uploa ds/tools /Mercu ry_
Tool_Manua l_v4.7.0.pdf. Accessed Mar 2019
41. Oliveira D (2019) The mercury scripting language cookbook. Available at: http://www.modcs
.org/?page_id=1703. Accessed Apr 2019
42. Smith WE, Trivedi KS, Tomek LA, Ackaret J (2008) Availability analysis of blade server systems.
IBM Syst J 47(4):621–640
43. Brosch F, Koziolek H, Buhnova B, Reussner R (2010) Parameterized reliability prediction for com-
ponent-based software architectures. In: International Conference on the Quality of Software Archi-
tectures, pp 36–51. Springer
44. Gomes D, Santos GL, Rosendo D, Gonçalves G, Moreira A, Kelner J, Sadok D, Endo PT (2019)
Measuring the impact of data center failures on a cloud-based emergency medical call system. Con-
curr Comput Pract Exper. https ://doi.org/10.1002/cpe.5156
1 3
A methodology toassess theavailability ofnext-generation…
45. Cérin C, Coti C, Delort P, Diaz F, Gagnaire M, Gaumer Q, Guillaume N, Lous J, Lubiarz S, Raf-
faelli J etal (2013) Downtime statistics of current cloud solutions. International Working Group on
Cloud Computing Resiliency. Technical Report
46. Endo PT, Santos GL, Rosendo D, Gomes DM, Moreira A, Kelner J, Sadok D, Gonçalves GE,
Mahloo M (2017) Minimizing and managing cloud failures. Computer 50(11):86–90
47. Jammal M, Kanso A, Heidari P, Shami A (2017) Evaluating high availability-aware deployments
using stochastic petri net model and cloud scoring selection tool. IEEE Trans Serv Comput PP:1
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.
Aliations
DanielRosendo1· DemisGomes1· GutoLeoniSantos1· GlaucoGoncalves2·
AndreMoreira1· LeylaneFerreira1· PatriciaTakakoEndo3 · JudithKelner1·
DjamelSadok1· AmardeepMehta4· MattiasWildeman4
Daniel Rosendo
daniel.rosendo@gprt.ufpe.br
Demis Gomes
demis.gomes@gprt.ufpe.br
Guto Leoni Santos
guto.leoni@gprt.ufpe.br
Glauco Goncalves
glauco.goncalves@ufrpe.br
Andre Moreira
andre@gprt.ufpe.br
Leylane Ferreira
leylane.silva@gprt.ufpe.br
Judith Kelner
jk@gprt.ufpe.br
Djamel Sadok
jamel@gprt.ufpe.br
Amardeep Mehta
amardeep.mehta@ericsson.com
Mattias Wildeman
mattias.wildeman@ericsson.com
1 Universidade Federal de Pernambuco (UFPE), Recife, Brazil
2 Universidade Federal Rural de Pernambuco (UFRPE), Recife, Brazil
3 Universidade de Pernambuco (UPE), Recife, Brazil
4 Ericsson Research, Stockholm, Sweden
... A way to reach this objective is to arrange the data center infrastructure in modules called Performance Optimized Data Center (POD) (Figure 1.a). PODs can be defined as a collection of racks, tightly connected through a fast local network, sharing common resources, such as monitoring management, aisle containment, and Power Distribution Unit [19], [20]. With this configuration, data center operators can split their infrastructure to allocate different workloads to PODs with specific configurations, decoupling the data center space into different demands and facilitating monitoring and management [20]. ...
... PODs can be defined as a collection of racks, tightly connected through a fast local network, sharing common resources, such as monitoring management, aisle containment, and Power Distribution Unit [19], [20]. With this configuration, data center operators can split their infrastructure to allocate different workloads to PODs with specific configurations, decoupling the data center space into different demands and facilitating monitoring and management [20]. [20]. ...
... With this configuration, data center operators can split their infrastructure to allocate different workloads to PODs with specific configurations, decoupling the data center space into different demands and facilitating monitoring and management [20]. [20]. ...
Conference Paper
To meet service level agreement (SLA) requirements, the majority of enterprise IT infrastructure is typically overpro-visioned, underutilized, non-compliant and lacking in required agility resulting in significant inefficiencies. As enterprises introduce and migrate to next-generation applications designed to be horizontally scalable, they require infrastructure that can manage the duality of legacy and next generation application requirements. To address this, composable data center infrastructure disaggregates and refactors compute, storage, network and other infrastructure resources in to shared resources pools that can be "composed" and allocated on-demand. In this paper, we model the allocation of resources in a composable data center infrastructure as a bounded multidimensional knapsack and then apply multi-objective optimization algorithms, Non-dominated Sorting Genetic Algorithm (NSGA-II) and Generalized Differential Evolution (GDE3), to allocate resources efficiently. The main goal is to maximize resource availability for the application owner, while meeting minimum requirements (in terms of CPU, memory, network, and storage) within budget constraints. We consider two different scenarios to analyze heterogeneity and variability aspects when allocating resources on composable data center infrastructure.
... Despite the opportunities and benefits, the challenges and risks are still outnumbered 8,9,10 . As an important challenge, short downtime of IaaS cloud services has negative effects on different subjects ranging from economic losses to impact on human life 11 . It is reported that the data center components outage can cost the providers between $8000 and $16,000 per minute 12 . ...
... In this section, a dynamic method for modeling cloud infrastructure 11,43,44,45 is presented. The presented method applies an algorithm to use a hierarchical modeling approach proposed in Section 4.1. ...
Article
Full-text available
In cloud computing services, high availability is one of the quality of service requirements which is necessary to maintain customer confidence. High availability systems can be built by applying redundant nodes and multiple clusters in order to cope with software and hardware failures. Due to cloud computing complexity, dependability analysis of the cloud may require combining state‐based and nonstate‐based modeling techniques. This article proposes a hierarchical model combining reliability block diagrams and continuous time Markov chains to evaluate the availability of OpenStack private clouds, by considering different scenarios. The steady‐state availability, downtime, and cost are used as measures to compare different scenarios studied in the article. The heterogeneous workloads are considered in the proposed models by varying the number of CPUs requested by each customer. Both hardware and software failure rates of OpenStack components used in the model are collected via setting up a real OpenStack environment applying redundancy techniques. Results obtained from the proposed models emphasize the positive impact of redundancy on availability and downtime. Considering the tradeoff between availability and cost, system providers can choose an appropriate scenario for a specific purpose.
... A similar analytical approach has been explained in [158], where the reliability of the typical Alternate Current (AC) distribution system is compared with the Direct Current (DC) power distribution system in data centers using RBD. The failure of the power distribution system is only considered without considering the failures of the IT loads in [158], [159], while the availability of the IPCS considering the failure probability of the IT loads including PSUs are presented in [8]. Depending on the voltage level in the IPCS the reliability of different IPCS structures are evaluated using RBD in [160], [161]. ...
Article
Full-text available
Enhancing the efficiency and the reliability of the data center are the technical challenges for maintaining the quality of services for the end-users in the data center operation. The energy consumption models of the data center components are pivotal for ensuring the optimal design of the internal facilities and limiting the energy consumption of the data center. The reliability modeling of the data center is also important since the end-user’s satisfaction depends on the availability of the data center services. In this review, the state-of-the-art and the research gaps of data center energy consumption and reliability modeling are identified, which could be beneficial for future research on data center design, planning, and operation. The energy consumption models of the data center components in major load sections i.e., information technology (IT), internal power conditioning system (IPCS), and cooling load section are systematically reviewed and classified, which reveals the advantages and disadvantages of the models for different applications. Based on this analysis and related findings it is concluded that the availability of the model parameters and variables are more important than the accuracy, and the energy consumption models are often necessary for data center reliability studies. Additionally, the lack of research on the IPCS consumption modeling is identified, while the IPCS power losses could cause reliability issues and should be considered with importance for designing the data center. The absence of a review on data center reliability analysis is identified that leads this paper to review the data center reliability assessment aspects, which is needed for ensuring the adaptation of new technologies and equipment in the data center. The state-of-the-art of the reliability indices, reliability models, and methodologies are systematically reviewed in this paper for the first time, where the methodologies are divided into two groups i.e., analytical and simulation-based approaches. There is a lack of research on the data center cooling section reliability analysis and the data center components’ failure data, which are identified as research gaps. In addition, the dependency of different load sections for reliability analysis of the data center is also included that shows the service reliability of the data center is impacted by the IPCS and the cooling section.
... Regarding the reliability indices for maintaining SLA, quality of service (QoS), and capacity management, the key quality indicator (KQI) and key performance indicator (KPI) are described in [45], [46]. A further application of reliability assessment is proposed in [47] that use ''service availability'' of the servers in a performance optimized DC. The reliability index ''service availability'' and ''service reliability'' are also used in [48], [49] as a function of up-time and down-time of DC components. ...
Article
Full-text available
The energy demand of data centers is increasing globally with the increasing demand for computational resources to ensure the quality of services. It is important to quantify the required resources to comply with the computational workloads at the rack-level. In this paper, a novel reliability index called loss of workload probability is presented to quantify the rack-level computational resource adequacy. The index defines the right-sizing of the rack-level computational resources that comply with the computational workloads, and the desired reliability level of the data center investor. The outage probability of the power supply units and the workload duration curve of servers are analyzed to define the loss of workload probability. The workload duration curve of the rack, hence, the power consumption of the servers is modeled as a function of server workloads. The server workloads are taken from a publicly available data set published by Google. The power consumption models of the major components of the internal power supply system are also presented which shows the power loss of the power distribution unit is the highest compared to the other components in the internal power supply system. The proposed reliability index and the power loss analysis could be used for rack-level computational resources expansion planning and ensures energy-efficient operation of the data center.
... Redfish Schema: services. Adapted from[4]. ...
Article
Large data centers are complex systems that depend on several generations of hardware and software components, ranging from legacy mainframes and rack-based appliances to modular blade servers and modern rack scale design solutions. To cope with this heterogeneity, the data center manager must coordinate a multitude of tools, protocols, and standards. Currently, data center managers, standardization bodies, and hardware/software manufacturers are joining efforts to develop and promote Redfish as the main hardware management standard for data centers, and even beyond the data center. The authors hope that this article can be used as a starting point to understand how Redfish and its extensions are being targeted as the main management standard for next-generation data centers. This article describes Redfish and the recent collaborations to leverage this standard.
Article
The number of data centers and the energy demand are increasing globally with the development of information and communication technology (ICT). The data center operators are facing challenges to limit the internal power losses and the unexpected outages of the computational resources or servers. The power losses of the internal power supply system (IPSS) increase with the increasing number of servers that causes power supply capacity shortage for the devices in IPSS. The aim of this paper is to address the outage probability of the computational resources or servers due to the power supply capacity shortage of the power distribution units (PDUs) in the IPSS. The servers outage probability at rack-level defines the service availability of the data center since the servers are the main computational resource of it. The overall availability of the IPSS and the power consumption models of the IPSS devices are also presented in this paper. Quantitative studies are performed to show the impacts of the power losses on the service availability and the overall availability of the IPSS for two different IPSS architectures, which are equivalent to the Tier I and Tier IV models of the data center.
Article
Next‐generation cloud data centers are based on software‐defined data center infrastructures that promote flexibility, automation, optimization, and scalability. The Redfish standard and the Intel Rack Scale Design technology enable software‐defined infrastructure and disaggregate bare‐metal compute, storage, and networking resources into virtual pools to dynamically compose resources and create virtual performance‐optimized data centers (vPODs) tailored to workload‐specific demands. This article proposes four chassis design configurations based on Distributed Management Task Force's Redfish industry standard applied to compose vPOD systems, namely, a fully shared design, partially shared homogeneous design, partially shared heterogeneous design, and not shared design; their main difference is based on the used hardware disaggregation level. Furthermore, we propose models that combine reliability block diagram and stochastic Petri net modeling approaches to represent the complexity of the relationship between the pool of disaggregated hardware resources and their power and cooling sources in a vPOD. These four proposed design configurations were analyzed and compared in terms of availability and component's sensitivity indexes by scaling their configurations considering different data center infrastructure. From the obtained results, we can state that, in general, when one increases the hardware disaggregation, availability is improved. However, after a given point, the availability level of the fully shared, partially shared homogeneous, and partially shared heterogeneous configurations remain almost equal, while the not shared configuration is still able to improve its availability.
Article
The ever-growing data traffic volume inside data centers caused by the popularization of cloud services and edge computing demands scalable and cost-efficient network infrastructures. With this premise, optical interconnects have recently gained more and more research attention as a key building block to ensure end-to-end energy efficient solutions, offering high throughput, low latency and reduced energy consumption compared to current networks based on active optical cables. An efficient way for performing such optical interconnects is to make use of multi-core fibers (MCFs), which enables the multiplexing of several spatial channels, each using a different core inside the same fiber cladding. Moreover, non-orthogonal multiple access combined with multi-band carrierless amplitude and phase modulation (NOMA-CAP) has been recently proposed as a potential candidate to increase the network capacity and an efficiency/flexibility resource management. In this paper, using direct detection we experimentally demonstrate the transmission of NOMA-CAP signals through a 2 km MCF with 7 spatial channels for high capacity optical interconnect applications. The results show negligible transmission penalty for different total aggregated traffics ranging from 350 Gb/s to 630 Gb/s.
Article
Full-text available
Cloud computing is an evolving paradigm with tremendous momentum. Performance is a major challenge in providing cloud services, and performance management is prerequisite to meet quality objectives in clouds. Although many researches have studied this issue, there is a lack of analysis on management dimensions, challenges and opportunities. As an attempt toward compensating the shortage, this work first gives a review on performance management dimensions in clouds. Moreover, a taxonomic scheme has devised to classify the recent literature, help to standardize the problem and highlight commonalities and deviations. Afterward, an autonomic and integrated performance management framework has been proposed. The proposed framework enables cloud providers to realize optimization schemes without major changes. Practicality and effectiveness of the proposed framework has been demonstrated by prototype implementation on top of the CloudSim. Experiments present promising results, in terms of the performance improvement and management. Finally, open issues, opportunities and suggestions have been presented.
Article
Full-text available
Memory allocation has a major influence on multiuser systems, cloud-based services, virtual machines, and other computer systems. Memory allocation is a process that assigns physical or virtual memory space to programs and services as efficiently and quickly as possible. Economical memory allocation management needs allocation strategies with minimum wastage. In this paper, we introduce a new memory allocation algorithm based on sequential fits and zoning for on-demand (online) cloud services. The memory is divided into multiple zones, where a subgroup of relative request sizes compete in reverse order. We use simulation to compare our new mechanism with existing memory allocation methods that have been deployed using Amazon Elastic Compute Cloud as a test bed. The proposed algorithm is more efficient, and the average saving for the normalized revenue loss is about 7% better than best-fit and 15% better than first-fit memory allocation. In addition, we show that proposed algorithm is robust and faster and has a fairness index that is superior to that of existing techniques.
Article
Emergency call services are expected to be highly available in order to minimize the loss of urgent calls and, as a consequence, minimize loss of life due to lack of timely medical response. This service availability depends heavily on the cloud data center on which it is hosted. However, availability information alone cannot provide sufficient understanding of how failures impact the service and users' perception. In this paper, we evaluate the impact of failures on an emergency call system, considering service‐level metrics such as the number of affected calls per failure and the time an emergency service takes until it recovers from a failure. We analyze a real data set from an emergency call center for a large Brazilian city. From stochastic models that represent a cloud data center, we evaluate different data center architectures to observe the impact of failures on the emergency call service. Results show that changing data center's architecture in order to improve availability from two to three nines cannot decrease the average number of affected calls per failure. On the other hand, it can decrease the probability to affect a considerable number of calls at the same time.
Article
Different challenges are facing the adoption of cloud-based applications, including high availability (HA), energy, and other performance demands. Therefore, an integrated solution that addresses these issues is critical for cloud services. Cloud providers promise the HA of their infrastructure while cloud tenants are encouraged to deploy their applications across multiple availability zones with different reliability levels. Moreover, the environmental and cost impacts of running the applications in the cloud are an integral part of incorporated responsibility, where both the cloud providers and tenants intend to reduce. Hence, a formal and analytical stochastic model is needed for both the tenants and providers to quantify the expected availability offered by an application deployment. If multiple deployment options can satisfy the HA requirement, the question remains, how can we choose the deployment that satisfies the other providers and tenants requirements? For instance, choosing data centers with low carbon emissions can both reduce the environmental footprint and potentially earn carbon tax credits that lessen the operational cost. Therefore, this paper proposes a cloud scoring system and integrates it with a Stochastic Petri Net model. While the Petri Net model evaluates the availability of cloud applications deployments, the scoring system selects the optimal HA-aware deployment in terms of energy, operational expenditure (OPEX), and other norms. We illustrate our approach with a use case that shows how we can use the various deployment options in the cloud to satisfy both the cloud tenant and provider needs.
Conference Paper
Data centers are at the core of modern software technology and play a crucial role in defining the capabilities and overall success of an organization. The massive growth in size and scale of data and computing leads to an enormous growth in the size and complexity of clusters and data centers. Therefore, traditional management methods and standards like Intelligent Platform Management Interface (IPMI) are not sufficient to manage these modern scalable data centers anymore. Redfish is a new standard for managing hardware in modern data centers and is anticipated to meet the expectations of end users to provide simple, effective, and secure management of scalable platform hardware. It is essential to validate Redfish's capability regarding the performance, scalability and security aspects as defined in the Redfish Specification. To validate Redfish services, we have designed a Redfish Conformance Test Tool (RCTT) which tests compliance and re-instates faith on the viability of Redfish at meeting customer expectations.
Article
Guaranteeing high levels of availability is a huge challenge for cloud providers. The authors look at the causes of cloud failures and recommend ways to prevent them and to minimize their effects when they occur.