ArticlePDF Available

Abstract and Figures

In only a decade, cloud computing has emerged from a pursuit for a service-driven information and communication technology (ICT), becoming a significant fraction of the ICT market. Responding to the growth of the market, many alternative cloud services and their underlying systems are currently vying for the attention of cloud users and providers. To make informed choices between competing cloud service providers, permit the cost-benefit analysis of cloud-based systems, and enable system DevOps to evaluate and tune the performance of these complex ecosystems, appropriate performance metrics, benchmarks, tools, and methodologies are necessary. This requires re-examining old system properties and considering new system properties, possibly leading to the re-design of classic benchmarking metrics such as expressing performance as throughput and latency (response time). In this work, we address these requirements by focusing on four system properties: (i) elasticity of the cloud service, to accommodate large variations in the amount of service requested, (ii) performance isolation between the tenants of shared cloud systems and resulting performance variability, (iii) availability of cloud services and systems, and (iv) the operational risk of running a production system in a cloud environment. Focusing on key metrics for each of these properties, we review the state-of-the-art, then select or propose new metrics together with measurement approaches. We see the presented metrics as a foundation toward upcoming, future industry-standard cloud benchmarks.
Content may be subject to copyright.
antifying Cloud Performance and Dependability:
Taxonomy, Metric Design, and Emerging Challenges
NIKOLAS HERBST, ANDRÉ BAUER, and SAMUEL KOUNEV, Universität Würzburg, Germany
GIORGOS OIKONOMOU and ERWIN VAN EYK, TU Delft, the Netherlands
TIM BRECHT, University of Waterloo, Canada
CRISTINA L. ABAD, Escuela Superior Politécnica del Litoral, Ecuador
ALEXANDRU IOSUP, Vrije Universiteit Amsterdam and TU Delft, the Netherlands
In only a decade, cloud computing has emerged from a pursuit for a service-driven information and communi-
cation technology (ICT), becoming a signicant fraction of the ICT market. Responding to the growth of the
market, many alternative cloud services and their underlying systems are currently vying for the attention of
cloud users and providers. To make informed choices between competing cloud service providers, permit the
cost-benet analysis of cloud-based systems, and enable system DevOps to evaluate and tune the performance
of these complex ecosystems, appropriate performance metrics, benchmarks, tools, and methodologies are
necessary. This requires re-examining old and considering new system properties, possibly leading to the
re-design of classic benchmarking metrics such as expressing performance as throughput and latency (response
time). In this work, we address these requirements by focusing on four system properties: (i) elasticity of the
cloud service, to accommodate large variations in the amount of service requested, (ii) performance isolation
between the tenants of shared cloud systems and resulting performance variability, (iii) availability of cloud
services and systems, and the (iv) operational risk of running a production system in a cloud environment.
Focusing on key metrics for each of these properties, we review the state-of-the-art, then select or propose
new metrics together with measurement approaches. We see the presented metrics as a foundation towards
upcoming, future industry-standard, cloud benchmarks.
CCS Concepts:
General and reference Cross-computing tools and techniques
Cloud computing
Computer systems organization Self-organizing autonomic computing
Software and its engineering Virtual machines;
Additional Key Words and Phrases: Metrics, Cloud, Benchmarking, Elasticity, Performance Isolation, Perfor-
mance Variability, Availability, Operational Risk
This work was co-funded by the German Research Foundation (DFG) under grant No. KO 3445/11-1, and by the Netherlands
NWO grants MagnaData and COMMIT; it has been supported by the CloudPerfect project and received funding from the
EU H2020 research and innovation programme, topic ICT-06-2016: Cloud Computing, under grant agreement No 73225.
This research has been supported by the Research Group of the Standard Performance Evaluation Corporation (SPEC).
The authors would like to thank for reviewing Kai Sachs (SAP), Klaus-Dieter Lange (HPE), and Manoj K. Nambiar (Tata
Consulting) .
Authors’ addresses: Nikolas Herbst; André Bauer; Samuel Kounev, Universität Würzburg, Germany; Giorgos Oikonomou;
Erwin van Eyk, TU Delft, the Netherlands; George Kousiouris; Athanasia Evangelinou, Nat’l. Tech. U. of Athens, Greece;
Rouven Krebs, SAP SE, Germany; Tim Brecht, University of Waterloo, Canada; Cristina L. Abad, Escuela Superior Politécnica
del Litoral, Ecuador; Alexandru Iosup, Vrije Universiteit Amsterdam and TU Delft, the Netherlands.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from
©2017 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery.
2376-3639/0000/0-ART1 $15.00
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:2 N. Herbst et al.
ACM Reference Format:
Nikolas Herbst, André Bauer, Samuel Kounev, Giorgos Oikonomou, Erwin van Eyk, George Kousiouris,
Athanasia Evangelinou, Rouven Krebs, Tim Brecht, Cristina L. Abad, and Alexandru Iosup. 0000. Quantifying
Cloud Performance and Dependability: Taxonomy, Metric Design, and Emerging Challenges. ACM Trans.
Model. Perform. Eval. Comput. Syst. 1, 1, Article 1 ( 0000), 35 pages.
Cloud computing is a paradigm under which ICT services are oered on demand “as a service”,
where resources providing the service are dynamically adjusted to meet the needs of a varying
workload. Over the last decade, cloud computing has become increasingly important for the
information and communication technology (ICT) industry. Cloud applications already represent
a signicant share of the entire ICT market in Europe [
] (with similar fractions expected for
North America, Middle East, and Asia). By 2025, over three-quarters of global business and personal
data may reside in the cloud, according to a recent IDC report [
]. This promising growth trend
makes clouds an interesting new target for benchmarking, with the goal of comparing, tuning,
and improving the increasingly large set of cloud-based systems and applications, and the cloud
fabric itself. However, traditional approaches to benchmarking my not be well suited to cloud
computing environments. In classical benchmarking, system performance metrics are measured on
systems-under-test (SUTs) that are well-dened, well-behaved, and often operating on a xed or at
least pre-dened set of resources. In contrast, cloud-based systems add new, dierent challenges to
benchmarking, because they can be built using a rich yet volatile combination of infrastructure,
platforms, and entire software stacks, which recursively can be built out of cloud systems and
oered as cloud services. For example, to allow its subscribers to browse the oered and then to
watch the selected videos on TVs, smart-phones, and other devices, Netix utilizes a cloud service
that provides its front-end services, operates its own cloud services to generate dierent bit-rate
and device-specic encoding, and leverages many other cloud services from monitoring to payment.
A key to benchmarking the rich service and resource tapestry that characterizes many cloud
services and their underlying ecosystems is the re-denition of traditional benchmarking metrics
for cloud settings, and the denition of new metrics that are unique to cloud computing. This is the
focus of our work, and the main contribution of this article.
Academic studies, concerned public reports, and even company white papers indicate that a
variety of new operational and user-centric properties of system quality (i.e., non-functional properties)
are important in cloud settings. We consider four such properties. First, cloud systems are expected
to deliver an illusion of innite capacity and capability, yet to appear perfectly elastic so as to oer
important economies of scale. Second, cloud services and systems have been shown to exhibit high
performance variability [
], against which modern cloud users now expect protection (performance
isolation). Third, also because the recursive nature of cloud services can lead to cascading failures
across multiple clouds when even one fails, increasingly more demanding users expect that the
availability of cloud services is nearly perfect, and that even a few unavailability events will
cause signicant reputation and pecuniary damage to a cloud provider. Fourth, as the risks of not
meeting implicit user-expectations and explicit service contracts (service level agreements, SLAs)
are increasing with the scale of cloud operations, cloud providers have become increasingly more
interested in quantifying, reducing, and possibly reporting their operational risk.
With the market growing and maturing, many cloud services are now competing to retain
existing and to attract new customers. Consequently, being able to benchmark, quantify, and
compare the capabilities of competing systems is increasingly important. We rst examine the
research question: For the four properties of cloud services we consider, can existing metrics be applied
to cloud computing environments and used to compare services? Responding to this question, our
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:3
survey of the state-of-the-art (see Section 9) indicates that the existing body of work on (cloud)
performance measurement and assessment, albeit valuable, does not address satisfactorily the
question, and in particular the existing metrics leave important conceptual and practical gaps in
quantifying elasticity, performance isolation and variability, availability, and operational risk for
cloud services. Therefore, we propose the main research question investigated in this work:
Q: Which new metrics can be useful to measure, examine, and compare cloud-based systems, for
the four properties we consider in this work?
Both designing new metrics and proving their usefulness (possibly even in comparison with
previous metrics) is non-trivial. Generally, metrics are designed to quantify a specic property of
interest, and become useful only when shared by a diverse group of stakeholders. Metrics must
address their design goals, which include domain-specic aspects, such as providing unique insights
into the performance of cloud settings, and general aspects, such as the characteristics expected of
a good metric we explain in Section 7. As with many other designed artifacts, proving that one
metric is correct, or that one metric is better than another are not merely deductive or quantitative
processes. Design often satises a need, rather than solving exactly (i.e., correctly) a problem.
Dierent metrics are designed for dierent aspects and use cases; for example, the commonly used
temperature metrics Celsius (interval scale) and Kelvin (ratio scale) metrics cannot be considered
as one better than the other. Even when quantifying the same coarse-grained aspect, metrics can
dier in some essential, albeit ne-grained, detail; for example, even the traditional utilization
metric can oer dierent facets when considering heterogeneous resources and services.
Although elasticity, performance isolation and variability, availability, and operational risk are
already perceived as important aspects in academia and by the industry, they have not yet been
thoroughly dened and surveyed. As we show in this work, their meaning may be dierent for
dierent stakeholders, and in some cases existing denitions are inconsistent or even contrary
to each other. Towards answering the main research question, our main contribution is four-fold.
Each contribution is focusing on dening the foundations of benchmarking one property of system
quality in cloud settings, and on showing exemplary applications to realistic cloud settings:
Elasticity, addressed in Section 3. Elasticity oers the opportunity to automatically adapt
the resource supply to a changing demand. Observing high response times during resource
congestion or a low resource utilization during excess use allows only for an indirect, narrow
understanding of the accuracy and timeliness of elastic adaptations. We present an extended
set of metrics and methods for combining them to capture the accuracy and timing aspects of
elastic platforms explicitly. We rene the elasticity metrics from previous work [
introduce a well-dened uniform instability metric as well as the elastic deviation.
Performance isolation & variability, addressed in Section 4. The underlying cloud infrastructure
should isolate dierent customers sharing the same hardware from each other with regards to
the performance they observe. We summarize how the impact of disruptive workloads onto
specied performance guarantees connected with a cloud service oering can be quantied as
discussed in previous work [
]. In addition, we propose a new metric capturing performance
variability seen as an eect of imperfect isolation.
Availability, addressed in Section 5. We analyze the availability denitions used by various
cloud providers, and quantify the availability of their business critical cloud applications and
compare them for dierent contexts. We then dene a metric of SLA adherence that enables
direct comparisons between providers with otherwise dierent denitions of availability.
Operational risk-related, addressed in Section 6. And on a more general level than the other
features, we also focus in this work on estimating dierent types of operational risks that are
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:4 N. Herbst et al.
connected with running software in the cloud. We dene here various relevant metrics, and
a measurement methodology that addresses them.
Because the practicality and utility of new metrics will be determined by whether or not they
are adopted and how widely they are used, with this work we aim not only to answer the main
research question, but also to: (i) initiate discussions and provide a foundation regarding possible
performance metrics, evaluation methodologies, and challenges that arise in cloud computing
environments; (ii) examine methods for comparing the performance, dependability, and other
non-functional properties of competing cloud oerings and technologies; (iii) inuence emerging
standardization eorts; and (iv) encourage providers of cloud computing services to support and
make available monitoring of such non-functional properties, as a critical piece of infrastructure
that is often lacking in current practice. Toward these goals (i) - (iv) and in addition to the four
main contributions (1) - (4) mentioned above, we propose (I) in Section 2, a taxonomy of useful
cloud metrics, (II) in Section 8, an in depth discussion of open challenges that arise mainly in the
context of containerization and micro-services as new technologies in the cloud computing domain.
(III) Last, we review related work on cloud metrics, in Section 9.
In this section, we propose a cloud-metric hierarchy that reects four dierent levels of abstraction
in cloud performance measurement and assessment. Whereas in mathematics, the term metric
is explicitly distinguished from the term measure (the former referring to a distance function), in
computer science and engineering, the terms metric and measure overlap in meaning and are often
used interchangeably. One way to distinguish between them is to look at metrics as values that
can be derived from some fundamental measurements comprising one or more measures. In cloud
settings, only a subset of the metrics can be measured directly at runtime by an end-user. Other
metrics require detailed knowledge and controlled experiments, and thus need to be reported and
measured using standards shared by cloud providers and infrastructure customers. Because there
can be many or even endless derivations, and because such standards do not yet exist, it is useful
to design a taxonomy of useful cloud metrics. We design such a taxonomy and depict it in Figure 1.
We now address each layer, in turn.
The rst layer includes Traditional Metrics for performance and dependability, such as end-to-
end response time, average resource utilization, and throughput rates, are key instruments in
the performance evaluation of cloud computing services. As we discuss in Section 8.1, to ensure
Operational risk(C), To ta l cost of ownership(E),
Metrics for
SLO Violation rates(E),
service costs(E), ...
Policy Metrics
Performance variability(E),
resource availability(E),
Cloud Infrastructure
Throughput rates(E),
end-to-end response
times(E) ,
Tra diti onal Per form ance Me tric s
Aggregate metrics(C),
e.g., unit-free scores,
speedup ratios, …
Performance isolation(P),
elasticity & scalability(C),
energy efficiency(P),
Resource utilization
averages(P), latency(P),
congestion times(P),
Metrics to be measured
by provider(P) or IaaS customer(C) Metrics measurable for end-user(E)
Fig. 1. A hierarchical taxonomy of cloud metrics.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:5
well-dened and reproducible measurements, addressing the challenge of transparent placement of
measurement sensors for these metrics has become more important in a cloud computing context.
We propose to group on the next-higher level the set of Cloud Infrastructure Metrics that cover
various aspects of non-functional quality with a focus on the operation of the infrastructure,
such as energy eciency, or the capability of the infrastructure to isolate the performance hosted
virtual machines and provide a certain degree of performance stability. This layer provides metrics
useful in dening service level objectives, such as performance isolation and elasticity, eciency,
and operational risk. These non-functional properties of a cloud environment or cloud service
deployment are derived measures based on the Traditional Metrics. For example, Traditional
Metrics characterizing a system using a response time distribution can be used to derive Cloud
Infrastructure Metrics that inform about the quality of isolation between tenants, or if the system’s
resource management follows the demand for resources well-enough for ecient usage.
The group of Policy Metrics, the third layer, targets quantities relevant to an end-user. Examples
include assessing the fulllment of SLOs as experienced by users; and ranking and selecting cloud
providers using unit-free scores aggregating multiple Cloud Infrastructure Metrics.
The topmost layer is comprised of Metrics for Managerial Decisions, which uses deep aggregates
to quantify for example the risk and cost of running arbitrary cloud services in arbitrary cloud
Elasticity has originally been dened in physics as a material property capturing the capability
of returning to its original state after a deformation. In economics, elasticity captures the eect
of change in one variable to another dependent variable. In both cases, elasticity is an intuitive
concept and can be precisely described using mathematical formulas.
The concept of elasticity has been transferred to the context of clouds, and is commonly con-
sidered a central attribute of the cloud paradigm [
]. The term is heavily used in providers’
advertisements and even in the naming of specic products or services. Even though tremendous
eorts are invested to enable cloud systems to behave in an elastic manner, no common and precise
understanding of this term in the context of cloud computing has been established so far, and no
industry-standard ways have been proposed to quantify and compare elastic behavior.
3.1 Prerequisites
The scalability of a system including all hardware, virtualization, and software layers within its
boundaries is a prerequisite for speaking of elasticity. Scalability is the ability of a system to sustain
increasing workloads with adequate performance, provided that hardware resources are added. In
the context of distributed systems, it has been dened by Jogalekar and Woodside [42], as well as
in the works of Duboc et al. [19], where also a measurement methodology is proposed.
The existence of at least one adaptation process is typically assumed. The process is normally
automated, but it could contain manual steps. Without a dened adaptation process, a scalable
system cannot scale in an elastic manner, as scalability on its own does not include temporal aspects.
When evaluating elasticity, the following points need to be checked beforehand:
Automated Scaling: What adaptation process is used for automated scaling?
Elasticity Dimensions: What is the set of resource types scaled as part of the adaptation
Resource Scaling Units: For each resource type, in what unit is the amount of allocated
resources varied?
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:6 N. Herbst et al.
Scalability Bounds: For each resource type, what is the upper bound on the amount of resources
that can be allocated?
3.2 Definition
is the degree to which a system is able to adapt to workload changes by provisioning
and de-provisioning resources in an autonomic manner, such that at each point in time the
available resources match the current demand as closely as possible.
Dimensions and Core Aspects. Any given adaptation process is dened in the context of at least
one or possibly multiple types of resources that can be scaled up or down as part of the adaptation.
Each resource type can be seen as a separate dimension of the adaptation process with its own
elasticity properties. If a resource type consists of other resources types, like in the case of a virtual
machine having assigned CPU cores and memory, elasticity can be considered at multiple levels.
Normally, resources of a given resource type can only be provisioned in discrete units like CPU
cores, VMs, or physical nodes. For each dimension of the adaptation process with respect to a
specic resource type, elasticity captures the following core aspects of the adaptation:
The timing aspect is captured by the percentages a system is in an under-provisioned,
over-provisioned or perfect state and by the oscillation of adaptations.
The accuracy of scaling is dened as the average relative deviation of the current amount
of allocated resources from the actual resource demand.
A direct comparison between two systems in terms of elasticity is only possible if the same
resource types (measured in identical units) are scaled. To evaluate the actual elasticity in a given
scenario, one must dene the criterion through which the amount of provisioned resources is
considered to match the actual demand needed to satisfy the system’s given performance require-
ments. Based on such a matching criterion, specic metrics that quantify the above mentioned
core aspects, as discussed in more detail in Section 3.5, can be dened to quantify the practically
achieved elasticity in comparison to the hypothetical optimal elasticity. The latter corresponds to the
hypothetical case where the system is scalable with respect to all considered elasticity dimensions
without any upper bounds on the amount of resources that can be provisioned and where resources
are provisioned and de-provisioned immediately as they are needed exactly matching the actual
demand at any point in time. Optimal elasticity, as dened here, would only be limited by the
granularity of resource scaling units.
Dierentiation. This paragraph discusses the conceptual dierences between elasticity and the
related terms scalability and eciency.
is a prerequisite for elasticity, but it does not consider temporal aspects of how fast,
how often, and at what granularity scaling actions can be performed. Scalability is the ability
of the system to sustain increasing workloads by making use of additional resources, and
therefore, in contrast to elasticity, it is not directly related to how well the actual resource
demands are matched by the provisioned resources at any point in time.
expresses the amount of resources consumed for processing a given amount of work.
In contrast to elasticity, eciency is directly linked to resource types that are scaled as part of
the system’s adaptation mechanisms. Normally, better elasticity results in higher eciency.
This implication does not apply in the other direction, as eciency can be inuenced by
other factors (e.g., dierent implementations of the same operation).
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:7
3.3 Related Elasticity Metrics and Measurement Methodologies
In this section, we group existing elasticity metrics and benchmarking approaches according to
their perspective and discuss their shortcomings.
3.3.1 Related Elasticity Metrics. Several metrics for elasticity have been proposed so in literature:
The “scaling latency” metrics [
] or the “provisioning interval” [
] capture the pure time
to bring up or drop a resource. This duration is a technical property of an elastic environment
independent of the timeliness and accuracy of demand changes, thus independent the elasticity
mechanism itself that decides when to trigger a reconguration. We consider these metrics as
insucient to fully characterize the elasticity of a platform.
The “reaction time” metric we proposed at an early stage of our research already in 2011 in the
report [
] can only be computed if a unique mapping between resource demand changes and
supply changes exists. This assumption does not hold especially for proactive elasticity mechanisms
or for mechanisms that have unstable (alternating) states.
The “elastic speedup” metric proposed by SPEC OSG in their report [
] relates the processing
capability of a system at dierent scaling levels. This metric - contrary to intuition - does not
capture the dynamic aspects of elasticity and is regarded as scalability metric.
The integral-based “agility” metric also proposed by SPEC OSG [
] compares the demand and
supply over time normalized by the average demand. They state that the metric becomes invalid in
cases where service level objectives (SLOs) are not met. This “agility” metric has not been included
as part of the SPEC Cloud IaaS 2016 benchmark
. This metric resembles an early version of our
proposed “precision” metric [
]. We propose a rened version normalized by time to capture the
accuracy aspect of elastic adaptations, considering also situations when SLOs are not met.
The approaches in the work of Binning et al., Cooper et al., Almeida et al. and Dory et al. [
] characterize elasticity indirectly by analyzing response times for signicant changes
or for SLO compliance. In theory, perfect elasticity would result in constant response times for
varying arrival rates. In practice, detailed reasoning about the quality of platform adaptations based
on response times alone is hampered due to the lack of relevant information about the platform
behavior, e.g., the information about the amount of provisioned surplus resources.
Becker et al. [
] introduced in 2015 the “mean-time-to-repair” in the context of elasticity as the
time the systems needs on average to step out of an imperfectly provisioned state. This “mean-
time-to-repair” is estimated indirectly based on analyzing for how long the request response
times violate a given SLO or a systems runs below a target utilization - in other words, without
knowing the exact demand for resources. Furthermore, the “mean-time-to-repair” metric comes
with the assumption that for each defect a unique repair is given. In practice, based on experimental
experience [
], this assumption does not hold. It is a common case that within one repair
phase multiple further defects and/or repair actions can occur as illustrated in Figure 2. In these
cases, this metric is not completely dened with the result that metric values might be computed
based on dierent interpretations and thus can be misleading.
Numerous cost-based elasticity metrics have been proposed so far [
]. They
quantify the impact of elasticity either by comparing the resulting provisioning costs to the costs
for a peak-load static resource assignment or the costs of a hypothetical perfect elastic platform. In
both cases, the resulting metrics strongly depend on the underlying cost model, as well as on the
assumed penalty for under-provisioning, and thus do not support fair cross-platform comparisons.
We propose a set of elasticity metrics explicitly covering the timeliness, accuracy and stability
aspects of a deployed elasticity mechanism (a.k.a auto-scaler) together with three dierent way
1SPEC Cloud IaaS 2016:
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:8 N. Herbst et al.
to aggregate in a possibly weighted manner. This set of core elasticity metrics can of course be
complemented by user-oriented measures like response-time distributions, percentage of SLO
violations, a cost measure by applying a cost model, or accounted and charged resource instance
3.3.2 Elasticity Measurement Approaches. As a high number of auto-scaling mechanisms has
been proposed in literature over the course of the last decades, many of them come with individual
and tailored evaluation approaches. We observe that in most cases an experimentally comparison
against other state-of-the-art mechanisms is omitted and the mechanisms eectiveness is justied
by show-cases of improvements in service level compliance. The eld of resource management
algorithms has also been ooded by approaches only evaluated in simulation frameworks. A broader
simulated analysis of auto-scalers can be found in the work of Papadopoulos et al. [
]. A number
of approaches for elasticity evaluation - in some cases only on an abstract level - can be found in
literature [3,15,18,26,40,63,66,68,73]:
In the work of Folkerts et al. [
], the requirements for a cloud benchmark are described on
an abstract level. The work of Suleiman [
], Weinman [
], Shawky and Ali [
] as well as the
work of Islam et al.[
] have in common their end-user perspective on elasticity and quantify
costs or service level compliance. These are important end-user metrics, but we argue that good
results in these metrics are indirectly aected by elasticity mechanisms. Timeliness, accuracy and
stability of resource adaptations are not considered although those are the core aspect of an elastic
behavior. Furthermore, these approaches account neither for dierences in the performance of the
underlying physical resources, nor for the possibly non-linear scalability of the evaluated platform.
As a consequence, elasticity evaluation is not performed in isolation from these related platform
attributes. In contrast, the approach proposed in this work uses the results from an initial systematic
scalability and performance analysis to adapt the generated load prole for evaluating elasticity, in
such a way that dierences in the platform performance and scalability are factored out.
Another major limitation of existing elasticity evaluation approaches is that systems are subjected
to load intensity variations that are not representative for real-life workload scenarios. For example,
in the work of Dory et al. [
] and of Shawky and Ali [
], the direction of scaling the workload
downwards is completely omitted. In the work of Islam et al. [
], sinus like load proles with
plateaus are employed. Real-world load proles exhibit a mixture of seasonal patterns, trends, bursts
and noise. We account for the generic benchmark requirement “representative" [
] by employing
the load prole modeling formalism DLIM as presented in the article [71].
3.4 Scalability Analysis as Preprocessing Step
To capture the curve of demanded resource units over an intensity-varying workload, the scalability
of the system needs to be analyzed as a preprocessing step. As implemented in the BUNGEE Cloud
Elasticity Benchmark [
], a scalability analysis results in a discrete function mapping the minimal
amount of resources to a workload intensity while satisfying a given SLOs. The workload intensity
can be specied either as the number of workload units (e.g., user requests) present at the system
at the same time (concurrency level), or as the number of workload units that arrive per unit of
time (arrival rate). For the derivation of the mapping step function, it is proposed to conduct a binary
search for nding the maximum sustainable load intensity per conguration. The characterization
of the scalability per scaling dimension limits the evaluation of elasticity to one dimension at time.
3.5 Proposed Elasticity Metrics
The demanded resource units of a certain load intensity is the minimal amount of resources required
for fullling a given performance related SLO. The demanded resource units mapping for dierent
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:9
load intensity levels can be derived with the methodology proposed in Section 3.4. The following
metrics are designed to characterize two core aspects of elasticity: accuracy and timing
. As all
metrics calculate their observed elasticity aspect based on the the supply and the demanded resource
units, each metric describes how the supply behavior diers from the demanded resource units; i.e.,
all metrics can be seen as deviation from the perfect behavior. Hence, the optimal value is zero and
denes the perfect elastic system; the higher the metrics, the worse is the observed elasticity aspect.
For a valid comparison of elasticity based on the proposed set of metrics, the platforms (i) require
the existence of an autonomic adaption process (e.g., an auto-scaler), (ii) the scaling of the same
resource type, e.g., CPU cores or VMs, and (iii) within the same ranges, e.g., 1 to 20 resource units.
The proposed metrics evaluate the elastic behavior and are not designed for distinct descriptions
of the underlying hardware, the virtualization technology, the cloud management software or the
elasticity strategy and its conguration in isolation. As a consequence, the metric and measurement
methodology are applicable when not all inuencing factors are known. The metrics require two
discrete curves as input: (i) The curve of demanded resource units, which denes how the demanded
resource units vary during the measurement period, where
is the demanded resource units at
t∈ [
; and, (ii) the supply curve, which denes how the amount of used resources vary
during the measurement, where stis the supplied resource units at time t∈ [0,T]
In Section 3.5.1, we describe the accuracy metrics and in Section 3.5.2 we present a set of metrics
for the quantication of timing aspects. In Section 3.6, we outline two approaches for the aggregation
of the proposed elasticity metrics enabling to compute a consistent ranking between multiple elastic
cloud environments and congurations. Afterwards, we discuss challenges of elasticity in multi-tier
applications. Finally, we conclude with a scaling behavior example and its associated metrics.
3.5.1 Accuracy. In contrast to the denition in the previous publication [
], the provisioning ac-
curacy is extended so that it describes the relative amount of resources that are wrongly provisioned.
Furthermore, we propose two dierent ways to calculate this metric: (i) the provisioning accuracy
related to the current demanded resource units or (ii) the provisioning accuracy normalized by the
maximum number of resources that are available for the experiment.
While using the rst approach and taking Figure 2into account, the under-provisioning accuracy
is the sum of the areas
divided by the current demanded resource units
and nally,
normalized by the measurement period
. Similarly, the over-provisioning accuracy
the relative amount of resources that are over-provisioned during the measurement interval:
While using the second approach and considering Figure 2,
is the sum of the of the areas
normalized by the measurement period
and the maximum available resources units. Similarly,
is the sum of the of the areas Otthat is normalized by the time and the maximum resource units:
The advantage of the rst approach is that the metric describes the relative deviations between
the allocated resource units and their respective demanded resource units. However, the range of
this metric is [0;
), while the second approach gives values between 0 and 100 (as is describes the
deviation related to the maximum available resource units). Herein, we use the term provisioning
accuracy for both approaches as they are exchangeable for the aggregated elasticity metrics.
2In the work of Herbst et. al. [32], these aspects are referred to as precision and speed.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:10 N. Herbst et al.
resource demand resource supply
U3 O3
A1 A2 A3
B1 B2 B3
Fig. 2. Illustration for the definition of accuracy and provisioning timeshare metrics.
3.5.2 Timing. We highlight the timing aspect of elasticity from the viewpoints of the pure wrong
provisioning timeshare and the instability accounting the degree of oscillation.
Wrong Provisioning Timeshare. The accuracy metrics allow no reasoning on whether the average
amount of under-/over-provisioned resource units results from a few big deviations between
demanded resource units and supply, or if it is caused by a constant small deviation. The following
two metrics address this by giving more insights about the ratio of time in which under- or over-
provisioning occurs. The metrics capture the time in percentage in which the system is under/over-
provisioned during the experiment. In other words, The under-provisioning timeshare
over-provisioning timeshare
are computed by summing up the total amount of time spend in an
under- or over-provisioned state normalized by the duration of the measurement period (see Fig. 2).
Thus, τUand τOmeasure the overall timeshare spent in under- or over-provisioned states:
Instability. Although the
accur acy
metrics measure important aspects of elasticity,
platforms can behave dierently while producing the same metric values for
accur acy
metrics. An example is shown in Figure 3: Platforms A and B exhibit the same accuracy and spend
the same amount of time in the under-provisioned and over-provisioned states. However, B triggers
three unnecessary resource supply adaptations whereas A triggers seven. We propose to capture
this with a further metric called instability to support reasoning for instance-time-based pricing
models as well as for the operators view on estimating resource adaptation overheads.
resource demand
resource supply
provisioning time deprovisioning time
(a) Platform A
resource demand
resource supply
(b) Platform B
Fig. 3. Platforms with dierent elastic behaviors that produce equal results for accuracyand timeshare .
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:11
is the time in percentage in which the curves of supplied and demanded
resource units are not parallel. As a requirement for the calculation of this metric, the average
provisioning and deprovisioning time has to be determined experimentally. The step-functions of
demanded resource units and supply are transformed to ramps based on the average deprovisioning
and provisioning time as depicted in Figure 3(a). Without this transformation, the resulting value
would depend on how
is chosen. In summary,
measures the fraction of time in which the
demanded resource units and the supplied units do not change with same direction:
min(|sдn(st) − sдn(dt)|,1)t(4)
value close to zero indicates that the platform adapts closely to a change in demand,
like in Figure 3(a). A value close to 100 means that the platform oscillates heavily and does not
converge to the demand, like in Figure 3(b). In contrast to the accuracy and timeshare metrics, a
value of zero is a necessary but not sucient requirement for a perfect elastic system.
3.6 Metric Aggregation
For a direct comparison of platforms, we propose two ways to aggregate the set of elasticity metrics
and to build a consistent and fair ranking: (i) The elastic deviation and (ii) the elastic speedup score.
Elastic Deviation. To quantify the elastic deviation
of an elasticity measurement
, we propose
to calculate the deviation of the observed behavior to the theoretically optimal behavior (i.e., the
curves of the supplied and demanded resource units are identical). A system with optimal elastic
behavior in theory is assumed to have a value of 0 for this metric and, the higher the value, the
worse the behavior. For the calculation of the deviation, we propose to compute the Minkowski
. As the theoretically optimal elastic behavior is assumed to have for each individual
metric the value 0, the dp-metric is reduced to the Lp-norm and is dened as:
Please note that user-dened weights could easily be added to set a preference for either over-
or under-provisioning.
Elastic Speedup Score. The elastic speedup score
is computed similar to the aggregation and
ranking of results in established benchmarks, e.g., SPEC CPU2006
. Here, the use of the geometric
mean to aggregate speedups in relation to a dened baseline scenario is a common approach.
The geometric mean produces consistent rankings and is suitable for normalized measure-
ments [
]. The resulting elastic speedup score allows to compare elasticity performance without
having to compare each elasticity metric separately, and a later point in time add a new result to
the ranking (in contrast to a closed set in, e.g., a pair-wise competition).
A drawback of the elastic speedup score is its high sensitivity to values close to zero and becoming
undened if one or more of the metrics are zero. To minimize the probability of zero-valued metrics,
we propose to aggregate the accuracy and timeshare sub metrics into a weighted accuracy and a
weighted wrong provisioning timeshare metric, respectively:
τ[%]=wτU·τU+wτO·τOwith wij∈ [0,1],wθU+wθO=1and wτU+wτO=1
3SPEC CPU2006:
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:12 N. Herbst et al.
This way,
become zero only for the theoretical optimal auto-scaler. Thus, we compute
the metrics of a baseline platform base and and for elasticity measurement kas follows:
ϵk=θba se
·τba se
·υba se
with wθ,wτ,wυ∈ [0,1],wθ+wτ+wυ=1(7)
The weights can be used a) to realize user-dened preferences, e.g., for either putting more
weight on under- or overprovisioning and b) to adjust the impact of accuracy and timeshare metrics
compared to instability if desired.
3.7 Elasticity Measurement Approach
This paragraph sketches an elasticity benchmarking concept that we propose as described in
the paper [
], together with its implementation called BUNGEE
. Generic and cloud-specic
benchmark requirements as stated in the work of Huppler et al. [
] and Folkerts et al. [
] are
considered in this approach. The four main steps in the measurement process can be summarized
as follows:
(1) Platform Analysis:
The benchmark analyzes the system under test (SUT) with respect to
the performance of its underlying resources and its scaling behavior.
(2) Benchmark Calibration:
The results of the analysis are used to adjust the load intensity
prole injected on the SUT in a way that it induces the same resource demand on all platforms.
(3) Measurement:
The load generator exposes the SUT to a varying workload according to
the adjusted load prole. The benchmark extracts the actual induced resource demand and
monitors resource supply changes on the SUT.
(4) Elasticity Evaluation:
Elasticity metrics are computed and used to compare the resource
demand and resource supply curves with respect to dierent elasticity aspects.
3.8 Exemplary Experimental Elasticity Metric Results
Metric Input I Input II
θU6.49% 15.14%
θO13.54% 7.05%
τU18.87% 45.62%
τO58.28% 31.72%
υ24.34% 20.78%
σ58.47% 47.21%
ϵk2.61 2.64
Table 1. Elasticity metrics.
0 30 60 90
Time in minutes
Amount of VMs
Scaling Behavior of a Standard Reactive Auto-scaler with different Input
Demanded VMs
Supplied VMs based on Input I
Supplied VMs based on Input II
0 30 60 90 120 150 180
Time in minutes
Amount of VMs
Average Utilization for each VM
Average load per VM
Average CPU load per VM
Fig. 4. Illustration of an auto-scaling example.
This section presents a extract of evaluation from the work of A. Bauer et. al [
]. In this work,
we measure the elasticity of a state-of-the-art auto-scaler depending on dierent inputs (Input I:
estimated CPU utilization based on service demand law, Input II: measured CPU utilization using
TOP command). The results of this measurement are depicted in Figure 4and listed in Table 1,
where each column lists a elasticity metric and each row represents a measurement. In the gure,
the black curve represents the demanded VMs, which were determined by BUNGEE (r.t. Section
3.7), the blue dashed curve visualizes the auto-scaling behavior with the Input I, and the red dotted
curve shows the scaling behavior based on the Input II. While observing the two dierent scaling
behaviors, it is hard to decide which input results in a better performance as the dierent input leads
4BUNGEE IaaS Cloud Elasticity Benchmark:
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:13
to contrary performance: (i) While the auto-scaler with Input I almost t the demand during the
increasing load in the st 25 minutes, the system is in an over-provision state the system during the
remaining time. (ii) While the auto-scaler with Input II under-provisions the system until minute
35, it almost ts the demand during the remaining time. This is reected in the elasticity metrics:
The auto-scaler with Input I has signicant better values for under-provisioning accuracy and
time share, but has signicant worse values for the over-provision accuracy and time-share. As
the dierence is quite similar for each compared metric, the elastic speed-up score is almost the
same. A dierence in performance can be seen in the auto-scaling deviation. Here, the auto-scaling
behavior based on Input II achieves a better value in the context of elasticity. Note that the elasticity
describes the system behavior and does not take user experience or ecient resource usage directly
into account.
3.9 Discussion
Besides the publication mentioned in Section 3.8, the elasticity metrics are applied to describe
auto-scaler behavior [
] or are used to measure the elasticity of a system [
]. However, the
proposed metrics are designed to quantify the elasticity on systems that host single-tier applications
running on one group of resources. Hence, with the investigation of multi-tier application and
multiple auto-scalers in place for sub-groups of resources or one multi-scaler some challenges
occur: (i) While the search of the demand-intensity curve on a single-tier applications takes
steps (
is the number of resource units), the amount of steps for a naive search for a multi-tier
application is n1·. . . ·nmwhere mis the number of tiers and nithe number of resources of tier i.
(ii) As the demanded resource units for a given intensity for a single tier application is distinct, the
demand derivation for multi-tier application becomes more complex because for a constant load
intensity there could be more than one optimal resource conguration. (iii) While investigating
each tier separately, the proposed metrics remain fully applicable, however, while investigating the
metrics along all tiers, the proposed metrics are not completely sucient. For instance, in the rst
tier, there is one resource too less and in the second tier, there is one resource too much. In sum,
the number of resources are equal to the number of demanded resource units. Hence, the proposed
elasticity measurement methodology and the metrics need extensions to handle such scenarios.
Cloud resources are shared among several customers on various layers like IaaS, PaaS or SaaS.
Thus, performance isolation of customers poses a major challenge. In this section, we sketch a way
for quantifying performance isolation that has previously been proposed by Krebs et al. [
] and
propose a new metric to quantify the performance variability of a cloud service oering.
Performance isolation in connection with a stable performance are important aspects for various
stakeholders. When a platform developer has to develop a mechanism to ensure performance
isolation between customers, he needs to validate the eectiveness of his approach to ensure the
quality of the product. Furthermore, to improve an existing mechanism, he needs an isolation
metric to compare dierent variants of the solution. When a system owner has to decide for one
particular deployment in a virtual environment, his decision might be strongly inuenced also by
the quality of the non-functional performance isolation and stability properties.
The degree of performance isolation is perceived as a static quality of a given platform or
environment, whereas performance variability is a dynamic property describing uctuations in
performance caused by imperfect isolation. To quantify performance isolation, full control of the
cloud environment is required to distinguish between and physically collocate abiding and disruptive
cloud tenants in a given experiment. Having an industry-standard benchmark for performance
isolation, cloud providers could conduct trusted experiments and advertise their results. In contrast,
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:14 N. Herbst et al.
performance variability can be quantied in a black-box manner by conducting a number of small
performance measurements. Repeating performance measurements in randomized time intervals
would enable the reporting of the recently experienced variability, as this quantity may change
over time and depend on the time of the day.
A system is considered performance-isolated if, for customers working within their quotas, the
performance is not negatively aected when other customers sharing the physical hardware exceed
their quotas. A decreasing performance for the customers exceeding their quotas is accepted. A
quota might be implicitly given by the specication of a service oering in combination with the
current overbooking factor. Thus, a cloud customer must not explicitly be aware of his quota.
The isolation metrics [
] distinguish between groups of disruptive and abiding customers. The
latter work within their given quota (e.g., dened number of requests/s) the former exceed their
quota. Isolation metrics are based on the inuence of the disruptive customers on the abiding
customers. Thus we have two groups and observe the performance of one group as a function of
the workload of the other group (cf. Figure 5).
Time Time
Resp. Time
Fig. 5. Influence of the disruptive tenant onto the abiding.
4.1 Isolation Metrics based on QoS impact
For the denition of the isolation metric, a set of symbols is specied in Table 2. These metrics
Symbol Meaning
tA customer in the system.
Set of disruptive customers exceeding their quotas (e.g., contains customers inducing more
than the allowed requests per second). |D|>0
Set of abiding customers not exceeding their quotas (e.g., contains customers inducing less
than the allowed requests per second).|A|>0
Workload caused by customer
represented as numeric value
. The workload is considered
to increase with higher values (e.g., request rate and job size). wtW
The total system workload as a set of the workloads induced by all individual customers. Thus,
the load of the disruptive and abiding ones.
A numeric value describing the QoS provided to customer
. The individual QoS a customer
observes depends on the composed workload of all customer
. We consider QoS metrics
where lower values of
correspond to better qualities (e.g., response time) and
zt(W) ∈ R+
The degree of isolation provided by the system. An index is added to distinguish dierent types
of isolation metrics. The various indices are introduced later.
Table 2. Overview of variables and symbols.
depend on at least two measurements. First, the observed QoS results for every
at a reference
Wre f
where all customers stress the system at their quotas (or slightly less). Second, the
results for every
at a workload
Wdi sr
when a subset of the customers have increased their
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:15
load to challenge the system’s isolation mechanisms. As previously dened
Wre f
Wdi sr
composed of the workload of the same set of customers which is the union of
. At
Wdi sr
the workload of the disruptive customers is increased.
We consider the relative dierence of the QoS (
) for abiding customers at the reference
workload compared to the disruptive workload. Additionally, we consider the relative dierence of
the load induced by the two workloads (w):
[zt(Wdi sr ) − zt(Wre f )]
zt(Wr e f )w=Í
wtWdis r
wtWre f
wtWre f
Based on these two dierences the inuence of the increased workload on the QoS of the abiding
tenants is expressed as follows:
IQo S =
A low value of this metric represents a good isolation as the dierence of the QoS in relation to the
increased workload is low. Accordingly, a high value expresses a bad isolation of the system.
Further isolation metrics can be found in the work of Krebs et al. [
], e.g., considering a
measurement approach where the QoS of the abiding customer is kept constant by synthetically
reducing its load. The amount of required reduction is taken as basis to compute isolation metrics.
4.2 Exemplary Experimental Isolation Metric Results
This section presents results based on the work of Krebs et al. [
] assessing dierent Xen hypervisor
congurations (CPU pinning vs. credit scheduler) and their eects on the performance isolation
when executing the TPC-W benchmark for transactional web servers. Two dierent scenarios
are investigated in this case study. In the pinned scenario, the server with eight physical cores
hosts four guest systems (dom1, dom2, dom3, dom4) and dom0. Every domU has a xed memory
allocation of 3096MB and hosts a MySQL 5.0 database and a Tomcat web server. The various
domains are exclusively pinned to the existing cores. Thus, no competition for the same CPU
resources is possible. Based on this run-time environment, four separate instances of the TPC-W
bookshop application are deployed. In the unpinned scenario, all domU and the dom0 are not pinned
to a specic CPU and free to use any existing hardware resource. Xen’s credit scheduler is chosen
to allocate the domains to the various resources.
Table 3shows the values used to dene the reference and disruptive workloads whereas EBs
stands for the number of emulated browsers in the workload driver.
EBs per
Total through-
per domU
Avg. response
Max. load dis-
Pinned 3000 1195 req/s 299 req/s 1104ms 15000
1500 721 req/s 180 req/s 842ms 13500
Table 3. Performance results for the scenario setup and configuration.
The highest dierence in throughput for one domain compared to the mean is around 4
5% and
the highest dierence of the response times around 6
5% in the pinned scenario. In the unpinned
case, we observed 2
2% dierence for the throughput. The dierence of the response times was
at 8
2%. As a consequence of these observations
is set to 15000 for the pinned scenario and
to 13500 for the unpinned. It is worth to mention that in the unpinned scenario
is very close to
nine times the load of the maximum throughput for one domain. Table 4contains the values of
for the two scenarios. The rst column of Table 4identies the scenario, the second the amount of
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:16 N. Herbst et al.
users for the disruptive domain, the third column the average response time of all abiding domains
followed by the results for
. To ensure a level playing eld of comparison, we select
measurements where wis the same in the relevant scenarios.
Disruptive load
Response time wz IQoS
Pinned 15000 1317ms 4.00 0.19 0.05
3200 927ms 0.33 0.10 0.30
4800 942ms 0.60 0.12 0.20
7500 914ms 1.05 0.09 0.08
10000 1173ms 1.47 0.39 0.27
Table 4. Results of IQoS in the various scenarios.
The pinned scenario presents a nearly perfect isolation throughout the whole range. The
presented in Table 4at a disruptive load of 15000 users was below 0.05.
For the isolation metrics based on QoS impact, we determine the isolation at various disruptive
workloads shown in Table 4. We observe two signicant characteristics. The rst, is the increasing
response time when the disruptive load is set to 3200 users. The second, is the increasing response
time at 10000 users. Accordingly, the isolation becomes better between 3200 and 10000 users. This
is, because of the widely constant response times by increasing load changing the ratio of
4.3 Performance Variability of Virtual Cores
Typically, performance of virtualized computing oerings in the IaaS domain is indirectly expressed
through the use of sizing parameters (such as RAM size, number of cores, etc.), with some providers
using their own metrics (e.g., Amazon Compute Units). In the case of nominal sizing parameters,
these fail to capture the actual performance of the resources when executing a given task, which
may also depend on other parameters (e.g., sharing of some level of cache with other processors,
bus speed, etc.). Furthermore, performance may uctuate due to a number of other parameters
(such as collocation of virtual resources in multi-tenant infrastructures, noisy neighbor eect,
etc.). Thus, a more accurate representation of thes computational capabilities could be based on a
number of benchmark tests (potentially diverse, to capture dierent application types and ways to
utilize the underlying hardware resources). It is not the purpose of this work to go into details on
which these benchmarks should be. The main goal is to indicate the uctuation aspects obtained
by the benchmark scores, due to this dynamic and on demand nature of cloud services usage. This
uctuation should include no absolute values, but percentages of deviation from the average values,
according to the “performance of virtual core” metrics given in Table 5. From the two we consider
mainly the PVC (case 1), described below, which is a combination of the generic mean absolute
deviation and the mean absolute percentage error formulas, if error in this case is considered the
deviation from the mean value of the measurements. The metric (we denote it as PDVC, Percentage
Deviation of Virtual Cores), calculates the average percentage deviation of the results from the mean
value for the same benchmark, the same workload and the same size of VM. This metric, if also
guaranteed by the respective provider, may indicate a promised limit
across all measurements,
where nthe observation set and Cthe guaranteed percentage by the provider (if any).
Applied to the measurements of Evangelinou et al. [
], the specic metric portrays the following
ratings for the various service types on the specic examined benchmarks.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:17
Cloud Provider Instance Type PDVC (%) Avg. Bench. Score Serv. E. Score
Flexiant 2Gb-2CPU 11.2 4423.2 0.226
Flexiant 4Gb-3CPU 31.4 10237.0 0.169
Flexiant 4Gb-4CPU 15.2 3052.6 0.172
Microsoft Azure A2 10.0 5501.3 0.191
Microsoft Azure A1 7.9 9974.3 0.213
Amazon EC2 m1.large 20.0 5070.7 0.173
Amazon EC2 m1.medium 11.1 6692.2 0.217
Amazon EC2 m1.small 3.7 9381.3 0.364
Table 5. Performance of virtual cores metrics.
Ranks by PDVC Ranks by Avg. Bench. Score Ranks by Serv. E. Score
Amazon EC2 m1.small (best) Flexiant 4Gb-4CPU (best) Amazon EC2 m1.small (best)
Microsoft Azure A1 Flexiant 2Gb-2CPU Flexiant 2Gb-2CPU
Microsoft Azure A2 Amazon EC2 m1.large Amazon EC2 m1.medium
Amazon EC2 m1.medium Microsoft Azure A2 Microsoft Azure A1
Flexiant 2Gb-2CPU Amazon EC2 m1.medium Microsoft Azure A2
Flexiant 4Gb-4CPU Amazon EC2 m1.small Amazon EC2 m1.large
Amazon EC2 m1.large Microsoft Azure A1 Flexiant 4Gb-4CPU
Flexiant 4Gb-3CPU (worst) Flexiant 4Gb-3CPU (worst) Flexiant 4Gb-3CPU (worst)
Table 6. Ranking based on performance of virtual cores metrics.
Based on the specic rating, and compared to the original average benchmark values obtained,
a clear dierence may be observed in the values obtained for the PDVC metric, indicating an
enhanced stability of the oering of Amazon small, which is ranked sixth in the absolute comparison
based on the average benchmark score. Comparing to the service eciency column (based on the
service eciency metric, a weighted sum of performance and cost as dened and used the work of
Kousiouris et al. [
]), which is more balanced since it includes an aspect of the oering’s cost, the
top rated one is again EC2 small, however there are considerable dierences in the following ones,
with Azure oerings being evaluated with lower SE, while indicated as more stable from the PDVC.
Flexiant 2GB-2CPU on the other hand appears as one of the best oerings in terms of absolute or
relative (to cost) performance, while in the PDVC case is ranked in a lower position.
We study critical aspects of measuring availability of cloud services, existing metrics and their
relation to the de facto standards, and conditions that apply to the public cloud market. Relevance
to cloud environments is high, given that availability is one of the strong arguments for using
cloud services, together with the elastic nature of the resources and adaptable utilization. Further-
more, availability is a key performance indicator (KPI) included in most public cloud service level
agreements (SLAs).
The goal is to identify metrics that can be used for provider and service comparisons, or for
incorporation into trust and reputation mechanisms. Furthermore we intend to highlight aspects
that are needed for an availability benchmark, which in contrast to most cases of benchmarking,
does not refer to the creation of an elementary computational pattern that may be created to
measure a system’s performance. Instead we refer mainly to a daemon-like monitoring client for
auditing provider SLAs and extracting the relevant statistics.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:18 N. Herbst et al.
5.1 Prerequisites and Context
SLAs are contracts between a service provider and a potential client, that govern or guarantee a
specic aspect of the oered service. Relevant metrics may include availability, performance aspects,
QoS, etc. as dened service level objectives (SLOs). Research eorts have investigated advanced
aspects like direct SLA negotiation and runtime renegotiation [
], and real-time QoS aspects [
however, in main public cloud environments they appear as static agreements [
], prepared
in advance by the respective provider, and not available for adaptation to each specic user. Thus a
cloud user has no choice but accepting de facto the terms and conditions oered.
5.2 Relevant Metrics Definition
Operational Availability. For real time monitoring of services availability, one very interesting
approach is provided by CloudSleuth [
], which has an extensive network of deployed applications
in numerous providers and locations and continuously monitors their performance with regard to
response times and availability metrics. CloudSleuth’s measurement way is not adapted to each
provider’s SLA denition, so it cannot be used to claim compensation but it relates mainly to the
denition of operational availability. Furthermore, it checks the response of a web server (status 200
return type on a GET request). Thus it cannot distinguish between a failure due to an unavailable
VM (case of provider liability) or due to an application server crash (customer liability in case of
IaaS deployment, provider liability in case of PaaS) or pure application fault (customer liability). On
the other hand, it follows a normal availability denition that makes it feasible to compare services
from dierent providers, a process which cannot be performed while following their specic SLA
denitions, since they have dierences in the way they dene availability. Equation 11 contains
CloudSleuth’s formula for availability
, with
as the overall number of samples
the overall number of failed samples:
Availability has also been dened using the Mean Time Between Failures [
], to avoid cases where
small uptime is combined with a small downtime (and thus result in a high availability measure).
While this is reasonable, commercial cloud providers tend to consider as uptime the entire month
duration rather than only the actual uptime the services were used by the end user. Aceto et al. [
conducted a thorough analysis of monitoring tools. While numerous tools exist for focusing on
service availability (and other non functional properties), it is questionable whether the way they
are calculating the specic metrics is compliant to the relevant denitions in commercial clouds
and their respective SLAs. In the work of Li et al. [
], an investigation of new metrics and relevant
measurement tools expands across dierent areas such as network, computation, memory and
storage. However, availability is again not considered against the actual SLA denitions.
De Facto Industrial SLAs Examination. Commercial providers provide their own denition of
availability in their respective SLAs. In many cases this implies a dierent calculation based on the
same samples and a set of preconditions that are necessary for a specic service deployment to
be included under the oered SLA. To investigate the ability to provide an abstraction layer for
this type of services and thus create a generic and abstracted benchmarking client, one needs to
identify common concepts in the way the SLAs are dened for the various providers (we considered
Amazon EC2, Google Compute and Windows Azure). The rst and foremost point of attention is the
generic denition of availability
, which is included in Equation 12. Here,
is each elementary
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:19
dened time interval (can be minutes, hours etc. depending on the examined services) and
is a
failed examined (downtime) period. This may seem the same as the one dened in Equation 11;
however, the providers typically dene the downtime period as a step function (x in minutes).
Cmi n
is currently dened as 1 minute for the public cloud providers Google, Amazon and Microsoft.
DP(x)=0i f x <Cmi n
x i f x >Cmi n (12)
Furthermore, one other key point is the existence of preconditions needed for an SLA to be
applicable for a specic group of services used (e.g., need to have multiple VMs deployed in dierent
availability zones). There are variations between provider denitions with regard to this aspect,
with relation to where the VMs are deployed or if further action is necessary after the realization
that a virtual resource is unavailable (e.g., restart of the resource). In a nutshell, the similarities and
dierences of the three examined providers are the following:
Common concepts: the common concepts include the Quantum of downtime period, i.e. the
fact that providers do not accept that a downtime period is valid, unless it is higher than a
specic quantum of time. Furthermore, discount formats is the common compensation for
downtime with relation to the monthly bill, while the calculation cycle is typically set to 1
month. Also typically, more than one VM need to be used to accept SLA applicability (dis-
tributed across at least two availability zones) and they need to be simultaneously unavailable
for an overall sample to be considered as unavailable.
Dierences: dierences include the quantum size (minor since the format is the same but
also because these dierences tend to extinct with the typical example of Google Compute
changing from 5 minutes to 1 minute at the end of 2016). The number of discount levels is
also dierent (again minor since the format is the same). For the Azure Compute case, it
seems that more than one instances for the same template ID must be deployed. However, it
seems also that the overall time interval refers only to the time a VM was actually active.
Other functional dierences include the restart from the template which is necessary in
Google App Engine before validating an unavailable sample.
5.3 Abstracted and Comparable Metrics
It is not feasible to directly compare provider-dened availability metrics between providers, as
these dier in denition. It is more meaningful to follow a more generic denition (as in [
]), or
abstract to the more generic concept of SLA adherence level, which can be dened as the ratio of
violated SLAs over the overall SLAs. Since SLA period is set to monthly cycles, this may be the
minimum granularity of observation.
SLA Adherence Levels
.Special attention must be given for cases that sampling is not continuous,
indicating that the client did not have running services for a given period, applicable for an SLA.
These cases must be removed from such a ratio, especially for the cases that no violations are
examined in the limited sampling period, given that no actual testing has been performed. If
a violation is observed even for a limited period, then this may be included. Furthermore, SLA
adherence may be grouped according to the complexity of the examined service
SLA Strictness Levels
.Besides SLA adherence, other metrics may be dened to depict the
strictness of an SLA. As a prerequisite, we assume that the metric must follow a “higher is stricter”
approach. Stricter implies that it is more dicult for a provider to maintain this SLA. To dene
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:20 N. Herbst et al.
Provider/Service t q q’ p x S S’
Google Compute 0 1 (norm:0.1 ) 1 (norm: 0.0167) 99.95 (norm:0.5) 0 1.4 1.4833
Amazon EC2 0 1 (norm: 0.1) 1 (norm: 0.0167) 99.95 (norm:0.5) 0 1.4 1.4833
Microsoft Azure 1 1 (norm: 0.1) 1 (norm: 0.0167) 99.95 (norm:0.5) 0 2.4 2.4833
Table 7. Example application of the SLA strictness metric on existing public cloud SLAs.
such a metric, one needs to identify what are the critical factors that may aect strictness levels.
Factors should be normalized to a specic interval (e.g. 0 to 1) and appropriate levels for them may
be dened. Indicative factors may include:
Size of the minimum continuous downtime period (quantum of downtime period q). A higher
size means that the SLA is more relaxed, giving the ability to the provider to hide outages if
they are less than the dened interval. The eect of such a factor may be of a linear fashion
(e.g. 1
). Necessary edges of the original interval (before the normalization process) may
be dened per case, based e.g. on existing examples of SLAs.
Ability to use the running time of the services and not the overall monthly time, denoted
by a Boolean variable t(0 false, 1 true). This would be stricter in the sense that we are not
considering the time the service is not running as available.
Percentage of availability that is guaranteed. Again this may be of a linear fashion, with
logical intervals dened by the examined SLAs.
Existence of performance metrics (e.g. response timing constraints). This may be a boolean
feature x, whose values may be set to higher levels (0 or 5). The importance of this will be
explained briey.
Such metric may be useful for deploying applications with potentially dierent characteristics
and requirements. For example, having soft real-time applications would imply that we need to
have feature 4. Other less demanding applications may be accommodated by services whose SLAs
are less strict. Thus suitable value intervals may be adjusted for each feature. If we use a value
of 5 for the true case of feature 4, and all the other features are linked in such a manner that their
cumulative score is not higher than 5, then by indicating a necessary strictness level of 5 implies
on a numerical level that feature 4 needs denitely to be existent in the selected cloud service.
Depending on the application types and their requirements and based on the metric denition,
one can dene categories of strictness based on the metric values that correspond to according
levels (e.g. medium strictness needs a score from 2 to 3 etc.). It is evident that such a metric is based
only on the SLA analysis and is static, if the SLA denition is not changed. The indicative formula
for the case of equal importance to all parameters appears in Equation 13.
where siis a normalization factor for the continuous variables ∈ [0,1]
so that (s1·q) ∈ [0,1] and (s2·q) · t{0,1}, x{0,1}
For the normalization intervals, for pwe have used 99% and 100% as the edges, given that these
were the ranges encountered in the examined SLAs. For qwe have used 0 and 10 minutes as the
edges. 0 indicates the case where no minimum interval is dened (thus abiding by the formula in
Equation 11) and 10 the maximum interval in examined compute level SLAs. However there are
larger intervals (e.g. 60 minutes) in terms of other layer SLAs (Azure Storage). The limit to 60 has
been tried out in the q’ case that is included in Table 7, along with the example of the other factors
and the overall values of the SLA strictness metrics in the 3 examined public SLAs.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:21
For this parameter, while in the past dierences existed (Google had a 5 minute value), currently
they are uniform to 1 minute. An example of x not being 0 would be the Azure Storage SLA, where
unavailability is also determined by response time limits to a variety of service calls.
One interesting aspect in this case refers to the usage model of cloud services that may inuence
one or more parameters of the selection. As an example, if an entity uses cloud services as their
main and continuous operation platform, then the concern of whether the actual usage time or
the overall monthly time is used as the calculation interval becomes irrelevant (and thus the t
parameter may be omitted, resulting in equivalent rankings for the providers of Table 7). If the
usage model considers bursting scenarios (temporary usage of external cloud services to meet
peaks in demand), the respective t parameter gains signicant importance in the decision process.
5.4 Measurement Methodology
To create an availability benchmark for the aforementioned SLAs, the following requirements/steps
in the methodology must be undertaken:
(1) Complete alignment to each provider denition to achieve non-repudiation, including:
(a) Necessary preconditions checking in terms of number and type
(b) Availability calculation formula
(c) Dynamic
consideration of user actions (e.g. starting/stopping of a VM) that may inuence
SLA applicability
(d) Downtime due to maintenance
General assurance mechanisms in terms of faults that may be accredited to 3
party ser-
(a) For example testing of general Internet connectivity on the client side
(b) Front-end API availability of provider
Monitoring daemon not running for an interval at the client side. This can be covered by
appropriate assumptions, e.g. if logs are not available for an interval then the services are
considered as available
(3) Logging mechanisms that persist the necessary range and type of information
System Setup. System setup should include a running cloud service. Furthermore, it should
include the benchmark/auditing code (according to the aforementioned requirements taken under
consideration) that is typically running externally to the cloud infrastructure, to cover the case that
connectivity exists internally in the provider but not towards the outside world.
Workload. Given that this is a daemon-like benchmark, the concept of workload is not directly
applicable. The only aspect of workload that would apply would be for the number of cloud service
instances to be monitored and the only constraint is that these cover the preconditions of the
SLA. However an interesting consideration in this case may be the dierentiation based on the
complexity of the observed service (in terms of number of virtual resources used), given that this
would inuence the probabilities of not having a violation.
If we consider the case of a typical cloud deployment at the IaaS level, we may use
zones (AZ), in which
virtual machines are deployed. An availability zone is typically dened as
a part of the data center that shares the same networking and/or power supply. Thus, the usage
of multiple AZs eliminates the existence of a single point of failure. For simplicity purposes we
assume that the number of VMs is the same across all AZs. The probability of a technical failure in
the AZ
depends on the probability of power supply failure
and the probability of network
. The probability that a VM is unavailable
the risk of the physical host in
which a VM is running to fail and
the risk of the VM to fail. Assuming that these probabilities
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:22 N. Herbst et al.
are mutually exclusive and depending on dierent factors, the overall probability of failure for a
deployment in one AZ
is given by Equation 14. The service is deemed as unavailable in one
specic AZ if power or network connectivity is lost across the AZ or if all VMs in that AZ are at
the same time unavailable.
U, with PT=PP+PNand PU=PH+PV(14)
VMs are deployed in each one of the
AZs, then the overall failure probability
is given
by the following formula (Equation 15), assuming that the various AZs have similar power and
network probability failures and given that we are not aware of the anity of VM placement across
physical nodes thus we can assume that dierent physical hosts are used for each VM.
The signicant factors that indicate the complexity (
) can be used as a generic metric of
“workload”. In addition, they can be used to classify results according to service complexity.
5.5 Discussion
While availability has been dened in the literature in various ways, existing mainstream public
clouds such as Amazon, Google and Azure have separate denitions, which may be similar but not
identical even to each other. Thus direct comparison of providers based on these metrics can not
be achieved and especially benchmarked against the guarantees they advertise in their respective
SLAs. In this section, an analysis is performed regarding the similarities of provider denitions
and how these can lead to guidelines regarding the implementation of benchmarking clients for
identifying provider fulllment of the issued service level agreements towards their customers.
Furthermore, we dene a simple yet directly comparable (between providers) metric of SLA
adherence. Classes of workload can be identied based on the size and deployment characteristics
of the measured service thus further rening the aforementioned comparison.
In this section, we present cloud metrics laid on the top level of the pyramid in Figure 1. These metrics
may provide comparable information about the overall status of cloud systems and can further be
used for managerial decisions by operators and stakeholders. We dene the Operational Risk (OR),
a family of metrics determining the risk of production systems running in cloud environments, and
provide metrics and measurement approaches to quantify it.
6.1 Goal and Relevance
Risk management is dened as a decision paradigm in grids [
] and used for the selection of an
infrastructure to host an application. Similarly in clouds, we use risk as a quality aspect reecting
the impact of running an application in cloud infrastructures.
The operational risk derives from the risk aspect and depicts the performance impact on services
when running in cloud systems. Operational risk evaluates whether cloud services run in the
system as expected and, if not, the severity of the deviations. Given that cloud users want their
applications to run with minimum performance degradation, providers may use metrics such the
operational risk to publicly present the capabilities of their systems.
In addition, operational risk helps managers summarize and report system status in management
analysis. For instance, cloud architects need metrics to depict the overall system status and, if
possible, the future impact of cloud solutions on system performance. On the other hand, operators
monitoring the cloud infrastructures use dierent metrics for in-depth performance analysis.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:23
The operational risk provides an overview of the system and contributes to a higher level
approach towards the performance evaluation of complex systems.
6.2 Prerequisites and Context
In this section, we describe the operational risk of cloud services deriving from their performance
levels when running in cloud infrastructures. The term
is also used in literature for describing
security issues of cloud but the notion of security is out of scope of the presented metric.
The service performance in clouds refers to multiple kinds of performance that can be measured
in cloud services. A possible distinction among these performance types is the level of service (e.g.,
IaaS, PaaS, SaaS) in the cloud system. The service level indicates the service performance for that
level and also the suitable metrics measuring the performance.
The denition of operational risk refers to expected performance levels that cloud services should
run at to avoid degradation. The expected performance varies depending on needs. We describe
the most common benchmark objectives from which the expected performance levels derive.
First, the performance can be compared to an optimal level which bounds theoretically the
measurements. The implied expected performance to compare results is the optimal level. An
example is the total availability of cloud infrastructures with no downtime, as described in Section 5.
A second benchmark result can be the comparison between the performance level of a service
running alone on a dedicated environment and the performance of the same service running on
an environment where multiple services share the resources. This is a more realistic approach to
benchmark and to compare performances since optimal performances may never be reached.
A third result in cloud benchmarks can be the success of the system in achieving the guaranteed
performance levels described in SLOs. Providers may use benchmarking to test systems against
possible SLO violations and create new SLA levels according to the results. Therefore, we set as
expected performance levels of services those dened in SLOs. In this way, we can test and tune
systems according to required performance bounds. Having SLA levels as a benchmark goal requires
the denition of such SLOs which are general and applicable for many cloud-based systems.
As a result of the many variations of performance levels set as benchmark goals, we can map
these levels to the expected performance levels in operational risk. In this way, operational risk can
evaluate the severity of the deviations between monitored results and given expected levels.
6.3 Proposed Operational Risk Metrics
The proposed operational risk metric adapts the notion of the expected levels from the second
comparison type described in Section 6.2.
The metric quanties the variation of service performance between the service running in a
dedicated environment and the performance when running in another, non-dedicated environment.
The former provides an isolated environment where resources are always available to the service
under test. In contrast, due to concurrent operations of services in a non-dedicated environment,
services share the resources resulting in performance interference among the services [
]. This
interference aects the performance delivered in services. Hence this performance may deviate
from the service performance in the dedicated environments implying degradation issues.
Since we use the notion of the expected levels for testing performance, by risk, we imply the
likelihood that the service performance in the cloud will deviate from the demanded performance.
The expected service performance is reected by the performance in the dedicated system.
We focus on the service performance at the IaaS level, thus the operational risk utilizes IaaS-level
metrics and refers to performance levels which reect the resource utilization by cloud services.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:24 N. Herbst et al.
6.3.1 Related Metrics. The risk metric reveals an additional aspect of the service performance in
the cloud which is the degree to which service performance is degraded. Figure 6shows the three
metrics referring to resource type of a cloud service. The current usage (
) of the resource, the
demanded (
) amount the service requires and the provisioned (
) amount of resource which is
the upper limit of resource the service is able to consume and is given by the resource manager.
As Figure 6shows,
line can be higher than the
line as a service may not receive what
it requires due to contention issues in the system, e.g., resource overbooking. We consider that
cannot be higher than the demanded amount
), because the real usage of resource
cannot exceed the requirements of service in resources. Similarly,
as the actual resource
consumption can only reach up to the resource limit having been set by the resource manager.
Relative Levels of Metrics. Considering the three mentioned metrics, we show the possible cases
about the relative levels of the metrics. In an ideally auto-scaling and isolated environment, the
service is provisioned and consumes the demanded amount of resource (
), as in period
in Figure 6. The resource manager provides the demanded resources (
) and the service utilizes
all the demanded resources (D=U) without interfering with other co-hosted services.
Fig. 6. Metrics composing the Operational Risk metric for cloud service.
However, the case of the three coinciding lines is not applicable in real environments. An auto-
scaling mechanism is not perfectly accurate and creates incompatibilities between the provisioned
and the demanded resources. When a service demands fewer resources than it has been provi-
sioned (
), the system hosts an over-provisioned service. The system has under-provisioned
resources to service when
. The two mentioned cases are depicted in periods
in Figure 6.
Finally, the real usage
cannot be higher than the demanded usage
; however, the lines of
do not always coincide due to contention issues in the system. We call the deviation
, which is always non-negative (
), as contention-metric and it presents the
gap between the demanded amount of resource and the real consumption of resource at specic
time. The higher the value of the contention-metric, the more severe is the resource-contention
which is experienced by the service; hence degrading the service performance. For instance, in
Figure 6, the service experiences more contention in its resources during period
than during
period T1as the gap between the resource lines D,Uis bigger in the former period.
6.3.2 Definition of Metric. The operational risk metric presented in this section incorporates the
three resource metrics mentioned before and splits the metric into two partial metrics, the provision
risk (
) and the contention risk (
). We split the risk metric into two so that each metric evaluates
dierent risk aspects when service performances are degraded. The provision risk evaluates the
severity of performance degradation aected by inaccurate provisioned resources to services,
and the contention risk, the severity aected by the resource contention when resources are
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:25
shared among services. The combination of the two metrics, will let cloud providers have a more
comprehensive view of the system performance and adjust performance issues towards their needs.
Provision Risk. We dene the provision risk (
) as the degree to which the demanded resources
meet the provisioned resources.
are the provisioned and the demanded resources respectively at time
is the time period in which we measure the two metrics. The integral value of
measures the
relative squashed area enclosed by the resource values of
for time period
. The value
ranges between
indicating an under-provisioning situation when
rp∈ [−
and an
over-provisioning case when
rp∈ (
. The zero value indicates an accurate provision of resources
according to the demanded resources. The closer the value of
is to zero, the less risk is indicated
for the service.
Contention Risk. Similarly, we dene the contention risk (
). This metric utilizes the resource
values of demanded and used resources over time
. The metrics
indicate the values of
the respective metrics at time t.
value is non-negative as the amount of used resources cannot exceed the amount of
demanded resources. The higher the value of
, the more risk is estimated for a service to not
receive the demanded performance due to resource contention.
Service Risk. The risk for a cloud service should be composed of the combination of the two
aforementioned risk metrics to have better overview on the service status regarding risk levels.
We dene the risk of service (
) as the degree to which a cloud service performs as expected, which
derives from the performance of the provisioning capabilities of the system and from the system’s
capacity to keep low resource contention among hosted services. Both expected performance levels
contribute to evaluate if a system supplies the service adequately with enough resources.
re=wp· |rp|+wc·rc=1
with wp,wc∈ [0,1],wp+wc=1.
For the service risk metric
in Equation 18, the dierence between
is an absolute
value because we do not focus on the provisioning type of risk but only for the level of severity
that the dierence between the two metrics reects. Factors
wp,wc∈ [
are used to weight the
operational risk value according to user needs.
System Risk. The operational risk of a cloud system (
) should be an overview of the risk values
of the constituent services in the system. The aggregation method of service risks, which calculates
the system risk value, deviates according to user needs. The method will result in a variability
metric which combines values according to dierent purposes. For example, providers who want
to test the risk levels of the system may use quantile values to depict the risk level of a proportion
of services. When cloud environments are assessed for stable risk levels, a variability metric, like
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:26 N. Herbst et al.
the inter-quartile range (
), will show the dispersion of service risks in the systems and thus,
the performance variability in the systems.
6.4 Measurement Methodology
To measure eectively the operational risk of cloud systems, we have to determine which monitored
data is the respective metrics that operational risk is built upon. Additionally, the resource type
has to be decided in order for the appropriate resource metrics to be declared. We propose the
measurement of multiple resource types since objectives in resource management by cloud operators
need to consider the joint management of dierent resources [
]. We focus on the most common
resource types that are currently available to clouds: CPU, memory, network and storage.
.The metric denitions of provision (
) and usage (
) are straight-forward and
one can easily monitor the respective values. Metric
refers to the capacity that resource manager
has provisioned to service and metric
is the actual resource capacity that service uses. In any
resource type, the corresponding metrics of Pand Ucan be readily monitored.
.Although it may be confusing how the demanded amount of the aforementioned
resource types is estimated, there is currently enough research on that topic. For the resource type
of CPU, prediction models have been introduced in [
] and contemporary cloud monitors provide
metrics about the demanded amount of CPU. For the memory resource, the demanded amount can
also be estimated as in [
] and used as the possible memory capacity that service needs at specic
time. For the resource types of network and storage, the metric
is simpler to calculate because
the demanded capacities are either the size or the number of requests of that resource received
by the resource manager. The resource manager, after collecting the requests, handles a subset of
these requests (i.e., metric U) due to the system load.
.The values of the two weights
represent the importance of the two
respectively. One has to consider the purpose of benchmarking the cloud system to
dene similarly the weights. The contention risk may be more important for testing the impact of
co-location of cloud services (
) whereas the provision risk represents better the operational
risk when selecting the most promising elastic policy for a system.
6.5 Measurement Approach
The approach of evaluating the operational risk from Section 6.3 can be summarized as follows:
(1) Measurement:
The benchmark exposes the system-under-test to a given workload
specic time period
. Given the workload resource demands, the resource allocations for the
workload by the system and the contention levels in the system, the benchmark estimates
the current resource usage Uand the provision amount Pacross period T.
(2) Risk Evaluation:
Operational risk metrics are computed and used to compare the risk in
dierent time periods or dierent cloud environments.
We present an example of measuring operational risk using simulation of cloud services running in
a multi-datacenter multi-cluster system. To simplify the process, we focus on the contention risk of
a service (
) and for one resource type (CPU). The workload trace has a long time period of
three months because short-term performance degradation of services do not show representative
overview of their severity. Monitoring and testing long workload traces is the main reason of
having simulation results for our measurements. The computed risk metrics are shown in Table 8.
Each table row represents an experimental conguration with the second service load stressing
higher the system. The dierent operational risk values in the congurations reect the dierent
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:27
1800 11 0.05 0.001
3400 61 0.09 0.007
Table 8. Metric results for a simulated cloud environment (wc=1).
impact of the CPU workloads in the system. By measuring the risk, providers may focus on the
most contended services, therefore services with risk values higher than the system risk (
6.6 Related Work for Operational Risk Metrics
Risk management and resource contention are not new subjects in cloud research. Risk manage-
ment has been dened as a decision paradigm in Grids [
], where selection among multiple
infrastructures should be taken into account to host an application. Historical records about SLO
violations can be used to assess the system risk for an incoming application to fulll the agreed
objectives [
]. The risk levels are also evaluated according to the provisioned mismatches
of the elasticity mechanism. Our approach of operational risk considers the elastic aspect of the
system and incorporates it with the risk of performance interference of services.
Persico et al. [
] studied the intra-network performance in AWS EC2 and Microsoft Azure,
suggesting ways to properly characterize the maximum achievable throughput; studies like these
are necessary to correctly dene the baseline-level prior to a performance degradation analysis.
Tang et al. [
] investigated the interference of services in memory. The presented results on
performance degradation use as baseline-level the performance of a service running alone in
the system under test. Zhuravlev et al. [
] used a similar approach measuring the performance
degradation, with a more explicit denition of the degradation. The authors dene the relative
performance degradation of memory components according to the performance level of the service
running alone. We extend this related work considering the impact of inaccurately provisioned
resources in the hosted services. The degradation of performance due to resource contention is
aected by the elastic mechanisms in clouds, thus the need to incorporate the two notions.
6.7 Discussion
In this section, we present the feature of operational risk in clouds and introduce a representative
metric evaluating the risk in cloud services and systems not to perform as expected.
The notion of risk dierentiates from the elasticity feature in cloud because elasticity takes into
account the provisioned and the demanded values of service resource to cope with the service load. In
contrast, both performance isolation and availability utilize demand and usage metrics for dierent
reasons. Performance isolation concerns about the contention that a service may experience and
evaluates the severity of the interference among services while availability evaluates the periods
where demand is present but the usage of resources cannot be achieved.
The operational risk incorporates the three resource-level metrics and complements the other
metrics described in previous sections to assess the severity of performance degradation regarding
the mentioned cloud features.
Usability. The operational risk metric can be used as evaluation of cloud services and systems to
assess whether the performance guarantees are met.
The measurement of service-performance in the cloud is important for both customer and cloud
provider. The customer wants to maximize prot by delivering good QoS to clients. Therefore, the
service performance is utilized by the customer to check the progress of service-runtime as well as
to compare and select the cloud environment which meets the service needs the most.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:28 N. Herbst et al.
For cloud providers, they are interested in the levels of service performance because they are
burdened with nancial fees when SLO violations occur. Moreover, there is a reputation cost for
cloud providers when not delivering good performance results to customers, thus, providers also
utilize service-performance metrics to maintain competition in the cloud market.
In the previous sections, we dened a number of cloud relevant metrics for dierent non-functional
quality aspects, including ways aggregate them. In Table 9, we provide an overview on the presented
metrics including their value ranges, optimal points and references to show cases. In general, it is
impossible to prove correctness or superiority of a metric; a metric is a common agreement on how
to quantify a given property. One can discuss to what degree metrics fulll characteristics of a good,
well-dened and intuitive metric and additionally demonstrate their usefulness by meaningful
results in rows of experiments or adoption in dierent communities. Let us go in the following
step-by-step over a list of metric characteristics:
Denition. A metric should come along with a precise, clear mathematical (symbolic) expression
and a dened unit of measure, to assure consistent application and interpretation. We are compliant
with this requirement, as all of our proposed metrics come with a mathematical expression, a unit
or are simply time-based ratios.
Interpretation. A metric should be intuitively understandable. We address this by keeping the
metrics simplistic and describe the meaning of each in a compact sentence. Furthermore, it is
important to specify (i) if a metric has a physical unit or is unite-free, (ii) if it is normalized and if
yes, how, (iii) if it is directly time-dependent or can only be computed ex-post after a completed
measurement, and (iv) clear information on the value range and the optimal point. Aggregate
metrics should keep generality and fairness, combined with a way to customize by agreement on a
weight vector.
Reliability. A metric is considered reliable if it ranks experiment results consistently with respect
to the property that is subject of evaluation. In other words, if System A performs better than
System B with respect to the property under evaluation, then the values of the metric for the two
systems should consistently indicate this (e.g., higher value meaning better score). In the intuitive
optimal case, a metric is considered linear if its value is linearly proportional to the degree to
which the system under test exhibits the property under evaluation. For example, if a performance
metric is linear, then twice as high value of the metric would indicate twice as good performance.
Linear metrics are intuitively appealing since humans typically tend to think in linear terms. For
the aggregate metrics, linearity might not be given. In the long run, the metric’s distributions in
reality should be analyzed to improve the reliability of their aggregation. For our proposed metrics,
we demonstrate consistent rankings in rows of experiments.
Further quality aspect of metrics in combination with a standardized measurement. A transparently
dened and consistently applied measurement procedure is important for reliable measurements
of a metric. For example, it is important to state where the sensors need to be placed (to have
an unambiguous view on the respective resource), the frequency of sampling idle/busy counters
and the intervals for reporting averaged percentages. The easier a metric is to measure, the more
likely it is that it will be used in practice and that its value will be correctly determined. For
all proposed metrics, we dene or refer to a measurement methodology (a.k.a benchmarking
approach) and demonstrate their applicability. It is out of scope of this work and part of our
future agenda to support the development of standardized measurement approaches to render the
metric independent and repeatably measurable. A metric’s measurement approach is independent
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:29
Quality Attribute Metric Value Range Unit Howto Measure ShowCase
θU[0; ),opt: 0%
required Sec. 3.8
θO[0; ),opt: 0%
U[0; 100],opt: 0 %
O[0; 100],opt: 0 %
TimeshareτU[0; ),opt: 0%
τO[0; ),opt: 0%
Instability υ[0; 100],opt: 0 %
Deviation σ[0; ),opt: 0% expost
Speedup ϵk[0; ),opt: None expost
Isolation QoS IQoS[0; ],opt: 0None expost Sec. 4.2 [45]
Variability Deviation PVDC [0; 100],opt: 0 % expost Sec. 4.3 [21]
Availability Adherence Sa[0;100],opt: 100 % expost Sec. 5.3
appl. on pub. clouds5
Strictness Ss[0; ],opt: % der.from SLO def.
Operational Risk
Provision rp[1; 1],opt: 0 None
expost Sec. 6.5
Contention rc[0; 1],opt: 0 None
Service re[0; 1],opt: 0 None
System rs[0; 1],opt: 0 None
Table 9. Metric summary.
if its denition complete and behavior cannot be inuenced by proprietary interests of dierent
vendors or manufacturers aiming to gain competitive advantage by measuring the metric in a
way that favors their products or services. As our proposed metrics come with a mathematical
denition and a measurement methodology, we claim at this point of research, that it should be
possible to verify that a measurement was conducted according to given run-rules. Still, there
could be ways to manipulate the results in a way we are not aware of yet. Repeatability of metric
measurements implies that if the metric is measured multiple times using the same procedure, the
same value is measured. In practice, small dierences are usually acceptable; however, ideally, a
metric’s measurement approach should be deterministic when measured multiple times. In this
work, we can only partially demonstrate that repeatability is possible in a controlled experiment
environment and to a certain degree in the public cloud domain.
In this section, we rst discuss general challenges of conducting performance experiments in
cloud environments, before we focus on the emerging challenges for metrics and measurement
methodologies in the context of cloud microservice technologies.
8.1 On Challenges of Performance Experiments in Clouds
Obtaining performance metrics from cloud computing environments requires running experiments.
Unfortunately, measured performance can vary signicantly during the execution of an individual
experiment and from one execution to the next [
]. These variations may be due to
the use of dierent physical machines with dierent performance characteristics (e.g., faster
processors) or to interference caused by other VMs that may be using the same shared resources
(e.g., disks, processors, or networks). As a result, signicant care is required to obtain valid and
meaningful measures of performance. Previous work [
] has shown that commonly used approaches
to conducting experiments in clouds can lead to unfair comparisons and invalid conclusions. They
propose the use of a methodology called randomized multiple interleaved trails (RMIT) in which
competing alternatives that are being compared are interleaved in a round robin fashion. For
5,, and
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:30 N. Herbst et al.
example, the rst round might run alternatives A, B, and then C and then the order is re-randomized
for each subsequent round so the second round might run B, C, and then A. Clearly, because of
variability, multiple executions will be required to obtain metrics and at least some basic statistical
analysis, such as computing and reporting means and condence intervals, is required.
The reliability of the resulting measures, when based on system-level metrics like average resource
utilization, pose another challenge for conducting performance experiments. The dynamic mapping
relationships of resource activities between the virtual and the physical layers makes the proling
of system level metrics dicult. This is exacerbated by cross-resource utilization relationships
(e.g., due to I/O scheduling on hypervisor level) and resource multiplexing across dierent VMs
running on the same, possibly overbooked hardware [
]. For comparisons, the position and type
of measurement sensors need to be well-dened and consistent across experiments.
8.2 Emerging Challenges
In the recent years, the industry has increasingly been moving towards microservice, and to
serverless or Function-as-a-Service (FaaS) architectures, aiming to take advantage of the exibility,
scalability and time-to-market in cloud environments [
]. This growing popularity of these
microservice and FaaS architectures is introducing new challenges and opportunities to the eld of
non-functional performance evaluation in cloud computing environments. In this section potential
extensions, assumptions and changes in applicability of the cloud metrics are discussed to address
the concerns and challenges specic to these new types of architectures.
Adaptations for Resource Nesting. The increasing popularity of microservices coincides with
technological advances in containerization. In contrast to traditional VM-based virtualization,
containerization technology, such as LXC
or Docker
, oer reduced performance overhead by
executing applications as isolated processes, commonly referred to as containers, directly on the
host operating system [
]. As a result, systems are increasingly deployed as a set of smaller
microservices in a container cloud. However, due to the security concerns associated with this
relatively immature technology, it is currently a common practice to deploy microservice containers
on top of a VM. Moreover, to take advantage of the container-based, environment-agnostic cloud
infrastructure, FaaS functions are typically deployed on top of containers [
]. This resource nesting,
consisting of FaaS functions, containers, VMs and physical resources, introduces challenges for
many existing metrics. Where previously metrics for the evaluation of cloud environments focused
on a single virtual layer, VMs, on top of physical hardware, adaptations will need to be considered to
ensure quality aspects consider multiple virtual layers. One implication of this increasing resource
nesting is that metrics concerned with elasticity typically assume that underlying resources are
available and can be requested with a consistent provisioning delay. However, this assumption
no longer holds true, scaling higher-level resources depends on the scaling capabilities of the
underlying, lower-level resources. To scale up a FaaS function, in an optimistic scenario, the
underlying VM and container are already provisioned, leading to a very short provisioning time.
However, in alternative scenario’s, the VM and the container might need to be scaled rst before
having sucient resources for the function to scale. Therefore, performance metrics of systems
comprised of these nested resources should not only account for their own performance, but also
of that of the underlying resources.
Adaptations for Resource Prioritization. One of the important advantages to cloud computing, is
the notion of "on-demand" resources. This powerful notion has allowed cloud users to avoid large
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:31
upfront costs and static maintenance costs by only paying the cloud provider for the resources it uses.
By eectively managing and multiplexing its resources a cloud provider can make a prot, while
reducing the costs for the cloud users. However, oering the same priority for every application
might not be the most eective. A background application collecting log data would most likely
not have the same urgency or priority as a business-critical web server. Resource cost instead
of absolute, immediate performance is the most important factor for a background application,
while in the business-critical web server the performance is required at any expense. This scenario
is especially true, with regards to microservices, where the diverse set of microservices all have
dierent performance/cost trade-os. Large providers are experimenting with the notion of spot
markets, where cloud users can get resources with fewer performance guarantees at a reduced
price. For example, Amazon Spot Instances allows you to bid on spare capacity at their data-centers
and in return allow Amazon to pre-empt the resources when the capacity is needed.[
]. On the
other hand, Microsoft Azure and Google Cloud oer a substantial, xed discount on pre-emptable
VMs compared to regular VMs. Yet current metrics focus on optimizing performance, elasticity
and other like-wise metrics without a regard for the costs. Though metrics for absolute, functional
characteristics, such as performance, will continue to be value, adaptions will be needed to reect
how well systems can match the expected functional and non-functional objectives. Others have
started to investigate this performance/cost trade-o [
], but more research is needed on how to
adapt existing metrics and methodologies to take the costs of performance into account.
Adaptations for Resource Fragmentation. The increased adoption of the cloud by various industries
comes with a wide variety of dierent types of workloads, which also introduce more and specic
demand for various types of resources [
]. High-performance resources, such as VMs with large
memory allocation and CPU shares, or specic resources, such as machines with GPUs, generally
cost more relative to less performant machines. This diverse demand requires providers to be exible
in the resource options that they oer. However, this diverse demand does lead to physical resources
being sub-optimally utilized, as it leads to resource fragmentation. For example, a single application
might take up most of the CPU while under-utilizing the memory. Due to the inability to change
resource characteristics on-demand, this prevents any other application being able to make use of
the resource. In the context of microservices this resource fragmentation would increase even more
as microservices are generally optimized to focus on a similar set of related responsibilities, leading
to microservice being a CPU-heavy, I/O-heavy or network-heavy specic resource consumer.
Metrics for elasticity and performance typically consider only a single resource. When they do
consider multiple resources, they fail to take into account over-provisioning in multiple resource
dimensions. Therefore, adaptations should become part of performance evaluations, to put more
emphasis on how eectively multiple resource dimensions are taken into account.
Adaptations for Dynamic Workloads. Current approaches in performance evaluation of cloud
applications assume a relatively static workload. However, with the rising popularity of DevOps
practices, including automated, frequent deployment of new versions of services is increasingly
becoming an industry standard, which is referred to as Continuous Deployment (CD) [
]. According
to the 2015 State of DevOps Report [
], a signicant portion of organizations already conduct
multiple software deployments per day. For example, the retailer Otto conducts more than 500
deployments per week [
]. These rapid changes in the behaviour and resource consumption of
these services cause variations in the workload that aect the entire system under test, making it
more dicult to predict workloads based on historical data. Accordingly, the methodology will
need to be adapted to focus more on evaluating the initial, ’warm-up time’, of mechanisms. The
warm-up time is the time needed to obtain a workload characterization needed by a mechanism or
policy to perform close to optimally. For example, in auto-scaling research, many policies make use
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:32 N. Herbst et al.
of historical data to evaluate scaling decisions using time series analysis [
]. These policies need
historical data to function optimally. With the frequent deployments and changing congurations
of services, the size of the workload that needs to be recorded to have a policy function near-
optimally becomes a concern. Yet, this concern is not yet represented in existing metrics for cloud
8.3 Discussion
The industry is rapidly transforming to an increasingly dynamic, on-demand model. This trend
raises a number of challenges regarding the current cloud metrics and methodology. First, adaptions
to existing metrics will need to be considered to address increasing resource nesting. Second, there
is an increased focus on applications being cost-eective, meaning that the operational costs
should match the desired performance. Besides focusing performance of applications, there should
be research done into adaptations to the metrics to consider the performance/cost ratio. Third,
adaptations to metrics regarding resource fragmentation should be considered. Finally, these
dynamic systems are accompanied by dynamic workloads; workloads of which the characteristics
change signicantly over time. The methodology of how to evaluate policies, which may depend
on long-running static workloads for proling purposes, will need to reevaluated.
We identify two distinct and large bodies of related work: one, focusing on the four categories
of metrics we address in Sections 3through 6, the other, general work on cloud performance
measurement and assessment. For studies focusing on the four categories of metrics addressed in
this work, we have already discussed the similarities and dierences with our metrics in the corre-
sponding sections: elasticity (Section 3), performance isolation (Section 4), availability (Section 5),
and operational risk (Section 6). Overall, this work summarizes and extends the body of knowledge
available for these metrics.
The general work includes much innovative work during the early years, followed since around
2010 by synoptic material. From the early work, innovation has focused on metrics for quantifying
overheads [
], basic operational characteristics [
], specic functionality such as
scheduling [
] and (emerging topic in clouds) auto-scaling [
], and frameworks for
comparing clouds across multiple metrics [
]. We review in the following some of the important
results from the synoptic material, which our study complements with new metrics, new synoptic
material, and guidelines of how to measure specic metrics in practice.
Iosup et al. [
] propose a framework and the tools to compare clouds for HPC and many-
task workloads. The framework considers metrics for basic operational characteristics, overheads,
traditional performance, scalability, and variability, and has been applied to 4 commercial clouds
for a long period of time. Our study greatly extends the set of earlier metrics.
Garg et al. [
] propose a framework to monitor and rank cloud computing services based on a
set of given metrics. Their goal is to allow customers to evaluate and rank cloud oerings based on
the provider’s ability to meet the user’s essential and non-essential requirements, based on metrics
for QoS attributes in the SMI framework of the now defunct cloud Service Measurement Index
Consoritium (CSMIC) [
]. They frame the problem using multiple criteria decision making (MCDM)
and propose a ranking approach based on an Analytic Hierarchy Process (AHP) which reduces bias
in decision making. The problem of ranking providers based on multiple metrics is orthogonal to
the choice of metrics to use for each non-functional quality aspect of the environment.
Li et al. [
] collect metrics used in prior cloud projects and construct a metrics catalogue. We
revise the compiled list and contrast them with our proposed metrics, in Sections 3through 6.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:33
Becker et al. [
] use the goal question metric (GQM) method to derive metrics for scalability,
elasticity and eciency, with the goal of using these metrics for requirements/SLO engineering. In
contrast, our goal is to make various cloud oerings and technologies comparable to each other,
and provide a common understanding among all cloud stakeholders.
Aceto et al. and Fatema et al. [
] survey existing cloud monitoring tools, identifying require-
ments, strengths and weaknesses of these tools. Our study extends this direction by analyzing
how to measure specic metrics, thus providing the conceptual tools to assess whether the metrics
reported by existing tools and benchmarks are appropriate and complete. Moreover, the metrics we
propose in this work could be added to the existing tools, in the future.
Because cloud computing services, and the systems underlying them, already account for a large
fraction of the information and communication technology (ICT) market, understanding their
non-functional properties is increasingly important. Building standardized cloud benchmarks
for these tasks could lead to better system design, tuning opportunities, and eased cloud service
selection. Responding to this need, we highlight the relevance of non-functional system properties
emerging in the context of cloud computing, namely elasticity, performance isolation, availability
and operational risk. We discuss these properties in depth and describe existing or new metrics that
can quantify these properties. Thus, for these four properties, we lay a foundation for benchmarking
cloud computing settings. Furthermore, we propose a hierarchical taxonomy for cloud-relevant
metrics and discuss emerging challenges.
As future activities, we plan to conduct comprehensive real-world experiments that underline
the applicability and usefulness of the proposed metrics, also rening and supporting the stan-
dardization the corresponding measurement approaches. As a next step, we are working on an
extensive review of existing cloud-relevant benchmarks and connected domains like big data, web
services and graph processing.
A. Abedi and T. Brecht. Conducting repeatable experiments in highly variable cloud computing environments. In
ACM/SPEC ICPE 2017, 2017.
[2] G. Aceto and more. Cloud monitoring: A survey. Computer Networks, 57(9), 2013.
R. F. Almeida, F. R. Sousa, S. Lifschitz, and J. C. Machado. On Dening Metrics for Elasticity of Cloud Databases. In
Brazilian Symposium on Databases, 2013.
[4] Amazon. EC2 Compute SLA., 2017.
A. Balalaie, A. Heydarnoori, and P. Jamshidi. Microservices architecture enables devops: migration to a cloud-native
architecture. IEEE Software, 33(3):42–52, 2016.
P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Wareld. Xen and the art
of virtualization. In SOSP, pages 164–177, 2003.
A. Bauer, J. Grohmann, N. Herbst, and S. Kounev. On the Value of Service Demand Estimation for Auto-Scaling. In
GI/ITG MMB 2018. Springer, February 2018.
M. Becker and more. Systematically deriving quality metrics for cloud computing systems. In ACM/SPEC ICPE, 2015.
[9] D. Bernstein. Containers and cloud: From lxc to docker to kubernetes. IEEE Cloud Computing, 1(3):81–84, 2014.
C. Binnig, D. Kossmann, T. Kraska, and S. Loesing. How is the Weather Tomorrow?: Towards a Benchmark for the
Cloud. In DBTest 2009, DBTest ’09, pages 9:1–9:6, New York, NY, USA, 2009. ACM.
M. Boniface, B. Nasser, and more. Platform-as-a-service architecture for real-time quality of service management in
clouds. In ICIW 2010, pages 155–160, May 2010.
[12] D. Chandler and more. Report on Cloud Computing to the OSG Steering Committee. Technical report, Apr. 2012.
B. Clark, T. Deshane, E. Dow, S. Evanchik, M. Finlayson, J. Herne, and J. N. Matthews. Xen and the art of repeated
research. In USENIX ATC, pages 135–144, 2004.
[14] CloudSleuth. Cloudsleuth monitoring network. sleuth/, 2017.
B. F. Cooper, A. , E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In ACM
SoCC 2010, pages 143–154, New York, NY, USA, 2010. ACM.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
1:34 N. Herbst et al.
M. D. de Assuncao, A. di Costanzo, and R. Buyya. Evaluating the cost-benet of using cloud computing to extend the
capacity of clusters. In HPDC, pages 141–150, New York, NY, USA, 2009. ACM.
[17] K. Djemame and more. Introducing Risk Management into the Grid. IEEE eScience 2006, 2006.
[18] T. Dory and more. Measuring Elasticity for Cloud Databases. In ACM/IEEE CCGrid 2011, 2011.
L. Duboc, D. Rosenblum, and T. Wicks. A Framework for Characterization and Analysis of Software System Scalability.
In ACM SIGSOFT ESEC-FSE 2007, pages 375–384. ACM, 2007.
European Commission. Uptake of cloud in europe. Final Report. Digital Agenda for Europe report. Publications Oce
of the European Union, Luxembourg, 2014.
A. Evangelinou, M. Ciavotta, D. Ardagna, A. Kopaneli, G. Kousiouris, and T. Varvarigou. Enterprise applications cloud
rightsizing through a joint benchmarking and optimization approach. Elsevier FGCS, 2016.
K. Fatema and more. A survey of cloud monitoring tools: Taxonomy, capabilities and objectives. Journal of Parallel
and Distributed Computing, 74(10), 2014.
W. Felter, A. Ferreira, R. Rajamony, and J. Rubio. An updated performance comparison of virtual machines and linux
containers. In IEEE ISPASS 2015, pages 171–172. IEEE, 2015.
A. J. Ferrer, F. Hernández, and more. OPTIMIS: A holistic approach to cloud service provisioning. Elsevier FGCS, 2012.
P. J. Fleming and J. J. Wallace. How Not to Lie with Statistics: The Correct Way to Summarize Benchmark Results.
Commun. ACM, 29(3):218–221, Mar. 1986.
E. Folkerts, A. Alexandrov, K. Sachs, A. Iosup, V. Markl, and C. Tosun. Benchmarking in the Cloud: What It Should,
Can, and Cannot Be. In Selected Topics in Performance Evaluation and Benchmarking, volume 7755 of LNCS. 2012.
[27] N. Forsgren Velasquez and more. State of devops report 2015. Puppet Labs and IT Revolution, 2015.
[28] M. Fowler. Continuous delivery, 2013. Accessed: 2017-05-17.
[29] S. K. Garg and more. A framework for ranking of cloud computing services. Elsevier FGCS, 29(4), 2013.
[30] Google. Compute Level SLA.
W. Hasselbring and G. Steinacker. Microservice architectures for scalability, agility and reliability in e-commerce. In
IEEE ICSA 2018 Workshops, pages 243–246. IEEE, April 2017.
N. Herbst, S. Kounev, and R. Reussner. Elasticity in Cloud Computing: What it is, and What it is Not. In USENIX ICAC
2013. USENIX, June 2013.
N. Herbst, S. Kounev, A. Weber, and H. Groenda. Bungee: An elasticity benchmark for self-adaptive iaas cloud
environments. In SEAMS 2015, pages 46–56, Piscataway, NJ, USA, 2015. IEEE Press.
K. Huppler. Performance Evaluation and Benchmarking. chapter The Art of Building a Good Benchmark, pages 18–30.
Springer-Verlag, Berlin, Heidelberg, 2009.
[35] K. Huppler. Benchmarking with Your Head in the Cloud. In R. Nambiar and M. Poess, editors, Topics in Performance
Evaluation, Measurement and Characterization, volume 7144 of Lecture Notes in Computer Science, pages 97–110.
Springer Berlin Heidelberg, 2012.
IDC. Worldwide and regional public it cloud services: 2016-2020 forecast. IDC Tech Report. [Online] Available:, 2016.
A. Ilyushkin and more. An experimental performance evaluation of autoscaling policies for complex workows. In
ACM/SPEC ICPE 2017, pages 75–86. ACM.
A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, T. Fahringer, and D. H. J. Epema. Performance analysis of cloud
computing services for many-tasks scientic computing. IEEE TPDS, 22(6):931–945, 2011.
C. Isci, J. Hanson, I. Whalley, M. Steinder, and J. Kephart. Runtime demand estimation for eective dynamic resource
management. pages 381–388, April 2010.
S. Islam, K. Lee, A. Fekete, and A. Liu. How a Consumer Can Measure Elasticity for Cloud Platforms. In ACM/SPEC
ICPE 2012, ICPE ’12, pages 85–96, New York, NY, USA, 2012. ACM.
B. Jennings and R. Stadler. Resource Management in Clouds: Survey and Research Challenges. Journal of Network and
Systems Management, pages 567–619, 2015.
[42] P. Jogalekar and M. Woodside. Evaluating the scalability of distributed systems. IEEE TPDS, 11:589–603, 2000.
G. Kousiouris, D. Kyriazis, S. Gogouvitis, A. Menychtas, K. Konstanteli, and T. Varvarigou. Translation of application-
level terms to resource-level attributes across the cloud stack layers. In IEEE ISCC 2011, pages 153–160, June 2011.
G. Kousiouris and more. A multi-cloud framework for measuring and describing performance aspects of cloud services
across dierent application types. In MultiCould, 2014.
R. Krebs, C. Momm, and S. Kounev. Metrics and Techniques for Quantifying Performance Isolation in Cloud Environ-
ments. Elsevier SciCo, Vol. 90, Part B:116–134, 2014.
M. Kuperberg and more. Dening and Quantifying Elasticity of Resources in Cloud Computing and Scalable Platforms.
Technical report, KIT, Germany, 2011.
P. Leitner and J. Cito. Patterns in the Chaos - A Study of Performance Variation and Predictability in Public IaaS
Clouds. ACM Trans. Internet Technol., 16(3), 2016.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
antifying Cloud Performance and Dependability 1:35
V. Lesch, A. Bauer, N. Herbst, and S. Kounev. FOX: Cost-Awareness for Autonomic Resource Management in Public
Clouds. In ACM/SPEC ICPE 2018, New York, NY, USA, April 2018. ACM.
A. Li, X. Yang, S. Kandula, and M. Zhang. CloudCmp: Comparing Public Cloud Providers. In ACM SIGCOMM IMC
2010, pages 1–14, New York, NY, USA, 2010. ACM.
Z. Li, L. O’Brien, H. Zhang, and R. Cai. On a Catalogue of Metrics for Evaluating Commercial Cloud Services. In
ACM/IEEE CCGrid 2012, pages 164–173, Sept 2012.
T. Lorido-Botran, J. Miguel-Alonso, and J. A. Lozano. A Review of Auto-scaling Techniques for Elastic Applications in
Cloud Environments. Journal of Grid Computing, 12(4):559–592, 2014.
L. Lu and more. Untangling mixed information to calibrate resource utilization in virtual machines. In ACM ICAC
2001, pages 151–160, New York, NY, USA. ACM.
M. Mao, J. Li, and M. Humphrey. Cloud auto-scaling with deadline and budget constraints. In ACM/IEEE CCGrid, 2010.
A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel. Diagnosing performance overheads in the
Xen virtual machine environment. In VEE, pages 13–23, 2005.
Microsoft. Azure compute level sla. us/support/legal/sla/virtual-machines/v1_2/, 2017.
M. R. Palankar, A. Iamnitchi, M. Ripeanu, and S. Garnkel. Amazon S3 for science grids: a viable solution? In DADC,
pages 55–64, 2008.
A. V. Papadopoulos and more. Peas: A performance evaluation framework for auto-scaling strategies in cloud
applications. ACM Trans. Model. Perform. Eval. Comput. Syst., 1(4):15:1–15:31, Aug. 2016.
V. Persico, P. Marchetta, A. Botta, and A. Pescapé. Measuring network throughput in the cloud: The case of amazon
ec2. Computer Networks, 93, 2015.
V. Persico, P. Marchetta, A. Botta, and A. Pescape. On network throughput variability in Microsoft Azure cloud. In
IEEE Global Communications Conference (GLOBECOM), 2015.
D. C. Plummer and more. Study: Five Rening Attributes of Public and Private Cloud Computing. Technical report,
Gartner, 2009.
[61] M. Roberts. Serverless architectures., 2016. Accessed: 2017-05-17.
J. Schad, J. Dittrich, and J.-A. Quiané-Ruiz. Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing
Variance. Proc. VLDB Endow., 2010.
[63] D. Shawky and A. Ali. Dening a Measure of Cloud Computing Elasticity. In ICSCS 2012, pages 1–5, Aug 2012.
S. Shen, A. Iosup, and more. An availability-on-demand mechanism for datacenters. In IEEE/ACM CCGrid, pages
495–504, 2015.
J. Siegel and J. Perdue. Cloud services measures for global use: The service measurement index (SMI). In 2012 Annual
SRII Global Conference, 2012.
B. Suleiman. Elasticity Economics of Cloud-Based Applications. In IEEE SCC 2012, pages 694–695, Washington, DC,
USA, 2012. IEEE Computer Society.
L. Tang, J. Mars, N. Vachharajani, R. Hundt, and M. L. Soa. The impact of memory subsystem resource sharing on
datacenter applications. ACM SIGARCH Computer Architecture News, 2011.
C. Tinnefeld, D. Taschik, and H. Plattner. Quantifying the Elasticity of a Database Management System. In DBKDA
2014, pages 125–131, 2014.
E. van Eyk, A. Iosup, S. Seif, and M. Thömmes. The SPEC cloud group’s research vision on FaaS and serverless
architectures. In International Workshop on Serverless Computing, pages 1–4. ACM, 2017.
D. Villegas, A. Antoniou, S. M. Sadjadi, and A. Iosup. An analysis of provisioning and allocation policies for
infrastructure-as-a-service clouds. In ACM/IEEE CCGrid 2012, pages 612–619, 2012.
J. von Kistowski, N. Herbst, S. Kounev, H. Groenda, C. Stier, and S. Lehrig. Modeling and Extracting Load Intensity
Proles. ACM TAAS, 11(4):23:1–23:28, January 2017.
L. Wang, J. Zhan, W. Shi, Y. Liang, and L. Yuan. In cloud, do mtc or htc service providers benet from the economies of
scale? In MTAGS, SC Workshops, 2009.
J. Weinman. Time is Money: The Value of “On-Demand”.
Time_Is_Money.pdf, 2011. (accessed July, 2017).
N. Yigitbasi, A. Iosup, D. H. J. Epema, and S. Ostermann. C-meter: A framework for performance analysis of computing
clouds. In ACM/IEEE CCGrid, pages 472–477, 2009.
Q. Zhang, Q. Zhu, and R. Boutaba. Dynamic resource allocation for spot markets in cloud computing environments. In
IEEE UCC 2011, pages 178–185. IEEE, 2011.
P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve
for memory management. ACM SIGOPS Operating Systems Review, 2004.
S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via
scheduling. ASPLOS 2010, 45, 2010.
ACM Transactions on Modeling and Performance Evaluation of Computing Systems, Vol. 1, No. 1, Article 1. Publication
date: 0000.
Preprint for Personal Use only
... Benchmarking is therefore an important tool to help organizations to continuously improve the quality of their products and services. It is a popular tool in industry [17][18][19][20], but it is also used in the health service to improve patient outcome, for example in surgery [21]. In this study, the quality is limited to the efficacy of the risk management system. ...
... Herbst et al. [17] discuss benchmarking in cloud computing, which in recent years has become a significant part of information and communication technology. Benchmarks play an important role as evaluation tools during system design, development, and maintenance. ...
... Benchmarking is important for risk management [17,18,20,21,23,24,27]. ...
Full-text available
The overall aim of this article is to contribute to the further development of the area of benchmarking in risk management. The article introduces a two-step benchmarking model to assess the efficacy of ISO risk management systems. It furthermore aims at verifying its usefulness in terms of finding hidden risk issues and improvement opportunities. The existence of all key elements of an ISO 31000-based risk management system is examined at the beginning of this study. Then, the quality in terms of efficacy of important aspects of the risk management system is examined in more detail with special benchmarks. The application of the model to six ISO-certified organizations follows and reinforces the novelty of this study, which is to combine risk science knowledge with benchmarking theory in the application of ISO risk management standards in organizations. The results show that the benchmarking model developed in this study provides rigor when assessing and evaluating the efficacy of an ISO risk management system. By applying the model, risk issues and risk factors can be found that had not previously been identified. The findings are of importance for risk management, the benchmarking science, and for the development of ISO risk management standards.
... Targets include the metrics that the practitioner is interested in and their desired granularity, along with relevant SLOs (FR3). Following the taxonomy defined by the performance organization SPEC [29], we support both system-provider metrics (such as operational risk and resource utilization) and organization metrics (such as SLO violation rates and performance variability). The targets also include a time range over which these metrics should be recorded and compared. ...
Full-text available
Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption.
Advances in technology, rapid globalization, trade liberalization, and increased regulation have shaped supply chains in the last four decades. We examine the impact of digitalization on contemporary and future supply chains. Digitalization potentially enables a strong digital thread connecting and mirroring an entire physical supply chain. We provide an overview of the principal technologies and systems enabling the Digital Supply Chain, including Smart Factories, Smart Warehouses, Smart Logistics, Cloud-based systems, and digital platforms. We discuss the computational engines enabled by Analytics, Data Science, and Artificial Intelligence and the emerging technologies likely to influence future supply chains—Blockchain, Digital Twins, Internet of Things, 5G, Edge, and Fog computing. The technologies offering the most promise in linking the virtual and physical worlds to improve supply chain performance are noted. We describe an evolving spectrum from digitally immature to digitally enabled and digitally transformed supply chains. We provide both narrow and broad definitions for future Digital Supply Chains. The transformative effects of the digitalization of supply chains will affect supply systems in diverse ways. Data-rich supply chain ecosystems will provide many new opportunities but will also give rise to many challenges that require continued analysis and evaluation by researchers and practitioners.
Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption.
Recently, a growing number of scientific applications have been migrated into the cloud. To deal with the problems brought by clouds, more and more researchers start to consider multiple optimization goals in workflow scheduling. However, the previous works ignore some details, which are challenging but essential. Most existing multi-objective workflow scheduling algorithms overlook weight selection, which may result in the quality degradation of solutions. Besides, we find that the famous partial critical path (PCP) strategy, which has been widely used to meet the deadline constraint, can not accurately reflect the situation of each time step. Workflow scheduling is an NP-hard problem, so self-optimizing algorithms are more suitable to solve it.In this paper, the aim is to solve a workflow scheduling problem with a deadline constraint. We design a deadline constrained scientific workflow scheduling algorithm based on multi-objective reinforcement learning (RL) called DCMORL. DCMORL uses the Chebyshev scalarization function to scalarize its Q-values. This method is good at choosing weights for objectives. We propose an improved version of the PCP strategy called MPCP. The sub-deadlines in MPCP regularly update during the scheduling phase, so they can accurately reflect the situation of each time step. The optimization objectives in this paper include minimizing the execution cost and energy consumption within a given deadline. Finally, we use four scientific workflows to compare DCMORL and several representative scheduling algorithms. The results indicate that DCMORL outperforms the above algorithms. As far as we know, it is the first time to apply RL to a deadline constrained workflow scheduling problem.
Full-text available
These days, we are living in a digitalized world. Both our professional and private lives are pervaded by various IT services, which are typically operated using distributed computing systems (e.g., cloud environments). Due to the high level of digitalization, the operators of such systems are confronted with fast-paced and changing requirements. In particular, cloud environments have to cope with load fluctuations and respective rapid and unexpected changes in the computing resource demands. To face this challenge, so-called auto-scalers, such as the threshold-based mechanism in Amazon Web Services EC2, can be employed to enable elastic scaling of the computing resources. However, despite this opportunity, business-critical applications are still run with highly overprovisioned resources to guarantee a stable and reliable service operation. This strategy is pursued due to the lack of trust in auto-scalers and the concern that inaccurate or delayed adaptations may result in financial losses. To adapt the resource capacity in time, the future resource demands must be "foreseen", as reacting to changes once they are observed introduces an inherent delay. In other words, accurate forecasting methods are required to adapt systems proactively. A powerful approach in this context is time series forecasting, which is also applied in many other domains. The core idea is to examine past values and predict how these values will evolve as time progresses. According to the "No-Free-Lunch Theorem", there is no algorithm that performs best for all scenarios. Therefore, selecting a suitable forecasting method for a given use case is a crucial task. Simply put, each method has its benefits and drawbacks, depending on the specific use case. The choice of the forecasting method is usually based on expert knowledge, which cannot be fully automated, or on trial-and-error. In both cases, this is expensive and prone to error. Although auto-scaling and time series forecasting are established research fields, existing approaches cannot fully address the mentioned challenges: (i) In our survey on time series forecasting, we found that publications on time series forecasting typically consider only a small set of (mostly related) methods and evaluate their performance on a small number of time series with only a few error measures while providing no information on the execution time of the studied methods. Therefore, such articles cannot be used to guide the choice of an appropriate method for a particular use case; (ii) Existing open-source hybrid forecasting methods that take advantage of at least two methods to tackle the "No-Free-Lunch Theorem" are computationally intensive, poorly automated, designed for a particular data set, or they lack a predictable time-to-result. Methods exhibiting a high variance in the time-to-result cannot be applied for time-critical scenarios (e.g., auto-scaling), while methods tailored to a specific data set introduce restrictions on the possible use cases (e.g., forecasting only annual time series); (iii) Auto-scalers typically scale an application either proactively or reactively. Even though some hybrid auto-scalers exist, they lack sophisticated solutions to combine reactive and proactive scaling. For instance, resources are only released proactively while resource allocation is entirely done in a reactive manner (inherently delayed); (iv) The majority of existing mechanisms do not take the provider's pricing scheme into account while scaling an application in a public cloud environment, which often results in excessive charged costs. Even though some cost-aware auto-scalers have been proposed, they only consider the current resource demands, neglecting their development over time. For example, resources are often shut down prematurely, even though they might be required again soon. To address the mentioned challenges and the shortcomings of existing work, this thesis presents three contributions: (i) The first contribution-a forecasting benchmark-addresses the problem of limited comparability between existing forecasting methods; (ii) The second contribution-Telescope-provides an automated hybrid time series forecasting method addressing the challenge posed by the "No-Free-Lunch Theorem"; (iii) The third contribution-Chamulteon-provides a novel hybrid auto-scaler for coordinated scaling of applications comprising multiple services, leveraging Telescope to forecast the workload intensity as a basis for proactive resource provisioning. In the following, the three contributions of the thesis are summarized: Contribution I - Forecasting Benchmark To establish a level playing field for evaluating the performance of forecasting methods in a broad setting, we propose a novel benchmark that automatically evaluates and ranks forecasting methods based on their performance in a diverse set of evaluation scenarios. The benchmark comprises four different use cases, each covering 100 heterogeneous time series taken from different domains. The data set was assembled from publicly available time series and was designed to exhibit much higher diversity than existing forecasting competitions. Besides proposing a new data set, we introduce two new measures that describe different aspects of a forecast. We applied the developed benchmark to evaluate Telescope. Contribution II - Telescope To provide a generic forecasting method, we introduce a novel machine learning-based forecasting approach that automatically retrieves relevant information from a given time series. More precisely, Telescope automatically extracts intrinsic time series features and then decomposes the time series into components, building a forecasting model for each of them. Each component is forecast by applying a different method and then the final forecast is assembled from the forecast components by employing a regression-based machine learning algorithm. In more than 1300 hours of experiments benchmarking 15 competing methods (including approaches from Uber and Facebook) on 400 time series, Telescope outperformed all methods, exhibiting the best forecast accuracy coupled with a low and reliable time-to-result. Compared to the competing methods that exhibited, on average, a forecast error (more precisely, the symmetric mean absolute forecast error) of 29%, Telescope exhibited an error of 20% while being 2556 times faster. In particular, the methods from Uber and Facebook exhibited an error of 48% and 36%, and were 7334 and 19 times slower than Telescope, respectively. Contribution III - Chamulteon To enable reliable auto-scaling, we present a hybrid auto-scaler that combines proactive and reactive techniques to scale distributed cloud applications comprising multiple services in a coordinated and cost-effective manner. More precisely, proactive adaptations are planned based on forecasts of Telescope, while reactive adaptations are triggered based on actual observations of the monitored load intensity. To solve occurring conflicts between reactive and proactive adaptations, a complex conflict resolution algorithm is implemented. Moreover, when deployed in public cloud environments, Chamulteon reviews adaptations with respect to the cloud provider's pricing scheme in order to minimize the charged costs. In more than 400 hours of experiments evaluating five competing auto-scaling mechanisms in scenarios covering five different workloads, four different applications, and three different cloud environments, Chamulteon exhibited the best auto-scaling performance and reliability while at the same time reducing the charged costs. The competing methods provided insufficient resources for (on average) 31% of the experimental time; in contrast, Chamulteon cut this time to 8% and the SLO (service level objective) violations from 18% to 6% while using up to 15% less resources and reducing the charged costs by up to 45%. The contributions of this thesis can be seen as major milestones in the domain of time series forecasting and cloud resource management. (i) This thesis is the first to present a forecasting benchmark that covers a variety of different domains with a high diversity between the analyzed time series. Based on the provided data set and the automatic evaluation procedure, the proposed benchmark contributes to enhance the comparability of forecasting methods. The benchmarking results for different forecasting methods enable the selection of the most appropriate forecasting method for a given use case. (ii) Telescope provides the first generic and fully automated time series forecasting approach that delivers both accurate and reliable forecasts while making no assumptions about the analyzed time series. Hence, it eliminates the need for expensive, time-consuming, and error-prone procedures, such as trial-and-error searches or consulting an expert. This opens up new possibilities especially in time-critical scenarios, where Telescope can provide accurate forecasts with a short and reliable time-to-result. Although Telescope was applied for this thesis in the field of cloud computing, there is absolutely no limitation regarding the applicability of Telescope in other domains, as demonstrated in the evaluation. Moreover, Telescope, which was made available on GitHub, is already used in a number of interdisciplinary data science projects, for instance, predictive maintenance in an Industry 4.0 context, heart failure prediction in medicine, or as a component of predictive models of beehive development. (iii) In the context of cloud resource management, Chamulteon is a major milestone for increasing the trust in cloud auto-scalers. The complex resolution algorithm enables reliable and accurate scaling behavior that reduces losses caused by excessive resource allocation or SLO violations. In other words, Chamulteon provides reliable online adaptations minimizing charged costs while at the same time maximizing user experience.
The metrics presented in this chapter are applicable for use in performance benchmarks that measure the performance without requiring internal knowledge. They are preferable in situations where different request sources use the functions of a shared system with a similar call probability and demand per request but with a different load intensity. These characteristics are typical for multi-tenant applications but can also occur in other shared resource systems. This chapter introduces the metrics and provides a case study showing how they can be used in a real-life environment.
In this chapter, we present a set of intuitively understandable metrics for characterizing the elasticity of a cloud platform including ways to aggregate them. The focus is on IaaS clouds; however, the presented approach can also be applied in the context of other types of cloud platforms. The metrics support evaluating both the accuracy and the timing aspects of elastic behavior. We discuss how the metrics can be aggregated and used to compare the elasticity of cloud platforms.
Conference Paper
Full-text available
Nowadays, to keep track with the fast changing requirements of internet applications, auto-scaling is an essential mechanism for adapting the number of provisioned resources to the resource demand. In the context of public clouds, there exist different natures of cost-models for charging resources. However, the accounted resource units and charged resource units may differ significantly due to the applied cost model. This can lead to a significant increase of charged costs when using an auto-scaler as it tries to match the demand of the application as close as possible. In the literature, several auto-scalers exist that support cost-aware scaling decisions but they introduce inherent drawbacks. In this work, this lack of existing cost-aware mechanisms is addressed by introducing a mediator between an application and the auto-scaler. This cost-aware mechanism is called FOX. It leverages knowledge of the charging model of the public cloud and reviews the scaling decisions found by the auto-scaler to reduce the charged costs to a minimum. More precisely, FOX delays or omits releases of resources to avoid additional charging costs if the resource is required in the future. Hereby, FOX is not restricted to use one specific auto-scaler but offers interfaces to use any auto-scaler. For an evalation under controlled conditions, FOX scales a multi-tier application deployed in a private cloud that is stressed with two real world workloads: BibSonomy and IBM CICS. As FOX provides an interface for auto-scalers, we evaluate the cost-aware mechanism with three state of the art auto-scalers: React, Adapt, and Reg. The experiments show that FOX is able to reduce the charged costs by 34% at maximum for the Amazon EC2 charging model. According to the cost model, FOX provisions more resources than required. This results in a decreased SLO violation rate from 28% to 2% at maximum. The accounted instance time increases at max. by 30%.
Conference Paper
Full-text available
In the context of performance models, service demands are key model parameters capturing the average time individual requests of different workload classes are actively processed. In a system under load, due to measurement interference, service demands normally cannot be measured directly, however, a number of estimation approaches exist based on high-level performance metrics. In this paper, we show that service demands provide significant benefits for implementing modern auto-scalers. Auto-scaling describes the process of dynamically adjusting the number of allocated virtual resources (e.g., virtual machines) in a data center according to the incoming workload. We demonstrate that even a simple auto-scaler that leverages information about service demands significantly outperforms auto-scalers solely based on CPU utilization measurements. This is shown by testing two approaches in three different scenarios. Our results show that the service demand-based autoscaler outperforms the CPU utilization-based one in all scenarios. Our results encourage further research on the application of service demand estimates for resource management in data centers.
Conference Paper
Full-text available
Cloud computing enables an entire ecosystem of developing, composing, and providing IT services. An emerging class of cloud-based software architectures, serverless, focuses on providing software architects the ability to execute arbitrary functions with small overhead in server management, as Function-as-a-service (FaaS). However useful, serverless and FaaS suffer from a community problem that faces every emerging technology, which has indeed also hampered cloud computing a decade ago: lack of clear terminology, and scattered vision about the field. In this work, we address this community problem. We clarify the term serverless, by reducing it to cloud functions as programming units, and a model of executing simple and complex (e.g., workflows of) functions with operations managed primarily by the cloud provider. We propose a research vision, where 4 key directions (perspectives) present 17 technical opportunities and challenges.
Conference Paper
Previous work has shown that benchmark and application performance in public cloud computing environments can be highly variable. Utilizing Amazon EC2 traces that include measurements affected by CPU, memory, disk, and network performance, we study commonly used methodologies for comparing performance measurements in cloud computing environments. The results show considerable flaws in these methodologies that may lead to incorrect conclusions. For instance, these methodologies falsely report that the performance of two identical systems differ by 38% using a confidence level of 95%. We then study the efficacy of the Randomized Multiple Interleaved Trials (RMIT) methodology using the same traces. We demonstrate that RMIT could be used to conduct repeatable experiments that enable fair comparisons in this cloud computing environment despite the fact that changing conditions beyond the user's control make comparing competing alternatives highly challenging.
Virtual Machine (VM) environments (e.g., VMware and Xen) are experiencing a resurgence of interest for diverse uses including server consolidation and shared hosting. An application's performance in a virtual machine environ- ment can differ markedly from its performance in a non- virtualized environment because of interactions with the underlying virtual machine monitor and other virtual ma- chines. However, few tools are currently available to help de- bug performance problems in virtual machine environments. In this paper, we present Xenoprof, a system-wide statis- tical profiling toolkit implemented for the Xen virtual ma- chine environment. The toolkit enables coordinated profil- ing of multiple VMs in a system to obtain the distribution of hardware events such as clock cycles and cache and TLB misses. We use our toolkit to analyze performance overheads in- curred by networking applications running in Xen VMs. We focus on networking applications since virtualizing network I/O devices is relatively expensive. Our experimental re- sults quantify Xen's performance overheads for network I/O device virtualization in uni- and multi-processor systems. Our results identify the main sources of this overhead which should be the focus of Xen optimization efforts. We also show how our profiling toolkit was used to uncover and re- solve performance bugs that we encountered in our experi- ments which caused unexpected application behavior.