PreprintPDF Available

Capelin: Data-Driven Capacity Procurement for Cloud Datacenters using Portfolios of Scenarios -- Extended Technical Report

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of performance degradation or energy consumption.
Content may be subject to copyright.
1
Capelin: Data-Driven Capacity Procurement for
Cloud Datacenters using Portfolios of Scenarios
[Technical Report on the TPDS homonym article]
Georgios Andreadis, Fabian Mastenbroek, Vincent van Beek, and Alexandru Iosup
Abstract—Cloud datacenters provide a backbone to our digital society. Inaccurate capacity procurement for cloud datacenters can
lead to significant performance degradation, denser targets for failure, and unsustainable energy consumption. Although this activity is
core to improving cloud infrastructure, relatively few comprehensive approaches and support tools exist for mid-tier operators, leaving
many planners with merely rule-of-thumb judgement. We derive requirements from a unique survey of experts in charge of diverse
datacenters in several countries. We propose Capelin, a data-driven, scenario-based capacity planning system for mid-tier cloud
datacenters. Capelin introduces the notion of portfolios of scenarios, which it leverages in its probing for alternative capacity-plans. At
the core of the system, a trace-based, discrete-event simulator enables the exploration of different possible topologies, with support for
scaling the volume, variety, and velocity of resources, and for horizontal (scale-out) and vertical (scale-up) scaling. Capelin compares
alternative topologies and for each gives detailed quantitative operational information, which could facilitate human decisions of
capacity planning. We implement and open-source Capelin, and show through comprehensive trace-based experiments it can aid
practitioners. The results give evidence that reasonable choices can be worse by a factor of 1.5-2.0 than the best, in terms of
performance degradation or energy consumption.
Index Terms—Cloud, procurement, capacity planning, datacenter, practitioner survey, simulation
F
1 INTRODUCTION
CLOUD datacenters are critical for today’s increasingly
digital society [20, 22, 23]. Users have come to expect
near-perfect availability and high quality of service, at low
cost and high scalability. Planning the capacity of cloud in-
frastructure is a critical yet non-trivial optimization problem
that could lead to significant service improvements, cost
savings, and environmental sustainability [4]. This activity
includes short-term capacity planning, which includes the
process of provisioning and allocating resources from the
capacity already installed in the datacenter, and long-term
capacity planning, which is the process of procuring machines
that form the datacenter capacity. This work focuses on
the latter, which is a process involving large amounts of
resources and decisions that are difficult to reverse.
Although many approaches to the long-term capacity-
planning problem have been published [13, 53, 66], com-
panies use much rule-of-thumb reasoning for procurement
decisions. To minimize operational risks, many such indus-
try approaches currently lead to significant overprovision-
ing [25], or miscalculate the balance between underprovi-
sioning and overprovisioning [49]. In this work, as Figure 1
depicts, we approach the problem of capacity planning for
mid-tier cloud datacenters with a semi-automated, special-
ized, data-driven tool for decision making.
We focus in this work mainly on mid-tier providers of
cloud infrastructure that operate at the low- to mid-level
tiers of the service architecture, ranging from IaaS to PaaS.
G. Andreadis, F. Mastenbroek, V. van Beek, and A. Iosup are with Electri-
cal Engineering, Mathematics & Computer Science, Delft University of
Technology, 2628 CD Delft, Netherlands.
V. van Beek is also with Solvinity, 1100 ED Amsterdam, Netherlands.
A. Iosup is also with Computer Science, Vrije Universiteit Amsterdam,
1081 HV Amsterdam, Netherlands.
ClusterClusterClusterClusterClusterCluster
VMs
VMs
VMs
{…}
Current Practice Capelin (this work)
Capelin (§4)
VM
Workloads
VMs
VMs
Host
Topology
VMs
VMs
Data
Monitoring
VMs
VMs
Host VMs
Host
generic, coarse-grained right-sized, fine-grained
Data Filter (§5)
Portfolios
{…}
ClusterClusterCluster
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
VMs
VMs
VMs
Host
Host
Host
Capacity Planning Committee (§3)
1
3
4
4
2
3
Simulator
all dataall data
Capacity Planner (§3)
decision
decision
Figure 1. Capelin, a new, data-based capacity planning process for
datacenters, compared against the current approach.
Compared to the extreme-scale operators Google, Facebook,
and others in the exclusive GAFAM-BAT group, the mid-tier
operators are small-scale. However, they are both numerous
and they are responsible for much of the datacenter capacity
in modern, service-based and knowledge-driven economies.
This work addresses four main capacity planning challenges
for mid-tier cloud providers. First, the lack of published
arXiv:2103.02060v1 [cs.DC] 2 Mar 2021
2
knowledge about the current practice of long-term cloud
capacity planning. For a problem of such importance and
long-lasting effects, it is surprising that the only studies
of how practitioners make and take long-term capacity-
planning decisions are either over three decades old [44]
or focus on non-experts deciding how to externally procure
capacity for IT services [6]. A survey of expert capacity
planners could reveal new requirements.
Second, we observe the need for a flexible instrument
for long-term capacity planning, one that can address var-
ious operational scenarios. State-of-the-art tools [30, 34, 62]
and techniques [13, 24, 57] for capacity-planning operate
on abstractions that match only one vendor or focus on
simplistic problems. Although single-vendor tools, such as
VMware’s Capacity Planner [62] and IBM’s Z Performance
and Capacity Analytics tool [34], can provide good advice
for the cloud datacenters equipped by that vendor, they do
not support real-world cloud datacenters that are heteroge-
neous in both software [2][4, §2.4.1] and hardware [10, 18][4,
§3]. Yet, to avoid vendor lock-in and licensing costs, cloud
datacenters acquire heterogeneous hardware and software
from multiple sources and could, for example, combine
VMware’s, Microsoft’s, and open-source OpenStack+KVM
virtualization management technology, and complement
it with container technologies. Although linear program-
ming [63], game theory [57], stochastic search [24], and
other optimization techniques work well on simplistic
capacity-planning problems, they do not address the multi-
disciplinary, multi-dimensional nature of the problem. As
Figure 1 (left) depicts, without adequate capacity planning
tools and techniques, practitioners need to rely on rules-
of-thumb calibrated with casual visual interpretation of the
complex data provided datacenter monitoring. This state-
of-practice likely results in overprovisioning of cloud data-
centers, to avoid operational risks [26]. Even then, evolving
customers and workloads could make the planned capacity
insufficient, leading to risks of not meeting Service Level
Agreements [1, 7], inability to absorb catastrophic failures [4,
p.37], and even unwillingness to accept new users.
Third, we identify the need for comprehensive evalu-
ations of long-term capacity-planning approaches, based
on real-world data and scenarios. Existing tools and tech-
niques have rarely been tested with real-world scenarios,
and even more rarely with real-world operational traces that
capture the detailed arrival and execution of user requests.
Furthermore, for the few thus tested, the results are only
rarely peer-reviewed [1, 53]. We advocate comprehensive
experiments with real-world operational traces and diverse
scaling scenarios to test capacity planning approaches.
Fourth and last, we observe the need for publicly
available, comprehensive tools for long-term capacity
planning. However, and in stark contrast with the many
available tools for short-term capacity planning, few pro-
curement tools are publicly available, and even fewer are
open-source. From the available tools, none can model all
the aspects needed to analyze cloud datacenters from §2.
We propose in this work Capelin, a data-driven,
scenario-based alternative to current capacity planning ap-
proaches. Figure 1 visualizes our approach (right column
of the figure) and compares it to current practice (left
column). Both approaches start with inputs such as work-
loads, current topology, and large volumes of monitoring
data (step 1in the figure). From this point on, the two
approaches diverge, ultimately resulting in qualitatively
different solutions. The current practice expects a committee
of various stakeholder to extract meaning from all the input
data ( 3), which is severely hampered by the lack of decision
support tools. Without a detailed understanding of the
implications of various decisions, the final decision is taken
by committee, and it is typically an overprovisioned and
conservative approach ( 4). In contrast, Capelin adds and
semi-automates a data-driven approach to data analysis and
decision support ( 2), and enables capacity planners to take
fine-grained decisions based on curated and greatly reduced
data ( 3). With such support, even a single capacity plan-
ner can make a tailored, fine-grained decision on topology
changes to the cloud datacenter ( 4). More than a purely
technical solution, this approach can change organizational
processes. Overall, our main contribution is:
1) We design, conduct, and analyze community interviews
on capacity planning in different cloud settings (Sec-
tion 3). We use broad, semi-structured interviews, from
which we identify new, real-world requirements.
2) We design Capelin, a semi-automated, data-driven ap-
proach for long-term capacity planning in cloud datacen-
ters (Section 4). At the core of Capelin is an abstraction,
the capacity planning portfolio, which expresses sets of
“what-if” scenarios. Using simulation, Capelin estimates
the consequences of alternative decisions.
3) We demonstrate Capelin’s ability to support capacity
planners through experiments based on real-world op-
erational traces and scenarios (Section 5). We implement
a prototype of Capelin as an extension to OpenDC,
an open-source platform for datacenter simulation [36].
We conduct diverse trace-based experiments. Our ex-
periments cover four different scaling dimensions, and
workloads from both private and public clouds. They also
consider different operational factors such as the sched-
uler allocation policy, and phenomena such as correlated
failures and performance interference [42, 60, 64].
4) We release our prototype of Capelin, consisting of ex-
tensions to OpenDC 2.0 [46], as Free and Open-Source
Software (FOSS), for practitioners to use. Capelin is engi-
neered with professional, modern software development
standards and produces reproducible results.
2 A SYSTEM MODEL FOR DC OPERATIONS
In this work we assume the generic model of cloud infras-
tructure and its operation depicted by Figure 2 (next page).
Workload: The workload consists of applications execut-
ing in Virtual Machines (VMs) and containers. The emphasis
of this study is on business-critical workloads, which are
long-running, typically user-facing, and back-end enterprise
services at the core of an enterprise’s business [55, 56]. Their
downtime, or even just low Quality of Service (QoS), can
incur significant and long-lasting damage to the business.
We also consider virtual public cloud workloads in this model,
submitted by a wider user base.
The business-critical workloads we consider also include
virtualized High Performance Computing (HPC) parts.
3
Cluster
Cluster
Cluster
Host
Host
Host
VM / Container
VM / Container
Hypervisor
Workload and
Resource Manager
VM / Container
Application
VM / Container
VM / Container
HPC Tasks
Business-Critical
Workloads
Batch Bag of
HPC Tasks
VM / Container
VM / Container
Spark/ML
App Managers
Workload
RM&S
Infrastructure
VM / Container
VM / Container
VM / Container
Application
Capacity Planning
Figure 2. Generic model for datacenter operation.
These are primarily comprised of conveniently (embarrass-
ingly) parallel tasks, e.g., Monte Carlo simulations, forming
batch bags-of-tasks. Larger HPC workloads, such as scientific
workloads from healthcare, also fit in our model.
Our system model also considers app managers, such
as the big data frameworks Spark and Apache Flink, and
machine learning frameworks such as TensorFlow, which
orchestrate virtualized workflows and dataflows.
Infrastructure: The workloads described earlier run on
physical datacenter infrastructure. Our model views data-
center infrastructure as a set of physical clusters of possibly
heterogeneous hosts (machines), each host being a node in a
datacenter rack. A host can execute multiple VM or con-
tainer workloads, managed by a hypervisor. The hypervisor
allocates computational time on the CPU between the work-
loads that request it, through time-sharing (if on the same
cores) or space-sharing (if on different cores).
We model the CPU usage of applications for discretized
time slices. Per slice, all workloads report requested CPU
time to the hypervisor and receive the granted CPU time
that the resources allow. We assume a generic memory
model, with memory allocation constant over the runtime of
a VM. As is common in industry, we allow overcommission
of CPU resources [5], but not of memory resources [55].
Infrastructure phenomena: Cloud datacenters are com-
plex hardware and software ecosystems, in which com-
plex phenomena emerge. We consider in this work two
well-known operational phenomena, performance variabil-
ity caused by performance interference between collocated
VMs [42, 43, 60] and correlated cluster failures [8, 19, 21].
Live Platform Management (RM&S in Figure 2):We
model a workload and resource manager that performs
management and control of all clusters and hosts, and is re-
sponsible for the lifecycle of submitted VMs, including their
placement onto the available resources [3]. The resource
manager is configurable and supports various allocation poli-
cies, defining the distribution of workloads over resources.
The devops team monitors the system and responds to
incidents that the resource management system cannot self-
manage [7].
Capacity Planning: Closely related with infrastructure
and live platform management is the activity of capacity
planning. This activity is conducted periodically and/or
at certain events by a capacity planner (or committee).
The activity typically consists of first modeling the current
state of the system (including its workload and infrastruc-
ture) [47], forecasting future demand [14], deriving a capacity
decision [65], and finally calibrating and validating the deci-
sion [40]. The latter is done for QoS, possibly expressed as
detailed Service Level Agreements (SLAs) and Service Level
Objectives (SLOs). In Section 3 we analyze the current state
of practice and in Section 8 we discuss existing approaches
in literature.
Which cloud datacenters are relevant for this model?
We focus in this work on capacity planning for mid-tier
cloud infrastructures, characterized by relatively small-scale
capacity, temporary overloads being common, and a lack
of in-house tools or teams large enough to develop them
quickly. In Section 3 we analyze the current state of the
capacity planning practice in this context and in Section 8
we discuss existing approaches in related literature.
Which tools support this model? We are not aware of
analytical tools that can cope with these complex aspects.
Although tools for VM simulation exist [12, 31, 50], few sup-
port CPU over-commissioning and none outputs detailed
VM-level metrics; the same happens for infrastructure phe-
nomena. From the few industry-grade procurement tools
who published details about their operation, none supports
the diverse workloads and phenomena considered here.
3 REAL-WORL D EXPERIENCES WITH CAPACITY
PLANNING IN CLO UD INFRASTRUCTURES
Real-world practice can deviate significantly from published
theories and strategies. In this section, we conduct and
analyze interviews with 8 practitioners from a wide range of
backgrounds and multiple countries, to assess whether this
is the case in the field of capacity planning.
3.1 Method
Our goal is to collect real-world experiences from practition-
ers systematically and without bias, yet also leave room for
flexible, personalized lines of investigation.
3.1.1 Interview type
The choice of interview type is guided by the trade-off
between the systematic and flexible requirements. A text
survey, for example, is highly suited for a systematic study,
but generally does not allow for low-barrier individual
follow-up questions or even conversations. An in-person
interview without pre-defined questions allows full flexi-
bility, but can result in unsystematic results. We use the
general interview guide approach [58], a semi-structured type
of interview that ensures certain key topics are covered but
permits deviations from the script. We conduct in-person
interviews with a prepared script of ranked questions, and
allow the interviewer the choice of which scripted questions
to use and when to ask additional questions.
4
Table 1
Summary of interviews. (Notation: TTD = Time to Deploy, CP = Cloud Provider, DC = Datacenter, M = Monitoring, m/y = month/year, NIT =
National IT Infrastructure Provider, SA = Spreadsheet Analysis.)
Int. Role(s) Backgr. Scale Scope Tooling Workload Comb. Frequency TTD
1 Researcher CP rack multi-DC M combined 3m, ad-hoc ?
2 Board Member NIT iteration multi-DC combined 4–5y 12–18m
3 Manager, Platform Eng. CP rack multi-DC M combined ad-hoc 4–5m
4 Manager NIT iteration per DC M benchmark 6–7y 18m
5 Hardware Eng. NIT iteration per DC M benchmark 6y 18m
6 Researcher NIT rack multi-DC M separate 6m 12m
7 Manager NIT iteration multi-DC M, SA combined 5y 3.5-4y
3.1.2 Data collection
Our data collection process involves three steps. Firstly, we
selected and contacted a broad set of prospective interviewees
representing various kinds of datacenters, with diverse roles
in the process of capacity planning, and with diverse re-
sponsibility in the decisions.
Secondly, we conducted and recorded the interviews. Each
interview is conducted in person and digitally recorded
with the consent of the interlocutor. Interviews last be-
tween 30 and 60 minutes, depending on availability of
the interlocutors and complexity of the discussion. To help
the interviewer select questions and fit in the time-limits
imposed by each interviewee, we rank questions by their
importance and group questions broadly into 5 categories:
(1) introduction, (2) process, (3) inside factors, (4) outside
factors, and (5) summary and followup. The choice between
questions is then dynamically adjusted to give precedence to
higher-priority questions and to ensure each category is cov-
ered at least briefly. The script itself is listed in Appendix A.
Thirdly, the recordings are manually transcribed into a
full transcript to facilitate easy analysis. Because matters
discussed in these interviews may reveal sensitive opera-
tional details about the organisations of our interviewees,
all interview materials are handled confidentially. No infor-
mation that could reveal the identity of the interlocutor or
that could be confidential to an organization’s operations is
shared without the explicit consent of the interlocutor. In
addition, all raw records will be destroyed directly after this
study.
3.1.3 Analysis of Interviews
Due to the unstructured nature of the chosen interview
approach, we combine a question-based aggregated anal-
ysis with incidental findings. Our approach is inspired by
the Grounded Theory strategy set forth by Coleman and
O’Connor [15], and has two steps. First, for each transcript,
we annotate each statement made based on which questions
it is relevant to. This may be a sub-sentence remark or an
entire paragraph of text, frequently overlapping between
different questions. We augment this systematic analysis
with more general findings, including comments on unan-
ticipated topics.
Secondly, we traverse all transcripts for each question
and form aggregate observations for each question in the tran-
script. Appendix B details the full findings. From these, we
synthesize Capelin requirements (§4.1).
3.2 Observations from the Interviews
Table 1 summarizes the results of the interviews. In total,
we transcribed over 35,000 words in 3 languages, which is a
very large amount of raw interview data. We conducted 7 in-
terviews with practitioners from commercial and academic
datacenters, with roles ranging from capacity planners, to
datacenter engineers, to managers. We summarize here our
main observations:
O1: A majority of practitioners find that the process in-
volves a significant amount of guesswork and human inter-
pretation (see detailed finding (IF16) in App. B). Interlocu-
tors managing commercial infrastructures emphasize multi-
disciplinary challenges such as lease and support contracts,
and personnel considerations (IF19, IF18).
O2: In all interviews, we notice the absence of any dedicated
tooling for the capacity planning process (IF44). Instead, the
surveyed practitioners rely on visual inspection of data,
through monitoring dashboards (IF45). We observe two
main reasons for not using dedicated tooling: (1) tools
tend to under-represent the complexity of the real situation,
and (2) have high costs with many additional, unwanted
features (IF47).
O3: The organizations using these capacity planning ap-
proaches provide a range of digital services, ranging from
general IT services to specialist hardware hosting (IF2).
They run VM workloads, in both commercial and scientific
settings, and batch and HPC workloads, mainly in scientific
settings (IF3).
O4: Alarge variety of factors are taken into account when
planning capacity (IF34). The three named in a majority
of interviews are (1) the use of historical monitoring data,
(2) financial concerns, and (3) the lifetime and aging of
hardware (IF35).
O5: Success and failure in capacity planning are underspeci-
fied. Definitions of success differ: two interviewees see the
use of new technologies as a success (IF6), and one interprets
the absence of total failure events as a success (IF5). Chal-
lenges include chronic underutilization (IF8), increasing
complexity (IF9), and small workloads (IF10). Failures in-
clude decisions taking long (IF12), misprediction (IF13), and
new technology having unforeseen consequences (IF11).
O6: The frequency of capacity planning processes seems correlated
with the duration of core activities using it: commercial clouds
deploy within 4-5 months from the start of capacity plan-
ning, whereas scientific clouds take 1–1.5 years (IF39, IF40).
O7: We found three financial and technical factors that play a
role in capacity planning: (1) funding concerns, (2) special
hardware requests, and (3) the cost of new hardware (IF74).
In two interviews, interlocutors state that financial consid-
erations prime over the choice of technology, such as the
vendor and model (IF78).
O8: The human aspect of datacenter operations is emphasized
in 5 of the 7 interviews (IF85). The datacenter administra-
5
Simulator
Legend
Capelin
Frontend
Infrastructure
Backend
Monitoring ServiceSimulator
Database
Backend
Frontend
Resource Topology
Workload
Library of
Components
Scenario Portfolio
Generator
Capacity Plan
Generator
Capacity
Planner Scenario Portfolio
Builder
Scenario Portfolio
Evaluator
Control Data Stateful
Workload Modeler
Portfolio
D
C F
EA
B
G
H
I
J M
L
K
Users
Figure 3. An overview of the architecture of Capelin. Capelin is provided information on the current state of the infrastructure and assists the capacity
planner in making capacity planning decisions. Labels indicate the order of traversal by the capacity planner (e.g., the first step is to use component
A, the scenario portfolio builder).
tors need training (IF81), and wrong decisions in capacity
planning lead to stress within the operational teams (IF83).
Users also need training, to leverage heterogeneous or new
resources (IF81).
O9: We observe a wide range of requirements and wishes
expressed by interlocutors about custom tools for the process.
Fundamentally, the tool should help manage the increasing
complexity faced by capacity planners (IF97). A key require-
ment for any tool is interactivity: practitioners want to be
able to interact with the metrics they see and ask questions
from the tool during capacity planning meetings (IF95).
The tool should be affordable and usable without needing
the entire toolset of the vendor (IF96). One interviewee
asks for support for infrastructure heterogeneity, to support
scientific computing (IF98).
O10: Two interviewees detail “what-if” scenarios they would
like to explore with a tool, using several dimensions (IF101):
(1) the topology, in the form of the computational and mem-
ory capacity needed, or new hardware arriving; (2) the work-
load, and especially emerging kinds; and (3) the operational
phenomena, such as failures and the live management of the
platform (e.g., scheduling and fail-over scenarios).
4 DESIGN OF CAPELIN:ACA PACI TY PLANNING
SYSTEM FOR CLO UD INFRASTRUCTURE
In this section, we synthesize requirements and design
around them a capacity planning approach for cloud in-
frastructure. We propose Capelin, a scenario-based capacity
planning system that helps practitioners understand the im-
pact of alternatives. Underpinning this process, we propose
as core abstraction the portfolio of capacity planning scenarios.
4.1 Requirements Analysis
In this section, from the results of Section 3, we synthesize
the core functional requirements addressed by Capelin. In-
stead of aiming for full automation – a future objective that
is likely far off for the field of capacity planning – the em-
phasis here is on human-in-the-loop decision support [37,
P2].
(FR1) Model a cloud datacenter environment (see O2, O3,
O7, and O9): The system should enable the user to
model the datacenter topology and virtualized work-
loads introduced in Section 2.
(FR2) Enable expression of what-if scenarios (see O2,
O10): Users can express what-if scenarios with diverse
topologies, failures, and workloads. The system should
then execute the what-if scenario(s), and produce and
justify a set of user-selected QoS metrics.
(FR3) Enable expression of QoS requirements, in the form
of SLAs, consisting of several SLOs (see O2, O5, O9).
These requirements are formulated as thresholds or
ranges of acceptable values for user-selected metrics.
(FR4) Suggest a portfolio of what-if scenarios, based on
user-submitted workload traces, given topology, and
specified QoS requirements (see O2, O10). This greatly
simplifies identifying meaningful scenarios.
(FR5) Provide and explain a capacity plan, optimizing for
minimal capacity within acceptable QoS levels, as spec-
ified by FR4 (see O2, O9). The system should explain
and visualize the data sources it used to make the plan.
4.2 Overview of the Capelin Architecture
On the previous page, Figure 3 depicts an overview of the
Capelin architecture. Capelin extends OpenDC, an open-
source, discrete event simulator with multiple years of de-
velopment and operation [36]. We now discuss each main
component of the Capelin architecture, taking the perspec-
tive of a capacity planner. We outline the abstraction under-
pinning this architecture, the capacity planning portfolios,
in §4.3.
4.2.1 The Capelin Process
The frontend and backend of Capelin are embedded in
OpenDC. This enables Capelin to leverage the simulator’s
existing platform for datacenter modeling and allows for
inter-operability with other tools as they become part of the
simulator’s ecosystem. The capacity planner interacts with
the frontend of Capelin, starting with the Scenario Portfolio
Builder (component Ain Figure 3), addressing FR2. This
component enables the planner to construct scenarios, using
pre-built components from the Library of Components (B).
The library contains workload, topology, and operational
building blocks, facilitating fast composition of scenarios. If
the (human) planner wants to modify historical workload
behavior or anticipate future trends, the Workload Mod-
eler (C) can model workloads and synthesize custom loads.
6
The planner might not always be aware of the full range
of possible scenarios. The Scenario Portfolio Generator (D)
suggests customized scenarios extending the given base-
scenario (FR4). The portfolios built in the builder can be ex-
plored and evaluated in the Scenario Portfolio Evaluator (E).
Finally, based on the results from this evaluation, the Capac-
ity Plan Generator (F) suggests plans to the planner (FR5).
4.2.2 The Datacenter Simulator
In Figure 3, the Frontend (G) acts as a portal, through which
infrastructure stakeholders interact with its models and ex-
periments. The Backend (H) responds to frontend requests,
acting as intermediary and business-logic between frontend,
and database and simulator. The Database (I) manages the
state, including topology models, historical data, simulation
configurations, and simulation results. It receives inputs
from the real-world topology and monitoring services, in
the form of workload traces. The Simulator (J) evaluates
the configurations stored in the database and reports the
simulation results back to the database.
OpenDC [36, 46] is the simulation platform backing
Capelin, enabling the capacity planner to model (FR1) and
experiment (FR5) with the cloud infrastructure, interac-
tively. The software stack of this platform is composed of
a web app frontend, a web server backend, a database, and
a discrete-event simulator. This kind of simulator offers a
good trade-off between accuracy and performance, even
at the scale of mid-tier datacenters and with long-term
workloads.
4.2.3 Infrastructure
The cloud infrastructure is at the foundation of this archi-
tecture, forming the system to be managed and planned.
We consider three components within this infrastructure:
The workload (K) submitted by users, the (logical or phys-
ical) resource topology (L), and a monitoring service (M).
The infrastructure follows the system model described in
Section 2.
4.3 A Portfolio Abstraction for Cap. Planning
In this section, we propose a new abstraction, which orga-
nizes multiple scenarios into a portfolio (see Figure 4). Each
portfolio includes a base scenario, a set of candidate scenar-
ios given by the user and/or suggested by Capelin, and a set
of targets to compare scenarios. In contrast, most capacity
planning approaches in published literature are tailored
towards a single scenario—a single potential hardware ex-
pansion, a single workload type, one type of service quality
metrics. This approach does not cover the complexities that
capacity planners are facing (see Section 3.2). Our portfolio
reflects the multi-disciplinary and multi-dimensional nature
of capacity planning by including multiple scenarios and a
set of targets. We describe them, in turn.
4.3.1 Scenarios
A scenario represents a point in the capacity planning (data-
center design) space to explore. It consists of a combination
of workload, topology, and a set of operational phenomena.
Phenomena can include correlated failures, performance
variability, security breaches, etc., allowing the scenarios
to more accurately capture the real-world operations. Such
phenomena are often hard to predict intuitively during
Figure 4. Abstraction of a capacity planning portfolio, consisting of
a base scenario, a number of candidate scenarios, and comparison
targets.
capacity planning, due to emergent behavior that can arise
at scale.
The baseline for comparison in a portfolio is the base
scenario. It represents the status quo of the infrastructure
or, when planning infrastructure from scratch, it consists of
very simple base workloads and topologies.
The other scenarios in a portfolio, called candidate scenar-
ios, represent changes to the configuration that the capacity
planner could be interested in. Changes can be effected in
one of the following four dimensions: (1) Variety: qualitative
changes to the workload or topology (e.g., different arrival
patterns, or resources with more capacity); (2) Volume: quan-
titative changes to the workload or topology (e.g., more
workloads or more resources); (3) Velocity: speed-related
changes to workload or topology (e.g., faster resources); and
(4) Vicissitude combines (1)-(3) over time.
This approach to derive candidate scenarios is system-
atic, and although abstract it allows approaching many of
the practical problems discussed by capacity planners. For
example, an ongoing discussion is horizontal scaling (scale-
out) vs. vertical (scale-up) [54]. Horizontal scaling, which
is done by adding clusters and commodity machines, con-
trasts to vertical scaling, which is done by acquiring more
expensive, “beefy” machines. Horizontal scaling is typically
cheaper for the same performance, and offers a broader
failure-target (except for cluster-level failures). Yet, vertical
scaling could lower operational costs, due to fewer per-
machine licenses, fewer switch-ports for networking, and
smaller floor-space due to fewer racks. Experiment 5.2 ex-
plores this dichotomy.
4.3.2 Targets
A portfolio also has a set of targets that prescribe on what
grounds the different scenarios should be compared. Targets
include the metrics that the practitioner is interested in and
their desired granularity, along with relevant SLOs (FR3).
Following the taxonomy defined by the performance orga-
nization SPEC [29], we support both system-provider metrics
(such as operational risk and resource utilization) and or-
ganization metrics (such as SLO violation rates and perfor-
mance variability). The targets also include a time range
over which these metrics should be recorded and com-
pared.
7
Table 2
Experiment configurations. A legend of topology dimensions is provided below. (Notation: PI = Performance Interference, pub = public cloud trace,
pri = private cloud trace.)
Candidate Topologies Workloads Op. Phenomena
Sec. Focus Mode Quality Direction Variance Trace Loads Failures PI Alloc. Policy
§5.2 Hor. vs. Ver. pri sampled 3 3 active-servers
§5.3 Velocity pri sampled 3 3 active-servers
§5.4 Op. Phen. pri original 7/3 7 /3all
§5.5 Workloads pri / pub sampled 3 7 active-servers
replace
Mode QualityDirection Variance
volumehorizontalexpand velocityvertical heterogeneoushomogeneous
5 EXPERIMENTS WITH CAPELIN
In this section, we explore how Capelin can be used to
answer capacity planning questions. We conduct extensive
experiments using Capelin and data derived from opera-
tional traces collected long-term from private and public
cloud datacenters.
5.1 Experiment Setup
We implement a prototype of Capelin (§5.1.1), and verify the
reproducibility of its results and that it can be run within the
expected duration of a capacity planning session (§5.1.2). All
experiments use long-term, real-world traces as input.
Our experiment design, which Table 2 summarizes, is
comprehensive and addresses key questions such as: Which
input workload (§5.1.3)? Which datacenter topologies to
consider (§5.1.4)? Which operational phenomena (§5.1.6)?
Which allocation policy (§5.1.5)? Which user- and operator-
level performance metrics to use, to compare the scenarios
proposed by the capacity planner (§5.1.7)?
The most important decision for our experiments is
which scenarios to explore. Each experiment takes in a
capacity planning portfolio (see Section 4.3), starts from a
base scenario, and aims to extend the portfolio with new
candidate scenarios and its results. The baseline is given by
expert datacenter engineers, and has been validated with hardware
vendor teams. Capelin creates new candidates by modifying
the base scenario along dimensions such as variety, volume,
and velocity of any of the scenario-components. In the
following, we experiment systematically with each of these.
5.1.1 Software prototype
We extend the open-source OpenDC simulation plat-
form [36] with capabilities for modeling and simulating
the virtualized workloads prevalent in modern clouds. We
model the CPU and memory usage of each VM along
with hypervisors deployed on each managed node. Each
hypervisor implements a fair-share scheduling model for
VMs, granting each VM at least a fair share of the available
CPU capacity, but also allowing them to claim idle capacity
of other VMs. The scheduler permits overprovisioning of
CPU resources, but not of memory resources, as is common in
industry practice. We also model a workload and resource
manager that controls the deployed hypervisors and de-
cides based on configurable allocation policies (described
in §5.1.5) to which hypervisor to allocate a submitted VM.
Our experiments and workload samples are orchestrated by
Capelin, which is written in Kotlin (a modern JVM-based
language), and processed and analyzed by a suite of tools
based on Python and Apache Spark. More detail about the
software implementation is given in Appendix C.
We release our extensions of the open-source OpenDC code-
base and the analysis software artifacts on GitHub1, as part of
release 2.0 [46]. We conduct thorough validation and tests
of both the core OpenDC and our additions, as detailed in
Section 6.
5.1.2 Execution and Evaluation
Our results are fully reproducible, regardless of the physical
host running them. All setups are repeated 32 times. The re-
sults, in files amounting to hundreds of GB in size due to the
large workload traces involved, are evaluated statistically
and verified independently. Factors of randomness (e.g.,
random sampling, policy decision making if applicable, and
performance interference modeling) are seeded with the
current repetition to ensure deterministic outcomes, and for
fairness are kept consistent across scenarios.
Capelin could be used during capacity planning meet-
ings. A single evaluation takes 1–2 minutes to complete,
enabled by many technical optimizations we added to the
simulator. The full set of experiments is conveniently paral-
lel and takes around 1 hour and 45 minutes to complete, on
a “beefy” but standard machine with 64 cores and 128GB
RAM; parallelization across multiple machines would re-
duce this to minutes.
5.1.3 Workload
We experiment with a business-critical workload trace from
Solvinity, a private cloud provider. The anonymized version of
this trace has been published in a public trace archive [35].
We were provided with the full, deanonymized data arti-
facts of this trace, which consists of more than 1,500 VMs
along with information on which physical resources where
used to run the trace and which VMs were allocated to
which resources. We cannot release these full traces due to
confidentiality, but release the summarized results.
The full trace includes a range of VM resource-usage mea-
surements, aggregated over 5-minute-intervals over three
months. It consumes 3,063 PFLOPs (exascale), with the mean
CPU utilization on this topology of 5.6%. This low utiliza-
tion is in line with industry, where utilization levels below
1. https://github.com/atlarge-research/opendc
8
Table 3
Aggregate statistics for both workloads used in this study. (Notation: AP
= Solvinity.)
Characterization AP Azure
VM submissions
per hour
Mean (×103) 31.836 4.547
CoV 134.605 17.188
VM duration [days] Mean 20.204 2.495
CoV 0.378 3.072
CPU load [TFLOPs] Mean (×102) 9.826 64.046
CoV 2.992 4.654
15% are common [61], and reduce the risk of not meeting
SLAs.
For all experiments, we consider the full trace, and
further generate three other kinds of workloads as sam-
ples (fractions) of the original workload. These workloads
are sampled from the full trace, resulting, in turn, to 306
PFLOPs (0.1 of the full trace), 766 (0.25), and 1,532 (0.5).
To sample, Capelin takes randomly VMs from the full trace
and adds their entire load, until the resulting workload
has enough load. We illustrate this in pseudocode, in Al-
gorithm 1.
For the §5.5 experiment, we further experiment with
a public cloud trace from Azure [16]. We use the most
recent release of the trace. The formats of the Azure and
the Solvinity traces are very similar, indicating a de facto
standard has emerged across the private and public cloud
communities. One difference in the level of anonymity of
the trace requires an additional assumption. Whereas the
Solvinity trace expresses CPU load as a frequency (MHz),
the Azure trace expresses it as a utilization metric ranging
from 0 to the number of cores of that VM. Thus, for the
Azure trace, in line with Azure VM types on offer we
assume a maximum frequency of 3 GHz and scale each
utilization measurement by this value. The Azure trace is
also shorter than Solvinity’s full trace, so we shorten the
latter to Azure’s length of 1 month.
We combine for the §5.5 experiment the two traces and
investigate possible phenomena arising from their interac-
tion. We disable here performance interference, because we
can only derive it for the Solvinity trace (see §5.1.6). To
combine the two traces, we first take a random sample
of 1% from the (very large) Azure trace, which results in
26,901 VMs running for one month. We then further sample
this 1%-sample, using the same method as for Solvinity’s
full trace. The full procedure is listed in Algorithm 2.
5.1.4 Datacenter topology
As explained at the start of §5.1, for all experiments we set
the topology that ran Solvinity’s original workload (the full
trace in §5.1.3) as the base scenario’s topology. This topology
is very common for industry practice. It is a subset of the
complete topology of the Solvinity when the full trace was
collected, but we cannot release the exact topology or the
entire workload of Solvinity due to confidentiality.
From the base scenario, Capelin derives candidate sce-
narios as follows. First, it creates a temporary topology
by choosing half of the clusters in the topology, consisting
of average-sized clusters and machines, compared to the
overall topology. Second, it varies the temporary topology,
in four dimensions: (1) the mode of operation: replacement
(removing the original half and replacing it with the mod-
ified version) and expansion (adding the modified half
to the topology and keeping the original version intact);
(2) the modified quality: volume (number of machines/cores)
and velocity (clock speed of the cores); (3) the direction of
modification: horizontal (more machines with fewer cores
each) and vertical (fewer machines with more cores each);
and (4) the kind of variance: homogeneous (all clusters in the
topology-half modified in the same way) and heterogeneous
(two thirds in the topology-half being modified in the des-
ignated way, the remaining third in the opposite way, on the
dimension being investigated in the experiment).
Each dimension is varied to ensure cores and machine
counts multiply to (at least) the same total core count as
before the change, in the modified part of the topology.
For volume changes, we differentiate between a horizontal
mode, where machines are given 28 cores (a standard size
for machines in current deployments), and vertical modes,
where machines are given 128 cores (the largest CPU models
we see being commonly deployed in industry). For velocity
changes, we differentiate between the clock speed of the
base topology and a clock speed that is roughly 25% higher.
Because we do not investigate memory-related effects, the
total memory capacity is preserved.
Last, due to confidentiality, we can describe the base and
derived topologies only in relative terms.
5.1.5 Allocation policies
We consider several policies for the placement of VMs on
hypervisors: (1) prioritizing by available memory (mem),
(2) by available memory per CPU core (core-mem), (3) by
number of active VMs (active-servers), (4) mimicking
the original placement data (replay), and (5) randomly
placing VMs on hosts (random). Policies 1-3 are actively
used in production datacenters [59].
For each policy we use two variants, following the
Worst-Fit strategy (selecting the resource with the most avail-
able resource of that policy) and the Best-Fit strategy (the
inverse, so selecting the least available, labeled with the
postfix -inv in §5.4).
5.1.6 Operational phenomena
Each capacity planning scenario can include operational
phenomena. In these experiments, we consider two such
phenomena, (1) performance variability caused by perfor-
mance interference between collocated VMs, and (2) cor-
related cluster failures. Both are enabled, unless otherwise
mentioned.
We assume a common model [42, 60] of performance
interference, with a score from 0 to 1 for a given set of collo-
cated workloads, with 0 indicating full interference between
VMs contending for the same CPU, and 1 indicating non-
interfering VMs. We derive the value from the CPU Ready
fraction of a VM time-slice: the fraction of time a VM is
ready to use the CPU but is not able to, due to other VMs
occupying it. We mine the placement data of all VMs run-
ning on the base topology and collect the set of collocated
workloads along with their mean score, defined as the mean
CPU ready time fraction subtracted from 1, conditioned by
the total host CPU load at that time, rounded to one decimal.
At simulation time, this score is then activated if a VMs is
collocated with at least one of the others in the recorded set
9
Algorithm 1 Sampling procedure for the VMs in Solvinity trace (as described in §5.1.3).
1: procedure SAMPLETRACE(vms,fraction,totalLoad)
2: selected ← ∅ .The set of selected VMs
3: load 0.Current total load (FLOP)
4: while |vms|>0do
5: vm Randomly removed element from vms
6: vmLoad Total load of vm
7: if load+vmLoad
totalLoad >fraction then
8: return selected
9: end if
10: load load +vmLoad
11: selected selected ∪ {vm}
12: end while
13: return selected
14: end procedure
Algorithm 2 Sampling procedure for combining the private and private traces (as described in §5.1.3).
1: procedure SAMPLEMULTIPLETRACES(vmsPri,fractionPri,vmsPub,fractionPub)
2: Ensure VMs in vmsPri and vmsPri have same length
3: vmsPub Randomly sample 0.01 of all VMs in vmsPub
4: totalLoad Total CPU load of the private trace
5: vmsPriSelected SAMPLETRACE(vmsP ri, f ractionP r i, totalLoad)
6: vmsPubSelected SAMPLETRACE(vmsP ub, f ractionP ub, total Load)
7: return vmsPriSelected vmsPubSelected
8: end procedure
Table 4
Parameters for the lognormal failure model we use in experiments. We
use the normal logarithm of each value.
Parameter [Unit] Scale Shape
Inter-arrival time [hour] 24 ×7 2.801
Duration [minute] 60 60 ×8
Group size [machine-count] 2 1
and the total load level on the system is at least the recorded
load. The score is then applied to each collocated VMs with
probability 1/N, where Nis the number of collocated VMs,
by multiplying its requested CPU cycles with the score and
granting it this (potentially lower) amount of CPU time.
The second phenomenon we model are cluster failures,
which are based on a common model for space-correlated
failures [21] where a failure may trigger more failures within
a short time span; these failures form a group. We consider in
this work only hardware failures that crash machines (full-
stop failures), with subsequent recovery after some dura-
tion. We use a lognormal model with parameters for failure
inter-arrival time, group size, and duration, as listed in
Table 4. The failure duration is further restricted by a min-
imum of 15 minutes, since faster recoveries and reboots at
the physical level are rare. The choice of parameter values is
inspired by GRID’5000 [21] (public trace also available [38])
and Microsoft Philly [39], scaled to Solvinity’s topology.
5.1.7 Metrics
In our article, we use the following metrics:
(1) the total requested CPU cycles (in MFLOPs) of all VMs,
(2) the total granted CPU cycles (in MFLOPs) of all VMs,
(3) the total overcommitted CPU cycles (in MFLOPs) of
all VMs, defined as the sum of CPU cycles that were
requested but not granted,
(4) the total interfered CPU cycles (in MFLOPs) of all VMs,
defined as the sum of CPU cycles that were requested
but could not be granted due to performance interfer-
ence,
(5) the total power consumption (in Wh) of all machines
using a linear model based on machine load [9], with
an idle baseline of 200 W and a maximum power draw
of 350 W,
(6) the number of time slices a VM is in a failed state,
summed across all VMs.
(7) the mean CPU usage (in MHz), defined as the mean
number of granted cycles per second per machine,
averaged across machines,
(8) the mean CPU demand (in MHz), defined as the mean
number of requested cycles per second per machine,
averaged across machines,
(9) the mean number of deployed VM images per host,
(10) the maximum number of deployed VM images per
host,
(11) the total number of submitted VMs,
(12) the maximum number of queued VMs in the system at
any point in time,
(13) the total number of finished VMs,
(14) the total number of failed VMs.
Note on the model for power consumption: The current
model, i.e., linear in the server load with offsets, is based
on a peer-reviewed model and common to other simulators
commonly used in practice, such as CloudSim and GridSim,
and produces in general reasonable results for CPU power
consumption. More accurate energy models appear for ex-
ample in GreenCloud and in CloudNetSim++, which model
the dynamic energy-performance trade-off when using the
DVFS technique, and in iCanCloud’s E-mc2 extension and
10
in DISSECT-CF, which model every power state of each
resource.
5.1.8 Listing of Full Results
In the subsections below, we highlight a small selection
of the key metrics. For full transparency, we present the
entire set of metrics for each experiment in the appendices.
Appendix D visualizes the full results for all metrics and
Appendix E lists the full results for the two most important
metrics in tabular form.
5.2 Horizontal vs. Vertical Resource Scaling
Our main findings from this experiment are:
MF1: Capelin enables the exploration of a complex trade-off
portfolio of multiple metrics and capacity dimensions.
MF2: Vertically scaled topologies can improve power con-
sumption (median lower by 1.47x-2.04x) but can lead
to significant performance penalties (median higher
by 1.53x-2.00x) and increased chance of VM failure
(median higher by 2.00x-2.71x, which is a high risk!)
MF3: Capelin reveals how correlated failures impact vari-
ous topologies. Here, 147k–361k VM-slices fail.
The scale-in vs. scale-out decision has historically been
a challenge across the field [54][28, §1.2]. We investi-
gate this decision in a portfolio of scenarios centered
around horizontally (symbol ) vs. vertically ( ) scaled re-
sources (see §5.1.4). We also vary: (1) the decision mode, by
replacing the existing infrastructure ( ) vs. expanding it ( ),
and (2) the kind of variance, homogeneous resources ( )
vs. heterogeneous ( ). On these three dimensions, Capelin
creates candidate topologies by increasing the volume ( )
and compares their performance using four workload inten-
sities, two of which are shown in this analysis. We consider
three metrics for each scenario: Figure 5 (top) depicts the
overcommitted CPU cycles, Figure 5 (middle) depicts the
power consumption, and Figure 5 (bottom) depicts the
number of failed VM time slices.
Our key performance indicator is overcommitted CPU
cycles, that is, the count of CPU cycles requested by VMs
but not granted, either due to collocated VMs requesting
too many resources at once, or due to performance interfer-
ence effects taking place. We observe in Figure 5 (top) that
vertically scaled topologies (symbol ) have significantly
higher overcommission (lower performance) than their hor-
izontally scaled counterparts ( , the other three symbols
identical). The median value is higher for vertical than for
horizontal scaling, for both replaced ( ) and expanded ( )
topologies, by a factor of 1.53x–2.00x (calculated as the ratio
between medians of different scenarios at full load). This
is a large factor, suggesting that vertically scaled topologies
are more susceptible to overcommission, and thus lead to
higher risk of performance degradation. The decrease in
performance observed in this metric is mirrored by the
granted CPU cycles metric in Figure 16b (Appendix D),
which decreases for vertically scaled topologies. Among
replaced topologies (all combinations including ), the
horizontally scaled, homogeneous topology ( ) yields
the best performance, and in particular the lowest me-
dian overcommitted CPU. We also observe that expanded
topologies ( ) have lower overcommission than the base
topology, so adding machines is worthwhile. We observe
Figure 5. Results for a portfolio of candidate topologies and different
workloads(§5.2): (top) overcommitted CPU cycles, (middle) total power
consumption, (bottom) total number of time slices in which a VM is in a
failed state. Table 2 describes the symbols used to encode the topology.
all these effects strongly for the full trace (3,063 PFLOPs),
but less pronounced for the lower workload intensity (1,531
PFLOPs).
But performance is not the only criterion for capacity
planning. We turn to power consumption, as a proxy for
cost analysis and environmental concerns. We see here that
vertically scaled topologies ( ) drastically improve power
consumption, for median values by a factor of 1.47x–2.04x,
contrasting their worse performance compared to horizontal
scaling ( ). As expected, all expanded topologies ( ), which
11
have more machines, incur higher power-consumption than
replaced topologies ( ). Higher workload intensity (i.e., for
the 3,063 PFLOPs results) incurs higher power consumption,
although less pronounced than earlier.
We also consider the amount of failed VM time-slices. Each
failure here is full-stop (§5.1.6), which typically escalates an
alarm to engineers. Thus, this metric should be minimized.
We observe significant differences here: the median failure
time of a homogeneous vertically scaled topology ( ) is
between 2.00x–2.71x higher than the base topology. This
metric shows similarities qualitatively with the overcom-
mitted CPU cycles. Vertical scaling is correlated not only
with worse performance, but also with higher failure counts.
We see that vertical scaling leads to a significant increase
in the maximum number of deployed images per physical
host (Figure 17d), which leads to larger failure domains
and thus potentially higher failure counts. The effect is
less pronounced when making heterogeneous compared to
homogeneous procurement.
Our findings show that Capelin gives practitioners the
possibility to explore a complex trade-off portfolio of dimensions
such as power consumption, performance, failures, work-
load intensity, etc. Optimization questions surrounding hor-
izontal and vertical scaling can therefore be approached with
a data-driven approach. We find that decisions including het-
erogeneous resources can provide meaningful compromises
between more generic, homogeneous resources; they also
lead to different decisions related to personnel training (not
shown here). We show significant differences between can-
didate topologies in all metrics, translating to very different
power costs, long-term. We conclude that Capelin can help
test intuitions and support complex decision making.
5.3 Expansion: Velocity
Our main findings from this experiment are:
MF4: Capelin enables exploring a range of resource di-
mensions frequently considered in practice, such as
component velocity.
MF5: Increasing velocity can reduce overcommitted CPU
cycles by 3.3%.
MF6: Expanding a topology by velocity can improve per-
formance by 1.54x, compared to expansion by volume.
In vertical horizontal scaling, practitioners are also faced
with the decision of which qualities to scale. This experi-
ment varies the velocity of resources both homogeneously
and heterogeneously, while replacing or expanding the
existing topology. Figure 6 depicts the explored scenarios
and their performance, in the form of overcommitted CPU
cycles.
We find that in-place, homogeneous vertical scaling of
machines with higher velocity leads to slightly better per-
formance, by a percentage of 3.3% (compared to the base
scenario, by median). In this dimension, performance varies
only slightly between homogeneously and heterogeneously
scaled topologies, for all metrics (see also Appendix D).
Expanding the topology homogeneously ( ) with a set
of machines with higher CPU frequency helps reduce over-
commission more drastically, also improving it beyond the
lowest overcommission reached by homogeneous vertical
expansion in the previous experiment, in Figure 5. When
Figure 6. Overcommitted CPU time for a portfolio of candidate topolo-
gies and different workloads, for Experiment 5.3.
Figure 7. Overcommitted CPU cycles for a portfolio of operational phe-
nomena (the “none” through “all” sub-plots), and allocation policies (leg-
end), for Experiment 5.4.
expanding, this cross-experiment comparison shows an im-
provement of performance with a factor of 1.54x.
5.4 Impact of Operational Phenomena
Our main findings from this experiment are:
MF7: Capelin enables the exploration of diverse allocation
policies and operational phenomena, both of which
lead to important differences in capacity planning.
MF8: Modeling performance interference can explain
80.6%—94.5% of the overcommitted CPU cycles.
MF9: Different allocation policies lead to different perfor-
mance interference intensities, and to median overcom-
mitted CPU cycles different by factors between 1.56x
and 30.3x compared to the best policy—high risk!
This experiment addresses operational factors in the
capacity planning process. We explore the impact of bet-
ter handling of physical machine failures, the impact of
(smarter) scheduler allocation policies, and the impact of
(the absence of) performance interference on overall perfor-
mance. Figure 7 shows the impact of different operational
phenomena on performance, for different allocation poli-
cies. We observe that performance interference has a strong
impact on overcommission, dominating it compared to the
12
Figure 8. Total power consumption for a portfolio of candidate topologies
(legend), subject to different workloads (the “all-pri” to “all-pub” sub-plots),
for Experiment 5.5.
“failures” sub-plot, where only failures are considered, or
with the “none” sub-plot, where no failures or interference
are considered. Depending on the allocation policy, it rep-
resents between 80.6% and 94.5% of the overcommission
recorded in simulation for the “all” sub-plot, where both
failures and interference are considered. This is visualized
more in detail in Figure 28d (§D), which plots the inter-
ference itself, separately. We also see the large impact that
live resource management (in this case, the allocation policy)
can have on Quality of Service. Median ratios vary between
1.56x and 30.3x vs. the best policy, with active-servers (see
§5.1.5) generally best-performing. Finally, we observe that
enabling failures increases the colocation ratio of VMs (see
Figure 29c, §D).
We conclude Capelin can help model aspects that are impor-
tant but typically not considered for capacity planning.
5.5 Impact of a New Workload
Our main findings from this experiment are:
MF10: Capelin enables exploring what-if scenarios that in-
clude new workloads as they become available.
MF11: Power consumption can vary significantly more in
all-private vs. all-public cloud scenarios, with the range
higher by 4.79x–5.45x.
This experiment explores the impact that a new work-
load type can have if added to an existing workload, an
exercise capacity planners have to consider often, e.g., for
new customers. We combine here the 1-month Solvinity and
Azure traces (see §5.1.3).
Figure 8 shows the power consumption for different
combinations of both workloads and different topologies.
We observe the unbiased variance of results [17, p. 32] is
positively correlated with the fraction of the workload taken
from the public cloud (Azure). Depending on topology,
the variance increase with this fraction ranges from 4.78x
to 5.45x. Expanding the volume horizontally ( ) leads
to the lowest increase in variance. The workload statis-
tics listed in Table 3 show that the Azure trace has far
fewer VMs, with higher load per VM and shorter duration,
thus explaining the increased variance. Last, all candidate
topologies have a higher power consumption than the base
topology.
We also observe performance degrading with increasing
public workload fraction (see Figure 34c, §D), calling for a
different topology or more sophisticated provisioning policy
to address the differing needs of this new workload. We
see that horizontal volume expansion ( ) provides the
best performance in the majority of workload transition
scenarios.
We conclude Capelin can support new workloads as they
appear, so before they are deployed.
6 VALIDATIO N OF T HE SIMUL ATOR
We discuss in this section the validity of the outputs of
the (extensions to the) simulator. Capelin uses datacenter-
level simulation using real-world traces to evaluate port-
folios of capacity planning scenarios. Although real-world
experimentation would provide more realistic outputs, eval-
uating the vast amount of scenarios generated by Capelin
on physical infrastructure is prohibitively expensive, hard
to reproduce, and cannot capture the scale of modern
datacenter infrastructure, notwithstanding environmental
concerns. Alternatively, we can use mathematical analysis,
where datacenter resources are represented as mathematical
models (e.g., hierarchical and queuing models). However,
this approach is limited because its accuracy relies on pre-
existing data from which the models are derived. Further
considering the complexity and responsibilities of modern
datacenters, this approach becomes infeasible.
Given that the effectiveness of Capelin depends heavily
on (the correctness of) simulator outputs, we have worked
very carefully and systematically to ensure the validity of
the simulator. For the validity of the simulator, we consider
three main aspects: (1) validity of results, (2) soundness of
results, and (3) reliability of results. Below, we discuss for
each of these aspects our approach and results.
T1. How to ensure simulator outputs are valid?
We consider simulator outputs valid if a realistic base model
(e.g., the datacenter topology) with the addition of a work-
load and other assumptions (e.g., operational phenomena)
can reflect realistically real-world scenarios based on the
same assumptions.
We ensure validity of simulator outputs by tracking
a wide variety of metrics (see Section 5.1.7) during the
execution of simulations in order to validate the behavior
of the system. This selection is comprised of metrics of
interest which we analyze in our experiments, but also fail-
safe metrics (e.g., total requested burst) that we can verify
against known values.
Moreover, we employ step-by-step inspection using the
various tools offered by the Java ecosystem (e.g., Java De-
bugger, Java Flight Recorder, and VisualVM) to verify the
state of individual components on a per-cycle basis.
T2. How to ensure simulator outputs are sound?
While the simulator may produce valid outputs, for them to
be useful, these outputs must also be realistic and applicable
to users of Capelin. That is, the assumptions that support
the datacenter model must hold in the real world, for
the simulator outputs to be sound and in turn be useful.
13
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Overcomm. CPU Cycles [MFLOP] ×1010
replay
active-servers
active-servers
(calibrated)
Allocation Policy
Topology
base
Figure 9. Validation with a replay policy, copying the exact cluster as-
signment of the original deployment. For a legend of topologies, see
Table 2.
Concretely, a particular choice of scheduling policy might
produce valid results, yet may not reflect reality.
To address this, we have created “replay experiments”
that replicate the resource management decisions made by
the original infrastructure of the traces, based on placement
data from that time. We do not support live migration of
VMs that occurs in the placement data, since VM placements
are currently fixed over time in OpenDC. However, the
majority of VMs do not migrate at all. Capacity issues due
to not supporting live migration are resolved by scheduling
VMs on other hosts in the cluster based on the mem policy.
The “replay experiments” are run in an identical setup
to the experiments in Section 5 and its results are compared
to the active-servers allocation policy. We visualize
both raw results and calibrated results, obtained through
only linear transformations (shifting and scaling values) to
account for possible constant discrepancy factors. We find
that:
1) The total overcommitted burst shows distributions
that are similar in shape but differ in scale, for
both policies. This can be explained by the fact that
active-servers policy is not as effective as the
manual placements on the original infrastructure in
addition to the influence of performance interference
(Figure 9).
2) Other metrics show very similar distributions. Small
differences may be accounted to the number of VMs
being slightly smaller in the “replay experiments“ due
to missing placement data (Figure 10).
Furthermore, we have had several meetings with both
industry and domain experts to discuss the simulator out-
puts in depth, validate our models and assumptions, and
spot inconsistencies. Moreover, we have had proactive com-
munication with the experts about possible issues with the
simulator that arose during development, such as unclear
observations.
T3. How to ensure no regression in subsequent simulator
versions?
Although we may at one point trust the simulator to
produce correct outputs, the addition or modification of
functionality in subsequent versions of the simulator may
0.00 0.25 0.50 0.75 1.00 1.25 1.50
Total Power Consumption [Wh] ×108
replay
active-servers
active-servers
(calibrated)
Allocation Policy
Topology
base
Figure 10. Validation with a replay policy, copying the exact cluster
assignment of the original deployment. For a legend of topologies, see
Table 2.
inadvertently affect the output compared to previous ver-
sions.
We safeguard against such issues by means of snapshot
testing. With snapshot testing, we capture a snapshot of
the system outputs and compare it against the outputs
produced by subsequent simulator versions. For this test,
we consider a downsized variant of the experiments run
in this work and capturing the same metrics. These tests
execute after every change and ensure that the validity of
the simulator outputs is not affected. In case some output
changes are intentional, the test failures serve as a double
check.
Furthermore, we use assertions in various parts of the
simulator to validate internal assumptions. This includes
verifying that messages in simulation are not delivered out-
of-order and validating that simulated machines do not
reach invalid states.
Finally, we employ industry-standard development
practices. Every change to the simulator or its extensions
requires an independent code review before inclusion in
the main code base. In addition, we automatically run for
each change static code analysis tools (e.g. linting) to spot
common mistakes.
7 OTH ER THREATS TO VALIDITY
In this section, we list and address threats to the validity of
our work that go beyond the validity of the simulator.
7.1 Interview Study
Confidentiality limits us from sharing the source transcripts
of our analysis. Inherent in such a study is the threat to
validity caused by this limitation. To minimize this threat,
the process used is meticulously described and the full
findings presented. We also point to the resonance that
many of the results find with observations in other work.
The limited sample size of our study presents another
threat to the validity of our interview findings. This is
difficult to address, due to the labor-intensive transcription
and analysis conducted already in this study. Follow-up
studies should further address this concern by conducting
a textual survey with a wider user base, requiring less time
investment per interlocutor.
14
7.2 Experimental Study
We discuss three threats related to the experimental study.
7.2.1 Diversity of Modeled Resources
Building topologies in practice requires consideration of
many different kinds of resources. In our study, we only
actively explore the CPU resource dimension in the capacity
planning process, to restrict the scope. This could be seen
Adding or removing CPUs to/from a machine however can
relate to different types of memory or network becoming
applicable or necessary. This can have impacts on costs and
energy consumption, altering the decision support provided
in this study. Nevertheless, the performance should suffer
only minimal impact from this, since CPU consumption can
be regarded as the critical factor in these considerations. In
addition, Capelin and it’s core abstraction of portfolios of
scenarios offers a broader framework and future extensions
to OpenDC will directly become available to planners using
Capelin.
7.2.2 Public Data Artifacts
A second threat to validity could be perceived in the ab-
sence of public experiment data artifacts. The confidentiality
of the trace and topology we use in simulation prohibits
the release of detailed artifacts and results. However, an
anonymized version of the trace is available in a public trace
archive, which can be used to explore a restricted set of the
workload. The Azure traces used in the experiment in §5.5
are public, along with our sampling logic for their use, and
can therefore be locally used along with the codebase.
Last, a threat to validity could be seen in the validity of
the outputs of the (extensions to the) simulator itself. We cover
this threat extensively in Section 6.
7.2.3 Allocation Policies
We discuss in this section the relevance of the chosen alloca-
tion policies in this work and how they relate to allocation
policies used in popular resource management tools such as
OpenStack, Kubernetes, and VMWare vSphere.
The allocation policies used in this work use a ranking
mechanism which orders candidate hosts based on some
criterion (e.g., available memory or number of active VMs)
and selects either the lowest or highest ranking host.
OpenStack uses by default the Filter Scheduler2for
placement of VMs onto hosts. For this, it uses a two step
process, consisting of filtering and weighing. During the filter-
ing phase, the scheduler filters the available hosts based on
a set of user-configured policies (e.g., based on the number
of available vCPUs). In the weighing phase, the scheduler
uses a selection of policies to assign weights to the hosts
that survived the filtering phase, and select the host with
the highest weight. How the weights are determined can
be configured by the user, but by default the scheduler
will spread VMs across all hosts evenly based the available
RAM3, similar to the available-mem policy in this work.
2. https://docs.openstack.org/nova/latest/user/filter-scheduler.
html
3. https://docs.openstack.org/nova/latest/admin/configuration/
schedulers.html#id18
Kubernetes conceptually uses almost exactly the same
process as OpenStack4, but by default uses more extensive
weighing policies to ensure the workloads are balanced over
the hosts, also taking into account dynamic information
such as resource utilization. A key difference with Open-
Stack is that Kubernetes does not consider the memory
requirements of workloads when weighing the hosts.
VMWare vSphere offers DRS (Distributed Resource
Scheduler) which automatically balances workloads across
hosts in a cluster based on memory requirements of the
workloads.
8 RE LATED WORK
We summarize in this section the most closely related work,
which we identified through a survey of the field that
yielded over 75 relevant references. Overall, our work is
the first to: (1) conduct community interviews with capac-
ity planning practitioners managing cloud infrastructures,
which resulted in unique insights and requirements, (2)
design and evaluate a data-driven, comprehensive approach
to cloud capacity planning, which models real-world op-
erational phenomena and provides, through simulation,
multiple VM-level metrics as support to capacity planning
decisions.
8.1 Community Interviews
Related to (1), we see two works as closely related to our
interview study of practitioners. In the late-1980s, Lam and
Chan conducted a written questionnaire survey [44] and,
mid-2010s, Bauer and Bellamy conducted semi-structured
interviews [6]. The target group of these studies differs
from ours, however, since both focus on practitioners from
different industries planning the resources used by their IT
department. We summarize both related works below.
Lam and Chan (1987) conduct a written survey with
388 participants [44, p. 142]. The survey consists of scaled
questions where practitioners indicate how frequently they
use certain strategies in different stages of the capacity
planning process [44, p. 143]. Their results indicate that
very few respondents believe that they use “sophisticated”
forecasting techniques for their capacity planning activities,
with visual trending being the most popular strategy at
that time. They find that “many companies still rely on
the simplistic, rules-of-thumb, or judgmental approach” to
capacity planning [44, p. 8]. More importantly even, the
authors believe that there is a “significant gap between
theory and practice as to the usability of the scientific and
the more sophisticated techniques”. The conclusions Lam
and Chan draw from their survey and the relations we
observe in their results are resonant with the findings of our
study. This stresses the need for a usable and comprehensive
capacity planning system for today’s computer systems.
Bauer and Bellamy (2017) conduct 12 in-person inter-
views with “IT capacity-management practitioners” [6] in
six different industries. Similar to our interviewing style,
the interviews were “semi-structured”, guided by questions
prepared in advance. The questions range from capacity
4. https://kubernetes.io/docs/concepts/scheduling-eviction/
kube-scheduler/#kube-scheduler-implementation
15
planning process questions to more managerial questions
around organizational structure. After manual evaluation of
the interview transcripts, the authors find that practitioners
often state that the number of capacity planning roles in
organizations is decreasing, while the discipline is still very
much relevant. The practitioners also find that “vendor-
relationship management and contract management” are
playing an increasing role in the capacity planning pro-
cess, as well as redundancy and multi-cloud considerations.
These results, even if for a different target group, resonate
with our findings in two ways: (1) they underline our call
for the need to focus on the capacity planning process as
an essential part of resource management, and (2) empha-
size the multi-disciplinary, complex nature of the decisions
needing to be taken.
8.2 Capacity Planning Approaches
Related to (2), our work extends the body of related work
in three key areas: (1) process models for capacity planning,
(2) works related to capacity planning, and (3) system-level
simulators.
8.2.1 Process Models for Capacity Planning
Firstly, we survey process models for capacity planning
published in literature. To enable their comparison, we unify
the terminology and the stages proposed by these models,
and create the super-set of systems-related stages summa-
rized in Table 5. We observe that the first stages (assessment
and characterization) have the broadest support among
models. However, we also find significant differences in
the comprehensiveness of models. We observe that the later
stages (deployment and calibration) tend to receive more
attention only in more recent publications. From a systems
perspective, Capelin proposes the first comprehensive pro-
cess.
8.2.2 Works Related to Capacity Planning
Secondly, we survey systematically the main scientific
repositories and collect 56 works related to capacity plan-
ning. While we plan to release the full survey at a later
stage, we share key insights here. We find that the majority
of studies only consider one resource dimension, and four
inputs or less for their capacity planning model. Few are
simulation-based [1, 13, 48, 51, 52, 53], with the rest using
primarily analytical models. We highlight three of these
works below and position them in relation to this work.
Rolia et al. proposes the first trace-based approach to
the problem [53]. Their “Quartermaster” capacity manager
service motivates the use of what-if questions to optimize
SLOs, with the help of trace-based analysis and optimizing
search for optimal capacity plan suggestions. It’s underlying
simulation is restricted to replay with no additional mod-
elling of phenomena or policies. This severely limits the
scope and coverage of the exploration, regarding only one
dimension (quantity of CPUs). The work also does not for-
mally specify what-if scenarios, even though mentioning the
wide variety of scenarios (questions) that can be formulated.
Carvalho et al. uses queuing-theory models to optimize
the computational capacity of datacenters [13]. Their models
are built from high-level workload characteristics derived
Table 5
Comparison of process models for capacity planning. Sources: Lam
and Chan [44, p. 92], Howard [33] (referenced by Browning [11, p. 7]),
Menasc´
e and Almeida [47, p. 179], Gunther [27, p. 22], and Kejariwal
and Allspaw [40, p. 4].
Stage [44] [11] [47] [27] [40] Capelin
Assessing current cap. X X X X X
Identifying all workloads X X
Characterize workloads X X X X X
Aggregate workloads X X
Validate workload char. X X
Determine resource req. X X
Predict workload X X X X
Characterize perf. X X X
Validate perf. char. X X X
Predict perf. X X X
Characterize cost X X
Predict cost X X
Analyze cost and perf. X X
Examine what-if scen. X X
Design system X X
Iterate and calibrate X X
from traces and include admission control policies. The
simplifying assumptions made in constructing these simula-
tion models restrict the realism of their output. In addition,
while this work emphasizes the role that trade-offs play in
the decision-making process, the trade-offs themselves are
only evaluated on a single-metric scale (combining multiple
metrics into one), leaving practitioners with a single output
plan to accept or reject.
The notable Janus [1] presents a real-time risk-based
planning approach for datacenter networks. The scope of
this study differs from our scope, in that it addresses net-
works and aims to assist in real-time, operational changes.
However, we share a focus on operational risks and in-
volved costs, and Janus also is evaluated with the help of
real-world traces.
Our scope of long-term planning (procurement) ex-
cludes more dynamic, short-term process such as Google’s
Auxon [32] or the Cloud Capacity Manager [41], which
address the live management of capacity already procured;
explained differently, Capelin (this work) helps decide on
long-term capacity procurement, whereas Auxon and others
like focus on the different problems of what to do with
that capacity, short-term, once it is already there. Other
work investigates the dynamic management of physical
components, such as CPU frequency scaling [45].
8.2.3 System-Level Simulators
Thirdly, we survey system-level simulators, and study 10
of the best-known in the large-scale distributed systems
community. Among the simulators that support VMs al-
ready [12, 31, 50] and could thus be useful for simulating
cloud datacenters, few have been tested with traces at the
scale of this study, few support CPU overcommissioning,
none supports both operational phenomena used in this
work, and none can output detailed VM-level metrics.
9 CONCLUSION AND FUTURE WORK
Accurately planning cloud datacenter capacity is key to
meeting the needs of the 2020s society whilst saving costs
16
and ensuring environmental sustainability. Although capac-
ity planning is crucial, the current practice has not been
analyzed in decades and publicly available tools to support
practitioners are scarce. Capelin, a data-driven, scenario-
based alternative to current planning approaches, addresses
these problems.
In this work, we have designed, implemented, and
evaluated Capelin. We have conducted a guided interview
with diverse practitioners from a variety of backgrounds,
whose results led us to synthesize five functional require-
ments. We have designed Capelin to meet them, including
the ability to model datacenter topologies and virtualized
workloads, to express what-if scenarios and QoS require-
ments, to suggest scenarios to evaluate, and to evaluate and
explain capacity plans. Capelin uses a novel abstraction, the
capacity planning portfolio, to represent, explore, and com-
pare scenarios. Experiments based on real-world workload
traces collected from private and public clouds demonstrate
Capelin’s capabilities. Results show that Capelin can sup-
port capacity planning processes, exploring changes from a
baseline scenario alongside four dimensions. We found that
capacity plans common in practice could potentially lead
to significant performance degradation, e.g., 1.5x–2.7x. We
also gave evidence of the important, but often discounted,
role that operational choices (e.g., the allocation policy)
and operational phenomena (e.g., performance interference)
play in capacity planning.
We have released Capelin as FOSS for capacity planners
to use. We will continue to support it and, in future work, we
plan to deepen and engineer Capelin. We are investigating
the use of Machine Learning and conventional Artificial In-
telligence search techniques to make the Capacity Plan Gen-
erator component more capable of exploring the enormous
design-space. We intend to conduct a structured survey, in
the form of a textual questionnaire, to reach a larger base
of capacity planning practitioners and augment the initial
findings made in our interview study. We see opportunities
for research into cloud user behavior when emerging re-
sources are deployed, a factor especially relevant in scientific
clouds. We also plan to include more workload types, such
as virtualized Function as a Service (FaaS) workloads.
REFERENCES
[1] Omid Alipourfard, Jiaqi Gao, J´
er´
emie Koenig, Chris
Harshaw, Amin Vahdat, and Minlan Yu. Risk based
planning of network changes in evolving data centers.
In SOSP, 2019.
[2] George Amvrosiadis, Jun Woo Park, Gregory R.
Ganger, Garth A. Gibson, Elisabeth Baseman, and
Nathan DeBardeleben. On the diversity of cluster
workloads and its impact on research results. In ATC,
2018.
[3] Georgios Andreadis, Laurens Versluis, Fabian Masten-
broek, and Alexandru Iosup. A reference architecture
for datacenter scheduling: Design, validation, experi-
ments. SC, 2018.
[4] Luiz Andr´
e Barroso, Urs H¨
olzle, and Parthasarathy
Ranganathan. The Datacenter as a Computer: Designing
Warehouse-Scale Machines. Synthesis lectures on comp.
arch. Morgan and Claypool, 2018. 3rd ed.
[5] Salman A Baset, Long Wang, and Chunqiang Tang. To-
wards an understanding of oversubscription in cloud.
USENIX HOT-ICE, 2012.
[6] Joe Bauer and Al Bellamy. Latent effects of cloud com-
puting on IT capacity management structures. IJCCE, 6
(2), 2017.
[7] Betsy Beyer, Chris Jones, Jennifer Petoff, and
Niall Richard Murphy. Site Reliability Engineering: How
Google Runs Production Systems. O’Reilly Media, 2016.
[8] Robert Birke, Ioana Giurgiu, Lydia Y Chen, Dorothea
Wiesmann, and Ton Engbersen. Failure analysis of
virtual and physical machines: Patterns, causes, char-
acteristics. IFIP, 2014.
[9] Mark Blackburn. Five ways to reduce data center server
power consumption. The Green Grid, 42:12, 2008.
[10] Raphael Bolze, Franck Cappello, Eddy Caron, Michel J.
Dayd´
e, Fr´
ed´
eric Desprez, Emmanuel Jeannot, Yvon
J´
egou, St´
ephane Lanteri, Julien Leduc, Nouredine
Melab, Guillaume Mornet, Raymond Namyst, Pascale
Primet, Benjamin Qu´
etier, Olivier Richard, El-Ghazali
Talbi, and Ir´
ea Touche. Grid’5000: A large scale and
highly reconfigurable experimental grid testbed. IJH-
PCA, 20(4):481–494, 2006.
[11] Tim Browning. Capacity Planning for Computer Systems.
Academic Press, 1994.
[12] Rodrigo N Calheiros, Rajiv Ranjan, Anton Beloglazov,
C´
esar AF De Rose, and Rajkumar Buyya. Cloudsim:
a toolkit for modeling and simulation of cloud com-
puting environments and evaluation of resource provi-
sioning algorithms. Softw. Pract. Exp., 41(1), 2011.
[13] Marcus Carvalho, Daniel A. Menasc´
e, and Francisco
Brasileiro. Capacity planning for IaaS cloud providers
offering multiple service classes. FGCS, 77, 2017.
[14] Mark Chamness. Capacity forecasting in a backup
storage environment. USENIX LISA, 2011.
[15] Gerry Coleman and Rory O’Connor. Using grounded
theory to understand software process improvement:
A study of Irish software product companies. Inf. and
Sw. Tech., 49(6), 2007.
[16] Eli Cortez, Anand Bonde, Alexandre Muzio, Mark
Russinovich, Marcus Fontoura, and Ricardo Bianchini.
Resource central: Understanding and predicting work-
loads for improved resource management in large
cloud platforms. In SOSP, 2017.
[17] Jay L Devore. Probability and Statistics for Engineering
and the Sciences, 7Ed. Brooks Cole Cengage Learning,
2009.
[18] Dmitry Duplyakin, Robert Ricci, Aleksander Maricq,
Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller,
Mike Hibler, David Johnson, Kirk Webb, Aditya Akella,
Kuang-Ching Wang, Glenn Ricart, Larry Landwe-
ber, Chip Elliott, Michael Zink, Emmanuel Cecchet,
Snigdhaswin Kar, and Prabodh Mishra. The design and
operation of CloudLab. ATC, 2019.
[19] Nosayba El-Sayed, Hongyu Zhu, and Bianca Schroeder.
Learning from failure across multiple clusters. In
ICDCS, 2017.
[20] Flexera. State of the Cloud Report, Sep 2020.
17
[21] Matthieu Gallet, Nezih Yigitbasi, Bahman Javadi, Der-
rick Kondo, Alexandru Iosup, and Dick H. J. Epema.
A model for space-correlated failures in large-scale
distributed systems. In Euro-Par, 2010.
[22] Gartner Inc. Gartner Forecasts Worldwide Public
Cloud Revenue to Grow 17% in 2020, Sep 2019.
[23] Frank Gens. Worldwide and Regional Public IT Cloud
Services 2019–2023 Forecast. Tech. Rep. by IDC, Doc.
#US44202119, Aug 2019.
[24] Rahul Ghosh, Francesco Longo, Ruofan Xia, Vijay K.
Naik, and Kishor S. Trivedi. Stochastic model driven
capacity planning for an infrastructure-as-a-service
cloud. IEEE Transactions on Services Computing, 7(4),
2014.
[25] James Glanz. Data centers waste vast amounts of
energy, belying industry image. N.Y. Times, 2012.
[26] Albert G. Greenberg, James R. Hamilton, David A.
Maltz, and Parveen Patel. The cost of a cloud: research
problems in data center networks. ACM CCR, 39(1),
2009.
[27] Gunther. Guerrilla Capacity Planning. Springer, 2007.
[28] Mor Harchol-Balter. Performance Modeling and Design
of Computer Systems: Queueing Theory in Action. Cam-
bridge University Press, 2013.
[29] Nikolas Herbst, Andr´
e Bauer, Samuel Kounev, Gior-
gos Oikonomou, Erwin Van Eyk, George Kousiouris,
Athanasia Evangelinou, Rouven Krebs, Tim Brecht,
Cristina L. Abad, and Alexandru Iosup. Quantifying
cloud performance and dependability: Taxonomy, met-
ric design, and emerging challenges. ACM TOMPECS,
3(4), 2018.
[30] Hewlett-Packard Development Company. HP Capacity
Advisor Version 7.4, 2014.
[31] Takahiro Hirofuchi, Adrien Lebre, and Laurent Pouil-
loux. Simgrid VM: virtual machine support for a
simulation framework of distributed systems. IEEE
Trans. Cloud Computing, 6(1), 2018.
[32] Hixson et al. Capacity Planning. USENIX ;login,
February, 2015.
[33] Phillip C Howard. IS Capacity Management Handbook
Series–Volume 1–Capacity Planning. Institute for Com-
puter Capacity Management, 1992.
[34] International Business Machines Corporation. IBM Z
Performance and Capacity Analytics tool, 2019.
[35] Alexandru Iosup, Hui Li, Mathieu Jan, Shanny Anoep,
Catalin Dumitrescu, Lex Wolters, and Dick H J Epema.
The Grid Workloads Archive. FGCS, 24(7), 2008.
[36] Alexandru Iosup, Georgios Andreadis, Vincent
Van Beek, Matthijs Bijman, Erwin Van Eyk, Mihai
Neacsu, Leon Overweel, Sacheendra Talluri, Laurens
Versluis, and Maaike Visser. The OpenDC vision:
Towards collaborative datacenter simulation and
exploration for everybody. ISPDC, 2017.
[37] Alexandru Iosup, Alexandru Uta, Laurens Versluis,
Georgios Andreadis, Erwin Van Eyk, Tim Hegeman,
Sacheendra Talluri, Vincent Van Beek, and Lucian
Toader. Massivizing computer systems: A vision to
understand, design, and engineer computer ecosys-
tems through and beyond modern distributed systems.
ICDCS, 2018.
[38] Bahman Javadi, Derrick Kondo, Alexandru Iosup, and
Dick Epema. The Failure Trace Archive: Enabling the
comparison of failure measurements and models of
distributed systems. JPDC, 73(8), 2013.
[39] Myeongjae Jeon, Shivaram Venkataraman, Amar Phan-
ishayee, Junjie Qian, Wencong Xiao, and Fan Yang.
Analysis of large-scale multi-tenant GPU clusters for
DNN training workloads. In ATC, 2019.
[40] Arun Kejariwal and John Allspaw. The Art of Capacity
Planning: Scaling Web Resources in the Cloud. O’Reilly,
2017.
[41] Mukil Kesavan et al. Practical compute capacity man-
agement for virtualized datacenters. IEEE TCC, 1(1):
1–1, 2013.
[42] Younggyun Koh, Rob Knauerhase, Paul Brett, Mic
Bowman, Zhihua Wen, and Calton Pu. An analysis
of performance interference effects in virtual environ-
ments. In ISPASS, 2007.
[43] Rouven Krebs, Christof Momm, and Samuel Kounev.
Metrics and techniques for quantifying performance
isolation in cloud environments. Science of Computer
Programming, 2014.
[44] Shui F Lam and K Hung Chan. Computer capacity
planning: theory and practice. Academic Press, 1987.
[45] Drazen Lucanin et al. Performance-based pricing in
multi-core geo-distributed cloud computing. IEEE
TCC, 2016.
[46] Fabian Mastenbroek, Georgios Andreadis, Soufiane
Jounaid, Wenchen Lai, Jacob Burley, Jaro Bosch, Er-
win van Eyk, Laurens Versluis, Vincent van Beek, and
Alexandru Iosup. OpenDC 2.0: Convenient modeling
and simulation of emerging technologies in cloud dat-
acenters. In CCGRID, 2021.
[47] Daniel A. Menasc´
e and Virg´
ılio A.F. Almeida. Capacity
Planning for Web Services: metrics, models, and methods.
Prentice Hall, 2001.
[48] Swarna Mylavarapu, Vijay Sukthankar, and Pradipta
Banerjee. An optimized capacity planning approach for
virtual infrastructure exhibiting stochastic workload.
SAC, 2010.
[49] Derek L. Nazareth and Jae Choi. Capacity management
for cloud computing: a system dynamics approach.
AMCIS, 2017.
[50] Alberto N ´
u˜
nez, Jose L V´
azquez-Poletti, Agustin C
Caminero, Gabriel G Casta˜
n´
e, Jesus Carretero, and
Ignacio M Llorente. iCanCloud: A flexible and scalable
cloud infrastructure simulator. J. Grid Comput., 10(1),
2012.
[51] Per Olov Ostberg, James Byrne, Paolo Casari, Philip
Eardley, Antonio Fernandez Anta, Johan Forsman, John
Kennedy, Thang Le Duc, Manuel Noya Marino, Rad-
hika Loomba, Miguel Angel Lopez Pena, Jose Lopez
Veiga, Theo Lynn, Vincenzo Mancuso, Sergej Svorobej,
Anders Torneus, Stefan Wesner, Peter Willis, and Jorg
Domaschka. Reliable capacity provisioning for dis-
tributed cloud/edge/fog computing applications. Eu-
CNC, 2017.
[52] Jayneel Patel, Shahram Sarkani, and Thomas Mazzuchi.
Knowledge based data center capacity reduction using
sensitivity analysis on causal Bayesian belief network.
IKSM, 12(2), 2013.
18
[53] Jerry Rolia, Ludmila Cherkasova, Martin Arlitt, and
Artur Andrzejak. A capacity management service for
resource pools. WOSP, 2005.
[54] Semih Salihoglu and M Tamer ¨
Ozsu. Response to ”scale
up or scale out for graph processing”. IEEE Internet
Comput., 22(5), 2018.
[55] Siqi Shen, Vincent Van Beek, and Alexandru Iosup. Sta-
tistical characterization of business-critical workloads
hosted in cloud datacenters. CCGrid, 2015.
[56] Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and
John Wilkes. CloudScale: Elastic resource scaling for
multi-tenant cloud systems. SoCC, 2011.
[57] Ling Tang and Hao Chen. Joint pricing and capacity
planning in the IaaS cloud market. IEEE TCC, 5(1),
2017.
[58] Daniel W Turner. Qualitative Interview Design: A
Practical Guide for Novice Investigators. Qualitative
Report, 15(3):7, 2010.
[59] Vincent van Beek, Jesse Donkervliet, Tim Hegeman,
Stefan Hugtenburg, and Alexandru Iosup. Self-
Expressive Management of Business-Critical Work-
loads in Virtualized Datacenters. Computer, 48(7):46–54,
2015.
[60] Vincent van Beek, Giorgos Oikonomou, and Alexandru
Iosup. A CPU contention predictor for business-critical
workloads in cloud datacenters. In HotCloudPerf, 2019.
[61] Arunchandar Vasan, Anand Sivasubramaniam, Vikrant
Shimpi, T. Sivabalan, and Rajesh Subbiah. Worth
their watts? - an empirical study of datacenter servers.
HPCA, 2010.
[62] VMware. VMware Capacity Planner, 2009.
[63] Jie Xu and Chenbo Zhu. Optimal pricing and capacity
planning of a new economy cloud computing service
class. ICCAC, 2015.
[64] Ennan Zhai, Ruzica Piskac, Ronghui Gu, Xun Lao,
and Xi Wang. An auditing language for preventing
correlated failures in the cloud. PACMPL, 2017.
[65] Chun Zhang, Rong N. Chang, Chang Shing Perng, Ed-
ward So, Chungqiang Tang, and Tao Tao. An optimal
capacity planning algorithm for provisioning cluster-
based failure-resilient composite services. SCC, 2009.
[66] Qi Zhang, Ludmila Cherkasova, Guy Mathews, Wayne
Greene, and Evgenia Smirni. R-capriccio: A capacity
planning and anomaly detection tool for enterprise
services with live workloads. Middleware, 2007.
APPENDIX A
CAPACIT Y PLANNING INT ERVI EW SCRIPT
In this chapter, we list the interview script used for the
interviews described in Section 3. To encode instructions
to the reviewer, we use the following notation. The font
determines the type of instruction: Questions are written
in standard fonts, emphasis indicates instructions to be read
to the interlocutor by the interviewer (not necessarily ver-
batim), and mono-space font represents instructions for the
reviewer. The questions are numbered for cross-reference
and divided into 5 category. Each question is assigned one
out of three priority levels, indicated by asterisks (*,**,
and ***), with three stars indicating the highest priority.
Each category (section) of questions is allocated a number
of minutes, listed between parentheses.
A.1 Part 1: Overview (15’)
Thanks for agreeing to meet with me. To get the most out of this
conversation, would you allow me to record the conversation for
note-taking purposes? I will not share the recording with anyone
else but you, if you want a copy, and I will delete the recording at
the end of my thesis project.
If you have any concerns about sharing this kind of informa-
tion with me, let us talk quickly and openly about it. We hope you
will share openly. Rest assured, we want more to learn about the
issues around capacity planning than to publish on it.
I will transcribe the recording for myself, and for any use of a
snippet of your words, I will ask you specifically for approval to
release, with due reference, unless you want me to keep the author
anonymous, of course.
We are interested in learning more about how businesses
think about having IT infrastructure and services always available
and plentiful. We call this capacity planning, and know we are
referring here only to IT and the IT team, and not to other types of
“capacity”. We want to learn and share with you what processes
are used, what challenges exist, and how can we help solve them.
(Q1) *** How important is it to have IT infrastructure
and services always available and plentiful in
your business?
(Q2) *** What kind of services do you provide? How
important is it for your different services?
(Q3) *** Can you give us an example of a success in
capacity planning? Share with us a good idea,
a good process, some situation when all worked
well?
(Q4) *** Can you give us an example of an insuccess in
capacity planning? Share with us a mistake, an
erroneous process, some situation when many
things failed to worked well or took much more
to get through than expected?
(Q5) *** What does the typical process for capacity plan-
ning look like at your company? You can start
with an overview, or even from a concrete exam-
ple, like how to get a new cluster in operation.
(Q6) ** A few yes or no questions:
1) Do you have “what if” scenarios?
2) Do you consider hybrid or public clouds to be
part of your capacity planning process?
3) Would you be willing to share historical data
on capacity planning?
19
4) Do you consider human personnel (availabil-
ity, experience, training) when planning for
new capacity?
A.2 Part 2: The Process (15’)
(Q7) *** Who are the stakeholders? Who gets to take
the decision? Who gets to give input? Is this a
board-level decision? Is it left to operations?
(Q8) *** On what time and infrastructure scale are your
typical decisions?
(Q9) *** What factors do you take into account when
making capacity planning decisions? Does this
differ per stakeholder; if so, how?
(Q10) ** What is the margin for error in the decision
making process?
(Q11) *How frequently are capacity planning decisions
made? Also, how long does a decision take?
(Q12) *What kind of data sources do you consult in
your capacity planning process?
(Q13) *What kind of tools do you use in your capac-
ity planning process? For planning, recording,
sharing information at different levels in the
organization, etc.
(Q14) ** How are errors or issues about capacity plan-
ning preserved? How frequent/severe are the
errors that are made? How do people learn from
these issues?
A.3 Part 3: Inside Factors (15’)
(Q15) *** What kinds of IT services and infrastructure are
part of your capacity planning processes?
(Q16) ** I will ask the same question about four kinds of
workloads.
What are your capacity planning processes for
business-critical workloads?
What are your capacity planning processes for
big data workloads?
What are your capacity planning processes for
serverless workloads?
What are your capacity planning processes for
high performance computing workloads?
(Q17) *** How do you try to combine multiple workloads
in the same capacity planning process? Shared
infrastructure? Shared services? What role do
hybrid or public cloud offerings play in your
capacity planning process?
(Q18) *** Because serverless workloads are so new, I’d like
to ask a couple more questions about them. With
such fine-granularity and variable workloads,
how do you reason about needs?
Do you reason differently about them than
about other (more traditional) workloads?
How do you reason about workload isolation
(performance, availability, security, etc.)?
(Q19) *What are some typical “what if” scenarios?
A.4 Part 4: Outside Factors (10’)
(Q20) *** What regulatory constraints (laws, standards;
e.g. General Data Protection Regulation (GDPR),
concerning where you get the capacity) play a
role in the decision process?
(Q21) *** What financial aspects (costs of resources, per-
sonnel, etc.) or technical aspects (new genera-
tion of hardware/software/IT paradigms) play
a role in the decision process?
(Q22) *** How and which human factors are involved
in your decision making on resource capacity
planning?
(Q23) ** Do you make capacity planning decisions on a
multi-datacenter level, or on a local level?
(Q24) *** Do you do capacity planning specifically for
production, development, and/or test?
(Q25) *What are some typical “what if” scenarios?
A.5 Part 5: Summary and Follow-Up (5’)
(Q26) *** What would your ideal process for capacity
planning look like? Something that is not al-
ready there?
(Q27) ** Which other processes do you see capacity
planning linked with? For example, manag-
ing change, evolution of business requirements,
etc.?
(Q28) ** What other aspects would you like to share?
Follow-Up Points
Explain what I will do with the information.
Ask if they want a summary or report related
to my thesis project.
If they answered "yes" to sharing historical
data, follow up here.
APPENDIX B
DETAILED INT ERVI EW RE SULTS
In this appendix, we present our detailed analysis of the
interlocutors’ statements, organized around the questions
listed in the interview script in Appendix A. Findings are
numbered and prefixed with IF (Interview Finding).
1 - Q1: Importance of Availability
IF1: We observe that availability appears to be critical in
commercial cloud environments. In scientific cloud infras-
tructure, availability generally appears to be perceived as
less important than in commercial cloud environments.
1 - Q2: Services
IF2: The organizations of the interlocutors provide a wide
range of services, from general IT services to specialist
hardware hosting. The interlocutors themselves are mainly
concerned with compute cloud services. These can be di-
vided into virtualized offerings (VM hosting), batch work-
load services, and specialized HPC hosting.
IF3: Batch workloads and HPC hosting are only seen in the
scientific infrastructure surveyed. VM hosting is dominant
in the commercial infrastructure space and can only be
found in half of the surveyed scientific infrastructures.
20
1 - Q3: Success in Capacity Planning
IF4: The success stories being told vary from large installa-
tions with significant upfront effort to more flexible, iterative
installations. Flexibility is still often valued even in large
scale designs, in the form of strategies leaving room for
adaptations later-on.
IF5: One interview characterizes the absence of total failure
scenarios in the past as a success story for the capacity
planning team.
IF6: The utilization of new hardware with beneficial fea-
tures, such as competitive pricing or an increase in paral-
lelism, is a success story recounted in two of the interviews.
1 - Q4: Insuccess in Capacity Planning
IF7: A first observation is the abundance of failure stories,
especially compared to the number of success stories. A
possible explanation is that the process could be largely
taken for granted. The practices is mainly revisited when
suboptimal situations arise.
Challenges Capacity Planners Face
We find that capacity planning practitioners face many
challenges in the process.
IF8: Most interlocutors see and are discontent with a perva-
sive under-utilization of their system. This under-utilization
can be caused by operational risk minimization taking
precedence, newly installed resources not being directly
used, and delays in the process.
IF9: We observe the challenge of increasing complexity,
both in the managed resources and the tooling needed
to correctly monitor them. The heterogeneity of hardware,
especially in HPC domains, is also mentioned in accounts of
this.
IF10: Some remark that supporting the small to medium
workload deployments is much more difficult than planning
for the larger deployments. While the larger units each have
a larger financial impact, the small units tend to be neglected
and found in need of sufficient leftover capacity.
Failure Stories
We summarize the most common failure types below.
IF11: In some cases, the adoption of new technologies can
have unforeseen negative consequences. A failure story of
the adoption of an (unnamed) new processor architecture
ends in an entire rollback of the installation, due to users
having difficulties properly utilizing the new hardware, and
due to high power consumption.
IF12: Some capacity planning decisions take so long that
technological perceptions have changed due to the rapidly
changing nature of the field. This leads to hardware choices
needing to be implemented that, at the time of actual instal-
lation, are considered suboptimal.
IF13: A notorious challenge seems to be the prediction of
future ratios between different resource types (mainly num-
ber of cores, memory units, storage capacity). We observe
a number of failure stories surrounding the misprediction
of how different resource type might relate in the future,
resulting in significant parts being underutilized or certain
capacity dimensions running out of capacity far faster than
others. This last consequence can lead to reduced QoS.
IF14: We observe cases where capacity planning is only
seen as an afterthought. A representative example is the fast
onboarding of a new client where available capacity or time
to acquire new capacity is judged too optimistically.
1 - Q5: Typical Process
IF15: The typical processes we see have two shapes: a
periodic process, typically centered around the lifecycle of
topology resources, and a more ad-hoc process, triggered
by less predictable events such as the arrival of new users.
The former is dominant in most surveyed scientific clouds,
while the latter is more common in commercial clouds. One
scientific cloud in the set has a combination of both.
IF16: A sentiment expressed in the majority of interviews is
that guessing and human interpretation of monitoring data
are a big part of the process. The tooling in the area seems
underutilized, further discussed in Q13.
IF17: Next to computational performance of resources, elec-
tricity and cooling also play a significant role in the equa-
tion. This both impacts and is impacted by the choice of
hardware.
Commercial Infrastructures
We observe a difference in the typical process and challenges
involved in this process between commercial and scientific
infrastructures. Below, we outline the findings for commer-
cial infrastructures.
IF18: We see all interlocutors from commercial backgrounds
facing a dilemma between combining purchases and spread-
ing them. The former can lead to significant cost savings
but periods of more intense effort for employees, while the
latter has the opposite advantages and drawbacks. A pos-
sible generalization of this is the competition and interplay
between financial and human factors, both impacting the
capacity planning process in different ways.
IF19: We also see some interlocutors with commercial cloud
backgrounds describing lease and support contracts as be-
ing especially important in the set of factors taken into
account in a typical capacity planning process. The timing
and duration of these conditions can have significant impact
on the decision taken.
Scientific Infrastructures
We now summarize the main findings for the typical process
scientific infrastructures.
IF20: Most scientific clouds seem to follow a typical public
competitive dialogue process for their resource selection. In
at least half of the surveyed scientific infrastructures, this
includes a benchmark-driven process. This entails providing
hardware contractors with a set of benchmark applications
that need to be optimized by the contractor. Results for
applications in this benchmark are often weighted by im-
portance or frequency of deployment.
IF21: We observe that most scientific clouds take their
biggest users as the main indicator of the needs of a
platform. Half of scientific clouds also take into account a
broader user feedback survey, reaching users regardless of
size or importance.
21
IF22: Almost all scientific cloud interlocutors perceive their
process as fundamentally different from the commercial
capacity planning process. The main perceived difference
is the mode of operation, which they believe to be budget-
centered rather than demand-centered. These interlocutors
also consider their budget-centered to have less capacity
planning efforts than commercial efforts. Whether this as-
sessment is accurate is difficult to objectively judge, al-
though analysis from other questions seems to indicate that
there are more aspects of “traditional” capacity planning in
their process than commonly perceived.
1 - Q6: Yes/No Questions
IF23: What-if scenarios do not seem to be established prac-
tice currently. Some unstructured examples fitting the for-
mat (e.g. fail-over scenarios, clients arriving) are mentioned
informally, but not in a structured form.
IF24: Most of the scientific clouds have exploratory projects
running where they investigate the possibility of offloading
demands to a public cloud. This indicates increased interest
in hybrid infrastructure offerings.
IF25: A significant portion of interlocutors is willing to
share historical data with the interviewer. This could signal
interest in academia and industry for more research being
conducted in topics surrounding their capacity planning
decision making.
IF26: The majority of interlocutors considers human person-
nel in some form in the capacity planning process. This topic
is further analysed in Q22.
2 - Q7: Stakeholders
IF27: All processes surveyed seem to have an executive
board at the head of the process. This board seek out advice
from experts in the domains relevant to their decision. These
can include technical advisors or scientific experts.
IF28: For all surveyed instances, the final decision seems
to be at board level. While the input of domain experts is
sought out, the final decision is made by the management.
IF29: The interlocutors with commercial background also
have an engineer in charge of the capacity planning pro-
cess, monitoring the current situation and coordinating the
lifecycle-based planning process. The process in this case
also includes input from the hardware contract adminis-
tration for contracts. Occasionally, the sales department is
involved, if the decision affects the shaping or pricing of
services provided to customers.
IF30: In scientific environments, an important part of the set
of stakeholders tend to be scientific partners and (govern-
mental) funding agencies. Half of the surveyed processes
here also take into account user input through a user ques-
tionnaire.
2 - Q8: Time and Infrastructure Scale
IF31: Unlike the frequency or trigger of capacity planning
processes, the time scale of the decisions made seem to be
roughly uniform across interlocutors. We observe that the
aging and thus deterioration of hardware is seen as the
most important factor here, with a mean of 5 years until
the next decision for a specific machine/rack. Commercial
environments seem to tend towards faster replacement (3-
5 years), while scientific environments seem to replace less
quickly (4-7 years).
IF32: The infrastructure scale of decisions for commercial
environments tends to be a single rack. Making multi-rack
decisions is desired, due to potential cost savings, but not
always possible.
IF33: In scientific environments, the infrastructure scale of
decisions seems to be larger, with most surveyed infras-
tructures working on scales of entire cluster/site iterations.
One infrastructure works on a smaller scale, making single-
machine or single-rack decisions.
2 - Q9: Factors
IF34: The number of factors is remarkable, with more than
25 distinct factors being named in the full set of interviews.
IF35: Nevertheless, the factors that span across a majority
of interviews are few. Only three factors are named in more
than half of the interviews: the use of historical monitor-
ing data, financial concerns (such as budget size), and the
lifetime of hardware. These are followed up by a set of four
factors mentioned in slightly less than half of the interviews:
user demand, new technologies, incoming projects, and the
benchmark performance of different solutions.
IF36: We observe a number of factors particular to scien-
tific infrastructures but not being mentioned in the com-
mercial set. The most important here are the benchmark
performance of solutions (which is often required by pub-
lic competitive acquisition processes) and user demands.
One surveyed infrastructure optimizes for throughput here,
meaning the number of times the benchmark can be run in
a certain time frame.
IF37: Similarly, we observe a number of factors unique to
commercial infrastructures. The most prominent are lease
contracts, current offerings that the provider has, and per-
sonnel capacities.
2 - Q10: Margin for Error
IF38: The margin for error is difficult to objectively measure,
due to the multi-faceted nature of this process. Two main
consequences of errors are mentioned by interlocutors. First,
financial losses can occur due to overestimation of the
demand of a certain resource, such as specific accelerators or
storage capacity, or due to underestimation, as can happen if
the ratio of resource types is mispredicted. Second, person-
nel can come under pressure, due to available capacity being
smaller than expected, starting a search for spare capacity in
any of the managed clusters.
22
2 - Q11: Frequency and Time to Deployment
IF39: While interlocutors from commercial backgrounds re-
port a frequency of at least once per three months (depend-
ing on an ad-hoc component), counterparts from scientific
infrastructures generally report a frequency upwards of four
years. There is one notable exception to this rule, with one
of the scientific clouds which takes a decision twice a year.
In general, we observe a separation between fast-paced
commercial planning cycles and longer cycles in scientific
clouds.
IF40: Similar to the frequency of planning events, the time
from start of the event to deployment is determined largely
by the background of the infrastructure. Commercial clouds
tend to finish deployment within 4–5 months, while scien-
tific clouds tend to take 1–1.5 years to deploy. We see a pos-
itive correlation with the frequency of planning instances,
meaning that a higher frequency trends to be paired with a
shorter time to deployment.
IF41: In some scientific clusters, we see a part of the topol-
ogy containing specialized hardware, such as accelerators,
getting a special process with more rapid cycles than the
rest of the architecture. This could be due to the faster pace
of evolution that these kinds of technologies experience.
2 - Q12: Data Sources
IF42: With the exception of one infrastructure, historic uti-
lization data from monitoring agents is universally reported
to be used in the process.
IF43: Next to historic utilization data, we see operational
data such as lease contracts and maintenance periods being
involved in the process. We also observe some interlocutors
explicitly mention taking global market developments into
account.
2 - Q13: Tooling
IF44: The main observation here is that none of the surveyed
infrastructures have dedicated tooling for the capacity plan-
ning of their infrastructures. They use monitoring tools
(with dashboards) and/or spreadsheets, combined with
human interpretation of the results. Decisions are, in one
infrastructure, being preserved in minutes and mails.
IF45: We observe that planners typically consume the data
they receive from monitoring in visual formats, in plots
over time. Being able to visually investigate and interpret
developments plays an important role here.
IF46: The most commonly used tool for monitoring seems to
be Grafana, which allows teams to build custom dashboards
with the information they see as relevant. NetApp monitor-
ing tools are mentioned as being used by one commercial
party. One scientific infrastructure reports basing their re-
sults on custom SQL queries of monitoring data. Another
scientific infrastructure uses spreadsheets as the primary
medium for analysis.
IF47: We identify two key issues being raised explaining
the absence of dedicated tooling. First, tools tend to be too
platform specific or work only in one layer of the hierarchy
and thus return misleading results. We see the issue being
the mismatch between the complexity of the reality on the
ground and the complexity that these tools assume of the
topology. Second, tools tend to have high cost and carry a
number of additional features that planners reportedly do
not find useful, meaning that the high price is not justified
by the value the planners receive out of these tools.
2 - Q14: Errors
IF48: We observe several occurrences of failures being men-
tioned, although the perceived frequency varies. One in-
terlocutor believes that (slightly) erroneous plans are made
constantly, since it is not possible to predict accurately what
will be needed in the future, while another interlocutor
claims the errors made are not very frequent. On average,
the frequency is perceived as low, drawing contrast to the
failures being mentioned in the rest of the interview.
IF49: The severity of an error is hard to measure objectively
if not actively monitored. The (subjective) descriptions of
how severe errors vary from losing potential income, to
having underutilized hardware, to hitting storage limits.
This raises a different point, surrounding the definition
of errors or failures in the field of capacity planning. An
underutilized new cluster may be seen as a minor error,
since service is typically not affected and the only cost seems
to be additional power usage and environmental footprint.
IF50: We did not observe any structured approach to record-
ing errors in the process. Whether they only remain tacit
team knowledge or are still recorded somewhere is not clear,
although our interpretation indicates the former.
IF51: While most interlocutors seem to describe negative
capacity planning incidents as being infrequent and having
low severity, the examples being given in response to other
questions tend to be from the most recent (if not one of
the last) iterations. This is partly explainable with more
recent memories being more readily accessible, but also
might indicate a more structural underappreciation of the
possibility for failures or suboptimal choices in the process.
3 - Q15: Services and Infrastructure Part of Process
IF52: We observe that the majority of interlocutors considers
only one type of service as part of their capacity planning
process and a minority considers two or more.
IF53: In the commercial settings we survey, we see that
VMs holding business-critical workloads are most univer-
sally considered as part of the process. One interlocutor
mentions a new container platform as also being part of
the process, although it is internally approximated as a VM
while planning.
IF54: In the scientific settings we survey, we see that batch
workloads, HPC workloads, VM workloads, and baremetal
hosting services are equally popular. One provider also
mentions shared IT services (more general IT functionality)
as also being a part of the process.
23
3 - Q16: Processes for Specific Workloads
IF55: The instances running business critical workloads
report two special aspects that they consider for these
workloads: special redundancy requirements and live man-
agement concerns (primarily migration and offloading).
IF56: We do not observe any special processes being men-
tioned for Big Data workloads.
IF57: The processes for serverless workloads are still very
much in a stage of infancy, as most interlocutors having
container or FaaS solutions only host them as experimental
pilot projects. In terms of capacity planning, one interlocutor
points out that for the container platform they currently
build they only approximate the containers with VMs in
their reasoning. However, they acknowledge that the den-
sity and characteristics of this new workload might be very
different and that they may need to have special process for
this in the future.
IF58: Two of the interlocutors reporting that HPC is a
part of their process, state that capacity planning for HPC
workloads is even more challenging than for conventional
workloads, due to the increased heterogeneity in the hard-
ware platforms needed for this domain.
3 - Q17: Combining Workloads
IF59: All interlocutors, with one exception, consider all
workloads combined in one process. The interlocutor form-
ing the exception states that certain different workload
types in their cloud are hosted on different infrastructure
and separated entirely, with no synchronization occurring
between the different efforts.
IF60: A popular approach in scientific infrastructures seems
to be to combine workloads through a weighted benchmark
suite. This scores topologies by running important represen-
tatives from each workload type and combining the scores
into a single score.
IF61: One interlocutor with commercial background points
out that there is a trade-off between combining processes,
thus gaining efficiency but also increasing the risk of failure,
and keeping processes separate, thus loosing efficiency but
also reducing the risk.
3 - Q18: Serverless Workloads
IF62: We observe that, with one exception, all interviews
describe introduction of (pilot) serverless programs in their
services. One interlocutor sees serverless as a fast growing
business, but another interlocutor contrasts this with an
observation that the demand for it is still limited.
IF63: Interlocutors see a number of differences with tra-
ditional workloads. They observe differing usage patterns
with finer granularity of execution units. This leads to
higher fluctuation of the load and faster deployment pat-
terns.
IF64: Three of the interviews detail expectations on how
their capacity planning will change with serverless work-
loads becoming more prevalent. They expect impacts on the
resource needs, such as the CPU to memory ratio and the
allowed overcommission ratio. One also states that guaran-
teeing workload isolation is likely to become significantly
more difficult.
IF65: Currently, none of the interlocutors state having a
special subprocess for serverless in their capacity planning
approach. They agree, however, that this might need to
change in the future, as serverless workloads increase in
popularity.
IF66: We observe two key issues hindering the specializa-
tion of (parts of) the capacity planning process towards
new workloads such as serverless. First, not yet enough
information is available on this new workload type and
its behavior. This makes reasoning about its capacity needs
more difficult, at least with conventional capacity planning
methods. Second, interlocutors report a lack of personnel to
dedicate to research into effective and efficient hosting of
this new workload type.
3 - Q19: What-If Scenarios
IF67: This question was asked infrequently due to time
constraints. One interlocutor answered that scenarios they
look at indirectly are customer-based scenarios (if a certain
customer needs to be hosted) and new hardware releases
and acquirements (with new specifications and properties).
See also IF101 for requested what-if scenarios in tooling.
4 - Q20: Regulatory Constraints
IF68: We gather a number of laws and standards relevant to
the capacity planning process. We conclude that regulatory
constraints can definitely play a role in capacity planning.
IF69: Financial institutions tend to have strict standards for
the capacity they acquire, such as a guaranteed amount of
fail-over capacity. This requires a capacity planning process
(and recorded trail of that process) that meets these stan-
dards.
IF70: We observe that privacy regulations such as the GDPR
are only of limited concern in the capacity planning pro-
cess. One interlocutor managing a scientific infrastructure
states that GDPR only affects the placement and planning
of privacy-critical applications on their platform, in the
form of preventing public cloud offloads for these specific
applications. Another interlocutor mentions the storage of
logs could be affected by GDPR, as its introduction leads
to less log storage demands and thus less storage capacity
being needed for that purpose.
IF71: In scientific infrastructures, we observe the competi-
tive dialogue procedure playing a big role in shaping the
process. Publicly funded institutions need to shape their
acquirement processes around a public tender with com-
petitive dialogue, which limits how and which hardware
components can be selected.
IF72: Security standards can also steer the choice of certain
technologies in the capacity planning process. We observe
one case where reported exploits in a container platform
limit a quick deployment of that technology.
IF73: Special regulations that hold in the country of origin of
a certain hardware vendor can also play a role. We hear one
example of a supercomputer manufacturer prohibiting its
hardware being used by personnel and users having the na-
tionality of a certain set of countries set by the government
of the manufacturer.
24
4 - Q21: Financial and Technical Aspects
IF74: Overall, there are three most frequently cited finan-
cial and technical aspects across interviews. The first is
funding concerns, looking at the source of funds for future
expansions and maintenance. The second consists of special
hardware requests from users. The third is the cost of new
hardware, in line with global cost developments on the
market.
IF75: For commercial infrastructures, there are a wide range
of similarly frequent financial and technical factors that are
taken into consideration. Noteworthy are the timing and
costs of lease contracts, which receive special attention, and
historical sale and usage models for existing services.
IF76: For scientific infrastructures, the size of publicly
funded grants is a factor that was mentioned in all inter-
views. Special hardware requests are the second most fre-
quent factor here, followed up by the total cost of ownership
(including electricity and cooling costs) and new hardware
developments.
IF77: One interlocutor with commercial background made
an observation that we believe resonates with statements by
other interlocutors, as well: The variables in the equation
that are most relevant are the factors around the hardware,
not the cost of the hardware itself. Support contracts, lease
contracts, personnel cost – all these factors play a significant
role.
IF78: In two interviews, we observe the point being made
that their choice of technology is inferior to the financial
considerations they make. This underlines the importance
of financial aspects in the process.
IF79: One interview raises an interesting relation, between
financial investment and the energy consumed by resources
that can be acquired with this investment. It observes a trend
in which constant investment can lead to increasing energy
costs, due to the falling cost of computational resources
when compared by energy usage.
4 - Q22: Human Factors
IF80: Overall, we observe two human factors being men-
tioned most frequently: the need to account for personnel ca-
pacity and time to install and set up hardware and software,
and the usage patterns that users exhibit when using the
infrastructures (each of these is mentioned in 3 interviews).
IF81: Strongly present in the commercial sphere is an aware-
ness of the need for training personnel, especially when
switching technologies. In scientific infrastructures, the fo-
cus seems to rest more on end-users. Their usage demands
and their abilities (and training) are most frequently raised
as factors in this category.
IF82: Listening to users and their demands has its limits,
however, as one interlocutor points out. They state that if
administrators ask users if they would like more computing
power, the answer will likely often be “yes”.
IF83: One interlocutor points out that improper capacity
planning can lead to stress in the team, because it can lead
to short term remedial actions becoming necessary, such as
gathering left-over capacity from the entire resource pool.
IF84: One interlocutor observes that personnel does not
grow proportionally to the size of the managed resource
pool, but that specialization and having specialized staff
for certain technologies is the deciding factor. We see this
sentiment being shared in many of the interviews.
IF85: We conclude that the human factor plays a significant
role in the process, for most surveyed infrastructures. 5 out
of the 7 interviews place special emphasis on this. Hiring
costs and personnel hours can add up especially, as one
interlocutor points out.
4 - Q23: Multi-Datacenter or Local Level
IF86: We observe that the majority of surveyed clouds takes
decisions on a multi-datacenter level. The interlocutors that
report single-datacenter decision making cite differences in
architecture and requirements between different sites as the
main cause.
IF87: One interlocutor points out that while the scope of
decision making is multi-datacenter, the scope of single
decisions still focuses on single racks.
4 - Q24: Production vs. Development
IF88: 3 out of the 7 interviews mention capacity planning
differently for development and testing resources than for
production resources. For one of these interlocutors, this
involves setting aside a dedicated set of testing nodes per
cluster. For another, this means splitting workloads into
different datacenters. For a third, this involves having lower
redundancy on certain development machines.
IF89: We see a set of 2 other interviews claiming that every
resource in their topology is considered production, even if
it is in fact used as a development or testing machine.
4 - Q25: What-If Scenarios
IF90: This question was asked infrequently due to time
constraints. One interlocutor names scenarios where costs
are conditioned against certain amounts of cooling, with
different cooling types. Another interlocutor points out that
their process currently does not contain any what-if scenar-
ios, but believes that they should in the future. This would
create a better understanding of possible future outcomes,
using factors such as the timing of deployments or new,
unknown workload patterns.
5 - Q26: Ideal Process
IF91: An aspect shared by 5 out of the 7 interviews is the
call for a more flexible, fast-paced process. Planners across
academia and industry believe the current process they fol-
low is not always able to adequately keep up with the latest
hardware trends, with special mention of accelerators. Most
of them also see another issue arising from this: procured
hardware is often idle for a (relatively long) time before it
is utilized. An ideal process would address these two issues
by adding hardware more flexibly, i.e. in smaller batches.
This is not straightforward, due to the economies of scale
that sometimes only come into effect at larger batch sizes.
IF92: One interlocutor mentions that their ideal process
would also require less time than it currently does. The years
of analysis and discussions should be reduced to months.
25
IF93: We observe two interlocutors mentioning a preference
for smaller-scale decision-making, with more attention to
detail. One interlocutor mentions this with respect to topolo-
gies, having per-rack decisions replace multi-datacenter de-
cisions. Another interlocutor mentions this with respect to
the application domain, separating different domains into
different sub-processes due to the difficulty of capacity
planning large heterogeneous environment.
IF94: One interlocutor expresses the desire for an increased
focus in the process on training of users in order to exploit
the full potential of new hardware once it arrives.
Tooling
Especially from interlocutors with commercial backgrounds,
we hear a wide variety of requests for better tooling for their
activities. We list them below, grouped into categories.
IF95: In 3 of the 7 interviews, we observe a demand for ca-
pacity planning tools, helping the practitioners rely less on
their intuition. A key request is interactivity, with answers
wanted within a maximum of two weeks. Getting immedi-
ate answers during a meeting would be even better. One
interlocutor describes this as an “interactive dashboard”.
One interlocutor also states that having the tool answer
questions at different levels of accuracy (increasing over
time) would also beneficial, to facilitate quick estimates
upfront and more detailed analysis over a longer period,
e.g. between two meetings.
IF96: We observe the requirement that tools should be
affordable. Interlocutors state that high prices of an existing
tooling platform that might help this activity are mainly due
to the many other features in that platform being packaged
along with the capacity planning functionality needed.
IF97: Interlocutors also express the need for tools that help
in addressing complexities in the process. Being able to track
the details of current capacity and being able to predict
needed capacity would be a first step in this direction.
IF98: One interlocutor managing a scientific infrastructure
points out that any tool should for this activity should
support making heterogeneous decisions, which are more
difficult to make but are yet still necessary, especially in
the academic domain. The request for heterogeneous capa-
bilities is repeated by interlocutors from commercial back-
grounds.
IF99: Interlocutors managing commercial infrastructure call
for tools that are aware of multi-disciplinary aspects in the
process. This includes lifecycle processes (such as aging and
maintenance) and lease contracts.
IF100: One interlocutor also expresses the wish for workload
trend analysis capabilities in any tool for this activity.
IF101: Two interlocutors list a number of what-if scenarios
that they would like to explore with a capacity planning
tool. We list the questions underlying these scenarios here,
in no particular order.
1) Deciding how much more capacity is needed after
certain decisions are taken, given projected CPU usage,
memory commission, and overbooking ratios.
2) Seeing the impact different new kinds of workloads
have before they become common.
3) Deciding when to buy new hardware.
4) Modeling fail-over scenarios.
5) Deciding whether special user requests can be granted
before responding to users.
6) Choosing the best lease duration.
7) Deciding in which cluster to place new workloads.
8) Choosing the best overlap duration between acquiring
new hardware and decommissioning old hardware.
5 - Q27: Other Processes Linked With Capacity Planning
IF102: One interlocutor sees no direct link between capacity
planning and other processes, although an indirect link is
present. Capacity planning, according to the interlocutor,
tends to be on the end of the pipeline: only after acquiring
new projects is the challenge of finding capacity for these
projects considered.
IF103: One interlocutor sees a close relationship between the
process and resource management strategies and research.
The live management of the infrastructure can have sig-
nificant impact on the needed capacity, just as the capac-
ity can have consequences for the management strategies
that should be employed. Migration and consolidation ap-
proaches need special attention here.
5 - Q28: Other Aspects Shared
IF104: One interlocutor states that, no matter what one
plans, the future always looks (slightly) differently. This
does eliminate the need for planning, but underlines the
need to be flexible and plan for unforeseen changes down
the road.
IF105: One interlocutor (from a scientific background) points
out the difference of speed in capacity planning processes
between Europe and the United States. They believe that
infrastructures in the U.S. have a faster pace of capacity
planning than comparable infrastructures in Europe. They
find this disadvantageous, due to new hardware improve-
ments being slower to arrive.
IF106: Two interlocutors observe that it is easier to obtain
grants for a proposal of an infrastructure addressing one
coherent need or project. The surveyed scientific infrastruc-
ture tends to serve a far more heterogeneous set of use cases,
which makes acquiring sufficient funds more difficult.
APPENDIX C
SOFTWARE IMPLEMENTATION
We depict in the figures below the program structure and
dependencies of the OpenDC simulation model and Capelin
extension in the form of class diagrams. Furthermore, we
highlight the interaction between components in the simu-
lator using sequence diagrams.
26
ExperimentCli
WorkloadSamplerKt
int
repetitions
TestPortfolio
Sc20ParquetTraceReader
Type
type
TrialExperimentDescriptor
String
name
long
timestamp
Event
long
grantedBurst
double
cpuUsage
long
interferedBurst
long
duration
int
cores
long
requestedBurst
double
cpuDemand
int
vmCount
long
timestamp
Server
host
long
overcommissionedBurst
double
powerDraw
HostEvent
SortedSet<String>
selectedVms
SelectedVmFilter
Scenario
parent
int
id
int
seed
Run
Type
type
Sequence<ExperimentDescriptor>
children
ContainerExperimentDescriptor
long
grantedBurst
double
cpuUsage
Server
vm
long
interferedBurst
long
duration
long
requestedBurst
double
cpuDemand
long
timestamp
Server
host
long
overcommissionedBurst
VmEvent
Sc20TraceConverterKt
Topology
topology
int
repetitions
Portfolio
parent
String
allocationPolicy
OperationalPhenomena
operationalPhenomena
int
id
Workload
workload
Sequence<ExperimentDescriptor>
children
Scenario
MoreVelocityPortfolio
CompositeWorkloadPortfolio
String
version
String
id
ExperimentRunner
ConsoleExperimentReporter
ReplayPortfolio
Sc20StreamingParquetTraceReader
File
environments
int
bufferSize
ExperimentDescriptor
parent
File
output
PerformanceInterferenceModelReader
performanceInterferenceModel
File
traces
Map<String, String>
vmPlacements
Experiment
long
requiredMemory
int
cores
long
minTime
long
maxTime
VmInfo
Sc20RawParquetTraceReaderKt
OperationalPhenomenaPortfolio
String
name
Experiment
parent
int
repetitions
int
id
Sequence<Scenario>
children
Portfolio
Run
run
long
timestamp
RunEvent
HorVerPortfolio
boolean
root
Type
type
ExperimentDescriptor
parent
boolean
trial
ExperimentDescriptor
int
inactiveVmCount
int
waitingVmCount
int
availableHostCount