Conference PaperPDF Available

A Holistic Machine Learning-Based Autoscaling Approach for Microservice Applications

Abstract and Figures

Microservice architecture is the mainstream pattern for developing large-scale cloud applications as it allows for scaling application components on demand and independently. By designing and utilizing autoscalers for microservice applications, it is possible to improve their availability and reduce the cost when the traffic load is low. In this paper, we propose a novel predictive autoscaling approach for microservice applications which leverages machine learning models to predict the number of required replicas for each microservice and the effect of scaling a microservice on other microservices under a given workload. Our experimental results show that the proposed approach in this work offers better performance in terms of response time and throughput than HPA, the state-of-the-art autoscaler in the industry, and it takes fewer actions to maintain a desirable performance and quality of service level for the target application.
Content may be subject to copyright.
A Holistic Machine Learning-Based Autoscaling Approach for
Microservice Applications
Alireza Goli1 a, Nima Mahmoudi1 b, Hamzeh Khazaei2 c and Omid Ardakanian1 d
1University of Alberta, Edmonton, AB, Canada
2York University, Toronto, ON, Canada
{goli, nmahmoud, oardakan},
Keywords: Autoscaling, Microservices, Performance, Machine Learning
Abstract: Microservice architecture is the mainstream pattern for developing large-scale cloud applications as it allows
for scaling application components on demand and independently. By designing and utilizing autoscalers for
microservice applications, it is possible to improve their availability and reduce the cost when the traffic load
is low. In this paper, we propose a novel predictive autoscaling approach for microservice applications which
leverages machine learning models to predict the number of required replicas for each microservice and the
effect of scaling a microservice on other microservices under a given workload. Our experimental results show
that the proposed approach in this work offers better performance in terms of response time and throughput
than HPA, the state-of-the-art autoscaler in the industry, and it takes fewer actions to maintain a desirable
performance and quality of service level for the target application.
Microservice is the most promising architecture for
developing modern large-scale cloud software sys-
tems (Dragoni et al., 2017). It has emerged through
the common patterns adopted by big tech companies
to address similar problems, such as scalability and
changeability, and to meet business objectives such as
reducing time to market and introducing new features
and products at a faster pace (Nadareishvili et al.,
2016). Traditional software architectures, such as
monolithic architecture, are not capable of accommo-
dating these needs efficiently (Dragoni et al., 2017).
Companies like SoundCloud, LinkedIn, Netflix, and
Spotify have adopted the microservice architecture in
their organization in recent years and reported success
stories of using it to meet their non-functional require-
ments (Calc¸ado, 2014; Ihde and Parikh, 2015; Mauro,
2015; Nadareishvili et al., 2016).
In the microservice paradigm, the application
is divided into a set of small and loosely-coupled
services that communicate with each other through
a message-based protocol. Microservices are au-
tonomous components which can be deployed and
scaled independently.
One of the key features of the microservice archi-
tecture is autoscaling. It enables the application to
handle an unexpected demand growth and continue
working under pressure by increasing the system ca-
pacity. While different approaches have been pro-
posed in the literature for autoscaling of cloud ap-
plications (Kubernetes, 2020; Fernandez et al., 2014;
Kwan et al., 2019; Lorido-Botran et al., 2014; Qu
et al., 2018), most related work is not tailored for the
microservice architecture (Qu et al., 2018). This is be-
cause a holistic view of the microservice application
is not incorporated in most related work; hence each
service in the application is scaled separately with-
out considering the impact this scaling could have on
other services. To remedy the shortcoming of existing
solutions, a more effective and intelligent autoscaler
can be designed for microservice applications, a di-
rection we pursue in this paper.
We introduce Waterfall autoscaling (hereafter re-
ferred to as Waterfall for short), a novel approach
to autoscaling microservice applications. Waterfall
takes advantage of machine learning techniques to
model the behaviour of each microservice under dif-
ferent load intensities and the effect of services on
one another. Specifically, it predicts the number of re-
quired replicas for each service to handle a given load
and the potential impact of scaling a service on other
services. This way, Waterfall avoids shifting load or
possible bottlenecks to other services and takes fewer
actions to maintain the application performance and
quality of service metrics at a satisfactory level. The
main contributions of our work are as follows:
We introduce data-driven performance models for
describing the behaviour of microservices and
their mutual impacts in microservice applications.
Using these models, we design Waterfall which is
a novel autoscaler for microservice applications.
• We evaluate the efficacy of the proposed au-
toscaling approach using Teastore, a reference
microservice application, and compare it with a
state-of-the-art autoscaler used in the industry.
The rest of this paper is organized as follows.
Section 2 reviews related work on autoscaling and
Section 3 provides a motivating scenario. Section 4
presents the proposed machine learning-based perfor-
mance models for microservice applications. Sec-
tion 5 describes the design of Waterfall autoscaler.
Section 6 evaluates the proposed autoscaling tech-
nique, and Section 7 concludes the paper.
Autoscaling is a widely used and well-known con-
cept in cloud computing, mainly due to the elas-
ticity and pay-as-you-go cost model of cloud ser-
vices. With the shift in the runtime environment of
microservice applications from bare-metal servers to
more fine-grained environments, such as virtual ma-
chines and containers in the cloud, autoscaling has
become an indispensable part of microservice appli-
cations. The autoscalers can be categorized based
on different aspects from the underlying technique to
the decision making paradigm (e.g., proactive or re-
active) and the scaling method (e.g., horizontal, ver-
tical, or hybrid) (Qu et al., 2018). Based on the un-
derlying technique, autoscalers can be classified into
five categories: rule-based methods, application pro-
filing methods, analytical modelling methods, search-
based optimization methods, and machine learning-
based methods.
Rule-based autoscalers act based on a set of pre-
defined rules to scale and estimate the amount of
necessary resources for provisioning. This type of
autoscalers is common in the industry and usually
serves as the baseline (Qu et al., 2018). Products
such as Amazon AWS Autoscaling service (Amazon,
2020b) and Kubernetes Horizontal Pod Autoscaler
(HPA) (Kubernetes, 2020) fall into this group. Wong
et al. (Kwan et al., 2019) proposed two rule-based
autoscalers similar to Kubernetes HPA for microser-
vices, namely HyScaleCPU and HyScaleCPU+Mem .
HyScaleCPU uses both horizontal and vertical scaling
to scale each microservice in the target application
separately based on CPU utilization. It gives prior-
ity to vertical scaling and applies horizontal scaling
only if the required amount of resources cannot be
acquired using vertical scaling. HyScaleCPU+Mem op-
erates similarly except that it considers memory uti-
lization in addition to CPU utilization for making the
scaling decision. Although rule-based autoscalers are
easy to implement, they typically need expert knowl-
edge about the underlying application for tuning the
thresholds and defining the scaling policies (Qu et al.,
Application profiling methods measure the appli-
cation capacity with a variety of configurations and
workloads and use this knowledge to determine the
suitable scaling plan for a given workload and con-
figuration. For instance, Fernandez et al. (Fernandez
et al., 2014) proposed a cost-effective autoscaling ap-
proach for single-tier web applications using hetero-
geneous Spot instances (Amazon, 2020a). They used
application profiling to measure the processing capac-
ity of the target application on different types of Spot
instances for generating economical scaling policies
with a combination of on-demand and Spot instances.
In autoscalers with analytical modelling, a mathe-
matical model of the system is used for resource esti-
mation. Queuing models are the most common ana-
lytical models used for performance modelling of ap-
plications in the cloud. In applications with more than
one component, such as microservice applications, a
network of queues is usually considered to model the
system. Gias et al. (Gias et al., 2019) proposed a hy-
brid (horizontal+vertical) autoscaler for microservice
applications based on a layered queueing network
model (LQN) named ATOM. ATOM uses a genetic
algorithm in a time-bounded search to find the opti-
mal scaling strategy. The downside of modelling mi-
croservice applications with queuing network models
is that finding the optimal solution for scaling is com-
putationally expensive. Moreover, in queueing mod-
els, measuring the parameters such as service time
and request mix is non-trivial and demands a complex
monitoring system (Qu et al., 2018).
Search-based optimization methods use a meta-
heuristic algorithm to search the state space of system
configuration for finding the optimal scaling decision.
Chen et al. (Chen and Bahsoon, 2015) leveraged a
multi-objective ant colony optimization algorithm to
optimize the scaling decision for a single-tier cloud
application with respect to multiple objectives.
Machine learning-based autoscalers leverage ma-
chine learning models to predict the application per-
formance and estimate the required resources for dif-
ferent workloads. Machine learning techniques can
be divided into regression and reinforcement learning
methods. Regression-based methods usually find the
relationship between a set of input variables and an
output variable such as resource demand or a perfor-
mance metric. Wajahat et al. (Wajahat et al., 2019)
proposed a regression-based autoscaler for autoscal-
ing of single-tier applications. They considered a set
of monitored metrics to predict the response time of
the application, and based on predictions, they in-
creased or decreased the number of virtual machines
assigned to the application on OpenStack. Jindal et
al. (Jindal et al., 2019) used a regression model to es-
timate the microservice capacity (MSC) for each ser-
vice in a microservice application. MSC is the max-
imum number of requests that a microservice with a
certain number of replicas can serve per second with-
out violating the service level objective (SLO). They
obtained this value by sandboxing and stress-testing
each service for several configuration deployments
and then fitting a regression model to the collected
data. In reinforcement learning approaches, an agent
tries to find the optimal scaling policy for each state
of the system (without assuming prior knowledge)
through interaction with the system. Iqbal et al. (Iqbal
et al., 2015) leveraged reinforcement learning to learn
autoscaling policies for a multi-tier web application
under different workloads. They identified the work-
load pattern from access logs and learned the appro-
priate resource allocation policy for a specific work-
load pattern so that SLO is satisfied and resource uti-
lization is minimized. The drawback of reinforce-
ment learning methods is the poor performance of au-
toscalers at the early stages of deployment because it
takes some time for the reinforcement learning model
to learn the optimal policy. Moreover, machine learn-
ing has been used for workload prediction in proactive
autoscaling. These methods use time series forecast-
ing models to predict the future workload and provi-
sion the resources ahead of time based on the predic-
tion for the future workload. Coulson et al. (Coulson
et al., 2020) used a stacked LSTM (Hochreiter and
Schmidhuber, 1997) model to predict the composition
of the next requests and scale each service in the ap-
plication accordingly. Abdullah et al (Abdullah et al.,
2020) introduced a proactive autoscaling method for
microservices in fog computing micro data centers.
They predict the incoming workload with a regression
model using different window sizes and identify the
number of containers required for each microservice
separately. The main problem with these methods is
Service 3
Service 1
Service 2
Figure 1: Interaction of services in an example microservice
that they can lead to dramatic overprovisioning or un-
derprovisioning of resources (Qu et al., 2018) owing
to the uncertainty of workload arrivals, especially in
the news feed and social network applications.
A microservice application usually consists of multi-
ple services interacting with each other to accomplish
their job. The rate at which a service sends requests to
the downstream services depends on the rate at which
it receives requests and the amount of resources avail-
able for processing these requests. Thus, scaling a ser-
vice that may invoke a group of other services might
subsequently change the load on those services. Con-
sider the interaction between three services in an ex-
ample microservice application depicted in Figure 1.
Service 1 calls Service 2 and Service 3 to complete
some tasks. If Service 1 is under heavy load (R1),
scaling Service 1 would cause an increase in the load
observed by Service 2 (R2) and Service 3 (R3). If
we predict how scaling Service 1 degrades the per-
formance of Service 2 and Service 3, we can avoid
the shift in the load and a possible bottleneck from
Service 1 to Service 2 and Service 3 by scaling Ser-
vice 2 and Service 3 proactively at the same time as
Service 1.
To further examine the cascading effect of scal-
ing a service in a microservice application on other
services, we conducted an experiment using an ex-
ample microservice application called Teastore (von
Kistowski et al., 2018). Teastore1is an emulated on-
line store for tea and tea-related products. It is a
reference microservice application developed by the
performance engineering community to provide re-
searchers with a standard microservice application
that can be used for testing and evaluating research in
different areas such as performance modelling, cloud
resource management, and energy efficiency analy-
sis (von Kistowski et al., 2018). Figure 2 shows ser-
vices in the Teastore application and the relationships
between them. The solid lines show the dependencies
between services, and dashed lines indicate that the
PersistenceImage Auth Recommender
Figure 2: Architecture of the Teastore application.
service call happens only once at startup time. Tea-
store includes five primary services: Webui, Auth,
Persistence, Recommender, and Image. Webui is the
front-end service that users interact with and is re-
sponsible for rendering the user interface. Auth stands
for authentication; it verifies the user’s credentials and
session data. The Persistence service interacts with
the database and performs create, read, update, and
delete (CRUD) operations. The Recommender ser-
vice predicts the user preference for different products
and recommends appropriate products to users using
a collaborative filtering algorithm. The Image service
provides an image of products in different sizes. In
addition to main services, Teastore has another com-
ponent named Registry, which is responsible for ser-
vice registration and discovery.
As can be seen in Figure 2, depending on the
request type, the Webui service may invoke Image,
Persistence, Auth, and Recommender services. We
generate a workload comprising different types of re-
quests so that Webui service calls all of these four ser-
vices. Keeping the same workload intensity, we in-
creased the number of replicas for the Webui service
from 1 to 5 and monitored the request rate of Webui
in addition to the downstream rate of the Webui ser-
vice to other services that each has one replica. For
the two services mand n, we define the request rate
of service m, denoted by RR(m), as the number of re-
quests it receives per second, and the downstream rate
of service mto service n, denoted by DR(m,n), as the
number of requests service msends to service nper
For instance, in Figure 1, RR(Service 1) is equal
to R1 and DR(Service 1, Service 2) is equal to R2.
Figure 3 shows the results of our experiment. The
left plot and right plot show the request rate and to-
tal downstream rate of the Webui service for different
number of replicas, respectively. Error bars indicate
the 95% confidence interval. Table 1 shows the re-
quest rate of Webui and its downstream rate to each
Figure 3: Request rate and total downstream rate of We-
bui under the same load intensity for different numbers of
Table 1: Request Rate (RR) and Downstream Rate (DR) of
the Webui service to each service.
Monitored Metric 1 2 3 4 5
RR(Webui) 480 936 1334 1438 1445
DR(Webui,Persistence) 1083 2108 3005 3239 3253
DR(Webui,Auth) 482 938 1337 1441 1447
DR(Webui,Image) 562 1094 1559 1680 1688
DR(Webui,Recommender) 121 235 334 360 362
service for the different number of replicas. As can
be seen, scaling the Webui service leads to an increase
in its request rate, which in turn increases the down-
stream rate of the Webui service to other services.
Therefore, under heavy load, scaling the Webui ser-
vice increases the load on the other four services.
The cascading effect of microservices on each
other motivates the idea of having an autoscaler that
takes this effect into account and takes action accord-
ingly. Autoscalers that consider and scale different
services in an application independently are unaware
of this relationship, thereby making premature de-
cisions that could lead to extra scaling actions and
degradation in the quality of service of the applica-
tion. In this work, we introduce a novel autoscaler to
address the deficiencies in these autoscalers.
This section presents machine learning models
adopted for performance modelling of microservice
applications. These models are at the core of our au-
toscaler for predicting the performance of each ser-
vice and possible variations in performance as a re-
sult of scaling another service. Hence, we utilize two
machine learning models for each microservice which
are described in the following sections.
CPU Model
Request Rate
CPU Utilization
Figure 4: Input features and the predicted value of the CPU
Request Model
Current Replica
New Request Rate
Current Reques Rate
Current CPU Utilization
New Replica
Figure 5: Input features and the predicted value of the re-
quest model.
4.1 Predictive Model for CPU
The CPU Model captures the performance behaviour
of each microservice in a microservice application
in terms of CPU utilization. CPU utilization is a
good proxy for estimating the workload of a microser-
vice (Gotin et al., 2018). Therefore, we use the aver-
age CPU utilization of the microservice replicas as the
performance metric for scaling decisions. Depending
on the target performance objective, this metric can
be replaced with other metrics, such as response time
and message queue metrics.
As Figure 4 demonstrates, the CPU Model takes
the number of service replicas and the request rate of
service as input features and predicts the service’s av-
erage CPU utilization. In other words, this model can
tell us what would be the average CPU utilization of
service under a specific load.
4.2 Predictive Model for Request Rate
The Request Model predicts the new request rate of
a microservice after scaling and changing the number
of service replicas. As shown in Figure 5, we feed
the current number of service replicas, the current av-
erage CPU utilization of service, the current request
rate of service, and the new number of service replicas
as input features to the Request Model to predict the
new request rate for the service. The current replica,
current CPU utilization, and current request rate de-
scribe the state of the service before scaling. The new
replica and new request rate reflect the state of the ser-
vice after scaling. We use the output of the Request
Model for a given service to calculate the new down-
stream rate of that service to other services. Thus, the
Request Model helps us predict the effect of scaling a
service on other services.
As we discussed in Section 3, any changes in the
request rate of a service in a microservice applica-
tion might lead to changes in the downstream rate
of that service to other services. However, we ob-
Table 2: The ratio of Downstream Rate (DR) values of We-
bui service to its Request Rate (RR) for different number of
replicas under the same workload intensity.
DR/RR 1 2 3 4 5
RR(Webui) 2.25 2.25 2.25 2.25 2.25
RR(Webui) 1.00 1.00 1.00 1.00 1.00
RR(Webui) 1.17 1.17 1.17 1.17 1.17
RR(Webui) 0.25 0.25 0.25 0.25 0.25
served that under the same workload intensity, when
we scale a service, the downstream rate of that service
to another service changes linearly with respect to its
request rate. For instance, we used the results from
Section 3 in Table 1 and divided the downstream rate
of Webui service to other services by its request rate
and got the values in Table 2. Consequently, when we
scale a service, if we have the new request rate after
scaling, we can calculate its new downstream rate to
other services. We achieve this goal through Request
Model. For example, according to Table 1 for one
replica RR(Webui) 480 and DR(Webui,Persistence)
1083. Moreover, from Table 2 we know that for all
replica counts, DR(Webui,Persistence) / RR(Webui)
2.25. Therefore, if we scale out the Webui service to
two replicas and have the new value for RR(Webui) as
936, we can estimate the new DR(Webui,Persistence)
by multiplying the new RR(Webui) by 2.25 which will
be 936 * 2.25 2106. The reason for the differ-
ence between the calculated value (2106) and the real
value (2108) for the new DR(Webui,Persistence) is
that numbers in Table 1 and Table 2 are rounded due
to lack of space.
4.3 Data Collection
To train CPU Model and Request Model for each mi-
croservice, we needed to collect two datasets per mi-
croservice. The data collection for each microservice
is performed independent of other services. We de-
ploy enough number of replicas from other services
to avoid any limitations imposed by other services on
the target service for data collection.
The dataset for CPU Model includes three met-
rics: the number of replicas, the request rate per sec-
ond, and the average CPU utilization of replicas. Each
data point results from applying a workload with a
fixed number of threads for 12 minutes to the front-
end service. At the end of each run, we collect each
metric’s values during this period and use their mean
as the value of the metric for that data point. Note that
we ignored data values for the first and last minute of
each run to exclude the warm-up and cool-down pe-
riods. We consider a different number of replicas for
the target service, and for each number of replicas,
we change the number of threads to increase the num-
ber of requests until we reach the saturation point for
that specific number of replicas. For instance, for one
replica of an example service, we apply the workload
with 1, 2, 3, 4, and 5 threads, resulting in five data
The Request Model dataset contains five metrics:
the current number of replicas, the current request rate
per second, the current average CPU utilization of
replicas, the new number of replicas, and the new re-
quest rate per second. Each data point for this dataset
results from the merging of two runs with the same
number of threads but a different number of replicas.
For example, we merge the result for the run with one
replica and five threads with the result for two repli-
cas and five threads to generate a data point for the
Request Model dataset. More specifically, we get the
current replica, current CPU utilization, and current
request rate from the first run and the new replica and
new request rate from the second run.
Figure 6 shows an example data point for CPU
Model and Request Model datasets. The data point
for Request Model is a combination of two runs that
have nthreads with xand x0replicas, respectively.
4.4 Model Training Results
We trained CPU Model and Request Model for all mi-
croservices in the Teastore application using datasets
created from collected data. Each dataset was split
into training and validation sets. The training sets
and validation sets contain 80% and 20% of data, re-
spectively. We used Linear Regression, Random For-
est, and Support Vector Regressor algorithms for the
training process and compared them in terms of mean
absolute error (MAE), mean squared error (MSE),
root mean squared error (RMSE), and R2score. Ta-
ble 3 and Table 4 show the results for CPU Model and
Request Model of each microservice, respectively. As
can be seen from the results, Support Vector Regres-
sor and Random Forest provide lower MAE, MSE,
RMSE, and higher R2score for CPU Model and Re-
quest Model compared to Linear Regression. Cur-
rently, we use offline learning to train machine learn-
ing models, but our approach can be adapted to lever-
age online learning as well.
In this section, we present the autoscaler we designed
using the performance models described in Section 4.
We first outline the architecture of Waterfall and dis-
cuss its approach to abstracting the target microser-
vice application. Finally, we elaborate on the algo-
rithm that Waterfall uses to obtain the scaling strategy.
5.1 Architecture and Abstraction
Figure 7 shows the architecture of Waterfall, which
is based on the MAPE-K control loop (Brun et al.,
2009; Kephart et al., 2003; Kephart and Chess, 2003)
with five elements, namely monitor, analysis, plan,
execute, and a shared knowledge base.
Waterfall abstracts the target microservice appli-
cation as a directed graph, which is called microser-
vice graph, hereafter. In the microservice graph, ver-
texes represent services, and edges show the depen-
dencies between services. The direction of an edge
determines which service sends request to the other
one. For instance, consider the following vertex (V)
and edge (E)sets for an example microservice graph:
This microservice graph contains three services and
two edges. A,B, and Care three different services.
The edges (A,B)and (A,C)show that service Acalls
services Band Crespectively. In addition, we as-
sign the following three weights to each directed edge
(m,n)between two microservices mand n:
DR(m,n) which is defined in Section 3.
Request Rate Ratio(m,n) which is defined for
two services mand nas:
Request Rate Ratio(m,n) = DR(m,n)
Downstream Rate Ratio(m,n) which is defined
for two services mand nas:
Downstream Rate Rat io(m,n) = DR(m,n)
We calculate these weights for each edge and pop-
ulate the graph using the monitoring data. Figure 8
shows the microservice graph for the Teastore appli-
cation. The microservice graph for small applications
can be derived manually according to service depen-
dencies. There are also tools (Ma et al., 2018) for
extracting the microservice graph automatically.
5.2 Scaling Algorithm
Our proposed algorithm for autoscaling of microser-
vices leverages machine learning models to predict
the number of required replicas for each service and
the impact of scaling a services on the load of other
Thread Replica Request Rate CPU Utilizaion
Thread Replica Request Rate CPU Utilizaion
Input Features Target Feature
Thread OId Replica Old Request Rate Old CPU Utilizaion New Replica
New Request
Input Features Target Feature
CPU Model Dataset Request Model Dataset
Figure 6: The construction of datasets for CPU Model and Request Model. Request Model dataset is built by merging data
points from the CPU Model dataset.
Table 3: The accuracy and R2score of CPU Model for different services using Linear Regression (LR), Random Forest (RF),
and Support Vector Regressor (SVR).
Service Linear Regression Random Forest SVR
Webui 4.97 45.32 6.73 92.21 3.67 18.57 4.31 96.81 1.43 3.07 1.75 99.47
Persistence 4.12 27.55 5.25 94.03 3.26 17.02 4.13 96.31 0.88 1.91 1.38 99.59
Auth 4.40 37.39 6.11 94.82 4.26 34.45 5.87 95.23 1.73 6.45 2.54 99.11
Recommender 2.62 12.42 3.52 92.94 1.39 4.23 2.06 97.60 1.38 5.00 2.23 97.16
Image 3.81 20.12 4.49 96.87 3.61 21.09 4.59 96.72 1.54 3.45 1.86 99.50
Monitor Analysis
New Configuration
New Configuration
Get Metrrics
ML Performance
Monitoring Data
Microservice Graph
Performance Predictions
Metric Data
Figure 7: Architecture of Waterfall autoscaler.
Clients Webui
Request Rate Ratio(c,w)
Downstream Rate Ratio(c,w)
Request Rate Ratio(w,p)
Downstream Rate Ratio(w,p)
Request Rate Ratio(w,a)
Downstream Rate Ratio(w,a)
Request Rate Ratio(w,i)
Downstream Rate Ratio(w,i)
Request Rate Ratio(w,r)
Downstream Rate Ratio(w,r)
Request Rate Ratio(a,p)
Downstream Rate Ratio(a,p)
Figure 8: Teastore microservice graph.
services. This way, we provide a more responsive au-
toscaler that takes fewer actions to keep the applica-
tion at the desired performance.
At the end of each monitoring interval, Water-
fall initializes the microservice graph weights using
monitoring data and runs the scaling algorithm to
find the new scaling configuration. The steps in the
Waterfall scaling algorithm are summarized in Algo-
rithm 1. The algorithm takes the microservice graph,
start node, and monitoring data as input and provides
the new scaling configuration as the output. In the be-
ginning, it initializes the New Con f ig with the current
configuration of the system using monitoring data and
starts finding the new configuration.
It traverses the microservice graph using the
Breadth-First Search (BFS) algorithm and starts the
search from the start node. The start node is usually
the front-end service, which is the users’ interaction
point with the application. At each node, the algo-
rithm checks whether the CPU utilization of the ser-
vice is above or below the target threshold.
In case that the CPU utilization is higher than the
threshold, it calls the scaleOut function. This function
increases the service replicas and predicts the new re-
quest rate of the service using Request Model. After
predicting the new request rate, it uses CPU Model to
predict the new CPU utilization with the new number
of replicas and the new request rate. If the new pre-
dicted CPU utilization is below the threshold, it con-
siders the new replica as the new configuration for the
service. Afterwards, it updates the microservice re-
quest rate using the updateReqRate function. As Al-
gorithm 3 indicates, function updateReqRate updates
the DR value on all edges ending to this microservice
based on the Request Rate Ratio value on each edge.
If the CPU utilization is less than the threshold, it
calls the scaleIn function. This function reduces the
number of service replicas and predicts the new re-
quest rate of the service using Request Model. It then
feeds the new request rate and new replica to CPU
Model to predict the new CPU utilization. If the new
CPU utilization is still below the threshold, it consid-
ers the new replica as the new configuration for ser-
vice and updates the microservice request rate using
the updateReqRate function. Otherwise, it keeps the
current replica as the configuration of the service.
Table 4: The accuracy and R2score of Request Model for different services using Linear Regression (LR), Random Forest
(RF), and Support Vector Regressor (SVR).
Service Linear Regression Random Forest SVR
Webui 50.01 3568.55 59.74 97.83 25.67 1596.37 39.95 99.02 32.01 2134.50 46.20 98.70
Persistence 71.21 9708.55 98.53 99.50 34.94 2717.49 52.13 99.86 39.36 3041.56 55.15 99.84
Auth 79.23 11158.89 105.64 96.35 47.34 3857.84 62.11 98.74 39.57 3611.02 60.09 98.82
Recommender 31.22 1258.56 35.48 94.26 24.49 911.24 30.19 95.84 20.27 620.22 24.90 97.17
Image 71.45 8137.23 90.21 98.72 72.48 7328.20 85.60 98.85 42.99 3642.93 60.36 99.43
If the node that is being processed has any chil-
dren, the algorithm goes to the next step which is ap-
plying the effect of change in service replica num-
ber on downstream services by calling the update-
DownstreamRate function. As Algorithm 3 shows,
this function updates the DR value on all edges start-
ing from the current node and ending at child nodes
based on the Downstream Rate Ratio value on each
After this step, the algorithm continues the BFS
search by the next node and repeats the steps men-
tioned above. After searching the whole graph and
inferring the new configuration for each service, the
search is over and the algorithm returns the new scal-
ing configuration.
As lines 8-11 show, if the request rate of the ser-
vice in the current node has been changed in the graph
in previous steps, the CPU utilization in the monitor-
ing data is not valid anymore, and we should estimate
the new CPU utilization using CPU Model. The ge-
tRequestRate function calculates the request rate of a
node by summing the DR value on all edges ending to
this node.
In this section, we evaluate the performance of Wa-
terfall autoscaler by comparing Waterfall with HPA,
which is the de facto standard for autoscaling in the
industry. First, we elaborate on the details of our ex-
perimental setup. After that, we present and discuss
our experimental results for the comparison of Water-
fall and HPA in terms of different metrics.
6.1 Experimental Setup
6.1.1 Microservice Application Deployment
We created a Kubernetes2cluster as the container
orchestration system with one master node and
four worker nodes in the Compute Canada Arbutus
Algorithm 1: Autoscaling Algorithm
Input: Microservice Graph G, Start Node S, Monitoring Data M
Output: New Scaling Configuration New Config
1New C on f ig initilize with current config
2queue []
3queue.ap pend(S)
4while queue is not empty do
5service queue.pop(0)
6req rat e u pdated False
7req rat e get ReqRate(G,service)
8if M[service][0Req Rate0] == req rate then
9cpu util M[service][0CPU Ut il0]
10 else
11 cpu util
CPU Model (service,new con f ig[service],req rate)
12 curr req rat e req rate
13 curr cpu util cpu util
14 curr replica ←− new conf ig[service]
15 if cpu util >=T HRE SH then
16 (new replica,pred req rate)←−
scaleOut(cur r repl ica,curr c pu util,
curr req rate)
17 upd ateReqRate(G,ser vice,pred req rate)
18 new con f ig[service]new replica
19 req rat e u pdated True
20 else if cpu util <T HRE SH curr replica >1then
21 (new replica,pred req rate)←−
scaleIn(cur r repl ica,curr c pu util,cur r req rate)
22 if new re plica 6=curr replica then
23 upd ateReqRate(G,ser vice,pred req rate)
24 new con f ig[service]new replica
25 req rat e u pdated True
26 if G[service].hasChild() req rate u pdated then
27 upd ateDownstreamRat e(G,service,pred req rate)
28 for each v G[service].ad jacent() do
29 queue.ap pend(v)
Cloud3. Each node is a virtual machine with 16 vCPU
and 60GB of memory running Ubuntu 18.04 as the
operating system. We deployed each microservice
in the Teastore application as a Kubernetes deploy-
ment exposed by a Kubernetes service. The incoming
traffic is distributed in a round-robin fashion between
pods that belong to a deployment. We imposed con-
3Compute Canada Cloud:
Algorithm 2: Scale Out and Scale In Func-
1Function scaleOut(curr replica, curr cpu util, curr req rate):
2new re plica curr replica
3pred cpu ut il curr c pu util
4while pred cpu ut il >T HRE SH do
5new re plica new repl ica +1
6pred req rat e
Request Model (service,curr repl ica,
curr c pu util,cur r req rate,new repl ica)
7pred cpu ut il CPU Model (service,new replica,
pred req rate)
8return (new replica,pred req rat e)
9Function scaleIn(curr replica, curr cpu util, curr req rate):
10 new re plica curr replica
11 pred cpu ut il curr c pu util
12 while pred cpu ut il <T HRE SH do
13 new re plica new repl ica 1
14 pred req rat e
Request Model (service,curr repl ica,
curr c pu util,cur r req rate,new repl ica)
15 pred cpu ut il CPU Model (service,new replica,
pred req rate)
16 if pred cpu ut il <T HRE SH then
17 new req rate pred req rate
18 return (new re plica +1,new req rate)
straints on the amount of resources available to each
pod using the resource request and limit mechanism
in Kubernetes. The resource request is the amount of
resources guaranteed for a pod, and the resource limit
is the maximum amount of resources that a pod can
have in the cluster. We used the same value for both
resource request and limit to decrease the variability
in pods’ performance. Table 5 shows the details of
CPU and memory configuration for each pod. We
configured the startups, readiness, and liveness probes
for each pod to measure the exact number of ready
pods at any time in the system and also have a recov-
ery mechanism in place for unhealthy pods. We used
the Kubernetes API to query or change the number of
pods in a deployment.
6.1.2 Load Generation
We used Jmeter4, an open-source tool for load test-
ing of web applications, to generate an increasing
workload with a length of 25 minutes for the Teast-
ore application. This workload is a common browsing
workload that represents the behaviour of most users
when visiting an online shopping store. It follows a
closed workload model and includes actions like vis-
iting the home page, login, adding product to cart, etc.
Algorithm 3: Microservice Graph Helper
1Function getReqRate(Microservice Graph G, Node service):
2req rat e 0
3for each (m,n)Gdo
4if n== service then
5req rat e req rate +G[m][n][0DR0]
6return req rate
7Function updateReqRate(Microservice Graph G, Node service,
new req rate):
8for each (m,n)Gdo
9if n== service then
10 G[m][n][0DR0]
new req rate G[m][n][0ReqRateRat io0]
11 Function updateDownstreamRate(Microservice Graph G,
Node service, new req rate):
12 for each (m,n)Gdo
13 if m== service then
14 G[m][n][0DR0]new req rate
Table 5: Resource request and limit of Teastore services.
Service Name CPU Memory
Webui 1200mCore 512MB
Persistence 900mCore 512MB
Auth 900mCore 512MB
Recommender 800mCore 512MB
Image 1100mCore 512MB
Jmeter acts like users’ browsers and sends requests se-
quentially to the Teastore front-end service using a set
of threads. The number of threads controls the rate at
which Jmeter sends requests to the front-end service.
We deployed Jmeter on a stand-alone virtual machine
with 16 vCPU and 60GB of memory running Ubuntu
18.04 as the operating system.
6.2 Results and Discussion
To compare the behaviour and effectiveness of Wa-
terfall autoscaler with HPA, we applied the increas-
ing workload described in the previous section to the
front-end service of the Teastore application for 25
minutes. Figures 9-13 show the average CPU utiliza-
tion and replica count for each service in the Teast-
ore application throughout the experiment. The red
dashed line in CPU utilization plots denotes the CPU
utilization threshold that both autoscalers use as the
scaling threshold. The green dashed line in each ser-
vice’s replica count plot shows the ideal replica count
for that service at each moment of the experiment.
The ideal replica count is the minimum number of
replicas for the service which is enough to handle the
Figure 9: The CPU utilization and number of replicas for
the Webui service.
incoming load and keep the CPU utilization of the
service below the threshold. According to Figures 9-
13, HPA scales a service whenever the service’s aver-
age CPU utilization goes above the scaling threshold.
However, Waterfall scales a service in two different
situations: 1) the CPU utilization of the service goes
beyond the scaling threshold; 2) the predicted CPU
utilization for the service exceeds the threshold due
to scaling of another service. Therefore, when Water-
fall scales a service while its CPU utilization is below
the threshold, it must be due to the predicted perfor-
mance degradation of the service as a result of scaling
of another service(s).
As Figure 9 shows, for the Webui service, both au-
toscalers increase the replica count when the CPU uti-
lization is above the threshold with some delay com-
pared to the ideal state. According to Figure 8, as
Webui is the front-end service and no other internal
services depend on it, scaling of other services does
not compromise the performance of the Webui ser-
vice. Hence, all Waterfall’s scaling actions for the
Webui service can be attributed to CPU utilization.
As can be seen in Figure 10, we observe that Wa-
terfall scales the Persistence service around the 6th
minute, although the CPU utilization is below the
threshold. We attribute this scaling action to the deci-
sion for scaling the Webui service in the same mon-
itoring interval that leads to an increase in the CPU
utilization of Persistence service as Webui service de-
pends on Persistence service. In contrast, as we can
see in Figure 10, the HPA does not scale the Persis-
tence service at the 6th minute. Consequently, a short
while after the 6th minute, when the second replica
of Webui service completes the startup process and is
ready to accept traffic, the CPU utilization of Persis-
tence service increases and goes above the threshold.
The other scaling action of Waterfall for Persistence
service after the 15th minute is based on CPU utiliza-
Results for the Auth service shown in Figure 11
Figure 10: The CPU utilization and number of replicas for
the Persistence service.
Figure 11: CPU utilization and number of replicas for Auth
suggest that the increase in the replica count of Auth
around the 6th minute is based on the prediction for
the impact of scaling of the Webui service, as the
CPU utilization of Auth is below the threshold during
this time. On the other hand, we can see that at 6th
minute, the HPA does not increase the replica count
for Auth service. Therefore, after adding the second
replica of Webui, the CPU utilization of Auth reaches
the threshold. The other scaling action of Waterfall
for Auth after the 20th minute is based on the CPU
According to the Image service results in Fig-
ure 12, Waterfall scales the Image service around the
11th minute. This scaling action is due to scaling the
Webui service that depends on Image service from
two to three replicas in the same monitoring interval.
However, HPA does not scale the Image service si-
multaneously with Webui causing an increase in the
CPU utilization of the Image service. For Waterfall,
as Figure 12 shows, there is a sudden increase in the
CPU utilization of Image service right before the time
that the second replica of Image service is ready to ac-
cept traffic. This sudden increase in CPU utilization
of Image service is because of the time difference be-
tween the time that Webui and Image services com-
plete the startup process and reach the ready state.
Figure 12: The CPU utilization and number of replicas for
the Image service.
Figure 13: The CPU utilization and number of replicas for
the Recommender service.
During the interval between these two incidents, the
Webui service has three replicas; therefore, its down-
stream rate to Image service increases while the sec-
ond replica of the Image service is not ready yet.
For the Recommender service, as Figure 13 illus-
trates, during the whole time of the experiment, the
CPU utilization is below the threshold. Consequently,
there is no scaling action for both autoscalers.
Putting the results of all services together, we can
see that the Waterfall autoscaler predicts the effect of
scaling a service on downstream services and scale
them proactively in one shot if it is necessary. There-
fore, it takes fewer actions to maintain the CPU uti-
lization of the application below the threshold. For
example, around the 6th minute, we can see from Fig-
ures 9, 10, and 11 that Waterfall autoscaler scales the
Persistence and Auth services along with Webui in the
same monitoring interval. However, HPA scales these
services separately in different monitoring intervals.
To quantify the effectiveness of Waterfall com-
pared to HPA, we evaluate both autoscalers in terms
of several metrics. Figure 14 shows the total number
of transactions executed per second (TPS) for Water-
fall and HPA throughout the experiment. It can be
seen that Waterfall has a higher cumulative TPS than
HPA thanks to timely scaling of services.
Figure 14: Cumulative Transaction Per Second (TPS) of
Waterfall and HPA autoscalers.
Table 6: Comparison of Waterfall and HPA autoscalers in
terms of performance metrics.
# HPA Waterfall
Total Request 727270.0±12369.95 796867.4±4594.77
TPS 484.55 ±8.23 530.93 ±3.06
Response Time 20.47 ±0.36 18.67 ±0.11
We repeated the same experiment five times and
calculated the average of the total number of served
requests, TPS, and response time for both autoscalers
over these runs. Table 6 shows the results along with
the 95% confidence interval. It can be seen that TPS
(and the total number of served requests) is 9.57%
higher for Waterfall than HPA. The response time for
Waterfall is also 8.79% lower than HPA.
Additionally, we have calculated the following
metrics for both autoscalers and presented them in Ta-
ble 7:
• CPU>Threshold time: The percentage of time
that CPU utilization of the service is above the
Underprovision time: The percentage of time that
the number of service replicas is less than the ideal
Overprovision time: The percentage of time that
the number of service replicas is more than the
ideal state.
It can be seen that for all services except the Recom-
mender service, both autoscalers have a nonzero value
for CPU>T. However, CPU>T is less for Waterfall
in all services. Moreover, Waterfall yields a lower
underprovision time and zero overprovision time for
all services. Despite the overprovisioning of HPA for
two services, we observe that Waterfall still provides
a higher TPS and better response time; we attribute
this to the timely and effective scaling of services by
the Waterfall autoscaler.
Table 7: Comparison of Waterfall and HPA in terms of CPU>Threshold(T), overprovision, and underprovision time.
Service CPU >T Underprovision Overprovision
HPA Waterfall HPA Waterfall HPA Waterfall
Webui 31% 16% 54% 15.33% 0% 0%
Persistence 16% 4% 28.66% 7.33% 0% 0%
Auth 6.33% 0.33% 32% 8% 26% 0%
Image 13.33% 0.33% 28% 6% 24% 0%
Recommender 0% 0% 0% %0 0% 0%
We introduced Waterfall, a machine learning-based
autoscaler for microservice applications. While nu-
merous autoscalers consider different microservices
in an application independent of each other, Waterfall
takes into account that scaling a service might have an
impact on other services and can even shift the bottle-
neck from the current service to downstream services.
Predicting this impact and taking the proper action in
a timely manner could improve the application per-
formance as we corroborated in this study. Our eval-
uation results show the efficacy and applicability of
our approach. In future work, we plan to explore the
feasibility of adding vertical scaling to the Waterfall
autoscaling approach.
Abdullah, M., Iqbal, W., Mahmood, A., Bukhari, F., and
Erradi, A. (2020). Predictive autoscaling of microser-
vices hosted in fog microdata center. IEEE Systems
Amazon (2020a). Amazon ec2 spot instances. https://aws. Accessed: 2020-10-25.
Amazon (2020b). Aws auto scaling.
com/autoscaling/. Accessed: 2020-10-25.
Brun, Y., Serugendo, G. D. M., Gacek, C., Giese, H.,
Kienle, H., Litoiu, M., M¨
uller, H., Pezz`
e, M., and
Shaw, M. (2009). Engineering self-adaptive systems
through feedback loops. In Software engineering for
self-adaptive systems, pages 48–70. Springer.
Calc¸ado, P. (2014). Building products at soundcloud—part
i: Dealing with the monolith. Retrieved from:
https://developers. soundcloud. com/blog/building-
monolith. Accessed: 2020-10-25.
Chen, T. and Bahsoon, R. (2015). Self-adaptive trade-
off decision making for autoscaling cloud-based ser-
vices. IEEE Transactions on Services Computing,
Coulson, N. C., Sotiriadis, S., and Bessis, N. (2020).
Adaptive microservice scaling for elastic applications.
IEEE Internet of Things Journal, 7(5):4195–4202.
Dragoni, N., Giallorenzo, S., Lafuente, A. L., Mazzara,
M., Montesi, F., Mustafin, R., and Safina, L. (2017).
Microservices: yesterday, today, and tomorrow. In
Present and ulterior software engineering, pages 195–
216. Springer.
Fernandez, H., Pierre, G., and Kielmann, T. (2014). Au-
toscaling web applications in heterogeneous cloud in-
frastructures. In 2014 IEEE International Conference
on Cloud Engineering, pages 195–204. IEEE.
Gias, A. U., Casale, G., and Woodside, M. (2019).
Atom: Model-driven autoscaling for microservices.
In 2019 IEEE 39th International Conference on Dis-
tributed Computing Systems (ICDCS), pages 1994–
2004. IEEE.
Gotin, M., L¨
osch, F., Heinrich, R., and Reussner, R.
(2018). Investigating performance metrics for scaling
microservices in cloudiot-environments. In Proceed-
ings of the 2018 ACM/SPEC International Conference
on Performance Engineering, pages 157–167.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Ihde, S. and Parikh, K. (2015). From a mono-
lith to microservices + rest: the evolution of
linkedin’s service architecture. Retrieved from:
microservices-urn/. Accessed: 2020-10-25.
Iqbal, W., Dailey, M. N., and Carrera, D. (2015). Unsuper-
vised learning of dynamic resource provisioning poli-
cies for cloud-hosted multitier web applications. IEEE
Systems Journal, 10(4):1435–1446.
Jindal, A., Podolskiy, V., and Gerndt, M. (2019). Perfor-
mance modeling for cloud microservice applications.
In Proceedings of the 2019 ACM/SPEC International
Conference on Performance Engineering, pages 25–
Kephart, J., Kephart, J., Chess, D., Boutilier, C., Das, R.,
Kephart, J. O., and Walsh, W. E. (2003). An architec-
tural blueprint for autonomic computing. IBM White
paper, pages 2–10.
Kephart, J. O. and Chess, D. M. (2003). The vision of auto-
nomic computing. Computer, 36(1):41–50.
Kubernetes (2020). Kubernetes hpa.
Accessed: 2020-10-25.
Kwan, A., Wong, J., Jacobsen, H.-A., and Muthusamy, V.
(2019). Hyscale: Hybrid and network scaling of dock-
erized microservices in cloud data centres. In 2019
IEEE 39th International Conference on Distributed
Computing Systems (ICDCS), pages 80–90. IEEE.
Lorido-Botran, T., Miguel-Alonso, J., and Lozano, J. A.
(2014). A review of auto-scaling techniques for elas-
tic applications in cloud environments. Journal of grid
computing, 12(4):559–592.
Ma, S.-P., Fan, C.-Y., Chuang, Y., Lee, W.-T., Lee, S.-J., and
Hsueh, N.-L. (2018). Using service dependency graph
to analyze and test microservices. In 2018 IEEE 42nd
Annual Computer Software and Applications Confer-
ence (COMPSAC), volume 2, pages 81–86. IEEE.
Mauro, T. (2015). Adopting microservices at net-
flix: Lessons for architectural design. Retrieved
from https://www. nginx. com/blog/microservices-at-
netflix-architectural-best-practices. Accessed: 2020-
Nadareishvili, I., Mitra, R., McLarty, M., and Amundsen,
M. (2016). Microservice architecture: aligning prin-
ciples, practices, and culture. ” O’Reilly Media, Inc.”.
Qu, C., Calheiros, R. N., and Buyya, R. (2018). Auto-
scaling web applications in clouds: A taxonomy and
survey. ACM Computing Surveys (CSUR), 51(4):1–
von Kistowski, J., Eismann, S., Schmitt, N., Bauer, A.,
Grohmann, J., and Kounev, S. (2018). Teastore: A
micro-service reference application for benchmark-
ing, modeling and resource management research. In
2018 IEEE 26th International Symposium on Mod-
eling, Analysis, and Simulation of Computer and
Telecommunication Systems (MASCOTS), pages 223–
236. IEEE.
Wajahat, M., Karve, A., Kochut, A., and Gandhi, A. (2019).
Mlscale: A machine learning based application-
agnostic autoscaler. Sustainable Computing: Infor-
matics and Systems, 22:287–299.
... Figure 2 shows a scenario in which Service 1 has been assigned a heavy workload. 30 Then, Service 1 segregates the requests to Service 2 and Service 3. However, because the current request has not been completely processed, the new request needs to be queued. This shows how the cascading effect takes place within the system. ...
... Apart from the mainstream one, there are several other metrics have been introduced in recent works to address factors such as performance, availability, reliability, and elasticity. The metrics that have been used in performance evaluation are the request rate 30 and message queue. 32 Meanwhile, the metrics used to evaluate the availability of microservices after the scaling process are the reaction time, repair time, recovery time, and total outage time. ...
... For example, the most focused aspect of microservice systems is system performance and load testing is the most common method used to test the effectiveness of such algorithms. After performing load testing, further analysis will be performed by comparing the results using methods such as benchmarking, 30 architecture comparison, 39 and model performance comparison. 34,36 From an autoscaling perspective, the benchmarking process usually required the SUT to be deployed with a different scaling algorithm before comparing the performance testing results of the application deployed with the proposed algorithm with another algorithm. ...
The process of scaling microservices is a challenging task, especially in maintaining optimum resource provisioning while respecting QoS constraints and SLA. Many research works have proposed autoscaling approaches for microservices, however, less likely concerned with the correctness guarantee of the proposed algorithms. Hence, it is significant to gather and summarize these approaches to foster future innovation. Meanwhile, a few reviews have been published concerning microservices from different aspects. Therefore, our review complements the existing by focusing on autoscaling with verification perspectives. This study highlights the recent contributions in three inter‐related main topics that were published within the year 2017 to 2022, namely, microservice, verification, and autoscaling. Due to limited resources on verification for microservice autoscaling, we widen the perspective by considering the verification for autoscaling in cloud‐based systems. Based on our findings, we found that the formal method is not a new thing in verifying the autoscaling policies in cloud‐based systems, and one recent study that implements the formal method in the microservices area has been identified. Apart from the autoscaling techniques, we have also determined several factors that have been a concern in scaling the microservices as well as the relatable metrics. Meanwhile, from a verification perspective, we identified that probabilistic model checking is the common formal verification technique used to verify microservices and cloud autoscaling. Finally, we recommend open challenges from two perspectives which highlight the verification for existing microservice autoscaling and verification for ML‐based microservice autoscaling.
Full-text available
Fog computing provides microdata center (MDC) facilities closer to the users and applications, which help to overcome the application latency and response time concerns. However, guaranteeing specific service-level objectives (SLOs) for the applications running on the MDC requires automatic scaling of allocated resources by efficiently utilizing the available infrastructure capacity. In this article, we propose a novel predictive autoscaling method for microservices running on the fog MDC to satisfy the application response time SLO. Initially, our proposed approach uses a reactive rule-based autoscaling method to gather the training dataset for building the predictive autoscaling model. The proposed approach is efficient, as it can learn the predictive autoscaling model using an increasing synthetic workload. The learned predictive autoscaling model is used to manage the application resources serving different realistic workloads effectively. Our experimental evaluation using two synthetic and three realistic workloads for two benchmark microservice applications on a real MDC shows excellent performance compared to the existing state-of-the-art baseline rule-based autoscaling method. The proposed autoscaling method yields 75.51% reduction in the number of rejected requests and 77.53% fewer number of SLO violations compared to the baseline autoscaling methods by using only 9.20% additional data center resources at the fog layer.
Conference Paper
Full-text available
Microservices enable a fine-grained control over the cloud applications that they constitute and thus became widely-used in the industry. Each microservice implements its own functionality and communicates with other microservices through language- and platform-agnostic API. The resources usage of microservices varies depending on the implemented functionality and the workload. Continuously increasing load or a sudden load spike may yield a violation of a service level objective (SLO). To characterize the behavior of a microservice application which is appropriate for the user, we define a MicroService Capacity (MSC) as a maximal rate of requests that can be served without violating SLO. The paper addresses the challenge of identifying MSC individually for each microservice. Finding individual capacities of microservices ensures the flexibility of the capacity planning for an application. This challenge is addressed by sandboxing a microservice and building its performance model. This approach was implemented in a tool Terminus. The tool estimates the capacity of a microservice on different deployment configurations by conducting a limited set of load tests followed by fitting an appropriate regression model to the acquired performance data. The evaluation of the microservice performance models on microservices of four different applications shown relatively accurate predictions with mean absolute percentage error (MAPE) less than 10%. The results of the proposed performance modeling for individual microservices are deemed as a major input for the microservice application performance modeling.
Conference Paper
Full-text available
Modern distributed applications offer complex performance behavior and many degrees of freedom regarding deployment and configuration. Researchers employ various methods of analysis, modeling, and management that leverage these degrees of freedom to predict or improve non-functional properties of the software under consideration. In order to demonstrate and evaluate their applicability in the real world, methods resulting from such research areas require test and reference applications that offer a range of different behaviors, as well as the necessary degrees of freedom. Existing production software is often inaccessible for researchers or closed off to instrumentation. Existing testing and benchmarking frameworks, on the other hand, are either designed for specific testing scenarios, or they do not offer the necessary degrees of freedom. Further, most test applications are difficult to deploy and run, or are outdated. In this paper, we introduce the TeaStore, a state-of-the-art micro-service-based test and reference application. TeaStore offers services with different performance characteristics and many degrees of freedom regarding deployment and configuration to be used as a benchmarking framework for researchers. The TeaStore allows evaluating performance modeling and resource management techniques; it also offers instrumented variants to enable extensive run-time analysis. We demonstrate TeaStore's use in three contexts: performance modeling, cloud resource management, and energy efficiency analysis. Our experiments show that TeaStore can be used for evaluating novel approaches in these contexts and also motivates further research in the areas of performance modeling and resource management.
Conference Paper
Full-text available
A CloudIoT solution typically connects thousands of IoT things with cloud applications in order to store or process sensor data. In this environment, the cloud applications often consist of microservices which are connected to each other via message queues and must reliably handle a large number of messages produced by the IoT things. The state of a message queue in such a system can be a challenge if the rate of incoming messages continuously exceeds the rate of outgoing messages. This can lead to performance and reliability degradations due to overloaded queues and result in the unavailability of the cloud application. In this paper we present a case study to investigate which performance metrics to be used by a threshold-based auto-scaler for scaling consuming microservices of a message queue in order to prevent overloaded queues and to avoid SLA violations. We evaluate the suitability of each metric for scaling I/O-intensive and compute-intensive microservices with constant and varying characteristics, such as service time. We show, that scaling decisions based on message queue metrics are much more resilient to microservice characteristics variations. In this case, relying on the CPU utilization may result in massive overprovisioning or no scaling decision at all which could lead to an overloaded queue and SLA violations. We underline the benefits of using message queue metrics for scaling decisions instead of the more traditional CPU utilization particularly for I/O-intensive microservices due to the vulnerability to variations in the microservice characteristics.
Full-text available
Microservices is an architectural style inspired by service-oriented computing that has recently started gaining popularity. Before presenting the current state-of-the-art in the field, this chapter reviews the history of software architecture, the reasons that led to the diffusion of objects and services first, and microservices later. Finally, open problems and future challenges are introduced. This survey primarily addresses newcomers to the discipline, while offering an academic viewpoint on the topic. In addition, we investigate some practical issues and point out some potential solutions.
Full-text available
Web application providers have been migrating their applications to cloud data centers, attracted by the emerging cloud computing paradigm. One of the appealing features of cloud is elasticity. It allows cloud users to acquire or release computing resources on demand, which enables web application providers to auto-scale the resources provisioned to their applications under dynamic workload in order to minimize resource cost while satisfying Quality of Service (QoS) requirements. In this paper, we comprehensively analyze the challenges remain in auto-scaling web applications in clouds and review the developments in this field. We present a taxonomy of auto-scaling systems according to the identified challenges and key properties. We analyze the surveyed works and map them to the taxonomy to identify the weakness in this field. Moreover, based on the analysis, we propose new future directions.
Full-text available
To deal with the increasing complexity of software systems and uncertainty of their environments, software engineers have turned to self-adaptivity. Self-adaptive systems are capable of dealing with a continuously changing environment and emerging requirements that may be unknown at design-time. However, building such systems cost-effectively and in a predictable manner is a major engineering challenge. In this paper, we explore the state-of-the-art in engineering self-adaptive systems and identify potential improvements in the design process. Our most important finding is that in designing self-adaptive systems, the feedback loops that control self-adaptation must become first-class entities. We explore feedback loops from the perspective of control engineering and within existing self-adaptive systems in nature and biology. Finally, we identify the critical challenges our community must address to enable systematic and well-organized engineering of self-adaptive and self-managing software systems.
Autoscaling is the practice of automatically adding or removing resources for an application deployment to meet performance targets in response to changing workload conditions. However, existing autoscaling approaches typically require expert application and system knowledge to reduce resource costs and performance target violations, thus limiting their applicability. We present MLscale, an application-agnostic, machine learning based autoscaler that is composed of: (i) a neural network based online (black-box) performance modeler, and (ii) a regression based metrics predictor to estimate post-scaling application and system metrics. Implementation results for diverse applications across several traces highlight MLscale's application-agnostic behavior and show that MLscale (i) reduces resource costs by about 41%, on average, compared to the optimal static policy, (ii) is within 14%, on average, of the cost of the optimal dynamic policy, and (iii) provides similar cost-performance tradeoffs, without requiring any tuning, when compared to carefully tuned threshold-based policies.