ArticlePDF Available

Abstract and Figures

Accurate estimation of data center resource utilization is a challenging task due to multi-tenant co-hosted applications having dynamic and time-varying workloads. Accurate estimation of future resources utilization helps in better job scheduling, workload placement, capacity planning, proactive auto-scaling, and load balancing. The inaccurate estimation leads to either under or over-provisioning of data center resources. Most existing estimation methods are based on a single model that often does not appropriately estimate different workload scenarios. To address these problems, we propose a novel method to adaptively and automatically identify the most appropriate model to accurately estimate data center resources utilization. The proposed approach trains a classifier based on statistical features of historical resources usage to decide the appropriate prediction model to use for given resource utilization observations collected during a specific time interval. We evaluated our approach on real datasets and compared the results with multiple baseline methods. The experimental evaluation shows that the proposed approach outperforms the state-of-the-art approaches and delivers 6% to 27% improved resource utilization estimation accuracy compared to baseline methods.
Content may be subject to copyright.
1
Adaptive Prediction Models for Data Center
Resources Utilization Estimation
Shuja-ur-Rehman Baig, Waheed Iqbal, Josep Lluis Berral, Abdelkarim Erradi, David Carrera
Abstract—Accurate estimation of data center resource utiliza-
tion is a challenging task due to multi-tenant co-hosted applic-
ations having dynamic and time-varying workloads. Accurate
estimation of future resources utilization helps in better job
scheduling, workload placement, capacity planning, proactive
auto-scaling, and load balancing. The inaccurate estimation leads
to either under or over-provisioning of data center resources.
Most existing estimation methods are based on a single model
that often does not appropriately estimate different workload
scenarios. To address these problems, we propose a novel method
to adaptively and automatically identify the most appropriate
model to accurately estimate data center resources utilization.
The proposed approach trains a classifier based on statistical
features of historical resources usage to decide the appropriate
prediction model to use for given resource utilization observations
collected during a specific time interval. We evaluated our
approach on real datasets and compared the results with multiple
baseline methods. The experimental evaluation shows that the
proposed approach outperforms the state-of-the-art approaches
and delivers 6% to 27% improved resource utilization estimation
accuracy compared to baseline methods.
Index Terms—Data center, Resource management, Data clas-
sification, Modeling and prediction, Dynamic prediction model,
Feature Extraction
I. INTRODUCTION
TECHNOLOGICAL advances in server virtualization and
cloud computing allow cost-effective hosting of multiple
applications in a secure, customizable, and isolated computing
environment managed by modern data centers. This yields
higher resources utilization with reduced costs.
Additionally, cloud consumers can acquire compute, storage
and networking resources on-demand from Infrastructure-as-
a-Service (IaaS) providers on pay-per-use basis. IaaS users can
control the leased resources and scale them to optimize their
usage accordingly to their needs. To ensure better Quality of
Service, IaaS providers distribute data center resources across
multiple geographical locations to enhance proximity to the
application users. Also, virtualization and holistic data center
management enable providers to maximize the data center
utilization while minimizing their operational cost [1].
Efficient methods for estimating resource utilization in data
centers can significantly ease self-management and usage op-
timization for both users and providers. Users can dynamically
S. Baig, J. Berral, and D. Carrera are with Barcelona Supercomputing
Center (BSC) and Universitat Politècnica de Catalunya (UPC), Barcelona
Spain. Emails: {shuja.baig, josep.berral, david.carrera}@bsc.es.
W.Iqbal is with Punjab University College of Information Technology, Uni-
versity of the Punjab, Lahore, Pakistan. Email: waheed.iqbal@pucit.edu.pk.
A. Erradi is with the Department of Computer Science and Engineer-
ing, College of Engineering, Qatar University, Doha, Qatar. Email: er-
radi@qu.edu.qa
adjust the leased resources to minimize costs for hosting their
applications while maintaining the desired performance and
service quality [2]. Further, accurate estimates of resources
utilization enable the providers to efficiently allocate virtual
machines (VM) and other virtual resources to workloads,
migrate VMs to consolidate or balance resource usage [3],
[4], plan in advance resource capacities [5], [6], also take
awareness of energy requirements in advance for expected
workloads and users [7], [8].
Accurate estimation of future resources utilization for data
centers is challenging due multi-tenant co-hosted applications
having dynamic and time-varying workloads. While there are
several estimation methods for cloud resource utilization using
time-series learning or deep-learning networks [9]–[11], all
use a single model that often does not accurately capture the
workload dynamics. To address these problems, we propose a
novel method to adaptively and automatically identify the most
appropriate model to accurately estimate data center resources
utilization. Our adaptive multi-methods approach considers
different scenarios encountered in a production data center and
enables selecting the the predictive method that learns best.
Our approach focuses on training estimation models using
different methods then selecting the one that will yield the
best prediction given the current scenario and the previous
batch of collected data.
To test and validate our selective multi-method approach,
we have conducted experiments using Alibaba [12] and Bit-
brains [13] data center utilization datasets. We compared the
results of our experiments with exiting baseline approaches
that use a single model such as Linear Regression (LR), Sup-
port Vector Machines (SVM), Gradient Boosting Tree (GBT)
and Gaussian Process also known as Krigin (KR). Figure 1
shows the motivation to adaptively select an appropriate
method to effectively estimate resource utilization for different
scenarios. The figure shows CPU utilization estimations using
different methods for four different machines from Alibaba
data set. Each estimation method is trained using 55 minutes
time interval data and estimated the utilization for the next 5
minutes. We observed that different predictors yield better re-
source utilization estimation for different scenarios. Therefore,
it adds value to build a system to identify the best predictor
for forecasting resource utilization at each time interval.
In this paper we focused on classical machine learning
approach and did not use deep learning as the learning process
is considered as black-box [14] and to understand the reason-
ing of the model’s prediction behavior is not apparent. Deep
learning works well with a large amount of high dimensional
data, and it is also extremely computational expensive due
2
Figure 1: CPU estimation using different methods and scen-
arios for Alibaba data set. Different predictors yield better
estimation, each for different scenarios.
to which, it requires a specialized type of hardware. The
resource utilization of data centers is a low dimensional data,
and traditional machine learning methods can be effectively
used for estimations. Moreover, deep learning performs quite
well once trained for a particular problem; however, model
miserably fails when applying on a similar type of other
problems and required to retrain. Due to these reasons, we
selected the traditional machine learning approach and pro-
pose a novel adaptive model selector method, to dynamically
identify the best prediction method for estimating resource
utilization of data centers, from a bag of trained methods
with different characteristics and accuracy over different data
center behaviors. The data center telemetry contains burstiness
behavior which represents sudden spikes and peaks of resource
utilization. In general, it is challenging to predict the burstiness
behavior, and we address this issue with the help of an
adaptive selection of an appropriate prediction method at
every estimation step. After some experiments and model
selection, we chose Random Decision Forests (RDF) as the
best mechanism for learning the expected accuracy for each
candidate predictor.
Our proposed method trains on the statistical features of
historical resource utilization and predictor correctness for
sliding windows of a specific size, to identify which predictor
will produce the best forecast given the current resource utiliz-
ation. We evaluate our method by comparing its decision and
forecasting capabilities with baseline methods, using datasets
from Alibaba and Bitbrains monitored data centers. Results
show that the proposed method outperforms the baseline
methods for both of the datasets. Notice that in this work we
focus on CPU resource consumption as the primary resource
on high-performance computing data centers, but our solution
is generic and can be used to predict utilization of all system
resources. The main contributions of this work are:
A novel method to dynamically select the best predic-
tion model for estimating and forecasting cloud resource
utilization for a given recent time window of observed
resource utilization.
Use RDF for choosing an appropriate prediction model
to be used for estimating data center resource utilization.
A comparison of different baseline models, currently used
in the state-of-the-art, as candidate models for resource
utilization estimation, aside of validation for the presented
approach.
Analyze the impact of different window sizes on the
proposed resource estimation systems.
The rest of the paper is organized as follows. Related work
is presented in Section II. Our proposed resource estimation
system is explained in Section III. Prediction methods and
Adaptive Model Selector (AMS) are explained in Section IV.
Feature extraction and selection is discussed in Section V.
We provide details about the experimental evaluation in Sec-
tion VI. The experimental results are presented in Section VII.
Finally, conclusion and future work are discussed in Sec-
tion VIII.
II. RE LATE D WORK
Data center resource utilization and workload prediction
is an active research area. Recently, there have been several
attempts to use machine learning methods for predictions of
data center resources. For example, recent work by Kim et
al. [15] proposed an ensemble approach which uses mul-
tiple predictors together to produce an output. The proposed
ensemble technique uses Linear Regression, SVM, ARMA,
and ARIMA together to predict future workload for the
data centers by dynamically determining the weight of each
predictor using the regression method. Another recent work
by Rahmanian et al. [16] also proposed an ensemble-based
approach to predict CPU utilization of application usage of
VMs. The proposed approach uses automata theory to adjust
the weight for each predictor in the ensemble method to predict
the CPU usage. Subirats et al. [17] proposed an ensemble-
based prediction strategy which forecasts the infrastructure
energy requirement by predicting the future CPU utilization
of VMs. Their ensemble-based approach uses the moving
average, exponential smoothing, linear regression, and double
exponential smoothing methods. Chen et al. [18] propose
an ensemble model based on the fuzzy neural network to
predict the resource demand. They use the second moving
average (SMA), exponential moving method (EMA), autore-
gression model (ARM), and trend seasonality model (TSM)
as base predictors. Cetinski et al. [19] combine statistical
and machine learning methods to predict application specific
workload volume. Tseng et al. [20] used a multi-objective
genetic algorithm to forecast resource utilization and energy
consumption in data centers. Jiang et al. [21] proposed
ensemble prediction mechanism to predict the cloud workloads
for capacity planning in data centers. They used five prediction
algorithms named as moving average, autoregression, artificial
neural network, support vector machine, and gene expression
programming to predict the future workload estimations.
There have been several efforts to use typical time series
solutions to predict data center resource utilization. For ex-
3
ample, Rodrigo et al. [22] used autoregressive integrated
moving average (ARIMA) method to predict the arrival rate
for the applications hosted on the cloud. Liao et al. [23] use
typical time series prediction methods namely autoregressive
moving-average, moving average, and auto-regressive together
as an ensemble approach to predict CPU usage of VMs. The
proposed method combines the output of time series prediction
techniques as input to another linear prediction model to
predict CPU utilization of VMs. Vazquez et al. [24] used
various time series prediction models to forecast the number of
requests which helps in the dynamic scaling of cloud resources
proactively. For this purpose, they evaluated the autoregressive
model (AR), moving average model (MA), simple exponential
smoothing, double exponential smoothing, automated ARIMA
method, and neural network autoregression method. Dmytro
et al. [25] use ARIMA to forecast load on the cluster which
helps in scheduling the data center resources by migrating the
VMs. Fang et al. [26] used ARIMA to predict the future CPU
utilization and several requests for the applications hosted in
the cloud.
There have been several efforts to employ deep learning
methods for predicting data center resource utilization. For
example, Zhang et al. [9] use autoencoders to predict the CPU
utilization of VMs. The authors used tensor rank decompos-
ition technique to reduce the training time by compressing
the input parameters. Feng Qiu [27] used a deep belief
network using multiple-layered restricted Boltzmann machines
(RBMs) and a regression layer to predict the CPU usage
of VMs. The RBMs are used to extract high-level features,
and the regression layer is used to predict CPU utilization.
Zhang et al. [11] also use RBMs to predict CPU and RAM
utilization in data centers. They use backpropagation as global
supervised learning to minimize the loss function. Mason [10]
predict the CPU consumption of the host by using evolu-
tionary Neural Networks (NN). To train the network weights
of neural networks, they used Particle Swarm Optimization
(PSO), Differential Evolution (DE), and Covariance Matrix
Adaptation Evolutionary Strategy (CMA-ES). Song et al. [28]
use long short-term memory (LSTM ) model to predict the
host load. To train the recurrent networks, the authors used
truncated back-propagation through time technique. Duggan
et al. [29] predict host CPU utilization by using Recurrent
Neural networks. They also use the back-propagation through
time (BPTT) technique to train the network.
Recurrent Neural Networks are a hot topic on many mod-
eling scenarios, including resource management for Data-
Centers. However, RNNs imply a set of trade-offs to have
into consideration on scenarios where data-streams must
be constantly modeled or evaluate. Moreover, the selection
of hyper-parameters for RNNs and their different methods
(LSTMs, GRUs, etc.) imply extra decisions to be searched
and tuned, while simpler methods can provide similar accuracy
(or lower but good-enough accuracy) with less computational
and human-tuning effort. One of the problems of time-series
algorithms like RNNs, time-series related NNs like CRBMs, or
filters like period-adaptive-Kalman, is that most rely on a delay
or memory hyper-parameter on their design, on a scenario
where behavior regimes and their length may not be known
a-priori. Not to say that interpretability is a requirement on
knowledge discovery, to help data-center architects to improve
the DC infrastructure. In conclusion, it is true that advances
have been done on RNNs towards DC management, but in
this work, we advocate for more simpler (in terms of training
and operability) and more readable models.
The work in this area most relevant to ours [30] ad-
aptively picks either Regression (LR) or Support Vector
Machine (SVM) predictors to estimate CPU utilization of
VMs. The proposed method dynamically select LR for slow
changing workloads and SVM for rapidly changing work-
loads. Moreover, most of the existing works use ensemble-
based approaches in which multiple estimation methods are
collectively used to produce the final output whereas in our
proposed solution the final output is produced using only
a single machine learning predictor which is dynamically
identified using the recent resource utilization observations.
Our approach uses four different estimators and dynamically
identifies the estimator using a machine learning approach and
time series features. To the best of our knowledge, no existing
work which uses time series features to adaptively identify and
use the best prediction method to minimize the estimated error
of cloud resource utilization. Table I presents the comparison
and explain how proposed solution is different from existing
state of the art work.
III. PROP OS ED SY ST EM OV ERV IE W
The overall proposed system is illustrated in Figure 2. Dif-
ferent steps are numbered and labeled to explain the working
flow of the system. The system work in the following steps:
Historical resource utilization logs of the data center are
divided into sliding windows of a fixed size consists of
the last kintervals. Then each sliding window data is
used to fit different prediction models including Linear
Regression (LR), Support Vector Machine (SVM), Kri-
ging (KR), and Gradient Boosting Tree (GBT) to predict
the next interval resource utilization. The system selects
the prediction method yields a minimum prediction error
for the given sliding window data.
For each sliding window, the system identifies a specific
set of features as explained in Section V.
The selected features and identified prediction methods
are logged as training data. For each historical sliding
window, the training data set contains the corresponding
feature vector and the best prediction method.
Once training data is prepared, the system builds a
classifier using Random Decision Forest (RDF) to predict
the best model for a given sliding window data. We call
this classifier “Adaptive Model Selector”. We explain this
in Section IV-B.
Once the Adaptive Model Selector is trained than the
system predicts the data center resource utilization in real
time. For the current time interval t, system select last k
observation to extract features and then use the Adaptive
Model Selector to identify the best prediction method to
predict the resource utilization for the t+ 1 time interval.
The selected prediction method is used to train a regres-
sion model using the last kinterval’s observed resource
4
Table I: Comparison of related work with proposed solution
Reference Observed
Metric
Feature
Based
Classification
Adaptivly
Predictor
Selection
Ensemble Methodology
Kim [15] job arrival rate/time No No Yes Four predictors collectively determine the output of future workload by assigning weight
to each predictor using multiclass regression.
Rahmanian [16] cpu No No Yes Predicts the future by combining the prediction values of all constituent models and
use automata theory to adjust the weight of each predictor.
Subirats [17] cpu No No Yes Selects the forecast provided by the predictor with the smallest mean average absolute
error using four predictors ( MA, ES, LR, DES) which is calculated for every time
interval.
Chen [18] network traffic No No Yes Use fuzzy neural networks to predict the future value which takes input from multiple
base predictors as well as raw input.
Tseng [20] cpu,memory, energy No No No Use a multi-objective genetic algorithm to forecast resource utilization and energy
consumption.
Jiang [21] vm request No No Yes Use five predictions methods to predict future value. It uses relate error between actual
and predicted value is used to update the weights of each base predictor.
Rodrigo [22] arrival rate No No No Use ARIMA to predict the future data point.
Liao [23] cpu No No Yes Use prediction of multiple time series predictors as an input to another linear prediction
model to predict the final future value.
Vazquez [24] number of requests No No No Use various time series prediction models independently to forecast the number of
requests and compare the results.
Dmytro [25] vm migration No No No Use ARIMA to predict the future data point.
Liu [30] cpu No Yes No Use either LM or SVM according to slow or fast changing workload.
Zhang [9] cpu No No No Use autoencoders to predict the future CPU utilization of VMs.
Alex [31] - Yes No No Only provide the time series classification based on feature vector.
Fulcher [32] - Yes No No Only provide time series classification based on feature vector.
Proposed cpu Yes Yes No Forecast the future resource utilization using the best predictor obtained from the
classifier trained on the time series features. The best predictor is identified adaptively
at every estimation time step.
..... 26.7 15.5 18.0 12.0
Historical Resource Utilization
Interval F1F2F3... FmMethod
t 2.45 3,56 0.067 .... 45.65 KR
t-1 3.76 2.22 0.044 .... 5.76 SVM
t-2 1.45 7.45 0.01 ... 4.89 GBT
.... .... .... .... ... ... ....
Resource
Utilization
Predictor
9. Predicted resource utilization
for next interval
Current Testing Window
Prediction
Methods
Feature
Extractor
LR
SVM
KR
GBT
.......
.......
.......
.... .... 60.1 25.5 19.3 20.0 18.7 11.9 45.3
t
t-1
t-2
1. Identify best
prediction method
3.Best prediction
method name
2. Extract features
4. Feature vector
Training Dataset
5. Learn classifier to identify best
prediction method
Adaptive Model
Selector
Number of Trees = 50
Feature
Extractor
6. Extract
features
7. Feature
vector
8. Best prediction
method name
Adaptive Prediction Method Learning
Resource Utilization Prediction
Figure 2: Purposed system overview to learn adaptive model
selector and using it to estimate the data center resource
utilization.
usage data to estimate the resource utilization for the t+1
future time interval.
IV. MACHINE LEARNING METHODS
In this work we use Machine Learning (ML) techniques for
two main purposes: first, predict future workload behaviors
and traces; second, from a set of ML methods and a context,
choose one that predicts the workload better. The ensemble
presented here focuses on different algorithms for regression
used to predict the workload, while a trained decision maker
selects at each time a regression model that is expected to
produce the most accurate prediction.
In this section, we introduce the different algorithms used
for the prediction and decision-making processes.
A. Workload Prediction Methods
To predict workload, we explore a diversity of Machine
Learning techniques commonly used in the literature, ones
more complex than others with different properties each. The
learned regression models are to predict our target variable
which is next data point in time series from known input
features [33] such as skewness, standard deviation, kurtosis,
autocorrelation for different lags, absolute sum of changes,
etc. As the data we are dealing is in the form of a time series,
evaluation of prediction must be based not just on accuracy
but also on the significance of results which is often a difficult
problem on regression analysis.
Our presented methodology shows a multi-model approach,
where different models are trained, each one with a different
set of strong and weak properties. The models are applied to
a dynamic window to predict future interval workloads. The
studied models for workload prediction are: Linear Regression,
Support Vector Machines for regression, Gradient Boosting,
and Gaussian Process Regression.
5
Linear Regression: Linear Regression (LR) is one of the
simplest but effective approaches in machine learning mod-
eling and prediction specifically when a linear relation exists
among variables. LR assumes there is a linear relation between
output variable Yand input variables X={x1. . . xn}, and
attempts to find a vector WT={w1. . . wn}and a scalar b
where ˜
Y=X·W+bwhile minimizing the error =|Y˜
Y|.
Minimization is usually performed using the Least Squares
Error approach, although other approaches using the deviation
or specific cost function exist. LR variants include Polynomial
and Multinomial Regression, where variable relations are
assumed more complex, thus learning algorithms also become
more complex.
Support Vector Regression: Support Vector Machine
(SVM) methods are common for classification although they
can be used for regression as Support Vector Regression
machines (SVR) [34]. The advantage of SVMs is that non-
linear functions can be learned as linear ones thanks to a
transformation of data known as the kernel trick.
SVMs allow learning non-linear functions by mapping them
into a higher dimensional feature space, using a defined kernel
function. Input Xare mapped into an h-dimensional feature
space using a predefined non-linear kernel function to produce
a linear model. Similar to LR, we can express SVMs as ˜
Y=
k(X)·W+b, where kis the function making the space for
Xlinear. SMVs error minimization consists on building two
margin functions (support vectors) X·W+b±, where final
error ξis computed for those elements outside the margins.
As a disadvantage, margin can become an hyper-parameter.
Gradient Boosting: Gradient Boosting is the combination
of the Gradient Descent optimization and Boosting tech-
niques [35], [36]. As any other boosting technique, the learned
model is the composition of weaker models focusing on
subsets of data, forming a stronger model when combined.
Usually, decision and regression trees are used on Gradient
Boosting techniques, but any other modeling technique can be
used for boosting.
On Gradient Boosting, a model is fitted as ˜
Y=f(X)
minimizing =|Y˜
Y|. Then function fcan be fine-
grain tuned using another function hfitted to , learning and
correcting the errors on the first function, and so on recursively.
This recursion can continue until we rest satisfied with the
resulting aggregation of models.
Gaussian Process Regression: Gaussian Process Regres-
sion (also known as Kriging) [37] is a non-parametric regres-
sion method, where the modeled function is trained after a
Gaussian process using the covariances of previous examples.
This process is used mainly for interpolation which requires
some example observation points. Kriging method predicts by
computing the weighted average of the values for neighbors
from the known examples. Kriging models can model non-
linear as well as linear behavior. Typical regression meth-
ods are extended by statistical models based on stochastic
processes. However, Kriging also estimates the associated
statistical variations using the distribution and correlation of
observed data. Recently, Kriging is used for self-adaptive
provisioning of resources in cloud-hosted applications [38].
B. Adaptive Model Selector (AMS)
On multi-model methodologies, different regression models
produce predictions altogether, and a trained expert system
decides which prediction is followed, or how they are ag-
gregated into a final prediction. Such a trained expert can be
a machine learning model, like in Boosting methods. In our
proposed solution, before producing workload predictions, we
use a trained decision maker to choose the best predictor to
be used. The decision maker will classify each scenario into
the best-expected predictor for it.
Our decision maker input will be features [33] such as
skewness, standard deviation, kurtosis, autocorrelation for
different lags,the absolute sum of changes, etc., and it will
output the regression method which is expected to be the
best. At each time step, the decision maker predicts the best
regression model and then produces the workload prediction
using the predicted regression model. Here we present the
different classification models studied in this work.
K-Nearest Neighbors: The k-Nearest Neighbors (k-NN)
algorithm allows to memorize a set of characteristic examples,
and classify new data instances by finding the knearest
neighbors, and returning the class of the majority (or the
probabilities per class on those kexamples). The nearest
neighbors are those examples with minimum distance, often
euclidean, Hamming or Manhattan distances. Here we select k-
NN as one of the tentative classifiers, as it is one of the easier
models to train (it memorizes the training set), in exchange
of the not-so-easy search process when predicting a new data
instance.
Naïve Bayes: The Naïve Bayes algorithm is a classifier
based on computing the likelihood of a feature given each
class, then use the Bayes theorem to compute the conditional
probability of a class given that feature. The method extracts
from data the probabilities of each feature value P(F eature =
X), each class P(Class =C), and each likelihood of
features per class P(F eature =X|Class =C). This
method assumes independence among features, in contrast to
Bayesian Networks. The probabilities per class are the product
of their probabilities per feature, and the algorithm returns
the class with a higher probability (or the rank of classes per
probability).
We selected Naïve Bayes as one of the classifiers for its low
complexity, as training implies keeping the count of element
occurrences, then probabilities can be computed on demand at
prediction time.
Multilayer Perceptron: Multilayer Perceptron (MLP) is
a kind of Artificial Neural Network (ANN) used for both
classification and regression problems for non-linear systems.
The most commonly used ANN for classification problem is
“one-hidden layer” Feed-Forward ANN, where the ANN is
composed of a single layer of perceptrons (neuron units) and
an output layer.
Data passes through the hidden layer to the output producing
a value for each class, then the class with a higher value is
chosen. Neurons aggregate input data, usually through a linear
function Xo=Xi·W+bias, then passes outputs Xoto
the next layer (here the output layer). Output neurons also
pass their produced aggregation through sigmoid functions
6
to approximate their outputs to 0 or 1 Y=sigm(Xo).
Fitting those functions is done by passing data repeatedly
and comparing the network output with the real output, then
updating neurons weights Wand bias using Gradient Descent
techniques.
Neural networks can be complicated to fine-tune, as their
architecture must be treated as a hyper-parameter, deciding
how many neuron units are in the hidden layer, how many
times data must be passed for training, etc. We used Keras
[39] sequential model to implement MLP for classification.
We evaluated MLP with different number of hidden layers and
found that with three hidden layers, it yields the best results.
Since we are using four machine learning predictors for
evaluation purposes, the output layer contains 4 neurons. We
use "relu" as the activation function for hidden layers and
"softmax" for output layer. We use "adam" as optimizer and
"categorical crossentropy" as loss function. The total epochs
used are 1000 with batch size equal to 2000.
Random Decision Forest: Random Decision Forests (usu-
ally referred as Random Forests) are an ensemble method
for classification and regression, based on the aggregation of
specialized decision trees [40]. The ensemble builds a set
of decision trees, trained from different data subsets, then
predicted data is classified as the most voted class from all
decision trees (the trend). The main reason to use random
forests is to prevent over-fitting single decision tree models
and get a more accurate and stable prediction.
Random Forests are known to produce decent results for
classification and regression problems, without the need for
much tuning or hyper-parameters. For our experiment, we tune
the number of trees and set it to 50.
Gradient Boosting: Gradient Boosting can also be used
for both regression and classification problems. For methods
where the boosted algorithm is already a classification prob-
lem, the most voted class from all partial models is selected.
For regression boosted algorithms, we can turn outputs into
binary values using similar approaches like in SVMs, consid-
ering each value as a class and its value between 1and 1as
its scoring.
V. FE ATUR E EXT RACTION AND SELECTION
Appropriate features can play an important role to improve
the prediction accuracy of machine learning models. In our
data set, the resource utilization of data centers is available as
a time series data. We explore multiple ways to extract time
series features from the given data set which includes manual
extraction, automatically extraction by the help of open source
libraries such as Cesium [42], TSFRESH [41]. However we
selected TSFRESH as it provides us most useful and a com-
prehensive set of time series features which is not available
in any other library. Time Series Feature extraction based on
scalable hypothesis tests (TSFRESH) [41], [43] is an open-
source Python library available to extract features for a given
time series data. In our proposed system, we used TSFRESH
to extract features for data center resource utilization data
available as time series. TSFRESH automatically calculates
a large number of time series characteristics based on scalable
hypothesis tests.
Maximum
Minimum
Mean
Median
Number of peaks
Figure 3: Example of time series features that are extracted
from TSFRESH [41] library. These features consist of stat-
istical and time series features such as minimum, maximum,
variance, standard deviation, number of peaks, autocorrelation
at different lag intervals, entropy, kurtosis, skewness, fourier
transformation, mexican hat wavelet transformation, and etc.
Figure 3 shows some of the features that TSFRESH ex-
tracts for the given time series data. It provides hundreds
of statistical and time series features including minimum,
maximum, variance, mean, standard deviation, sum of values,
autocorrelation of the specified lags, measure of non linearity
in the time series, Mexican hat wavelet, first and last location
of minimum and maximum, number of peaks, quantile, and
sample entropy etc. However all of these features are not
necessary, and appropriate features should be identified to
improve the performance of machine learning methods [44],
[45].
The proposed system filters the features obtained from
TSFRESH using another open-source library available for
feature selection [46]. We selected this library because
it includes a comprehensive set of functions to filter
the features by using different approaches for identify-
ing the most appropriate features for time series clas-
sification. The library provides five different methods to
filter features for missing values,single unique
values,collinear features,zero importance
features, and low importance features. However,
in our proposed system, we only used three methods to
filter the features obtained through TSFRESH. First, we
apply three methods to filter the feature. First, we ap-
ply single unique value method which remove the
features with identical unique values. Second, we apply
identify collinear which remove the features which
are highly correlated with one another. We used 98% cor-
related threshold in this method to ensure only remove the
features which correlated 98%. Finally, we apply zero
importance features which uses Gradient Boosting
Machine (GBM) learning model to identify the features which
have zero importance for the given set of features. After
applying these methods, we obtain one hundred and six
features in total which include standard deviation, kurtosis,
7
Figure 4: Box plot of CPU utilization for randomly selected 100 machines from Alibaba data set.
Figure 5: Box plot for Bitbrain data set of 20 randomly
selected VMs for one-day data.
Figure 6: Box plot for CPU utilization of selected four
machines with different characteristics from the Alibaba data
set. M1=high load, M2=low load, M3=high variation, and
M4=low variation.
skewness, absolute some of the changes, auto-correlation at
different lags, partial auto-correlation at different lags, the first
location of minimum, linear least-squares regression [47], and
many others.
VI. EX PE RI ME NTAL EVAL UATION
In this section, we explain the datasets used to evaluate our
proposed method, the details about the experiments used to
validate it, the baseline methods used for comparison, and the
used evaluation metrics.
A. datasets
1) Alibaba data set: The first data set we use is the
Alibaba cluster logs [12], publicly available, containing per-
formance traces of 1,313 machines for 12 hours duration. The
Alibaba monitored cluster provides interactive services and
batch processing workloads. The metrics represented are CPU,
memory and disk utilization for all machines, aggregated on 5-
minute averages. For simplification purposes, we are focusing
and experimenting with CPU time series. The average CPU
utilization in the Alibaba data set is 26.46%, with a standard
deviation of 10.66% CPU. Figure 4 shows a CPU utilization
sample for 100 randomly selected machines from the data set.
2) BitBrains data set: The second data set we use is
the Bitbrains data set [13], publicly accessible, containing
performance logs of 1,750 VMs for 30 days of data. The
Bitbrain monitored cluster provides interactive services and
batch processing workloads. The metrics represented are CPU,
memory, network and disk utilization for all the virtual ma-
chines, aggregated on 5-minute averages. From this data set,
we randomly selected 20 VMs with average CPU utilization
greater than 30%, as most of the VMs with low usage do not
show critical metric patterns or utilization tends to be constant
on the lowest part of the spectrum demand. Figure 5 shows
the box-plot for one-day data of the average CPU utilization
for the selected machines.
3) Google data set: The Google cluster traces [48] are the
publicly available traces published by Google. To create the
CPU and the Memory utilization, the tasks of each job were
aggregated by summing their CPU and Memory consumption
every five minutes in a period of 24 hours. The dataset
was extracted over the first ten days period by filtering the
utilization of CPU and memory from 5 to 90 percent, resulting
in a total of 1,600 VMs [49]. We randomly selected 500 VMs
from this data set for the experiments and the average CPU
8
utilization in the selected data set is 21.89%, with a standard
deviation of 3.63% CPU.
B. Methodology
For the current experiments, we are using the Alibaba
data set to show a comprehensive evaluation of the proposed
solution. Whereas, the BitBrains data set is used for testing to
show that the proposed methodology does not over-fit to the
main data set (Alibaba).
To train and validate the machine learning models in the
AMS classifier, we are using a random split of 80% data for
training, and the remaining 20% for validating the models. The
test data also includes the four machines which are discussed
in the following paragraph.
As applications running on data-centers can have different
profiles, we selected four machines from the Alibaba data
set with very distinct CPU demands, to test the resource
estimation on different demand behaviors. Figure 6 shows the
box-plot of CPU utilization of the four selected machines: Ma-
chine M1 serves a workload demanding high CPU resources;
machine M2 serves a workload requiring low CPU resources;
machine M3 serves a workload requiring CPU resources with
highly fluctuating demand; and finally machine M4 serves
a workload requiring CPU resources with low fluctuating
demand.
Our proposed solution is then compared with the aforemen-
tioned baseline methods, proposed by Liu et al. [30], using
Linear Regression (LR) and Support Vector Machines (SVM)
methods to estimate adaptively CPU utilization of VMs. The
combination of the two methods, LR for the slow-changing
workloads and SVM for the fast-changing ones, are here
labeled as “Liu” method. In addition, we also add the methods
namely LR, SVM, Kriging (KR) and Gradient Boosting Tree
(GBT) to consider for comparison with ours.
C. Experimental Details
1) Adaptive Model Selector Evaluation: The Adaptive
Model Selector (AMS) is in charge to estimate which of
the available ML algorithms will provide better modeling
for the current data being monitored. We performed a set
of experiments to evaluate different methods to make such
estimation by comparing different classifiers namely Random
Decision Forest (RDF), Gradient Boosting Tree (GBT), Multi-
layer Perceptron (MLP), K-Nearest Neighbors(k-NN), Gaus-
sian Naive Bayes (NB), and Support Vector Machine (SVM)
with linear kernel. These classifiers are trained and validated
using the Alibaba data set. We trained all classifiers on 80%
of the entire Alibaba data set and then tested on the remaining
20% data to compare and identify the best classifier to used
in AMS.
Training and validation data will be structured in time
windows, as explained in Section VI-C3. The classifiers are
evaluated through True/False Positive Rates (TPR and FPR),
accuracy, recall, f-measure, and precision. We also consider
the performance of the AMS by measuring the training time,
prediction time, and the size of the model on disk.
Table II: AMS evaluation results using different classifiers for
Alibaba data set.
Classifier TPR FPR TNR FNR Precision Recall F-measure Accuracy
KNN 0.62 0.11 0.88 0.37 0.65 0.65 0.65 0.65
MLP 0.64 0.11 0.88 0.35 0.66 0.67 0.66 0.67
NB 0.33 0.22 0.77 0.66 0.38 0.31 0.29 0.31
RDF 0.65 0.10 0.89 0.34 0.68 0.68 0.68 0.68
GBT 0.48 0.16 0.83 0.51 0.55 0.53 0.51 0.53
2) Resource Estimation Evaluation: Finally, when integrat-
ing the different techniques of model selection and resource
modeling, we perform a set of experiments to evaluate the
resource estimation using our proposed adaptive ensemble.
The final goal is to identify, on-line, the best regression
method that will build a prediction model for estimating the
resource utilization of the next future interval, given the current
monitored data.
As this problem is a regression one, we evaluate the com-
plete mechanism using the Root-Mean Square Error (RMSE)
as shown on Equation 1 to show how our method deviates from
the truth, also the Mean Absolute Error (MAE) as shown on
Equation 2 to show the absolute magnitude of the produced
error. Here atis the true CPU utilization and ptis the estimated
CPU utilization at time interval t, and nis the number of
performed estimations.
RMSE =sPn
t=1 (atpt)2
n(1)
MAE =1
n
n
X
t=1
|atpt|(2)
Again, training and validation data will be structured in time
windows (Section VI-C3), and then for each sliding window,
we use the AMS to identify the best regression method to
estimate the resources for the following intervals.
3) Window Size Sensitivity: A specific observation window
size is required to train the AMS. In this experiment, we eval-
uate the effect of different window sizes on the proposed solu-
tion by quantifying the estimation error using Alibaba dataset.
We tested window sizes of 20, 40, 60, 80 and 90 minutes of
data to train and validate the proposed solution for resource
estimation. To segment the data set into training/validation
sets, we performed a random split 80%/20%. We organized
the training data into windows of the aforementioned sizes,
and evaluate the models and ensemble, using the RMSE and
MAE metrics to quantify the effect of each different window
size.
VII. EXP ER IM EN TAL RES ULT S
A. AMS Evaluation
The Adaptive Model Selection method is evaluated through
the aforementioned quality metrics for different classifiers, and
check not only accuracy but also the performance requirements
for each, like time for training and predicting, and size of the
resulting model.
Table II shows the evaluation results of the AMS using
the selected features of the raw data set to identify the best
prediction method for CPU resource utilization estimation
9
Table III: Time and space efficiency of AMS using different
classifiers for Alibaba data set.
Classifier Training
Time (sec)
Prediction
Time (sec)
Prediction Time
per Request (ms) Size (KB)
KNN 3.23 593.61 17.017 255283.2
Multi-layer Perceptron 728.13 0.34 0.010 180.7
Naive Bayes (Guassian) 0.59 0.13 0.004 7.5
RDF 57.43 0.51 0.015 201523.2
GBT 186.45 0.28 0.008 140.9
Table IV: RMSE and MAE for resource estimation using the
purposed system for Alibaba data set.
Method RMSE MAE
GBT 4.57 3.43
LR 5.12 3.87
SVM 5.63 4.23
Kriging 5.26 3.99
Liu [30] 5.34 3.94
Proposed 3.32 2.29
using Alibaba data set. The table shows true positive rate
(TPR), false positive rate (FPR), true negative rate (TNR),
false negative rate (FNR), precision, recall, f-measure, and
accuracy for using kNN, Multi-layer Perceptron, Naive Bayes,
RDF, and GBT as classification methods in AMS to identify
the prediction method which can be used to estimate the
CPU resources with high accuracy. The RDF outperforms
all other classifiers. We observed that KNN, as second best
classification method in AMS also provides comparable and
closest results to RDF.
To profile the time and space efficiency of different classifi-
ers for AMS using Alibaba data set, we profile training time,
testing time, and the size of the trained model on the disk.
Table III shows the time and space efficiency of AMS using
different classification methods. We observed Naive Bayes
classifier is efficient by consuming the least time to train
and test the AMS. Whereas, the classification performance of
Naive Bayes is significantly lower than RDF specifically for
precision, recall, f-measure, and accuracy.
Although kNN classification performance is comparable
to RDF, however, training, testing, and disk size of AMS
using kNN is worst comparing to other classification methods.
The RDF training and test time are reasonably good, and
it outperforms other classification methods for all evaluation
metrics. Therefore, we chose RDF classifier to use in our
proposed AMS.
Figure 7 shows the Receiver Operator Characteristics (ROC)
curve using RDF with AMS for different classes. ROC curves
for all the classes are better than the random classifier. We
observed that the proposed AMS with RDF efficiently classi-
fies the test data for all the classes. The area under the ROC
curves is 0.84, 0.89, 0.90, and 0.90 for SVM, LR, GBT, and
KR labels respectively.
Overall, we observed that using RDF in AMS performs
excellently to identify appropriate prediction method to use
adaptively for the given data for resource estimations.
B. Resource Utilization Estimation
Table IV shows RMSE and MAE for CPU utilization
estimations on test data of Alibaba data set for the proposed
Figure 7: ROC curves using RDF with AMS for different
classes.
Figure 8: Comparison of normalized RMSE for baseline
methods with the proposed method using Alibaba data set.
and baseline methods. The proposed method outperforms all
baseline methods by yielding minimum RMSE and MAE.
To compare the proposed method with baseline methods we
normalized the RMSE with relative to the proposed solution,
as shown in Figure 8. We observed 27%, 35%, 37%, 38%,
and 41% less estimation error comparing to GBT, LR, KR,
Liu, and SVM baseline methods.
Figure 9 shows the box plot for absolute error computed for
each estimated CPU utilization using Alibaba data set for the
proposed and baseline methods. We observed the proposed
method outperforms the baseline methods to minimize the
absolute error.
Figure 11 shows the recommendations proposed by AMS
as a function of time for the selected four machines. The
proposed method dynamically selects the most appropriate
prediction model based on time series features of recent win-
dow. Figure 10 shows the comparison of actual and estimated
CPU resources using baseline methods and with the proposed
system for the four selected machines. The proposed method
to estimate the CPU utilization shows significantly closer to
the actual resource utilization for all of the machines serving a
10
Figure 9: Box plot of absolute error computed for each
estimation using baseline and proposed methods for Alibaba
data set.
significantly different type of workloads. Moreover, it is hard
to forecast in the presence of burst. For example, Figure 10 for
M3 we observe a burst between 460 to 530 seconds, and the
proposed solution tries to minimize the estimation error using
different estimator as reflected in Figure 11. The proposed
solution dynamically switches between different estimators to
yield prediction with better accuracy.
To quantify and visualize the error for each estimation, we
show absolute error frequency computed for machines M1
and M3 using baseline and proposed methods in Figure 12
and Figure 13 respectively. Where M1 serves a workload
demanding high CPU resources consistently, and M3 served
a workload requiring CPU resources with fluctuating demand.
We observed that the proposed method always yield minimum
error to estimate the CPU resource utilization for a different
type of workloads. We also observed that the proposed method
yield minimum absolute error for each estimation comparing
to the baseline methods for both M1 and M3 machines.
C. Window Size Sensitivity Analysis
Figure 14 shows the RMSE and MAE for different windows
sizes with the proposed system to estimate CPU resource
estimation. We observe that increasing window size reduce the
estimation error untill window size 60; however, after that, the
error starts rising. The 20 minutes window size only contains
four observations to fit the prediction models for an estimation
which yields a maximum error. This experiment identifies that
60 minutes window size is optimal to use with the proposed
system to minimize the estimation error. Therefore, in all of
our experiments, we used 60 minutes window size with the
proposed and baseline methods.
D. Evaluation Using BitBrains Data set
Table V shows RMSE and MAE for estimating CPU util-
ization on test data using Bitbrains data set for the proposed
and baseline methods. The proposed method outperforms all
baseline methods by yielding minimum RMSE and MAE.
To show the comparison of the proposed method with
baseline methods, we normalized the RMSE with relative to
the proposed solution. Figure 15 shows the comparison of
Table V: RMSE and MAE for resource estimation using the
purposed system for Bitbrains data set.
Method RMSE MAE
GBT 9.74 2.85
LR 15.01 6.03
SVM 19.94 7.19
Kriging 15.80 6.05
Liu [30] 19.80 7.09
Proposed 9.13 2.57
Table VI: MSE and MAE for resource estimation using the
purposed system for Google Cluster data set
Method RMSE MAE
GBT 2.31 1.24
LR 2.40 1.32
SVM 2.35 1.28
Krining 2.28 1.24
Liu 2.26 1.24
proposed 2.22 1.14
baseline methods with the proposed solution by calculating
normalized RMSE for the Bitbrains data set. We observed 6%,
39%, 42%, 54%, and 54% less estimation error comparing to
GBT, LR, KR, SVM, and Liu baseline methods, respectively.
Figure 16 shows the box plot for absolute error computed
for each estimated CPU utilization using Bitbrains data set
for baseline and proposed methods. We observed the proposed
method produces less absolute error compared to the baseline
methods.
E. Evaluation Using Google Data set
After performing additional experiments using the Google
dataset, the same used by Liu [30], we realized that while such
dataset presents a behavior with less variance than Bitbrains
(more than 80% of the machines report standard deviations
below 4 in a range of 0 to 100), and all methods behave
with similar good accuracy, also both methods Liu’s and ours
are better than the individual machine learning algorithms.
But then, for the Bitbrains and Alibaba datasets with higher
variance and more extreme behavior, and while Liu’s method
does not adapt that well, our method still does and improve the
individual algorithms. Table VI shows RMSE and MAE for
estimating CPU utilization on test data using Google data set
for the proposed and baseline methods. The proposed method
outperforms all baseline methods by yielding minimum RMSE
and MAE.
VIII. CONCLUSION
Building new methods for estimating resource utilization in
data centers is an active and challenging problem, as most
of the state-of-art techniques are based on specific machine
learning methods, able to adjust to particular scenarios, but
ineffective on extremely diverse environments. Therefore, we
present a novel approach to adaptively and automatically
identify the most appropriate machine learning method to be
used for predicting future resource utilization, given recent
observations of such resources.
11
Figure 10: Actual vs proposed method CPU prediction for Alibaba data set for four selected machines. M1 = Heavy workload,
M2 = Low workload, M3 = High variation, M4= Low variation. The window size used to train the prediction model is 60
minutes.
Figure 11: Model selection of Adaptive Model Selector (AMS)
for Alibaba data set for four selected machines. M1 = Heavy
workload, M2 = Low workload, M3 = High variation, M4=
Low variation.
In our proposed methodology, we use Random Decision
Forest classifiers to determine, from a set of available ma-
chine learning techniques, which one is most appropriate for
predicting resources on a next time interval, having monitored
the previous one. The RDF is trained on the statistical features
extracted from historical observations and samples of the best
method identified for each time window. Our selected available
Figure 12: Absolute error frequency of CPU utilization estim-
ation for machine M1 (High Load).
methods include several techniques used in the current state
of the art, as regression methods, neural networks, statistical
learning, and bayesian approaches.
The proposed method is evaluated on real traces collected
from Alibaba and Bitbrains data-center monitoring datasets,
and our proposed approach can improve prediction accuracy
from 6% to 27% over current methodologies. We also fo-
cused on the importance of monitoring time window sizes
when modeling and predicting and evaluated different sizes.
12
Figure 13: Absolute error frequency of CPU utilization estim-
ation for machine M3 (High Variation).
Figure 14: RMSE and MAE using different window sizes with
the proposed system for resource utilization estimation.
Figure 15: Comparison of normalized RMSE for baseline
methods with the proposed method using Bitbrains data set.
We found that 60 minutes of historical resource utilization
observation can effectively be used to build the prediction
Figure 16: Box plot of absolute error computed for CPU
utilization estimation using baseline and proposed methods for
Bitbrain data set.
model to estimate the future resource utilization.
We conclude that our methodology can help to identify
the appropriate machine learning methods for each specific
scenario over time, and future work will focus on investigating
adaptive window size for modeling and predicting data center
resource utilization.We also plan to extend the proposed sys-
tem for online retraining automatically to adapt for changing
characteristics. Moreover, we also intend to investigate the
prediction for t+nintervals.
ACK NOW LE DG EM EN T
This work is partially supported by the European Research
Council (ERC) under the EU Horizon 2020 programme (GA
639595), the Spanish Ministry of Economy, Industry and
Competitiveness (TIN2015-65316-P and IJCI2016-27485), the
Generalitat de Catalunya (2014-SGR-1051), and NPRP grant
# NPRP9-224-1-049 from the Qatar National Research Fund
(a member of Qatar Foundation). The statements made herein
are solely the responsibility of the authors.
REFERENCES
[1] A. Erradi, W. Iqbal, A. Mahmood, and A. Bouguettaya, “Web application
resource requirements estimation based on the workload latent features,”
IEEE Transactions on Services Computing, 2019.
[2] W. Iqbal, M. N. Dailey, D. Carrera, and P. Janecek, “Adaptive resource
provisioning for read intensive multi-tier applications in the cloud,
Future Generation Computer Systems, vol. 27, no. 6, pp. 871–879, Jun.
2011.
[3] Y. Xia, M. Tsugawa, J. A. Fortes, and S. Chen, “Large-scale vm
placement with disk anti-colocation constraints using hierarchical de-
composition and mixed integer programming,IEEE Transactions on
Parallel and Distributed Systems, vol. 28, no. 5, pp. 1361–1374, 2017.
[4] T. H. Duong-Ba, T. Nguyen, B. Bose, and T. T. Tran, “A dynamic
virtual machine placement and migration scheme for data centers,” IEEE
Transactions on Services Computing, 2018.
[5] L. Tang and H. Chen, “Joint pricing and capacity planning in the iaas
cloud market,” IEEE Transactions on Cloud Computing, vol. 5, no. 1,
pp. 57–70, 2017.
[6] M. Carvalho, D. A. Menascé, and F. Brasileiro, “Capacity planning
for iaas cloud providers offering multiple service classes,Future
Generation Computer Systems, vol. 77, pp. 97–111, 2017.
[7] A. Paya and D. C. Marinescu, “Energy-aware load balancing and
application scaling for the cloud ecosystem,” IEEE Transactions on
Cloud Computing, vol. 5, no. 1, pp. 15–27, 2017.
[8] E. K. Lee, H. Viswanathan, and D. Pompili, “Proactive thermal-aware
resource management in virtualized hpc cloud datacenters,” IEEE Trans-
actions on Cloud Computing, vol. 5, no. 2, pp. 234–248, 2017.
13
[9] Q. Zhang, L. T. Yang, Z. Yan, Z. Chen, and P. Li, “An efficient deep
learning model to predict cloud workload for industry informatics,” IEEE
Transactions on Industrial Informatics, 2018.
[10] K. Mason, M. Duggan, E. Barrett, J. Duggan, and E. Howley, “Predicting
host cpu utilization in the cloud using evolutionary neural networks,
Future Generation Computer Systems, vol. 86, pp. 162–173, 2018.
[11] W. Zhang, P. Duan, L. T. Yang, F. Xia, Z. Li, Q. Lu, W. Gong, and
S. Yang, “Resource requests prediction in the cloud computing environ-
ment with a deep belief network,” Software: Practice and Experience,
vol. 47, no. 3, pp. 473–488, 2017.
[12] “Alibaba cluster log.” [Online]. Available: https://github.com/alibaba/
clusterdata
[13] “Bitbrains cluster log.” [Online]. Available: http://gwa.ewi.tudelft.nl/
datasets/gwa-t- 12-bitbrains
[14] D. Erhan, A. Courville, and Y. Bengio, “Understanding representations
learned in deep architectures,” Department dInformatique et Recherche
Operationnelle, University of Montreal, QC, Canada, Tech. Rep, vol.
1355, p. 1, 2010.
[15] I. K. Kim, W. Wang, Y. Qi, and M. Humphrey, “Cloudinsight: Utilizing
a council of experts to predict future cloud application workloads,” in
IEEE International Conference on Cloud Computing, 2018.
[16] A. A. Rahmanian, M. Ghobaei-Arani, and S. Tofighy, “A learning
automata-based ensemble resource usage prediction algorithm for cloud
computing environment,Future Generation Computer Systems, vol. 79,
pp. 54–71, 2018.
[17] J. Subirats and J. Guitart, “Assessing and forecasting energy efficiency
on cloud computing platforms,” Future Generation Computer Systems,
vol. 45, pp. 70–94, 2015.
[18] Z. Chen, Y. Zhu, Y. Di, and S. Feng, “Self-adaptive prediction of
cloud resource demands using ensemble model and subtractive-fuzzy
clustering based fuzzy neural network,” Computational intelligence and
neuroscience, vol. 2015, p. 17, 2015.
[19] K. Cetinski and M. B. Juric, “AME-WPC: advanced model for efficient
workload prediction in the cloud,” Journal of Network and Computer
Applications, vol. 55, pp. 191–201, 2015.
[20] F.-H. Tseng, X. Wang, L.-D. Chou, H.-C. Chao, and V. C. Leung,
“Dynamic resource prediction and allocation for cloud data center using
the multiobjective genetic algorithm,IEEE Systems Journal, vol. 12,
no. 2, pp. 1688–1699, 2018.
[21] Y. Jiang, C.-S. Perng, T. Li, and R. N. Chang, “Cloud analytics for
capacity planning and instant vm provisioning,” IEEE Transactions on
Network and Service Management, vol. 10, no. 3, pp. 312–325, 2013.
[22] R. N. Calheiros, E. Masoumi, R. Ranjan, and R. Buyya, “Workload
prediction using ARIMA model and its impact on cloud applications’
qos,” IEEE Transactions on Cloud Computing, vol. 3, no. 4, pp. 449–
458, 2015.
[23] S. Liao, H. Zhang, G. Shu, and J. Li, “Adaptive resource prediction
in the cloud using linear stacking model,” in Advanced Cloud and Big
Data (CBD), 2017 Fifth International Conference on. IEEE, 2017, pp.
33–38.
[24] C. Vazquez, R. Krishnan, and E. John, “Time series forecasting of cloud
data center workloads for dynamic resource provisioning.JoWUA,
vol. 6, no. 3, pp. 87–110, 2015.
[25] K. Dmytro, T. Sergii, and P. Andiy, “Arima forecast models for schedul-
ing usage of resources in it-infrastructure,” in Computer Sciences and
Information Technologies (CSIT), 2017 12th International Scientific and
Technical Conference on, vol. 1. IEEE, 2017, pp. 356–360.
[26] W. Fang, Z. Lu, J. Wu, and Z. Cao, “Rpps: a novel resource prediction
and provisioning scheme in cloud data center,” in Services Computing
(SCC), 2012 IEEE Ninth International Conference on. IEEE, 2012, pp.
609–616.
[27] F. Qiu, B. Zhang, and J. Guo, “A deep learning approach for vm
workload prediction in the cloud,” in 2016 17th IEEE/ACIS International
Conference on Software Engineering, Artificial Intelligence, Networking
and Parallel/Distributed Computing (SNPD). IEEE, 2016, pp. 319–324.
[28] B. Song, Y. Yu, Y. Zhou, Z. Wang, and S. Du, “Host load prediction
with long short-term memory in cloud computing,” The Journal of
Supercomputing, pp. 1–15, 2017.
[29] M. Duggan, K. Mason, J. Duggan, E. Howley, and E. Barrett, “Predicting
host cpu utilization in cloud computing using recurrent neural networks,”
in Internet Technology and Secured Transactions (ICITST), 2017 12th
International Conference for. IEEE, 2017, pp. 67–72.
[30] C. Liu, C. Liu, Y. Shang, S. Chen, B. Cheng, and J. Chen, “An
adaptive prediction approach based on workload pattern discrimination
in the cloud,” Journal of Network and Computer Applications, vol. 80,
pp. 35 – 44, 2017. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/S1084804516303198
[31] A. Nanopoulos, R. Alcock, and Y. Manolopoulos, “Feature-based
classification of time-series data,” International Journal of Computer
Research, vol. 10, no. 3, pp. 49–61, 2001.
[32] B. D. Fulcher and N. S. Jones, “Highly comparative feature-based
time-series classification,” IEEE Transactions on Knowledge and Data
Engineering, vol. 26, no. 12, pp. 3026–3037, 2014.
[33] “Tsfresh features.” [Online]. Available: https://tsfresh.readthedocs.io/en/
latest/text/list_of_features.html
[34] H. Drucker, C. J. C. Burges, L. Kaufman, A. J. Smola, and V. Vapnik,
“Support vector regression machines,” in NIPS. MIT Press, 1996, pp.
155–161.
[35] L. Mason, J. Baxter, P. Bartlett, and M. Frean, “Boosting algorithms
as gradient descent,” in In Advances in Neural Information Processing
Systems 12. MIT Press, 2000, pp. 512–518.
[36] J. H. Friedman, “Greedy function approximation: A gradient boosting
machine,” Annals of Statistics, vol. 29, pp. 1189–1232, 2000.
[37] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimiza-
tion of expensive black-box functions,Journal of Global optimization,
vol. 13, no. 4, pp. 455–492, 1998.
[38] A. Gambi, M. Pezze, and G. Toffetti, “Kriging-based self-adaptive cloud
controllers,” IEEE Transactions on Services Computing, vol. 9, no. 3,
pp. 368–381, 2016.
[39] F. Chollet, Deep learning with python. Manning Publications Co., 2017.
[40] T. K. Ho, “Random decision forests,” in Proceedings of the Third
International Conference on Document Analysis and Recognition
(Volume 1) - Volume 1, ser. ICDAR ’95. Washington, DC,
USA: IEEE Computer Society, 1995, pp. 278–. [Online]. Available:
http://dl.acm.org/citation.cfm?id=844379.844681
[41] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time series
feature extraction on basis of scalable hypothesis tests (tsfresh–a python
package),” Neurocomputing, 2018.
[42] “Cesium,” 2018. [Online]. Available: http://cesium-ml.org/
[43] L. Ge and L.-J. Ge, “Feature extraction of time series classification based
on multi-method integration,” Optik-International Journal for Light and
Electron Optics, vol. 127, no. 23, pp. 11070–11 074, 2016.
[44] B. Gregorutti, B. Michel, and P. Saint-Pierre, “Correlation and variable
importance in random forests,” Statistics and Computing, vol. 27, no. 3,
pp. 659–678, 2017.
[45] G. Chandrashekar and F. Sahin, “A survey on feature selection methods,”
Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
[46] “Feature selector.” [Online]. Available: https://github.com/WillKoehrsen/
feature-selector
[47] “linear trend feature.” [Online]. Available: https:
//tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#
tsfresh.feature_extraction.feature_calculators.agg_linear_trend
[48] “Google cluster log.” [Online]. Available: https://github.com/google/
cluster-data
[49] T. H. Nguyen, M. Di Francesco, and A. Yla-Jaaski, “Virtual machine
consolidation with multiple usage prediction for energy-efficient cloud
data centers,” IEEE Transactions on Services Computing, 2017.
Shuja-ur-Rehman Baig is currently pursuing
the Ph.D. degree with Computer Architecture
Department-UPC. He is also working as collaborator
with the Data-Centric Computing group, Barcelona
Supercomputing Center. He also holds a position of
Lecturer at Punjab University College of Informa-
tion Technology, University of the Punjab, Lahore,
Pakistan. Shuja received M.Sc. degree in computer
science from Lahore University of Management Sci-
ences (LUMS), Lahore, Pakistan.
14
Waheed Iqbal is an Assistant Professor at Punjab
University College of Information Technology, Uni-
versity of the Punjab, Lahore, Pakistan. He worked
as a Postdoc researcher with the Department of
Computer Science and Engineering, Qatar Univer-
sity during 2017–2018. His research interests lie
in cloud computing, distributed systems, machine
learning, and large scale system performance eval-
uation. Waheed received his Ph.D. degree from the
Asian Institute of Technology, Thailand.
Josep Lluis Berral received the degree in in-
formatics in 2007, the M.Sc. degree in computer
architecture in 2008, and the Ph.D. degree from
BarcelonaTech-UPC, computer science in 2013. He
is a Data Scientist, working in applications of data
mining and machine learning on data-centric com-
puting scenarios, at the Barcelona Supercomputing
Center. He was with the High Performance Com-
puting Group, Computer Architecture Department-
UPC, and the Relational Algorithms, Complexity
and Learning Group, Computer Science Department-
UPC.
Abdelkarim Erradi is an Associate Professor in
the Computer Science and Engineering Department
at Qatar University. His research and development
activities and interests focus on autonomic com-
puting, self-managing systems and cybersecurity.
He leads several funded research projects in these
areas. He has authored several scientific papers in
international conferences and journals. He received
his Ph.D. in computer science from the University
of New South Wales, Sydney, Australia.
David Carrera received the M.S. and Ph.D. de-
grees from BarcelonaTech-UPC in 2002 and 2008,
respectively. He is an Associate Professor with the
Computer Architecture Department, UPC. He is an
Associate Researcher with the Data-Centric Comput-
ing, Barcelona Supercomputing Center. His research
interests are focused on the performance manage-
ment of data center workloads. He has been involved
in several EU and industrial research projects. In
2015, he was awarded an ERC Starting Grant for
the project HiEST. He was a recipient of the IBM
Faculty Award in 2010.
... These issues are very critical; no organization can avoid these major points while deploying a CDC for their organization. To deal with these issues existing solution consider estimation technique for predicting the usage of Virtual Machine (VM) resources [9,10]. ...
... Cloud service providers (CSPs) must offer cloud service resources with good workability and price in order to give an excellent Quality of Service (QoS) to their regulars. The workloads of VMs vary with time; satisfying QoS with cost-effective resources is a difficult challenge for CSPs [9]. To make the best use of resources, workload prediction, production scheduling, task scheduling, energy efficiency, task allocation, and load balancing are all necessary. ...
... Thus, to avoid the load burst, the model must predict future demand so that the resource manager has sufficient time to provide suitable resources. The research gap [9] found in the study is that most of the existing forecasting techniques are rooted in a single model, but the estimation of many workload scenarios is still under work. To deal with this issue, the author presented a technique for identifying the most appropriate model adaptively and involuntarily in order to predict truthful data center resource utilization. ...
Article
Full-text available
In cloud data centers, precise resource prediction is a critical issue due to the dynamic environment, the presence of irrelevant data points, and the unpredictable nature of resource demand. An accurate prediction helps with resource management, cost planning, and improving cloud-related services, whereas an inaccurate prediction increases the budget because of unused and overused resources. The presence of dynamic and irrelevant data not only creates confusion in the model, resulting in inaccurate predictions, but also adds unnecessary complexity and cost to the entire process. To address these challenges, we propose an approach for multi-model methods that uses a sliding window method with an adjustable size to estimate important data points from the real trace. The current work conducts three experiments to assess the impact of unpredictable data and improve the models' performance for greater accuracy. Initially, we conducted the experiment on entire datasets, but this approach failed to produce accurate and efficient machine-learning models. The next fixed window technique provides activity recognition in real-time, making it suitable for applications that require immediate feedback or response. Finally, a novel Variable-Size Sliding Window (VSSW) is proposed that selects relevant data points that help to provide better performance. Additionally, a Model Selector Decision Support System (MSDSS) is designed for forecasting and optimizing resource demand. This system determines the best predictive model for a specific set of resources based on observations gathered over a defined time frame. The experimental outcomes demonstrate that the proposed algorithm has improved the Mean Absolute Error (MAE) by approximately 50.01% compared to the baseline method and approximately 31.75% compared to the fixed window size approach. Furthermore, the proposed model effectively addresses the challenge of predicting resource workloads in a dynamic environment.
... Some example methods are Adaboost, XGBoost [26], and random forests (RF). RF [27] specifically builds upon DTs by training multiple decision trees on different data subsets to promote diversity and make predictions based on the majority vote, enhancing accuracy and stability. ...
Preprint
The Internet of Things is an example domain where data is perpetually generated in ever-increasing quantities, reflecting the proliferation of connected devices and the formation of continuous data streams over time. Consequently, the demand for ad-hoc, cost-effective machine learning solutions must adapt to this evolving data influx. This study tackles the task of offloading in small gateways, exacerbated by their dynamic availability over time. An approach leveraging CPU utilization metrics using online and continual machine learning techniques is proposed to predict gateway availability. These methods are compared to popular machine learning algorithms and a recent time-series foundation model, Lag-Llama, for fine-tuned and zero-shot setups. Their performance is benchmarked on a dataset of CPU utilization measurements over time from an IoT gateway and focuses on model metrics such as prediction errors, training and inference times, and memory consumption. Our primary objective is to study new efficient ways to predict CPU performance in IoT environments. Across various scenarios, our findings highlight that ensemble and online methods offer promising results for this task in terms of accuracy while maintaining a low resource footprint.
... Deep Learning approaches form the predominant category, particularly focusing on temporal dependencies in cloud workloads [11], [12], though they face challenges with long-range predictions and rapid workload changes [13]. Hybrid Learning combines multiple algorithms to overcome individual limitations [14], [15], while Ensemble Learning leverages multiple predictive models for improved accuracy [16], [17]. Finally, Quantum Learning has emerged as a promising direction for handling the complexity of modern cloud environments [18]. ...
Article
Full-text available
Effective resource management in edge-cloud networks is crucial for meeting Quality of Service (QoS) requirements while minimizing operational costs. However, dynamic and fluctuating workloads pose significant challenges for accurate workload prediction and efficient resource allocation, particularly in resource-constrained edge environments. In this paper, we introduce AERO (Adaptive Edge-cloud Resource Orchestration), a novel lightweight forecasting model designed to address these challenges. AERO features an adaptive period detection mechanism that dynamically identifies dominant periodicities in multivariate workload data, allowing it to adjust to varying patterns and abrupt changes. With fewer than 1,000 parameters, AERO is highly suitable for deployment on edge devices with limited computational capacity. We formalize our approach through a comprehensive system model and extend an existing simulation framework with predictor modules to evaluate AERO’s performance in realistic cloud-edge environments. Our extensive evaluations on real-world cloud workload datasets demonstrate that AERO achieves comparable prediction accuracy to complex state-of-the-art models with millions of parameters, while significantly reducing model size and computational overhead. Simulations show that AERO improves orchestration performance, achieving up to 13% reduction in energy consumption and 67% reduction in response times compared to reactive approaches. Our live deployment experiments further validate these findings, demonstrating that AERO consistently delivers superior performance. These results highlight AERO as an effective solution for improving resource management and reducing operational costs in dynamic cloud-edge environments.
... Some example methods are Adaboost, XGBoost [26], and random forests (RF). RF [27] specifically builds upon DTs by training multiple decision trees on different data subsets to promote diversity and make predictions based on the majority vote, enhancing accuracy and stability. ...
Chapter
Full-text available
The Internet of Things is an example domain where data is perpetually generated in ever-increasing quantities, reflecting the proliferation of connected devices and the formation of continuous data streams over time. Consequently, the demand for ad-hoc, cost-effective machine learning solutions must adapt to this evolving data influx. This study tackles the task of offloading in small gateways, exacerbated by their dynamic availability over time. An approach leveraging CPU utilization metrics using online and continual machine learning techniques is proposed to predict gateway availability. These methods are compared to popular machine learning algorithms and a recent time-series foundation model, Lag-Llama, for fine-tuned and zero-shot setups. Their performance is benchmarked on a dataset of CPU utilization measurements over time from an IoT gateway and focuses on model metrics such as prediction errors, training and inference times, and memory consumption. Our primary objective is to study new efficient ways to predict CPU performance in IoT environments. Across various scenarios, our findings highlight that ensemble and online methods offer promising results for this task in terms of accuracy while maintaining a low resource footprint. Code is available at https://github.com/sebasmos/AML4CPU.
... Liu et al. [27] proposed a cascaded shallow model based on SVM for workload prediction of network devices. Bala and Chana [28] Classification and taxonomy of application-orientied workload prediction tested machine learning algorithms such as KNN, ANN, RF, and SVM, which confirmed that RF has the highest prediction accuracy. Baig [29] introduced an adaptive model selector method that dynamically identifies the most suitable prediction method from a set of trained models. ...
Article
Full-text available
Workload prediction is critical in enabling proactive resource management of cloud applications. Accurate workload prediction is valuable for cloud users and providers as it can effectively guide many practices, such as performance assurance, cost reduction, and energy consumption optimization. However, cloud workload prediction is highly challenging due to the complexity and dynamics of workloads, and various solutions have been proposed to enhance the prediction behavior. This paper aims to provide an in-depth understanding and categorization of existing solutions through extensive literature reviews. Unlike existing surveys, for the first time, we comprehensively sort out and analyze the development landscape of workload prediction from a new perspective, i.e., application-oriented rather than prediction methodologies per se. Specifically, we first introduce the basic features of workload prediction, and then analyze and categorize existing efforts based on two significant characteristics of cloud applications: variability and heterogeneity. Furthermore, we also investigate how workload prediction is applied to resource management. Finally, open research opportunities in workload prediction are highlighted to foster further advancements.
... Techniques based on ARIMA and exponential smoothing, while effective for forecasting trends and seasonality in resource demand, are limited in handling sudden changes or spikes in resource use [13,14]. The advancement of artificial intelligence has made deep learning a promising alternative for resource usage prediction; however, it relies heavily on extensive training data and involves significant time for model selection and parameter tuning, which introduces considerable overhead for quick decisions [15][16][17]. ...
Article
Full-text available
Traditional cloud architectures struggle to effectively allocate resources to container-based workloads due to fluctuating usage patterns and potential interference among multi-tenants. Conventional scheduling methods, which primarily rely on user-specified resource requests, often lead to over-provisioning and suboptimal resource utilization. Although efforts have been made to predict container resource usage and allocate resources more tightly than the full requests, such approaches typically fall short during sudden demand spikes, thus failing to meet Service Level Objectives (SLOs). In this article, we introduce a novel cloud resource prediction engine specifically designed to differentiate between online and batch jobs. Our engine prioritizes ensuring SLOs for online jobs where immediate responsiveness is crucial. Specifically, our approach employs a combination of genetic algorithms (GA) and Bayesian neural networks (BNN) to enhance the prediction accuracy of CPU and memory resources. Trained on real-world trace data, our model significantly outperforms traditional forecasting methods like ARIMA and exponential smoothing, especially in reducing the risk of under-prediction for online jobs. This not only ensures more efficient resource utilization but also improves adherence to SLOs without compromising performance.
Article
Full-text available
Efficient resource management in cloud computing is crucial due to the dynamic nature of workloads and un predictable demand patterns, which can result in resource wastage, degraded performance, and SLA violations. This paper presents an advanced resource allocation framework integrating adaptive workload distribution with predictive scaling algorithms. The adaptive component dynamically reallocates resources based on real-time metrics, ensuring balanced utilization. At the same time, the predictive scaling algorithm leverages machine learning models, such as recurrent neural networks (RNNs), to forecast future workload demands and enable proactive scaling decisions. Experimental evaluations in a simulated environment demonstrated significant improvements, including a 25% increase in resource utilization, a 20% reduction in response time, and a 25% decrease in operational costs compared to static and dynamic methods. These findings underscore the transformative potential of the proposed system to enhance scalability, performance, and cost-efficiency in modern cloud environments.
Article
Full-text available
Most cloud computing platforms offer reactive resource auto-scaling mechanisms for dealing with variable traffic patterns to deliver the desired QoS properties while keeping low provisioning costs. However, a range of scenarios have not been fully addressed by the current auto-scaling solutions, particularly dealing with a rapid increase in workload and the risk of thrashing due to frequent workload variations. A reactive system is vulnerable in such conditions. Realizing the full potential of auto-scaling still remains challenging particularly due to the need of accurately estimating the application resource requirements for time-varying workload patterns. In this work, we propose and evaluate a novel method using only application access logs to estimate more accurately the hardware resource demands and application response time. In particular, we propose novel workload latent features which we compute by applying unsupervised learning on the access logs. We use these latent features to estimate the application hardware resource requirements and response time for various workload patterns. We evaluate the proposed method using multiple benchmark web applications and compare it with the current state-of-the-art. Extensive experimental evaluations show an excellent performance of our proposed workload latent features in estimating response time, CPU, memory, and bandwidth utilization.
Article
Full-text available
Time series feature engineering is a time-consuming process because scientists and engineers have to consider the multifarious algorithms of signal processing and time series analysis for identifying and extracting meaningful features from time series. The Python package tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests) accelerates this process by combining 63 time series characterization methods, which by default compute a total of 794 time series features, with feature selection on basis automatically configured hypothesis tests. By identifying statistically significant time series characteristics in an early stage of the data science process, tsfresh closes feedback loops with domain experts and fosters the development of domain specific features early on. The package implements standard APIs of time series and machine learning libraries (e.g. pandas and scikit-learn) and is designed for both exploratory analyses as well as straightforward integration into operational data science applications.
Article
Full-text available
The Infrastructure as a Service (IaaS) platform in cloud computing provides resources as a service from a pool of compute, network, and storage resources. One of the major challenges facing cloud computing is to predict the usage of these resources in real time. By knowing future demands, cloud data centres can dynamically scale resources to decrease energy consumption while maintaining a high quality of service. However cloud resource consumption is ever changing, making it difficult for accurate predictions to be produced. This motivates the research presented in this paper which aims to predict in advance the level of CPU consumption of a host. This research implements evolutionary Neural Networks (NN), a powerful machine learning method, to make these predictions. A number of state of the art swarm and evolutionary optimisation algorithms are implemented to train the neural networks to predict host utilization: Particle Swarm Optimisation (PSO), Differential Evolution (DE) and Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES). The results of this research demonstrate that CMA-ES converges faster to a better solution on the training data. However when evaluated on the test data, DE performs statistically equal to CMA-ES. The results also demonstrate that the trained networks are still accurate when applied to CPU utilization data from different hosts with no further training needed. When evaluated to predict multiple steps into the future, the accuracy of the network understandably decreases but still performs well on average.
Conference Paper
Full-text available
One of the major challenges facing cloud computing is to accurately predict future resource usage for future demands. Cloud resource consumption is constantly changing, which makes it difficult for forecasting algorithms to produce accurate predictions. This motivates the research presented in this paper which aims to predict host machines CPU consumption for a single time-step and multiple time-steps into the future. This research implements a Recurrent Neural Network to predict CPU utilisation, due to their ability to retain information and accurately make predictions for time series problems, making it a promising candidate to predict CPU utilization with greater accuracy when compared to traditional approaches.
Article
We study the problem of virtual machine (VM) placement and migration in a data center. In the current approaches, VMs are assigned to physical servers using on-demand provisioning. Such an approach is simple but it often results in a poor performance due to resource fragmentation. Additionally, sub-optimal VM placement usually generates unneeded VM migration and unnecessary cross network traffic. The efficiency of a datacenter therefore significantly depends on how VMs are provisioned and where they are placed. A good placement scheme will not only improve the quality of service but also reduce the operation cost of the data center. In this paper, we study the problem of optimal VM placement and migration to minimize resource usage and power consumption in a data center. We formulate the optimization problem as a joint multiple objective function and solve it by leveraging the framework of convex optimization. Due to the intractable nature of the combinatorial optimization, we then propose Multi-level Join VM Placement and Migration (MJPM) algorithms based on the relaxed convex optimization framework to approximate the optimal solution. The theoretical analysis demonstrates the effectiveness of our proposed algorithms that substantially increases data center efficiency. In addition, our extensive simulation results on different practical topologies show significant performance improvement over the existing approaches.
Article
Deep learning, as the most important architecture of current computational intelligence, achieves super performance to predict the cloud workload for industry informatics. However, it is a non-trivial task to train a deep learning model efficiently since the deep learning model often includes a great number of parameters. In this paper, an efficient deep learning model based on the canonical polyadic decomposition is proposed to predict the cloud workload for industry informatics. In the proposed model, the parameters are compressed significantly by converting the weight matrices to the canonical polyadic format. Furthermore, an efficient learning algorithm is designed to train the parameters. Finally, the proposed efficient deep learning model is applied to the workload prediction of virtual machines on cloud. Experiments are conducted on the datasets collected from PlanetLab to validate the performance of the proposed model by comparing with other machine learning-based approaches for workload prediction of virtual machines. Results indicate that the proposed model achieves a higher training efficiency and workload prediction accuracy than state-of-the-art machine learning-based approaches, proving the potential of the proposed model to provide predictive services for industry informatics.
Article
Infrastructure as a service (IaaS) providers are interested in increasing their profit by gathering more and more customers besides providing more efficiency in cloud resource usage. There are several approaches to reach the resource usage efficiency goal such as dynamic consolidation of virtual machines (VMs). Resource management techniques such as VM consolidation must be aware of the current and future resource usage of the cloud resources. Hence, applying prediction models for current cloud resource management is a must. While cloud resource usage varies widely time to time and server to server, determining the best time-series model for predicting cloud resource usage depend not only on time but the cloud resource usage trend. Thus, applying ensemble prediction algorithms that combine several prediction models can be suitable to reach the mentioned goal. In this paper, an ensemble cloud resource usage prediction algorithm based on Learning Automata (LA) theory is proposed that combines state of the art prediction models, and it determines weights for individual constituent models. The proposed algorithm predicts by combining the prediction values of all constituent models based on their performance. The extensive experiments on CPU load prediction of several VMs gathered from the dataset of the CoMon project indicated that the proposed approach outperforms other ensemble prediction algorithms.