Conference PaperPDF Available

Sustainability-aware Resource Provisioning in Data Centers

Authors:
  • ByteDance

Abstract

In the big data era, cloud computing provides an effective usage model for providing computing services to handle diverse data-intensive workloads. Data center capacity planning and resource provisioning policies play a vital role in long-term life cycle management of datacenters. Effective design and management of data center infrastructures while ensuring good performance is critical to minimizing the carbon footprint of the datacenter. Traditional solutions have primarily focused on optimizing data center operational phase impacts including reducing energy cost during the resource management phase. In this paper, we propose a two-phase sustainability-aware resource allocation and management framework for data center life-cycle management that jointly optimizes the data center manufacturing phase and operational phase impact without impacting the performance and service quality for the jobs. Phase 1 of the proposed approach minimizes data center building phase carbon footprint through a novel manufacturing cost-aware server provisioning plan. In phase 2, the approach minimizes the operational phase carbon footprint using a server lifetime-aware resource allocation scheme and a manufacturing cost-aware replacement plan. The proposed techniques are evaluated through extensive experiments using realistic workloads generated in a data center. The evaluation results show that the proposed framework significantly reduces the carbon footprint in the data center without impacting the performance of the jobs in the workload.
Sustainability-aware Resource Provisioning in Data
Centers
Jingzhe Wang
School of Computing and Information
University of Pittsburgh
Pittsburgh,USA
jiw148@pitt.edu
Balaji Palanisamy
School of Computing and Information
University of Pittsburgh
Pittsburgh,USA
bpalan@pitt.edu
Jinlai Xu
School of Computing and Information
University of Pittsburgh
Pittsburgh,USA
jinlai.xu@pitt.edu
Abstract—In the big data era, cloud computing provides an
effective usage model for providing computing services to handle
diverse data-intensive workloads. Data center capacity planning
and resource provisioning policies play a vital role in long-
term life cycle management of datacenters. Effective design and
management of data center infrastructures while ensuring good
performance is critical to minimizing the carbon footprint of the
datacenter. Traditional solutions have primarily focused on opti-
mizing data center operational phase impacts including reducing
energy cost during the resource management phase. In this paper,
we propose a two-phase sustainability-aware resource allocation
and management framework for data center life-cycle manage-
ment that jointly optimizes the data center manufacturing phase
and operational phase impact without impacting the performance
and service quality for the jobs. Phase 1 of the proposed approach
minimizes data center building phase carbon footprint through
a novel manufacturing cost-aware server provisioning plan. In
phase 2, the approach minimizes the operational phase carbon
footprint using a server lifetime-aware resource allocation scheme
and a manufacturing cost-aware replacement plan. The proposed
techniques are evaluated through extensive experiments using
realistic workloads generated in a data center. The evaluation
results show that the proposed framework significantly reduces
the carbon footprint in the data center without impacting the
performance of the jobs in the workload.
Index Terms—Sustainable Cloud Computing, Resource Man-
agement, Scheduling, Life Cycle Assessment
I. INTRODUCTION
In the big data era, cloud computing provides an effective
usage model for providing computing services to handle
diverse data-intensive workloads. With increasing trends in
workload demand and diversity of workloads, cloud com-
puting plays a vital role in providing scalable and highly
available services to cloud consumers. A typical cloud data
center consists of tens of thousands of heterogeneous servers
that may be partitioned into different clusters for handling
different workloads [1]. In addition, the data center infrastruc-
ture may include cooling systems and on-site/off-site energy
systems configured to sufficiently support normal operations
and management. Large-scale IT equipment and auxiliary
power systems may lead to several environmental issues [2].
Designing sustainable data centers is a challenging problem.
In this paper, we develop a sustainability-aware data center
design that jointly optimizes both manufacturing phase and
operational phase impacts while taking into impacts due to
server replacement during the operational phase.
The sustainability objective in the data center manufacturing
phase aims at provisioning server capacity in the data center
for handling long-term workload demand while minimizing
the total manufacturing carbon footprint. In our work, we
specifically focus on minimizing the carbon footprint incurred
due to manufacturing and operating servers. Each type of
server has a different manufacturing carbon footprint. The
manufacturing phase carbon footprint of a data center con-
sidered in our work refers to the total carbon contribution of
manufacturing all the servers included in the data center. In
addition to the initial provisioning of servers in the data center
capacity planning, servers may need to be replaced during
the operational phase. With dynamic scheduling of workloads
on the servers, the lifetime and hardware reliability of the
servers are impacted by repeated on-off thermal cycles, wear-
and-tear, and temperature rise [3]–[7]. After a server wears
out sufficiently, the server cannot serve any workload job
requests and the server may need to be replaced to serve
future workloads. Provisioning new replacement servers will
result in additional manufacturing carbon footprint to the op-
timized server provisioning plan. Therefore, designing server
lifetime-aware resource management techniques that minimize
the manufacturing impacts due to server replacement in the
operational phase of the data center is critical to minimizing
the total carbon footprint. Besides this goal, the key objective
in the operational phase of the data center is to minimize the
carbon emission footprint when operating the data center and
serving the workload requests. It is critical that the data center
employs an efficient online job scheduler that schedules the
workload requests while minimizing the total carbon emission
from an energy efficiency perspective.
This paper presents a new framework, SusOpt, a two-phase
sustainability-aware resource provisioning and management
framework that jointly optimizes the data center manufactur-
ing phase and operational phase impacts without affecting
the performance and service quality for the jobs. We first
formulate the minimal manufacturing carbon footprint server
capacity provisioning problem as an offline Integer Linear
Programming Problem. We then propose an efficient heuristic
algorithm to solve the problem. We propose a new online
resource allocation and scheduling mechanism that jointly
minimizes the carbon emission due to server power consump-
tion and server replacement based on the optimal data center
capacity plan. The proposed techniques are evaluated through
extensive experiments using realistic workloads generated in
a data center. Compared with the baseline solutions that
independently minimize manufacturing phase carbon footprint
and operational phase carbon footprint, the evaluation results
show that the proposed framework significantly reduces the
total carbon footprint in the data center without impacting
the performance of the jobs in the workload. The rest of
this paper is organized as follows. We begin by discussing
the background and motivation in Section II. In Section III,
we define the objectives of the proposed framework and
present the proposed heuristic mechanisms. In Section IV, we
evaluate the techniques through simulations using a real-world
workload data trace. We discuss related work in Section V and
conclude in Section VI.
II. BACKGROU ND A ND MOT IVATIO N
In this section, we first present an overview of sustainable
data center infrastructure design problem by illustrating both
manufacturing phase and use phase impacts.
A data center consists of composite IT infrastructures that
include racks of servers, network switches and power genera-
tion equipment and cooling systems. The manufacturing phase
of a data center refers to the time when data center designers
build a new data center by choosing a set of equipment for
the data center. Data center designers need to decide the type
and number of servers needed in the data center to satisfy
the expected workload demand. While existing research [8]–
[15] has focused on optimizing manufacturing phase impacts
from economic and energy management perspectives, such
techniques do not consider sustainability impacts from the data
center capacity planning perspective.
After a data center infrastructure is designed, the data center
goes into operation. Data center operators should be aware of
effectively operating and managing the data center to minimize
the use-phase impact incurred during the operating phase.
Therefore, designing server lifetime-aware resource manage-
ment techniques that minimize the manufacturing impact due
to server replacement in the operational phase of the data
center is critical to minimizing the total carbon footprint.
Data center Life Cycle Assessment (LCA) refers to as-
sessing environmental impacts associated with all the stages
of the life-cycle of a data center. It includes the process of
building clusters of servers and associated equipment and
effectively managing the resource allocation for the incoming
workload and maintaining and upgrading data center equip-
ment as needed. We assume that the data center life cycle
consists of two phases namely the manufacturing phase and
the operational phase. A few past research have focused on this
direction including designing new evaluation metrics for data
center LCA. Zhang et al. [16] proposed a new data center
network topology and complexity metric to optimize data
center LCA from a network perspective. Chang et al. [10] pro-
posed a new metric called ’exergy’ to develop a more efficient
evaluation framework for data center life cycle. Other research
efforts have focused on data center equipment upgrading. For
instance, Gao et al. [17] designed a yearly server upgrading
plan for geo-distributed data centers to minimize the carbon
emission footprint. We note that these existing solutions do
not jointly consider the two phases together and instead focus
on manufacturing phase optimization and operational phase
management independent of each other.
III. SUS OPT FR AM EWORK DESIGN
The objective of the SusOpt framework is to minimize the
total carbon footprint in the data center life cycle by jointly
minimizing the manufacturing phase carbon footprint and the
carbon emission footprint during the data center operational
phase.
A. SusOpt Phase-1: Minimal Manufacturing Carbon Footprint
Data Center Server Capacity Planning
In phase-1 (Fig. 1) of the proposed framework, the objective
is to determine the type and number of servers needed to pro-
vision in the data center to satisfy expected workloads while
minimizing the total manufacturing phase carbon footprint.
We model the data center to consist of heterogeneous types
of servers. We define Skas the type-kserver and each type-k
server is characterized by several parameters including CPU,
memory capacity and server product carbon footprint. The set
of all server types is defined as K={S1, S2, ..., Sk}. In gen-
eral, we use Rcpu, Rmem to denote server CPU capacity and
memory capacity respectively. The product carbon footprint
of type-kserver is denoted as Mkand all servers in a given
server type have the same Mk. Therefore, in our model, we
characterize type-kserver in Sk, as Sk={Rcpu
k, Rmem
k, Mk}.
We assume that for each type-k, there are nkidentical servers,
where nk0. We use Nkto denote the type-kserver
clusters. We assume a discrete time model with equal time
slots collection T={t1, t2, ..., tT}and for each time slot t,
we have t∈ T .
We characterize the historical workload in the data center
by applying a modified k-means algorithm [18]( 1 1). We use
C={c1, c2, ..., cC}to define the job types and we define
Jc(c∈ C)as the type-cjob collection. After characterizing
the historical workload, we get the workload type distribution,
denoted as ϑ( 3 ). Then, we apply the ARIMA model [19] on
the historical workload trace( 2 ) to predict workload requests
for the future time slots( 4 ). We define λ(t)as the workload
requests per time slot tin the future and all the workload
in the future Twindow is denoted as λ={λ(1), ..., λ(t)},
where t T . At each time slot t, there are several types of
jobs. We assume that the resource requirements of each type-
cjob follows the ϑdistribution( 6 ). We define Jc(t)as the
collection of type-cjob in time slot t, where Jc(t)λ(t).
For simplicity, we assume that the jobs of a given job type
1The circled number here refers to the workflow step in the proposed
framework (Fig 1)
Phase2-Resource Management
Data Center Life-Cycle
Data Center Operational Phase
CentralDispatcher
Data Center Building Phase
Heterogeneous Server Cluster
Cloud Service Provider
Sustainable
Data Center
Capacity Plan
Data Center
Server Candidate
Phase1-Server Capacity Plan (Offline)
12
34
5
9
10
11
12
13
Heterogeneous CentralJob Queue
Carbon Optimal Data Center
14
15
Workl oad
Characterization
Historical Real-world Data Center Workl oad
Historical
Workl oad Type
Distribution
Future Workload
Requests
Future Workload Distribution
Predicted
Workl oad Typ e
Analysis
6
7
8
Time Series
Prediction
Future Workl oad
Sampling
……
Heterogenous Job
Requests
Cluster Status
Local Scheduler
Server
Lifetime
Monitor
Replace
Server
Carbon
Effici ency
Monitor
Resource
Monitor
Local Job Queue Type-1 cluster
Servers
Local Scheduler
Server
Lifetime
Monitor
Replace
Server
Carbon
Effici ency
Monitor
Resource
Monitor
Local Job Queue Type-k cluster
Servers
Job Type Type-1 ServerTy pe-k Server
Fig. 1. SusOpt Framework
have the same resource demands and execution time (aver-
age value from the ϑdistribution), denoted as Rescpu
cJj
c,
Resmem
cJj
c, and τcrespectively, where Jj
cdenotes the j-th
job in Jc(t). Each job also has an expected deadline dj
cfor
completing the execution. Here, we assume that dj
c=α×τj
c.
We use j
cto denote the provisioning delay of job Jj
cwhich
could be zero if the job is placed immediately at time t.
We consider that the job types are mapped to server types
to indicate which servers are capable of executing a given job
type. We define a binary variable Yc
kto indicate the mapping
constraint between type-cjob and type-kserver. Yc
k= 1
indicates that Rcpu
kRescpu
c(Jj
c)and Rmem
kResmem
c(Jj
c)
and Yc
k= 0 otherwise. We assume that each type-cjob has a
fixed degree of parallelism, and its average running time on
each eligible type-kserver is denoted as rk
cwhich is evaluated
through Amdal’s Law [20] [21].
We formulate our minimal manufacturing carbon footprint
server capacity provisioning problem as an offline problem.
The objective is to find the optimal heterogeneous server
collections to minimize the data center manufacturing car-
bon footprint denoted as Fwhile satisfying the workload
resource demand. We denote the problem as MCSCP (Minimal
Manufacturing Carbon Footprint Server Capacity Planning).
In MCSCP, we consider an offline optimization problem by
assuming that all the job information including the resource
demands are known apriori. Then, we define ymk to indicate
whether the m-th server of type-kcluster is occupied and
we use xjmk to indicate whether the j-th job of type-cjob
collection is placed on the m-th server of type-kserver cluster.
For example, if j-th job is scheduled in the m-th server in Nk,
then ymk = 1 and xj mk = 1.
We formulate MCSCP as an Integer Linear Programming
problem:
min F=X
t∈T X
k∈K Mk X
m∈Nk
ymk!! (1)
s.t. X
k∈K X
m∈Nk
Yc
k·xjmk = 1,j∈ Jc(t)c∈ C (2)
X
c∈C X
j∈Jc(t)
Yc
k·xjmk ·Rescpu
c(Jj
c)ymk ·Rcpu
k,
m∈ Nkk∈ K ∀t∈ T
(3)
X
c∈C X
j∈Jc(t)
Yc
k·xjmk ·Resmem
c(Jj
c)ymk ·Rmem
k,
m∈ Nkk∈ K ∀t∈ T
(4)
t+ ∆j
c+ (τj
c)kdj
c,j∈ Jc(t)c∈ C k∈ K (5)
xjmk , ymk , Y c
k∈ {0,1} ∀j∈ Jc(t)m∈ Nkk∈ K ∀c∈ C (6)
where (1) shows the objective of our problem that minimizes
the total manufacturing phase carbon footprint. Constraint (2)
ensures that each type-cjob can only be assigned to exactly
one server at each time slot and constraints (3)-(4) ensure that
all the jobs that are assigned to each type-kserver cannot
exceed server capacity limit at each time slot, where Rcpu
k
and Rmem
krefer to the type-kserver capacity limit at each
time slot. (5) denotes the expected deadline constraint for the
job.
Theorem 1. The MCSCP problem is NP-Hard.
Proof: From our optimization problem formulation, we can
reduce the MCSCP problem as a Two Dimension Variable
Size Vector Bin Packing Problem (2DVSVBPP) described in
[22]. In our formulation, after reduction, each job in type-c
job collection can be seen as an item with vector constraints in
resource demand (CPU,memory). The heterogeneous servers
are equivalent to multiple types of bins and the server capacity
volume correspond to the two dimensional size of the bin and
the manufacturing carbon footprint of each type-kserver can
be considered as the cost of the selected bins. The 2DVSVBPP
problem is NP-Hard [22]. Therefore, the MCSCP problem is
also NP-Hard.
We propose our heuristic solution, Deadline-aware Minimal
Manufacturing Carbon Footprint Capacity Plan (MCCADL),
based on the First Fit by Ordered Deviation(FFOD) heuris-
tic [22] to solve the MCSCP problem. We define dcpukm
cj and
[memkm
cj as the normalized resource demand [23] of j-th type-
cjob on Sm
kserver.
Algorithm 1: MCCADL-Main
Input : Heterogeneous Servers Candidates: K={S1, ..., Sk};
Job Requests: λ={λ(1), ..., λ(t)};
Output: Selected Servers: N={N1, ..., Nk};
Total Manufacturing Carbon Footprint: F;
1Initialization: N={∅},F ← 0; Occupied Server Set:OK ← {∅}
2Function CapacityPlan(T,λ,K)
3if t== 1 then
4λ(1)ddlAscSort(λ(1))//ascending deadline order
5NkFindNewMinOpportunityCostServer(J1
c,K,1)
6N ← Nk;OK ← Nk
7foreach Jj
cλ(1)\J1
cdo
8η
cj FindNewMinOpportunityCostServer(Jj
c,K, t)
9ηcj FindOppupiedMinOpportunityCostServer(Jj
c,OK, t)
10 ServerSelectionAndUpdate(η
cj , ηcj ,N,OK, t)
11 end
12 end
13 W Q ← {∅}//waiting queue for future assignment
14 for t= 2 to LastDeadlineSlot do
15 Release finished job resource if needed
16 if W Q(t)6=then
17 Each job executes FindOccupiedMinOpportunityCostServer
18 end
19 λ(t)ddlAscSort(λ(t))
20 foreach Jj
cλ(t)do
21 if FindOccupiedMinOpportunityCostServer(Jj
c,OK, t)then
22 Put Jj
cinto server, update resource
23 else
24 if ResourceReleasebyDDL(OK, Jj
c)then
W Q(t).enqueue(Jj
c)
25 else
26 Assign a new server using
FindNewMinOpportunityCostServer(Jj
c,K, t)
27 end
28 end
29 end
30 end
31 return N,F
Now, we discuss the capacity planning procedure (Algo-
rithm 2) in detail. JobAdjustedResource function (line 25-31)
of the algorithm calculates dominant resource dimension of
Jj
cto Sm
kbased on the job resource dimension balance and
the resource occupied in each dimension in Sm
k.δkm
cj refers
to the adjusted resource demand of j-th of type-cjob to Sm
k.
We define ηkm
cj as the opportunity carbon cost when assigning
Jj
cto Sm
k, where ηkm
cj =δkm
cj ·Mk. We define two minimal
server cost selection functions namely FindNewMinOpportu-
nityCostServer (line 4-11) that finds a new server of type-k,
and FindOccupiedMinOpportunityCostServer(line12-24) that
finds m-th server from occupied type-kservers. Based on η
calculated by those two functions, we decide to either use a
Algorithm 2: MCCADL-Helper
1Function ServerSelectionAndUpdate(ηn, ηo,N,OK, t)
2if ηo> ηnthen N ← Nk,OK ← Nk, update resource
3if ηo< ηnthen update resource
4Function FindNewMinOpportunityCostServer(Jj
c,K,t)
5Kcall TRUE return from JobServerConstraint()
6η
cj ← {∅}//Initial cost vector for all new k∈ Kc
7foreach kin Kcdo
8δk
cj JobAdjustedResource(Jj
c,0,0);ηk
cj ← Fk×δk
cj
9η
cj .add(ηk
cj )
10 end
11 ηk
cj mink∈Kc(η
cj );return ηk
cj and the server
12 Function FindOccupiedMinOpportunityCostServer(Jj
c,OK,t)
13 ηcj ← {∅}//Initial cost vector for occupied servers in OK
14 if OK 6=then
15 foreach m∈ OK do
16 b
Rcpu
km (t)Pj∈Jkm(t)dcpukm
cj
17 b
Rmem
km (t)Pj∈Jkm(t)[memkm
cj
18 if (1 b
Rcpu
km (t)) dcpukm
cj
&(1 b
Rmem
km (t)) [memkm
cj then
19 δkm
cj JobAdjustedResource(Jj
c,b
Rcpu
km (t),b
Rmem
km (t))
20 ηkm
cj ← Fk×δkm
cj ;ηcj .add(ηkm
cj )
21 end
22 ηm
cj minm∈OK(ηk m
cj )
23 end
24 return ηm
cj and the server
25 Function JobAdjustedResource(Jj
c,b
Rcpu
km (t),b
Rmem
km (t))
26 if b
Rcpu
km (t)b
Rmem
km (t)then
27 δkm
cj max{dcpukm
cj ,[memkm
cj (b
Rcpu
km (t)b
Rmem
km (t))}
28 else
29 δkm
cj max{[memkm
cj ,dcpukm
cj (b
Rmem
km (t)b
Rcpu
km (t))}
30 end
31 return δkm
cj
32 Function JobServerConstraint(c, K)
33 return Yc
k== 1
34 Function ResourceReleaseByDDL(OK, Jj
c)
35 if Have Resource for Jj
cbefore dj
cthen
36 return TRUE
new server or assign the job in an available occupied server
by using function ServerSelectionAndUpdate. In the Capaci-
tyPlan function from Algorithm 1, at the first time slot (line 3-
12), we sort jobs in λ(1) in the ascending order of the deadline,
and we find the first server, and update the server set (line 4-
6). For other jobs in λ(1), we compare the opportunity cost
between using a new server and assigning into an occupied
server, and we select the minimal cost server and update the
server set. For the next each time slot, we first check the
resource release (line 15). Here, we define a job waiting queue
W Q that queues the jobs that do not have available servers
to assign at time tbut can be assigned in an incoming time
slot where the occupied resource will be released. Therefore, at
each time slot, we check the server resource release and assign
jobs in W Q(t)using FindOccupiedMinOpportunityCostServer
function(line16-18). For new arrival jobs, we sort them in an
ascending deadline order (line 19) and find a server to assign
using FindOccupiedMinOpportunityCostServer. If the current
server pool cannot serve the jobs, we have two options. The
first option is to use the ResourceReleasebyDDL function to
estimate whether there has been enough resource release to
serve the job before its deadline and if there will be available
resource in the next slots. We then assign the job in the W Q(t)
(line 24). In the other case, we find a new server based on
FindNewMinOpportunityCostServer (line 26), and update the
server set. After all jobs are assigned, we calculate the total
server numbers and the manufacturing carbon footprint.
MCCADL is an offline provisioning as we assume the
expected job resource requirement apriori and we assign new
servers for supporting future expected workloads. After phase-
1 server provisioning procedure is complete, we use Nas the
manufacturing phase carbon-optimal server cluster and we will
use Nas the infrastructure for the phase-2 optimization.
B. SusOpt Phase-2: Carbon Emission Aware Resource Man-
agement and Server Replacement
In this section, we introduce the second phase (Fig. 1)
of the proposed sustainability-aware resource management
framework that focuses on minimizing the carbon emission
footprint occurring during data center operation and server
replacement during the data center life cycle.
To achieve the above-mentioned objective, we design a job
scheduling technique that is aware of online carbon emission
during data center operation and closely considers server
lifetime and replacement impact while satisfying job deadline
constraints. We use the Nkto denote the type-kcluster that
consists of nkhomogeneous type-kservers. Sm
kdenotes the
m-th server in type-kcluster. In contrast to phase-1 where we
assumed that all jobs of the same type-chave the same charac-
teristics, in phase-2, we randomly sample jobs’ characteristics
calculated in characterization step of phase-1. We use π(t)
to denote the job requests at time slot t. To differentiate the
symbols used in phase-1, we define Jj
cas the j-th type-cjob.
We define Dc(t)as the type-cjobs collection in time slot t,
where Dc(t)π(t). We use Vcpu(Jj
c)and Vmem(Jj
c)to denote
the online job resource demand and (υj
c)kto represent the
actual execution duration in the type-kservers. j
crepresents
the scheduling delay and sj
crepresents the deadline of online
jobs, which is calculated using sj
c=α×(υj
c)k.
We apply the server power consumption and energy model
from [24]. We assume that at time slot t, the utilization
of Sm
kis denoted as um
k(t)[0,1] which is dominantly
calculated through CPU utilization described in [24]. Then,
we define PSm
k(um
k(t)) as the dynamic power of Sm
k, where
PSm
k(um
k(t)) = (Pmax
Sm
kPidle
Sm
k)(2um
k(t)(um
k(t))r)and Pmax
Sm
k
and Pidle
Sm
krepresent the maximum power and idle power of
server Sm
kserver. Here ris a tunable parameter that minimizes
the square error. Based on the power consumption model,
we define Em
k(um
k(t)) as the energy consumption of actively
powered-on servers when jobs run, where Em
k(um
k(t)) =
PSm
k(um
k(t)) ×(τj
c)kand (τj
c)kdenotes the job execution
duration. We use ρto denote the data center carbon emission
rate from [25], where the unit is kg/kWh. Therefore, we can
calculate the carbon emission as ρ×Em
k.
In phase-2, the key objective is to dynamically monitor
and manage servers in the data center. For simplicity, we
assume that the expected lifetime of the servers used in
Phase-1 follows lifetime distribution that can be obtained
through analyzing historical data from data centers. We use
Lm
kto denote the lifetime of server Sm
k. In our framework,
the dynamic utilization of servers caused by running jobs
would impact the remaining server lifetime [26] [6]. In our
work, we consider that the attenuation of servers’ lifetime is
influenced by raise in temperatures when jobs are executing.
We define Lm
kas the lifetime attenuation of Sm
kserver,
where Lm
k=θ(∆um
k)×Lm
k. The θ(∆um
k)is a function
that describes the relationship between the acceleration factor
of a server lifetime attenuation with the utilization before and
after scheduling jobs.
We define Qk(t)as the local job queue of the cluster at
tin the type-kcluster, where Qk(t)π(t). We design a
central dispatcher in phase-2 to dispatch jobs to clusters( 10 2).
The dispatcher receives job requests( 8 ) and the cluster status
information( 9 ) to dispatch the jobs. In each type-kcluster,
we define a binary variable Zkm
cj (t)to represent whether
Jj
cis put into Sm
kat t. In each type of cluster, we design
a local job scheduler that receives information from the
cluster monitor( 12 ) and local job requests( 11 ) to schedule
the time-sensitive jobs. For monitoring server lifetime in each
cluster, we design a server lifetime monitor(14 ) and server
replacement( 15 ) to update the servers. For simplicity, we
define Wm
k(t)to indicate whether we should update server
Sm
kbased on its remaining lifetime or not.
We have two optimization objectives in phase-2. The first is
to minimize the total carbon emission footprint when running
jobs. It is defined as
E=X
t∈T X
k∈K X
jQk(t)X
m∈Nk
ρ·Zkm
cj (t)·Em
k(Jj
c, t)(7)
and the second objective is to minimize the total server
replacement carbon cost, defined as
R=X
t∈T X
k∈K X
m∈Nk
Wm
k(t)·Mk(8)
Therefore, we formulate the phase-2 optimization problem,
MSC-Minimal Carbon Footprint Scheduler with Server
Replacement, as follows
min E+R(9)
s.t. X
m∈Nk
Zkm
cj 1,k∈ K ∀t∈ T jQk(t)(10)
X
jQk(t)
Zkm
cj (t)·Vcpu(Jj
c)Rcpu
k,k∈ K,m∈ Nk,t∈ T (11)
X
jQk(t)
Zkm
cj (t)·Vmem(Jj
c)Rmem
k,k∈ K,m∈ Nk,t∈ T (12)
t+ (vj
c)k+ ∆j
csj
c,jQk(t),k K ,t∈ K (13)
Zkm
cj , W m
k∈ {0,1}, ρ > 0(14)
where the (9) is our objective to minimize the sum of opera-
tional phase carbon emission footprint and server replacement
manufacturing carbon footprint. We use cto indicate the type
2The circled number here refers to the workflow step in the proposed
framework (Fig 1)
of job Jj
cand the (10) constraints to ensure that at each time
schedule time slot, each job can be only scheduled by one
server. Constraint (11)-(12) ensure that the resource demand
of all jobs in each server cannot exceed the server resource
capacity. Constraint (13) captures the deadline of each job.
Finally, constraint (14) bounds the range of the variables.
Theorem 2: The MSC problem is NP-Hard.
Proof. At each time slot, the MSC can be reduced to a Multiple
Multidimensional Knapsack Problem(MMKP). In our formu-
lation, in each cluster, we have a number of homogeneous
servers that can be seen as a knapsack with two-dimensional
capacity bound. All the jobs in local job queue can be seen
as items with two-dimensional resource demand and deadline
constraints. The cost of each item is calculated through an
amortized carbon footprint. The key objective is to pack all
items (jobs) into the homogeneous knapsacks (servers) to
minimize the total carbon cost while satisfying the job deadline
and server capacity requirement. The MMKP is NP-Hard [27].
Therefore, the MSC problem is also NP-Hard.
Algorithm 3: MCS
Input : Data Cetner Server Clusters: N={N1, ..., Nk};
Online Job Requests: π={π(1), ..., π(t)};
Output: Schedule Results, Carbon Emission E
Replacement Manufacturing Carbon Footprint R
1Function Main(N, π, T)
2foreach t∈ T do
3CentralDispatch(N, π(t))
4foreach k∈ K do
5if ServerLifetimeMonitor(Nk, t)then
6R←R+ServerReplace(Sm
k, t)
7end
8E ← E +LocalSchedule(Qk(t),Nk)
9end
10 cluster status update
11 end
12 return E+R
13 Function CentralDispatch(N, π(t))
14 A← {A1, ..., Ak},(Ak)AscSort(A)
15 π(t)ddlAscSort(π(t))
16 foreach job π(t)do
17 dispatch job to the first-fit cluster in (Ak)
18 end
19 Function LocalSchedule(Qk(t),Nk)
20 Am
k(t)← {A1
k, ..., Am
k},Am
k(t)AscSort(Am
k(t))
21 Qk(t)SRJFSort(Qk(t))
22 foreach job Qk(t)do
23 put the job in the first-fit server in Am
k(t)
24 update server lifetime based on θ(∆k
u(t)) and Am
k(t)
25 end
26 calculate Ebased-on finished job,return E
27 Function ServerLifetimeMonitor(Nk, t)
28 Wm
k(t) == 1
29 Function ServerReplace(Sm
k, t)
30 return Mk
We use Am
k(t)to capture the amortized carbon footprint
of server Sm
kat t, which refers to the average carbon foot-
print of the server in each unit time slot. Here, Am
k(t) =
(Mk
Lm
k+Pm
k(um
k(t)) ·tand we define Am
k(t)as the average
amortized carbon footprint of the cluster, where Am
k(t) =
Pm∈NkAm
k(t). These two metrics drive the design of the
proposed heuristic algorithm. In Algorithm 3, at each schedule
time slot, our framework first dispatches jobs in the global
queue (line 3). In the CentralDispatch function (line 13), we
first sort the cluster average amortized carbon footprint and the
TABLE I
SERVE R PRODUCTION CARBON FOOTP RI NT
Server Type CPU cores Memory(GB) Carbon Footprint(CO2 (kg))
Dell PowerEdge R240 6 16 1167
Dell PowerEdge R330 4 16 1146
Dell PowerEdge R420 16 16 1165
Dell PowerEdge R540 16 32 1234
job deadline in increasing order (line 14-15). For each job, we
find its first-fit cluster that satisfies the resource requirement
and type constraints to dispatch (line 16-18). After dispatching,
for each cluster, we first check the lifetime of the servers
(line 5-7) using the ServerLifetimeMonitor function (line 27-
28).If some expire, the algorithm replaces them (line 6). Then,
we schedule the jobs in the local job queue (line 8). In the
LocalSchedule function, we first sort the amortized cost of
each Sm
kserver and the execution duration of job in increasing
order (line 20-21). We then schedule the shortest duration
job on the server with the lowest amortized carbon footprint
at tand update the server lifetime and amortized cost value
(line 23-24). We calculate the carbon emission Ebased on the
server energy consumption (line 26). After all clusters update
their status, the algorithm moves to the next time slot. Finally,
the algorithm calculates the total carbon emission footprint and
server replacement manufacturing carbon footprint (E+R)
(line 12).
IV. PERFORMANCE EVALUATION
In this section, we present the experimental evaluations of
the proposed SusOpt framework. We begin by introducing our
experiment setup.
A. Experiment Setup
Data Center Server Configuration: In our experiment set
up, we use 4 different types of server configurations. Table I
describes the characteristics of the server candidates. We
extract the configuration (CPU cores, memory capacity) of the
servers and the manufacturing carbon footprint based on the
real-world data in [28]. The manufacturing carbon footprint
value is based on the specific configuration of each type of
server. The energy consumption value of each configuration
is based on the SPEC architecture performance report [29].
We use the data center server lifetime distribution from the
Google Data Center [6] in the simulation.
Workload Trace: In our experiment, we use a real world data
center trace [30] to model the characteristic of the historical
workload. In the capacity planning phase, we first apply the k-
means [31] algorithm to characterize the workload demand in
terms of CPU and memory capacity. We generate the historical
workload distribution based on the characterization. We then
predict and generate the future workload requests using the
Auto-ARIMA model [19].
Synthetic Job Profile: We synthesize the set of jobs based
on the workload type distribution from the output of the k-
means clustering of the historical workload trace. The CPU
requirement, memory requirement and the expected running
TABLE II
ALGORITHMS
Algorithms Manufacturing Aware Online Emission Aware Replacement Aware Deadline Aware
MCCADL-MCS X X X X
MCCADL-MCSWR X X ×X
MCCADL-NC X× × X
NC-MCSWR ×X×X
NC-NC × × × X
(a) Life Cycle Carbon Footprint (b) Job Success Rate (c) Job Response Time (d) Effective Server Utilization
Fig. 2. Impact of Job Deadline Slack Percentage
(a) Life Cycle Carbon Footprint (b) Job Success Rate (c) Job Response Time (d) Effective Server Utilization
Fig. 3. Impact of Job Runtime Estimation Error Percentage
time for each job are generated by randomly sampling the
k-means cluster that the job belongs to. In the synthesised
job set, each job includes 6 attributes namely job-ID, CPU
requirement, memory requirement, arrival time, start time and
the expected running time on each type of server. We use the
Amdal’s Law [21] [20] to estimate the running time on all the
4 types of servers.
Metrics: We use four metrics to compare the life-cycle carbon
footprint and performance of the proposed framework: (1)
normalized life-cycle carbon footprint: this metric includes
the manufacturing carbon footprint in the capacity planning
phase and the carbon emission footprint and server replace-
ment carbon footprint in the operational phase. Resource
management algorithms that require lower carbon footprint are
more sustainable. (2) job success rate: this metric captures
the percentage of jobs that successfully meet their deadline
requirements. (3) job response time: this metric captures the
sum of queuing delay and execution time of the job. Low job
response time indicates good performance. (4) effective server
utilization: this metric captures the cluster utilization when
using different capacity planning and scheduling techniques.
Good capacity planning leads to higher effective server uti-
lization.
Algorithms and baseline: In our evaluation, we employ five
different heuristics for comparing the performance. We use two
algorithms in phase-1: (1) MCCADL: Our proposed deadline-
aware minimal manufacturing impact capacity planning. (2)
NC: Deadline-aware and space aware first-fit capacity planning
[32] that does not consider manufacturing carbon footprint.
We have three algorithms in phase-2: (1) MCS: Our proposed
deadline-aware and carbon emission-aware scheduler that also
minimizes server replacement cost. (2) MCSWR: Deadline-
aware and carbon emission aware scheduler [33] that does not
minimize server replacement cost. (3) NC: Earliest deadline
first scheduler that does not consider carbon emission and
server replacement. For our evaluation, we use five com-
binations of these techniques: MCCADL-MCS,MCCADL-
MCSWR,MCCADL-NC,NC-MCSWR,NC-NC.MCCADL-
MCS is our optimal solution that jointly minimizes manufac-
turing phase carbon footprint and operational carbon emission
footprint while considering server replacement. MCCADL-
MCSWR,MCCADL-NC and NC-MCSWR are baseline solu-
tions that do not consider server replacement impact and they
independently reduce manufacturing phase carbon footprint
and operational phase carbon emission footprint without con-
sidering server replacement. We also note that NC-NC does
not consider carbon footprint in both manufacturing phase and
operational phase.
B. Experimental Results
Impact of Job Deadline: The first set of experiments evaluates
the effect of job deadlines on different heuristic combinations.
(a) Life Cycle Carbon Footprint (b) Job Success Rate (c) Job Response Time (d) Effective Server Utilization
Fig. 4. Impact of Workload Prediction Error
(a) Life Cycle Carbon Footprint (b) Job Success Rate (c) Job Response Time (d) Effective Server Utilization
Fig. 5. Impact of Workload Intensive Factor
In the simulation, the deadline slack is considered as (deadline
- submissiontime - runtime)/runtime * 100 (i.e., a slack of 60%
indicates that the scheduler has a window 60% longer than the
running time of the job) [34]. We compare the normalized life-
cycle carbon footprint in Fig-2(a). In comparison with other
techniques, the proposed approach MCCADL-MCS achieves
the lowest life-cycle carbon footprint under different deadline
slack settings. The trend also indicates that the proposed
capacity planning algorithm results in less number of servers
when the deadline slack is higher. This contributes to the
lower life-cycle carbon footprint in the data center. From
Fig-2(b)(c)(d), we find that the MCCADL-MCS approach
achieves similar performance for job success rate and job
response time compared to the other approaches. In summary,
the proposed approach, MCCADL-MCS achieves the lowest
life-cycle carbon footprint impact while maintaining similar
runtime performance as the other techniques.
Impact of Job Runtime Estimation Error: This set of exper-
iments evaluates the proposed framework under different job
runtime estimation errors. In this set of experiments, we set
the deadline slack as 0.6. The estimation error model is using
the approach in [34]. The x-axis in Fig 3 is the runtime shift
in the job runtime distribution N(µ=runtime ·(1 + x), δ =
runtime ·x). From Fig-3(a), we observe that the proposed
solution, MCCADL-MCS achieves the lowest life-cycle carbon
footprint. As the number of servers increase when the runtime
error shift becomes larger, it causes uncertainty for the jobs
and the capacity planning and job scheduling needs to handle
the uncertainty. In addition, the runtime distribution error shift
also leads to poor scheduler performance. In Fig-3(b)(c)(d),
we observe that the job success rate becomes lower and the
job response time becomes higher when the estimation error
increases. Similarly, the effective server utilization decreases
when the error increases.
Impact of Workload Prediction Error: This set of experiments
evaluates the performance under different levels of error in
workload prediction. We use the approach described in [35]
to inject the error in the workload prediction process. From
Fig-4, we can see that the workload prediction error does not
have significant impact on the performance as it only impacts
the workload characterization. From Fig-4(a), we can see that
the proposed solution, MCCADL-MCS has the lowest life-
cycle carbon footprint value under different prediction error
percentage settings. From Fig-4(b)(c)(d), we observe that the
proposed solution maintains similar performance as the other
four techniques.
Impact of Workload Intensity: The final set of experiments
evaluates the performance under different levels of workload
intensity. In this experiment, we use the settings from [33].
Based on the workload intensity factor, we scale the origi-
nal workload requests at each time slot. From Fig-5(a), we
observe that the proposed solution, MCCADL-MCS has the
lowest life-cycle carbon footprint value compared to the other
techniques under different workload intensity factor settings.
From Fig-5(b)(c)(d), we observe that the proposed solution
maintains similar performance metrics compared to the other
four techniques.
V. RE LATE D WOR K
Several existing works have focused on sustainability-aware
resource provisioning in data centers. Berral et al. [13] pro-
posed an optimization framework for designing green data
centers for follow-the-renewable HPC cloud services. Enreal
[36] is an energy-aware resource allocation framework for sci-
entific workflow executions. Beloglazov et al. [37] developed
an energy efficient virtual machine allocation algorithm to
minimize power consumption through consolidation. Liu et
al. [38] integrated renewable energy supply with IT workload
planning to reduce data center electricity cost and environ-
mental impact. Zhang et al. [18] proposed a control-theoretic
solution to dynamically provision servers to minimize total
energy cost while meeting the task delay. Unlike our proposed
framework, these works do not consider the sustainability
impact in the data center manufacturing phase.
Sustainability-aware scheduling techniques in data center
were also investigated in the past. Tang et al. [39] proposed
a thermal-aware task scheduling algorithm for homogeneous
data center to minimize the inlet temperatures. Berral et al.
[40] proposed a machine learning based scheduling framework
to deal with uncertain information while maximizing the
performance in data centers. Goiri et al. [41] have developed
an energy-aware dynamically job scheduling framework to
minimize power consumption. Khosravi et al. [33] proposed a
virtual machine scheduling algorithm to minimize the carbon
emission footprint in geo-distributed data centers. Unlike our
work, these solutions only focused on the operational phase
optimization without considering the impacts due to server
replacement.
There are several previous works that focused on the life
cycle assessment for data centers. Marwah et al. [8] proposed
an exergy-based life cycle assessment approach to quantify the
sustainability impact for data centers. Chang et al. [10] pro-
posed a methodology for data center server life cycle analysis
from a thermodynamic viewpoint and discussed the sustain-
ability bottlenecks. However, these techniques only focused
on sustainability-aware server design and did not consider the
data center design and operations. Zhang et al. [16] designed a
new data center network topology to minimize the complexity
in data center life cycle from a network perspective. Unlike
our proposed framework, this technique focused on optimizing
data center complexity and did not optimize for environmental
metrics. To the best of our knowledge, the work presented
in this paper is first significant effort on jointly optimizing
manufacturing phase and operational phase impacts while
closely considering impacts due to server replacement during
the operational phase.
VI. CONCLUSION
In this paper, we proposed a two-phase sustainability-aware
resource allocation and management framework for data center
life-cycle management that jointly optimizes the data center
manufacturing phase and operational phase impact without
impacting the performance and service quality for the jobs.
Phase 1 of the proposed approach minimizes data center
building phase carbon footprint through a novel manufacturing
cost-aware server provisioning plan. In phase 2, the proposed
approach minimizes the operational phase carbon footprint
using a server lifetime-aware resource allocation scheme and
a manufacturing cost-aware replacement plan. We evaluated
the performance of the proposed framework through extensive
simulation-based experiments using real world workload trace.
The evaluation results show that the proposed approach signif-
icantly reduces the carbon footprint in the data center without
impacting the performance of the jobs in the workload.
VII. ACKN OWLEDGEMENT
We acknowledge the support for this work through a grant
from the Central Research Development Fund (CRDF) at the
University of Pittsburgh.
REFERENCES
[1] A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das, “Towards
characterizing cloud backend workloads: insights from google compute
clusters,” ACM SIGMETRICS Performance Evaluation Review, vol. 37,
no. 4, pp. 34–41, 2010.
[2] S. S. Gill and R. Buyya, “A taxonomy and future directions for sus-
tainable cloud computing: 360 degree view,” ACM Computing Surveys
(CSUR), vol. 51, no. 5, pp. 1–33, 2018.
[3] F. J. Mesa-Martinez, E. K. Ardestani, and J. Renau, “Characterizing
processor thermal behavior,ACM SIGARCH Computer Architecture
News, vol. 38, no. 1, pp. 193–204, 2010.
[4] J. Srinivasan, S. V. Adve, P. Bose, and J. A. Rivers, “Lifetime reliability:
Toward an architectural solution,” IEEE Micro, vol. 25, no. 3, pp. 70–80,
2005.
[5] N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and
B. Schroeder, “Temperature management in data centers: why some
(might) like it hot,” in Proceedings of the 12th ACM SIGMET-
RICS/PERFORMANCE joint international conference on Measurement
and Modeling of Computer Systems, 2012, pp. 163–174.
[6] K. V. Vishwanath and N. Nagappan, “Characterizing cloud computing
hardware reliability,” in Proceedings of the 1st ACM symposium on
Cloud computing. ACM, 2010, pp. 193–204.
[7] P. Hale, “Acceleration and time to fail,Quality and Reliability Engi-
neering International, vol. 2, no. 4, pp. 259–262, 1986.
[8] M. Marwah, P. Maciel, A. Shah, R. Sharma, T. Christian, V. Almeida,
C. Ara´
ujo, E. Souza, G. Callou, B. Silva et al., “Quantifying the
sustainability impact of data center availability,” ACM SIGMETRICS
Performance Evaluation Review, vol. 37, no. 4, pp. 64–68, 2010.
[9] J. Chang, J. Meza, P. Ranganathan, C. Bash, and A. Shah, “Green server
design: Beyond operational energy to sustainability,” Memory, vol. 4,
no. 10, p. 50.
[10] J. Chang, J. Meza, P. Ranganathan, A. Shah, R. Shih, and C. Bash, “To-
tally green: evaluating and designing servers for lifecycle environmental
impact,” ACM SIGARCH Computer Architecture News, vol. 40, no. 1,
pp. 25–36, 2012.
[11] M. Stutz, S. O’Connell, and J. Pflueger, “Carbon footprint of a dell rack
server,” in Electronics Goes Green 2012+(EGG), 2012. IEEE, 2012,
pp. 1–5.
[12] C. Ren, D. Wang, B. Urgaonkar, and A. Sivasubramaniam, “Carbon-
aware energy capacity planning for datacenters,” in 2012 IEEE 20th
International Symposium on Modeling, Analysis and Simulation of
Computer and Telecommunication Systems. IEEE, 2012, pp. 391–400.
[13] J. L. Berral, ´
I. Goiri, T. D. Nguyen, R. Gavalda, J. Torres, and
R. Bianchini, “Building green cloud services at low cost,” in 2014
IEEE 34th International Conference on Distributed Computing Systems.
IEEE, 2014, pp. 449–460.
[14] I. Goiri, K. Le, J. Guitart, J. Torres, and R. Bianchini, “Intelligent place-
ment of datacenters for internet services,” in 2011 31st International
Conference on Distributed Computing Systems. IEEE, 2011, pp. 131–
142.
[15] Y. Gao, Z. Zeng, X. Liu, and P. Kumar, “The answer is blowing in the
wind: Analysis of powering internet data centers with wind energy,” in
2013 Proceedings IEEE INFOCOM. IEEE, 2013, pp. 520–524.
[16] M. Zhang, R. N. Mysore, S. Supittayapornpong, and R. Govindan,
“Understanding lifecycle management complexity of datacenter
topologies,” in 16th USENIX Symposium on Networked Systems
Design and Implementation (NSDI 19). Boston, MA: USENIX
Association, Feb. 2019, pp. 235–254. [Online]. Available:
https://www.usenix.org/conference/nsdi19/presentation/zhang
[17] P. X. Gao, A. R. Curtis, B. Wong, and S. Keshav, “It’s not easy being
green,” in Proceedings of the ACM SIGCOMM 2012 conference on
Applications, technologies, architectures, and protocols for computer
communication. ACM, 2012, pp. 211–222.
[18] Q. Zhang, M. F. Zhani, S. Zhang, Q. Zhu, R. Boutaba, and J. L.
Hellerstein, “Dynamic energy-aware capacity provisioning for cloud
computing environments,” in Proceedings of the 9th international con-
ference on Autonomic computing. ACM, 2012, pp. 145–154.
[19] A. Khan, X. Yan, S. Tao, and N. Anerousis, “Workload characterization
and prediction in the cloud: A multiple time series approach,” in Network
Operations and Management Symposium (NOMS), 2012 IEEE. IEEE,
2012, pp. 1287–1294.
[20] G. M. Amdahl, “Validity of the single processor approach to achieving
large scale computing capabilities,” in Proceedings of the April 18-20,
1967, spring joint computer conference, 1967, pp. 483–485.
[21] S. M. Zahedi, Q. Llull, and B. C. Lee, “Amdahl’s law in the datacenter
era: A market for fair processor allocation,” in 2018 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA).
IEEE, 2018, pp. 1–14.
[22] B. T. Han, G. Diehr, and J. S. Cook, “Multiple-type, two-dimensional bin
packing problems: Applications and algorithms,” Annals of Operations
Research, vol. 50, no. 1, pp. 239–261, 1994.
[23] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and
I. Stoica, “Dominant resource fairness: Fair allocation of multiple
resource types.” in Nsdi, vol. 11, no. 2011, 2011, pp. 24–24.
[24] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a
warehouse-sized computer,” in ACM SIGARCH computer architecture
news, vol. 35, no. 2. ACM, 2007, pp. 13–23.
[25] [Online]. Available: https://www.epa.gov/energy/
greenhouse-gases- equivalencies-calculator-calculations-and-references
[26] G. Wang, L. Zhang, and W. Xu, “What can we learn from four years
of data center hardware failures?” in 2017 47th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks (DSN).
IEEE, 2017, pp. 25–36.
[27] Y. Song, C. Zhang, and Y. Fang, “Multiple multidimensional knapsack
problem and its applications in cognitive radio networks,” in MILCOM
2008-2008 IEEE Military Communications Conference. IEEE, 2008,
pp. 1–7.
[28] [Online]. Available: https://www.dell.com/learn/us/en/uscorp1/
corp-comm/environment carbon footprint products
[29] [Online]. Available: https://www.spec.org/contents.html
[30] [Online]. Available: https://github.com/alibaba/clusterdata/blob/master/
cluster-trace-v2018/trace 2018.md
[31] C. Jiang, G. Han, J. Lin, G. Jia, W. Shi, and J. Wan, “Characteristics
of co-allocated online services and batch jobs in internet data centers: a
case study from alibaba cloud,” IEEE Access, vol. 7, pp. 22495–22 508,
2019.
[32] H. I. Christensen, A. Khan, S. Pokutta, and P. Tetali, “Multidimensional
bin packing and other related problems: A survey.”
[33] A. Khosravi, S. K. Garg, and R. Buyya, “Energy and carbon-efficient
placement of virtual machines in distributed cloud data centers,” in
European Conference on Parallel Processing. Springer, 2013, pp. 317–
328.
[34] J. W. Park, A. Tumanov, A. Jiang, M. A. Kozuch, and G. R. Ganger,
“3sigma: distribution-based cluster scheduling for runtime uncertainty,
in Proceedings of the Thirteenth EuroSys Conference, 2018, pp. 1–17.
[35] Y. Wu, K. Hwang, Y. Yuan, and W. Zheng, “Adaptive workload predic-
tion of grid performance in confidence windows,IEEE Transactions on
Parallel and Distributed Systems, vol. 21, no. 7, pp. 925–938, 2009.
[36] X. Xu, W. Dou, X. Zhang, and J. Chen, “Enreal: An energy-aware
resource allocation method for scientific workflow executions in cloud
environment,IEEE Transactions on Cloud Computing, vol. 4, no. 2,
pp. 166–179, 2015.
[37] A. Beloglazov and R. Buyya, “Energy efficient allocation of virtual
machines in cloud data centers,” in 2010 10th IEEE/ACM International
Conference on Cluster, Cloud and Grid Computing. IEEE, 2010, pp.
577–578.
[38] Z. Liu, Y. Chen, C. Bash, A. Wierman, D. Gmach, Z. Wang, M. Marwah,
and C. Hyser, “Renewable and cooling aware workload management for
sustainable data centers,” in ACM SIGMETRICS Performance Evalua-
tion Review, vol. 40, no. 1. ACM, 2012, pp. 175–186.
[39] Q. Tang, S. K. Gupta, and G. Varsamopoulos, “Thermal-aware task
scheduling for data centers through minimizing heat recirculation,” in
2007 IEEE International Conference on Cluster Computing. IEEE,
2007, pp. 129–138.
[40] J. L. Berral, ´
I. Goiri, R. Nou, F. Juli`
a, J. Guitart, R. Gavald`
a, and
J. Torres, “Towards energy-aware scheduling in data centers using
machine learning,” in Proceedings of the 1st International Conference
on energy-Efficient Computing and Networking, 2010, pp. 215–224.
[41] I. Goiri, F. Julia, R. Nou, J. L. Berral, J. Guitart, and J. Torres, “Energy-
aware scheduling in virtualized datacenters,” in 2010 IEEE International
Conference on Cluster Computing. IEEE, 2010, pp. 58–67.
Article
Full-text available
In order to reduce power and energy costs, giant cloud providers now mix online and batch jobs on the same cluster. Although the co-allocation of such jobs improves machine utilization, it challenges the data center scheduler and workload assignment in terms of quality of service, fault tolerance, and failure recovery, especially for latency critical online services. In this paper, we explore various characteristics of co-allocated online services and batch jobs from a production cluster containing 1.3k servers in Alibaba Cloud. From the trace data we find the following:(1) For batch jobs with multiple tasks and instances, 50.8% failed tasks wait and halted after a very long time interval when their first and the only one instance fails. This wastes much time and resources as the remaining instances are running for an impossible successful termination. (2) For online services jobs, they are clustered in 25 categories according to their requested CPU, memory, and disk resources. Such clustering can help co-allocation of online services jobs with batch jobs. (3) Servers are clustered into 7 groups by CPU utilization, memory utilization, and their correlations. Machines with strong correlation between CPU and memory utilization provides opportunity for job co-allocation and resource utilization estimation. (4) The MTBF (mean time between failures) of instances are in the interval [400, 800] seconds while the average completion time of the 99th percentile is 1003 seconds. We also compare the cumulative distribution functions of jobs and servers and explain the differences and opportunities for workload assignment between them. Our findings and insights presented in this paper can help the community and data center operators better understand the workload characteristics, improve resource utilization, and failure recovery design.
Article
Full-text available
The cloud computing paradigm offers on-demand services over the Internet and supports a wide variety of applications. With the recent growth of Internet of Things (IoT) based applications the usage of cloud services is increasing exponentially. The next generation of cloud computing must be energy-efficient and sustainable to fulfil the end-user requirements which are changing dynamically. Presently, cloud providers are facing challenges to ensure the energy efficiency and sustainability of their services. The usage of large number of cloud datacenters increases cost as well as carbon footprints, which further effects the sustainability of cloud services. In this paper, we propose a comprehensive taxonomy of sustainable cloud computing. The taxonomy is used to investigate the existing techniques for sustainability that need careful attention and investigation as proposed by several academic and industry groups. Further, the current research on sustainable cloud computing is organized into several categories: application design, sustainability metrics, capacity planning, energy management, virtualization, thermal-aware scheduling, cooling management, renewable energy and waste heat utilization. The existing techniques have been compared and categorized based on the common characteristics and properties. A conceptual model for sustainable cloud computing has been proposed along with discussion on future research directions.
Conference Paper
The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8--23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.
Article
The environmental impact of servers and datacenters is an important future challenge. System architects have traditionally focused on operational energy as a proxy for designing green servers, but this ignores important environmental implications from server production (materials, manufacturing, etc.). In contrast, this paper argues for a lifecycle focus on the environmental impact of future server designs, to include both operation and production. We present a new methodology to quantify the total environmental impact of system design decisions. Our approach uses the thermodynamic metric of exergy consumption, adapted and validated for use by system architects. Using this methodology, we evaluate the lifecycle impact of several example system designs with environment-friendly optimizations. Our results show that environmental impact from production can be important (around 20% on current servers and growing) and system design choices can reduce this component (by 30--40%). Our results also highlight several, sometimes unexpected, cross-interactions between the environmental impact of production and operation that further motivate a total lifecycle emphasis for future green server designs.
Article
Scientific workflows are often deployed across multiple cloud computing platforms due to their large-scale characteristic. This can be technically achieved by expanding a cloud platform. However, it is still a challenge to conduct scientific workflow executions in an energy-aware fashion across cloud platforms or even inside a cloud platform, since the cloud platform expansion will make the energy consumption a big concern. In this paper, we propose an Energy-aware Resource Allocation method, named EnReal, to address the above challenge. Basically, we leverage the dynamic deployment of virtual machines for scientific workflow executions. Specifically, an energy consumption model is presented for applications deployed across cloud computing platforms, and a corresponding energy-aware resource allocation algorithm is proposed for virtual machine scheduling to accomplish scientific workflow executions. Experimental evaluation demonstrates that the proposed method is both effective and efficient.
Article
The advent of cloud computing promises highly available, efficient, and flexible computing services for applications such as web search, email, voice over IP, and web search alerts. Our experience at Google is that realizing the promises of cloud computing requires an extremely scalable backend consisting of many large compute clusters that are shared by application tasks with diverse service level requirements for throughput, latency, and jitter. These considerations impact (a) capacity planning to determine which machine resources must grow and by how much and (b) task scheduling to achieve high machine utilization and to meet service level objectives. Both capacity planning and task scheduling require a good understanding of task resource consumption (e.g., CPU and memory usage). This in turn demands simple and accurate approaches to workload classification-determining how to form groups of tasks (workloads) with similar resource demands. One approach to workload classification is to make each task its own workload. However, this approach scales poorly since tens of thousands of tasks execute daily on Google compute clusters. Another approach to workload classification is to view all tasks as belonging to a single workload. Unfortunately, applying such a coarse-grain workload classification to the diversity of tasks running on Google compute clusters results in large variances in predicted resource consumptions. This paper describes an approach to workload classification and its application to the Google Cloud Backend, arguably the largest cloud backend on the planet. Our methodology for workload classification consists of: (1) identifying the workload dimensions; (2) constructing task classes using an off-the-shelf algorithm such as k-means; (3) determining the break points for qualitative coordinates within the workload dimensions; and (4) merging adjacent task classes to reduce the number of workloads. We use the foregoing, especially the notion of qualitative coordinates, to glean several insights about the Google Cloud Backend: (a) the duration of task executions is bimodal in that tasks either have a short duration or a long duration; (b) most tasks have short durations; and (c) most resources are consumed by a few tasks with long duration that have large demands for CPU and memory.
Conference Paper
Interest in powering data enters at least partially using on-site renewable sources, e.g. solar or wind, has been growing. In fact, researchers have studied distributed services comprising networks of such "green" data centers, and load distribution approaches that "follow the renewables" to maximize their use. However, prior works have not considered where to site such a network for efficient production of renewable energy, while minimizing both data center and renewable plant building costs. Moreover, researchers have not built real load management systems for follow-the-renewables services. Thus, in this paper, we propose a framework, optimization problem, and solution approach for sitting and provisioning green data centers for a follow-the-renewables HPC cloud service. We illustrate the location selection tradeoffs by quantifying the minimum cost of achieving different amounts of renewable energy. Finally, we design and implement a system capable of migrating virtual machines across the green data centers to follow the renewables. Among other interesting results, we demonstrate that one can build green HPC cloud services at a relatively low additional cost compared to existing services.
Conference Paper
Due to the increasing use of Cloud computing services and the amount of energy used by data centers, there is a growing interest in reducing energy consumption and carbon footprint of data centers. Cloud data centers use virtualization technology to host multiple virtual machines (VMs) on a single physical server. By applying efficient VM placement algorithms, Cloud providers are able to enhance energy efficiency and reduce carbon footprint. Previous works have focused on reducing the energy used within a single or multiple data centers without considering their energy sources and Power Usage Effectiveness (PUE). In contrast, this paper proposes a novel VM placement algorithm to increase the environmental sustainability by taking into account distributed data centers with different carbon footprint rates and PUEs. Simulation results show that the proposed algorithm reduces the CO2 emission and power consumption, while it maintains the same level of quality of service compared to other competitive algorithms.