Effective Dynamic Voltage Scaling Through CPU-Boundedness Detection.
-
Citations (0)
-
Cited In (0)
Page 1
Effective Dynamic Voltage Scaling
through CPU-Boundedness Detection∗
Chung-Hsing Hsu and Wu-chun Feng
{chunghsu,feng}@lanl.gov
Advanced Computing Laboratory
Los Alamos National Laboratory
Los Alamos, NM 87545
Keywords: Power-aware computing, dynamic voltage
scaling, interval-based voltage scheduling, performance
modeling, power-performance tradeoff.
Abstract
Dynamic voltage scaling (DVS) allows a program to ex-
ecute at a non-peak CPU frequency in order to reduce
CPU power, and hence, energy consumption; however,
it is done at the cost of performance degradation. For
a program whose execution time is bounded by periph-
erals’ performance rather than the CPU speed, apply-
ing DVS to the program will result in negligible per-
formance penalty. Unfortunately, existing DVS-based
power-management algorithms are conservative in the
sense that they overly exaggerate the impact that the
CPU speed has on the execution time, e.g., they assume
that the execution time will double if the CPU speed is
halved. Based on a new single-coefficient performance
model, we propose a DVS algorithm that detects the
CPU-boundedness of a program on the fly (via a re-
gression method on the past MIPS rate) and then ad-
justs the CPU frequency accordingly. To illustrate its
effectiveness, we compare our algorithm with other DVS
algorithms on real systems via physical measurements.
1Introduction
Dynamic voltage and frequency scaling (DVS) is a
mechanism whereby software can dynamically adjust
CPU voltage and frequency.
systems to address the problem of ever-increasing CPU
power dissipation and energy consumption, as they are
both quadratically proportional to the CPU voltage.
However, reducing the CPU voltage may also require
the CPU frequency to be reduced and results in de-
This mechanism allows
∗This work was supported by the DOE ASC Program through
Los Alamos National Laboratory contract W-7405-ENG-36.
graded CPU performance with respect to execution
time. In other words, DVS trades off performance for
power and energy reduction.
The performance loss due to running at a lower CPU
frequency raises several issues. First, a user who pays to
upgrade his/her computer system does not want to see
performance degradation.
at a low CPU frequency may end up increasing total
system energy usage [1, 2, 3, 4]. In order to control
(or constrain) the performance loss effectively, a model
that relates performance to the CPU frequency is essen-
tial for any DVS-based power-management algorithm
(shortened as DVS algorithm hereafter).
A typical model used by many DVS algorithms pre-
dicts that the execution time will double if the CPU
speed is cut in half. Unfortunately, this model overly
exaggerates the impact that the CPU speed has on the
execution time. It is only in the worst case that the ex-
ecution time doubles when the CPU speed is halved; in
general, the actual execution time is less than double.
For example, in programs with a high cache miss ra-
tio, performance can be limited by memory bandwidth
rather than CPU speed.Since memory performance
is not affected by a change in CPU speed, increasing
or decreasing the CPU frequency will have little effect
on the performance of these programs. We call this
phenomenon — sublinear performance slowdown. Con-
sequently, researchers have been trying to exploit this
program behavior in order to achieve better power and
energy reduction [5, 6, 7, 8]. One common technique de-
composes program workload into regions based on their
CPU-boundedness.The decomposition can be done
statically using profiling information [5, 6] or dynam-
ically through an auxiliary circuit [9, 10, 11] or through
a built-in performance monitoring unit (PMU) [7, 8].
In this paper, we propose a new PMU-assisted, on-line
DVS algorithm called beta that provides fine-grained,
tight control over performance loss as well as takes the
advantage of the sublinear performance slowdown. The
new beta algorithm is based on an extension of the the-
Second, running programs
1
PACS 2004: The Fourth Workshop on Power-Aware Computer Systems.
Portland, Oregon, December 2004, LA-UR 04-7195.
Page 2
oretical work developed by Yao et al. [12] and by Ishi-
hara and Yasuura [13]. Via physical measurements, we
will demonstrate the effectiveness of the beta algorithm
when compared to several existing DVS algorithms for
a number of applications.
The rest of the paper is organized as follows: Section 2
characterizes how current DVS algorithms relate perfor-
mance to CPU frequency. With this characterization as
a backdrop, we present a new DVS algorithm (Section 3)
along with its theoretical foundation (Section 4). Then,
Section 5 describes the experimental set-up, the imple-
mented DVS algorithms, and the experimental results.
Finally, Section 6 concludes and presents some future
directions.
2 Related Work
CPU utilization is often used to relate performance to
the CPU frequency. While it is generally defined as the
fraction of time that the CPU spends non-idle, CPU
utilization may also be interpreted as the normalized
workload (e.g., [14, 15, 16]). This particular interpreta-
tion has a nice property that there is a one-to-one corre-
spondence between the desired normalized CPU speed
and CPU idle time. Thus, if CPU utilization is 0.5
on a 2-GHz machine, then setting the CPU frequency
to 1 GHz is predicted to eliminate all CPU idle time.
Clearly, CPU utilization (or the normalized workload)
follows the assumption that the execution time doubles
when the CPU speed is halved. This type of model is
popular because the metric is easy to derive at run time
and it does not require application-specific information.
However, CPU utilization by itself does not provide
enough information about system timing requirements,
and DVS algorithms based on such information can only
provide loose control over performance loss [17, 18, 19].
Thus, DVS algorithms with application-specific infor-
mation have been proposed in order to provide tighter
control over performance loss. For example, an applica-
tion (or task or thread) can be associated with a dead-
line, in terms of seconds, as well as a CPU work require-
ment, in terms of CPU cycles. In this setting, perfor-
mance is usually formulated as a linear function of the
CPU speed. This type of performance model predicts
that the execution time doubles when the CPU speed
is halved. Other approaches use a target IPC (instruc-
tions per cycle) rate as the system timing requirements
[20, 21]. Their performance model falls into the same
category too.
There have been some attempts to exploit the sublin-
ear performance slowdown to achieve more power and
energy reduction. For example, Marculescu [9] proposed
to set the CPU to a low speed whenever an L2 cache
miss occurs. Li et al. [11] improved the algorithm by
taking into account the transition overhead and scaling
the CPU frequency and voltage according to the level of
parallelism between the CPU and the memory subsys-
tem. Stanley-Marbell et al. [10] designed an auxiliary
hardware unit to detect loop-based, memory-bound ex-
ecution phases.
The PMU-assisted “process cruise control” developed
by Weissel and Bellosa [5] relies on a pre-computed ta-
ble of optimal CPU speeds to direct the CPU speed
change. The table is indexed by the run-time instruc-
tion counts per cycle and memory requests per cycle.
Although the algorithm requires neither source code nor
compiler support, it is inflexible in the sense that the ta-
ble is obtained through extensive experiments of micro-
benchmarks for a given performance loss (e.g., 10% in
[5]). In other words, the algorithm does not allow for
dynamic, application-specific control over performance
loss.
Hsu and Kremer [6] use off-line profiling to identify
memory-bound program regions, coupled with compiler
transformations, to facilitate the setting of the CPU
frequency. However, the need for source code and com-
piler support makes this approach more difficult to im-
plement in practice. In general, compiler-directed DVS
algorithms have the benefit of only requiring the host
processor to export a DVS interface and does not re-
quire support from the OS scheduler. They also allow
DVS scheduling decisions to be made in a global man-
ner and to be in combination with performance-oriented
optimization. On the other hand, savings are limited as
speed-set instructions are inserted statically, and thus,
apply to all execution of a specific memory reference,
both cache misses as well as hits [22]. Moreover, in-
put data sets may change program behavior that makes
profile-based DVS algorithms less attractive.
Our work is closest to Choi et al.’s recent work
[7, 8]. Both use a regression method and PMU sup-
port to perform on-line DVS scheduling through CPU-
boundedness detection. However, the two works differ
in their definition of CPU-boundedness, and thus, the
detection mechanism. Choi et al.’s work is based on the
ratio of the on-chip computation time to the off-chip
access time. In contrast, our algorithm defines CPU-
boundedness as the fraction of program workload that
is CPU-bound. Because of the different definitions, the
set of events monitored by the PMU for each algorithm
is different. In Section 5.5, we argue that our DVS al-
gorithm is equally effective but has a simpler imple-
mentation. Moreover, in contrast to [7, 8], we provide
a theoretical foundation of why our DVS algorithm is
effective in achieving energy optimality. The same the-
oretical result can be applied to their work as well.
In general, PMU-assisted DVS algorithms will en-
counter a couple of challenges. The PMU is notorious
2
Page 3
for its incomplete set of event counting, inconsistency
across generations of the CPU, and counters do not
function as advertised. For example, Choi et al. pre-
sented two platform-dependent implementations [7, 8] of
the same DVS algorithm because the PMUs of these two
platforms count different sets of events. In addition, the
correlation of event counts to power and performance is
not yet clear and has been an ongoing research focus
(e.g., [23, 24]).
3 A New DVS Algorithm
Here we describe a new interval-based PMU-assisted
DVS algorithm that provides fine-grained, tight control
over performance loss as well as exploits the sublinear
performance scaling in memory-bound and I/O-bound
programs. The theoretically-based heuristic algorithm
is based on an extension of the theoretical work devel-
oped by [12] and [13] (details in Section 4):
If the CPU power dissipation is a convex func-
tion of the CPU frequency, then for any pro-
gram whose performance is an affine function
of the CPU frequency, running at a constant
CPU speed and meeting the deadline just in
time will minimize the energy usage of execut-
ing the program. If the desired CPU frequency
is not directly supported by the system, the
two immediately-neighboring CPU frequencies
can be used to emulate the desired CPU fre-
quency and result in an energy-optimal DVS
schedule.
To account for the sublinear performance slowdown,
the following model that relates performance to the
CPU frequency is often used [25, 7, 8]:
T(f) = Wcpu·1
f+ Tmem
(1)
The total execution time T(f) at frequency f is decom-
posed into two parts. The first part models on-chip
workload in terms of CPU cycles. Its value is affected
by the CPU speed change. The second part models the
time due to off-chip accesses and is invariant to changes
in the CPU speed. Note that this breakdown of the total
execution time is inexact when the target processor sup-
ports out-of-order execution because on-chip execution
may overlap with off-chip accesses [26, 22]. However, in
practice, the error tends to be quite small [7, 8].
The model T(f) treats program performance as an
affine function of the CPU frequency f and thus allows
us to apply the aforementioned theoretical result. We
simply execute a program at CPU frequency f∗such
that D = T(f∗) where D is the deadline of the program.
However, there are two challenges in using the theorem
this way. First, in many cases there is no consensus on
how to assign a deadline to a program, e.g., scientific
computation. Second, in order to use the model T(f),
we need to know the values of the coefficients, Wcpuand
Tmem. These coefficients are oftentimes determined by
the hardware platform, program source code, and data
input. Thus, calculating these coefficients statically is
very difficult.
We address these challenges by defining a deadline as
the relative performance slowdown and by estimating
the model’s coefficients on the fly (without any off-line
profiling nor compiler support).
mance slowdown δ
The relative perfor-
δ =
T(f)
T(fmax)− 1(2)
where fmax is the peak CPU frequency, as has been
used in previous work [26, 7]. It is widely accepted in
programs that are difficult to assign deadlines in terms
of absolute execution time. It also carries more timing
requirement information than CPU utilization and IPC
rate. Providing this user-tunable parameter δ in our
DVS algorithm allows fine-grained, tight control over
performance loss.
To estimate the coefficients, we first re-formulate
the original two-coefficient model in Equation (1) as a
single-coefficient model:
T(f)
T(fmax)= β ·fmax
f
+ (1 − β) (3)
with
β =
Wcpu
Wcpu+ Tmem· fmax
(4)
The coefficient β is, by definition, a value between 0 and
1. It was introduced by one of the authors in [6] to quan-
tify the CPU-boundedness of a program and its perfor-
mance impact to the CPU speed change. The metric
represents the fraction of the program workload that
scales linearly with the CPU frequency. If a program
has β = 1, it means the execution time of the program
will double when the CPU speed is halved. In contrast,
memory-bound and I/O-bound programs have their β
values close to zero, indicating that their execution time
will remain the same even running at the slowest CPU
speed. The single-coefficient model instead of the orig-
inal two-coefficient model facilitates the calculation of
the coefficient values in an efficient manner.
The coefficient β is computed at run time using a
regressionmethod on the past MIPS rates reported from
the PMU. Specifically, our DVS algorithm keeps track of
the average MIPS rate for each executed CPU frequency
and applies the least-square fitting at each interval to
3
Page 4
For every I seconds, doing the following:
1. Use Equation (5) to compute β.
2. Compute the ideal frequency f∗.
f∗
=
?
fmin
fmax/(1 + δ/β)
if β ≤ δ
otherwise
3. Figure out fjand fj+1.
fj≤ f∗< fj+1
4. Compute the ratio r
r =(1 + δ/β)/fmax− 1/fj+1
1/fj− 1/fj+1
5. Run r · I seconds at frequency fj.
6. Run (1−r)·I seconds at frequency fj+1.
7. Update mips(fj) and mips(fj+1).
Figure 1: Algorithm beta.
tive performance slowdown and and parameter I is the
length of an interval in seconds.
Parameter δ is the rela-
dynamically re-compute the new β value:
β =
?
i(fmax
fi
− 1)(mips(fmax)
mips(fi)
?
− 1)
i(fmax
fi
− 1)2
(5)
where mips(f) is the average MIPS rate for CPU fre-
quency f. Note that our mechanism assumes a constant
number of total instructions in a program, regardless of
the running CPU frequency. This assumption has been
verified through extensive experiments. In practice, the
value of β converges very quickly for the benchmarks
we tested.
The rest of the algorithm simply applies the theo-
retical result to compute the desired CPU frequency
f∗for each interval, once the coefficient β is updated,
plus some bookkeeping on mips(f). The derivation of
f∗comes by equating Equation (2) with Equation (3).
Figure 1 outlines the entire algorithm.
Finally, we note that Choi et al.’s recent work on DVS
algorithms [7, 8] is based on the on-line calculation of
ratios αf, one for each frequency f, that are also derived
from Equation (1). There, αf is defined as the ratio of
on-chip computation time to off-chip access times
αf= f ·Tmem
Wcpu
(6)
Using this αf, the desired CPU frequency for the next
interval can be computed. The detailed comparison of
both works is presented in Section 5.5.
4 Theoretical Foundation
In the previous section, we claim a theoretical result for
energy-optimal DVS scheduling which extends both Yao
et al.’s work in [12] and Ishihara and Yasuura’s work in
[13]. In this section we provide evidence to support our
claim. However, due to the limit of paper length, all the
proofs are left in the appendix.
The energy-optimal DVS scheduling problem consid-
ered here is taken from [6]. That previous work only
provides a problem formulation. In this paper we pro-
vide two new theorems that characterize the energy-
optimal DVS schedule for the problem. The two theo-
rems are also closely related to some previous work such
as Miyoshi et al.’s “critical power slope” [3].
A DVS system is assumed to export n settings
{(fi,Pi)}, where Pi is the CPU power dissipation (in
watts) at CPU frequency fi. Without loss of generality,
we assume 0 < f1< ··· < fn. We also denote the total
execution time of a program running at setting i as Ti.
Finally, to facilitate discussion, we define Ei= Pi· Ti.
The DVS scheduling problem is formulated as follows:
given a program and a deadline D (in seconds), find a
DVS schedule (t∗
1,···,t∗
executed for t∗
iseconds at setting i, the total energy
usage E is minimized, the deadline D is met, and the
required work is completed. Mathematically speaking,
n) such that if the program is
minE =?
iPi· ti
(7)
subject to
?
iti/Ti= 1
ti≥ 0
iti≤ D(8)
?
(9)
(10)
To simplify the discussion of the main theorems, we
handle a few corner cases first.
miniTihas to be satisfied so that the problem is feasi-
ble. If the condition D ≥ maxiTi, the problem becomes
the classical fractional Knapsack problem [27] because
Equation (8) can be removed. In this case, the energy-
optimal DVS schedule will execute the entire program
at setting i∗where i∗= argimin{Ei}. Similarly, for the
case of T1= ··· = Tn, the above DVS schedule is also
energy-optimal. In the following, we will focus on cases
where T1?= ··· ?= Tnand miniTi< D < maxiTi.
The condition D ≥
Theorem 1 If
T1> T2> ··· > Tn
4
Page 5
and
0 ≥E2− E1
T2− T1
≥E3− E2
T3− T2
≥ ··· ≥En− En−1
Tn− Tn−1
then
t∗
i=
D−Tj+1
Tj−Tj+1· Tj
D − t∗
j
0
i = j
i = j + 1
otherwise
where
Tj+1< D ≤ Tj
Theorem 1 says that if the piecewise-linear function that
connects points {(Ti,Ei)} is convex and non-increasing
on [Tn,T1], then running at a CPU frequency that fin-
ishes the execution right at the deadline is the most
energy-efficient. If the desired CPU frequency is not
directly supported, it can be emulated by the two
immediately-neighboring CPU frequencies and result in
the energy-optimal DVS schedule.
Theorem 2 If
Ti= T(fi) =c1
f
+ c0, c1?= 0
and
P1− 0
f1− 0≤P2− P1
f2− f1
≤ ··· ≤Pn− Pn−1
fn− fn−1
then
0 ≥E2− E1
T2− T1
≥E3− E2
T3− T2
≥ ··· ≥En− En−1
Tn− Tn−1
Theorem 2 says that, for any program whose execu-
tion time is an affine function of the CPU frequency,
if the DVS settings in a CPU are well-assigned, then
we can apply Theorem 1 to derive the energy-optimal
DVS schedule. Theorem 2 apparently builds a bridge in
using Theorem 1.
The DVS settings are considered well-designed if for
any setting, it has the lowest power dissipation com-
pared to the best possible combination of all other set-
tings that emulates its frequency [19]. Equivalently, the
DVS settings are well-designed if the CPU power dis-
sipation is a convex function of the CPU frequency on
[0,fmax] (in contrast to convex on [fmin,fmax]). This
is why Miyoshi et al.[3] found that in a few real-
istic CPUs, completing a task far before its deadline
and putting the CPU into sleep mode is more energy-
efficient than running the task as slow as possible to
barely make the deadline. In these realistic CPUs, the
CPU frequency f1can be emulated by the combination
of CPU frequency 0 (i.e., the CPU in sleep mode) and
a higher frequency with a lower power dissipation.
Finally, Theorem 2 extends the work presented by
Yao et al. [12] and by Ishihara and Yasuura [13]. First,
Profiling
Computer
Tested
Computer
Digital
Power Meter
Wall
Power Outlet
Power Strip
AC Adapter
Figure 2: The experimental setup.
both works assume that c0= 0. Second, Ishihara and
Yasuura’s work assumes a fixed relationship between f
and V in a DVS setting; namely,
f = k · (V − VT)α/V
where k, VT, α are positive constants. Unfortunately,
today’s DVS processors may not be able to support such
an assumption. This is because these processors only
provide a discrete set of CPU frequencies and voltages,
whereas the above equation requires a continuous range
of CPU frequency i to be supported for a discrete set
of voltages. Theorem 2 loosens these assumptions to
facilitate DVS algorithms on realistic processors.
5 Experiments
In this section, we describe our experimental environ-
ment in which we evaluate and compare algorithm beta
with several other DVS algorithms. We also discuss in
depth the experimental results.
5.1Experimental Setup
In order to get the high-fidelity experimental data, we
set up our experiments using physical measurements, as
shown in Figure 2. The experimental results were col-
lected through a Yokogawa WT210 digital power meter
[28]. The power meter continuously samples the instan-
taneous wattage at every 20 µs. The computer runs the
Linux 2.4.18 kernel. All the benchmarks were compiled
by GNU compilers with optimization level -O2. All the
benchmarks were run to completion; each run took over
a minute.
The benchmarks are taken from SPEC’s CPU95 bench-
mark suites. The SPEC benchmarks [29] emphasize the
performance of the CPU and memory, but not other
computer components such as I/O (disk drives), net-
working or graphics. We chose to use SPEC benchmarks
5