Conference PaperPDF Available

Optimizing pipelines for power and performance

Authors:

Abstract and Figures

During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.
Content may be subject to copyright.
Optimizing Pipelines for Power and Performance
Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose,
Victor Zyuban, Philip N. Strenski, Philip G. Emma
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598
Abstract
During the concept phase and definition of next gen-
eration high-end processors, power and performance will
need to be weighted appropriately to deliver competitive
cost/performance. It is not enough to adopt a CPI-centric
view alone in early-stage definition studies. One of the fun-
damental issues confronting the architect at this stage is
the choice of pipeline depth and target frequency. In this
paper we present an optimization methodology that starts
with an analytical power-performance model to derive op-
timal pipeline depth for a superscalar processor. The results
are validated and further refined using detailed simulation
based analysis. As part of the power-modeling methodol-
ogy, we have developed equations that model the variation
of energy as a function of pipeline depth. Our results using
a set of SPEC2000 applications show that when both power
and performance are considered for optimization, the op-
timal clock period is around 18 FO4. We also provide a
detailed sensitivity analysis of the optimal pipeline depth
against key assumptions of these energy models.
1 Introduction
Current generation high-end, server-class processors are
performance-driven designs. These chips are still somewhat
below the power and power density limits afforded by the
package/cooling solution of choice in server markets tar-
geted by such processors. In designing future processors,
however, energy efficiency is known to have become one of
the primary design constraints [7, 1].
In this paper, we analyze key constraints in choosing the
“optimal” pipeline depth (which directly influences the fre-
quency target) of a microprocessor. The choice of pipeline
depth is one of the fundamental issues confronting the ar-
chitect/designer during the very early stage microarchitec-
ture definition phase of high performance, power-efficient
processors. Even from a performance-only viewpoint, this
issue has been important, if only to understand the limits
to which pipelining can scale in the context of real work-
loads [15, 5, 9, 11, 21]. In certain segments of the mar-
ket (typically desktops and low-end servers), there is often
a market-driven tendency to equate delivered end perfor-
mance with the frequency of the processor. Enlightened
customers do understand the value of net system perfor-
mance; nonetheless, the instinctive urge of going primar-
ily for the highest frequency processor in a given technol-
ogy generation is a known weakness even among savvy end
users and therefore processor design teams. Recent stud-
ies [9, 11, 21] seem to suggest that there is still room to
grow in the pipeline depth game, with performance optima
in the range of 8-11 FO4 inverter delays
1
per stage (con-
sisting of 6-8 FO4 logic delay and 2-3 FO4 latch delay) for
current out-of-order superscalar design paradigms. How-
ever, even in these performance-centric analysis papers, the
authors do point out the practical difficulties of design com-
plexity, verification and power that must be solved in attain-
ing these idealized limits. Our goal in this paper is to exam-
ine the practical, achievable limits when power dissipation
constraints are also factored in. We believe that such analy-
sis is needed to realistically bound the scalability limit inthe
next few technology generations. In particular, power dissi-
pation must be carefully minimized to avoid design points
which on paper promise ever higher performance, yet un-
der normal operating conditions, with commodity packag-
ing and air cooling, only deliver a fraction of the theoretical
peak performance.
In this paper, we first develop an analytical model to
understand the power and performance tradeoffs for super-
scalar pipelines. From this model, we derive the optimal
pipeline depth as a functionof both powerand performance.
Subsequently, the results are validated and further refined
using a detailed cycle-accurate simulator of a current gener-
ation superscalar processor. The energy model for the core
pipeline is based on circuit-extracted power analysis for
structures in a current, high performance PowerPC proces-
sor. We then derive a methodology for scaling these energy
models to deeper andshallower pipelines. A key component
of this methodology is the scaling of latch count growth as
a function of pipeline depth. With these performance and
power models we attempt to determine the optimal pipeline
1
Fan-out-of-four (FO4) delay is defined as the delay of one inverter
driving four copies of an equally sized inverter. The amount of logic and
latch overhead per pipeline stage is often measured in terms of FO4 delay
which implies that deeper pipelines have smaller FO4.
depth for a particular power-performance metric.
Our results based on an analysis of the TPC-C transac-
tion processing benchmark and a large set of SPEC2000
programs indicate that a power-performance optimum is
achieved at much shallower pipeline depths than a purely
performance-focused evaluation would suggest. In addi-
tion, we find there is a range of pipeline depths for which
performance increases can be achieved with a modest sac-
rifice in power-performance efficiency. Pipelining beyond
that range leads to drastic reduction in power-performance
efficiency.
The contributions of this paper are (1) energy mod-
els for both dynamic and leakage power that capture the
scaling of different power components as a function of
pipeline depth; (2) an analytical performance model that
can predict optimal pipeline depth as well as the shift in
the optimal point, when combined with the energy models;
(3) cycle-accurate, detailed power-performance simulation
with a thorough sensitivity analysis of the optimal pipeline
depth against key energy model parameters.
This paper is structured as follows: We discuss the prior,
related work in Section 2. In Section 3, we describe the
proposed analytical model to study pipeline depth effects in
superscalar processors. Section 4 presents the simulation-
based validation methodology. In Section 5 we present re-
sults using both the analytical model and a detailed simula-
tor. In Section 6 we present a detailed sensitivity analysis to
understand the effect of variations in key parameters of the
derived power models on the optimal pipeline depth. We
conclude in Section 7, with pointers to future work.
2 Related Work
Previouswork has studied the issue of “optimal” pipeline
depth exclusively under the constraint of maximizing the
performance delivered by the microprocessor.
An initial study of optimal pipeline depths was per-
formed by Kunkel and Smith in the context of supercom-
puters [15]. The machine modeled in that study was based
on a Cray-1S, with delays being expressed as ECL gate lev-
els. The authors studied the achievable performance from
scalar and vector codes as a function of gate levels per
pipeline stage for the Livermore kernels. The study demon-
strated that vector codes can achieve optimum performance
by deep pipelining, while scalar (floating-point) workloads
reach an optimum at shallower pipelines.
Subsequently, Dubey and Flynn [5] revisited the topic
of optimal pipelining in a more general analytical frame-
work. The authors showed the impact of various workload
and design parameters. In particular, the optimal number of
pipeline stages is shown to decrease with increasing over-
head of partitioning logic into pipeline stages (i.e., clock
skew, jitter, and latch delay). In this model, the authors con-
sidered only stalls due to branch mispredictions and did not
consider data dependent stalls due to memory or register
dependencies.
More recently, several authors have reexamined this
topic in the context of modern superscalar processor mi-
croarchitectures. Hartstein and Puzak [9] treat this prob-
lem analytically and verify based on detailed simulation of
a variety of benchmarks for a 4-issue out-of-order machine
with a memory-execute pipeline. Simulation is also used to
determine the values of several parameters of their mathe-
matical model, since these cannot be formulated axiomati-
cally. They report optimal logic delay per pipeline stage to
be 7.7 FO4 for SPEC2000 and 5.5 FO4 for traditional and
Java/C++ workloads. Assuming a latch insertion delay of
3 FO4, this would result in a total delay of about 10.7 FO4
and 8.5 FO4 per pipeline stage, respectively.
Hrishikesh et al. [11] treat the question of logic depth
per pipeline stage empirically based on simulation of the
SPEC2000 benchmarks for an Alpha 21264-like machine.
Based on their assumed latch insertion delay of 1.8 FO4,
they demonstrate that a performance-optimal point is at
logic delay of 6.0 FO4. This would result in a total pipeline
delay of about 8 FO4.
Sprangle and Carmean [21] extrapolate from the current
performance of the Pentium 4 using IPC degradation factors
for adding a cycle to critical processor loops, such as ALU,
L1 and L2 cache latencies, and branch miss penalty for a
variety of application types. The authors compute an opti-
mal branch misprediction pipeline depth of 52 stages, cor-
responding to a pipeline stage total delay of 9.9 FO4 (based
on a logic depth of 6.3 FO4 anda latch insertion delay of 3.6
of which 3 FO4 are due to latch delay and 0.6 FO4 represent
skew and jitter overhead).
All of the above studies (as well as ours) assume that mi-
croarchitectural structures can be pipelined without limita-
tion. Several authors have evaluated limits on the scalability
and pipelining of these structures [6, 19, 22, 4, 11].
Collectively, the cited works on optimal pipelining have
made a significant contribution to the understanding of
workloads and their interaction with pipeline structures by
studying the theoretical limits of deep pipelining. How-
ever, prior work does not address scalability with respect to
increased power dissipation that is associated with deeper
pipelines. In this work, we aim to build on this foundation
by extending the existing analytical models and by propos-
ing a power modeling methodology that allows us to es-
timate optimal pipeline depth as a function of both power
and performance.
3 Analytical Pipeline Model
In the concept phase definition studies, the exact organi-
zation and parameters of the target processor are not known.
As such, a custom, cycle-accurate power-performance sim-
ulator for the full machine is often not available or rel-
evant. Therefore, the use of analytical reasoning mod-
els supplemented by workload characterization and limit
studies (obtained from prior generation simulators or trace
analysis programs) is common in real design groups. We
present such an analytical model to understand the power-
performance optima and tradeoffs during the pre-simulation
phase of a design project.
Figure 1 showsa high-levelblock diagram of the pipeline
model used in our analysis. Our primary goal is to derive the
optimum pipeline depth for the various execution units by
estimating the various types of stalls in these pipes while
using a perfect front-end for the processor. Although Fig-
ure 1 shows only one pipe for each unit (fixed point, floating
point, load/store, and branch), the model can be used for a
design with multiple pipes per unit as well.






!"
!#$%
Figure 1. Pipeline Model
In Figure 1, let t
i
be the latch-free logic time to complete
an operation in pipe i, and s
i
be the number of pipeline
stages of pipe i. Assuming the same clock frequency for all
the pipes, t
i
/s
i
= t
j
/s
j
, i, j.
If c
i
is the latch overhead per stage for pipe i, the total
time per stage of pipe i is T
i
= ((t
i
/s
i
) + c
i
), i. As
derived by Dubey and Flynn in [7], and Larson and David-
son (cited in Chapter 2 of [14]), the throughput of the above
machine in the absence of stalls is given by G =
P
(
1
T
i
).
We now extend this baseline model to include the effect
of data-dependent stalls. Workload analysis can be used to
derive first-cut estimates of the probability that an instruc-
tion n depends on another instruction j for all instructions
n, and can be used to estimate the frequency of pipeline
stalls. This is illustrated in the example below, where an
FXU instruction (i + 1) depends on the immediately pre-
ceding instruction i, and will be stalled for (s
1
1) stages
assuming a register file bypass to forward the results. Simi-
larly, another FXU instruction (j+2) depends on instruction
j, and will be stalled (s
1
2) stages. Note that in the above
workload analysis, if the source operands of an instruction
i are produced by more than one instruction, the largest of
the possible stalls is assigned to i.
inst (i) add r1 = r2, r3
inst (i+1) and r4 = r1, r5 --
stalled for (s1 - 1) stages
inst (j) add r1 = r2, r3
inst (j+1) or r6 = r7, r8
inst (j+2) and r4 = r1, r5 --
stalled for (s1 - 2) stages
T
fxu
= T
1
+ Stall
fxufxu
T
1
+ Stall
fxufpu
T
2
+
Stall
fxulsu
T
3
+ Stall
fxubru
T
4
where Stall
fxufxu
= f
1
(s
1
1) + f
2
(s
1
2) + . . .
The above equation represents the time to complete an
FXU operation in the presence of stalls; f
i
is the condi-
tional probability that an FXU instruction m depends on
another FXU instruction (m i) for all FXU instructions
m, provided that instruction (m i) is the producer with
the largest stall for instruction m. Similar expressions can
be derived for T
fpu
, T
lsu
, and T
bru
, the completion times
of an FPU, LSU, and BRU operation, respectively. To ac-
count for superscalar (> 1) issue widths, the workload anal-
ysis assumes a given issue width along with the number of
execution pipes of various types (FXU, FPU, BRU, LSU),
and issues independent instructions as an instruction bundle
such that the bundle width issue width. Thus, the distance
between dependent instructions is the number of instruction
bundles issued between them. To account for the dependent
instruction stalls due to L1 data cache misses we use a func-
tional cache simulator to determine cache hits and misses.
In addition, we split the load/store pipe into two, namely,
load-hit and load-miss pipe; thereby, steering all data refer-
ences that miss in the data cache to the load miss pipeline
which results in longer stall times for the dependent instruc-
tions. Since the workload analysis is independent of the ma-
chine architecture details and uses only the superscalar issue
width to determine the different stalls, it suffices to analyze
each application once to derive the stalls.
The stalls modeled so far include only hazards in the ex-
ecution stages of the different pipes. However, these pipes
could also be waiting for instructions to arrive from the
front-end of the machine. If u
i
represents the fraction of
time pipe i has instructions arriving from the front-end of
the machine, the equation below gives the throughput of the
pipeline in the presence of stalls. Note that u
i
= 0 for unuti-
lized pipelines, and u
i
= 1 for fully utilized pipelines.
G =
u
1
T
fxu
+
u
2
T
fpu
+
u
3
T
lsu
+
u
4
T
bru
Thus far, we focused on deriving the throughput of a
pipeline as a function of the number of pipeline stages. In
order to optimize the pipeline for both power and perfor-
mance we use circuit-extracted power models described in
Section 4.2.
If P
s
i
is the total power of pipe i derived from the power
models, the energy-delay product for pipe i is given by
ED = P
s
i
/G
2
. Hence, the optimal value of s
i
which
minimizes the energy-delay product can be obtained us-
ing d(ED)/ds
i
= 0. Note that depending on the class
of processors the desired metric for optimization could be
d((BIPS)
γ
/W)/ds
i
= 0 where γ 0 is an exponent
whose value can be fixed in specific circuit tradeoff contexts
[26]; BIPS is billions of instructions per second.
4 Performance and Power Methodology
In this section, we describe the performance simulator
used in this study as well as the details of our power mod-
eling toolkit and the methodology that we use to estimate
changes in power dissipation as we vary the pipeline depth
of the machine.
Fetch Latencies Decode Latencies
Latency Parms STD/INF Latency Parms STD/INF
NFA Predictor 1/0 Multiple Decode 2/0
L2 ICache 11/0 Millicode Decode 2/0
L3 (Instruction) 85/0 Expand String 2/0
I-TLB Miss 10/0 Mispredict Cycles 3/0
L2 I-TLB Miss 50/0 Register Read 1/1
Execution Pipe Latencies Load/Store Latencies
Latency Parms STD/INF Latency Parms STD/INF
Fix Execute 1/1 L1 D-Load 3/3
Float Execute 4/4 L2 D-Load 9/9
Branch Execute 1/1 L3 (Data) 77/0
Float Divide 12/12 Load Float 2/2
Integer Multiply 7/7 D-TLB Miss 7/0
Integer Divide 35/35 L2 D-TLB Miss 50/0
Retire Delay 2/2 StoreQ Forward 4/0
Table 1. Latencies for 19 FO4 Design Point
4.1 Performance Simulation Methodology
Issue-queue
Integer
Issue-queue
Load/store
Issue-queue
Float.Point
Issue-queue
Branch
issue logic issue logic issue logic
issue logic
Reg. read
Reg. read Reg. read Reg. read
Load/store
units
Float. Point
units
Branch
units
Integer
units
I-Buffer
Decode/
Expand
Rename/
Dispatch
IFETCH
L2 cache
Main
Memory
L1-Dcache
D-TLB1
D-TLB2
Cast-out
queue
L1-Icache
I-TLB1
I-TLB2
NFA/Branch Predictor
Retirement
queue
retirement logic
load/store
reorder buf
store
queue
miss
queue
Figure 2. Modeled Processor Organization
We utilize a generic, parameterized, out-of-order 8-way
superscalar processor model called Turandot [16, 17] with
32KB I and D-caches and a 2MB L2 cache. The overall
pipeline structure (as reported in [16]), is repeated here in
Figure 2. The modeled baseline microarchitecture is simi-
lar to a current generation microprocessor. As described in
[16], this research simulator was calibrated against a pre-
RTL, detailed, latch-accurate processor model. Turandot
supports a large number and parameters including config-
urable pipeline latencies discussed below.
Table 1 details the latency values in processor cycles for
the 19 FO4 base design point of this study. We assume a 2
FO4 latch overhead and 1 FO4 clock skew and jitter over-
head. The 19 FO4 latency values are then scaled with the
FO4-depth (after accounting for latch and clock skew over-
head). Each latency in Table 1 has two values: the first la-
beled STD, is for our detailed simulation model, and the
second labeled INF, assumes infinite I-Cache, I-TLB, D-
TLB, and a perfect front-end. The INF simulator model
is used for validating the analytical model described in Sec-
tion 3.
4.2 Power Simulation Methodology
To estimate power dissipation, we use the PowerTimer
toolset developedat IBM T.J. Watson Research Center [3, 1]
as the starting point for the simulator used in this work.
PowerTimer is similar to power-performance simulators de-
veloped in academia [2, 23, 24], except for the methodology
to build energy models.
Power=C1*SF+HoldPower
Power=C2*SF+HoldPower
Macro1
Macro2
MacroN
Sub-Units (uArch-level Structures)
Power=Cn*SF+HoldPower
Energy Models
SF Data
Power Estimate
Figure 3. PowerTimer Energy Models
Figure 3 above depicts the derivation of the energy mod-
els in PowerTimer. The energy models are based on circuit-
level power analysis that has been performed on structures
in a current, high performance PowerPC processor. The
power analysis has been performed at the macro level using
a circuit-level power analysis tool [18]. Generally, multi-
ple macros combine to form one micro-architectural level
structure which we will call a sub-unit. For example, the
fixed-point issue queue (one sub-unit) might contain sep-
arate macros for storage memory, comparison logic, and
control. Power analysis has been performed on each macro
to determine the macro’s unconstrained (no clock gating)
power as a function of the input switching factor. In addi-
tion, the hold power, or power when no switching is occur-
ring (SF = 0%), is also determined. Hold power primarily
consists of power dissipated in latches, local clock buffers,
the global clock network, and data-independent fraction of
the arrays. The switching power, which is primarily combi-
natorial logic and data-dependent array power dissipation,
is the additional power that is dissipated when switching
factors are applied to the macro’s primary inputs. These
two pieces of data allow us to form simple linear equations
for each macro’s power. The energy model for a sub-unit is
determined by summing the linear equations for each macro
within that sub-unit. We have generated these power models
for all microarchitecture-level structures (sub-units) mod-
eled in our research simulator [16, 17]. PowerTimer models
over 60 microarchitectural structures which are defined by
over 400 macro-level power equations.
PowerTimer uses microarchitectural activity information
from the Turandot model to scale down the unconstrained
hold and switching power on a per-cycle basis under a vari-
ety of clock gating assumptions. In this study, we use a real-
istic form of clock gating which considers the applicability
of clock gating on a per-macro basis to scale down either
the hold power or the combined hold and switching power
depending on the microarchitectural event counts. We de-
termine which macros can be clock gated in a fine-grained
manner (per-entry or per-stage clock gating) and which can
be clock gated in a coarse-grained manner (the entire unit
must be idle to be clock gated). For some macros (in par-
ticular control logic) we do not apply any clock gating; this
corresponds to about 20-25% of the unconstrained power
dissipation. The overall savings due to clock gating relative
to the unconstrained power is roughly 40-45%.
In order to quantify the power-performance efficiency of
pipelines of a given FO4-depth, andto scale the power dissi-
pation from the power models of our base FO4 design point
across a range of FO4-depths, we have extended the Power-
Timer methodology as discussed below.
Power dissipated by a processor consists of dynamic and
leakage components, P = P
dynamic
+ P
leakage
. The dy-
namic power data measured by PowerTimer (for a particu-
lar design point) can be expressed as P
base
dynamic
= CV
2
f
(α+β)CGF , where α is the average “true” switching fac-
tor in circuits; i.e., α represents transitions required for the
functionality of the circuit and is measured as the switching
factor by an RTL-level simulator run under the zero-delay
mode. In contrast, β is the average glitching factor that ac-
counts for spurious transitions in circuits due to race con-
ditions. Thus, (α + β) is the average number of transitions
actually seen inside circuits. Both α and β are averaged over
the whole processor over non-gated cycles with appropriate
energy weights (the higher the capacitance at a particular
node the higher the corresponding energy weight). CGF is
the clock gating factor which is defined as the fraction of
cycles where the microarchitectural structures are not clock
gated. The CGF is measured from our PowerTimer runs at
each FO4 design point as described above. The remaining
terms C, V , and f are effective switching capacitance, chip
supply voltage, and clock frequency, respectively.
Next we analyze how each of these factors scales
with FO4 pipeline depth. To facilitate the expla-
nation we define the following variables: FO4
logic
,
FO4
latch
and FO4
pipeline
, to designate the depth of
the critical path through logic in one pipeline stage,
the latch insertion delay including clock skew and jit-
ter, and the sum of the two quantities, respectively,
FO4
pipeline
= FO4
logic
+ FO4
latch
. We use FO4 and
FO4
pipeline
interchangeably. In the remainder of the pa-
per, the qualifier ‘base’ in all quantities designates the value
of the quantities measured for the base 19 FO4 design.
Frequency: FreqScale is the scaling factor that is used
to account for the changes in the clock frequency with the
pipeline depth. This factor applies to both hold power and
switching power:
FreqScale = FO4
base
pipeline
/FO4
pipeline
Latch: With fixed logic hardware for given logic func-
tions, the primary change in the chip effective switching ca-
pacitance C with pipeline depth is due to changes in the
latch count with the depth of the pipeline. LatchScale is a
factor that appropriately adjusts the hold power dissipation,
but does not affect the switching power dissipation.
LatchScale = LatchRatio
µ
FO4
base
logic
FO4
logic
LGF
where LatchRatio defines the ratio of hold power to the
total power and LatchGrowthF actor(LGF ) captures the
growth of latch count due to the logic shape functions. The
amount of additional powerthat is spent in latches and clock
in a more deeply pipelined design depends on the logic
shape functions of the structures that are being pipelined.
Logic shape functions describe the number of latches that
would need to be inserted at any cut point in a piece of
combinatorial logic if it were to be pipelined. Values of
LGF >1 recognize the fact that for certain hardware struc-
tures the logic shape functions are not flat and hence the
number of latches in the more deeply pipelined design in-
creases super-linearly with pipeline depth. In our baseline
model, we assume a LGF of 1.1 and study the sensitivity of
the optimal pipeline depth to this parameter in Section 6.1.
Clock Gate Factor: In general, CGF decreases with
deeper pipelines because the amount of clock gating poten-
tial increases with deeper pipes. This increased clock gating
potential is primarily due to the increased number of cycles
where pipeline stages are in stall conditions. This in turn
leads to an increase in the clock gating potential on a per
cycle basis. CGF is workload dependent and is measured
directly from simulator runs.
Glitch: The final two factors that must be considered
for dynamic power dissipation when migrating to deeper
pipelines are α and β, the chip-wide activity and glitching
factors.
The “true” switching factor α does not depend on the
pipeline depth, since it is determined by the functionality
of the circuits. The glitching factor at any net, on the other
hand, is determined by the difference in delay of paths from
the output of a latch that feeds the circuit to the gate that
drives that net. Once a glitch is generated at some net, there
is a high probability that it will propagate down the circuit
until the input of the next latch down the pipeline. Further-
more, the farther the distance from the latch output to the
inputs of a gate the higher the probability of the existence
of non-equal paths from the output of the latch to the inputs
of this gate. Therefore the average number of spurious tran-
sitions grows with FO4
logic
– the higher the FO4 the higher
the average glitching factor. Experimental data, collected
by running a dynamic circuit-level simulator (PowerMill)
on post-layout extracted netlists of sample functional units
show that the average glitching factor β can be modeled as
being linearly dependent on the logic depth:
β = β
base
FO4
logic
FO4
base
logic
To account for the effect of the dependence of β on
pipeline depth, we introduce the following factor which ap-
plies only to the switching power:
GlitchScale = (1 LatchRatio)
µ
α + β
α + β
base
=
1 LatchRatio
1 + β
base
µ
1 +
β
base
α
FO4
logic
FO4
base
logic
In this formula β
base
is the actual glitching factor aver-
aged over the baseline microprocessor for the base FO4 de-
sign point. Notice that β
base
appears in the formula only
in the ratio β
base
. This is consistent with our experi-
mental results showing that the glitching factor β is roughly
proportional to the “true” switching factor α, for the range
0 < α < 0.3 (for higher values of α the growth of β typ-
ically saturates). For the set of six sample units that we
simulated, with the logic depth ranging from 6 FO4 to 20
FO4, the ratio β was found to be roughly proportional to
the logic depth of the simulated units, FO4
logic
, with the co-
efficient equal to 0.3/FO4
base
logic
. Based on these simulation
results we set β
base
= 0.3 for the whole microprocessor
in the remainder of this section, and study the sensitivity of
the results to variations in the glitching factor in Section 6.4.
Leakage Currents: As the technology feature size scales
downand the powersupply and transistor threshold voltages
scale accordingly, the leakage power component becomes
more and more significant. Since the magnitude of the leak-
age power component is affected by the pipeline depth, it is
essential to include the effect of the leakage power in the
analysis of the optimal pipeline depth. Assuming that the
leakage power is proportional to the total channel width of
all transistors in the microprocessor, we model the depen-
dence of the leakage power on the depth of the pipeline as
follows:
P
FO4
leakage
= P
base
leakage
Ã
1 +
w
latch
w
total
"
µ
FO4
base
logic
FO4
logic
LGF
1
#!
where LGF is the LatchGrowthF actor defined earlier,
w
latch
/w
total
is the ratio of the total channel width of tran-
sistors in all pipeline latches (including local clock distri-
bution circuitry) to the total transistor channel width in the
base (19 FO4) microprocessor (excluding all low-leakage
transistors that might be used in caches or other on-chip
memories). If the technology supports multiple thresholds,
or any of the recently introduced leakage reduction tech-
niques are used on a unit-by-unit basis, such as MTCMOS,
back biasing, power-down, or transistor stacking, then the
above formula for the leakage power component needs to
be modified accordingly and we leave the detailed study of
these effects for future work. For the remainder of the study
we set w
latch
/w
total
to 0.2 for the base 19 FO4 pipeline.
Also, rather than giving the absolute value for the leak-
age current, P
base
leakage
in the base microprocessor, we will
count it as a fraction of the dynamic power of the base de-
sign, P
base
leakage
= LeakageF actor
base
P
base
dynamic
. We set the
LeakageF actor
base
to the value of 0.1, typically quoted
for state of the art microprocessors, and analyze the sensi-
tivity of the results to LeakageF actor in Section 6.5.
Total Power: The following equation expresses the rela-
tionship between the dynamic power for the base FO4 de-
sign, P
base
dynamic
, leakage power and the scaled power for de-
signs with different depths of the pipeline, P
FO4
total
, consider-
ing all factors above.
P
FO4
total
= CGF F S (LS+GS) P
base
dynamic
+P
FO4
leakage
where F S is F reqScale, LS is LatchScale, and GS is
GlitchScale.






 !"
#$%& 
Figure 4. Power Growth Breakdown
Figure 4 shows contributions of different factors in the
above formula, depending on the pipeline depth of the de-
sign FO4
pipeline
. The 19 FO4 design was chosen as a base
pipeline.
The line labeled “combined” shows the cumulative in-
crease or decrease in power dissipation. The line labeled
“only clock gate” quantifies the amount of additional clock
gating power savings for deeper pipelines. The relative ef-
fect of scaling in clock gating is fairly minor with slightly
more than 10% additional power reduction when going
from the 19 FO4 to 7 FO4 design points. There are several
reasons why the effect of clock gating is not larger. First,
the fraction of power dissipation that is not eligible to be
clock gated becomes larger with more clock gating leading
to diminishing returns. Second, some of the structures are
clock gated in a coarse-grained fashion and while the aver-
age utilization of the structure may decrease it must become
idle in all stages before any additional savings can be real-
ized. Finally, we observe that clock gating is more difficult
in deeper pipelined machines, because it is harder to deliver
cycle-accurate gating signals at lower FO4.
The two lines labeled “only freq” and “only hold” show
the power factors due to only frequency and hold power
scaling, respectively.
2
Overall, dynamic power increases
more than quadratically with increased pipeline depth.
Figure 4 shows that the leakage component grows much
less rapidly than the dynamic component with the increas-
ing pipeline depth. There are two primary reasons for this.
First, the leakage power does not scale with frequency.
Second, the leakage power growth is proportional to the
fraction of channel width of transistors in pipeline latches,
whereas the latch dynamic hold power growth is propor-
tional to the fraction of the dynamic power dissipated in
pipeline latches. Obviously, the former quantity is much
smaller than the latter.
4.3 Workloads and Metrics Used in the Study
In this paper, we report experimental results based on
PowerPC traces of a set of 21 SPEC2000 benchmarks,
namely, ammp, applu, apsi, art, bzip2, crafty, equake, fac-
erec, gap, gcc, gzip, lucas, mcf, mesa, mgrid, perl, six-
track, swim, twolf, vpr, and wupwise. We have also used
a 172M instruction trace of the TPC-C transaction process-
ing benchmark. The SPEC2000 traces were generated using
the tracing facility called Aria within the MET toolkit [17].
The particular SPEC2000 trace repository used in this study
was created by using the full reference input set. However,
sampling was used to reduce the total trace length to 100
million instructions per benchmark program. A systematic
validation study to compare the sampled traces against the
full traces was done in finalizing the choice of exact sam-
pling parameters [12].
We use BIPS
3
/W (
energy
delay
2
) as a basic energy-
efficiency metric for comparing different FO4 designs in
the power-performance space. The choice of this metric
is based on the observation that dynamic power is roughly
proportional to the square of supply voltage (V ) multiplied
by clock frequency and clock frequency is roughly propor-
tional to V . Hence, power is roughly proportional to V
3
assuming a fixed logic/circuit design. Thus, delay cubed
multiplied by power provides a voltage-invariant power-
performance characterization metric which we feel is most
2
Although these two factors increase linearly with clock frequency, we
are plotting against FO4-depth which is 1/
clock frequency
.
appropriate for server-class microprocessors (see discussion
in [1, 7]). In fact, it was formally shown in [26] that opti-
mizing performance subject to a constant power constraint
leads to the BIPS
3
/W metric in processors operating at a
supply voltage near the maximum allowed value in state of
the art CMOS technologies. As a comparative measure, we
also consider BIPS/W (energy) and BIPS
2
/W (energy-delay
[8]) in the baseline power-performance study.
5 Analytical Model and Simulation Results
5.1 Analytical Model Validation










!
Figure 5. BIPS: Analytical Model vs. Simulator
We now present the performance results using the sim-
ple analytical model (see Section 3) and compare it with
the results using our detailed cycle-accurate simulator. The
results in this section are presented for the average of the
SPEC2000 benchmarks described in Section 4.3. For fair
comparison, we have modified the simulator to include a
perfect front-end of the machine. The model and the sim-
ulator use latencies shown in the column labeled “INF” in
Table 1.
Figure 5 shows BIPS as a function of the FO4 delay per
stage of the pipeline. The BIPS for the analytical model
was computed after determining the stalls for the different
pipelines using independent workload analysis as explained
in Section 3.
From Figure 5 we observe that the performance optimal
pipeline depth is roughly 10 FO4 delay per pipeline stage
for both the model and the simulator. The BIPS estimated
using the model correlates well with the BIPS estimated
using the simulator except for very large and very small
FO4-depth machines. Since the stalls are determined in the
model independent of the pipeline and only once for a work-
load, it is possible that for shallow pipelines with large FO4
delay per stage, the model underestimates the total stall cy-
cles, and hence the BIPS computed by the model is higher
than the BIPS obtained from the simulator. The analytical
model currently only uses data dependent stalls to derive
the optimal pipeline depth. Hence, for the purpose of the
validation experiment, the model and the simulator assume
a perfect front-end. However, to estimate the effect of re-
source stalls we modeled a large but finite register renamer,
instruction buffer, and miss queue in the simulator keeping
the rest of the front-end resources infinite and perfect. For
deeper pipelines ( 10 FO4), we observe from the simula-
tor that the stalls due to the resources become appreciable
and the analytical model begins to overestimate the BIPS
relative to the simulator.




!"
Figure 6. BIPS
3
/W: Model vs. Simulator
Figure 6 shows that the optimal FO4-depth that maxi-
mizes the BIPS
3
/W is around 19-24 FO4 for the simulator.
We observe that the analytical model tracks the simulator
results more closely while optimizing performance alone as
seen in Figure 5. Since the analytical model overestimates
the performance for shallow pipelines, the cubing of BIPS
in Figure 6 compounds these errors.
5.2 Detailed Power-Performance Simulation







 
 
 
 
 
 
 
 
 
 
!"#"$ 
Figure 7. Simulation Results for SPEC2000
In the remainder of this work we consider the power and
performance results using the detailed performance simu-
lator with parameters corresponding to the STD column in
Table 1. Figure 7 shows the results for five metrics: BIPS,
IPC, BIPS/W, BIPS
2
/W, and BIPS
3
/W.
Figure 7 shows that the optimal FO4-depth for perfor-
mance (defined by BIPS) is 10 FO4, although pipelines of
8 FO4 to 15 FO4 are within 5% of the optimal. Because
of the super-linear increase in power dissipation and sub-
linear increases in overall performance, the BIPS/W always
decreases with deeper pipelines. BIPS
3
/W shows an op-
timum point at 18 FO4. BIPS
3
/W decreases sharply after
the optimum and at the performance-optimal pipeline depth
of 10 FO4, the BIPS
3
/W metric is reduced by 50% over
the 18 FO4 depth. For metrics that have less emphasis on
performance, such as BIPS
2
/W and BIPS/W, the optimal
point shifts towards shallower pipelines, as expected. For
the BIPS
2
/W, the optimal pipeline depth is achieved at 23
FO4.






 
 
 
 
 
 
 
 
 
 
!"#"$ 
Figure 8. Simulation Results for TPC-C
Figure 8 presents similar results for the TPC-C trace.
The optimal BIPS for TPC-C is very flat from 10-14 FO4
(within 1-2% for all 5 design points which is much flat-
ter than SPEC2000). Using BIPS
3
/W, the optimal pipeline
depth shifts over to 25-28 FO4. The main reason that the
optimal point is shallower for TPC-C is that BIPS decreases
less dramatically with decrease in pipeline length (relative
to SPEC2000). This is slightly counterbalanced because
power increases at a slower rate for TPC-C with deeper
pipes, because the additional amount of clock gating is more
pronounced due to large increases in the number of stall cy-
cles relative to the SPEC2000 suite.
6 Sensitivity Analysis
The derived equations that model the dependence of
the power dissipation on the pipeline depth depend on
several parameters. Some of these parameters, although
accurately measured for the baseline microprocessor, are
likely to change from one design to another, whereas oth-
ers are difficult to measure accurately. In this section, we
perform sensitivity analysis of the optimal pipeline depth
to key parameters of the derived power models such as
LatchGrowthF actor, LatchRatio, latch insertion delay
(FO4
latch
), GlitchRatio and LeakageF actor.
6.1 Latch Growth Factor
LatchGrowthF actor (LGF ) is determined by the in-
trinsic logic shape functions of the structures that are being
pipelined. We have analyzed many of the major microar-
chitectural structures to identify ones that are likely to have
LatchGrowthF actor greater than 1. One structure that we
will highlight is the Booth recoder and Wallace tree which
is common in high-performance floating point multipliers
[13, 20], as shown in Figure 9. Figure 9 shows the exponen-
tial reduction in the number of result bits as the data passes
through the 3-2 and 4-2 compressors. We have estimated
the amount of logic that can be inserted between latch cut
points for 7, 10, 13, 16, and 19 FO4 designs by assuming
3-4 FO4 delay for 3-2 and 4-2 compressor blocks. As the 7
and 10 FO4 design points require latch insertions just after
the Booth multiplexor (where there are 27 partial products),
there would be a large increase in the number of latches re-
quired for these designs. We note that the 7 FO4 design also
requires a latch insertion after the booth recode stage.
Figure 9. Wallace Tree Diagram and Latch Cut
points for 7/10/13/16/19 FO4
Figure 10 gives estimates for the cumulative number of
latches in the FPU design as a function of the FO4 depth
of the FPU. For example, the first stage of the 10 FO4 FPU
requires 3x as many latches as the first stage of the 19 FO4
FPU because the first latch cut point of the 19 FO4 FPU is
beyond the initial 9:2 compressor tree. Overall, the 10 FO4
FPU requires nearly 3x more latches than the 19 FO4 FPU.
19FO4
16FO4
13FO4
10FO4
0 10 20 30 40 50 60 70 80 90
CumulativeFO4Depth(Logic+LatchOverhead)
0
2
4
6
8
10
12
CumulativeNumberofLatches
Figure 10. Cumulative latch count for FPU
There are many other areas of parallel logic that are
likely to see super-linear increases in the number of latches
such as structures with decoders, priority encoders, carry
look ahead logic, etc. Beyond the pipelining of logic struc-
tures, deeper pipelines may require more pre-decode infor-
mation to meet aggressive cycle times, which would require
more bits to be latched in intermediate stages. On the other
hand, the number of latches that comprise storage bits in
various on-chip memory arrays (such as register files and
queues) does not grow at all with the pipeline depth, mean-
ing that the LGF = 0 for those latches. Thus, designs with
overall LGF < 1 are also possible.






 !"
   
Figure 11. BIPS
3
/W varying LatchGrowthF actor
Figure 11 quantifies the dependence of the optimal
pipeline depth on LGF using the BIPS
3
/W metric. It shows
that the optimal pipeline FO4 tends to increase as LGF
increases above the value of 1.1, assumed in the baseline
model. As a point of reference, our estimate for the latch
growth factor for the 10 FO4 vs. 19 FO4 Booth recoder and
Wallace tree is LGF = 1.9, while for the entire FPU LGF
is slightly less than 1.7.
6.2 Latch Power Ratio
Latch, clock, and array power are the primary com-
ponents of power dissipation in current generation CPUs.
This is especially true in high-performance, superscalar
processors with speculative execution which require the
CPU to maintain an enormous amount of architectural
and non-architectural state. One possible reason why the
LatchRatio could be smaller than the base value of 0.7
chosen in Section 4, is if more energy-efficient SRAM ar-
rays are used in high-power memory structures instead of
latches to reduce the data independent array power (which
we include as part of hold power).
















  
Figure 12. BIPS
3
/W varying LatchRatio
Figure 12 shows the optimal FO4 design point while
varying the LatchRatio of the machine from 80% to 40%.
We see that while the optimal FO4 point remains 18 FO4, it
is less prohibitive to move to deeper pipelines with smaller
latch-to-logic ratios. For example, with a LatchRatio of
0.4, the 13 FO4 design point is only 19% worse than the
optimal one while it is 27% worse than optimal with a
LatchRatio of 0.6.
6.3 Latch Insertion Delay
Latch FO4 2 3 4 5
Relative Latch Energy 1.0 0.53 0.36 0.29
Relative Clocking Energy 1.0 0.62 0.49 0.43
Table 2. Latch Insertion Delay (excluding skew and
jitter) vs. Relative Latch Energy
With the large amount of power spent in latches and
clocking, designers may consider the tradeoff between latch
delay and power dissipation as a means to design more
energy-efficient CPUs. Researchers have investigated latch
power vs. delay tradeoff curves both within a given latch
family and across latch families [25, 10]. Table 2, derived
from [25], shows latch FO4-delay vs. latch energy across
several latch styles. The first row of Table 2 shows the latch
insertion delay, excluding clock skew and jitter overhead
which is assumed to be constant for all latches. The second
row shows the relative latch energy, excluding energy dis-
sipated in the clock distribution. The third row shows the
relative energy of the clock system, including both energy
of latches (70% of the total) and clock distribution system
(30%). It is assumed that the clock distribution energy can-
not be completely scaled with reduced latch load. There is
significant overhead simply from driving the wires neces-
sary to distribute the clock over a large area.









Figure 13. BIPS varying latch power-delay
Replacing the baseline fast 2 FO4 latches with slower,
lower power latches increases the latch insertion delay
overhead, which impacts both the performance-optimal
and power-performance-optimal pipeline depth. Figure 13
shows the processor performance versus pipeline depth for
four latches from Table 2. We see that the performance
maxima shift towards shallower pipelines as the latch in-
sertion delay increases. For example, with a 3 FO4 latch
the performance-optimal FO4-depth is 11 FO4 and with a 4
FO4 latch it becomes 16 FO4.







  ! " #
   
Figure 14. BIPS
3
/W varying latch power-delay
Figure 14 shows the impact of the latch insertion de-
lay on the BIPS
3
/W rating of the processor for the same
range of pipeline depths. In this figure, all of the data points
are shown relative to the 10 FO4 design with the base 5
FO4 latch. Unlike curves on all previous sensitivity graphs,
curves in Figure 14 do not intersect at the base design point,
because different curves represent different designs, with
different power and performance levels.
Figure 14 shows that using the fastest 2 FO4 latch results
in the best BIPS
3
/W rating for processors with stages less
than 14 FO4. For processors with pipelines ranging from
15 FO4 to 24 FO4 a lower power 3 FO4 latch is the most
energy efficient, whereas for shallower pipelines (25 FO4
or more) the highest BIPS
3
/W is achieved with even slower
4 FO4 latches.
The use of the 3 FO4 latch, combined with the choice
of the pipeline depth in the range from 15 FO4 to 24 FO4
improves the BIPS
3
/W rating of the processor by more than
10%, compared to the base case of 2 FO4 latches. The graph
also shows that the optimal BIPS
3
/W design point shifts to-
wards shallower pipelines as high-performance latches are
replaced with lower power ones. For example, the 18 FO4
design point is optimal for a processor using 2 FO4 latches,
whereas the 19 FO4, 20 FO4, and 21 FO4 design points are
optimal for processors using 3 FO4 latches, 4 FO4, and 5
FO4 latches, respectively.
6.4 Glitch Factor
In this subsection we quantify the sensitivity of the op-
timal pipeline depth to the glitching factor. There are no
practical means for accurately measuring the actual value of
β
base
, averaged over the whole microprocessor. Instead
we measured the glitching factor for a selected set of func-
tional units and used the averaged value of β
base
= 0.3
throughout the analysis. In this section we analyze the sen-
sitivityof the optimal pipeline depth to the value of β
base
.
Figure 15 shows the dependence of the BIPS
3
/W rating
of the processor on the pipeline depth for three values of
β
base
. From this figure we see that higher glitching fac-
tors favor deeper pipelines. However, in the base design the
decrease in power dissipation related to reduced glitching
in deeper pipelines did not have a substantial impact on the












 !"
  
Figure 15. BIPS
3
/W varying β
base
optimal pipeline depth, primarily because of the relatively
small fraction of power dissipated in combinatorial switch-
ing. For designs which have smaller LatchRatio values,
this effect could be more significant.
6.5 Leakage Factor
As explained earlier, the leakage power component
grows more slowly with the pipeline depth than the dy-
namic component. Therefore, the optimum pipeline depth
depends on the LeakageF actor. Throughout the analy-
sis we assumed that for the base 19 FO4 microprocessor
the LeakageF actor (P
base
leakage
/P
base
dynamic
) to be 0.1. How-
ever, as the technology feature size scales down and the
power supply and transistor threshold voltages scale ac-
cordingly, the leakage power component becomes more and
more significant. To study the effect of the growing frac-
tion of the leakage power component we measured the sen-
sitivity of the optimal pipeline depth to the value of the
LeakageF actor.









 !"
 
Figure 16. BIPS
3
/W varying LeakageF actor
Figure 16 shows the BIPS
3
/W rating of the pro-
cessor versus pipeline depth for three values of the
LeakageF actor: a value of 0 that represents older CMOS
technologies, a value of 0.1, assumed in the current model,
and values of 0.5 and 1.0, projected for future generation
CMOS technologies (arguably, extreme values). The results
in Figure 16 show that unless leakage reduction techniques
become the standard practice in the design of high-end
microprocessors, the high values of the LeakageF actor
projected for future generations of CMOS technologies
may tend to shift the optimum pipeline depth towards
slightly deeper pipelines. For current generation tech-
nologies the result for the optimal pipeline depth is suffi-
ciently stable with respect to reasonable variations in the
LeakageF actor.
Summary of the Sensitivity Analysis: In this section we
considered the sensitivity of optimal pipeline length to five
key parameters in the power models using the BIPS
3
/W
metric. We did not observe a strong dependence of the re-
sults on the assumptions and choices of any of these param-
eters, which demonstrates the stability of the model, and
its applicability to a wide range of designs. To summarize
the results, highervaluesof the LatchGrowthF actor favor
shallower pipelines, lower values of the LatchRatio favor
deeper pipelines, the use of lower-power latches favorsshal-
lower pipelines, higher values of the GlitchF actor favors
deeper pipelines, and, finally, higher leakage currents favor
deeper pipelines.
7 Conclusions
In this paper, we have demonstrated that it is impor-
tant to consider both power and performance while opti-
mizing pipelines. For this purpose, we derived detailed
energy models using circuit-extracted power analysis for
microarchitectural structures. We also developed detailed
equations for how the energy functions scale with pipeline
depth. Based on the combination of power and perfor-
mance modeling performed, our results show that a purely
performance-driven, power-unaware design may lead to the
selection of an overly deep pipelined microprocessor oper-
ating at an inherently power-inefficient design point.
As this work is the first quantitative evaluation of power
and performance optimal pipelines, we also performed a
detailed sensitivity analysis of the optimal pipeline depth
against key parameters such as latch growth factor, latch ra-
tio, latch insertion delay, glitch, and leakage currents. Our
analysis shows that there is a range of pipeline depth for
which performance increases can be achieved at a mod-
est sacrifice in power-performance efficiency. Pipelining
beyond that range leads to drastic reduction in power-
performance efficiency with little or no further performance
improvement.
Our results show that for a current generation, out-of-
order superscalar processor, the optimal delay per stage is
about 18 FO4 (consisting of a logic delay of 15 FO4 and 3
FO4 latch insertion delay) when the objective function is a
power-performance efficiency metric like BIPS
3
/W; this is
in contrast to an optimal delay of 10 FO4/stage when con-
sidering the BIPS metric alone. We used a broad suite of
SPEC2000 benchmarks to arrive at this conclusion.
The optimal pipeline depth depends on a number of pa-
rameters in the power models which we have derived from
current state-of-the-art microprocessor design methodolo-
gies. Also, as already established through recent prior work,
such optimal design points generally depend on the in-
put workload characteristics. Our simulation-based experi-
ments on a typical commercial application (TPC-C) shows
that although the optimal pipeline depth is around 10-14
FO4 for performance-only optimization, it increases to 24-
28 FO4 when we consider power and performance opti-
mizations.
In future work, we would like to consider other architec-
tural options than those available in the base model, for ex-
ample, single- versus multi-threading, wide versus narrow
issue machines, and inorder versus out-of-order designs. In
addition, we would also like to consider more circuit-level
power-performance tradeoffs opportunities and factor that
into the analysis of optimal pipelines. Finally, there are nu-
merous circuit and technology techniques which affect leak-
age and therefore power-performance optimality; we hope
to consider all of these factors in our future work.
References
[1] D. Brooks et al. Power-aware Microarchitecture: Design
and Modeling Challenges for the next-generation micropro-
cessors. IEEE Micro, 20(6):26–44, Nov./Dec. 2000.
[2] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A frame-
work for architectural-level power analysis and optimiza-
tions. In Proceedings of the 27th Annual International Sym-
posium on Computer Architecture (ISCA-27), June 2000.
[3] D. Brooks, J.-D. Wellman, P. Bose, and M. Martonosi.
Power-Performance Modeling and Tradeoff Analysis for a
High-End Microprocessor. In Power Aware Computing Sys-
tems Workshop at ASPLOS-IX, Nov. 2000.
[4] M. Brown, J. Stark, and Y. Patt. Select-free instruction
scheduling logic. In Proceedings of the 34th International
Symposium on Microarchitecture (MICRO-34), pages 204–
213, December 2001.
[5] P. Dubey and M. Flynn. Optimal pipelining. J. Parallel and
Distributed Computing, 8:10–19, 1990.
[6] P. G. Emma and E. S. Davidson. Characterization of
branch and data dependencies in programs for evaluating
pipeline performance. IEEE Transactions on Computers, C-
36(7):859–875, 1987.
[7] M. J. Flynn, P. Hung, and K. Rudd. Deep-Submicron
Microprocessor Design Issues. IEEE Micro, 19(4):11–22,
July/Aug. 1999.
[8] R. Gonzalez and M. Horowitz. Energy dissipation in gen-
eral purpose microprocessors. IEEE Journal of Solid-State
Circuits, 31(9):1277–84, Sept. 1996.
[9] A. Hartstein and T. R. Puzak. The optimum pipeline depth
for a microprocessor. In Proceedings of the 29th Inter-
national Symposium on Computer Architecture (ISCA-29),
May 2002.
[10] S. Heo, R. Krashinsky, and K. Asanovic. Activity-sensitive
flip-flop and latch selection for reduce energy. In 19th Con-
ference on Advanced Research in VLSI, March 2001.
[11] M. Hrishikesh, K. Farkas, N. Jouppi, D. Burger, S. Keckler,
and P. Sivakumar. The optimal logic depth per pipeline stage
is 6 to 8 FO4 inverter delays. In Proceedings of the 29th
International Symposium on Computer Architecture (ISCA-
29), pages 14–24, May 2002.
[12] V. Iyengar, L. H. Trevillyan, and P. Bose. Representative
traces for processor models with infinite cache. In Proc.
2nd. Symposium on High Performance Computer Architec-
ture (HPCA-2), Feb. 1996.
[13] R. Jessani and C. Olson. The floating-point unit of the Pow-
erPC 603e microprocessor. IBM J. of Research and Devel-
opment, 40(5):559–566, Sept. 1996.
[14] P. Kogge. The Architecture of Pipelined Computers. Hemi-
sphere Publishing Corporation, 1981.
[15] S. R. Kunkel and J. E. Smith. Optimal pipelining in super-
computers. In Proceedings of the 13th International Sympo-
sium on Computer Architecture (ISCA-13), pages 404–411,
June 1986.
[16] M. Moudgill, P. Bose, and J. Moreno. Validation of Tu-
randot, a fast processor model for microarchitecture ex-
ploration. In Proceedings of the IEEE International Per-
formance, Computing, and Communications Conference
(IPCCC), pages 451–457, Feb. 1999.
[17] M. Moudgill, J. Wellman, and J. Moreno. Environment
for PowerPC microarchitecture exploration. IEEE Micro,
19(3):9–14, May/June 1999.
[18] J. S. Neely, H. H. Chen, S. G. Walker, J. Venuto, and
T. Bucelot. CPAM: A common power analysis methodol-
ogy for high-performance VLSI design. In Proc. of the 9th
Topical Meeting on the Electrical Performance of Electronic
Packaging, pages 303–306, 2000.
[19] S. Palacharla, N. Jouppi, andJ. Smith. Complexity-Effective
Superscalar Processors. In Proceedings of the 24th Inter-
national Symposium on Computer Architecture (ISCA-24),
1997.
[20] P. Song and G. D. Micheli. Circuit and architecture trade-
offs for high-speed multiplication. IEEE Journal of Solid-
State Circuits, 26(9):1184–1198, Sept. 1991.
[21] E. Sprangle and D. Carmean. Increasing processor perfor-
mance by implementing deeper pipelines. In Proceedings of
the 29th International Symposium on Computer Architecture
(ISCA-29), May 2002.
[22] J. Stark, M. Brown, and Y. Patt. On pipelining dynamic
instruction scheduling logic. In Proceedings of the 33rd In-
ternational Symposium on Microarchitecture (MICRO-33),
pages 57–66, Dec. 2000.
[23] N. Vijaykrishnan, M. Kandemir, M. Irwin, H. Kim, and
W. Ye. Energy-driven integrated hardware-software opti-
mizations using SimplePower. In Proceedings of the 27th
Annual International Symposium on Computer Architecture,
June 2000.
[24] V. Zyuban. Inherently Lower Power High Performance Su-
perscalar Architectures. PhD thesis, University of Notre
Dame, March 2000.
[25] V. Zyuban and D. Meltzer. Clocking strategies and
scannable latches for low power applications. In Proc.
of Int’l Symposium on Low-Power Electronics and Design,
2001.
[26] V. Zyuban and P. Strenski. Unified Methodology for Resolv-
ing Power-Performance Tradeoffs of the Microarchitectural
and Circuit Levels. In Proc. of Int’l Symposium on Low-
Power Electronics and Design, pages 166–171, 2002.
... Pipelining data-dominated architectures to achieve higher throughput while optimising for power has been the focus of extensive research [34][35][36][44][45][46][47]. In the past, proposed approaches have targeted designs at the gate level to determine specific stage delay resulting in both power and performance optimal solution [36]. ...
... In [34,35] one of the basic energy efficiency metric used for comparing different designs is energy*delay 2 . Hence, power is proportional to voltage cubed. ...
Thesis
p>As the demand for feature-rich portable devices continues to increase, new techniques are needed to minimise power consumption. This thesis is concerned with the development and validation of new systematic architectural methods of determining pipeline stage insertion in data-dominated designs with the aim of reducing dynamic power consumption. The methods place special emphasis on the number of latches used in pipeline stages and voltage scaling. The first part of the thesis addresses power minimisation through systematic analysis of the number of latches in pipeline stages. A new pipeline stage insertion (PSI) method operating at the architectural level is developed which takes into account system clock period and FEs outputs and delays. A PSI algorithm based on analytical heuristic equations is formulated to ensure the successful application of this method to any given data-dominated design. The input to the algorithm is designer clock period and naively inserted pipeline stages. The output from the algorithm is a pipelined design fulfilling the timing constraint with the least dynamic power consumption. To support efficient power-performance trade-offs exploration, the algorithm was fully automated. The second part of the thesis focuses on the validation of the PSI method using two real-life case studies: triple data path floating-point adder and MPEG-1 motion compensation module. These designs are common in many portable devices and have numerous implementation challenges large number of FEs and significant power consumption. Extensive experimental results show that for the motion compensation module, the PSI is able to reduce dynamic power consumption by up-to 30% compared with other reported approaches. The final part of the thesis concentrates on voltage scaling (VS) and its impact on pipeline stages. The timing slack available in each stage is investigated, with the aim of further reducing power consumption by lowering the supply voltage. The PSI method is modified to support voltage scaling, and as a result, a new pipeline stage insertion with voltage scaling (PSI-VS) method is proposed. Experimental results show that the PSI-VS can lead to significant power saving compared with PSI without VS. For the MPEG-1 motion compensation case study, a power saving of 68% is observed. All the developed methods have linear time complexity as the number of pipeline stages increases, facilitating their application to large designs without incurring run time penalty. The results for the case studies were based on a synthesisable RTL implementation using 90nm technology together with accurate power analysis using commercial tools.</p
Article
Presilicon modeling is a crucial and integral part of processor microarchitecture definition and optimization. In this article, I attempt to provide a retrospective view of IBM's POWER and PowerPC microprocessors, through the lens of someone who has been associated with such modeling in support of microarchitecture definition and optimization from the earliest days of this particular family of processors. The focus in the early/mid-1980s was on cycle-accurate performance modeling; much later, beginning in 1999 or so, the looming power wall triggered a new era of power-performance modeling at the microarchitecture level. Subsequently, temperature-aware and reliability-aware modeling were added dimensions that CMOS technology evolution drove us into. The problem of model validation is an unavoidable aspect of presilicon modeling. Without that mindset, the microarchitecture definition team can make serious mistakes, which results in unpleasant postsilicon surprises. I provide pointers to early approaches in addressing this issue. The article attempts to mention the contributions of many talented researchers and engineers that have, over the years, contributed immensely to the evolution of the POWER/PowerPC microprocessors from earliest research concepts through the recently announced POWER10—using model-based analysis to ensure competitive performance growth.
Chapter
ASICs, which integrate logic circuits designed for a specific application, are introduced in this chapter. Herein, an application is represented as an algorithm. An algorithm is implemented as software, hardware, or a combination of the two, and has locality and dependence, which are also introduced and described. Hardware having a good architecture utilizing such localities and dependencies improves the execution performance and energy consumption, and thus improves the energy efficiency.
Preprint
Given recent algorithm, software, and hardware innovation, computing has enabled a plethora of new applications. As computing becomes increasingly ubiquitous, however, so does its environmental impact. This paper brings the issue to the attention of computer-systems researchers. Our analysis, built on industry-reported characterization, quantifies the environmental effects of computing in terms of carbon emissions. Broadly, carbon emissions have two sources: operational energy consumption, and hardware manufacturing and infrastructure. Although carbon emissions from the former are decreasing thanks to algorithmic, software, and hardware innovations that boost performance and power efficiency, the overall carbon footprint of computer systems continues to grow. This work quantifies the carbon output of computer systems to show that most emissions related to modern mobile and data-center equipment come from hardware manufacturing and infrastructure. We therefore outline future directions for minimizing the environmental impact of computing systems.
Article
Performance and energy are the two most important objectives for optimization on modern parallel platforms. In this article, we show that moving from single-objective optimization for performance or energy to their bi-objective optimization on heterogeneous processors results in a tremendous increase in the number of optimal solutions (workload distributions) even for the simple case of linear performance and energy profiles. We then study full performance and energy profiles of two real-life data-parallel applications and find that they exhibit shapes that are non-linear and complex enough to prevent good approximation of them as analytical functions for input to exact algorithms or optimization software for determining the Pareto front. We, therefore, propose a solution method solving the bi-objective optimization problem on heterogeneous processors. The method's novel component is an efficient and exact global optimization algorithm that takes as an input performance and energy profiles as arbitrary discrete functions of workload size, which accurately and realistically take into account resource contention and NUMA inherent in modern parallel platforms, and returns the Pareto-optimal solutions (generally speaking, load imbalanced). To construct the input discrete energy functions, the method employs a methodology that accurately models the energy consumption by a hybrid data-parallel application executing on a heterogeneous HPC platform containing different computing devices using system-level power measurements provided by power meters. We experimentally analyse the proposed solution method using three data-parallel applications, matrix multiplication, 2D fast Fourier transform (2D-FFT), and gene sequencing, on two connected heterogeneous servers consisting of multicore CPUs, GPUs, and Intel Xeon Phi. We show that it determines a superior Pareto front containing the best load balanced solutions and all the load imbalanced solutions that are ignored by load balancing methods.
Chapter
In this chapter, a systematic methodology is introduced to design reconfigurable microarchitectures through automated and architecture-agnostic design flows. The main goal is to enrich a baseline microarchitecture with additional registers for throughput enhancement and then make selected registers bypassable to flexibly switch among different microarchitectures. Similarly, design methodologies for reconfigurable SRAM memories are described. As common thread, drop-in solutions for existing architectures allowing the above capability at very low design effort are discussed.
Conference Paper
Full-text available
Evaluation of architectural tradeoffs is complicated by implications in the circuit domain which are typically not captured in the analysis but substantially affect the results. We propose a metric of hardware intensity (η), which is useful for evaluating issues that affect both circuits and architecture. Analyzing data for actual designs we show how to measure the introduced parameters and discuss variations between observed results and common theoretical assumptions. For a power-efficient design we derive relations for η and supply voltage V under progressively more general situations, and incorporate η into a prior art architectural energy-efficiency criterion. Then, a more general relation is derived for the optimal balance between the architectural complexity, hardware intensity and power supply. Modified forms for these relations are obtained in special cases where the supply voltage is constrained or when clock gating is disallowed.
Conference Paper
Full-text available
Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 benchmarks, we find that for a high-performance architecture implemented in 100 nm technology, the optimal clock period is approximately 8 fan-out-of-four (FO4) inverter delays for integer benchmarks, comprised of 6 FO4 of useful work and an overhead of about 2 FO4. The optimal clock period for floating-point benchmarks is 6 FO4. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indicates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Consequently, we propose and evaluate a high-frequency design called a segmented instruction window
Conference Paper
Full-text available
The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation. An analytical theory is presented that shows two opposing architectural parameters affect the optimal pipeline length: the degree of instruction level parallelism (superscalar) decreases the optimal pipeline length, while the lack of pipeline stalls increases the optimal pipeline length. This theory is tested by analyzing the optimal pipeline length for 35 applications representing three classes of workloads. Trace tapes are collected from SPEC95 and SPEC2000 applications, traditional (legacy) database and online transaction processing (OLTP) applications, and modern applications primarily written in Java and C++. The results show that there is a clear and significant difference in the optimal pipeline length between the SPEC workloads and both the legacy and modern applications. The SPEC applications, written in C, optimize to a shorter pipeline length than the legacy applications, largely written in assembler language, with relatively little overlap in the two distributions. Additionally, the optimal pipeline length distribution for the C++ and Java workloads overlaps with the legacy applications, suggesting similar workload characteristics. These results are explored across a wide range of superscalar processors, both in-order and out-of-order
Conference Paper
One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium® 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance.This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.
Article
This paper examines the relationship between the degree of central processor pipelining and performance. This relationship is studied in the context of modern supercomputers. Limitations due to instruction dependencies are studied via simulations of the CRAY-1S. Both scalar and vector code are studied. This study shows that instruction dependencies severely limit performance for scalar code as well as overall performance. The effects of latch overhead are then considered. The primary cause of latch overhead is the difference between maximum and minimum gate propagation delays. This causes both the skewing of data as it passes along the data path, and unintentional clock skewing due to clock fanout logic. Latch overhead is studied analytically in order to lower bound the clock period that may be used in a pipelined system. This analysis also touches on other points related to latch clocking. This analysis shows that for short pipeline segments both the Earle latch and polarity hold latch give the same clock period bound for both single-phase and multi-phase clocks. Overhead due to data skew and unintentional clock skew are each added to the CRAY-1S simulation model. Simulation results with realistic assumptions show that eight to ten gate levels per pipeline segment lead to optimal overall performance. The results also show that for short pipeline segments data skew and clock skew contribute about equally to the degradation in performance.
Article
The IBM PowerPC 603e™ floating-point unit (FPU) is an on-chip functional unit to support IEEE 754 standard single- and double-precision binary floating-point arithmetic operations. The design objectives are to be a low-cost, low-power, high-performance engine in a single-chip superscalar microprocessor. Using less than 15 mm<sup>2</sup> of the available silicon area on the chip (the size of the PowerPC 603e microprocessor is 98 mm<sup>2</sup>) and operating at the peak clock frequency of 100 MHz, an average single-pumping multiply-add-fuse instruction has one-cycle throughput and four-cycle latency. An average double-pumping multiply-add-fuse instruction has two-cycle throughput and five-cycle latency. The estimated performance at 100 MHz is 105 against the SPECfp92™ benchmark.
Conference Paper
Abstract This paper covers a range of issues in the design of clocking schemes for low-power applications. First we revisit, extend and improve the power-performance optimization methodology,for latches, attempting to make it more formal and comprehensive. Data switching factor and the glitching activity are taken into con- sideration, using a formal analytical approach, then a notion of energy-efficient family of configurations is introduced to make the comparison,of different latch styles in the power-performance space more fair, also the power of the clock distribution is taken into account. Practical issues of building a low overhead scan mecha- nism are considered, and the power overhead of the scannable de- sign is analyzed. A low-power LSSD extension to single-phase latches is proposed, and results of a comparative study of LSSD- scannable latches are shown, supported by experimental data mea- sured on a,test chip.