Conference PaperPDF Available

Optimizing pipelines for power and performance

February 2002

February 2002

DOI:10.1109/MICRO.2002.1176261

Source
IEEE Xplore

Conference: Microarchitecture, 2002. (MICRO-35). Proceedings. 35th Annual IEEE/ACM International Symposium on

Authors:

David C Brooks

Brigham and Women's Hospital

Michael Gschwind

Meta

Pradip Bose

Show all 7 authorsHide

During the concept phase and definition of next generation high-end processors, power and performance will need to be weighted appropriately to deliver competitive cost/performance. It is not enough to adopt a CPI-centric view alone in early-stage definition studies. One of the fundamental issues confronting the architect at this stage is the choice of pipeline depth and target frequency. In this paper we present an optimization methodology that starts with an analytical power-performance model to derive optimal pipeline depth for a superscalar processor. The results are validated and further refined using detailed simulation based analysis. As part of the power-modeling methodology, we have developed equations that model the variation of energy as a function of pipeline depth. Our results using a set of SPEC2000 applications show that when both power and performance are considered for optimization, the optimal clock period is around 18 FO4. We also provide a detailed sensitivity analysis of the optimal pipeline depth against key assumptions of these energy models.

BIPS 3 /W: Model vs. Simulator

…

Simulation Results for SPEC2000

…

Simulation Results for TPC-C Figure 8 presents similar results for the TPC-C trace. The optimal BIPS for TPC-C is very flat from 10-14 FO4 (within 1-2% for all 5 design points which is much flatter than SPEC2000). Using BIPS 3 /W, the optimal pipeline depth shifts over to 25-28 FO4. The main reason that the optimal point is shallower for TPC-C is that BIPS decreases less dramatically with decrease in pipeline length (relative to SPEC2000). This is slightly counterbalanced because power increases at a slower rate for TPC-C with deeper pipes, because the additional amount of clock gating is more pronounced due to large increases in the number of stall cycles relative to the SPEC2000 suite.

…

BIPS varying latch power-delay

…

BIPS 3 /W varying latch power-delay

…

Figures - uploaded by Philip N. Strenski

Content may be subject to copyright.

Content uploaded by Philip N. Strenski

Content may be subject to copyright.

Optimizing Pipelines for Power and Performance

Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose,

Victor Zyuban, Philip N. Strenski, Philip G. Emma

IBM T.J. Watson Research Center

Yorktown Heights, NY 10598

Abstract

During the concept phase and deﬁnition of next gen-

eration high-end processors, power and performance will

need to be weighted appropriately to deliver competitive

cost/performance. It is not enough to adopt a CPI-centric

view alone in early-stage deﬁnition studies. One of the fun-

damental issues confronting the architect at this stage is

the choice of pipeline depth and target frequency. In this

paper we present an optimization methodology that starts

with an analytical power-performance model to derive op-

timal pipeline depth for a superscalar processor. The results

are validated and further reﬁned using detailed simulation

based analysis. As part of the power-modeling methodol-

ogy, we have developed equations that model the variation

of energy as a function of pipeline depth. Our results using

a set of SPEC2000 applications show that when both power

and performance are considered for optimization, the op-

timal clock period is around 18 FO4. We also provide a

detailed sensitivity analysis of the optimal pipeline depth

against key assumptions of these energy models.

1 Introduction

Current generation high-end, server-class processors are

performance-driven designs. These chips are still somewhat

below the power and power density limits afforded by the

package/cooling solution of choice in server markets tar-

geted by such processors. In designing future processors,

however, energy efﬁciency is known to have become one of

the primary design constraints [7, 1].

In this paper, we analyze key constraints in choosing the

“optimal” pipeline depth (which directly inﬂuences the fre-

quency target) of a microprocessor. The choice of pipeline

depth is one of the fundamental issues confronting the ar-

chitect/designer during the very early stage microarchitec-

ture deﬁnition phase of high performance, power-efﬁcient

processors. Even from a performance-only viewpoint, this

issue has been important, if only to understand the limits

to which pipelining can scale in the context of real work-

loads [15, 5, 9, 11, 21]. In certain segments of the mar-

ket (typically desktops and low-end servers), there is often

a market-driven tendency to equate delivered end perfor-

mance with the frequency of the processor. Enlightened

customers do understand the value of net system perfor-

mance; nonetheless, the instinctive urge of going primar-

ily for the highest frequency processor in a given technol-

ogy generation is a known weakness even among savvy end

users and therefore processor design teams. Recent stud-

ies [9, 11, 21] seem to suggest that there is still room to

grow in the pipeline depth game, with performance optima

in the range of 8-11 FO4 inverter delays

per stage (con-

sisting of 6-8 FO4 logic delay and 2-3 FO4 latch delay) for

current out-of-order superscalar design paradigms. How-

ever, even in these performance-centric analysis papers, the

authors do point out the practical difﬁculties of design com-

plexity, veriﬁcation and power that must be solved in attain-

ing these idealized limits. Our goal in this paper is to exam-

ine the practical, achievable limits when power dissipation

constraints are also factored in. We believe that such analy-

sis is needed to realistically bound the scalability limit inthe

next few technology generations. In particular, power dissi-

pation must be carefully minimized to avoid design points

which on paper promise ever higher performance, yet un-

der normal operating conditions, with commodity packag-

ing and air cooling, only deliver a fraction of the theoretical

peak performance.

In this paper, we ﬁrst develop an analytical model to

understand the power and performance tradeoffs for super-

scalar pipelines. From this model, we derive the optimal

pipeline depth as a functionof both powerand performance.

Subsequently, the results are validated and further reﬁned

using a detailed cycle-accurate simulator of a current gener-

ation superscalar processor. The energy model for the core

pipeline is based on circuit-extracted power analysis for

structures in a current, high performance PowerPC proces-

sor. We then derive a methodology for scaling these energy

models to deeper andshallower pipelines. A key component

of this methodology is the scaling of latch count growth as

a function of pipeline depth. With these performance and

power models we attempt to determine the optimal pipeline

Fan-out-of-four (FO4) delay is deﬁned as the delay of one inverter

driving four copies of an equally sized inverter. The amount of logic and

latch overhead per pipeline stage is often measured in terms of FO4 delay

which implies that deeper pipelines have smaller FO4.

depth for a particular power-performance metric.

Our results based on an analysis of the TPC-C transac-

tion processing benchmark and a large set of SPEC2000

programs indicate that a power-performance optimum is

achieved at much shallower pipeline depths than a purely

performance-focused evaluation would suggest. In addi-

tion, we ﬁnd there is a range of pipeline depths for which

performance increases can be achieved with a modest sac-

riﬁce in power-performance efﬁciency. Pipelining beyond

that range leads to drastic reduction in power-performance

efﬁciency.

The contributions of this paper are (1) energy mod-

els for both dynamic and leakage power that capture the

scaling of different power components as a function of

pipeline depth; (2) an analytical performance model that

can predict optimal pipeline depth as well as the shift in

the optimal point, when combined with the energy models;

(3) cycle-accurate, detailed power-performance simulation

with a thorough sensitivity analysis of the optimal pipeline

depth against key energy model parameters.

This paper is structured as follows: We discuss the prior,

related work in Section 2. In Section 3, we describe the

proposed analytical model to study pipeline depth effects in

superscalar processors. Section 4 presents the simulation-

based validation methodology. In Section 5 we present re-

sults using both the analytical model and a detailed simula-

tor. In Section 6 we present a detailed sensitivity analysis to

understand the effect of variations in key parameters of the

derived power models on the optimal pipeline depth. We

conclude in Section 7, with pointers to future work.

2 Related Work

Previouswork has studied the issue of “optimal” pipeline

depth exclusively under the constraint of maximizing the

performance delivered by the microprocessor.

An initial study of optimal pipeline depths was per-

formed by Kunkel and Smith in the context of supercom-

puters [15]. The machine modeled in that study was based

on a Cray-1S, with delays being expressed as ECL gate lev-

els. The authors studied the achievable performance from

scalar and vector codes as a function of gate levels per

pipeline stage for the Livermore kernels. The study demon-

strated that vector codes can achieve optimum performance

by deep pipelining, while scalar (ﬂoating-point) workloads

reach an optimum at shallower pipelines.

Subsequently, Dubey and Flynn [5] revisited the topic

of optimal pipelining in a more general analytical frame-

work. The authors showed the impact of various workload

and design parameters. In particular, the optimal number of

pipeline stages is shown to decrease with increasing over-

head of partitioning logic into pipeline stages (i.e., clock

skew, jitter, and latch delay). In this model, the authors con-

sidered only stalls due to branch mispredictions and did not

consider data dependent stalls due to memory or register

dependencies.

More recently, several authors have reexamined this

topic in the context of modern superscalar processor mi-

croarchitectures. Hartstein and Puzak [9] treat this prob-

lem analytically and verify based on detailed simulation of

a variety of benchmarks for a 4-issue out-of-order machine

with a memory-execute pipeline. Simulation is also used to

determine the values of several parameters of their mathe-

matical model, since these cannot be formulated axiomati-

cally. They report optimal logic delay per pipeline stage to

be 7.7 FO4 for SPEC2000 and 5.5 FO4 for traditional and

Java/C++ workloads. Assuming a latch insertion delay of

3 FO4, this would result in a total delay of about 10.7 FO4

and 8.5 FO4 per pipeline stage, respectively.

Hrishikesh et al. [11] treat the question of logic depth

per pipeline stage empirically based on simulation of the

SPEC2000 benchmarks for an Alpha 21264-like machine.

Based on their assumed latch insertion delay of 1.8 FO4,

they demonstrate that a performance-optimal point is at

logic delay of 6.0 FO4. This would result in a total pipeline

delay of about 8 FO4.

Sprangle and Carmean [21] extrapolate from the current

performance of the Pentium 4 using IPC degradation factors

for adding a cycle to critical processor loops, such as ALU,

L1 and L2 cache latencies, and branch miss penalty for a

variety of application types. The authors compute an opti-

mal branch misprediction pipeline depth of 52 stages, cor-

responding to a pipeline stage total delay of 9.9 FO4 (based

on a logic depth of 6.3 FO4 anda latch insertion delay of 3.6

of which 3 FO4 are due to latch delay and 0.6 FO4 represent

skew and jitter overhead).

All of the above studies (as well as ours) assume that mi-

croarchitectural structures can be pipelined without limita-

tion. Several authors have evaluated limits on the scalability

and pipelining of these structures [6, 19, 22, 4, 11].

Collectively, the cited works on optimal pipelining have

made a signiﬁcant contribution to the understanding of

workloads and their interaction with pipeline structures by

studying the theoretical limits of deep pipelining. How-

ever, prior work does not address scalability with respect to

increased power dissipation that is associated with deeper

pipelines. In this work, we aim to build on this foundation

by extending the existing analytical models and by propos-

ing a power modeling methodology that allows us to es-

timate optimal pipeline depth as a function of both power

and performance.

3 Analytical Pipeline Model

In the concept phase deﬁnition studies, the exact organi-

zation and parameters of the target processor are not known.

As such, a custom, cycle-accurate power-performance sim-

ulator for the full machine is often not available or rel-

evant. Therefore, the use of analytical reasoning mod-

els supplemented by workload characterization and limit

studies (obtained from prior generation simulators or trace

analysis programs) is common in real design groups. We

present such an analytical model to understand the power-

performance optima and tradeoffs during the pre-simulation

phase of a design project.

Figure 1 showsa high-levelblock diagram of the pipeline

model used in our analysis. Our primary goal is to derive the

optimum pipeline depth for the various execution units by

estimating the various types of stalls in these pipes while

using a perfect front-end for the processor. Although Fig-

ure 1 shows only one pipe for each unit (ﬁxed point, ﬂoating

point, load/store, and branch), the model can be used for a

design with multiple pipes per unit as well.















!"

!#$%

Figure 1. Pipeline Model

In Figure 1, let t

be the latch-free logic time to complete

an operation in pipe i, and s

be the number of pipeline

stages of pipe i. Assuming the same clock frequency for all

the pipes, t

= t

, ∀i, j.

If c

is the latch overhead per stage for pipe i, the total

time per stage of pipe i is T

= ((t

) + c

), ∀i. As

derived by Dubey and Flynn in [7], and Larson and David-

son (cited in Chapter 2 of [14]), the throughput of the above

machine in the absence of stalls is given by G =

(

We now extend this baseline model to include the effect

of data-dependent stalls. Workload analysis can be used to

derive ﬁrst-cut estimates of the probability that an instruc-

tion n depends on another instruction j for all instructions

n, and can be used to estimate the frequency of pipeline

stalls. This is illustrated in the example below, where an

FXU instruction (i + 1) depends on the immediately pre-

ceding instruction i, and will be stalled for (s

− 1) stages

assuming a register ﬁle bypass to forward the results. Simi-

larly, another FXU instruction (j+2) depends on instruction

j, and will be stalled (s

− 2) stages. Note that in the above

workload analysis, if the source operands of an instruction

i are produced by more than one instruction, the largest of

the possible stalls is assigned to i.

inst (i) add r1 = r2, r3

inst (i+1) and r4 = r1, r5 --

stalled for (s1 - 1) stages

inst (j) add r1 = r2, r3

inst (j+1) or r6 = r7, r8

inst (j+2) and r4 = r1, r5 --

stalled for (s1 - 2) stages

fxu

= T

+ Stall

fxu−fxu

∗ T

+ Stall

fxu−fpu

∗ T

Stall

fxu−lsu

∗ T

+ Stall

fxu−bru

∗ T

where Stall

fxu−fxu

= f

∗ (s

− 1) + f

∗ (s

− 2) + . . .

The above equation represents the time to complete an

FXU operation in the presence of stalls; f

is the condi-

tional probability that an FXU instruction m depends on

another FXU instruction (m − i) for all FXU instructions

m, provided that instruction (m − i) is the producer with

the largest stall for instruction m. Similar expressions can

be derived for T

fpu

, T

lsu

, and T

bru

, the completion times

of an FPU, LSU, and BRU operation, respectively. To ac-

count for superscalar (> 1) issue widths, the workload anal-

ysis assumes a given issue width along with the number of

execution pipes of various types (FXU, FPU, BRU, LSU),

and issues independent instructions as an instruction bundle

such that the bundle width ≤ issue width. Thus, the distance

between dependent instructions is the number of instruction

bundles issued between them. To account for the dependent

instruction stalls due to L1 data cache misses we use a func-

tional cache simulator to determine cache hits and misses.

In addition, we split the load/store pipe into two, namely,

load-hit and load-miss pipe; thereby, steering all data refer-

ences that miss in the data cache to the load miss pipeline

which results in longer stall times for the dependent instruc-

tions. Since the workload analysis is independent of the ma-

chine architecture details and uses only the superscalar issue

width to determine the different stalls, it sufﬁces to analyze

each application once to derive the stalls.

The stalls modeled so far include only hazards in the ex-

ecution stages of the different pipes. However, these pipes

could also be waiting for instructions to arrive from the

front-end of the machine. If u

represents the fraction of

time pipe i has instructions arriving from the front-end of

the machine, the equation below gives the throughput of the

pipeline in the presence of stalls. Note that u

= 0 for unuti-

lized pipelines, and u

= 1 for fully utilized pipelines.

G =

fxu

fpu

lsu

bru

Thus far, we focused on deriving the throughput of a

pipeline as a function of the number of pipeline stages. In

order to optimize the pipeline for both power and perfor-

mance we use circuit-extracted power models described in

Section 4.2.

If P

is the total power of pipe i derived from the power

models, the energy-delay product for pipe i is given by

ED = P

. Hence, the optimal value of s

which

minimizes the energy-delay product can be obtained us-

ing d(ED)/ds

= 0. Note that depending on the class

of processors the desired metric for optimization could be

d((BIPS)

/W)/ds

= 0 where γ ≥ 0 is an exponent

whose value can be ﬁxed in speciﬁc circuit tradeoff contexts

[26]; BIPS is billions of instructions per second.

4 Performance and Power Methodology

In this section, we describe the performance simulator

used in this study as well as the details of our power mod-

eling toolkit and the methodology that we use to estimate

changes in power dissipation as we vary the pipeline depth

of the machine.

Fetch Latencies Decode Latencies

Latency Parms STD/INF Latency Parms STD/INF

NFA Predictor 1/0 Multiple Decode 2/0

L2 ICache 11/0 Millicode Decode 2/0

L3 (Instruction) 85/0 Expand String 2/0

I-TLB Miss 10/0 Mispredict Cycles 3/0

L2 I-TLB Miss 50/0 Register Read 1/1

Execution Pipe Latencies Load/Store Latencies

Latency Parms STD/INF Latency Parms STD/INF

Fix Execute 1/1 L1 D-Load 3/3

Float Execute 4/4 L2 D-Load 9/9

Branch Execute 1/1 L3 (Data) 77/0

Float Divide 12/12 Load Float 2/2

Integer Multiply 7/7 D-TLB Miss 7/0

Integer Divide 35/35 L2 D-TLB Miss 50/0

Retire Delay 2/2 StoreQ Forward 4/0

Table 1. Latencies for 19 FO4 Design Point

4.1 Performance Simulation Methodology

Issue-queue

Integer

Issue-queue

Load/store

Issue-queue

Float.Point

Issue-queue

Branch

issue logic issue logic issue logic

issue logic

Reg. read

Reg. read Reg. read Reg. read

Load/store

units

Float. Point

units

Branch

units

Integer

units

I-Buffer

Decode/

Expand

Rename/

Dispatch

IFETCH

L2 cache

Main

Memory

L1-Dcache

D-TLB1

D-TLB2

Cast-out

queue

L1-Icache

I-TLB1

I-TLB2

NFA/Branch Predictor

Retirement

queue

retirement logic

load/store

reorder buf

store

queue

miss

queue

Figure 2. Modeled Processor Organization

We utilize a generic, parameterized, out-of-order 8-way

superscalar processor model called Turandot [16, 17] with

32KB I and D-caches and a 2MB L2 cache. The overall

pipeline structure (as reported in [16]), is repeated here in

Figure 2. The modeled baseline microarchitecture is simi-

lar to a current generation microprocessor. As described in

[16], this research simulator was calibrated against a pre-

RTL, detailed, latch-accurate processor model. Turandot

supports a large number and parameters including conﬁg-

urable pipeline latencies discussed below.

Table 1 details the latency values in processor cycles for

the 19 FO4 base design point of this study. We assume a 2

FO4 latch overhead and 1 FO4 clock skew and jitter over-

head. The 19 FO4 latency values are then scaled with the

FO4-depth (after accounting for latch and clock skew over-

head). Each latency in Table 1 has two values: the ﬁrst la-

beled STD, is for our detailed simulation model, and the

second labeled INF, assumes inﬁnite I-Cache, I-TLB, D-

TLB, and a perfect front-end. The INF simulator model

is used for validating the analytical model described in Sec-

tion 3.

4.2 Power Simulation Methodology

To estimate power dissipation, we use the PowerTimer

toolset developedat IBM T.J. Watson Research Center [3, 1]

as the starting point for the simulator used in this work.

PowerTimer is similar to power-performance simulators de-

veloped in academia [2, 23, 24], except for the methodology

to build energy models.

Power=C1*SF+HoldPower

Power=C2*SF+HoldPower

Macro1

Macro2

MacroN

Sub-Units (uArch-level Structures)

Power=Cn*SF+HoldPower

Energy Models

SF Data

Power Estimate

Figure 3. PowerTimer Energy Models

Figure 3 above depicts the derivation of the energy mod-

els in PowerTimer. The energy models are based on circuit-

level power analysis that has been performed on structures

in a current, high performance PowerPC processor. The

power analysis has been performed at the macro level using

a circuit-level power analysis tool [18]. Generally, multi-

ple macros combine to form one micro-architectural level

structure which we will call a sub-unit. For example, the

ﬁxed-point issue queue (one sub-unit) might contain sep-

arate macros for storage memory, comparison logic, and

control. Power analysis has been performed on each macro

to determine the macro’s unconstrained (no clock gating)

power as a function of the input switching factor. In addi-

tion, the hold power, or power when no switching is occur-

ring (SF = 0%), is also determined. Hold power primarily

consists of power dissipated in latches, local clock buffers,

the global clock network, and data-independent fraction of

the arrays. The switching power, which is primarily combi-

natorial logic and data-dependent array power dissipation,

is the additional power that is dissipated when switching

factors are applied to the macro’s primary inputs. These

two pieces of data allow us to form simple linear equations

for each macro’s power. The energy model for a sub-unit is

determined by summing the linear equations for each macro

within that sub-unit. We have generated these power models

for all microarchitecture-level structures (sub-units) mod-

eled in our research simulator [16, 17]. PowerTimer models

over 60 microarchitectural structures which are deﬁned by

over 400 macro-level power equations.

PowerTimer uses microarchitectural activity information

from the Turandot model to scale down the unconstrained

hold and switching power on a per-cycle basis under a vari-

ety of clock gating assumptions. In this study, we use a real-

istic form of clock gating which considers the applicability

of clock gating on a per-macro basis to scale down either

the hold power or the combined hold and switching power

depending on the microarchitectural event counts. We de-

termine which macros can be clock gated in a ﬁne-grained

manner (per-entry or per-stage clock gating) and which can

be clock gated in a coarse-grained manner (the entire unit

must be idle to be clock gated). For some macros (in par-

ticular control logic) we do not apply any clock gating; this

corresponds to about 20-25% of the unconstrained power

dissipation. The overall savings due to clock gating relative

to the unconstrained power is roughly 40-45%.

In order to quantify the power-performance efﬁciency of

pipelines of a given FO4-depth, andto scale the power dissi-

pation from the power models of our base FO4 design point

across a range of FO4-depths, we have extended the Power-

Timer methodology as discussed below.

Power dissipated by a processor consists of dynamic and

leakage components, P = P

dynamic

+ P

leakage

. The dy-

namic power data measured by PowerTimer (for a particu-

lar design point) can be expressed as P

base

dynamic

= CV

f ∗

(α+β)∗CGF , where α is the average “true” switching fac-

tor in circuits; i.e., α represents transitions required for the

functionality of the circuit and is measured as the switching

factor by an RTL-level simulator run under the zero-delay

mode. In contrast, β is the average glitching factor that ac-

counts for spurious transitions in circuits due to race con-

ditions. Thus, (α + β) is the average number of transitions

actually seen inside circuits. Both α and β are averaged over

the whole processor over non-gated cycles with appropriate

energy weights (the higher the capacitance at a particular

node the higher the corresponding energy weight). CGF is

the clock gating factor which is deﬁned as the fraction of

cycles where the microarchitectural structures are not clock

gated. The CGF is measured from our PowerTimer runs at

each FO4 design point as described above. The remaining

terms C, V , and f are effective switching capacitance, chip

supply voltage, and clock frequency, respectively.

Next we analyze how each of these factors scales

with FO4 pipeline depth. To facilitate the expla-

nation we deﬁne the following variables: FO4

logic

FO4

latch

and FO4

pipeline

, to designate the depth of

the critical path through logic in one pipeline stage,

the latch insertion delay including clock skew and jit-

ter, and the sum of the two quantities, respectively,

FO4

pipeline

= FO4

logic

+ FO4

latch

. We use FO4 and

FO4

pipeline

interchangeably. In the remainder of the pa-

per, the qualiﬁer ‘base’ in all quantities designates the value

of the quantities measured for the base 19 FO4 design.

Frequency: FreqScale is the scaling factor that is used

to account for the changes in the clock frequency with the

pipeline depth. This factor applies to both hold power and

switching power:

FreqScale = FO4

base

pipeline

/FO4

pipeline

Latch: With ﬁxed logic hardware for given logic func-

tions, the primary change in the chip effective switching ca-

pacitance C with pipeline depth is due to changes in the

latch count with the depth of the pipeline. LatchScale is a

factor that appropriately adjusts the hold power dissipation,

but does not affect the switching power dissipation.

LatchScale = LatchRatio ∗

FO4

base

logic

FO4

logic

LGF

where LatchRatio deﬁnes the ratio of hold power to the

total power and LatchGrowthF actor(LGF ) captures the

growth of latch count due to the logic shape functions. The

amount of additional powerthat is spent in latches and clock

in a more deeply pipelined design depends on the logic

shape functions of the structures that are being pipelined.

Logic shape functions describe the number of latches that

would need to be inserted at any cut point in a piece of

combinatorial logic if it were to be pipelined. Values of

LGF >1 recognize the fact that for certain hardware struc-

tures the logic shape functions are not ﬂat and hence the

number of latches in the more deeply pipelined design in-

creases super-linearly with pipeline depth. In our baseline

model, we assume a LGF of 1.1 and study the sensitivity of

the optimal pipeline depth to this parameter in Section 6.1.

Clock Gate Factor: In general, CGF decreases with

deeper pipelines because the amount of clock gating poten-

tial increases with deeper pipes. This increased clock gating

potential is primarily due to the increased number of cycles

where pipeline stages are in stall conditions. This in turn

leads to an increase in the clock gating potential on a per

cycle basis. CGF is workload dependent and is measured

directly from simulator runs.

Glitch: The ﬁnal two factors that must be considered

for dynamic power dissipation when migrating to deeper

pipelines are α and β, the chip-wide activity and glitching

factors.

The “true” switching factor α does not depend on the

pipeline depth, since it is determined by the functionality

of the circuits. The glitching factor at any net, on the other

hand, is determined by the difference in delay of paths from

the output of a latch that feeds the circuit to the gate that

drives that net. Once a glitch is generated at some net, there

is a high probability that it will propagate down the circuit

until the input of the next latch down the pipeline. Further-

more, the farther the distance from the latch output to the

inputs of a gate the higher the probability of the existence

of non-equal paths from the output of the latch to the inputs

of this gate. Therefore the average number of spurious tran-

sitions grows with FO4

logic

– the higher the FO4 the higher

the average glitching factor. Experimental data, collected

by running a dynamic circuit-level simulator (PowerMill)

on post-layout extracted netlists of sample functional units

show that the average glitching factor β can be modeled as

being linearly dependent on the logic depth:

β = β

base

FO4

logic

FO4

base

logic

To account for the effect of the dependence of β on

pipeline depth, we introduce the following factor which ap-

plies only to the switching power:

GlitchScale = (1 − LatchRatio)

α + β

base

1 − LatchRatio

1 + β

base

/α

1 +

base

FO4

logic

FO4

base

logic

In this formula β

base

is the actual glitching factor aver-

aged over the baseline microprocessor for the base FO4 de-

sign point. Notice that β

base

appears in the formula only

in the ratio β

base

/α. This is consistent with our experi-

mental results showing that the glitching factor β is roughly

proportional to the “true” switching factor α, for the range

0 < α < 0.3 (for higher values of α the growth of β typ-

ically saturates). For the set of six sample units that we

simulated, with the logic depth ranging from 6 FO4 to 20

FO4, the ratio β/α was found to be roughly proportional to

the logic depth of the simulated units, FO4

logic

, with the co-

efﬁcient equal to 0.3/FO4

base

logic

. Based on these simulation

results we set β

base

/α = 0.3 for the whole microprocessor

in the remainder of this section, and study the sensitivity of

the results to variations in the glitching factor in Section 6.4.

Leakage Currents: As the technology feature size scales

downand the powersupply and transistor threshold voltages

scale accordingly, the leakage power component becomes

more and more signiﬁcant. Since the magnitude of the leak-

age power component is affected by the pipeline depth, it is

essential to include the effect of the leakage power in the

analysis of the optimal pipeline depth. Assuming that the

leakage power is proportional to the total channel width of

all transistors in the microprocessor, we model the depen-

dence of the leakage power on the depth of the pipeline as

follows:

FO4

leakage

= P

base

leakage

1 +

latch

total

FO4

base

logic

FO4

logic

LGF

− 1

where LGF is the LatchGrowthF actor deﬁned earlier,

latch

total

is the ratio of the total channel width of tran-

sistors in all pipeline latches (including local clock distri-

bution circuitry) to the total transistor channel width in the

base (19 FO4) microprocessor (excluding all low-leakage

transistors that might be used in caches or other on-chip

memories). If the technology supports multiple thresholds,

or any of the recently introduced leakage reduction tech-

niques are used on a unit-by-unit basis, such as MTCMOS,

back biasing, power-down, or transistor stacking, then the

above formula for the leakage power component needs to

be modiﬁed accordingly and we leave the detailed study of

these effects for future work. For the remainder of the study

we set w

latch

total

to 0.2 for the base 19 FO4 pipeline.

Also, rather than giving the absolute value for the leak-

age current, P

base

leakage

in the base microprocessor, we will

count it as a fraction of the dynamic power of the base de-

sign, P

base

leakage

= LeakageF actor

base

dynamic

. We set the

LeakageF actor

base

to the value of 0.1, typically quoted

for state of the art microprocessors, and analyze the sensi-

tivity of the results to LeakageF actor in Section 6.5.

Total Power: The following equation expresses the rela-

tionship between the dynamic power for the base FO4 de-

sign, P

base

dynamic

, leakage power and the scaled power for de-

signs with different depths of the pipeline, P

FO4

total

, consider-

ing all factors above.

FO4

total

= CGF ∗ F S∗ (LS+GS)∗ P

base

dynamic

FO4

leakage

where F S is F reqScale, LS is LatchScale, and GS is

GlitchScale.















 !"











#$%& 

Figure 4. Power Growth Breakdown

Figure 4 shows contributions of different factors in the

above formula, depending on the pipeline depth of the de-

sign FO4

pipeline

. The 19 FO4 design was chosen as a base

pipeline.

The line labeled “combined” shows the cumulative in-

crease or decrease in power dissipation. The line labeled

“only clock gate” quantiﬁes the amount of additional clock

gating power savings for deeper pipelines. The relative ef-

fect of scaling in clock gating is fairly minor with slightly

more than 10% additional power reduction when going

from the 19 FO4 to 7 FO4 design points. There are several

reasons why the effect of clock gating is not larger. First,

the fraction of power dissipation that is not eligible to be

clock gated becomes larger with more clock gating leading

to diminishing returns. Second, some of the structures are

clock gated in a coarse-grained fashion and while the aver-

age utilization of the structure may decrease it must become

idle in all stages before any additional savings can be real-

ized. Finally, we observe that clock gating is more difﬁcult

in deeper pipelined machines, because it is harder to deliver

cycle-accurate gating signals at lower FO4.

The two lines labeled “only freq” and “only hold” show

the power factors due to only frequency and hold power

scaling, respectively.

Overall, dynamic power increases

more than quadratically with increased pipeline depth.

Figure 4 shows that the leakage component grows much

less rapidly than the dynamic component with the increas-

ing pipeline depth. There are two primary reasons for this.

First, the leakage power does not scale with frequency.

Second, the leakage power growth is proportional to the

fraction of channel width of transistors in pipeline latches,

whereas the latch dynamic hold power growth is propor-

tional to the fraction of the dynamic power dissipated in

pipeline latches. Obviously, the former quantity is much

smaller than the latter.

4.3 Workloads and Metrics Used in the Study

In this paper, we report experimental results based on

PowerPC traces of a set of 21 SPEC2000 benchmarks,

namely, ammp, applu, apsi, art, bzip2, crafty, equake, fac-

erec, gap, gcc, gzip, lucas, mcf, mesa, mgrid, perl, six-

track, swim, twolf, vpr, and wupwise. We have also used

a 172M instruction trace of the TPC-C transaction process-

ing benchmark. The SPEC2000 traces were generated using

the tracing facility called Aria within the MET toolkit [17].

The particular SPEC2000 trace repository used in this study

was created by using the full reference input set. However,

sampling was used to reduce the total trace length to 100

million instructions per benchmark program. A systematic

validation study to compare the sampled traces against the

full traces was done in ﬁnalizing the choice of exact sam-

pling parameters [12].

We use BIPS

/W (

energy

∗

delay

) as a basic energy-

efﬁciency metric for comparing different FO4 designs in

the power-performance space. The choice of this metric

is based on the observation that dynamic power is roughly

proportional to the square of supply voltage (V ) multiplied

by clock frequency and clock frequency is roughly propor-

tional to V . Hence, power is roughly proportional to V

assuming a ﬁxed logic/circuit design. Thus, delay cubed

multiplied by power provides a voltage-invariant power-

performance characterization metric which we feel is most

Although these two factors increase linearly with clock frequency, we

are plotting against FO4-depth which is 1/

clock frequency

appropriate for server-class microprocessors (see discussion

in [1, 7]). In fact, it was formally shown in [26] that opti-

mizing performance subject to a constant power constraint

leads to the BIPS

/W metric in processors operating at a

supply voltage near the maximum allowed value in state of

the art CMOS technologies. As a comparative measure, we

also consider BIPS/W (energy) and BIPS

/W (energy-delay

[8]) in the baseline power-performance study.

5 Analytical Model and Simulation Results

5.1 Analytical Model Validation

























 !

Figure 5. BIPS: Analytical Model vs. Simulator

We now present the performance results using the sim-

ple analytical model (see Section 3) and compare it with

the results using our detailed cycle-accurate simulator. The

results in this section are presented for the average of the

SPEC2000 benchmarks described in Section 4.3. For fair

comparison, we have modiﬁed the simulator to include a

perfect front-end of the machine. The model and the sim-

ulator use latencies shown in the column labeled “INF” in

Table 1.

Figure 5 shows BIPS as a function of the FO4 delay per

stage of the pipeline. The BIPS for the analytical model

was computed after determining the stalls for the different

pipelines using independent workload analysis as explained

in Section 3.

From Figure 5 we observe that the performance optimal

pipeline depth is roughly 10 FO4 delay per pipeline stage

for both the model and the simulator. The BIPS estimated

using the model correlates well with the BIPS estimated

using the simulator except for very large and very small

FO4-depth machines. Since the stalls are determined in the

model independent of the pipeline and only once for a work-

load, it is possible that for shallow pipelines with large FO4

delay per stage, the model underestimates the total stall cy-

cles, and hence the BIPS computed by the model is higher

than the BIPS obtained from the simulator. The analytical

model currently only uses data dependent stalls to derive

the optimal pipeline depth. Hence, for the purpose of the

validation experiment, the model and the simulator assume

a perfect front-end. However, to estimate the effect of re-

source stalls we modeled a large but ﬁnite register renamer,

instruction buffer, and miss queue in the simulator keeping

the rest of the front-end resources inﬁnite and perfect. For

deeper pipelines (≤ 10 FO4), we observe from the simula-

tor that the stalls due to the resources become appreciable

and the analytical model begins to overestimate the BIPS

relative to the simulator.





















 !"

Figure 6. BIPS

/W: Model vs. Simulator

Figure 6 shows that the optimal FO4-depth that maxi-

mizes the BIPS

/W is around 19-24 FO4 for the simulator.

We observe that the analytical model tracks the simulator

results more closely while optimizing performance alone as

seen in Figure 5. Since the analytical model overestimates

the performance for shallow pipelines, the cubing of BIPS

in Figure 6 compounds these errors.

5.2 Detailed Power-Performance Simulation

















 

 

 

 

 

 

 

 

 



 

!"#"$ 

Figure 7. Simulation Results for SPEC2000

In the remainder of this work we consider the power and

performance results using the detailed performance simu-

lator with parameters corresponding to the STD column in

Table 1. Figure 7 shows the results for ﬁve metrics: BIPS,

IPC, BIPS/W, BIPS

/W, and BIPS

/W.

Figure 7 shows that the optimal FO4-depth for perfor-

mance (deﬁned by BIPS) is 10 FO4, although pipelines of

8 FO4 to 15 FO4 are within 5% of the optimal. Because

of the super-linear increase in power dissipation and sub-

linear increases in overall performance, the BIPS/W always

decreases with deeper pipelines. BIPS

/W shows an op-

timum point at 18 FO4. BIPS

/W decreases sharply after

the optimum and at the performance-optimal pipeline depth

of 10 FO4, the BIPS

/W metric is reduced by 50% over

the 18 FO4 depth. For metrics that have less emphasis on

performance, such as BIPS

/W and BIPS/W, the optimal

point shifts towards shallower pipelines, as expected. For

the BIPS

/W, the optimal pipeline depth is achieved at 23

FO4.















 

 

 

 

 

 

 

 

 



 

!"#"$ 

Figure 8. Simulation Results for TPC-C

Figure 8 presents similar results for the TPC-C trace.

The optimal BIPS for TPC-C is very ﬂat from 10-14 FO4

(within 1-2% for all 5 design points which is much ﬂat-

ter than SPEC2000). Using BIPS

/W, the optimal pipeline

depth shifts over to 25-28 FO4. The main reason that the

optimal point is shallower for TPC-C is that BIPS decreases

less dramatically with decrease in pipeline length (relative

to SPEC2000). This is slightly counterbalanced because

power increases at a slower rate for TPC-C with deeper

pipes, because the additional amount of clock gating is more

pronounced due to large increases in the number of stall cy-

cles relative to the SPEC2000 suite.

6 Sensitivity Analysis

The derived equations that model the dependence of

the power dissipation on the pipeline depth depend on

several parameters. Some of these parameters, although

accurately measured for the baseline microprocessor, are

likely to change from one design to another, whereas oth-

ers are difﬁcult to measure accurately. In this section, we

perform sensitivity analysis of the optimal pipeline depth

to key parameters of the derived power models such as

LatchGrowthF actor, LatchRatio, latch insertion delay

(FO4

latch

), GlitchRatio and LeakageF actor.

6.1 Latch Growth Factor

LatchGrowthF actor (LGF ) is determined by the in-

trinsic logic shape functions of the structures that are being

pipelined. We have analyzed many of the major microar-

chitectural structures to identify ones that are likely to have

LatchGrowthF actor greater than 1. One structure that we

will highlight is the Booth recoder and Wallace tree which

is common in high-performance ﬂoating point multipliers

[13, 20], as shown in Figure 9. Figure 9 shows the exponen-

tial reduction in the number of result bits as the data passes

through the 3-2 and 4-2 compressors. We have estimated

the amount of logic that can be inserted between latch cut

points for 7, 10, 13, 16, and 19 FO4 designs by assuming

3-4 FO4 delay for 3-2 and 4-2 compressor blocks. As the 7

and 10 FO4 design points require latch insertions just after

the Booth multiplexor (where there are 27 partial products),

there would be a large increase in the number of latches re-

quired for these designs. We note that the 7 FO4 design also

requires a latch insertion after the booth recode stage.

Figure 9. Wallace Tree Diagram and Latch Cut

points for 7/10/13/16/19 FO4

Figure 10 gives estimates for the cumulative number of

latches in the FPU design as a function of the FO4 depth

of the FPU. For example, the ﬁrst stage of the 10 FO4 FPU

requires 3x as many latches as the ﬁrst stage of the 19 FO4

FPU because the ﬁrst latch cut point of the 19 FO4 FPU is

beyond the initial 9:2 compressor tree. Overall, the 10 FO4

FPU requires nearly 3x more latches than the 19 FO4 FPU.

19FO4

16FO4

13FO4

10FO4

0 10 20 30 40 50 60 70 80 90

CumulativeFO4Depth(Logic+LatchOverhead)

CumulativeNumberofLatches

Figure 10. Cumulative latch count for FPU

There are many other areas of parallel logic that are

likely to see super-linear increases in the number of latches

such as structures with decoders, priority encoders, carry

look ahead logic, etc. Beyond the pipelining of logic struc-

tures, deeper pipelines may require more pre-decode infor-

mation to meet aggressive cycle times, which would require

more bits to be latched in intermediate stages. On the other

hand, the number of latches that comprise storage bits in

various on-chip memory arrays (such as register ﬁles and

queues) does not grow at all with the pipeline depth, mean-

ing that the LGF = 0 for those latches. Thus, designs with

overall LGF < 1 are also possible.

















  !"

   

Figure 11. BIPS

/W varying LatchGrowthF actor

Figure 11 quantiﬁes the dependence of the optimal

pipeline depth on LGF using the BIPS

/W metric. It shows

that the optimal pipeline FO4 tends to increase as LGF

increases above the value of 1.1, assumed in the baseline

model. As a point of reference, our estimate for the latch

growth factor for the 10 FO4 vs. 19 FO4 Booth recoder and

Wallace tree is LGF = 1.9, while for the entire FPU LGF

is slightly less than 1.7.

6.2 Latch Power Ratio

Latch, clock, and array power are the primary com-

ponents of power dissipation in current generation CPUs.

This is especially true in high-performance, superscalar

processors with speculative execution which require the

CPU to maintain an enormous amount of architectural

and non-architectural state. One possible reason why the

LatchRatio could be smaller than the base value of 0.7

chosen in Section 4, is if more energy-efﬁcient SRAM ar-

rays are used in high-power memory structures instead of

latches to reduce the data independent array power (which

we include as part of hold power).



































  

Figure 12. BIPS

/W varying LatchRatio

Figure 12 shows the optimal FO4 design point while

varying the LatchRatio of the machine from 80% to 40%.

We see that while the optimal FO4 point remains 18 FO4, it

is less prohibitive to move to deeper pipelines with smaller

latch-to-logic ratios. For example, with a LatchRatio of

0.4, the 13 FO4 design point is only 19% worse than the

optimal one while it is 27% worse than optimal with a

LatchRatio of 0.6.

6.3 Latch Insertion Delay

Latch FO4 2 3 4 5

Relative Latch Energy 1.0 0.53 0.36 0.29

Relative Clocking Energy 1.0 0.62 0.49 0.43

Table 2. Latch Insertion Delay (excluding skew and

jitter) vs. Relative Latch Energy

With the large amount of power spent in latches and

clocking, designers may consider the tradeoff between latch

delay and power dissipation as a means to design more

energy-efﬁcient CPUs. Researchers have investigated latch

power vs. delay tradeoff curves both within a given latch

family and across latch families [25, 10]. Table 2, derived

from [25], shows latch FO4-delay vs. latch energy across

several latch styles. The ﬁrst row of Table 2 shows the latch

insertion delay, excluding clock skew and jitter overhead

which is assumed to be constant for all latches. The second

row shows the relative latch energy, excluding energy dis-

sipated in the clock distribution. The third row shows the

relative energy of the clock system, including both energy

of latches (70% of the total) and clock distribution system

(30%). It is assumed that the clock distribution energy can-

not be completely scaled with reduced latch load. There is

signiﬁcant overhead simply from driving the wires neces-

sary to distribute the clock over a large area.





















Figure 13. BIPS varying latch power-delay

Replacing the baseline fast 2 FO4 latches with slower,

lower power latches increases the latch insertion delay

overhead, which impacts both the performance-optimal

and power-performance-optimal pipeline depth. Figure 13

shows the processor performance versus pipeline depth for

four latches from Table 2. We see that the performance

maxima shift towards shallower pipelines as the latch in-

sertion delay increases. For example, with a 3 FO4 latch

the performance-optimal FO4-depth is 11 FO4 and with a 4

FO4 latch it becomes 16 FO4.



































  ! " #

   

Figure 14. BIPS

/W varying latch power-delay

Figure 14 shows the impact of the latch insertion de-

lay on the BIPS

/W rating of the processor for the same

range of pipeline depths. In this ﬁgure, all of the data points

are shown relative to the 10 FO4 design with the base 5

FO4 latch. Unlike curves on all previous sensitivity graphs,

curves in Figure 14 do not intersect at the base design point,

because different curves represent different designs, with

different power and performance levels.

Figure 14 shows that using the fastest 2 FO4 latch results

in the best BIPS

/W rating for processors with stages less

than 14 FO4. For processors with pipelines ranging from

15 FO4 to 24 FO4 a lower power 3 FO4 latch is the most

energy efﬁcient, whereas for shallower pipelines (25 FO4

or more) the highest BIPS

/W is achieved with even slower

4 FO4 latches.

The use of the 3 FO4 latch, combined with the choice

of the pipeline depth in the range from 15 FO4 to 24 FO4

improves the BIPS

/W rating of the processor by more than

10%, compared to the base case of 2 FO4 latches. The graph

also shows that the optimal BIPS

/W design point shifts to-

wards shallower pipelines as high-performance latches are

replaced with lower power ones. For example, the 18 FO4

design point is optimal for a processor using 2 FO4 latches,

whereas the 19 FO4, 20 FO4, and 21 FO4 design points are

optimal for processors using 3 FO4 latches, 4 FO4, and 5

FO4 latches, respectively.

6.4 Glitch Factor

In this subsection we quantify the sensitivity of the op-

timal pipeline depth to the glitching factor. There are no

practical means for accurately measuring the actual value of

base

/α, averaged over the whole microprocessor. Instead

we measured the glitching factor for a selected set of func-

tional units and used the averaged value of β

base

/α = 0.3

throughout the analysis. In this section we analyze the sen-

sitivityof the optimal pipeline depth to the value of β

base

/α.

Figure 15 shows the dependence of the BIPS

/W rating

of the processor on the pipeline depth for three values of

base

/α. From this ﬁgure we see that higher glitching fac-

tors favor deeper pipelines. However, in the base design the

decrease in power dissipation related to reduced glitching

in deeper pipelines did not have a substantial impact on the



























  !"

  

Figure 15. BIPS

/W varying β

base

/α

optimal pipeline depth, primarily because of the relatively

small fraction of power dissipated in combinatorial switch-

ing. For designs which have smaller LatchRatio values,

this effect could be more signiﬁcant.

6.5 Leakage Factor

As explained earlier, the leakage power component

grows more slowly with the pipeline depth than the dy-

namic component. Therefore, the optimum pipeline depth

depends on the LeakageF actor. Throughout the analy-

sis we assumed that for the base 19 FO4 microprocessor

the LeakageF actor (P

base

leakage

base

dynamic

) to be 0.1. How-

ever, as the technology feature size scales down and the

power supply and transistor threshold voltages scale ac-

cordingly, the leakage power component becomes more and

more signiﬁcant. To study the effect of the growing frac-

tion of the leakage power component we measured the sen-

sitivity of the optimal pipeline depth to the value of the

LeakageF actor.



























  !"

   

Figure 16. BIPS

/W varying LeakageF actor

Figure 16 shows the BIPS

/W rating of the pro-

cessor versus pipeline depth for three values of the

LeakageF actor: a value of 0 that represents older CMOS

technologies, a value of 0.1, assumed in the current model,

and values of 0.5 and 1.0, projected for future generation

CMOS technologies (arguably, extreme values). The results

in Figure 16 show that unless leakage reduction techniques

become the standard practice in the design of high-end

microprocessors, the high values of the LeakageF actor

projected for future generations of CMOS technologies

may tend to shift the optimum pipeline depth towards

slightly deeper pipelines. For current generation tech-

nologies the result for the optimal pipeline depth is sufﬁ-

ciently stable with respect to reasonable variations in the

LeakageF actor.

Summary of the Sensitivity Analysis: In this section we

considered the sensitivity of optimal pipeline length to ﬁve

key parameters in the power models using the BIPS

metric. We did not observe a strong dependence of the re-

sults on the assumptions and choices of any of these param-

eters, which demonstrates the stability of the model, and

its applicability to a wide range of designs. To summarize

the results, highervaluesof the LatchGrowthF actor favor

shallower pipelines, lower values of the LatchRatio favor

deeper pipelines, the use of lower-power latches favorsshal-

lower pipelines, higher values of the GlitchF actor favors

deeper pipelines, and, ﬁnally, higher leakage currents favor

deeper pipelines.

7 Conclusions

In this paper, we have demonstrated that it is impor-

tant to consider both power and performance while opti-

mizing pipelines. For this purpose, we derived detailed

energy models using circuit-extracted power analysis for

microarchitectural structures. We also developed detailed

equations for how the energy functions scale with pipeline

depth. Based on the combination of power and perfor-

mance modeling performed, our results show that a purely

performance-driven, power-unaware design may lead to the

selection of an overly deep pipelined microprocessor oper-

ating at an inherently power-inefﬁcient design point.

As this work is the ﬁrst quantitative evaluation of power

and performance optimal pipelines, we also performed a

detailed sensitivity analysis of the optimal pipeline depth

against key parameters such as latch growth factor, latch ra-

tio, latch insertion delay, glitch, and leakage currents. Our

analysis shows that there is a range of pipeline depth for

which performance increases can be achieved at a mod-

est sacriﬁce in power-performance efﬁciency. Pipelining

beyond that range leads to drastic reduction in power-

performance efﬁciency with little or no further performance

improvement.

Our results show that for a current generation, out-of-

order superscalar processor, the optimal delay per stage is

about 18 FO4 (consisting of a logic delay of 15 FO4 and 3

FO4 latch insertion delay) when the objective function is a

power-performance efﬁciency metric like BIPS

/W; this is

in contrast to an optimal delay of 10 FO4/stage when con-

sidering the BIPS metric alone. We used a broad suite of

SPEC2000 benchmarks to arrive at this conclusion.

The optimal pipeline depth depends on a number of pa-

rameters in the power models which we have derived from

current state-of-the-art microprocessor design methodolo-

gies. Also, as already established through recent prior work,

such optimal design points generally depend on the in-

put workload characteristics. Our simulation-based experi-

ments on a typical commercial application (TPC-C) shows

that although the optimal pipeline depth is around 10-14

FO4 for performance-only optimization, it increases to 24-

28 FO4 when we consider power and performance opti-

mizations.

In future work, we would like to consider other architec-

tural options than those available in the base model, for ex-

ample, single- versus multi-threading, wide versus narrow

issue machines, and inorder versus out-of-order designs. In

addition, we would also like to consider more circuit-level

power-performance tradeoffs opportunities and factor that

into the analysis of optimal pipelines. Finally, there are nu-

merous circuit and technology techniques which affect leak-

age and therefore power-performance optimality; we hope

to consider all of these factors in our future work.

References

[1] D. Brooks et al. Power-aware Microarchitecture: Design

and Modeling Challenges for the next-generation micropro-

cessors. IEEE Micro, 20(6):26–44, Nov./Dec. 2000.

[2] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A frame-

work for architectural-level power analysis and optimiza-

tions. In Proceedings of the 27th Annual International Sym-

posium on Computer Architecture (ISCA-27), June 2000.

[3] D. Brooks, J.-D. Wellman, P. Bose, and M. Martonosi.

Power-Performance Modeling and Tradeoff Analysis for a

High-End Microprocessor. In Power Aware Computing Sys-

tems Workshop at ASPLOS-IX, Nov. 2000.

[4] M. Brown, J. Stark, and Y. Patt. Select-free instruction

scheduling logic. In Proceedings of the 34th International

Symposium on Microarchitecture (MICRO-34), pages 204–

213, December 2001.

[5] P. Dubey and M. Flynn. Optimal pipelining. J. Parallel and

Distributed Computing, 8:10–19, 1990.

[6] P. G. Emma and E. S. Davidson. Characterization of

branch and data dependencies in programs for evaluating

pipeline performance. IEEE Transactions on Computers, C-

36(7):859–875, 1987.

[7] M. J. Flynn, P. Hung, and K. Rudd. Deep-Submicron

Microprocessor Design Issues. IEEE Micro, 19(4):11–22,

July/Aug. 1999.

[8] R. Gonzalez and M. Horowitz. Energy dissipation in gen-

eral purpose microprocessors. IEEE Journal of Solid-State

Circuits, 31(9):1277–84, Sept. 1996.

[9] A. Hartstein and T. R. Puzak. The optimum pipeline depth

for a microprocessor. In Proceedings of the 29th Inter-

national Symposium on Computer Architecture (ISCA-29),

May 2002.

[10] S. Heo, R. Krashinsky, and K. Asanovic. Activity-sensitive

ﬂip-ﬂop and latch selection for reduce energy. In 19th Con-

ference on Advanced Research in VLSI, March 2001.

[11] M. Hrishikesh, K. Farkas, N. Jouppi, D. Burger, S. Keckler,

and P. Sivakumar. The optimal logic depth per pipeline stage

is 6 to 8 FO4 inverter delays. In Proceedings of the 29th

International Symposium on Computer Architecture (ISCA-

29), pages 14–24, May 2002.

[12] V. Iyengar, L. H. Trevillyan, and P. Bose. Representative

traces for processor models with inﬁnite cache. In Proc.

2nd. Symposium on High Performance Computer Architec-

ture (HPCA-2), Feb. 1996.

[13] R. Jessani and C. Olson. The ﬂoating-point unit of the Pow-

erPC 603e microprocessor. IBM J. of Research and Devel-

opment, 40(5):559–566, Sept. 1996.

[14] P. Kogge. The Architecture of Pipelined Computers. Hemi-

sphere Publishing Corporation, 1981.

[15] S. R. Kunkel and J. E. Smith. Optimal pipelining in super-

computers. In Proceedings of the 13th International Sympo-

sium on Computer Architecture (ISCA-13), pages 404–411,

June 1986.

[16] M. Moudgill, P. Bose, and J. Moreno. Validation of Tu-

randot, a fast processor model for microarchitecture ex-

ploration. In Proceedings of the IEEE International Per-

formance, Computing, and Communications Conference

(IPCCC), pages 451–457, Feb. 1999.

[17] M. Moudgill, J. Wellman, and J. Moreno. Environment

for PowerPC microarchitecture exploration. IEEE Micro,

19(3):9–14, May/June 1999.

[18] J. S. Neely, H. H. Chen, S. G. Walker, J. Venuto, and

T. Bucelot. CPAM: A common power analysis methodol-

ogy for high-performance VLSI design. In Proc. of the 9th

Topical Meeting on the Electrical Performance of Electronic

Packaging, pages 303–306, 2000.

[19] S. Palacharla, N. Jouppi, andJ. Smith. Complexity-Effective

Superscalar Processors. In Proceedings of the 24th Inter-

national Symposium on Computer Architecture (ISCA-24),

1997.

[20] P. Song and G. D. Micheli. Circuit and architecture trade-

offs for high-speed multiplication. IEEE Journal of Solid-

State Circuits, 26(9):1184–1198, Sept. 1991.

[21] E. Sprangle and D. Carmean. Increasing processor perfor-

mance by implementing deeper pipelines. In Proceedings of

the 29th International Symposium on Computer Architecture

(ISCA-29), May 2002.

[22] J. Stark, M. Brown, and Y. Patt. On pipelining dynamic

instruction scheduling logic. In Proceedings of the 33rd In-

ternational Symposium on Microarchitecture (MICRO-33),

pages 57–66, Dec. 2000.

[23] N. Vijaykrishnan, M. Kandemir, M. Irwin, H. Kim, and

W. Ye. Energy-driven integrated hardware-software opti-

mizations using SimplePower. In Proceedings of the 27th

Annual International Symposium on Computer Architecture,

June 2000.

[24] V. Zyuban. Inherently Lower Power High Performance Su-

perscalar Architectures. PhD thesis, University of Notre

Dame, March 2000.

[25] V. Zyuban and D. Meltzer. Clocking strategies and

scannable latches for low power applications. In Proc.

of Int’l Symposium on Low-Power Electronics and Design,

2001.

[26] V. Zyuban and P. Strenski. Uniﬁed Methodology for Resolv-

ing Power-Performance Tradeoffs of the Microarchitectural

and Circuit Levels. In Proc. of Int’l Symposium on Low-

Power Electronics and Design, pages 166–171, 2002.

Architecture level power-performance trade-offs in data-dominated designs

Thesis

Jan 2008

Haider Farhan Ali

p>As the demand for feature-rich portable devices continues to increase, new techniques are needed to minimise power consumption. This thesis is concerned with the development and validation of new systematic architectural methods of determining pipeline stage insertion in data-dominated designs with the aim of reducing dynamic power consumption. The methods place special emphasis on the number of latches used in pipeline stages and voltage scaling. The first part of the thesis addresses power minimisation through systematic analysis of the number of latches in pipeline stages. A new pipeline stage insertion (PSI) method operating at the architectural level is developed which takes into account system clock period and FEs outputs and delays. A PSI algorithm based on analytical heuristic equations is formulated to ensure the successful application of this method to any given data-dominated design. The input to the algorithm is designer clock period and naively inserted pipeline stages. The output from the algorithm is a pipelined design fulfilling the timing constraint with the least dynamic power consumption. To support efficient power-performance trade-offs exploration, the algorithm was fully automated. The second part of the thesis focuses on the validation of the PSI method using two real-life case studies: triple data path floating-point adder and MPEG-1 motion compensation module. These designs are common in many portable devices and have numerous implementation challenges large number of FEs and significant power consumption. Extensive experimental results show that for the motion compensation module, the PSI is able to reduce dynamic power consumption by up-to 30% compared with other reported approaches. The final part of the thesis concentrates on voltage scaling (VS) and its impact on pipeline stages. The timing slack available in each stage is investigated, with the aim of further reducing power consumption by lowering the supply voltage. The PSI method is modified to support voltage scaling, and as a result, a new pipeline stage insertion with voltage scaling (PSI-VS) method is proposed. Experimental results show that the PSI-VS can lead to significant power saving compared with PSI without VS. For the MPEG-1 motion compensation case study, a power saving of 68% is observed. All the developed methods have linear time complexity as the number of pipeline stages increases, facilitating their application to large designs without incurring run time penalty. The results for the case studies were based on a synthesisable RTL implementation using 90nm technology together with accurate power analysis using commercial tools.</p

The POWER Processor Family: A Historical Perspective From the Viewpoint of Presilicon Modeling

Article

Nov 2021

Pradip Bose

Presilicon modeling is a crucial and integral part of processor microarchitecture definition and optimization. In this article, I attempt to provide a retrospective view of IBM's POWER and PowerPC microprocessors, through the lens of someone who has been associated with such modeling in support of microarchitecture definition and optimization from the earliest days of this particular family of processors. The focus in the early/mid-1980s was on cycle-accurate performance modeling; much later, beginning in 1999 or so, the looming power wall triggered a new era of power-performance modeling at the microarchitecture level. Subsequently, temperature-aware and reliability-aware modeling were added dimensions that CMOS technology evolution drove us into. The problem of model validation is an unavoidable aspect of presilicon modeling. Without that mindset, the microarchitecture definition team can make serious mistakes, which results in unpleasant postsilicon surprises. I provide pointers to early approaches in addressing this issue. The article attempts to mention the contributions of many talented researchers and engineers that have, over the years, contributed immensely to the evolution of the POWER/PowerPC microprocessors from earliest research concepts through the recently announced POWER10—using model-based analysis to ensure competitive performance growth.

Superconducting Computing with Alternating Logic Elements

Conference Paper

Jun 2021

Energy Efficiency Boost in the AI-Infused POWER10 Processor

Conference Paper

Jun 2021

Chasing Carbon: The Elusive Environmental Footprint of Computing

Conference Paper

Feb 2021

Applications, ASICs, and domain-specific architectures

Chapter

Jan 2021

Shigeyuki Takano

ASICs, which integrate logic circuits designed for a specific application, are introduced in this chapter. Herein, an application is represented as an algorithm. An algorithm is implemented as software, hardware, or a combination of the two, and has locality and dependence, which are also introduced and described. Hardware having a good architecture utilizing such localities and dependencies improves the execution performance and energy consumption, and thus improves the energy efficiency.

Chasing Carbon: The Elusive Environmental Footprint of Computing

Preprint

Oct 2020

Given recent algorithm, software, and hardware innovation, computing has enabled a plethora of new applications. As computing becomes increasingly ubiquitous, however, so does its environmental impact. This paper brings the issue to the attention of computer-systems researchers. Our analysis, built on industry-reported characterization, quantifies the environmental effects of computing in terms of carbon emissions. Broadly, carbon emissions have two sources: operational energy consumption, and hardware manufacturing and infrastructure. Although carbon emissions from the former are decreasing thanks to algorithmic, software, and hardware innovations that boost performance and power efficiency, the overall carbon footprint of computer systems continues to grow. This work quantifies the carbon output of computer systems to show that most emissions related to modern mobile and data-center equipment come from hardware manufacturing and infrastructure. We therefore outline future directions for minimizing the environmental impact of computing systems.

Power–Performance Trade-Offs in Design of SoCs

Chapter

Sep 2005

Bi-Objective Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms for Performance and Energy Through Workload Distribution

Article

Mar 2021

Performance and energy are the two most important objectives for optimization on modern parallel platforms. In this article, we show that moving from single-objective optimization for performance or energy to their bi-objective optimization on heterogeneous processors results in a tremendous increase in the number of optimal solutions (workload distributions) even for the simple case of linear performance and energy profiles. We then study full performance and energy profiles of two real-life data-parallel applications and find that they exhibit shapes that are non-linear and complex enough to prevent good approximation of them as analytical functions for input to exact algorithms or optimization software for determining the Pareto front. We, therefore, propose a solution method solving the bi-objective optimization problem on heterogeneous processors. The method's novel component is an efficient and exact global optimization algorithm that takes as an input performance and energy profiles as arbitrary discrete functions of workload size, which accurately and realistically take into account resource contention and NUMA inherent in modern parallel platforms, and returns the Pareto-optimal solutions (generally speaking, load imbalanced). To construct the input discrete energy functions, the method employs a methodology that accurately models the energy consumption by a hybrid data-parallel application executing on a heterogeneous HPC platform containing different computing devices using system-level power measurements provided by power meters. We experimentally analyse the proposed solution method using three data-parallel applications, matrix multiplication, 2D fast Fourier transform (2D-FFT), and gene sequencing, on two connected heterogeneous servers consisting of multicore CPUs, GPUs, and Intel Xeon Phi. We show that it determines a superior Pareto front containing the best load balanced solutions and all the load imbalanced solutions that are ignored by load balancing methods.

Automated Design Flows and Run-Time Optimization for Reconfigurable Microarchitecures

Chapter

Feb 2020

In this chapter, a systematic methodology is introduced to design reconfigurable microarchitectures through automated and architecture-agnostic design flows. The main goal is to enrich a baseline microarchitecture with additional registers for throughput enhancement and then make selected registers bypassable to flexibly switch among different microarchitectures. Similarly, design methodologies for reconfigurable SRAM memories are described. As common thread, drop-in solutions for existing architectures allowing the above capability at very low design effort are discussed.

Unified methodology for resolving power-performance tradeoffs at the microarchitectural and circuit levels

Conference Paper

Full-text available

Jan 2002

Evaluation of architectural tradeoffs is complicated by implications in the circuit domain which are typically not captured in the analysis but substantially affect the results. We propose a metric of hardware intensity (η), which is useful for evaluating issues that affect both circuits and architecture. Analyzing data for actual designs we show how to measure the introduced parameters and discuss variations between observed results and common theoretical assumptions. For a power-efficient design we derive relations for η and supply voltage V under progressively more general situations, and incorporate η into a prior art architectural energy-efficiency criterion. Then, a more general relation is derived for the optimal balance between the architectural complexity, hardware intensity and power supply. Modified forms for these relations are obtained in special cases where the supply voltage is constrained or when clock gating is disallowed.

The Optimal Logic Depth Per Pipeline Stage is 6 to 8 FO4 Inverter Delays.

Conference Paper

Full-text available

May 2002

Microprocessor clock frequency has improved by nearly 40% annually over the past decade. This improvement has been provided, in equal measure, by smaller technologies and deeper pipelines. From our study of the SPEC 2000 benchmarks, we find that for a high-performance architecture implemented in 100 nm technology, the optimal clock period is approximately 8 fan-out-of-four (FO4) inverter delays for integer benchmarks, comprised of 6 FO4 of useful work and an overhead of about 2 FO4. The optimal clock period for floating-point benchmarks is 6 FO4. We find these optimal points to be insensitive to latch and clock skew overheads. Our study indicates that further pipelining can at best improve performance of integer programs by a factor of 2 over current designs. At these high clock frequencies it will be difficult to design the instruction issue window to operate in a single cycle. Consequently, we propose and evaluate a high-frequency design called a segmented instruction window

The Optimum Pipeline Depth for a Microprocessor.

Conference Paper

Full-text available

May 2002

The impact of pipeline length on the performance of a microprocessor is explored both theoretically and by simulation. An analytical theory is presented that shows two opposing architectural parameters affect the optimal pipeline length: the degree of instruction level parallelism (superscalar) decreases the optimal pipeline length, while the lack of pipeline stalls increases the optimal pipeline length. This theory is tested by analyzing the optimal pipeline length for 35 applications representing three classes of workloads. Trace tapes are collected from SPEC95 and SPEC2000 applications, traditional (legacy) database and online transaction processing (OLTP) applications, and modern applications primarily written in Java and C++. The results show that there is a clear and significant difference in the optimal pipeline length between the SPEC workloads and both the legacy and modern applications. The SPEC applications, written in C, optimize to a shorter pipeline length than the legacy applications, largely written in assembler language, with relatively little overlap in the two distributions. Additionally, the optimal pipeline length distribution for the C++ and Java workloads overlaps with the legacy applications, suggesting similar workload characteristics. These results are explored across a wide range of superscalar processors, both in-order and out-of-order

Unified methodology for resolving power-performance tradeoffs at the microarchitectural and circuit levels

Conference Paper

Jan 2002

Increasing processor performance by implementing deeper pipelines

Conference Paper

May 2002
Comput Architect News

One architectural method for increasing processor performance involves increasing the frequency by implementing deeper pipelines. This paper will explore the relationship between performance and pipeline depth using a Pentium® 4 processor like architecture as a baseline and will show that deeper pipelines can continue to increase performance.This paper will show that the branch misprediction latency is the single largest contributor to performance degradation as pipelines are stretched, and therefore branch prediction and fast branch recovery will continue to increase in importance. We will also show that higher performance cores, implemented with longer pipelines for example, will put more pressure on the memory system, and therefore require larger on-chip caches. Finally, we will show that in the same process technology, designing deeper pipelines can increase the processor frequency by 100%, which, when combined with larger on-chip caches can yield performance improvements of 35% to 90% over a Pentium® 4 like processor.

The Architecture of Pipelined Computers

Article

Jan 1981

P. M. Kogge

Low-power high-performance superscalar architectures

Article

Victor V. Zyuban

Optimal pipelining in supercomputers

Article

Jun 1986
Comput Architect News

This paper examines the relationship between the degree of central processor pipelining and performance. This relationship is studied in the context of modern supercomputers. Limitations due to instruction dependencies are studied via simulations of the CRAY-1S. Both scalar and vector code are studied. This study shows that instruction dependencies severely limit performance for scalar code as well as overall performance. The effects of latch overhead are then considered. The primary cause of latch overhead is the difference between maximum and minimum gate propagation delays. This causes both the skewing of data as it passes along the data path, and unintentional clock skewing due to clock fanout logic. Latch overhead is studied analytically in order to lower bound the clock period that may be used in a pipelined system. This analysis also touches on other points related to latch clocking. This analysis shows that for short pipeline segments both the Earle latch and polarity hold latch give the same clock period bound for both single-phase and multi-phase clocks. Overhead due to data skew and unintentional clock skew are each added to the CRAY-1S simulation model. Simulation results with realistic assumptions show that eight to ten gate levels per pipeline segment lead to optimal overall performance. The results also show that for short pipeline segments data skew and clock skew contribute about equally to the degradation in performance.

The floating-point unit of the PowerPC 603e microprocessor

Article

Oct 1996
IBM J RES DEV

The IBM PowerPC 603e™ floating-point unit (FPU) is an on-chip functional unit to support IEEE 754 standard single- and double-precision binary floating-point arithmetic operations. The design objectives are to be a low-cost, low-power, high-performance engine in a single-chip superscalar microprocessor. Using less than 15 mm<sup>2</sup> of the available silicon area on the chip (the size of the PowerPC 603e microprocessor is 98 mm<sup>2</sup>) and operating at the peak clock frequency of 100 MHz, an average single-pumping multiply-add-fuse instruction has one-cycle throughput and four-cycle latency. An average double-pumping multiply-add-fuse instruction has two-cycle throughput and five-cycle latency. The estimated performance at 100 MHz is 105 against the SPECfp92™ benchmark.

Clocking strategies and scannable latches for low power applications

Conference Paper

Aug 2001

Abstract This paper covers a range of issues in the design of clocking schemes for low-power applications. First we revisit, extend and improve the power-performance optimization methodology,for latches, attempting to make it more formal and comprehensive. Data switching factor and the glitching activity are taken into con- sideration, using a formal analytical approach, then a notion of energy-efficient family of configurations is introduced to make the comparison,of different latch styles in the power-performance space more fair, also the power of the clock distribution is taken into account. Practical issues of building a low overhead scan mecha- nism are considered, and the power overhead of the scannable de- sign is analyzed. A low-power LSSD extension to single-phase latches is proposed, and results of a comparative study of LSSD- scannable latches are shown, supported by experimental data mea- sured on a,test chip.

Optimizing pipelines for power and performance

Abstract and Figures

Recommended publications

Assessing possible energy potential in a food and beverage industry: Application of IDA-ANN-DEA appr...

Design tools for packaging

The Use of Numerical Modeling Techniques to Optimize Groundwater Withdrawals and Minimize Streamflow...

The Data Footprint of Digital Images: Optimization