Total power optimization combining placement, sizing and multi-Vt through slack distribution management
ABSTRACT Power dissipation is quickly becoming one of the most important limiters in nanometer IC design for leakage increases exponentially as the technology scaling down. However, power and timing are often conflicting objectives during optimization. In this paper, we propose a novel total power optimization flow under performance constraint. Instead of using placement, gate sizing, and multiple-Vt assignment techniques independently, we combine them together through the concept of slack distribution management to maximize the potential for power reduction. We propose to use the linear programming (LP) based placement and the geometric programming (GP) based gate sizing formulations to improve the slack distribution, which helps to maximize the total power reduction during the Vt-assignment stage. Our formulations include important practical design constraints, such as slew, noise and short circuit power, which were often ignored previously. We tested our algorithm on a set of industrial-strength manually optimized circuits from a multi-GHz 65 nm microprocessor, and obtained very promising results. To our best knowledge, this is the first work that combines placement, gate sizing and Vt swapping systematically for total power (and in particular leakage) management.
-
Citations (0)
-
Cited In (0)
Page 1
Total Power Optimization Combining Placement, Sizing
and Multi-Vt Through Slack Distribution Management
∗
Tao Luo, David Newmark†, and David Z. Pan
Department of ECE, University of Texas at Austin, Austin, TX
†Advanced Micro Devices, Austin, TX
tluo@ece.utexas.edu, david.newmark@amd.com, dpan@ece.utexas.edu
ABSTRACT
Power dissipation is quickly becoming one of the most important
limiters in nanometer IC design for leakage increases exponentially
as the technology scaling down. However, power and timing are
often conflicting objectives during optimization. In this paper, we
propose a novel total power optimization flow under performance
constraint. Instead of using placement, gate sizing, and multiple-Vt
assignment techniques independently, we combine them together
through the concept of slack distribution management to maximize
the potential for power reduction. We propose to use the linear pro-
gramming (LP) based placement and the geometric programming
(GP) based gate sizing formulations to improve the slack distribu-
tion, which helps to maximize the total power reduction during the
Vt-assignment stage. Our formulations include important practi-
cal design constraints, such as slew, noise and short circuit power,
which were often ignored previously. We tested our algorithm on a
set of industrial-strength manually optimized circuits from a multi-
GHz 65nm microprocessor, and obtained very promising results.
To our best knowledge, this is the first work that combines place-
ment, gate sizing and Vt swapping systematically for total power
(and in particular leakage) management.
1.INTRODUCTION
For nanometer IC designs (90nm and below), power dissipa-
tion has become one of the most important limiting factors since
leakage is increasing exponentially as CMOS technology scaling
down. Both process and design technologies are being developed
to conquer the leakage barriers. Among various design techniques,
multiple-Vt assignment is very popular and effective. The idea
is fairly straightforward. For a design starts with all regular-Vt
(RVt) cells. Once the timing target is roughly met, one replaces
non-critical cells with their high-Vt (Hvt) counter parts, as the sub-
threshold leakage current of a gate is exponentially related to the
threshold voltage. Meanwhile, one needs to fix the remaining fail-
ing paths by using a small number of low-Vt (Lvt) cells since they
are faster (but leak much more).
The effectiveness of Vt swapping relies on the slack distribution.
The slack distribution is heavily related with how timing closure
is done during physical synthesis, e.g., placement and gate sizing.
As timing and power are often conflicting objectives during opti-
mization, traditionally, placement is mainly used for timing opti-
mization. There is no existing work in placement that considers the
leakage power reduction.
Gate sizing is used for both timing optimization and power re-
duction. Conventional gate sizing formulations either minimize the
worst case delay or minimize the power under delay constraints [1,
2, 3, 4, 5, 6, 7]. However, gate sizing is never considered to help
the Vt-swapping algorithm to maximize the power reduction over-
all, although Vt-swapping is known to be much more effective on
∗This work is supported in part by NSF, SRC, and IBM Faculty
Award.
leakage reduction.
To maximize the power reduction, the power saving opportunity
in above physical design stages should be considered and utilized
in a systematic manner. As we know the leakage current is expo-
nentially related to the threshold voltage (Table 1), but linear to the
cell size, multi-Vt assignment shall be a much more effective tech-
nique for leakage power reduction than gate sizing (i.e. by using
high-Vt cells as much as possible). In other words, to reduce total
powerwhere leakage becomes prominent, it is more effective to use
placement and gate sizing to promote more ef fectiveVt swapping
afterwards than using them independently for local power reduc-
tion. For example, we may size up some cells, which leads to less
Lvt cells used finally. In that case, the amount of leakage saved
could be much more than the power increased due to cells upsized.
Table 1: Normalized delay and leakage current for a cell with
different threshold voltages in 65nm technology
Cell
Delay
Leakage current
Lvt
1
17.3
Rvt
1.1
2.4
Hvt
1.3
1
In this paper, we propose to use the slack distribution manage-
ment to “glue” placement and gate sizing algorithms together to
boost the Vt-swapping technique. The primary objective of our ap-
proach is to increase the sum of slacks on critical and near critical
paths, i.e. to push the slack distribution curve (not the worst slack)
of the circuit away from critical, even at the cost of up-sizing some
cells slightly. Less total number of critical cells implies less Lvt
cells and higher percentage of Hvtcells being used eventually. In
other words, we trade small dynamic power increase for large leak-
age power reduction. In addition, we reduce the power directly by
sizing down cells on non-critical paths when possible. Our method-
ology formulates a linear-programming (LP) based placement and
two geometric programming (GP) based gate sizing algorithms to
change the slack distribution.
In the rest of the paper, section 2 motivates our proposed ap-
proach. The LP based placement stage is introduced in section 3.
The GP formulations are in section 4 and the Vt swapping algo-
rithm is described in section 5. Experimental results are reported in
section 6, and we conclude in section 7.
2.MOTIVATION&PROPOSEDAPPROACH
In a typical flow, a design starts with all regular Vt cells (Rvt).
A few timing violating paths that are very difficult to optimize in
other ways can be fixed by swapping in Lvtcells. All Rvtcells on
non-critical paths with large slack will be swapped into Hvtcells
to save power. As shown in Table 1, the leakage of a Hvtcell is
significantly smaller, about 17 times compared with a Lvtversion
at 65-nm technology.
The results of Vt swapping is highly dependent on the slack dis-
tributions. If we can reduce the number of near critical cells, we
may use fewer Lvtcells and more Hvtcells. Figure 1 plots the cell
slack histogram of a circuit before and after placement plus gate
4C-2
352
978-1-4244-1922-7/08/$25.00 ©2008 IEEE
Page 2
0
50
100
150
200
250
300
350
0 50 100 150 200 250
num gates
slack
Slack distribution
Orignal
Optimized
Figure 1: Slack distribution before and after optimization
sizing. The circuit is Rvtbased. The slack histogram after optimiza-
tion is tightened around a specified mean with a reduced deviation.
Less near-critical cells implies less Lvtcells be used later, and less
leakage power subsequently.
2.1The proposed flow
Our strategy is to use the placement and gate sizing to optimize
the slack distribution to promote Vt swapping. We formulate a LP
program for placement and a GP program for gate sizing to max-
imize the sum of slacks on semi-critical paths. In addition, cells
on non-critical paths may be oversized even if they are swapped
to Hvt. We formulate a GP problem to reduce slack and power on
non-critical cells directly.
Algorithm 1 The Overall Algorithm
1: The slack distribution management algorithm
2:
Input: initial design (all Rvtcells)
3:
while ( less than max. iter. & improved) do
4:
Incremental placement optimization
5:
TimingAnalysis
6:
Cell sizing on critical path for slack
7:
TimingAnalysis
8:
Size down non-critical cells
9:
TimingAnalysis
10: The Vt-swapping algorithm (Algorithm 2)
11: Functon: TimingAnalysis
12:
Pre-routing, and timing analysis
13:
if(improved) accept solution, annotate the database
We use placement and gate sizing iteratively to improve the slack
distribution. Algorithm 1 shows our proposed flow. Starting from
a design, we do the placement and critical cell gate sizing itera-
tively until no further improvement. Finally, we employee the Vt
swapping to use a few Lvtcells to fix the remaining critical paths,
and replace as many Rvt with Hvt cells as possible. At the end
of each stage in the flow, we run a fairly accurate timing analyzer.
The timing tool pre-routes the circuit, extract the parasitics, and run
the PrimeTime based timing analysis. The timing change from the
previous stage is accurately updated and annotated back into the
design databases, as the basis of the next stage.
2.2Practical design constraints
In existing literature of power optimization, important practical
design constraints, such as slew, noise, and short-circuit power, are
often not considered, which makes the optimization algorithm im-
practical for realistic designs. For example, short circuit power is
usually assumed small and ignored in most of existing power re-
duction algorithms. However, short circuit power may rise signifi-
cantly if not explicitly controlled in the optimization framework.
2.2.1The slew and noise related constraints
1
10
100
1000
0 10 20 30 40 50 60 70
num gates
slew
Slew distribution
Slew constrained
Without slew
Figure 2: Slew rate distribution with and without explicit con-
trol
1
2
3
4
5
S1
S3
S5
S7
0
5
10
15
Input slew
Output cap
Power
1
2
3
4
5
6
7
S1SSSS5
Delay
Input slew
1 0-15
5-10
0-5
Otput
cap
Figure 3: A simple yet effective short circuit power constraint
model.
Without restricting the maximum slew rate, cells on short paths
will be over-downsized. Figure 2 plots the slew rate histogram of
the gate sizing results with and without restricting the slew rate. in
Figure 2, a lot of instances violate the 50 pico-second slew limit if
ignoring slew constraints. Our maximum slew rate constraints set
an upper bound for the slew rate. Furthermore, cells have different
sensitivities to slew for noise. Our model includes the effective
fan-out constraints, which is an effective way to reduce the noise
related issues.
2.2.2
Short circuit power is difficult to model and ignored in most ex-
isting poweroptimization algorithms. The inputs forinternal power
are the input slew and output capacitance, as shown in Figure 3.
Imaging a scenario that the input slew of a cell is large and the load
can be charged full quickly, the PMOS and NMOS will be both on
for a longer period. The Vdd to ground current will consume a lot
of power. Short circuit power is often assumed very small. How-
ever, in our experiments, it is comparable to leakage if not properly
handled. Figure 3 is a SPICE simulation based look-up table to
interpolate the short circuit power. Note that short circuit power
could rise dramatically if the ratio of the input slew and output ca-
pacitive load falls into a certain range, e.g. input slew is large while
the cell is driving a comparably small load.
In later sections, we will show how to handle above important
design constraints in our proposed algorithm.
Short circuit power constraint
3.LP BASED PLACEMENT FOR POWER
The objective of our LP placement formulation is to reduce the
power in Vt-swapping stage incrementally. Therefore, instead of
reducing the worst case delay, our LP based placement is formu-
lated primarily to reduce the total number of critical and semi-
critical cells, i.e. to push the slack curve away from the critical
point, which helps the Vt-swapping tool on leakage reduction.
Linear programming is commonly used for incremental timing
driven placement [5, 8, 9, 10, 11]. In LP based incremental place-
ment, a few critical paths are selected by a sign off timer, and crit-
4C-2
353
Page 3
ical paths are optimized incrementally. Existing LP based timing
driven placement algorithms use the half parameter wire length for
wireestimation, asHPWLcanbeformulatedexactlyinaLPframe-
work.
Chen et.al. [5] proposed a simultaneous placement and gate siz-
ing approach to optimize the delay. Because the unified placement
and sizing GP formulation is not convex, the problem was formu-
lated into a generic geometric program (GGP) and solve iteratively.
However, the HPWL based wire load estimation is much less accu-
rate compared with that in a stand alone gate sizing problem, which
can be measured separately. A simultaneous formulation will make
the wire load estimation less accurate for gate sizing, which often
results in a sub-optimal solution.
3.1The LP formulations
We assume the following gate delay DPiand transition SPimod-
els for cell i
DPi= dpI+a1i·Slewi+a2i·Capi
SPi= spI+u1i·Slewi+u2i·Capi
where a1i, a2i, u1i, and u2iare the fitting coefficients. dpIand spI
denote the intrinsic delay and slew of the corresponding pin of the
cell. Slewidenotes the input slew. Let HPWLjdenotes the HPWL
of net j, andCapjrepresents the capacitive load of the driver of net
j. We have
(1)
(2)
Capj= c·HPWLj+Cpinj
which is the sum of the wire capacitance cHPWLjplus the total
pin capacitance driven by net j. c is the unit capacitance.
The LP placement algorithm selects a few critical paths selected
from the timing report, which have slacks less than a threshold.
The net delay sensitivity is computed for each critical net, and a LP
program is formulated to minimize the sum of the weighted critical
nets, which is an indirect method to increase the sum of total slack
on those critical paths. The net delay sensitivity Snjis based on the
delay propagation sensitivity computation in [11]
Snj= c·(a2i+a1i+1·u2i)
Elmore delay [12] is used for wire delay modeling and the symbols
related with net delay are omitted in the formulation for simplicity.
Similar to [13], the critical paths were counted to compute the
criticality of each selected critical net, the criticality score of net j
is denoted by Scj. Therefore, the combined timing weight wtt=
ScjSnj. The dynamic power is a function of the load capacitance
of the net. If cell i drives net j, we have a power weight
wtp= 0.5αi·F ·V2
(3)
where αidenotes the switching rate, F is the frequency, andV is the
voltage. A control parameter β is used to adjust the ratio between
the timing and power weight.
wtj= βwtp+(1−β)wtt
β is a value between 0 and 1. The primary objective of our LP
placement is to reduce the leakage power, thus, β is set to a rel-
atively small value. A LP program is formulated to minimize the
sum of the weighted critical nets, which indirectly increases the
sum of pin slacks.
min ∑wtjLj
∀j ∈ Selected critical nets
The residual overlap created in this stage is carefully removed.
4. GP BASED GATE SIZING FOR POWER
Placement has a limited impacton slack distribution improve-
ment if the cell sizes are not changed. To push the slack curve fur-
ther, weusetheeffortbaseddelaymodel, andformulatea Geometric
programming based gate sizing problem. GP is a special type of
the non-linear optimization problem that has been used for gate
sizing since the 80s [14, 15, 16]. The standard GP problem has
a posynomial objective and special format constraints. In last ten
years, the solving efficiency of GP is approaching that of Linear
Programming. We refer the reader to a tutorial for geometric pro-
gramming [16].
Conventional gate sizing formulations minimize the worst case
delay in a circuit with power or area constraints [2, 16, 5], or min-
imize the power directly under the delay constraints [7]. On the
contrary, our first GP formulation increases the sum of slack on
critical and near critical outputs instead. Our second GP program
is related to the conventional formulation, which focuses on the non
critical part of the circuit to “absorb” large slacks. Therefore, we
treat cells differently depending on the criticality of the cell.
4.1 Cell classification
Cells are classified into two sets, the non-critical set NC and the
critical set CRIT, based on the output pin slack. If the pin slack is
larger than a threshold, we add the cell into NC. Similarly, if the
slack is small enough, we add the cell into CRIT.
For the first GP program, we start from all outputs with slack
less than δ, for example, δ = 70 ps. We traverse the circuit in a
reversed breath first order, and the reversed BFS traversal proceeds
only on cell outputs with the slack smaller than δ+γ and stops
at signal inputs, which are the inputs of the circuit or the outputs
of sequential cells. Only cells with slack less than θ (θ < δ) are
selected into the CRIT. The size of cells in CRIT are variables for
the GP program. As the arrival time of all outputs with slack less
than δ is controlled in the GP program, δ−θ acts as a guard band
to ensure that timing on other outputs are not disturbed too much.
For the second GP program for non-critical cells, all cells with a
slack larger than a threshold are sizable cells in NC, and all outputs
are included into the GP problem. In other words, all arrival times
are controlled.
4.2The GP models
We model the gate as a resistor and a switch that drives a RC
network. The gate delay and transition are the functions of the
gate size W and the total capacitive load, Cap. The equation for
the cell equivalent impedance is different for delay and transition
equations, and the delay models for each pin of a gate and that for
the falling or rising transition are different. We use the worst case
models for a cell. The gate delay Dgiand slew Sgiare given by
Dgi= dgI+(hi/Wi)·Capi
(4)
Slew is not propagated. But slew is monitored and restricted by the
following equation
Sgi= sgI+(vi/Wi)·Capi
(5)
Cap is the sum of the capacitive load and the gate capacitance a
cell drives. The pin capacitance of a cell i is a linear function of the
cell size Wi.
Cpi= ei+ fiWi
(6)
In above equations, dgI, hi, sgI, vi, ei, and fiare all fitting coeffi-
cients to the cell library.
Assuming a cell i drives a sizable cells and b non-sizable cells.
The total capacitance the cell i drives is
Capi=
a
∑
k=1
(ek+ fkWk)+
b
∑
l=1
(Cpl)+Capwire
(7)
4C-2
354
Page 4
We add the wire delay in our formulation. An accurate pre-
routing tool is used to estimate routs. The pin to pin wire delay
is computed by a static timer and treated as a constant in the gate
sizing formulation.
Three major source of power consumption, including dynamic,
short circuit, and leakage power are considered in our approach.
The dynamic power can be written as
Pi= 0.5α·F ·V2·Capi
where α, F, and V are defined in equation (3). The leakage power
is assumed proportional to the gate size, and the parameter leak is
extracted from the SPICE simulation based power library. The fol-
lowing linear leakage model is sufficient for the leakage estimation
in the gate sizing stage.
(8)
Li= leaki·Wi
(9)
The short circuit power is modeled as constraints in the GP formu-
lation.
4.3Gate sizing effectiveness analysis
Slack and power optimization are often contradictory objectives.
To reduce the delay by sizing up cells will increase the dynamic and
the leakage power. Whether or not and how to size a cell should
be also determined by if such a chance has negative overall effect
potentially. We do the following gate sizing effectiveness analysis
to estimate a sizing range, i.e. we do not size a cell exceeding a
limit that may have a negative effect. In the following, we will
ii
……
……
11
KK
11
JJ
Figure 4: Gate sizing effectiveness analysis
derive the power and delay sensitivity to cell size. In Figure 4, the
cell i has J inputs and drives K downstream cells. If we change cell
i from Rvtto Lvt, the associated power will change by ΔPvi, and
delay will change by ΔDvi. We have
ΔPvi= ΔLeaki+αiFV2∑
j
Cpk+Cw)+max(hj
fjΔCpi, j = 1..J
ΔDvi= ((h
?
i−hi)/Wi)(∑
k
WjΔCpi),k = 1..K
Where ΔCpiis the pin capacitance change. h
coefficient for Lvtcell, as in equation (4). Similarly, if changing the
cell size by ΔWi, the associated power change is denoted by ΔPgi,
ΔPgi= LeakiΔWi+αiFV2∑
?
iis the corresponding
j
hjfjΔWi
The associated delay change is denoted by Dgi
hi
Wi+ΔWi
ΔDgi= (
−hi
Wi)∑
k
(Cpk+Cw)+max(hj
WjΔWi)
To solve the equation ΔPvi/ΔDvi= ΔPgi/ΔDgi, one zero and one
non-zero ΔWisolutions are generated. If the non-zero solution is
negative, sizing up cell i will increase both power and delay, and
the cell i is not allowed to be sized up. If one of the solutions
is positive, and Wi+ΔWi< Wmaxi, we set the maximum sizable
range of cell i asWmaxi=Wi+ΔWi. Beyond this limit, gate sizing
has a lower power and delay benefit compared to Vt swapping.
4.4GP for near-critical cells
Inbrief, ourfirstGP programcreatesmoreslacks fornearcritical
cells, which maximizes the sum of slacks on critical outputs. The
GP formulation is given by
minimize∑
j
+∑
ATj
i
wtiWi(ΔPgi/ΔWi), j ∈ PO∩CRIT,i ∈CRIT
s.t. Dgi= dI+bi
WiCpi
ATi≥ Dgi+max(ATp
ATi= Tstart,∀i ∈ PI
Wi≥Wmini,Wi≤Wmaxi
Inabove, ATiisthearrivaltimeattheoutputofthecelli. ΔPgi/ΔWi
is the power sensitivity to cell sizes
ΔPgi/ΔWi= Leaki+αiFV2∑
(10)
i−1),p ∈ input pins of i
j
hjfj
(11)
Tslackis a slack threshold. In the above GP formulation, we op-
timize the sum of the arrival time of all critical and near critical
outputs. Wmaxiand Wminiare the sizing range for cell i. Wire
delay is not shown for clarity, which is a constant computed by a
static timer conjuncted with a pre-routing tool.
Let wtidenotes the power weight. Without the power objective
∑iwtiWi, the cell could be overly unsized, which will cause unnec-
essary increase on power. Before the optimizations, the sum of the
arrival time on critical outputs and the sum of dynamic and leakage
power on cells in CRIT are evaluated. The power weight is com-
puted to normalize the arrival time and power objectives, and the
power weight is set to be associated with the power sensitivity of
each cell. A 0 to 1 parameter is set to adjust the ratio between the
arrival time and power objects.
4.5GP for non-critical cells
The GP for non-critical cells is to optimize the total power on
high slack cells, such that the arrival time does not violate timing
constraints. The GP problem for non-critical cells can be written as
min.∑
i
s.t. ATi≤ max((Tcycle−Tthreshold),AT origi),i ∈ PO
where ΔPgi/ΔWiis from equation (11). Tthresholdis the slack guard
band. We consider swapping non-critical cells with slack larger
than Tthresholdto Hvt cells. AT origiis the original arrival time
of the output i. Constraint (12) implies that for each output i, the
arrival time after the optimization may not violate the larger of a
delay threshold and its original delay. The shared constraints in
above GP problems are not shown here for simplicity.
4.6Modeling important constraints
Besides the delay and power, there are a few constraints that are
critical for industry practices, for example, the maximum slew con-
straint, the effective fan-out constraint for noise, and the short cir-
cuit power constraints, which were often ignored in previous stud-
ies. Our formulation considers those constraints and model them
as follows in the GP framework.
4.6.1 The max slew constraint
Although adding the slew constraints will significantly limit the
amount of power reducible, we should not ignore the slew con-
straints because slew rate violations are unacceptable for real world
designs. The slew equation in (5) is used to estimate the slew rate,
and we use the following to transform the slew constraint into siz-
ing constraint in GP form.
sI+vi
(ΔPgi/ΔWi),i ∈ NC
(12)
Si
=
WiCpi
Si
≤
Slewmax
(13)
4C-2
355
Page 5
where Slewmaxis the maximum slew rate acceptable.
4.6.2Effective fan-out constraint for noise tolerance
The concept of effective fan-out (E fo) is related to but differ-
ent from the conventional fan-out concept. Efo is the ratio of the
effective capacitance a cell drives divided by the effect impedance
ratio of the driver compared to a standard inverter. The effective
impedance Ratio is the hold resistance of a cell divided by that of
a standard inverter at a certain voltage level. The Efo constraint is
given by
Cpi
Ratioi×Cinv1
Applying an effective fan-out constraint on each cell will avoid
introducing large amount of noise issues during the optimization.
E foi=
≤ E folimit
(14)
4.6.3
The short circuit power is non-trivial to handle, and mostly ig-
nored in previous power optimization work. Since the short circuit
power is not large unless the ratio of the input slew and output ca-
pacitive load falls into a certain range, as shown in Figure 3, we
can specify a do-not-enter region by adding a linear constraint to
restrict the ratio between the input slew and output capacitance to
avoid large short circuit power consumption.
Short circuit power constraint
Capi≥ pi+qiSi
(15)
where Capiis the capacitive load driven by cell i. piand qiare the
parameter of the linear function shown in Figure 3, which specify
the boundary of the do-not-enter zone. Above constraint ensures
thattheinputslewofacellshouldnotbemuchlargerthanitsoutput
slew.
The number of possible sizes for a gate varies depending on the
gate type. An inverter could have over 20 different sizes. Our al-
gorithm assumes the sizes are continuous. The solution of the GP
solver are continuous gate sizes, which will be mapped into the
closest discrete ones. The discrete size mapping stage may intro-
duce less than 5 percent errors.
Algorithm 2 The Vt-swapping algorithm
1: Input The design after placement and sizing opt.
2: while (stopping criteria not meet) do
3:
foreach (all cells)
4:
if (slack > High) swap to Hvt
5:
if (slack < Low) swap to Lvt
6:
end
7:
TimingAnalysis
8:
Sort cells on Senstivity (critical and noncritical list)
9:
foreach (Sorted cells)
10:
Swap to Lvtor Rvt
11:
Propagate timing and evaluate
12:
end
13: end
5.VT SWAPPING ALGORITHM
We use a multiple pass sensitivity based Vt swapping algorithm,
as shown in Algorithm 2 to swap cells. Cells with very large or
small slacks are processed first. The rest are sorted on their sen-
sitivity score. In each swapping pass, two hashes are created, one
for Rvtcells and the other for Lvtcells. The sensitivity of a cell is
computed by the original slack of the cell, the up-cone impact and
the down cone impact of the cell. One top cell is selected at a time.
The internal timer propagates the timing changes down stream and
upstream to update the required times and the slacks. The process
continues until the slack requirement is met. The swapping process
will be performed multiple times for different supply voltages and
performance corners. A solution that satisfies all corners will be
adopted.
6.EXPERIMENTAL RESULTS
The placement and gate sizing algorithm are implemented in
C++ and the Vt swapping algorithm is in perl. We use the com-
mercial tool MOSEK [17] as the GP solver. Several modules from
a multi-GHz micro-processor in 65nm process technology are used
for experiments. The number of cells and nets are shown in table
2, which are typical in micro-processor designs. The circuits have
been initially manually placed and timing optimized and taped out
in a test chip. It is to be noted that the high performance micropro-
cessor circuits have a stringent timing target and are very difficult
for timing optimization. Therefore, the multi-Vt swapping tech-
nique has to be used to repair the remaining failing paths, in most
of cases. All experiments are tested on a 2.4GHz 64-bit Opteron
Linux server. We use an internal power evaluation tool to estimate
the power consumption.
Table 2 shows the total power comparisons. Table 3 and 4 re-
port the comparisons of leakage power and dynamic power respec-
tively. In all tables, column Base shows the base-line optimization
condition where cells are mostly Rvtcells. Column VT shows the
power after the Vt swapping, and BASE stands for the baseline.
PV shows the combined placement and Vt swapping, and column
PGV stands for the combined placement, gate sizing and Vt swap-
ping flow. We can see that the Vt swapping is very effective in
reducing leakage power. The combined LP based placement and
GP based gate sizing algorithm provides additional improvement
and the flexibility to trade off on dynamic and static power through
optimizing the slack distribution. We observe an additional 7.9%
total power reduction, which is significant for manually optimized
custom circuits. In current configurations, the placement optimiza-
tion is configured to mostly help leakage power. The combined
placement, gate sizing and Vt swapping gives the best results and
helps to reduce 63.8% of leakage power and 32.9% of total power
consumption.
Table 3: Leakage power comparison
BaseVTPGV
10.506.093.28
11.49 4.79 3.67
52.1120.1017.38
45.7630.4228.06
93.0426.62 18.28
99.2924.78 19.64
104.77 60.2549.86
215.24108.46 96.28
VT|Base %
42.0
58.3
61.4
33.5
71.4
75.0
42.5
49.6
54.2
PGV|Base %
68.8
68.1
66.6
38.7
80.4
80.2
52.4
55.3
63.8
ckt1
ckt2
ckt3
ckt4
ckt5
ckt6
ckt7
ckt8
Table 4: Dynamic power comparison
BaseVT PGV
19.2918.51 17.24
18.7717.6216.02
90.4083.3578.73
65.1063.1558.32
140.52125.19 105.12
141.98131.11 120.79
182.45 173.38167.28
283.91 268.88258.30
VT|Base %
4.0
6.1
7.8
3.0
10.9
7.7
5.0
5.3
6.2
PGV|Base%
10.6
14.7
12.9
10.4
25.2
14.9
8.3
9.0
13.3
ckt1
ckt2
ckt3
ckt4
ckt5
ckt6
ckt7
ckt8
The break down of runtime is shown in table 5. Column Timing
reports the runtime of the static timing analysis flow. Our sophisti-
cated timing analysis flow pre-routes the circuit, extracts parasitics
and run a PrimeTime engine to generate the timing report and an-
notates the timing information into the design database. We run the
timing analysis at the end of every optimization stage to update the
timing information. Therefore, multiple runs of the timing analysis
flow took a lot of runtime.
4C-2
356