Page 1

?A?New?Paradigm?for?Low-power,?Variation-Tolerant?Circuit?Synthesis?

Using?Critical?Path?Isolation?

Swaroop?Ghosh,?Swarup?Bhunia*,?and,?Kaushik?Roy?

School?of?Electrical?and?Computer?Engineering,?Purdue?University,?IN,?USA?

*Electrical?Engineering?and?Computer?Science,?Case?Western?Reserve?University,?OH,?USA?

?

Abstract??

Design?considerations?for?robustness?with?respect?to?variations?and?

low? power? operations? typically? impose? contradictory? design?

requirements.?Low?power?design?techniques?such?as?voltage?scaling,?

dual-Vth?etc.?can?have?a?large?negative?impact?on?parametric?yield.?In?

this? paper,? we? propose? a? novel? paradigm? for? low-power? variation-

tolerant?circuit?design,?which?allows?aggressive?voltage?scaling.?The?

principal?idea?is?to?(a)?isolate?and?predict?the?set?of?possible?paths?that?

may?become?critical?under?process?variations,?(b)?ensure?that?they?are?

activated?rarely,?and?(c)?avoid?possible?delay?failures?in?the?critical?

paths?by?dynamically?switching?to?two-cycle?operation?(assuming?all?

standard?operations?are?single?cycle),?when?they?are?activated.?This?

allows? us? to? operate? the? circuit? at? reduced? supply? voltage? while?

achieving?the?required?yield.?Simulation?results?on?a?set?of?benchmark?

circuits?at?70nm?process?technology?show?average?power?reduction?of?

60%?with?less?than?10%?performance?overhead?and?18%?overhead?in?

die-area? compared? to? conventional? synthesis.? Application? of? the?

proposed?methodology?to?pipelined?design?is?also?investigated.???

1.????INTRODUCTION?

It?is?well-known?that?process?parameter?variations?(both?systematic?

and?random)?may?cause?parametric?failures?in?logic?circuits?leading?to?

yield? loss.? Conventional? wisdom? dictates? a? conservative? design?

approach?(e.g.,?scaling?up?the?VDD?or?upsizing?logic?gates)?to?avoid?a?

large?number?of?chip?failures.?However,?such?techniques?come?at?the?

cost? of? power? and/or? die? area.? Process? tolerance? and? low? power,?

therefore,?represent?contradictory?design?requirements.?Over?the?past?

few?years,?statistical?design?approach?has?been?widely?investigated?as?

an?effective?method?to?ensure?yield?under?process?variations.?Several?

gate-level? sizing? and/or? Vth? assignment? techniques? [1]? have? been?

proposed?recently?addressing?the?minimization?of?total?power?while?

maintaining?the?timing?yield.?On?the?other?end?of?the?spectrum,?design?

techniques?(e.g.,?adaptive?body?biasing?[2])?have?been?proposed?for?

post-silicon?process?compensation?and?process?adaptation?to?deal?with?

process-related?timing?failures.??

Due?to?quadratic?dependence?of?dynamic?power?of?a?circuit?on?its?

operating?voltage,?supply?voltage?scaling?has?been?extremely?effective?

in?reducing?the?power?dissipation.?Researchers?have?investigated?logic?

design?approaches?that?are?robust?with?respect?to?process?variations?

and,? at? the? same? time,? suitable? for? aggressive? voltage? scaling.? One?

such?technique?[3]?uses?dynamic?detection?and?correction?of?circuit?

timing?errors?to?tune?processor?supply?voltage.?Design?optimization?

techniques? using? gate? sizing? and? dual-Vth? assignment? to? improve?

power/area?typically?increase?the?number?of?critical?paths?in?a?circuit,?

giving?rise?to?the?so-called?“wall?effect”?[4].?The?uncertainty-aware?

design?technique?[4]?describes?an?optimization?process?to?reduce?the?

wall? effect.? However,? it? does? not? address? the? problem? of? power?

dissipation.?

In?this?paper,?we?present?a?novel?design?paradigm,?which?achieves?

robustness?with?respect?to?timing?failure?and?provides?the?opportunity?

for?aggressive? voltage?scaling?by?critical?path?isolation.?The?notion?

critical? path? isolation? is? used? throughout? this?paper? to? indicate? the?

Permission?to?make?digital?or?hard?copies?of?all?or?part?of?this?work?for?

personal?or?classroom?use?is?granted?without?fee?provided?that?copies?are?not?

made?or?distributed?for?profit?or?commercial?advantage?and?that?copies?bear?this?

notice?and?the?full?citation?on?the?first?page.?To?copy?otherwise,?to?republish,?to?

post? on? servers? or? to? redistribute? to? lists,? requires?prior? specific? permission?

and/or?a?fee.?

ICCAD'06,?November?5-9,?2006,?San?Jose,?CA?

Copyright?2006?ACM?1-59593-389-1/06/0011...$5.00?

confinement? of? critical? paths? of? synthesized? design? to? known? logic?

block? (or? cofactor,? as? we? will? see? later).? Such? isolation? leads? to? a?

design?methodology?for?low?power?dissipation?by?making?the?critical?

paths?predictable?and?rare?under?parametric?variations.?Any?possible?

delay? errors? (that? may? occur? under? a? single? cycle? operation)? are?

predicted? ahead? of? time? and? are? avoided? by? two? cycle? operations?

(assuming?all?standard?operations?are?single?cycle).?This?lets?us?scale?

the? supply? voltage? aggressively? for? low? power? dissipation.? In?

particular,?the?proposed?technique:?

•?Isolates? the? critical? paths? and? makes? them? predictable? (by?

decoding?few?primary?inputs)?under?parametric?variations?so?that?

with? reduced? supply? voltage,? possible? delay? errors? are?

deterministic?and?can?be?avoided?by?two?cycle?operation.??

•?Restricts?the?occurrences?of?the?above?two-cycle?operations?by?

reducing?the?activation?probability?of?critical?paths.?

•?Increases?the?delay?margin?between?critical?and?non-critical?paths?

by?both?logic?synthesis?and?proper?gate?sizing?for?improved?yield,?

reliability?of?operations?and?low?power?by?voltage?scaling.??

We? also? present? an? application? of? the? proposed? methodology? in?

pipeline? based? design? for? low? power? operation.? The? circuit? is? re-

designed?to?operate?at?fixed?low?supply?voltage?with?occasional?two-

cycle? operations.? The? two-cycle? operations? are? implemented? by?

stalling?the?pipeline.??

Some?researchers?have?proposed?techniques?to?correct?variability-

induced? timing? error? during? operation? by? voltage? scaling.? The?

technique? in? [3]? referred? as? RAZOR,? reduces? or? eliminates? voltage?

margins?by?dynamic?scaling?of?the?supply?voltage?while?monitoring?

the?error?rate.?Razor?allows?the?occurrence?of?errors?at?low?voltage?and?

then?recovers.?However,?it?does?not?modify?the?logic?synthesis?or?gate?

sizing? process? and? thus? can? perform? poorly? in? presence? of? large?

number?of?critical?paths.?The?technique?proposed?in?this?paper,?on?the?

other?hand,?synthesizes?a?circuit?in?specific?way?to?facilitate?voltage?

scaling?for?power?reduction?as?well?as?to?improve?yield?by?making?the?

delay?failures?deterministic.??

2.?PRELIMINARY?ANALYSIS?

In?this?section,?first?we?present?example?of?an?adder?to?illustrate?the?

proposed? approach? for? low? power? robust? circuit? design.? Next,? we?

present?the?design?flow?followed?by?its?analysis?which?allows?us?to?

apply?similar?approach?to?any?random?logic?circuit.??

2.1.?Voltage?scaling?and?two-cycle?operations?in?a?4-bit?adder?

For?the?sake?of?simplicity,?we?choose?a?4-bit?ripple?carry?adder?as?

shown?in?Fig.?1.?Signals?P0-P3?(G0-G3)?are?the?propagate?(generate)?

signals?whereas?Ci,0?(Co,1-Co,3)?are?carry-in?(carry-out)?signals?[5].?As?

evident,?the?path?from?carry-in?to?carry-out?is?critical?and?determines?

the?frequency?of?operation?of?the?adder.?However,?note?that?the?critical?

path?is?activated?only?when?Ci,0?=?1?and?at?the?same?time,?P0P1P2P3??=?

1.? Since? the? probability? of? such? occurrences? is? very? low? (as?

p(P0P1P2P3Ci,0=1)?=?p(P0)p(P1)p(P2)p(P3)p(Ci,0)?is?very?low),?one?can?

reduce?the?supply?voltage?such?that?all?operations?with?P0P1P2P3?=?0?

and/or?Ci,0?=?0?can?still?be?performed?in?one-cycle.?However,?when?the?

critical?path?is?activated,?the?correct?results?are?obtained?by?evaluating?

the? adder? in? two? clock? cycles? (called? two-cycle? operation).? The?

activation? of? critical? path? can? be? predicted? by? pre-computation? of?

P0P1P2P3.?In?a?nutshell,?by? making?the?critical?path?predictable?and?

utilizing?the?available?slack?between?critical?and?non-critical?path,?it?is?

possible?to?operate?the?circuit?at?reduced?supply?voltage.?Note?that?this?

approach?incurs?penalty?of?an?extra?clock?cycle?when?the?critical?path?

is? activated.? However,? by? ensuring? low? activation? probability? of?

619

Page 2

critical? paths,? it? may? be? possible? to? reduce? the? active? and? leakage?

power?by?rarely?paying?penalty?of?an?extra?clock?cycle.??

To?evaluate?the?feasibility?of?this?idea,?we?simulated?a?4-bit?ripple?

carry? adder? with? 1V? supply? in? Hspice.? We? used? BPTM? [6]? 70nm?

devices?for?simulation.?The?critical?path?delay?was?found?to?be?260ps?

and? average? power? consumption? was? 13.03uW.? Assuming? the? clock?

period?to?be?260ps,?we?reduced?the?supply?to?0.8V.?Now?the?non-critical?

paths?were?within?the?single-cycle?delay?bound?however,?the?critical?path?

delay?increased?to?330ps?and?was?evaluated?with?two-cycles.?The?new?

power?consumption?was?7.32uW,?leading?to?44%?saving?in?total?power.????

2.2.?Generalization?to?random?logic?

Earlier,?we?presented?the?idea?of?supply?voltage?scaling?for?an?adder?

where? the? critical? path? was? unique? (assuming? no? process? variation).?

However,? a? random? logic? can? have? many? critical? paths? with?

corresponding?input?conditions?for?activation.?Further,?the?critical?paths?

may? vary? from? chip-to-chip? due? to? parametric? variations.? In? such?

situations,? the? overhead? associated? with? pre-decoding? logic? can?

overshadow? the? power? savings.? To? exercise? similar? supply? scaling?

technique?on?random?logic?circuits,?we?need?to?make?sure?that,?(a)?the?

critical?paths?are?confined?to?a?predictable?logic?section;?and,?(b)?the?non-

critical?paths?remain?non-critical?under?process?variation?by?providing?a?

safe? timing? slack.? The? timing? slack? between? critical? and? non-critical?

paths?will?be?the?enabling?factor?for?supply?voltage?scaling.?An?example?

of?a?possible?path?delay?distribution?(cartoon)?is?shown?in?Fig.?2.??

To?obtain?the?delay?distribution?shown?in?Fig.?2,?the?design?needs?to?

be?partitioned?and?synthesized?in?such?a?way?that?the?paths?are?divided?

into?several?logic?blocks.?The?partitioning?procedure?should?consider?

the?fact?that?(a)?these?logic?blocks?can?be?active?or?remain?idle?based?

on?the?state?of?primary?inputs;?and,?(b)?the?probabilities?of?activation?

of?the?logic?blocks?containing?critical?paths?(called?critical?block)?are?

very?low.?Therefore,?it?will?be?possible?to?predict?the?activation?of?a?

logic?block?(and?the?corresponding?paths)?just?by?decoding?the?states?

of?inputs.?Next,?gate?sizing?can?be?performed?on?the?partitioned?logic?

blocks?to?maximize?the?slack?between?critical?and?non-critical?blocks?

leading?to?further?isolation?of?critical?paths.?Note?that?the?suggested?

sizing?approach?will?be?opposite?of?the?conventional?sizing?because?in?

this?case,?the?critical?paths?should?be?made?slower?while?non-critical?

paths?should?be?made?faster.?By?performing?the?partitioning?and?sizing,?

a?path?delay?distribution?similar?to?the?one?shown?in?Fig.?2?can?be?

achieved.?Finally,?supply?voltage?scaling?can?be?done?such?that?non-

critical?blocks?meet?the?desired?timing?yield?with?respect?to?one-cycle?

delay?target?whereas?critical?block?meet?the?yield?with?respect?to?two-

cycle?delay?target.?In?other?words,?the?critical?blocks?can?operate?in?

two-cycles?while?the?non-critical?blocks?can?operate?in?single-cycle.?

Since?the?probability?of?activation?of?the?critical?block?is?low,?the?new?

design?operating?at?a?scaled?voltage? will?have? minimum?impact?on?

performance.? The? overall? design? strategy? is? shown? in? Fig.? 3.? The?

partitioning?and?sizing?is?more?clearly?illustrated?in?Fig.?4?where?a?

circuit?is?partitioned?into?four?functional?logic?blocks?f1-f4.?The?outputs?

are?fed?to?an?OR?network?to?generate?the?final?outputs.?Suppose?that?by?

the? virtue? of? proper? partitioning,? f4? becomes? the? least? activated?

functional? block? containing? the? critical? paths.? Then? f4? can? be?

downsized?further?while?the?other?functional?blocks?can?be?upsized?to?

maximize?the?slack?and?further?isolation?of?critical?paths,?as?shown?by?

arrows? in? Fig.? 4.? In? Section? 3,? we? will? describe? a? Shannon? based?

partitioning?technique?which?helps?in?isolating?the?critical?paths.??

2.3.?Analysis?of?the?proposed?design?methodology??

Let?us?consider?two?different?designs?for?the?same?combinational?

circuit,? design-A? and? design-B? with? timings? as? shown? in? Fig.? 5.?

Design-A? (design-B)? is? representative? of? conventional? design?

(proposed?design).?In?design-A,?the?slack?of?critical?path?is?S1?with?

respect?to?the?clock?period?Tc?whereas?in?design-B,?the?critical?path?

(shown?by?hatched?lines?in?Fig.?5)?does?not?meet?the?timing?constraint?

and?has?a?negative?slack?of?S3.??However,?the?non-critical?paths?(shown?

by?dotted?block?in?Fig.?5)?in?design-B?maintain?a?maximum?slack?for?

S2.?We?also?assume?that?the?activation?condition?of?critical?paths?in?

design-B?is?known?based?on?the?states?of?few?inputs?(say,?N).?An?extra?

decoder?is?needed?in?design-B?for?pre-determining?the?occurrences?of?

critical? path? activation.? Obviously,? design-B? can? function? properly?

with? two-cycle? operations? for? critical? paths? while? a? single? cycle?

operation? for? non-critical? paths.? Let? us? now? compare? the? power?

consumption? of? design-A? and? design-B? where? V0? is? the? voltage? at?

which? design-A? meets? the? slack? requirement? S1,? whereas,? design-B?

meets?slack?S2?for?non-critical?paths?and?S3?for?critical?paths.?Since?

voltage? is? proportional? to? (delay)-1,? the? scaled? voltage? (

design-A?can?be?determined?as?follows,??

11

and,??????

A

cc

TST

−

new

A

V

)? for?

1

00

1

1

newnew

A

c

S

VVVV

T

?

?

?

?

?

?

∝∝

?

=−

???????????????????????(1)???????????????????

FAFA FAFAFAFAFAFA

P0

P0

G1

G1

P0

P0

G1

G1

P2

P2

G2

G2

P3

P3

G3

G3

Co,3

Co,3

Co,2

Co,2

Co,1

Co,1

Co,0

Co,0

Ci,0

Ci,0

?

Fig.?1?Ripple?carry?adder?[5]?

?

One-cycle?

delay?targetdelay?target

path?delay? path?delay?

Number?of?paths

predictable?and?restricted?to?a?logic?section?having?low?

activation?probabilityactivation?probability

slackslack

One-cycle?

Number?of?paths

predictable?and?restricted?to?a?logic?section?having?low?

?

Fig.?2?Path?delay?distribution?needed?for?the?proposed?methodology?

probability of critical logic blocks are very small.

1. Perform an input based partitioning of the netlist such that the activation

critical and non−critical blocks.

2. Perform gate sizing on logic blocks to create timing slack between

(critical) blocks in one−cycle (two−cycle).

3. Perform supply voltage scaling while meeting the yield for non−critical

Input : Optimized netlist

new supply voltageSized netlist and

Output :

?

Fig.?3?Design?methodology?

?

Original?

CircuitCircuit

f1

f1

OR?Network

InputsInputs

POPO

f2

f2

f3

f3

f4

f4

Inputs Inputs

Original?

OR?Network

??

S1

S1

S2

S2

S3

S3

Design?ADesign?A

Tc

Tc

CLKCLK

Design?B Design?B

????

0

5

10

0

0.2

0.4

0.6

0.8

0

1

2

3

4

Snorm

2

N

EDPA/EDPB

????????

Fig.?7?Shannon’s?expansion?

based?partitioning?

sCF

CF

CF

Primary inputs

1

0

MUX

x

Primary outputs

1

2

i

?

??????

?

Fig.?4?Steps?1?and?2?of?proposed?

design?methodology?

Fig.?5?Timing?diagram?of?

design-A?and?design-B?

??????????????????

?Fig.?6?Plot?of?EDP?ratio?of?

design-A?and?design-B?for?k?=?

0.2,?

0

0 .0 01

d

C

=

no rm

and?n?=?4?

?

620

Page 3

Similarly,?

2

023

1;????? . .?

new

Bc

c

S

VVs t SST

T

?

?

?

?

?

?

=−+≤

??????????????????????????????(2)?????????????????????????????

If?the?performance?penalty?due?to?two-cycle?operation?in?design-B?

is?p,?then?the?effective?clock?cycle?delay?of?design-B?is?

cc

TpT

+

.?The?

energy-delay?product?(EDP)?of?both?designs?are?given?by?

?

() ( );?????

AAAc

EDPC VT

=

where? CA? (CB)? is? the? average? switched? capacitance? of? design-A?

(design-B)?and?Cd?is?the?average?switched?capacitance?of?the?decoding?

logic?(for?determination?of?critical?path?activation).?

The?EDP?ratio?(after?putting?the?values?of?

22

()( ) ()?

new new

BBBdcc

EDPCCVTpT

=++

???????(3)??????????

new

A

V

and

new

B

V

)?is?given?

by:?

()

2

1

2

1

2

2

1

11

1

1

1

11

11

norm

cA

normnormnorm

dB

BBd

cAA

S

T EDP

CSC

EDPp

TCC

S

CCSp

−

=

+

−+

−

=

+−+

?

?

?

?

?

?

??

??

??

??

??

??

?

?

?

?

?

?

?

?

?

??

?

?

? ?

?

?

?

?

?

?

?

?

?

(4)?

where,?

norm

B

C

(

norm

d

C

)?is?the?average?switched?capacitance?of?design-B?

(decoder?logic)?normalized?with?respect?to?

A

C ?and?

1

norm

S

(

2

norm

S

)?is?the?

slack?of?design-A?(design-B)?normalized?with?respect?to?

c T.??

From?the?expression?shown?in?equation?(4),?it?is?possible?to?study?

the?conditions?under?which?it?may?be?useful?to?opt?for?design-B?rather?

than?design-A.?It?is?obvious?that?design-B?can?be?better?than?design-A?

if?EDPA/EDPB?>?1.?Since?(

B

CC

+

design?modifications)?and(1 ) 1

p

+> ,?a?necessary?condition?for?design-

B?to?be?better?than?design-A?is,??

? . .,?

SSie SS

>>

?????????????????????????????????????????????????????????????????(5)???????

Therefore,?a?larger?value?of?S2?is?better?for?power?savings.?However,?

the? upper? bound? of? S2? is? determined? by? constraint? S2? +? S3? ≤? Tc?

(equation?(2)).?Hence,?S2?can?be?maximized?by?minimizing?slack?S3.?

Let?us?explore?the?design?space?for?which?design-B?can?be?beneficial.?

For?the?sake?of?simplicity,?we?model?the?normalized?capacitances?and?

performance?penalty?(p)?as?follows,??

k

CCNC

S

−

where?k?is?a?constant,?N?is?the?number?of?input?vectors?that?should?

be? decoded? to? determine? if? critical? paths? are? activated,

) 1

> (assuming?CB?≥?CA?due?to?

normnorm

d

2121

normnorm

01

2

1;???,?=0.05?and,???

(1)2

norm

B

norm

d

norm

d

norm

normn

N

Sp

= +==

?

0

norm

d

C

?is? the?

normalized?average?switched?capacitance?of?decoding?a?single?input?

vector?and,?n?is?the?total?number?of?primary?inputs?of?the?circuit.?The?

EDP?ratio?plotted?for?different?values?of?N?and?

2

norm

S

?is?shown?in?Fig.?

6.?From?the?EDP?ratio?profile?shown?in?this?figure,?it?is?obvious?that?

design-B?is?beneficial?only?if?N?is?small?(to?minimize?the?switched?

capacitance? of? decoding? logic).? Also,? the? initial? flat? portion? of? the?

profile?indicates?that

2

S

should?be?greater?than

EDP?curve?increases?with

2

S

,?a?large?value?of?

the?switched?capacitance?of?the?circuit?(i.e.

norm

1

norm

S

.?Although?the?

norm

2

norm

S

?may?increase?

norm

B

C

?if?gate?sizing?is?used)?

and?offset?the?saving?in?power.?

In?the?analysis?presented?above,?it?can?be?concluded?that?the?power?

saving?in?proposed?method?mainly?comes?from?quadratic?dependency?

of? power? on? voltage.? Power? reduces? quadratically? while? the? delay?

increases?only?linearly,?letting?us?reduce?the?EDP.???

3.?DESIGN?METHODOLOGY?

Based? on? the? analysis? and? the? guidelines? derived? above,? we?

describe?the?details?of?each?step?of?the?design?flow?(Fig.?3).?This?is?

followed?by?simulation?results?on?a?set?of?benchmark?circuits.?

3.1?Circuit?partitioning?and?synthesis?for?critical?path?isolation?

Let? us? first? consider? performing? an? input? based? partition? of? the?

circuit?such?that?the?critical?paths?are?isolated?and?their?activation?

probability?is?reduced.?To?achieve?this,?we?used?Shannon?expansion?

based? partitioning? [7]? which? partitions? a? Boolean? expression? f? into?

disjoint?sub-expressions?as?shown?below:?

==

=+

===

where?(x1…xn)?are?input?literals,?xi?is?control?variable,?and?CF1?and?

CF2?are?called?cofactors.?If?f?contains?sub-expressions?independent?of?

control?variable?xi,?then?we?may?also?have?a?Shared?Cofactor?(sCF)?

(Fig.? 7).? In? this? work,? we? have? used? Shannon? expansion? based?

partitioning? mainly? due? to? its? following? inherent? properties:? (a)? the?

circuit? partitioning? is? done? based? on? inputs;? (b)? the? activation?

probability? of? partitioned? logic? blocks? can? be? easily? reduced? by?

performing? multi-level? hierarchical? expansion;? and,? (c)? by? properly?

choosing?the?control?variables,?it?is?possible?to?isolate?the?critical?paths?

to?a?logic?block?having?least?activation?probability.?In?the?following?

paragraphs,? first? we? explain? multi-level? expansion? for? reduction? of?

111

12

1121

( ,..., ,...,? )

i

f xx

. ( ,...,

x f x

1,...,? ). ( ,...,

x f x

0,...,? )

????????????????????????????..

( ,...,

f x

1,...,? );?????

x

( ,...,

f x

0,...,? )

niiniin

ii

inin

xxxxx

x CF x CF

CFxCFxx

+=

=

????????????(6)???????????????

f1

f2

xx

x

x

x

x

x

xx

8

7

9

6

2

4

5

3

1

?

Fig.?8?Original?circuit?

4. currList = All graphs in gList

3. Make expansion decision for the graphs in gList

in currList

Level = 1

Yes

Yes

Delay < Dmax

10. Area < Amax and

9. All graphs of

currList traversed?

No

2. Initialize gList = {G}

1. Read netlist and create graph G,

marked

No

Yes

expand

for expansion?

6. Is

5. For each graph

7. Choose a control variable and

G

from gList.8. Remove

i

i

G

Original netlist, Area constraint

(Amax), delay constraint (Dmax)

Input :

i

G

i

G

into CF1, CF2, sCF

Level++

gList

Output :

No

Add CF1, CF2, and sCF to gList

?

Fig.?11?Automated?synthesis?flow?

x

x

?

x

x

x

1

f (CF1)

2f (CF1)

9

6

5

3

1

???

f (CF2)

2

f (CF2)

1

x

x

9 x

x

x

x

2

3

6

1

7

?

(a)?

Fig.?9?Control?variable?is?x4:?(a)?CF1;?(b)?CF2?

x2

x4

x3

??(b)?

x

x

x

x

x

1f (CF1)

f (CF1)

2

(a)?

Fig.?10?Control?variable?is?x1:?(a)?CF1;?(b)?CF2?

6

7

2

3

4

???

?

x5

x6

x9

2

f (CF2)

1

f (CF2)

?

?

? (b)?

621

Page 4

activation?probability?of?cofactors,?followed?by?the?control?variable?

selection?strategy?for?critical?path?isolation?during?partitioning.?

In?equation?(6),?the?activation?probability?of?each?cofactor?is?50%?

(assuming?50%?switching?probability?of?inputs).?By?performing?multi-

level?expansion,?the?activation?probability?of?the?resulting?cofactors?

can? be? reduced? further.? For? example,? a? 2nd? level? expansion? of? f?

(equation? (7))? results? in? four? cofactors,? each? with? an? activation?

probability?of?25%.??

( ,..., ,...,? )..

inijij

f xxxxx CF xx CF

=++

Control?variable?selection?plays?a?very?important?role?in?achieving?

desired?goals?in?Shannon’s?expansion?based?partitioning.?In?[8,?9],?the?

most?binate?variable?is?chosen?as?control?variable?to?minimize?the?area?

overhead.?However,?this?heuristic?may?not?lead?to?the?confinement?of?

critical?paths?of?the?circuit?after?expansion.?For?example,?consider?a?

multiple-output? two-level?

f x xx xx xx xx

=++++

?and

circuit?realization?shown?in?Fig.?8,?it?can?be?observed?that?f1?is?the?

critical?function?(or?critical?output).?If?n(xi)?is?the?total?literal?count?of?

xi? in? f1? and? f2? then,? n(x1)=4,? n(x2)=1,? n(x3)=2,? n(x4)=4,?

n(x5)=n(x6)=n(x7)=n(x8)=n(x9)=1.?Considering?most?binate?variable?as?

the?preferable?choice,?either?x1?or?x4?can?be?picked?as?control?variable.?

With?x4?as?control?variable,?resulting?cofactors?are?shown?in?Fig.?9.?It?

can? be? noticed? that? the? critical? paths? are? distributed? between? the?

cofactors.?However,?if?x1?is?chosen?as?control?variable,?the?critical?path?

has?been?confined?to?f1(CF2)?(Fig.?10).?Clearly,?a?strategy?is?needed?to?

isolate?the?critical?paths?and?limit?them?to?a?particular?cofactor.?If?ai?

(bi)?is?the?literal?count?of?variable?xi?in?true?(complement)?form?in?the?

critical? function? (or? output),? then? following? criterions? should? be?

fulfilled:?(i)?the?control?variable?should?be?present?in?critical?function?

(i.e.?min(ai,?bi)?>?0);?(ii)?difference?of??ai?and?bi?should?be?large?to?

ensure?that?the?paths?are?isolated?to?one?cofactor?and,?(iii)?the?max(ai,?

bi)? should? be? small? to? minimize? the? probability? of? logic? depth? of?

isolated?critical?paths?being?reduced?by?logic?optimization.?Following?

metric?can?be?used:??

11234

..

ijij

xx CFxx CF

+

???????????????????????(7)??????????????????

Boolean?

f

=

logic?

x x

+

function?

.? From? the?

1 1 42 33 44 562 1 7

x x

1 4

x x

1 9

+

?????????????????????????????????

||

?

max( , )

a b

ii

i

ii

ab

M

−

=

???????????????????????????????????????????(8)??????????????????????????

A?literal?with?maximum?value?of?Mi?ensures?that?the?critical?path?is?

isolated?to?a?cofactor.?Using?this?metric,?we?follow?the?steps?described?

in?[8]?for?choosing?the?control?variable?in?our?synthesis?flow.??

To?achieve?the?dual?objectives?of?isolating?the?critical?paths?to?a?

cofactor?while?reducing?its?activation?probability?during?partitioning?

and?synthesis,?we?adhere?to?following?approach:?(a)?we?partition?the?

circuit?and?determine?the?cofactor?where?the?critical?paths?have?been?

isolated? (called? critical? cofactor);? (b)? we? mark? this? cofactor? (i.e.?

critical? cofactor)? for? further? expansion? to? reduce? the? activation?

probability? of? the? critical? paths.? The? above? mentioned? steps? are?

repeated?under?a?given?area?and?delay?constraint.?Note?that?Synopsys?

Design? Compiler? [10]? has? been? used? for? synthesizing? the? new?

cofactors.?The?overall?synthesis?flow?is?shown?in?Fig.?11.?A?complete?

example?of?hierarchical?partitioning?and?synthesis?is?also?illustrated?in?

Fig.?12?where?the?original?circuit?is?partitioned?into?four?cofactors,?

CF20,?CF32,?CF53?and?CF63.?The?critical?paths?have?been?isolated?to?

CF53?(which?is?activated?by?3?inputs?i.e.?x1x2’x3).?Note?that,?in?this?

example?we?do?not?have?the?shared?cofactor?(sCF).?Shared?cofactors?

are? important? in? avoiding? the? logic? duplication? during? partitioning.?

However,? they? are? independent? of? control? variable.? Therefore? our?

synthesis?flow?(Fig.?11)?automatically?chooses?it?for?further?expansion?

(if?critical?paths?are?isolated?to?it).??????

3.2?Gate?Sizing?for?further?isolation?

In? the? previous? subsection,? we? presented? a? circuit? partitioning?

method?to?isolate?the?critical?paths?to?a?cofactor?with?small?activation?

probability.?The?next?step?is?to?size?the?resulting?cofactors?individually?

to? (a)? further? isolate? the? critical? paths? and,? (b)? create? timing? slack?

between?critical?and?non-critical?cofactors?to?allow?lowering?of?supply?

voltage.? To? achieve? this? goal,? all? gates? of? the? critical? cofactor? are?

downsized?to?make?the?corresponding?paths?further?critical.?The?gates?

belonging?to?the?remaining?cofactors?are?selectively?upsized?to?make?

them? more? non-critical? and? increase? the? slack? (S2,? as? discussed? in?

Section? 2.3).? An? example? of? the? proposed? sizing? approach? after?

Original?

Circuit CircuitCircuit Circuit

CF10CF10 CF10CF10

CF20CF20 CF20CF20

CF32CF32 CF32CF32

CF63 CF63 CF63CF63

CF53CF53 CF53CF53

CF42CF42CF42CF42

MUX?Network

LEVEL1 LEVEL1LEVEL1 LEVEL1

LEVEL2 LEVEL2 LEVEL2LEVEL2

LEVEL3 LEVEL3LEVEL3 LEVEL3

InputsInputs Inputs Inputs

POPOPOPO

Inputs InputsInputsInputs

x1

x1

x1

x1

x1

x1

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

#?control?variables

CF53,?CF63?:?3

CF32

CF20

CF20

:?2

:?1:?1

Original? Original?Original?

MUX?NetworkMUX?NetworkMUX?Network MUX?Network

#?control?variables

CF53,?CF63?:?3

CF32

:?2

?

Fig.?12?Hierarchical?expansion?and?sizing?of?cofactors?

002244

x 10−10

x 10−10

00

20 20

40 40

6060

80 80

Delay Delay

# of paths

benchmark: sct, VDD = 1V

new new

critical CFcritical CF

critical paths critical paths

00 0.50.511 1.5

Delay[seconds] Delay[seconds]

22 2.52.533 3.53.544

x 10 x 10

−10 −10

00

2020

4040

6060

8080

100100

120 120

140140

160 160

180 180

200 200

# of occurances

CF3

CF3

Benchmark: sct, 1000 simulation, VDD = 1V

Org: Critical path delay distribution of original ckt

CF1−CF4: Cofactor−wise critical path delay distribution of

proposed cktproposed ckt

one−cycle

delay targetdelay target

Org Org

CF4

CF4

CF2

CF2

CF1

CF1

(a)(a)

(b) (b)

22 2.5 2.533 3.5 3.5444.5 4.5555.5 5.5666.5 6.577

x 10x 10

−10−10

00

20 20

40 40

60 60

80 80

100100

120 120

140 140

160 160

180 180

200 200

Delay[seconds] Delay[seconds]

# of occurances

one−cycle delay

targettarget

two−cycle delay

targettarget

CF1

CF1

CF2

CF2

CF4

CF4

CF3

CF3

Benchmark: sct, 1000 simulation, VDD = 0.70V

CF1−CF4: Cofactor−wise critical path delay distribution of

proposed cktproposed ckt

(c) (c)

#?control?variables

CF1??????????:?4

CF2??????????:?3

CF3,?CF4?:?2 CF3,?CF4?:?2

[s] [s]

# of paths

benchmark: sct, VDD = 1V

1.5

# of occurances

Benchmark: sct, 1000 simulation, VDD = 1V

Org: Critical path delay distribution of original ckt

CF1−CF4: Cofactor−wise critical path delay distribution of

one−cycle

# of occurances

one−cycle delay

two−cycle delay

Benchmark: sct, 1000 simulation, VDD = 0.70V

CF1−CF4: Cofactor−wise critical path delay distribution of

#?control?variables

CF1??????????:?4

CF2??????????:?3

?

Fig.?13?Results?for?benchmark?sct:(a)?path?delay?distribution?after?

partitioning? and? sizing;(b)cofactor-wise? critical? path? delay?

distribution?under?Vt?variation?(VDD=1V),?(c)?VDD=0.7V?

TABLE-1?

Procedure?performSizing()?

Input????:?target?delay?(Tc),?yield?(Y),?list?of?cofactors?(gList);?

Output?:?sized?netlist;?

1.?

2.?

3.?

4?

5.?

6.?

7.?

8.?

9.?

10.?

11.?

12.?

13.?

14.?

15.?

16.?

return?G;?

maxLevel?=?maximum?hierarchy?of?the?cofactors?in?gList?;?

run?SSTA?on?Gi∈gList;?

critCF=cofactor?with?critical?paths?at?maxLevel?hierarchy;?

for?each?cofactors?Gi?∈gList?

?????calculate?Gi→muxdelay;?

end?for?

dTarget?=?αTc?–?critCF→muxDelay;?

downSize(critCF,?dTarget,?Y);?

critDelay?=?critCF→maxDelay?+?critCF→muxDelay;?

for?each?cofactors?Gi?∈gList?

?????if?Gi?≠?critCF?

?????????dTarget?=?critDelay?-?Tc?-?Gi→muxDelay?;?????????

?????????upSize(Gi,?dTarget,?Y);?????

end?for?

Add?mux’s?in?Gi∈gList?to?create?a?complete?graph?G;??

622

Page 5

partitioning? is? shown? in? Fig.?12.? The? cofactors? with? dashed? (solid)?

lines?indicate?expanded?(non-expanded)?circuits?and?levels?indicate?the?

hierarchy.?As?shown?in?the?figure,?cofactor?CF53?is?downsized?to?make?

it?further?critical?while?other?cofactors?are?upsized?to?make?them?more?

non-critical.?Note?that?the?proposed?sizing?approach?is?very?different?

from?the?conventional?sizing?because?in?this?case,?the?critical?paths?are?

made?slower?while?non-critical?paths?are?made?faster.????

We? follow? the? above? mentioned? sizing? strategy? in? a? Lagrangian?

Relaxation?(LR)?based?gate?sizing?[12]?as?shown?in?Table?1.??Given?a?

delay?target?(Tc),?it?tries?to?meet?the?yield?requirement?with?minimum?

area.?The?procedure?takes?gList?(i.e.,?list?of?cofactors)?and?determines?

the?cofactor?at?highest?level?of?hierarchy,?maxLevel?for?downsizing?it.?

The?target?delay?(dTarget)?for?sizing?the?critical?cofactor?candidate?(i.e.?

critCF)?is?computed?in?Step?7?(with?α=1.2,?determined?empirically?to?

allow? minimization? of? S3? as? discussed? in? Section? 2.3).? The? delay?

targets? of? non-critical? cofactors? are? obtained? by? subtracting? Tc? and?

multiplexer?delays?from?overall?critical?path?delay?(Step?12).?The?non-

critical?cofactor?candidates?are?now?upsized?while?meeting?the?yield?

target?(Step?13).?The?description?of?Table?1?is?omitted?for?brevity.???

3.3?Determination?of?supply?voltage?

After? circuit? partitioning? and? sizing,? we? obtain? the? path? delay?

distribution? similar? to? Fig.? 2.? Now? we? may? assign? a? lower? supply?

voltage?to?reduce?the?power?dissipation?while?meeting?robustness.?To?

achieve?this,?we?start?from?nominal?supply?and?iteratively?reduce?it?

with?two?stopping?criterions:?(a)?delay?violation?of?any?of?the?non-

critical? cofactors? (one-cycle? delay? target)? for? the? given? yield?

constraint;?and,?(b)?delay?violation?of?the?critical?cofactor?(two-cycle?

delay?bound)?for?the?target?yield.?Finally,?another?stopping?criterion?is?

the? 3Vth? limit? for? reliable? super-threshold? operations? [5].? The? new?

voltages?for?a?set?of?MCNC?benchmarks?are?shown?in?Section?3.4.?

3.4?Simulation?results?

In? previous? sections,? we? presented? a? methodology? to? make? the?

possible? delay? errors? (that? may? occur? under? single-cycle? operation)?

predictable?and?rare?(using?circuit?partitioning?and?sizing).?We?also?

discussed?the?determination?of?new?supply?voltage.?In?this?section,?we?

present? simulation? result? on? a? set? of? MCNC? benchmarks? to?

demonstrate?the?feasibility?of?this?methodology.?In?particular,?we?show?

(a)? isolation? of? critical? paths? to? a? cofactor? (having? low? activation?

probability);?(b)?reduction?of?supply?voltage?for?low?power?dissipation?

while?maintaining?robustness.?In?the?following?paragraphs?we?present?

simulation?setup?followed?by?the?results?and?discussion.?

For? logic? optimization? in? our? synthesis? flow,? we? have? used?

Synopsys? Design? Compiler? [10].? For? a? basis? of? comparison,? the?

original? benchmarks? are? also? optimized? for? area? in? Synopsys.? The?

mapping? is? done? to? a? standard? cell? library.? The? circuit? delays? are?

computed? by? using? SSTA? for? BPTM? 70nm? technology.? The?

parametric? variations? (L,? Tox,? doping? etc)? have? been? lumped? into?

threshold?voltage?variation.?The?change?in?Vth?due?to?inter-die?(∆Vtinter)?

and? intra-die? (∆Vtintra)? process? variations? are? modeled? as? Gaussian?

variables?with?zero?mean?and?standard?deviations?of?80mV?and?40mV,?

respectively.? The? total? change? in? transistor? Vth? is? given? by? the?

summation? of? ∆Vtinter? and? ∆Vtintra.? The? delay? target? (Tc)? for? sizing?

procedure?is?chosen?by?plotting?the?area-delay?curve?of?the?circuit?and?

selecting?the?delay?at?which?the?slope?of?the?curve?is?-1.?The?area?and?

delay?constraints?for?Shannon?based?partitioning?are?kept?at?40%?and?

20%?more?than?original?area?and?delay?respectively.?The?yield?targets?

of?original?circuit?and?the?cofactors?for?gate?sizing?are?set?to?95%.?The?

yield? target? of? cofactors? operating? on? one-cycle? (two-cycle)? after?

application? of? reduced? supply? is? fixed? to? 95%? (100%).? For? power?

estimation,?the?circuits?are?simulated?in?Hspice?by?applying?a?set?of?

200?random?input?patterns?having?input?switching?probabilities?of?0.2?

as?well?as?0.5.?The?runtime?of?the?entire?methodology?is?found?to?be?

small? (6.03s? for? largest? benchmark? cht? on? SUN? blade? 1000?

workstation).??

To?illustrate?the?isolation?of?critical?paths?to?the?critical?cofactor,?

we? have? plotted? the? path? delay? distribution? of? an? example? MCNC?

benchmark?circuit?(i.e.,?sct)?after?partitioning?and?sizing?(Fig.?13(a)).?

This?figure?clearly?indicates?that?the?critical?paths?of?the?re-synthesized?

design? are? limited? to? the? critical? cofactor.? We? also? present? it’s?

cofactor-wise?critical?path?delays?distribution?under?process?variation?

(Vth?variation,?Fig.?13(b)).?From?this?figure,?note?that:?(a)?CF1?remains?

critical? even? under? parametric? variation? while? the? other? cofactors?

remain?non-critical?and;?(b)?there?is?a?delay?slack?present?between?CF1?

and?other?cofactors.?Also,?note?that?the?critical?cofactor?CF1?is?at?the?

4th? hierarchical? level? (i.e.? 4? control? variables)? to? minimize? its?

activation? probability.? The? delay? distribution? at? reduced? supply? is?

shown?in?Fig.?13?(c).?It?shows?that?CF1?operates?in?two-cycles?while?

rest?of?the?cofactors?operates?in?single-cycle.??

In?Fig.?14,?we?show?the?area,?power?and?new?supply?voltage?for?a?

set?of?MCNC?benchmark?circuits.?It?can?be?observed?from?Fig.?14?(a)?

00

0.2 0.2

0.4 0.4

0.6 0.6

0.80.8

11

cht cht sctsct pcle pcle mux decod cm150a x2mux decod cm150a x2

Fig.?14?(a)?Supply?voltage?of?proposed?design;?(b)?%?improvement?in?power;?and,?(c)?Area?overhead?

alu2 count alu2 count

VDD[V]

(a)(a)

VDD[V]

000

2020 20

404040

6060 60

808080

chtcht chtsctsct sct pclepcle pclemux decod cm150a x2mux decod cm150a x2 mux decod cm150a x2alu2alu2alu2countcountcount

%?Imp.?in?power

%?imp?in?power?with?input?switching?prob =?0.2?

%?imp?in?power?with?input?switching?prob =?0.5?%?imp?in?power?with?input?switching?prob =?0.5?%?imp?in?power?with?input?switching?prob =?0.5?

100100100

(b)(b)

%?Imp.?in?power

%?imp?in?power?with?input?switching?prob =?0.2?%?imp?in?power?with?input?switching?prob =?0.2?

%?Imp.?in?power

?

000

1.01.01.0

2.0 2.02.0

3.03.03.0

4.04.04.0

5.0 5.05.0

6.06.0 6.0

7.0 7.07.0

chtchtchtsct sctsctpcle mux decod cm150a x2pcle mux decod cm150a x2pcle mux decod cm150a x2alu2 count alu2 countalu2 count

Area?(x103)

Original?design

Proposed?designProposed?design Proposed?design

(c)(c)

Area?(x103)

Original?designOriginal?design

Area?(x103)

?

InputsInputs Inputs

outputsoutputsoutputs

D1

D1

D1

D2

D2

D2

D3

D3

D3

freeze freeze freeze

●●●

cht cht cht

mux

cm150a

●●●●●●

85ps85ps 85ps

80ps80ps 80ps

70ps70ps70ps

CLK CLK CLK

D1,?D2,?D3are?

decoding?logic decoding?logic

mux

cm150a

mux

cm150a

D1,?D2,?D3are?

?

Fig.?15?Example?of?a?pipeline?design?using?proposed?method?

22446688 1010

00

1010

20 20

3030

40 40

5050

kk

Performance penalty (%)

N=5

(b)(b)

N=10N=10

# of control variables for

critical cofactor vs.

performance penalty performance penalty

(a) (a)

(a) (a)

N?increasesN?increases

Performance penalty (%)

N=5

# of control variables for

critical cofactor vs.

?

Fig.? 16? Performance? penalty? for? (a)? critical? cofactor? at? k=4,? (b)?

different?values?of?k?

TABLE-2?

Procedure?pipelineDesign()?

Input????:?yield?(Y),?list?of?circuits(dList),?VDDL;?/*?VDDL<?1V?*/?

Output?:?list?of?re-designed?circuits?(dList);?

1.?

2.?

3.?

4?

5.?

6.?

return?dList;?

target?delay?(Tc)?=?max(stage?delays);?

for?each?design?Di?∈dList?

?????gList?=?performPartitioning(Di,?VDDL);?/*Fig.?11*/?

?????Di?=?performSizing(gList,?Tc,?Y,?VDDL);?/*Table?1*/?

end?for?

623