# A new paradigm for low-power, variation-tolerant circuit synthesis using critical path isolation.

**ABSTRACT** Design considerations for robustness with respect to variations and low power operations typically impose contradictory design requirements. Low power design techniques such as voltage scaling, dual-Vth etc. can have a large negative impact on parametric yield. In this paper, we propose a novel paradigm for low-power variation- tolerant circuit design, which allows aggressive voltage scaling. The principal idea is to (a) isolate and predict the set of possible paths that may become critical under process variations, (b) ensure that they are activated rarely, and (c) avoid possible delay failures in the critical paths by dynamically switching to two-cycle operation (assuming all standard operations are single cycle), when they are activated. This allows us to operate the circuit at reduced supply voltage while achieving the required yield. Simulation results on a set of benchmark circuits at 70nm process technology show average power reduction of 60% with less than 10% performance overhead and 18% overhead in die-area compared to conventional synthesis. Application of the proposed methodology to pipelined design is also investigated.

**0**Bookmarks

**·**

**50**Views

- IEEE Transactions on Circuits and Systems for Video Technology 06/2013; · 2.26 Impact Factor
- [Show abstract] [Hide abstract]

**ABSTRACT:**Instance and temperature-dependent power variation has a direct impact on quality of sensing for battery-powered long-running sensing applications. We measure and characterize the active and leakage power for an ARM Cortex M3 processor and show that, across a temperature range of 20 -60, there is a 10% variation in active power, and a variation in leakage power. We introduce variability-aware duty cycling methods and a duty cycle (DC) abstraction for TinyOS which allows applications to explicitly specify the lifetime and minimum DC requirements for individual tasks, and dynamically adjusts the DC rates so that the overall quality of service is maximized in the presence of power variability. We show that variability-aware duty cycling yields a improvement in total active time over schedules based on worst case estimations of power, with an average improvement of across a wide variety of deployment scenarios based on the collected temperature traces. Conversely, datasheet power specifications fail to meet required lifetimes by 7%-15%, with an average 37 days short of the required lifetime of 1 year. Finally, we show that a target localization application using variability-aware DC yields a 50% improvement in quality of results over one based on worst case estimations of power consumption.IEEE Transactions on Very Large Scale Integration (VLSI) Systems 06/2013; 21(6):1000-1012. · 1.14 Impact Factor - [Show abstract] [Hide abstract]

**ABSTRACT:**As CMOS technology driven by Moore's law has approached device sizes in the range of 5-20 nm, noise immunity of such future technology nodes is predicted to decrease considerably, eventually affecting the reliability of computations through them. A shift in the design paradigm is expected from 100% accurate computations to probabilistic computing with accuracy dependent on the target application or circuit specifications. One model developed for CMOS technology that emulates the erroneous behavior predicted is termed probabilistic CMOS (PCMOS). In this paper, we propose a PCMOS-based architecture implementation for traditional motion estimation algorithms and show that up to 57% energy savings are possible for different existing motion estimation algorithms. Furthermore, algorithmic modifications are proposed that can enhance the energy savings to 70% with a PCMOS architectural implementation. About 1.8-5 dB improvement in peak signal-to-noise ratio under energy savings of 57% to 70% for two different motion estimation algorithms is shown, establishing the resilience of the proposed algorithm to probabilistic computing over the comparable conventional algorithm.IEEE Transactions on Circuits and Systems for Video Technology 01/2014; 24(1):1-14. · 2.26 Impact Factor

Page 1

?A?New?Paradigm?for?Low-power,?Variation-Tolerant?Circuit?Synthesis?

Using?Critical?Path?Isolation?

Swaroop?Ghosh,?Swarup?Bhunia*,?and,?Kaushik?Roy?

School?of?Electrical?and?Computer?Engineering,?Purdue?University,?IN,?USA?

*Electrical?Engineering?and?Computer?Science,?Case?Western?Reserve?University,?OH,?USA?

?

Abstract??

Design?considerations?for?robustness?with?respect?to?variations?and?

low? power? operations? typically? impose? contradictory? design?

requirements.?Low?power?design?techniques?such?as?voltage?scaling,?

dual-Vth?etc.?can?have?a?large?negative?impact?on?parametric?yield.?In?

this? paper,? we? propose? a? novel? paradigm? for? low-power? variation-

tolerant?circuit?design,?which?allows?aggressive?voltage?scaling.?The?

principal?idea?is?to?(a)?isolate?and?predict?the?set?of?possible?paths?that?

may?become?critical?under?process?variations,?(b)?ensure?that?they?are?

activated?rarely,?and?(c)?avoid?possible?delay?failures?in?the?critical?

paths?by?dynamically?switching?to?two-cycle?operation?(assuming?all?

standard?operations?are?single?cycle),?when?they?are?activated.?This?

allows? us? to? operate? the? circuit? at? reduced? supply? voltage? while?

achieving?the?required?yield.?Simulation?results?on?a?set?of?benchmark?

circuits?at?70nm?process?technology?show?average?power?reduction?of?

60%?with?less?than?10%?performance?overhead?and?18%?overhead?in?

die-area? compared? to? conventional? synthesis.? Application? of? the?

proposed?methodology?to?pipelined?design?is?also?investigated.???

1.????INTRODUCTION?

It?is?well-known?that?process?parameter?variations?(both?systematic?

and?random)?may?cause?parametric?failures?in?logic?circuits?leading?to?

yield? loss.? Conventional? wisdom? dictates? a? conservative? design?

approach?(e.g.,?scaling?up?the?VDD?or?upsizing?logic?gates)?to?avoid?a?

large?number?of?chip?failures.?However,?such?techniques?come?at?the?

cost? of? power? and/or? die? area.? Process? tolerance? and? low? power,?

therefore,?represent?contradictory?design?requirements.?Over?the?past?

few?years,?statistical?design?approach?has?been?widely?investigated?as?

an?effective?method?to?ensure?yield?under?process?variations.?Several?

gate-level? sizing? and/or? Vth? assignment? techniques? [1]? have? been?

proposed?recently?addressing?the?minimization?of?total?power?while?

maintaining?the?timing?yield.?On?the?other?end?of?the?spectrum,?design?

techniques?(e.g.,?adaptive?body?biasing?[2])?have?been?proposed?for?

post-silicon?process?compensation?and?process?adaptation?to?deal?with?

process-related?timing?failures.??

Due?to?quadratic?dependence?of?dynamic?power?of?a?circuit?on?its?

operating?voltage,?supply?voltage?scaling?has?been?extremely?effective?

in?reducing?the?power?dissipation.?Researchers?have?investigated?logic?

design?approaches?that?are?robust?with?respect?to?process?variations?

and,? at? the? same? time,? suitable? for? aggressive? voltage? scaling.? One?

such?technique?[3]?uses?dynamic?detection?and?correction?of?circuit?

timing?errors?to?tune?processor?supply?voltage.?Design?optimization?

techniques? using? gate? sizing? and? dual-Vth? assignment? to? improve?

power/area?typically?increase?the?number?of?critical?paths?in?a?circuit,?

giving?rise?to?the?so-called?“wall?effect”?[4].?The?uncertainty-aware?

design?technique?[4]?describes?an?optimization?process?to?reduce?the?

wall? effect.? However,? it? does? not? address? the? problem? of? power?

dissipation.?

In?this?paper,?we?present?a?novel?design?paradigm,?which?achieves?

robustness?with?respect?to?timing?failure?and?provides?the?opportunity?

for?aggressive? voltage?scaling?by?critical?path?isolation.?The?notion?

critical? path? isolation? is? used? throughout? this?paper? to? indicate? the?

Permission?to?make?digital?or?hard?copies?of?all?or?part?of?this?work?for?

personal?or?classroom?use?is?granted?without?fee?provided?that?copies?are?not?

made?or?distributed?for?profit?or?commercial?advantage?and?that?copies?bear?this?

notice?and?the?full?citation?on?the?first?page.?To?copy?otherwise,?to?republish,?to?

post? on? servers? or? to? redistribute? to? lists,? requires?prior? specific? permission?

and/or?a?fee.?

ICCAD'06,?November?5-9,?2006,?San?Jose,?CA?

Copyright?2006?ACM?1-59593-389-1/06/0011...$5.00?

confinement? of? critical? paths? of? synthesized? design? to? known? logic?

block? (or? cofactor,? as? we? will? see? later).? Such? isolation? leads? to? a?

design?methodology?for?low?power?dissipation?by?making?the?critical?

paths?predictable?and?rare?under?parametric?variations.?Any?possible?

delay? errors? (that? may? occur? under? a? single? cycle? operation)? are?

predicted? ahead? of? time? and? are? avoided? by? two? cycle? operations?

(assuming?all?standard?operations?are?single?cycle).?This?lets?us?scale?

the? supply? voltage? aggressively? for? low? power? dissipation.? In?

particular,?the?proposed?technique:?

•?Isolates? the? critical? paths? and? makes? them? predictable? (by?

decoding?few?primary?inputs)?under?parametric?variations?so?that?

with? reduced? supply? voltage,? possible? delay? errors? are?

deterministic?and?can?be?avoided?by?two?cycle?operation.??

•?Restricts?the?occurrences?of?the?above?two-cycle?operations?by?

reducing?the?activation?probability?of?critical?paths.?

•?Increases?the?delay?margin?between?critical?and?non-critical?paths?

by?both?logic?synthesis?and?proper?gate?sizing?for?improved?yield,?

reliability?of?operations?and?low?power?by?voltage?scaling.??

We? also? present? an? application? of? the? proposed? methodology? in?

pipeline? based? design? for? low? power? operation.? The? circuit? is? re-

designed?to?operate?at?fixed?low?supply?voltage?with?occasional?two-

cycle? operations.? The? two-cycle? operations? are? implemented? by?

stalling?the?pipeline.??

Some?researchers?have?proposed?techniques?to?correct?variability-

induced? timing? error? during? operation? by? voltage? scaling.? The?

technique? in? [3]? referred? as? RAZOR,? reduces? or? eliminates? voltage?

margins?by?dynamic?scaling?of?the?supply?voltage?while?monitoring?

the?error?rate.?Razor?allows?the?occurrence?of?errors?at?low?voltage?and?

then?recovers.?However,?it?does?not?modify?the?logic?synthesis?or?gate?

sizing? process? and? thus? can? perform? poorly? in? presence? of? large?

number?of?critical?paths.?The?technique?proposed?in?this?paper,?on?the?

other?hand,?synthesizes?a?circuit?in?specific?way?to?facilitate?voltage?

scaling?for?power?reduction?as?well?as?to?improve?yield?by?making?the?

delay?failures?deterministic.??

2.?PRELIMINARY?ANALYSIS?

In?this?section,?first?we?present?example?of?an?adder?to?illustrate?the?

proposed? approach? for? low? power? robust? circuit? design.? Next,? we?

present?the?design?flow?followed?by?its?analysis?which?allows?us?to?

apply?similar?approach?to?any?random?logic?circuit.??

2.1.?Voltage?scaling?and?two-cycle?operations?in?a?4-bit?adder?

For?the?sake?of?simplicity,?we?choose?a?4-bit?ripple?carry?adder?as?

shown?in?Fig.?1.?Signals?P0-P3?(G0-G3)?are?the?propagate?(generate)?

signals?whereas?Ci,0?(Co,1-Co,3)?are?carry-in?(carry-out)?signals?[5].?As?

evident,?the?path?from?carry-in?to?carry-out?is?critical?and?determines?

the?frequency?of?operation?of?the?adder.?However,?note?that?the?critical?

path?is?activated?only?when?Ci,0?=?1?and?at?the?same?time,?P0P1P2P3??=?

1.? Since? the? probability? of? such? occurrences? is? very? low? (as?

p(P0P1P2P3Ci,0=1)?=?p(P0)p(P1)p(P2)p(P3)p(Ci,0)?is?very?low),?one?can?

reduce?the?supply?voltage?such?that?all?operations?with?P0P1P2P3?=?0?

and/or?Ci,0?=?0?can?still?be?performed?in?one-cycle.?However,?when?the?

critical?path?is?activated,?the?correct?results?are?obtained?by?evaluating?

the? adder? in? two? clock? cycles? (called? two-cycle? operation).? The?

activation? of? critical? path? can? be? predicted? by? pre-computation? of?

P0P1P2P3.?In?a?nutshell,?by? making?the?critical?path?predictable?and?

utilizing?the?available?slack?between?critical?and?non-critical?path,?it?is?

possible?to?operate?the?circuit?at?reduced?supply?voltage.?Note?that?this?

approach?incurs?penalty?of?an?extra?clock?cycle?when?the?critical?path?

is? activated.? However,? by? ensuring? low? activation? probability? of?

619

Page 2

critical? paths,? it? may? be? possible? to? reduce? the? active? and? leakage?

power?by?rarely?paying?penalty?of?an?extra?clock?cycle.??

To?evaluate?the?feasibility?of?this?idea,?we?simulated?a?4-bit?ripple?

carry? adder? with? 1V? supply? in? Hspice.? We? used? BPTM? [6]? 70nm?

devices?for?simulation.?The?critical?path?delay?was?found?to?be?260ps?

and? average? power? consumption? was? 13.03uW.? Assuming? the? clock?

period?to?be?260ps,?we?reduced?the?supply?to?0.8V.?Now?the?non-critical?

paths?were?within?the?single-cycle?delay?bound?however,?the?critical?path?

delay?increased?to?330ps?and?was?evaluated?with?two-cycles.?The?new?

power?consumption?was?7.32uW,?leading?to?44%?saving?in?total?power.????

2.2.?Generalization?to?random?logic?

Earlier,?we?presented?the?idea?of?supply?voltage?scaling?for?an?adder?

where? the? critical? path? was? unique? (assuming? no? process? variation).?

However,? a? random? logic? can? have? many? critical? paths? with?

corresponding?input?conditions?for?activation.?Further,?the?critical?paths?

may? vary? from? chip-to-chip? due? to? parametric? variations.? In? such?

situations,? the? overhead? associated? with? pre-decoding? logic? can?

overshadow? the? power? savings.? To? exercise? similar? supply? scaling?

technique?on?random?logic?circuits,?we?need?to?make?sure?that,?(a)?the?

critical?paths?are?confined?to?a?predictable?logic?section;?and,?(b)?the?non-

critical?paths?remain?non-critical?under?process?variation?by?providing?a?

safe? timing? slack.? The? timing? slack? between? critical? and? non-critical?

paths?will?be?the?enabling?factor?for?supply?voltage?scaling.?An?example?

of?a?possible?path?delay?distribution?(cartoon)?is?shown?in?Fig.?2.??

To?obtain?the?delay?distribution?shown?in?Fig.?2,?the?design?needs?to?

be?partitioned?and?synthesized?in?such?a?way?that?the?paths?are?divided?

into?several?logic?blocks.?The?partitioning?procedure?should?consider?

the?fact?that?(a)?these?logic?blocks?can?be?active?or?remain?idle?based?

on?the?state?of?primary?inputs;?and,?(b)?the?probabilities?of?activation?

of?the?logic?blocks?containing?critical?paths?(called?critical?block)?are?

very?low.?Therefore,?it?will?be?possible?to?predict?the?activation?of?a?

logic?block?(and?the?corresponding?paths)?just?by?decoding?the?states?

of?inputs.?Next,?gate?sizing?can?be?performed?on?the?partitioned?logic?

blocks?to?maximize?the?slack?between?critical?and?non-critical?blocks?

leading?to?further?isolation?of?critical?paths.?Note?that?the?suggested?

sizing?approach?will?be?opposite?of?the?conventional?sizing?because?in?

this?case,?the?critical?paths?should?be?made?slower?while?non-critical?

paths?should?be?made?faster.?By?performing?the?partitioning?and?sizing,?

a?path?delay?distribution?similar?to?the?one?shown?in?Fig.?2?can?be?

achieved.?Finally,?supply?voltage?scaling?can?be?done?such?that?non-

critical?blocks?meet?the?desired?timing?yield?with?respect?to?one-cycle?

delay?target?whereas?critical?block?meet?the?yield?with?respect?to?two-

cycle?delay?target.?In?other?words,?the?critical?blocks?can?operate?in?

two-cycles?while?the?non-critical?blocks?can?operate?in?single-cycle.?

Since?the?probability?of?activation?of?the?critical?block?is?low,?the?new?

design?operating?at?a?scaled?voltage? will?have? minimum?impact?on?

performance.? The? overall? design? strategy? is? shown? in? Fig.? 3.? The?

partitioning?and?sizing?is?more?clearly?illustrated?in?Fig.?4?where?a?

circuit?is?partitioned?into?four?functional?logic?blocks?f1-f4.?The?outputs?

are?fed?to?an?OR?network?to?generate?the?final?outputs.?Suppose?that?by?

the? virtue? of? proper? partitioning,? f4? becomes? the? least? activated?

functional? block? containing? the? critical? paths.? Then? f4? can? be?

downsized?further?while?the?other?functional?blocks?can?be?upsized?to?

maximize?the?slack?and?further?isolation?of?critical?paths,?as?shown?by?

arrows? in? Fig.? 4.? In? Section? 3,? we? will? describe? a? Shannon? based?

partitioning?technique?which?helps?in?isolating?the?critical?paths.??

2.3.?Analysis?of?the?proposed?design?methodology??

Let?us?consider?two?different?designs?for?the?same?combinational?

circuit,? design-A? and? design-B? with? timings? as? shown? in? Fig.? 5.?

Design-A? (design-B)? is? representative? of? conventional? design?

(proposed?design).?In?design-A,?the?slack?of?critical?path?is?S1?with?

respect?to?the?clock?period?Tc?whereas?in?design-B,?the?critical?path?

(shown?by?hatched?lines?in?Fig.?5)?does?not?meet?the?timing?constraint?

and?has?a?negative?slack?of?S3.??However,?the?non-critical?paths?(shown?

by?dotted?block?in?Fig.?5)?in?design-B?maintain?a?maximum?slack?for?

S2.?We?also?assume?that?the?activation?condition?of?critical?paths?in?

design-B?is?known?based?on?the?states?of?few?inputs?(say,?N).?An?extra?

decoder?is?needed?in?design-B?for?pre-determining?the?occurrences?of?

critical? path? activation.? Obviously,? design-B? can? function? properly?

with? two-cycle? operations? for? critical? paths? while? a? single? cycle?

operation? for? non-critical? paths.? Let? us? now? compare? the? power?

consumption? of? design-A? and? design-B? where? V0? is? the? voltage? at?

which? design-A? meets? the? slack? requirement? S1,? whereas,? design-B?

meets?slack?S2?for?non-critical?paths?and?S3?for?critical?paths.?Since?

voltage? is? proportional? to? (delay)-1,? the? scaled? voltage? (

design-A?can?be?determined?as?follows,??

11

and,??????

A

cc

TST

−

new

A

V

)? for?

1

00

1

1

newnew

A

c

S

VVVV

T

?

?

?

?

?

?

∝∝

?

=−

???????????????????????(1)???????????????????

FAFAFAFAFAFAFAFA

P0

P0

G1

G1

P0

P0

G1

G1

P2

P2

G2

G2

P3

P3

G3

G3

Co,3

Co,3

Co,2

Co,2

Co,1

Co,1

Co,0

Co,0

Ci,0

Ci,0

?

Fig.?1?Ripple?carry?adder?[5]?

?

One-cycle?

delay?targetdelay?target

path?delay?path?delay?

Number?of?paths

predictable?and?restricted?to?a?logic?section?having?low?

activation?probabilityactivation?probability

slackslack

One-cycle?

Number?of?paths

predictable?and?restricted?to?a?logic?section?having?low?

?

Fig.?2?Path?delay?distribution?needed?for?the?proposed?methodology?

probability of critical logic blocks are very small.

1. Perform an input based partitioning of the netlist such that the activation

critical and non−critical blocks.

2. Perform gate sizing on logic blocks to create timing slack between

(critical) blocks in one−cycle (two−cycle).

3. Perform supply voltage scaling while meeting the yield for non−critical

Input : Optimized netlist

new supply voltageSized netlist and

Output :

?

Fig.?3?Design?methodology?

?

Original?

CircuitCircuit

f1

f1

OR?Network

InputsInputs

POPO

f2

f2

f3

f3

f4

f4

InputsInputs

Original?

OR?Network

??

S1

S1

S2

S2

S3

S3

Design?ADesign?A

Tc

Tc

CLKCLK

Design?BDesign?B

????

0

5

10

0

0.2

0.4

0.6

0.8

0

1

2

3

4

Snorm

2

N

EDPA/EDPB

????????

Fig.?7?Shannon’s?expansion?

based?partitioning?

sCF

CF

CF

Primary inputs

1

0

MUX

x

Primary outputs

1

2

i

?

??????

?

Fig.?4?Steps?1?and?2?of?proposed?

design?methodology?

Fig.?5?Timing?diagram?of?

design-A?and?design-B?

??????????????????

?Fig.?6?Plot?of?EDP?ratio?of?

design-A?and?design-B?for?k?=?

0.2,?

0

0 .0 01

d

C

=

no rm

and?n?=?4?

?

620

Page 3

Similarly,?

2

023

1;????? . .?

new

Bc

c

S

VVs t SST

T

?

?

?

?

?

?

=−+≤

??????????????????????????????(2)?????????????????????????????

If?the?performance?penalty?due?to?two-cycle?operation?in?design-B?

is?p,?then?the?effective?clock?cycle?delay?of?design-B?is?

cc

T pT

+

.?The?

energy-delay?product?(EDP)?of?both?designs?are?given?by?

?

() ( );?????

AAAc

EDP C VT

=

where? CA? (CB)? is? the? average? switched? capacitance? of? design-A?

(design-B)?and?Cd?is?the?average?switched?capacitance?of?the?decoding?

logic?(for?determination?of?critical?path?activation).?

The?EDP?ratio?(after?putting?the?values?of?

22

()( ) ()?

newnew

BBBdcc

EDPCCVTpT

=++

???????(3)??????????

new

A

V

and

new

B

V

)?is?given?

by:?

()

2

1

2

1

2

2

1

11

1

1

1

11

11

norm

cA

norm normnorm

dB

BBd

cAA

S

TEDP

CSC

EDPp

TCC

S

CCSp

−

=

+

−+

−

=

+−+

?

?

?

?

?

?

??

??

??

??

??

??

?

?

?

?

?

?

?

?

?

??

?

?

? ?

?

?

?

?

?

?

?

?

?

(4)?

where,?

norm

B

C

(

norm

d

C

)?is?the?average?switched?capacitance?of?design-B?

(decoder?logic)?normalized?with?respect?to?

A

C ?and?

1

norm

S

(

2

norm

S

)?is?the?

slack?of?design-A?(design-B)?normalized?with?respect?to?

c T.??

From?the?expression?shown?in?equation?(4),?it?is?possible?to?study?

the?conditions?under?which?it?may?be?useful?to?opt?for?design-B?rather?

than?design-A.?It?is?obvious?that?design-B?can?be?better?than?design-A?

if?EDPA/EDPB?>?1.?Since?(

B

CC

+

design?modifications)?and(1 ) 1

p

+> ,?a?necessary?condition?for?design-

B?to?be?better?than?design-A?is,??

? . .,?

SSie SS

>>

?????????????????????????????????????????????????????????????????(5)???????

Therefore,?a?larger?value?of?S2?is?better?for?power?savings.?However,?

the? upper? bound? of? S2? is? determined? by? constraint? S2? +? S3? ≤? Tc?

(equation?(2)).?Hence,?S2?can?be?maximized?by?minimizing?slack?S3.?

Let?us?explore?the?design?space?for?which?design-B?can?be?beneficial.?

For?the?sake?of?simplicity,?we?model?the?normalized?capacitances?and?

performance?penalty?(p)?as?follows,??

k

CC NC

S

−

where?k?is?a?constant,?N?is?the?number?of?input?vectors?that?should?

be? decoded? to? determine? if? critical? paths? are? activated,

) 1

> (assuming?CB?≥?CA?due?to?

normnorm

d

2121

normnorm

01

2

1;???,?=0.05?and,???

(1)2

norm

B

norm

d

norm

d

norm

normn

N

Sp

= +==

?

0

norm

d

C

?is? the?

normalized?average?switched?capacitance?of?decoding?a?single?input?

vector?and,?n?is?the?total?number?of?primary?inputs?of?the?circuit.?The?

EDP?ratio?plotted?for?different?values?of?N?and?

2

norm

S

?is?shown?in?Fig.?

6.?From?the?EDP?ratio?profile?shown?in?this?figure,?it?is?obvious?that?

design-B?is?beneficial?only?if?N?is?small?(to?minimize?the?switched?

capacitance? of? decoding? logic).? Also,? the? initial? flat? portion? of? the?

profile?indicates?that

2

S

should?be?greater?than

EDP?curve?increases?with

2

S

,?a?large?value?of?

the?switched?capacitance?of?the?circuit?(i.e.

norm

1

norm

S

.?Although?the?

norm

2

norm

S

?may?increase?

norm

B

C

?if?gate?sizing?is?used)?

and?offset?the?saving?in?power.?

In?the?analysis?presented?above,?it?can?be?concluded?that?the?power?

saving?in?proposed?method?mainly?comes?from?quadratic?dependency?

of? power? on? voltage.? Power? reduces? quadratically? while? the? delay?

increases?only?linearly,?letting?us?reduce?the?EDP.???

3.?DESIGN?METHODOLOGY?

Based? on? the? analysis? and? the? guidelines? derived? above,? we?

describe?the?details?of?each?step?of?the?design?flow?(Fig.?3).?This?is?

followed?by?simulation?results?on?a?set?of?benchmark?circuits.?

3.1?Circuit?partitioning?and?synthesis?for?critical?path?isolation?

Let? us? first? consider? performing? an? input? based? partition? of? the?

circuit?such?that?the?critical?paths?are?isolated?and?their?activation?

probability?is?reduced.?To?achieve?this,?we?used?Shannon?expansion?

based? partitioning? [7]? which? partitions? a? Boolean? expression? f? into?

disjoint?sub-expressions?as?shown?below:?

==

=+

===

where?(x1…xn)?are?input?literals,?xi?is?control?variable,?and?CF1?and?

CF2?are?called?cofactors.?If?f?contains?sub-expressions?independent?of?

control?variable?xi,?then?we?may?also?have?a?Shared?Cofactor?(sCF)?

(Fig.? 7).? In? this? work,? we? have? used? Shannon? expansion? based?

partitioning? mainly? due? to? its? following? inherent? properties:? (a)? the?

circuit? partitioning? is? done? based? on? inputs;? (b)? the? activation?

probability? of? partitioned? logic? blocks? can? be? easily? reduced? by?

performing? multi-level? hierarchical? expansion;? and,? (c)? by? properly?

choosing?the?control?variables,?it?is?possible?to?isolate?the?critical?paths?

to?a?logic?block?having?least?activation?probability.?In?the?following?

paragraphs,? first? we? explain? multi-level? expansion? for? reduction? of?

111

12

1121

( ,..., ,...,? )

i

f xx

. ( ,...,

x f x

1,...,? ). ( ,...,

x f x

0,...,? )

????????????????????????????..

( ,...,

f x

1,...,? );?????

x

( ,...,

f x

0,...,? )

niiniin

ii

inin

xxxxx

x CF x CF

CFxCFxx

+=

=

????????????(6)???????????????

f1

f2

xx

x

x

x

x

x

xx

8

7

9

6

2

4

5

3

1

?

Fig.?8?Original?circuit?

4. currList = All graphs in gList

3. Make expansion decision for the graphs in gList

in currList

Level = 1

Yes

Yes

Delay < Dmax

10. Area < Amax and

9. All graphs of

currList traversed?

No

2. Initialize gList = {G}

1. Read netlist and create graph G,

marked

No

Yes

expand

for expansion?

6. Is

5. For each graph

7. Choose a control variable and

G

from gList.8. Remove

i

i

G

Original netlist, Area constraint

(Amax), delay constraint (Dmax)

Input :

i

G

i

G

into CF1, CF2, sCF

Level++

gList

Output :

No

Add CF1, CF2, and sCF to gList

?

Fig.?11?Automated?synthesis?flow?

x

x

?

x

x

x

1

f (CF1)

2f (CF1)

9

6

5

3

1

???

f (CF2)

2

f (CF2)

1

x

x

9 x

x

x

x

2

3

6

1

7

?

(a)?

Fig.?9?Control?variable?is?x4:?(a)?CF1;?(b)?CF2?

x2

x4

x3

??(b)?

x

x

x

x

x

1f (CF1)

f (CF1)

2

(a)?

Fig.?10?Control?variable?is?x1:?(a)?CF1;?(b)?CF2?

6

7

2

3

4

???

?

x5

x6

x9

2

f (CF2)

1

f (CF2)

?

?

?(b)?

621

Page 4

activation?probability?of?cofactors,?followed?by?the?control?variable?

selection?strategy?for?critical?path?isolation?during?partitioning.?

In?equation?(6),?the?activation?probability?of?each?cofactor?is?50%?

(assuming?50%?switching?probability?of?inputs).?By?performing?multi-

level?expansion,?the?activation?probability?of?the?resulting?cofactors?

can? be? reduced? further.? For? example,? a? 2nd? level? expansion? of? f?

(equation? (7))? results? in? four? cofactors,? each? with? an? activation?

probability?of?25%.??

( ,..., ,...,? )..

inijij

f xxx xx CF xx CF

=++

Control?variable?selection?plays?a?very?important?role?in?achieving?

desired?goals?in?Shannon’s?expansion?based?partitioning.?In?[8,?9],?the?

most?binate?variable?is?chosen?as?control?variable?to?minimize?the?area?

overhead.?However,?this?heuristic?may?not?lead?to?the?confinement?of?

critical?paths?of?the?circuit?after?expansion.?For?example,?consider?a?

multiple-output?two-level?

f x xx xx xx xx

=++++

?and

circuit?realization?shown?in?Fig.?8,?it?can?be?observed?that?f1?is?the?

critical?function?(or?critical?output).?If?n(xi)?is?the?total?literal?count?of?

xi? in? f1? and? f2? then,? n(x1)=4,? n(x2)=1,? n(x3)=2,? n(x4)=4,?

n(x5)=n(x6)=n(x7)=n(x8)=n(x9)=1.?Considering?most?binate?variable?as?

the?preferable?choice,?either?x1?or?x4?can?be?picked?as?control?variable.?

With?x4?as?control?variable,?resulting?cofactors?are?shown?in?Fig.?9.?It?

can? be? noticed? that? the? critical? paths? are? distributed? between? the?

cofactors.?However,?if?x1?is?chosen?as?control?variable,?the?critical?path?

has?been?confined?to?f1(CF2)?(Fig.?10).?Clearly,?a?strategy?is?needed?to?

isolate?the?critical?paths?and?limit?them?to?a?particular?cofactor.?If?ai?

(bi)?is?the?literal?count?of?variable?xi?in?true?(complement)?form?in?the?

critical? function? (or? output),? then? following? criterions? should? be?

fulfilled:?(i)?the?control?variable?should?be?present?in?critical?function?

(i.e.?min(ai,?bi)?>?0);?(ii)?difference?of??ai?and?bi?should?be?large?to?

ensure?that?the?paths?are?isolated?to?one?cofactor?and,?(iii)?the?max(ai,?

bi)? should? be? small? to? minimize? the? probability? of? logic? depth? of?

isolated?critical?paths?being?reduced?by?logic?optimization.?Following?

metric?can?be?used:??

11234

..

ijij

xx CFxx CF

+

???????????????????????(7)??????????????????

Boolean?

f

=

logic?

x x

+

function?

.? From? the?

11 42 3 3 44 5621 7

x x

1 4

x x

1 9

+

?????????????????????????????????

||

?

max( , )

a b

ii

i

ii

ab

M

−

=

???????????????????????????????????????????(8)??????????????????????????

A?literal?with?maximum?value?of?Mi?ensures?that?the?critical?path?is?

isolated?to?a?cofactor.?Using?this?metric,?we?follow?the?steps?described?

in?[8]?for?choosing?the?control?variable?in?our?synthesis?flow.??

To?achieve?the?dual?objectives?of?isolating?the?critical?paths?to?a?

cofactor?while?reducing?its?activation?probability?during?partitioning?

and?synthesis,?we?adhere?to?following?approach:?(a)?we?partition?the?

circuit?and?determine?the?cofactor?where?the?critical?paths?have?been?

isolated? (called? critical? cofactor);? (b)? we? mark? this? cofactor? (i.e.?

critical? cofactor)? for? further? expansion? to? reduce? the? activation?

probability? of? the? critical? paths.? The? above? mentioned? steps? are?

repeated?under?a?given?area?and?delay?constraint.?Note?that?Synopsys?

Design? Compiler? [10]? has? been? used? for? synthesizing? the? new?

cofactors.?The?overall?synthesis?flow?is?shown?in?Fig.?11.?A?complete?

example?of?hierarchical?partitioning?and?synthesis?is?also?illustrated?in?

Fig.?12?where?the?original?circuit?is?partitioned?into?four?cofactors,?

CF20,?CF32,?CF53?and?CF63.?The?critical?paths?have?been?isolated?to?

CF53?(which?is?activated?by?3?inputs?i.e.?x1x2’x3).?Note?that,?in?this?

example?we?do?not?have?the?shared?cofactor?(sCF).?Shared?cofactors?

are? important? in? avoiding? the? logic? duplication? during? partitioning.?

However,? they? are? independent? of? control? variable.? Therefore? our?

synthesis?flow?(Fig.?11)?automatically?chooses?it?for?further?expansion?

(if?critical?paths?are?isolated?to?it).??????

3.2?Gate?Sizing?for?further?isolation?

In? the? previous? subsection,? we? presented? a? circuit? partitioning?

method?to?isolate?the?critical?paths?to?a?cofactor?with?small?activation?

probability.?The?next?step?is?to?size?the?resulting?cofactors?individually?

to? (a)? further? isolate? the? critical? paths? and,? (b)? create? timing? slack?

between?critical?and?non-critical?cofactors?to?allow?lowering?of?supply?

voltage.? To? achieve? this? goal,? all? gates? of? the? critical? cofactor? are?

downsized?to?make?the?corresponding?paths?further?critical.?The?gates?

belonging?to?the?remaining?cofactors?are?selectively?upsized?to?make?

them? more? non-critical? and? increase? the? slack? (S2,? as? discussed? in?

Section? 2.3).? An? example? of? the? proposed? sizing? approach? after?

Original?

CircuitCircuit CircuitCircuit

CF10CF10CF10CF10

CF20CF20 CF20CF20

CF32CF32CF32CF32

CF63CF63CF63CF63

CF53CF53 CF53CF53

CF42CF42 CF42CF42

MUX?Network

LEVEL1LEVEL1LEVEL1LEVEL1

LEVEL2LEVEL2 LEVEL2LEVEL2

LEVEL3LEVEL3LEVEL3LEVEL3

InputsInputsInputsInputs

POPOPOPO

InputsInputsInputsInputs

x1

x1

x1

x1

x1

x1

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

x1?x2?x3

#?control?variables

CF53,?CF63?:?3

CF32

CF20

CF20

:?2

:?1 :?1

Original?Original? Original?

MUX?NetworkMUX?NetworkMUX?NetworkMUX?Network

#?control?variables

CF53,?CF63?:?3

CF32

:?2

?

Fig.?12?Hierarchical?expansion?and?sizing?of?cofactors?

002244

x 10−10

x 10−10

00

2020

4040

6060

8080

DelayDelay

# of paths

benchmark: sct, VDD = 1V

newnew

critical CF critical CF

critical pathscritical paths

000.5 0.5111.5

Delay[seconds]Delay[seconds]

222.5 2.5333.53.544

x 10 x 10

−10−10

00

20 20

4040

6060

8080

100100

120120

140140

160160

180180

200200

# of occurances

CF3

CF3

Benchmark: sct, 1000 simulation, VDD = 1V

Org: Critical path delay distribution of original ckt

CF1−CF4: Cofactor−wise critical path delay distribution of

proposed ckt proposed ckt

one−cycle

delay targetdelay target

OrgOrg

CF4

CF4

CF2

CF2

CF1

CF1

(a)(a)

(b)(b)

222.52.5333.53.5 44 4.54.5555.5 5.566 6.56.577

x 10 x 10

−10−10

00

20 20

4040

6060

8080

100100

120120

140140

160 160

180 180

200200

Delay[seconds]Delay[seconds]

# of occurances

one−cycle delay

targettarget

two−cycle delay

targettarget

CF1

CF1

CF2

CF2

CF4

CF4

CF3

CF3

Benchmark: sct, 1000 simulation, VDD = 0.70V

CF1−CF4: Cofactor−wise critical path delay distribution of

proposed cktproposed ckt

(c)(c)

#?control?variables

CF1??????????:?4

CF2??????????:?3

CF3,?CF4?:?2CF3,?CF4?:?2

[s][s]

# of paths

benchmark: sct, VDD = 1V

1.5

# of occurances

Benchmark: sct, 1000 simulation, VDD = 1V

Org: Critical path delay distribution of original ckt

CF1−CF4: Cofactor−wise critical path delay distribution of

one−cycle

# of occurances

one−cycle delay

two−cycle delay

Benchmark: sct, 1000 simulation, VDD = 0.70V

CF1−CF4: Cofactor−wise critical path delay distribution of

#?control?variables

CF1??????????:?4

CF2??????????:?3

?

Fig.?13?Results?for?benchmark?sct:(a)?path?delay?distribution?after?

partitioning? and? sizing;(b)cofactor-wise? critical? path? delay?

distribution?under?Vt?variation?(VDD=1V),?(c)?VDD=0.7V?

TABLE-1?

Procedure?performSizing()?

Input????:?target?delay?(Tc),?yield?(Y),?list?of?cofactors?(gList);?

Output?:?sized?netlist;?

1.?

2.?

3.?

4?

5.?

6.?

7.?

8.?

9.?

10.?

11.?

12.?

13.?

14.?

15.?

16.?

return?G;?

maxLevel?=?maximum?hierarchy?of?the?cofactors?in?gList?;?

run?SSTA?on?Gi∈gList;?

critCF=cofactor?with?critical?paths?at?maxLevel?hierarchy;?

for?each?cofactors?Gi?∈gList?

?????calculate?Gi→muxdelay;?

end?for?

dTarget?=?αTc?–?critCF→muxDelay;?

downSize(critCF,?dTarget,?Y);?

critDelay?=?critCF→maxDelay?+?critCF→muxDelay;?

for?each?cofactors?Gi?∈gList?

?????if?Gi?≠?critCF?

?????????dTarget?=?critDelay?-?Tc?-?Gi→muxDelay?;?????????

?????????upSize(Gi,?dTarget,?Y);?????

end?for?

Add?mux’s?in?Gi∈gList?to?create?a?complete?graph?G;??

622

Page 5

partitioning? is? shown? in? Fig.?12.? The? cofactors? with? dashed? (solid)?

lines?indicate?expanded?(non-expanded)?circuits?and?levels?indicate?the?

hierarchy.?As?shown?in?the?figure,?cofactor?CF53?is?downsized?to?make?

it?further?critical?while?other?cofactors?are?upsized?to?make?them?more?

non-critical.?Note?that?the?proposed?sizing?approach?is?very?different?

from?the?conventional?sizing?because?in?this?case,?the?critical?paths?are?

made?slower?while?non-critical?paths?are?made?faster.????

We? follow? the? above? mentioned? sizing? strategy? in? a? Lagrangian?

Relaxation?(LR)?based?gate?sizing?[12]?as?shown?in?Table?1.??Given?a?

delay?target?(Tc),?it?tries?to?meet?the?yield?requirement?with?minimum?

area.?The?procedure?takes?gList?(i.e.,?list?of?cofactors)?and?determines?

the?cofactor?at?highest?level?of?hierarchy,?maxLevel?for?downsizing?it.?

The?target?delay?(dTarget)?for?sizing?the?critical?cofactor?candidate?(i.e.?

critCF)?is?computed?in?Step?7?(with?α=1.2,?determined?empirically?to?

allow? minimization? of? S3? as? discussed? in? Section? 2.3).? The? delay?

targets? of? non-critical? cofactors? are? obtained? by? subtracting? Tc? and?

multiplexer?delays?from?overall?critical?path?delay?(Step?12).?The?non-

critical?cofactor?candidates?are?now?upsized?while?meeting?the?yield?

target?(Step?13).?The?description?of?Table?1?is?omitted?for?brevity.???

3.3?Determination?of?supply?voltage?

After? circuit? partitioning? and? sizing,? we? obtain? the? path? delay?

distribution? similar? to? Fig.? 2.? Now? we? may? assign? a? lower? supply?

voltage?to?reduce?the?power?dissipation?while?meeting?robustness.?To?

achieve?this,?we?start?from?nominal?supply?and?iteratively?reduce?it?

with?two?stopping?criterions:?(a)?delay?violation?of?any?of?the?non-

critical? cofactors? (one-cycle? delay? target)? for? the? given? yield?

constraint;?and,?(b)?delay?violation?of?the?critical?cofactor?(two-cycle?

delay?bound)?for?the?target?yield.?Finally,?another?stopping?criterion?is?

the? 3Vth? limit? for? reliable? super-threshold? operations? [5].? The? new?

voltages?for?a?set?of?MCNC?benchmarks?are?shown?in?Section?3.4.?

3.4?Simulation?results?

In? previous? sections,? we? presented? a? methodology? to? make? the?

possible? delay? errors? (that? may? occur? under? single-cycle? operation)?

predictable?and?rare?(using?circuit?partitioning?and?sizing).?We?also?

discussed?the?determination?of?new?supply?voltage.?In?this?section,?we?

present? simulation? result? on? a? set? of? MCNC? benchmarks? to?

demonstrate?the?feasibility?of?this?methodology.?In?particular,?we?show?

(a)? isolation? of? critical? paths? to? a? cofactor? (having? low? activation?

probability);?(b)?reduction?of?supply?voltage?for?low?power?dissipation?

while?maintaining?robustness.?In?the?following?paragraphs?we?present?

simulation?setup?followed?by?the?results?and?discussion.?

For? logic? optimization? in? our? synthesis? flow,? we? have? used?

Synopsys? Design? Compiler? [10].? For? a? basis? of? comparison,? the?

original? benchmarks? are? also? optimized? for? area? in? Synopsys.? The?

mapping? is? done? to? a? standard? cell? library.? The? circuit? delays? are?

computed? by? using? SSTA? for? BPTM? 70nm? technology.? The?

parametric? variations? (L,? Tox,? doping? etc)? have? been? lumped? into?

threshold?voltage?variation.?The?change?in?Vth?due?to?inter-die?(∆Vtinter)?

and? intra-die? (∆Vtintra)? process? variations? are? modeled? as? Gaussian?

variables?with?zero?mean?and?standard?deviations?of?80mV?and?40mV,?

respectively.? The? total? change? in? transistor? Vth? is? given? by? the?

summation? of? ∆Vtinter? and? ∆Vtintra.? The? delay? target? (Tc)? for? sizing?

procedure?is?chosen?by?plotting?the?area-delay?curve?of?the?circuit?and?

selecting?the?delay?at?which?the?slope?of?the?curve?is?-1.?The?area?and?

delay?constraints?for?Shannon?based?partitioning?are?kept?at?40%?and?

20%?more?than?original?area?and?delay?respectively.?The?yield?targets?

of?original?circuit?and?the?cofactors?for?gate?sizing?are?set?to?95%.?The?

yield? target? of? cofactors? operating? on? one-cycle? (two-cycle)? after?

application? of? reduced? supply? is? fixed? to? 95%? (100%).? For? power?

estimation,?the?circuits?are?simulated?in?Hspice?by?applying?a?set?of?

200?random?input?patterns?having?input?switching?probabilities?of?0.2?

as?well?as?0.5.?The?runtime?of?the?entire?methodology?is?found?to?be?

small? (6.03s? for? largest? benchmark? cht? on? SUN? blade? 1000?

workstation).??

To?illustrate?the?isolation?of?critical?paths?to?the?critical?cofactor,?

we? have? plotted? the? path? delay? distribution? of? an? example? MCNC?

benchmark?circuit?(i.e.,?sct)?after?partitioning?and?sizing?(Fig.?13(a)).?

This?figure?clearly?indicates?that?the?critical?paths?of?the?re-synthesized?

design? are? limited? to? the? critical? cofactor.? We? also? present? it’s?

cofactor-wise?critical?path?delays?distribution?under?process?variation?

(Vth?variation,?Fig.?13(b)).?From?this?figure,?note?that:?(a)?CF1?remains?

critical? even? under? parametric? variation? while? the? other? cofactors?

remain?non-critical?and;?(b)?there?is?a?delay?slack?present?between?CF1?

and?other?cofactors.?Also,?note?that?the?critical?cofactor?CF1?is?at?the?

4th? hierarchical? level? (i.e.? 4? control? variables)? to? minimize? its?

activation? probability.? The? delay? distribution? at? reduced? supply? is?

shown?in?Fig.?13?(c).?It?shows?that?CF1?operates?in?two-cycles?while?

rest?of?the?cofactors?operates?in?single-cycle.??

In?Fig.?14,?we?show?the?area,?power?and?new?supply?voltage?for?a?

set?of?MCNC?benchmark?circuits.?It?can?be?observed?from?Fig.?14?(a)?

00

0.20.2

0.4 0.4

0.60.6

0.80.8

11

cht chtsct sctpclepclemux decod cm150a x2 mux decod cm150a x2

Fig.?14?(a)?Supply?voltage?of?proposed?design;?(b)?%?improvement?in?power;?and,?(c)?Area?overhead?

alu2 countalu2 count

VDD[V]

(a)(a)

VDD[V]

000

202020

404040

606060

808080

cht chtchtsctsct sctpclepcle pclemux decod cm150a x2 mux decod cm150a x2mux decod cm150a x2alu2 alu2alu2 countcountcount

%?Imp.?in?power

%?imp?in?power?with?input?switching?prob =?0.2?

%?imp?in?power?with?input?switching?prob =?0.5?%?imp?in?power?with?input?switching?prob =?0.5?%?imp?in?power?with?input?switching?prob =?0.5?

100100100

(b)(b)

%?Imp.?in?power

%?imp?in?power?with?input?switching?prob =?0.2?%?imp?in?power?with?input?switching?prob =?0.2?

%?Imp.?in?power

?

000

1.01.01.0

2.02.0 2.0

3.03.03.0

4.04.04.0

5.05.0 5.0

6.06.06.0

7.07.07.0

chtcht chtsct sct sctpcle mux decod cm150a x2pcle mux decod cm150a x2pcle mux decod cm150a x2alu2 count alu2 countalu2 count

Area?(x103)

Original?design

Proposed?designProposed?designProposed?design

(c)(c)

Area?(x103)

Original?designOriginal?design

Area?(x103)

?

InputsInputs Inputs

outputsoutputsoutputs

D1

D1

D1

D2

D2

D2

D3

D3

D3

freeze freezefreeze

●●●

chtchtcht

mux

cm150a

●●●●●●

85ps85ps85ps

80ps 80ps80ps

70ps70ps 70ps

CLKCLK CLK

D1,?D2,?D3are?

decoding?logic decoding?logic

mux

cm150a

mux

cm150a

D1,?D2,?D3are?

?

Fig.?15?Example?of?a?pipeline?design?using?proposed?method?

2244668810 10

00

1010

2020

3030

4040

5050

kk

Performance penalty (%)

N=5

(b)(b)

N=10 N=10

# of control variables for

critical cofactor vs.

performance penaltyperformance penalty

(a)(a)

(a)(a)

N?increasesN?increases

Performance penalty (%)

N=5

# of control variables for

critical cofactor vs.

?

Fig.? 16? Performance? penalty? for? (a)? critical? cofactor? at? k=4,? (b)?

different?values?of?k?

TABLE-2?

Procedure?pipelineDesign()?

Input????:?yield?(Y),?list?of?circuits(dList),?VDDL;?/*?VDDL<?1V?*/?

Output?:?list?of?re-designed?circuits?(dList);?

1.?

2.?

3.?

4?

5.?

6.?

return?dList;?

target?delay?(Tc)?=?max(stage?delays);?

for?each?design?Di?∈dList?

?????gList?=?performPartitioning(Di,?VDDL);?/*Fig.?11*/?

?????Di?=?performSizing(gList,?Tc,?Y,?VDDL);?/*Table?1*/?

end?for?

623