Page 1

Conditional Pre-Charge Techniques for Power-Efficient

Dual-Edge Clocking

This work has been supported by SRC Research Grant No. 931.001, California MICROs 01-062,63, and Fujitsu Laboratories of America

Nikola Nedovic

Advanced Computer Systems

Engineering Laboratory

ECE Department, University of

California Davis, CA 95616

+1-530-752-6800

nikola@ece.ucdavis.edu

Marko Aleksic

Advanced Computer Systems

Engineering Laboratory

ECE Department, University of

California Davis, CA 95616

+1-530-752-6800

maleksic@ece.ucdavis.edu

Vojin G. Oklobdzija

Advanced Computer Systems

Engineering Laboratory

ECE Department, University of

California Davis, CA 95616

+1-510-486-8171

vojin@ece.ucdavis.edu

ABSTRACT

A new dual edge-triggered flip-flop that saves power by inhibiting

transitions of the nodes that are not used to change the state is

presented. The proposed flip-flop is 12% faster with 10% lower

Energy-Delay Product for 50% data activity, as compared to the

previously published dual edge-triggered storage elements. This

was confirmed by simulation using 0.18um process, 1.8V power

supply, and clock frequency of 250MHz. This flip-flop is

particularly suitable for low-power applications.

Categories and Subject Descriptors

B.6.1 [Logic Design]: Design Styles – sequential circuits.

General Terms

Performance, Design.

Keywords

Dual edge-triggered flip-flop, clocked storage elements, clocking,

clock distribution, power consumption.

1. INTRODUCTION

Performance improvement of high-end processors is achieved at

the cost of exponential growth of the power consumption. The

power consumption of today’s processors is in the range of tens or

hundreds of watts ([1, 2]), reaching the point where the heat

removal issues become critical. Due to the dramatic increase of

the number of pipeline stages in the processor, large number of

storage elements on chip and constant frequency scaling, the

contribution of the clocking subsystem to total power budget

reaches 30-50%. Thus, the reduction of clock-related power is

among the most important tasks for future high performance

designs. On the other hand, the key property of complex digital

circuits in low-power applications is power efficient computation.

There exists a continuous demand for low-power circuits and

clocking strategies for low power.

One approach is to use dual-edge clocking [3, 4]. Dual-edge

clocking requires Dual Edge-Triggered Storage Elements

(DETSE), capable of capturing data on both rising and falling

edge of the clock. Main advantage of DETSE is their operation at

half the frequency of the conventional single-edge clocking, while

obtaining the same data throughput. Consequently, power

consumption of the clock generation and distribution system is

roughly halved for the same clock load. In addition, less

aggressive clock subsystems can be built, which further reduces

power consumption and clock uncertainties.

The most critical obstacle for extensive use of dual-edge clocking

strategy is the difficulty to precisely control the arrival of both

clock edges. This control is essential in order to avoid large

timing penalty incurred by the clock uncertainties. Even though

this requirement imposes additional complexity, it can be satisfied

with reasonably low hardware overhead [5]. In addition, the clock

uncertainty due to the variation of the duty cycle can be partially

absorbed by the storage element [6]. Another disadvantage of

DETSE, critical for high-performance applications, is their large

delay. This is due to the increased complexity of DETSE and

longer and more heavily loaded critical paths.

2. PROPOSED DESIGN

We propose a new dual edge-triggered flip-flop (Dual-edge

Conditional Pre-charge Flip-Flop, DE-CPFF). Its operation is

based on creating two narrow transparency windows during

which the logic level of the input D can be transferred to the

output. This flip-flop is a dual-edge version of Conditional Pre-

charge Flip-Flop (CPFF, [7]).

The schematic of DE-CPFF is shown in Fig. 1 and timing

diagrams that describe its operation are given in Fig. 2. The

transparency windows are defined by the propagation delay of

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

ISLPED’02, August 12-14, 2002, Monterey, California, USA.

Copyright 2002 ACM 1-58113-475-4/02/0008…$5.00.

Page 2

inv1, inv2 and NAND after the rising edge of the clock and

NAND, inv3 and inv4 after the falling edge of the clock. Internal

node S evaluates (discharges) during these transparency

windows if input D=1. Outside of the transparency windows, the

path from node S to ground through transistors Mn2, Mn3, Mn4

is off, and the path through either Mp1 or Mp3, Mp4 is ‘on’. Thus,

S takes value of D nand Q.

During the transparency windows, conditional evaluation of node

S takes place, based on the previous level of Q. The evaluation

proceeds as follows:

- If Q was ‘low’ in the previous clock half-cycle, node S was

pre-charged ‘high’. When the transparency window arrives, node

S switches to ‘low’ if D is ‘high’ (one of the paths Mn1-Mn2-

Mn3 and Mn1-Mn2-Mn4 is ‘on’). This sets Q to ‘high’ via

transistor Mp6. If the level of D is ‘low’, node S remains ‘high’

and Q remains ‘low’.

- If Q was ‘high’ in the previous clock half-cycle, value of S was

inverted input D (Mp7, Mn1 and Mn9). When the transparency

window arrives, ‘high’ level of S causes Q to switch ‘low’ (Mn5,

Mn6 and Mn7. Note that the ‘low’ level of S does not change Q

since it was already ‘high’.

Mn3

Mp2

Mp4Mp1

Mp3

Mn5

Mn6 Mn7

QB

Mp5

Q

D

inv1in2

inv3 inv4

CKD

CK

CLK

CK

CKD

CLKCLK

CK

NAND

CK

CKD

CLK

Mn1

Mn2

Mn4

Mn8

Mn9

Mp6

Mp7

Mn10

S

Once node S is ‘low’, it can return to the ‘high’ level only if

input D is ‘low’. In other words, it does not exercise pre-charge-

evaluate sequence in each clock cycle. Therefore, the internal

power consumed for redundant pre-charge for the case D=Q=1 is

saved. Consequently, this flip-flop has the feature of conditional

pre-charge and statistically reduces power consumption for low

input activity.

However, the conditional pre-charge feature does not come for

free. Clearly, there exists a delay penalty if D switches to low

CLK

CKD

CK

Q

S

D

level just before the transparency window so that the subsequent

low-to-high transition of node S occurs in the transparency

window. As opposed to most high-performance single edge-

triggered flip-flops, this high-to-low transition of D introduces

another critical path in the flip-flop (Mp7, Mn10-Mn5-Mn6 or

Mp7, Mn10-Mn5-Mn7). Fortunately, transistor sizing can be used

to equalize the delay of this path with the delay of the other

critical path (Mn1-Mn2-Mn3, Mp6 or Mn1-Mn2-Mn4, Mp6).

Thus, this delay penalty is not significant compared to the power

saving that can be achieved, as it will be shown later.

3. STORAGE ELEMENTS USED FOR

COMPARISON

We compare the proposed DETSE to a set of high-performance

and low-power dual edge- and single edge-triggered storage

elements. We use three DETSE’s: Transmission Gate Latch-Mux

(TGLM, [3]), C2MOS Latch-Mux ([4]) and Explicit-pulsed Dual

Edge-Triggered Static Flip-Flop (ep-DSFF, [8]). We use four

Single Edge-Triggered Storage Elements (SETSE): Transmission-

Gate Master-Slave Latch (TGMS, [9]), Hybrid Latch Flip-Flop

(HLFF, [10]), Semi-Dynamic Flip-Flop (SDFF, [11]) and

Conditional Pre-charge Flip-Flop (CPFF, [7]). First three are

used in recent high-performance microprocessors, and CPFF is

single-edge version of the proposed DETSE.

4. SIMULATIONS

Simulations are performed using 0.18um models by Fujitsu

(le=0.18um, wmin=0.36um), with power supply voltage of 1.8V

and temperature of 25°C. Clock frequency is 250MHz for dual-

edge and 500MHz for single-edge triggered storage elements,

keeping the same data throughput. Simulation testbench is given

in Fig. 3.

3

1.5

6

3

1.5

6

33

Din

CLKin

7 14

7 14

Q

Qb

D

CLK

Q

Qb

Fig. 1. Dual-Edge Conditional Pre-charge Flip-Flop,

DE-CPFF

Fig. 3. Simulation Testbench

Fig. 2. DE-CPFF Timing Diagrams

Page 3

Timing metrics of dual-edge storage elements is explained in

detail in [12]. A DETSE is characterized with the worse of its

overheads in the two (‘high’ and ‘low’) half-cycles of the clock:

tD = max (tDC,f + tCQ,r, tDC,r + tCQ,f)

In the above equation, tCQ,r and tCQ,f designate clock-to-output

delay for rising and falling clock edge, respectively. tDC,r and tDC,f

are data-to-clock delays for rising and falling clock edge. The

timing metric (tD) represents the worst-case time taken away from

the clock half-cycle. Times tDC,r and tDC,f that correspond to

minimum of tD are referred to as optimal set-up times, tsu,r and tsu,f

respectively.

Single-edge triggered storage elements are characterized by their

minimum data-to-output delay at optimal set-up time ([13]).

The power consumption is measured for different data activity

factors. For DETSE, maximum data activity occurs when data

toggles at each clock edge, i.e. it has the same frequency as the

clock. Note that the power consumption measured at a fixed

frequency corresponds to energy per cycle.

Since delay can always be traded for power consumption, the

primary performance parameter we use is Power-Delay Product,

PDP at fixed clock frequency, which is equivalent to Energy-

Delay Product (EDP). The clock frequency of DETSE should be

half of that of SETSE for a fair comparison, since in this case the

data throughputs are the same.

5. RESULTS AND COMPARISONS

Simulation results for 50% data activity are given in Table 1.

Parameters tsu,r and tsu,f represent the optimum setup times for

rising and falling clock edge respectively. tD represents the delay

of a storage element. Internal power consumption Pi,, includes the

power dissipated for the transitions of internal nodes and

charging/discharging the output load. Data power, Pd, and clock

power, Pclk, represent the dissipation of the data and clock drivers,

respectively. Total power consumption, Ptot, is a sum of internal,

data and clock power. Overall comparison parameter is the

Energy-Delay Product, EDP, computed as a product of total

power and data switching period (8ns at 50% activity) and delay.

0

50

100

150

200

250

300

350

DE

CPFF

ep-

DSFF

TGLMC2MOS CPFFHLFFSDFFTGMS

Fig. 4 presents the delay comparison. It shows that DE-CPFF

exhibits 12% improvement in delay, comparing to fastest

previously published DETSE (ep-DSFF). Due to the increased

complexity of dual edge-triggered storage elements, their delay is

typically larger than that of single edge-triggered structures.

However, the delay of the proposed flip-flop is comparable to the

fastest high-performance single-edge triggered designs: 33%

worse than SDFF and 20% worse than HLFF (39% and 25%

including the duty cycle penalty) as shown in Fig. 4. In terms of

EDP, proposed design is 10% better than previously published

DETSE’s, as shown in Fig. 5. The same figure shows that EDP of

the proposed design is comparable to that of single-edge triggered

designs.

0

2

4

6

8

10

12

14

16

18

DE

CPFF

ep-DSFFTGLMC2MOSCPFF HLFFSDFF TGMS

Clock Frequency: 250MHzClock Frequency: 500MHz

0

50

100

150

200

250

300

0% {D=0} 0% {D=1} 50% 100%

DE CPFF

ep-DSFF

TGLM

C2MOS

Fig. 6 presents how total power consumption changes with input

activity. Conditional pre-charge property of the proposed design

is best manifested for quiet input (D=0 and D=1). If we neglect

the leakage current, only the circuit that generates the

transparency window (NAND1, inv1-inv4 in Fig. 1) dissipates

power in this case.

Fig. 7 shows the power consumption break-up for 50% data

activity. Clock driver dissipates 23% less power for driving DE-

CPFF than that of single-edge CPFF. This is achieved by halving

the clock frequency, while the clock load is increased by a factor

less than two.

Overall power savings of dual-edge clocking over single-edge

clocking at the system level depends on the clock distribution

system and ratio of the total clock loads of used storage elements.

The switching power consumption of clock distribution system is

the sum of power dissipated on clock buffers and power

Fig. 4. Delay Comparison (delay in ps)

Fig. 5. EDP Comparison (EDP in Js*10-23)

Fig. 6. Power Consumption for Different Input Activities

(Clock frequency 250MHz, Power in uW)

Page 4

0 50100 150200 250

DE CPFF

ep-DSFF

TGLM

C2MOS

CPFF

HLFF

SDFF

TGMS

internal power

clock power

data power

dissipated on wires (e.g. clock grid). We assume that the power

dissipated on the clock buffers in the distribution network is

proportional to the total clock load of the storage element.

Therefore, the savings on the dissipation for driving the clock

buffers is proportional to the ratio of the clock powers (Fig. 7). In

contrast, the wire load depends on the wire length, which is

assumed to be approximately the same in clock distribution

networks for dual edge-triggered and single edge-triggered

systems. Thus, the power saving of dual-edge clocking on the

wires in clock distribution network depends mainly on the ratio of

the clock frequencies (50% savings for the same data throughput).

6. CONCLUSIONS

We presented a new design of dual edge-triggered storage

element that saves power by reducing the internal switching

activity as well as the clock period. The transitions of the internal

node are inhibited when they do not affect the state of the storage

element. The simulations are performed in 0.18um technology,

using power supply of 1.8V and clock frequency of 250MHz. The

observed delay and EDP improvement over other dual edge-

triggered storage elements are 12% and 10%, respectively. In

addition, the performance of proposed storage element is

comparable to that of single edge-triggered storage elements used

in recent high-performance processors. Reduced clock power of

proposed storage element allows for substantial saving in clock

distribution network.

7. REFERENCES

[1] P. Hofstee et al, “A 1-GHz Single-Issue 64b PowerPC

Processor”, ISSCC Digest of Technical Papers, p92-93,

February 2000

[2] A. Jain et al, “A 1.2GHz Alpha Microprocessor with

44.8GB/s Chip Pin Bandwidth”, ISSCC Digest of Technical

Papers, p240-241, February 2001

[3] R. P. Llopis, M. Sachdev, “Low power, testable dual edge

triggered flip-flops”, ISLPED Digest of Technical Papers,

p.341-345, 1996.

[4] A. Gago, R. Escano, J. A. Hidalgo, “Reduced

implementation of D-type DET flip-flops”, IEEE Journal of

Solid-State Circuits, vol.28, (no.3), p.400-402, March 1993.

[5] P. E. Gronowski et al, “A 433-MHz 64-b quad-issue RISC

microprocessor”, IEEE Journal of Solid-State Circuits,

vol.31 (no.11), p. 1687-1696, Nov. 1996.

[6] M. Saint-Laurent et al, “Optimal Sequencing Energy

Allocation for CMOS Integrated Systems”, Proceedings of

International Symposium on Quality Electronic Design,

p.94-99, March 2002.

[7] N. Nedovic, V. G. Oklobdzija, “Hybrid Latch Flip-Flop with

Improved Power Efficiency”, Proceedings of the Symposium

on Integrated Circuits and Systems Design, p.211-215,

September 2000.

[8] J. Tschanz et al, “Comparative Delay and Energy of Single

Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops

for High Performance Microprocessors”, Proceedings of

ISLPED, p.147-152, August 2001.

[9] G. Gerosa et al, “A 2.2W, 80MHz superscalar RISC

microprocessor”, IEEE Journal of Solid State Circuits, vol.

29, pp. 1440-1452, December 1994.

[10] Partovi, H. et al, “Flow-through latch and edge-triggered

flip-flop hybrid elements”, 1996 IEEE International Solid-

State Circuits Conference. Digest of Technical Papers,

ISSCC, San Francisco, CA, USA, February 1996.

[11] F. Klass, “Semi-Dynamic and Dynamic Flip-Flops with

Embedded Logic,” Symposium on VLSI Circuits, Digest of

Technical Papers, p.108-109, June 1998.

[12] N. Nedovic, M. Aleksic, V. G. Oklobdzija, “Timing

Characterization of Dual-Edge Triggered Flip-Flops”,

Proceedings of International Conference on Computer

Design, p.538-541, September 2001.

[13] V. Stojanovic, V. G. Oklobdzija, “Comparative Analysis of

Master-Slave Latches and Flip-Flops for High-Performance

and Low-Power Systems", IEEE Journal of Solid-State

Circuits, Vol.34, No.4, p.536-548, April 1999.

[14] V. G. Oklobdzija, “High-Performance System Design:

Circuits and Logic”, J. Wiley, July, 1999.

Table 1. Overall Simulation Results

Flip-Flop tsu,r [ps]

-62

tsu,f [ps]

-57

tD [ps]

225

Pi [uW]

121.8

Pclk [uW]

18.7

Pd [uW]

4.4

Ptot [uW]

144.9

EDP[x10-23 Js]

DE-CPFF 26.1

ep-DSFF -65 -69 257 125.2 14.5 7.3 147 30.2

TGLM

C2MOS

120 115 322 83.3 20.3 8.3 111.6 28.7

58 52 268 122.9 27.3 8.1 158.3 33.9

Fig. 7. Power Consumption Break-up (Power in uW)