Analysis and optimization of pausible clocking based GALS design
ABSTRACT Pausible clocking based globally-asynchronous locally-synchronous (GALS) system design has been proven a promising approach to SoCs and NoCs. In this paper, we analyze the throughput reduction and synchronization failures introduced by the widely used pausible clocking scheme, and propose an optimized scheme for higher throughput and more reliable GALS design. The local clock generator is improved to minimize the acknowledge latency, and a novel input port is applied to maximize the safe timing region for the clock tree insertion. Simulation results using the IHP 0.13-¿m standard CMOS process demonstrate that up to one-third increase in data throughput and an almost doubled safe timing region for clock tree distribution can be achieved in comparison to the traditional pausible clocking scheme.
-
Citations (0)
-
Cited In (0)
Page 1
Analysis and Optimization of Pausible Clocking based GALS Design
Xin Fan, Miloš Krstić, and Eckhard Grass
IHP Microelectronics, Im Technologiepark 25, 15236 Frankfurt (Oder), Germany
{fan, krstic, grass}@ihp-microelectronics.com
Abstract — Pausible clocking based globally-asynchronous
locally-synchronous (GALS) system design has been proven a
promising approach to SoCs and NoCs. In this paper, we
analyze the throughput reduction and synchronization failures
introduced by the widely used pausible clocking scheme, and
propose an optimized scheme for higher throughput and more
reliable GALS design. The local clock generator is improved to
minimize the acknowledge latency, and a novel input port is
applied to maximize the safe timing region for the clock tree
insertion. Simulation results using the IHP 0.13-µm standard
CMOS process demonstrate that up to one-third increase in
data throughput and an almost doubled safe timing region for
clock tree distribution can be achieved in comparison to the
traditional pausible clocking scheme.
I. INTRODUCTION
With the growing complexity of systems-on-chips (SoCs),
traditional synchronous digital circuits become increasingly
difficult to implement. A major challenge is the distribution
of a low skew global clock. The large number of required
buffer cells can lead to 40% of the total power dissipation
and occupy significant silicon area.
By eliminating the global clock, globally-asynchronous
locally-synchronous (GALS) design provides a promising
solution to SoCs. The most straightforward way to GALS
systems is to insert synchronizer circuits between different
clock domains [1]. Normally a synchronizer consists of two
or more cascaded flops, and it introduces latency in data
transfer. Another approach to GALS systems is the use of
asynchronous FIFOs [2, 3], and this results in overheads in
both area and power. In recent years, an alternative method
to GALS design, which is mainly based on pausible local
clocks, has been developed [4, 5, 6, 7, 8, 9, 10, 11, 12].
Communication between asynchronous modules is achieved
using a pair of request-acknowledge handshaking signals,
and the local clocks are paused and stretched, if necessary,
to avoid metastability in data transfer.
Until now, most of the silicon validated pausible clocking
systems [13, 14, 15, 16] are designed based on the scheme
proposed and improved in [5, 6, 8]. As a latest example, this
scheme is applied in [17] to implement a dynamic voltage
frequency scaling (DVFS) NoC. Fig. 1 depicts a point to
point GALS system based on this well-known scheme and
its waveforms of the handshake signals Each synchronous
module is surrounded by an asynchronous wrapper, which
mainly consists of a local clock generator and several
asynchronous I/O ports. A four-phase bundled-data protocol
is used and a single latch is deployed in the input port to
load the data from the output port. In [13, 14] the design of
asynchronous wrappers was discussed in detail.
Fig. 1 A GALS system (a) and its handshake signals (b)
In this paper, we focus on the analysis and optimization of
this widely used pausible clocking scheme. Section 2 studies
the acknowledge latency of local clock generators, and
demonstrates its impacts on the system throughput and data
synchronization. Section 3 shows an improved scheme with
minimum latency from the clock generator and maximum
safe timing region for the clock tree insertion. The proposed
scheme is implemented and evaluated using the IHP 0.13-
µm CMOS process in Section 4. Finally, a brief conclusion
is given in Section 5.
II. ANALYSIS OF PAUSIBLE CLOCKING SCHEME
A typical local clock generator used in pausible clocking
schemes is depicted in Fig. 2 [8, 9, 17, 18]. A programmable
delay line is employed to generate the clock signal LClk. An
array of MUTEX elements is used to arbitrate between port
requests Reqx and the request clock signal RClk. If any Reqx
gets acknowledged, LClk will be paused.
Fig. 2 A typical local clock generator (with 2 ports)
978-1-4244-5028-2/09/$25.00 ©2009 IEEE358
Page 2
A. Acknowledge Latency of Clock Generator
For a MUTEX element, at any time only one of two
incoming events Reqx+ and RClk+ is allowed to pass on a
first come first serve basis. A Reqx+ arriving before RClk+
will be acknowledged immediately by the MUTEX, but a
Reqx+ arriving after RClk+ will not be acknowledged until
RClk- happens. If Reqx+ and RClk+ arrive simultaneously,
the MUTEX element will decide randomly which signal
should be acknowledged.
For clarity in the following discussion, we define the
request acknowledge window (RAW) in a local clock
generator as the duration in each cycle of LCLK when the
port requests can be acknowledged. For the clock generator
shown in Fig. 2, its RAW is the inactive phase of RClk,
which corresponds to the active phase of LClk as shown in
Fig. 3. Considering 50% duty cycle of LClk, the duration of
the RAW in this clock generator can be deduced as follows:
0
RAW
RCLK LCLK
=
1
/ 2
LCLK
T
ttt
=
=≈≈
. (1)
Fig. 3 Request acknowledged window
Any Reqx+ occurring outside of the RAW will lead to
increased acknowledge latency. The worst situation happens
if Reqx+ arrives concurrently with RClk+ and RClk is
acknowledged by the MUTEX. Then Reqx can’t be granted
until RClk goes low. Therefore, based on the RAW, we can
further derive the maximum acknowledge latency caused by
the above local clock generator in equation (2). In the
following, we will analyze the impacts of the acknowledge
latency on system throughput and data synchronization.
t
=
1
max()/ 2
LCLK
T
RAWLCLK
T
Ack
RCLK
Latency
t
==≈
−
. (2)
B. Throughput Reduction
For simplicity, a point to point GALS communication, as
shown in Fig. 1, is taken as an example. We firstly discuss
the data transfer from a demand-type output (D-OUT) port
to a poll-type input (P-IN) port, and similar analysis is then
applied to the other three communication channels used in
point to point GALS systems [8, 13].
1) D-OUT Port to P-IN Port Channel
For the receiver equipped with a P-IN port, the local clock
LClkRx will be paused after ReqRx+ occurs [8]. ReqRx will be
asserted after a ReqP+ is detected, which is generated by the
output port in the transmitter running at an independent
clock. Therefore, without loss of generality, the arrival time
of ReqP+, and then the arrival time of ReqRx+, can be
modelled as a uniformly distributed random variable within
a period of LClkRx. Since tRAW = TLCLKRx/2 in the above clock
generator, there is 50% probability that a ReqRx+ is extended
to be acknowledged in the next RAW. Moreover, since the
data is sampled in the receiver at the next rising edge of
LClkRx, DataRxS will be delayed for one cycle of LClkRx. As
an example, the latencies of acknowledge signal AckRx and
sampled data DataRxS are depicted in Fig. 4.
Fig. 4 Latencies in AckRx and sampled data DataRxS
For the transmitter equipped with a D-OUT port, its local
clock LClkTx is paused before ReqP is asserted by the output
port, and LClkTx will not be released until ReqP gets
acknowledged [8]. Since ReqP will not be acknowledged by
the input port until ReqRx+ is acknowledged by the clock
generator on the receiver side, there is maximum a TLClkRx/2
latency in acknowledging ReqP as well. Consequently, the
latency in the receiver is propagated into the transmitter. If
the period of LClkRx is much longer than that of LClkTx, this
latency will result in a multi-cycle suspension in LClkTx.
Because the data is processed synchronously to LClkTx in the
transmitter, the suspension in LClkTx will eventually result in
a delay in data transfer.
For instance, considering a D-OUT port to P-IN port
channel, where the periods of LClkTx and LClkRx satisfy the
following condition:
3
(1)
2
LCLKTx
T
LCLKRx
TN
− ⋅= ⋅
, (3)
and the requests in the transmitter, ReqTx, is asserted every N
cycles of LClkTx, as shown in equation (4):
T
=
After ReqTx is asserted in the transmitter, a ReqP+ will be
generated by the output port controller. If a ReqP+ arrives in
the receiver in the on-phase of RClkRx, it will not be
acknowledged until RClkRx turns low. During that period
LClkTx is paused and stretched in its off-phase. After RClkRx-
happens, which corresponds to LClkRx+ occurring, ReqP will
be acknowledged and then LClkTx will be released. As a
result, the next rising edges of both LClkTx and LClkRx are
automatically synchronized to occur at almost the same time.
After LClkRx and LClkTx are synchronized, ReqP will be
asserted in (N-1) cycles of LClkTx in the transmitter.
According to condition (3), RClkRx will also be asserted after
one and a half cycles of LClkRx in the receiver. Therefore,
the next ReqP+ and RClkTx+ will occur simultaneously again,
and the extension in ReqP+ and the suspension in LClkTx
will repeat as well.
Above analysis illustrates that, for systems satisfying (3)
and (4), once a ReqP+ occasionally arrives in the active
phase of RClkRx, all following ReqP+ will be synchronized
to occur at the same time with RClkRx+. Consequently, each
ReqP will be extended for max(LatencyAck) = TLClkRx/2 to be
ReqTxLCLKTx
N T
⋅
. (4)
359
Page 3
acknowledged, and LClkTx will be periodically suspended
for TLClkRx/2 as well. This suspension in LClkTx will cause a
delay for data transfer. As an example, Fig. 5 illustrates a
waveform fragment of data transfer under this circumstance,
where N = 7.
Fig. 5 Extension in ReqP and suspension in LClkTx
Every time LClkTx is suspended, its inactive phase will be
stretched for a period of (TLClkRx/2 – TLClkTx). Based on
conditions (3) and (4), the throughput reduction RTx caused
by the suspension of LClkTx is deduced in equation (5). We
see that the exact percentage is determined by the value of N.
With the increase in the value of N, the limit of RTx reaches
1/3, as shown in equation (6). It means that up to one-third
reduction in data throughput could be introduced by the
acknowledge latency.
1
(
2
Re
4
) /
3
Tx qTxLCLKRx
T
LCLKTx
T
N
RT
N
−
⋅
=⋅−=
. (5)
41
3
lim
N
−>+∞
lim
N
−>+∞
3
Tx
N
R
N
−
⋅
==
. (6)
2) Other Point to Point Channels
A similar analysis can be easily applied on the other three
point to point communication channels. In Tab. 1 we present
the impacts of acknowledge latency on handshake signals
and local clocks for the four channels. It can be seen that the
only exception occurs when both input port and output port
are of demand type. No matter whether there is data ready to
be transferred or not, the clocks on both the transmitter side
and the receiver side will be paused as soon as the ports get
enabled. Although there is no extension in the handshake
signals or suspension in the clock signals caused by the
acknowledge latency of local clock generators, this channel
is prone to unnecessary long suspensions in both LClkRx and
LClkTx, and a huge drop in data throughput could happen.
Careful design is required for applying this type of channel
to reduce system power consumption.
Tab. 1 Impacts of acknowledge latency
Extended
Signals
D-OUT to P-IN ReqP+
P-OUT to D-IN AckP+
P-OUT to P-IN ReqP+, AckP+
D-OUT to D-IN NO
Channel Type
Suspended
Clock
LClkTx
LClkRx
LClkRx
NO
Maximum
Suspension
TLClkRx/2
TLClkTx/2
TLClkTx/2
0
C. Synchronization Failure
A significant benefit from GALS design is to simplify the
global clock distribution by a set of independent local clock
networks. For pausible clocking schemes, however, a crucial
issue is the synchronization failure caused by the local clock
tree insertion delay in receivers. Fig. 6 depicts a failure case
occurring in the traditional scheme (Fig. 1). As the clock
tree insertion delay is irrelevant to the handshake signals’
propagation delay, LClkRxDly+ can arrive at the sampling flip-
flop FF at simultaneously with loading data into the input
port latch L. Then metastability occurs in FF.
Fig. 6 A synchronization failure case
1) ΔLClkRx < TLClkRx
Data synchronization issues in pausible clocking schemes
were first discussed in [7]. The author suggested integrating
a clock buffer network into the local ring oscillator, and
proposed a pipelined interface to hide the control overhead.
This method is only suitable for pipelined systems. Recent
work in [12, 19] reveals that, for clock delays satisfying Δ
LClkRx < TLClkRx, there are two timing regions in each cycle of
LClkRx, as shown in Fig. 7 for example, where negligible
synchronization failure probability can be expected [12].
In Fig. 7, Cycle 1 illustrates the situation that the data is
safely sampled by FF before L turns to be transparent. It
contributes the safe timing region S1 ofΔLClkRx as follows:
dd
≤ Δ<+
where d0
ReqRx+ to AckRx + without metastability and the asynchronous
port delay from AckRx+ to AckP +. On the contrary, Cycle 2
represents another situation that data is safely sampled after
it is latched in L. We draw the pessimistic case that ReqRx+
happens concurrently with RClk+ and metastability occurs
in the MUTEX. This leads to the safe region S2:
+++Δ+
0
MUTEX
0
Port
NOR
LClkRxhold
dt
+−
, (7)
MUTEX and dPort denote the MUTEX delay time from
0
)
/2 (
LCLKRx
T
NOR MUTEXMUTEXLCLKLCLKRx
T
setup
Port
Latch
dddddt
++≤ Δ<
, (8)
where ΔdMUTEX denotes the additional delay of the MUTEX
to resolve metastability, and dLatch is the delay of L from
asserting gate enable (AckP+) to data being stable.
Fig. 7 Safe regions for ΔLClkRx<TLClkRx
360
Page 4
Inequality (8) shows that the width of S2, WS2, depends on
the period of LClkRx. Analyze WS2 in two typical cases:
a) If
/ 2
LCLKRx NORMUTEX
Tdd
≈++ Δ
≈ . It means that the MUTEX needs to consume half of
a clock cycle to resolve metastability. As a result, only the
safe region S1 is valid, which is dominated in width by a
number of gate delays shown in (7).
b) If
/ 2
LCLKRxNOR MUTEX
Tdd
>>++Δ
. With the increase of TLClkRx, WS2 is widened,
but the hazard region is also extended. The ratio of WS2 to
TLCLKRx reaches at most 50%.
Therefore, we see that the width of the safe regions falls
inside the range of
112SS SLCLKRx
WWT
+
<<
is a rather narrow region S1 to insert clock tree. Even for
large TLClkRx, only half of a clock period is allowed.
2) ΔLClkRx ≥ TLClkRx
It should be noticed that the safe regions within each
cycle of LClkRx, as discussed in the above, is always aligned
with LClkRx+. A stretch in LClkRx will lead to a delay in the
safe regions of the next cycle. If the clock tree delay meets
ΔLClkRx≥TLClkRx, this delay in safe regions also need to be
considered for data synchronization. Take Fig. 8 for instance,
where
1.8
LClkRxLClkRx
T
. During Cycle 1, the rising edge of
the delayed clock LClkRxDly falls in the safe region of LClkRx,
and data is sampled correctly by FF. Then a stretching on
LClkRx happens, and the safe regions in Cycle 2 are delayed.
But there is another LClkRxDly+ scheduled in the clock tree
before the stretched clock, which arrives at FF without any
delay. This eventually leads to a sampling conflict.
0
MUTEX
setup
Port
Latch
dddt
+++
, then
2
0
S
W
0
MUTEX
setup
Port
Latch
dddt
+++
, then
2
/2
S LCLKRx
TW
≈
/2
. For small TLClkRx, there
Δ≈
Fig. 8 Synchronization failure when ΔLCLKRx=1.8TLCLKRx
In bellow, the stretching on LClkRx is analyzed according
to the type of input ports used on the receiver side:
a) P-IN port
In this situation, a maximum TLClkTx/2 suspension on each
cycle of LClkRx could be introduced by the acknowledge
latency as shown in Tab. 1. So the stretching on LClkRx, and
then the delay of safe regions, is up to (TLCLKTx/2-TLCLKRx).
Since TLCLKTx is independent from TLCLKRx, this delay could
be long enough to mismatch the safe regions of successive
cycles of LClkRx, as illustrated in Fig. 9. There turns to be no
common safe region for the clock tree insertion. Moreover,
if
>2
LCLKRxLCLKRx
T
, more than one cycle of LClkRx could be
stretched within the clock tree delay, and an accumulated
delay in safe regions should be considered.
Δ
Fig. 9 Safe region mismatch due to the clock stretching
b) D-IN port
In this case, LClkRx is paused and stretched by the input
port controller until ReqP- is triggered by the output port
controller. The stretching on LClkRx as well as the delay in
safe regions is unpredictable. That means no common safe
region exists for the multi-cycle clock delay.
Now we can conclude that, for the clock tree insertion
delay exceeding one clock period, the uncertainty on clock
stretching must be taken into account, and no matter what
type of input port is utilized, there is no safe region in the
traditional scheme. In fact, in most of the reported pausible
clocking systems, the local clock trees were deliberately
distributed to satisfy ΔLClkRx < TLClkRx [13, 14, 15, 16].
For a multiple cycle clock tree delay, an asynchronous
FIFO was suggested in [20] to synchronize the input data
with the delayed clock, which leads to increased latency in
the datapath and additional overheads in area and power. An
interface circuit using partial handshake signals was shown
in [11] for high-speed systems with large clock delay, while
there is an unknown nonzero probability of failure in the
circuits. For the design of GALS systems insensitive to the
clock tree delay, a synchronizing scheme based on locally
delayed latching (LDL) was presented in [12, 19]. Since the
clocks can’t be paused in the LDL interface, it introduces
additional timing constraints on both the asynchronous input
port controller and the combinational logic following the
sampling register FF, which limits its application. Hence,
more stable and efficient synchronizing circuits are required
for inserting local clock trees with multi-cycle delay in the
pausible clocking based GALS systems.
III. OPTIMIZATION OF PAUSIBLE CLOCKING SCHEME
In this section, the pausible clocking scheme is optimized
in two respects. The local clock generator is first improved
to minimize the acknowledge latency, and then a novel input
port, including the data latching mechanism and the port
controller, is suggested to maximize the safe region for the
clock tree distribution.
A. Optimized Local Clock Generator
Behind the acknowledge latency is the fact that in Fig. 1
the local clocks on both the transmitter side and the receiver
side need to be paused for safe data transfer. To avoid the
acknowledge latency, we can deploy an asynchronous FIFO
as work [9] to decouple local clocks, with overheads in area
and power as penalty. Another simple solution, however, is
to widen the RAW of the clock generator as shown in Fig. 10.
There are two delay lines, the programmable delay line D0
followed by the fixed delay line D1, used in the local ring
oscillator. The delay lengths of D0 and D1 are as below:
/ 2 (
DDLClk
dTd
=−
ddd
=++
01
)
C ELE
−
NOR
dd
++
Δ
, (9)
. (10)
0
MUTEX
11
0
()
MUTEXDAND
AND
dd
+
361
Page 5
Fig. 10 An optimized local clock generator (with 2 ports)
The request clock RClk is now generated by an AND
operation between LClkB, being the inverted signal of LClk,
and L0, being the output signal of D0. It is asserted after
both LClkB and L0 are high and is de-asserted as soon as
LClkB turns low. The on-phase period of RClk in each cycle
of LClk is the sum of the delays of following gates:
01MUTEXANDC ELEMENTNORAND
− > − >−− >− >
.
If such a delay is shorter than the half period of LClk, the
RAW in this clock generator will be wider than that in Fig. 2.
For instance, if the period of the clock LClk is 10ns and the
summation of above delays is 1.5ns, the RAW in Fig. 10 is
8.5ns, while it is only 5ns in Fig. 2. Assuming a uniform
distribution of the arrival time of ReqRx+ in each cycle of
LClkRx, the probability drops from 50% to 15% for a ReqRx
to introduce one cycle latency in the receiver. Fig. 11
depicts a comparison in RAW, AckRx latency and DataRxS
latency between the two clock generators.
Fig. 11 Comparison in RAW, Ack and DataRxS
The fixed delay line D1 is employed in Fig. 10 to remain
LClk at 50% duty cycle. The delay from LClk+ to LClk-
is
01_
NORDD C ELE
dddd
+++
, and the delay from LClk- to LClk+
is
01
MUTEX MUTEX
NORDAND
ddddd
+++ + Δ
(
delay time of D1 is configured to match the total delay of
AND0, AND1 and the MUTEX as shown in (10), both of the
delay paths are balanced.
It is well-known that there is no upper bound on the
resolution time of the MUTEX elements [21]. A practical
solution is to estimate the resolution time based on the mean
time between failures (MTBF) according to (11). From [1,
19], 40 FO4 inverter delays are sufficient for metastability
resolution, i.e., for a MTBF of 10,000 years. It’s long enough
for normal applications.
0
0_
ANDC ELE
dd
++
)
. Since the
444
/
τ
/
:,2,1/100
MUTEX
D
F
C
D
FOFOCFO
d
MTBF
where
W F
=
F
dWdFd
e
=
τ
=⋅⋅
⋅==⋅
. (11)
Take the IHP 0.13-µm CMOS process as an example,
where
4
30
FO
ps
d
≈
. The resolution time is estimated to be
around
MUTEXMUTEX MUTEX
ddd
=+ Δ≈
delays of AND0 and AND1 in (10), we can fix the delay
length of D1 at 1.5ns, which equals to about 50dFO4. Based
on the delay length of D1, the active phase of RClk is shown
in equation (12), and furthermore, the duration of the RAW
and the maximum acknowledge latency in this optimized
clock generator are deduced as shown in equation (13). It
reveals that the RAW is determined by the period of LClk. If
TLCLK>100dFO4, typically which represents the shortest clock
cycle for standard cells based SoCs [19], the optimized local
clock generator provides a wider RAW than the traditional
one shown in Fig. 2.
1
1
DC ELE
RCLK
=
ttT
−
=
=
0
4
401.2
FO
ns
d
≈
. Considering the
1NORD
tdddd
−
≈++≈
, (12)
1
1
01
1
max()
RAW LCLK
t
LCLK
T
D
RCLKRCLK
≈
D
Ack
RCLK
Latency
td
d
−
=
=
==≈
. (13)
B. Optimized Input Port
In this section, a double latching mechanism is applied to
widen the safe region, and the port controller is improved to
minimize the uncertainty on clock stretching.
1) Double Latching Mechanism
To widen the safe regions for the clock tree delay meeting
ΔLCLKRx<TLCLKRx, a double latching mechanism, which is
based on the optimized clock generator, is proposed in Fig.
12. The first stage of latch L1 loads the data from the
transmitter, and then the second stage of latch L2 feeds the
data into the receiver. Since L1 and L2 are enabled by the
acknowledge signals of the MUTEX, there is only one latch
transparent at any time. Therefore, data is transferred by two
mutually exclusive coupling latches in this scheme, instead
of the single latch L in Fig. 1.
Fig. 12 Double latching mechanism
During the off-phase of RClk, RClkGrant remains low, and
DataRx is latched in L2 stably. Any LClkRxDly+ arriving at FF
in the inactive phase of RClk can sample DataRx safely. If
RClk turns high, RClkGrant+ is triggered, and L2 will get
enabled to load Data*Rx. Once ReqRx+ occurs simultaneously
with RClk+, RClkGrant will be asserted by the MUTEX in a
random resolution time. Consequently, any LClkRxDly+ falling
in the on-phase of RClk could conflict with loading Data*Rx
in L2. Therefore, the safe timing region for the clock tree
distribution in this double latching mechanism is the off-
phase period of RClk as shown in (14), which is exactly the
same to the RAW in the optimized clock generator:
0
SRAW
RCLK
tt
W
=
1LCLK
T
D
d
==≈
−
. (14)
362