ArticlePDF Available

# Asynchronous Design -- Part 1: Overview and Recent Advances

Authors:

## Abstract

Editor’s notes: An asynchronous design paradigm is capable of addressing the impact of increased process variability, power and thermal bottlenecks, high fault rates, aging, and scalability issues prevalent in emerging densely packed integrated circuits. The first part of the two-part article on asynchronous design presents a chronicle of past and recent commercial advances, as well as technical foundations, and highlights the enabling role of asynchronous design in two application areas: GALS systems and networks-on-chip.
Asynchronous DesignV
Part 1: Overview and
Steven M. Nowick
Columbia University
Montek Singh
University of North Carolina at Chapel Hill
hTHERE HAS BEEN A continuous growth of interest
in asynchronous design over the last two decades,
as engineers grapple with a host of challenging
trends in the current late-Moore era. As highlighted
in the International Technology Roadmap for Semi-
conductors (ITRS), these include dealing with the
impact of increased variability, power and thermal
bottlenecks, high fault rates (including due to soft
errors), aging, and scalability issues, as individual
chips head to the multibillion-transistor range and
many-core architectures are targeted.
While the synchronous, i.e., centralized clock,
asynchronous designVor the use of a hybrid mix
of asynchronous and synchronous componentsV
provides the potential for ‘‘object-oriented’’distributed
hardware systems, which naturally support modular
and extensible composition,
on-demand operation with-
out extensive instrumented
power management, and
variability-tolerant design.
As highlighted by the ITRS
report, it is therefore increas-
ingly viewed as a critical
the above challenges.
vide both a short historical and technical overview
of asynchronous design, and also a snapshot of the
state of the art, with highlights of some recent
exciting technical advances and commercial in-
roads. It also covers some of the remaining
challenges, as well as opportunities, of the field.
Asynchronous design is not new: some of the
earliest processors used clockless techniques. Over-
all, its history can be divided into roughly four eras.
The early years, from the 1950s to the early 1970s,
included the development of classical theory
(Huffman [1], Unger [1], McCluskey, Muller [2]),
of leading commercial processors (Iliac, Iliac II,
Atlas, MU-5) and graphics systems (LDS-1). The
middle years, from the mid 1970s to early 1980s,
were largely an era of retrenchment, with reduced
nous VLSI era. The mid 1980’s to late 1990’s
represented a revival or ‘‘coming-of-age’’ era, with
the beginning of modern methodologies for asyn-
chronous controller and pipeline design, initial
computer-aided design (CAD) tools and optimiza-
tion techniques, the first academic microprocessors
Editor’s notes:
increased process variability, power and thermal bottlenecks, high fault
rates, aging, and scalability issues prevalent in emerging densely packed
integrated circuits. The first part of the two-part article on asynchronous
design presents a chronicle of past and recent commercial advances, as
well as technical foundations, and highlights the enabling role of asynchro-
VPartha Pratim Pande, Washington State University
2168-2356/15 B2015 IEEEMay/June 2015 Copublished by the IEEE CEDA, IEEE CASS, IEEE SSCS, and TTTC 5
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/MDAT.2015.2413759
Date of publication: 16 March 2015; date of current version:
04 June 2015.
(Caltech, University of Manchester, Tokyo Institute of
Technology), and initial commercial uptake for use
in low-power consumer products (Philips Semicon-
ductors) and high-performance interconnection
networks (Myricom).
The modern era, from the early 2000s to present,
includes a surge of activity, with modernization of
design approaches, CAD tool development and
systematic optimization techniques, migration into
on-chip interconnection networks, several large-scale
demonstrations of cost benefits, industrial uptake at
leading companies (IBM, Intel) as well as startups, and
application to emerging technologies (sub-/near-
threshold circuits, sensor networks, energy harvesting,
cellular automata). The approaches in the modern era
bear little resemblance to some of the simple
asynchronous examples found in older textbooks.
with a chronicle of past and recent commercial
advances, and highlights the enabling role of
tion areas. Two promising application domains are
covered in more detailVGALS systems and networks-
on-chipVgiven their importance in facilitating the
integration of large-scale heterogeneous systems-on-
chip. Finally, several foundational techniques are
introduced: handshaking protocols and data encod-
ing, pipelining, and synchronization and arbitration.
Part 2 focuses on methodologies for the design of
asynchronous systems, including logic- and high-
level synthesis; tool flows for design, analysis,
verification and test; as well as examples of
asynchronous processors and architectures.
Applications
Asynchronous design has successfully migrated
into commercial products from leading companies
in recent years. In addition, there have been a
number of industry experiments using asynchro-
nous design that were quite successful, even though
products did not appear on the market. There are
also several exciting emerging application areas
where asynchronous design is expected to play a
key enabling role.
Commercialization
There have been several promising examples of
commercialization of asynchronous designs over
the last decade or two, with significant cost benefits
demonstrated.
Philips Semiconductors: low-power embedded
controllers. In the late 1990s through early 2000s,
Philips Semiconductors (now NXP) achieved much
commercial success with its asynchronous 80C51
microcontroller [3]. The chip was initially aimed for
use in pager chipsets, and the motivation was to
lower electromagnetic interference (EMI) noise
emissions so that the microcontroller could operate
harmoniously with the radio-frequency (RF) data
link, without the use of shielding. As a result,
encoding and decoding of RF data could now be
performed in software instead of requiring a custom
fixed-function circuit, allowing easier upgrades to
functionality, which was not possible with their
synchronous version. It also demonstrated a 4
power reduction over a comparable synchronous
was later used in smart cards for public transport,
where the wide range of operating voltages allowed
the cards to be battery-free and contactless, powered
merely by the brief burst of charge induced when the
user’s hand waves it through the magnetic field of the
card reader. An enhanced version of the asynchro-
nous microcontroller (SmartMX) is now used in
more than 75 countries, including the European
Union and more recently the United States, for
biometric passports and IDs. At last count, the
number of copies sold has exceeded 700 million.
Intel/Fulcrum Microsystems: Ethernet switch chips.
In 2011, Intel acquired Fulcrum Microsystems, an
asynchronous startup producing high-speed net-
working chips, in a move industry analysts regarded
as a bid to compete with Cisco Systems. Intel’s
current FM5000/FM6000 family of switch chips,
which supports industry-leading 40 gigabit Ethernet,
includes a fully-asynchronous high-speed crossbar
switch that provides high bandwidth, low latency,
support for flexible link topologies, and high energy
efficiency. The crossbar bandwidth of over 1 terabit
per second (in a 130 nm process) is achieved
throughfine-grainasynchronouspipelining,atthe
granularity of individual gates, unencumbered by a
rigid clock period [4]. When operated at below
peak throughput, these chips are highly energy-
efficient, since the asynchronous logic provides
the benefit of automatic power-down of inactive
combination of features is considered unique in
the marketplace.
IEEE Design & Test
6
Asynchronous DesignVPart 1: Overview and Recent Advances
Achronix: high-performance FPGA’s. Another ex-
ample of asynchronous commercialization is the
Speedster 22i family of FPGAs by Achronix Semi-
conductor [5]. Manufactured in 22 nm, these chips
can operate at 1.5 GHz, and are currently claimed as
the world’s fastest FPGA’s, yet they incur a fraction of
the design cost and operating energy cost of leading
synchronous designs. Their key to achieving fast
operation is the use of asynchronous fine-grain bit-
level pipelines, thereby relaxing the constraints of
global synchronization. This asynchronous internal
implementation is transparent to the end user: the
applications mapped can be synchronous, just as on
any typical FPGA.
IBM: TrueNorth neuromorphic computer. There has
been much excitement recently about neuro-
morphic computing, which seeks to mimic the
functioning of the human brain by using massively-
parallel computer systems. In a departure from the
traditional von Neumann architecture, these sys-
tems employ a highly-distributed memory that is
tightly integrated with a large number of parallel
computational elements that model neurons. Due to
the spatially-distributed nature of computation and
communication, along with wide timing unpredict-
ability of data events, neuromorphic computing fits
well with the asynchronous paradigm. One interest-
ing example is TrueNorth [6], released in August
2014, which is the largest chip ever developed at
IBM, with 5.4 billion transistors. It integrates 4096
neurosynaptic cores on a single chip, modelling
1 million neurons and 256 million synapses. This
scale of integration poses a formidable physical
design challenge, which was successfully met using
fully-asynchronous processing elements and inter-
connection network. The event-driven asynchronous
operation facilitates extremely low powerV70 mW
for real-time operation over the entire chipVwhich
would be extremely difficult to obtain with current
synchronous techniques. Other neuromorphic
computersVUniversity of Manchester’s SpiNNaker
machine (Furber group) and Stanford University’s
Neurogrid (Boahen group)Valso use fully-
asynchronous communication networks.
Other commercial designs. Several other industrial
applications have explored asynchronous benefits.
In the early 1990s, two commercial processors from
HaL Computer Systems used a self-timed floating-
point divider [7], which was reported as 2–3.5 times
faster than the leading commercial synchronous
dividers. At Sun Microsystems, now Oracle, high-
speed asynchronous pipelines were used commer-
cially in UltraSPARC IIIi computers for smoothing
out timing discrepancies in the interface to ultra-
fast memories. Theseus Logic has used a NULL
convention logic (NCL) methodology to develop
chips that are robust to extreme variations in
manufacturing and operating conditions [8]. Octa-
sic Inc. has recently developed a clockless DSP
technology that takes advantage of the highly-
variable data-dependent execution times of different
arithmetic operations to achieve a 3throughput
increase over comparable synchronous implemen-
tations. Finally, Tiempo IC has developed low-power
robust microprocessors, and also has exploited
another intrinsic property of asynchronous circuitsV
the lack of a coherent power or electromagnetic
emission signatureVto develop chips with secure
cryptographic functionality that have increased resil-
ience to side-channel attacks.
Industry experiments
There have also been several industry experi-
ments with asynchronous design that, though quite
successful, did not appear in commercial products.
Intel RAPPID. In this experimental project at Intel
in the mid-1990’s, an asynchronous implementation
of the IA32 instruction-length decoder was under-
taken because of severe performance bottlenecks
that could not be overcome in their commercial
version using synchronous techniques [9]. The
project focused on making the decoding of the
length of the most common CISC instructions fast by
exploiting concurrency at sub-clock-period granu-
larity, thereby significantly outperforming the exist-
ing synchronous Intel implementation: three times
higher throughput, comparable area, half the
latency, and half the power.
IBM FIR filter. At IBM Research, a project was
undertaken jointly with Columbia University to
develop a mixed synchronous-asynchronous imple-
mentation of a finite impulse response (FIR) filter for
use in the read channels of modern disk drives [10].
The goal was to reduce the filter’s latency over its
wide range of operating frequencies. In a synchro-
nous implementation that is deeply pipelined for
May/June 2015 7
speed, the latency becomes poorer when the data
rate, and hence the clock recovered from it, slows
down. The hybrid synchronous/asynchronous imple-
mentation replaced the core of the filter with an
asynchronous pipelined unit featuring a fixed latency,
while the remaining circuitry was kept synchronous.
The resulting chip exhibited a 50% reduction in worst-
case latency, along with a 15% throughput im-
provement, over IBM’s leading commercial clocked
implementation in the same technology.
Emerging application areas
Beyond more classical design targets, a number of
novel application areas have recently emerged where
asynchronous design is poised to make an impact.
Large-scale heterogenous system integration. In
multi- and many-core processors and systems-on-
chip (SoC’s), some level of asynchrony is inevitable
in the integration of heterogeneous components.
Typically, there are several distinct timing domains,
which are glued together using an asynchronous
communication fabric. There has been much recent
work on asynchronous and mixed synchronous-
asynchronous systems (see ‘‘GALS Systems’’ and
‘‘Ne t w o r k s- o n- C h ip ’’ s id eb a r s) .
Ultra-low-energy systems and energy harvesting.
Asynchronous design is also playing a crucial role in
the design of systems that operate in regimes where
energy availability is extremely limited. In one ap-
plication, Liu and Rabaey demonstrated the benefits
of asynchrony for sub-threshold operation, with cir-
cuits consuming 32% lower energy than a sub-
threshold synchronous counterpart [11]. In another
application, a collaboration with Oticon Inc., a lead-
ing hearing-aid manufacturer, Nielsen and Sparsø
proposed an IFIR filter that achieves a 5power
savings over a commercial synchronous counterpart
by dynamically adapting the numerical range of the
arithmetic circuitry to each individual sample, since
most audio samples are numerically small [12].
Such fine-grain adaptation, in which the datapath
latency can vary subtly for each input sample, is not
possible in a fixed-rate synchronous design. In a re-
cent in-depth case study by Chang et al., focusing on
ultra-low-energy 8051 microcontroller cores with
voltage scaling, it was shown that under extreme pro-
cess, voltage, and temperature (PVT) variations, a
synchronous core requires its delay margins to be
increased by a factor of 12, while a comparable
asynchronous core can operate at actual speed [13].
Finally, Christmann et al. have designed an energy
harvesting approach to implement an autonomous
sensing application, with a reported 40% power-
efficiency gain over synchronous approaches [14].
The use of asynchronous logic provided greater
energy efficiency not only due to its event-driven
nature, but also by allowing graceful adaptivity to the
highly-variable power availability.
Continuous-time digital signal processors (CT-
DSP’s). Another intriguing direction is the develop-
ment of continuous-time digital signal processors,
where input samples are generated at irregular rates
by a level-crossing analog-to-digital converter, depend-
ing on the actual rate of change of the input’s wave-
form. An early specialized approach, using finely
discretized sampling, demonstrated a 10power re-
duction in a speech processing application [15]. The
first general-purpose continuous-time DSP was re-
cently proposed by Vezyrtzis et al. [16]; unlike syn-
chronous DSPs, it maintains its frequency response
intact over varying sample rates and can support
multiple input formats without any internal change,
eliminates all aliasing, and demonstrates a signal-to-
error ratio for certain inputs which exceeds that of
clocked systems. The fine-grain asynchronous pro-
cessing of irregular sampling is fundamental to the
operation of these systems, and would not be possible
to support by conventional synchronous techniques.
Handling extreme environments. Asynchronous
approaches have also been explored to handle ex-
treme environments, such as for space missions,
where temperatures can vary widely. For example,
an asynchronous 8-bit data transfer system has been
designed that is fully operational over a 400 C
temperature range, from 175 Ctoþ225 C, usin g
high-temperature SOI technology [17]. The imple-
mentation also shows good resilience to single-event
transients (SET’s).
Alternative computing paradigms. Finally, there is
increasing interest in asynchronous circuits as the
organizing backbone of systems based on emerging
computing technologies, such as cellular nano-arrays
[18] and nanomagnetics [19], where highly-robust
asynchronous approaches are crucial to mitigating
timing irregularities, both layout-induced and those
IEEE Design & Test
8
Asynchronous DesignVPart 1: Overview and Recent Advances
resulting from the vagaries of quantum behavior.
Asynchronous approaches were also shown to be a
good match for flexible electronics, in the Seiko/
Epson ACT11 microprocessor [20], where the phys-
ical bending and manipulation of the material can
introduce unpredictable and large delay variations.
Summary
While the above applications are in a wide
variety of areas, they exhibit a few commonly-recur-
ring themes, representing beneficial opportunities
for asynchrony:
1) Extreme fine-grain pipelining: the ability to
implement and exploit extremely fine-grainV
even gate- and bit-levelVpipelines, uncon-
strained by the need to distribute a high-speed
fixed-rate clock [Intel/Fulcrum, Achronix, HaL,
Sun Microsystems];
2) Data-dependent completion times: to support and
micro-architect systems which can exploit subtle
and fine-grain differences in data-dependent
completion time, i.e., at a subcycle granularity
[Intel RAPPID, Oticon IFIR, Ocstasic];
3) Avoiding challenges due to the rigidity of global
[CT-DSPs, cellular nano-arrays, nanomagnetics,
flexible electronics], extreme micro-parallelism
ity [IBM FIR], and ease of large-scale system integ-
ration [GALS, NoCs, neuromorphic computing];
4) Robustness to voltage, temperature and process
variation: allowing flexible accommodation of dy-
namic timing variations [energy harvesting, sub-
threshold computing, space applications]; and
5) On-demand, i.e. event-driven, operation: highly
energy-proportional computing, without the
need for extensive instrumentation of clock-gating
at multiple design levels [nearly all applications].
While the above themes capture promising oppor-
tunities, and the current industrial uptake indicates
increasing commercial viability and interest, there
remain open issues and challenges that asynchronous
design must overcome to gain wider industry adoption.
A brief discussion appears in the conclusion to Part 2.
Foundations
Asynchronous systems are typically organized as
a set of components which communicate and
synchronize using handshaking channels. These
channels are defined by two key parameters: com-
munication protocol and data encoding. Building on
these fundamentals, complex asynchronous systems
can be modularly constructed. In particular, we re-
view how these techniques can be used to construct
asynchronous pipelines, which are building blocks
for many high-performance systems. We then address
two basic issues in assembling asynchronous compo-
nents in larger systems: synchronization,whichis
required when interfacing asynchronous and syn-
chronous domains; and arbitration, which is needed
to allow the safe competition of multiple asynchro-
nous components for a shared resource.
Classes of asynchronous circuits
Asynchronous circuits fall into several distinct
classes, depending on the degree of timing robustness
assumed in their operation. This spectrum typically
defines a robustness-performance space, where the
more robust circuits tend to have lower performance
(but not always!). Delay-insensitive circuits operate
correctly regardless of gate and wire delays. Quasi-
delay insensitive (QDI) circuits [4], [5], [21] operate
correctly assuming arbitrary gate delays, but all wires
at each fanout point must have roughly equal delays,
i.e., an isochronic fork assumption. Speed-independent
(SI) circuits [2] (an earlier term) assume arbitrary gate
delays, but all wire delays are assumed to be zero.
Other asynchronous circuits require additional timing
constraints, including hold time constraints (i.e.,
‘fundamental mode’’ [1]) on controllers, one-sided
‘bundled data’’ [3], [9], [22]–[24] timing constraints
on datapaths (see below), as well two-sided con-
straints (i.e. both short- and long-path) [25].
Protocols and data encoding
The basic structure of an asynchronous commu-
nication channel, between a sender and receiver, is
shown in Figure 1. Ignoring data transmission for now,
the channel is typically implemented by two wires:
req and ack.
Two common handshaking protocols are used to
define a single communication transaction, as
illustrated in Figure 2: 1) a four-phase protocol
(return-to-zero [RZ]), and 2) a two-phase protocol
(non-return-to-zero [NRZ], also known as transition-
signalling). In a four-phase protocol, req and ack
signals are initially at zero. The sender initiates
a transaction by asserting req, and the receiver
May/June 2015 9
An alternative to constructing fully-asynchronous systems
is a hybrid approach, integrating synchronous components
(i.e. cores, memories, accelerators, I/O units, etc.) using an
asynchronous communication network, which together form a
globally-asynchronous locally-synchronous (GALS) system.1-3
For some applications, this approach provides the best of both
worlds: allowing the design reuse of synchronous intellectual
property (IP) blocks, while combining them with ﬂ exible asyn-
chronous interconnect as a global integrative medium. The
elimination of ﬁ xed-rate global clocking on the communication
network can provide a highly-scalable, low-power and robust
mechanism for assembling complex systems.
A GALS approach was ﬁ rst published in 1980 by Chuck
Seitz,4 in an inﬂ uential overview of asynchronous design; the
approach was used even earlier by Evans & Sutherland Com-
puter Corp. in its ﬁ rst commercial graphics system, LDS-1
[1969]. The term was later introduced and formalized by Cha-
piro.1 Synchronous components interact with the asynchro-
nous network using either asynchro-
nous/synchronous interface circuits
or pausible clocks.2,3 Fig. A shows a
simpliﬁ ed GALS system with four syn-
chronous cores. Each core can oper-
ate as a separate voltage-frequency
island (VFI), which is connected to a
switch (SW) of an asynchronous com-
munication network through an associ-
ated network interface (NI). A number
of successful GALS multicore archi-
tectures have been developed,2-3,5-7
including those supporting ﬁ ne-grain
message passing8 and latency-insen-
sitive communication.9
As a recent example, STMicroelectronics’ Platform 2012
(P2012)10 includes a fully-asynchronous network-on-chip,11
supporting a highly-reconﬁ gurable accelerator-based many-
core GALS architecture which facilitates ﬁ ne-grain power, reli-
ability and variability management. The ﬁ rst prototype deliv-
ered 80 GOPS performance with only 2W power consumption.
It has evolved recently into the company’s STHORM Platform.
Interesting specialized applications which beneﬁ t from a
GALS architecture include large-scale neuromorphic systems
(see “Applications” section), as well as approaches to en-
hance resilience to side-channel attacks.12 Another interesting
approach with a GALS-like ﬂ avor is "proximity communication,"
which aims to overcome the latency bottleneck of inter-chip
communication by exploiting capacitive coupling at their inter-
Overall, there is a surge of interest and activity in GALS
design, in both industry and academia, as systems become
larger-scale and more heterogeneous, and variability and
timing unpredictability become critical factors.
As an addendum, it is worth noting that the term “GALS”
has at times been stretched to describe non-GALS systems.
In its original and widely-used sense,1,3,4 a GALS system in-
cludes a fully-asynchronous interconnection network, includ-
ing handshaking channels, to integrate synchronous com-
ponents. Systems containing multiple synchronous cores,
operating at different clock rates, which are directly con-
nected using synchronizers, e.g. bi-synchronous FIFO’s, are
A
n alternative to constructin
g
f
ully-asynchronous systems
i
s a
h
y
b
r
id
approac
h
,
i
nte
g
rat
i
n
g
sync
h
ronous components
(i.e. cores, memories, accelerators, I/O units, etc.) usin
g
an
asynchronous communication network, which to
g
ether
f
orm a
lobally-asynchronous locally-synchronous
AL
syste
.
1-
3
For some a
pp
lications, this a
pp
roach
p
rovides the best o
f
both
worlds: allowin
g
the desi
g
n reuse of synchronous intellectual
p
roperty (IP) blocks, while combinin
g
them with ﬂ exible asyn
-
c
h
ronous
i
nterconnect as a
gl
o
b
a
l
i
nte
g
rat
i
ve me
di
um.
Th
e
elimination o
f
xed-rate
g
lobal clockin
g
on the communication
networ
k
can prov
id
e a
hi
gh
l
y-sca
l
a
bl
e,
l
ow-power an
d
ro
b
ust
mechanism
f
or assemblin
g
complex systems
.
A
GALS a
pp
roach was rst
p
ublished in 1980 b
y
Chuck
S
eitz
,
4
in an in
uential overview o
f
asynchronous desi
g
n; the
a
pp
roach was used even earlier b
y
Evans &
S
utherland
C
om
-
p
uter Corp. in its ﬁ rst commercial
g
raphics system, LDS-1
[1969]. The term was later introduced and formalized b
y
Cha
-
pi
ro
.
1
S
y
nchronous com
p
onents interact with the as
y
nchro
-
nous networ
k
us
i
n
g
e
i
t
h
er async
h
ro-
nous
/
synchronous interface circuits
or
p
aus
ibl
e c
l
oc
k
s
.
2,3
Fig
.
A
s
h
ows a
sim
p
liﬁ ed GALS s
y
stem with four s
y
n-
c
h
ronous cores.
E
ac
h
core can o
p
p
er-
ate as a separate volta
g
e-
f
requency
island
(
VFI
)
, which is connected to a
switch (SW) of an asynchronous com-
mun
i
cat
i
on networ
k
t
h
rou
gh
an assoc
i
-
ated network interface
(
NI
)
. A number
of successful
G
AL
S
multicore archi-
tectures
h
ave
b
een
d
eve
l
ope
d,
2
-
3,
5-
7
includin
g
those supportin
g
ne-
g
rain
messa
g
e pass
i
n
g
8
an
d
l
atenc
y
-
i
nsen-
siti
v
e
co
mm
u
n
icatio
n
.
9
A
s a recent exam
p
le,
S
TMicroelectronics’ Platform 2012
(
P2012
)
10
includes a
f
ull
y
-as
y
nchronous network-on-chi
p,
11
su
su
pp
pp
or
or
ti
ti
ng
ng
a
a
h
h
ig
ig
hl
hl
y
-
y
re
re
co
co
nﬁ
nﬁ
g
g
ur
ur
ab
ab
le
le
a
a
cc
cc
el
el
er
er
at
at
or
or
-
b
b
as
as
ed
ed
m
m
an
an
y-
y
core
G
AL
S
architecture which facilitates ﬁ ne-
g
rain power, reli
-
ability and variability mana
g
ement. The ﬁ rst prototype deliv
-
ered 80
GO
P
S
p
erformance with onl
y
2W
p
ower consum
p
tion.
It has evolved recentl
y
into the com
p
an
y
’s STHORM Platform
.
Interestin
g
specialized applications which bene
t
f
rom a
G
AL
S
architecture include lar
g
e-scale neuromorphic systems
(
see “A
pp
lications” section
)
, as well as a
pp
roaches to en
-
ha
n
ce
r
esilie
n
ce
to
side
-
cha
nn
el
attacks.
12
A
not
h
er
i
nterest
i
n
g
a
pp
roach with a GALS-like ﬂ avor is "
p
roximit
y
communication,"
which aims to overcome the latenc
y
bottleneck o
f
inter-chi
p
commun
i
cat
i
on
b
y exp
l
o
i
t
i
n
g
capac
i
t
i
ve coup
li
n
g
at t
h
e
i
r
i
nter
-
f
f
.
13
Ov
Ov
er
er
al
al
l
l
,
t
t
he
he
re
re
i
i
s
s
a
a
su
su
rg
rg
e
e
of
of
i
i
nt
nt
er
er
es
es
t
t
an
an
d
d
ac
ac
ti
ti
vi
vi
ty
ty
i
i
n
n
GA
GA
LS
LS
d
es
ig
n,
i
n
b
ot
h
i
n
d
ustry an
d
aca
d
em
i
a, as systems
b
ecome
l
ar
g
er-sca
l
e an
d
more
h
etero
g
eneous, an
d
var
i
a
bili
ty an
d
timin
g
unpredictability become critical factors.
A
s an addendum, it is worth notin
g
that the term “GALS”
has at times been stretched to describe non-GALS s
y
stems.
I
n
i
ts or
igi
na
l
an
d
w
id
e
l
y-use
d
sense
,
1
,
3
,
4 a
G
AL
S
s
y
stem in
-
cludes
a
fully-asynchronous
i
nterconnect
i
on networ
k,
i
nc
l
u
d-
s
i
ng
h
an
d
s
h
a
ki
ng c
h
anne
l
s, to
i
ntegrate sync
h
ronous com
-
p
onents. Systems containin
g
multiple synchronous cores,
operatin
g
at different clock rates, which are directly con
-
nected usin
g
synchronizers, e.
g
. bi-synchronous FIFO’s, are
GALS Systems
Fig A. A multicore GALS
system
more properly referred to as multi-synchronous systems.7
A useful general classiﬁ cation scheme for mixed-timing
systems was proposed by Messerschmitt,3,14 based on the
relationship between different clock domains. In a mesochro-
nous system, synchronous components operate at exactly
the same frequency, but with unknown yet stable phase dif-
ference.15 In a plesiochronous system, synchronous compo-
nents operate at the same nominal frequency, but may have
a slight frequency mismatch, e.g. a few parts per million. Fi-
nally, in a heterochronous system, synchronous components
can operate at arbitrary unrelated frequencies.
References
1. D. M. Chapiro, “Globally-asynchronous locally-synchronous
systems,” Ph.D. dissertation, Dept. Comput. Sci., Stanford
Univ., Stanford, CA, USA, 1984.
2. M. Krstic et al., “Globally asynchronous, locally synchro-
nous circuits: Overview and outlook,” IEEE Des. Test,
vol. 24, no. 5, pp. 430–441, 2007.
3. P. Teehan, M. Greenstreet, and G. Lemieux, “A survey
and taxonomy of GALS design styles,” IEEE Des. Test,
vol. 24, no. 5, pp. 418–428, 2007.
4. C. L. Seitz, ‘‘System Timing,’’ in Introduction to VLSI Sys-
Wesley, 1980, pp. 218–262.
5. J. Muttersbach, T. Villiger and W. Fichtner, “Practical
design of globally-synchronous locally-asynchronous
systems,” in Proc. 6th IEEE Int. Symp. Adv. Res. ASYNC,
2000, pp. 52–59.
6. S. Moore et al., “Point to point GALS interconnect,” in
Proc. 8th IEEE Int. Symp. ASYNC, 2002, pp. 69–75.
synchronous and fully asynchronous NoCs for GALS
architectures,’’ IEEE Des. Test, vol. 25, no. 6, pp.
572–580, 2008.
8. N. J. Boden et al., ‘‘Myrinet: A gigabit-per-second local
area network,’’ IEEE Micro, vol. 15, no. 1, pp. 29–36,
Jan./Feb. 1995.
9. M. Singh and M. Theobald, “Generalized latency-insen-
sitive systems for single-clock and multi-clock architec-
tures,” in Proc. ACM/IEEE DATE, 2004, pp. 1008–1013.
10. L. Benini et al., “P2012: building an ecosystem for a
scalable, modular and high-efﬁ ciency embedded com-
puting accelerator,” in Proc. ACM/IEEE DATE, 2012,
pp. 983–987.
11. Y. Thonnart, P. Vivet, and F. Clermidy, “A fully-asynchronous
low-power framework for GALS NoC integration,” in Proc.
ACM/IEEE DATE, 2010, pp. 33–38.
12. R. Soares et al., “A robust architectural approach for
cryptographic algorithms using GALS pipelines,” IEEE
Des. Test, vol. 28, no. 5, pp. 62–71, 2011.
13. D. Hopkins et al., “Circuit techniques to enable 430 Gb/s
/mm2 proximity communication,” in Proc. IEEE ISSCC,
2007, pp. 368–369.
14. D. G. Messerschmitt, ‘‘Synchronization in digital sys-
tem design,” IEEE J. Sel. Areas Commun., vol. 8, no. 8,
pp. 1404–1419, Oct.1990.
15. M. R. Greenstreet, “Implementing a STARI Chip,” in Proc.
ICCD, 1995, pp. 38–43.
IEEE Design & Test
10
Asynchronous DesignVPart 1: Overview and Recent Advances
Over the last decade, networks-on-chip (NoCs) have be-
come the de facto standard approach for structured on-chip
communication, both for lowpower embedded systems and
high-performance chip multi-processors.1 These on-chip net-
munication with packet switching, and can be targeted to a
variety of cost functions (faulttolerance, power, latency, satura-
tion throughput, quality-of-service [QoS]) and parameters (net-
worktopology, channel width, routing strategies).
Since the NoC approach separates the communication in-
frastructure, and its timing, from processing elements, it is a
natural match for an asynchronous paradigm. Asynchronous
interconnect eliminates the need for global clock management
across a large network, thereby providing better support for
scalability, timing robustness and low power, and avoids the
challenge of instrumenting complex clock-gating in a highly
distributed communication structure.
A number of asynchronous and GALS NoCs have been pro-
posed in the last decade or so. An early approach, Chain,2
used delay-insensitive codes on channels for crosstalk miti-
gation and ease of physical design, and was effectively ap-
plied to an ARM-based smart-card chip. Several approaches
to support QoS have been proposed, including combining
guaranteed service (GS) and best effort (BE) trafﬁ c,3 as well as
multiple service levels.4 A comprehensive asynchronous NoC
framework has been developed to provide dynamic voltage
and frequency scaling (DVFS) and ﬁ ne-grain power manage-
ment,5 while other approaches have targeted fault tolerance6
and arbitration for high-radix switches.7 Automated design
ows are also being developed,8,9 leveraging commercial
synchronous CAD tools, which use directives to meet asyn-
chronous path timing and latch constraints, as well as judi-
cious control of optimization modes to avoid the introduction
of hazards. A recent asynchronous time-division-multiplexed
(TDM) NoC demonstrates correct operation without any global
synchronization, while tolerating signiﬁ cant skews on different
network interfaces.10
Power and performance beneﬁ ts of asynchronous NoCs
have been demonstrated for high-performance shared-memo-
ry chip multi-processors11 and Ethernet switch chips,12 as well
as their facilitation of extreme ﬁ ne-grain power management
and ﬂ exible integration of many-core GALS architectures (see
STHORM processor discussion in “GALS” sidebar). The end-
to-end latency beneﬁ ts of asynchronous NoCs over synchro-
nous NoCs have also been demonstrated,8-9,11-12 due to the low
forward latency of individual asynchronous router nodes, and
the ability of packets to advance without continual alignment
to a global clock.
As a recent example, an asynchronous NoC switch architec-
ture,9 using single-rail bundled data and two-phase communi-
cation, obtained a 45% reduction in average energy-per-packet
and 71% area reduction compared to a highly-optimized syn-
chronous single-cycle NoC switch, xpipes Lite, in the same
40nm technology. Additional latency beneﬁ ts have been ob-
tained using low-overhead early arbitration techniques.13
One interesting emerging domain where asynchronous and
GALS NoCs have played a key role, is in the development of
neuromorphic chips (see “Applications” section). These rely on
the scalability and ease-of-integration of asynchronous inter-
connect, and the inherent event-driven operation for low power.
One recent example, IBM’s TrueNorth, integrates 4096 neuro-
synaptic cores on a single chip, which models 1 million neu-
rons and 256 million synapses, in the largest chip developed
to date by IBM (5.4 billion transistors), where the large-scale-
integration is facilitated by using a fully-asynchronous NoC.
Overall, the NoC area is a promising arena where the inte-
grative beneﬁ ts of asynchronous design are making important
References
1. W. J. Dally and B. Towles, “Route packets, not wires: On-chip
interconnection networks,” in Proc. ACM/IEEE DAC, 2001,
pp. 684–689.
2. J. Bainbridge and S. Furber, ‘‘Chain: A delay-insensitive chip
area interconnect,’’ IEEE Micro, vol. 22, no. 5, pp. 16–23,
Sep./Oct. 2002.
3. T. Bjerregaard and J. Sparsø, “A router architecture for con-
nection-oriented service guarantees in the MANGO clock-
less network-on-chip,” in Proc. ACM/IEEE DATE, 2005, pp.
1226–1231.
4. R. Dobkin et al., “An asynchronous router for multiple ser-
vice levels networks on chip”, in Proc. 11th IEEE Int. Symp.
ASYNC, 2005, pp. 44–53.
5. E. Beigne et al., “Dynamic voltage and frequency scaling ar-
chitecture for units integration within a GALS NoC,” in Proc.
ACM NOCS, 2008, pp. 129–138.
6. M. Imai and T. Yoneda, “Improving dependability and perfor-
mance of fully asynchronous on-chip networks,” in Proc. 17th
IEEE Int. Symp. ASYNC, 2011, pp. 65–76.
7. S. R. Naqvi and A. Steininger, “A tree arbiter cell for high
speed resource sharing in asynchronous environments,” in
Proc. ACM/IEEE DATE, 2014.
8. Y. Thonnart, E. Beigne, and P. Vivet, “A Pseudo-synchronous
implementation ﬂ ow for WCHB QDI asynchronous circuits,”
in Proc. 18th IEEE Int. Symp. ASYNC, 2012, pp. 73–80.
9. A. Ghiribaldi, D. Bertozzi, and S. M. Nowick, “A transition-
signaling bundled data NoC switch architecture for cost-ef-
fective GALS multicore systems,” in Proc. ACM/IEEE DATE,
2013, pp. 332–337.
10. E. Kasapaki and J. Sparsø, “Argo: A time-elastic time-divi-
sion-multiplexed noC using asynchronous routers,” in Proc.
20th IEEE Int. Symp. ASYNC, 2014, pp. 45–52.
11. M. N. Horak et al., “A low-overhead asynchronous intercon-
nection network for GALS chip multiprocessors,” IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp.
494–507, 2011.
12. A. Lines, “Asynchronous interconnect for synchronous SoC
design,” IEEE Micro, vol. 24, no. 1, pp. 32–41, 2004.
13. W. Jiang et al., “A lightweight early arbitration method for low-
latency asynchronous 2D-mesh NoC’s,” in Proc. ACM/IEEE
DAC, 2015.
Networks-on-Chip
May/June 2015 11
responds by asserting ack, in the active or evaluate
phase. The two signals are then deasserted, in turn,
in the return-to-zero or reset phase. In contrast, in a
two-phase protocol, there is no return-to-zero phase:
asingletoggleonreq indicates a request, followed
by a toggle on ack to indicate an acknowledge.
Both two-phase and four-phase protocols are
widely used, with interesting tradeoffs between
them. A four-phase protocol has the benefit of return-
ing interfaces to a unique state, i.e., all-zero, which
typically simplifies hardware design. It is also a good
match for dynamic logic, where the RZ phase directly
corresponds to the precharge phase [4], [5], [7], [10].
However, the protocol requires two complete round-
trip channel communications per transaction, which
can result in lower throughput. A two-phase protocol
may involve more complex hardware design, but only
requires one round-trip communication per transac-
tion, which can provide higher throughput Vand
sometimes still has quite low complexity [22], [23],
[24]. Alternative protocols using pulse-mode or single-
track handshaking have also been proposed.
Once a communication protocol for a channel
has been defined, data communication is typically
needed. The data itself typically replaces the single
req wire in the above example. There are two
common data encoding schemes: 1) delay-insensitive
(DI) codes, and 2) single-rail bundled data.
Figure 3a illustrates delay-insensitive encoding on
a simple two-bit example. A common approach, dual-
rail encoding, is used for each bit, X and Y, which are
each encoded using two rails or wires (X1/X0 for X,
Y1/Y0 for Y). Assuming a four-phase protocol, all wires
are initially zero, and each bit is encoded as 00,
representing a NULL or spacer token (i.e., no valid
data). A one-hot encoding scheme is used: the
transmission of a 1 (0) value on X involves asserting
wire X1 (X0) high, and similarly for the transmission
on bit Y. Once the receiver has obtained a complete
valid codeword, it asserts ack. The reset phase then
occurs, where data and ack are deasserted in turn. A
completion detector (CD) is used by the receiver to
identify when a valid codeword has been received.
Dual-rail encoding is widely used [5], [7], [17],
and is one simple instance of a delay-insensitive
code. In particular, note that, regardless of the trans-
mission time and relative skew of the distinct bits,
the receiver can unambiguously identify when every
bit is valid, by checking for the arrival of a legal
codeword (01 or 10) on each bit. As a result, this ap-
proach provides great resilience to physical and
operating variability. Alternative DI codes have also
been widely explored, providing cost tradeoffs in
coding efficiency, dynamic power, and hardware
overhead, including 1-of-4, m-of-n [24], systematic,
level-encoded dual-rail (LEDR) and level-encoded
transition-signalling (LETS) [24] codes.
Figure 3b shows an alternative encoding approach,
single-rail bundled data. A standard synchronous-style
data channel is used, i.e. with
binary encoding. One extra req
wire is added, serving as a
‘bundling signal’’ or local strobe,
which must arrive at the receiver
after all data bits are stable and
valid. Both four-phase [3] and
two-phase [22], [23] bundled
protocols are widely used.
Interestingly, the bundled data
scheme allows arbitrary glitches
on the data channel, as long as
data becomes stable and valid
before the req signal is transmit-
ted. Typically, data must remain
Figure 1. An asynchronous channel.
Figure 2. Asynchronous handshake protocols.
IEEE Design & Test
12
Asynchronous DesignVPart 1: Overview and Recent Advances
valid from before the req is
transmitted to after an ack is re-
ceived. Therefore, the scheme
facilitates the use of synchro-
nous-style computation blocks.
It also provides good coding
efficiency, with only one extra
req wire added to the datapath.
However, unlike DI codes, a one-
sided timing constraint must be
enforced: the req delay must
always be longer than worst-
case data transmission. To sup-
port this constraint, a small
an inverter chain or carefully
replicated portion of the critical
path. Unlike in a clocked system,
though, this is a localized con-
straint: stages can be highly
unbalanced, each with its own
distinct matched delay. More-
over, the timing margins can be
made fairly tight because some parameters (e.g.,
process, voltage, temperature) tend to be locally
more uniform.
Finally, a hybrid scheme, called speculative
completion [26], uses bundled data, but also allows
variable-latency completion, including better than
worst-case, based on the actual data inputs. High-
Kogge-Stone) have been demonstrated, operating
at faster rates than synchronous designs.
Pipelining
Pipelining is a fundamental technique to in-
crease concurrency and boost throughput in high-
performance digital systems. All modern high-speed
processors, multimedia and graphics units, and
signal processors are pipelined. In a typical pipe-
lined implementation, complex function blocks are
subdivided into smaller blocks, registers are inserted
to separate them, and a clock is applied to all re-
gisters. In an asynchronous system, no global clock
is used and, instead, the interaction of neighboring
stages is coordinated by a handshaking protocol.
Developing better pipeline protocols and their
efficient circuit-level implementation has been the
focus of many researchers over the past two to three
styles, starting with the seminal work of Sutherland.
More details can be found in a recent survey [24].
Sutherland’s micropipeline. Figure 4 shows a
basic micropipeline [22], which uses a two-phase
handshaking protocol and single-rail bundled data.
Each interface between adjacent stages has single-
rail data and a bundling signal ðreqiÞgoing forward,
and an acknowledgment ðackiÞgoing backward. A
delay element is added to match the worst-case
delay of the corresponding logic block.
The pipeline operates according to a so-called
capture-pass protocol. The protocol is implemented
using a simple control chain of Muller C-elements
1
[22], [24] (with inversions on the right inputs),
operating on a set of specialized capture-pass data-
path latches. The latches are initially all normally
transparent, unlike synchronous pipelines, so the
entire pipeline forms a flow-through combinational
path. Locally, only after data advances through an
individual stage’s latches, the corresponding request
reqi1causes a transition on the C(i.e., capture)
control input, which makes those latches opaque,
thereby storing and protecting the data. Once data
1
A C-element is a basic asynchronous storage element;
assuming inputs A and B, the output is 1 (0) if both inputs are
1 (0), otherwise it holds its prior value.
Figure 3. Asynchronous data encoding schemes. (a) Dual-rail encoding;
(b) single-rail bundled data.
May/June 2015 13
advances through the next stage’s latches, where the
data is safely stored, a transition on the P(i.e., pass)
control input via ackiþ1, makes the current stage’s
latches transparent again, completing an entire cycle.
The latches indicate the completion of capture and
pass operations via Cd(capture done) and Pd(pass
done) outputs, respectively. Effectively, each data item
initiates a ‘‘wavefront,’’ which advances through the
pipeline and is protected by a series of latch-capture
operations. Predecessor stages, behind the wavefront,
are subsequently freed up through a series of pass
operations, once data has been safely copied to the
next stage. The old data can then be overwritten by
the next wave front.
Although micropipelines require specialized com-
ponents for implementation, they are remarkable in
the simplicity and elegance of their structure and
operation, and have inspired
proaches. Their introduction by
Sutherland also provided deeper
insights into the nature of asyn-
chronous systems and triggered
a resurgence of research activity
in asynchronous design.
Mousetrap pipeline. We de v e l-
oped Mousetrap at Columbia Uni-
versity to be a high-performance
pipeline that supports the use of
an entirely standard cell method-
ology [23], [24]. Although its
two-phase capture-pass protocol
is based on that of micropipe-
lines, it has simpler control cir-
cuits and data latches, with much lower area and
delay overheads. Figure 5 shows a basic Mousetrap
pipeline. The local control for each stage is only a
single combinational exclusive-NOR (XNOR) gate,
and the storage for each stage is a single bank of level-
sensitive D-latches, both of which are available in
standard cell libraries.
Although the implementation is quite different,
the overall operation is similar to that of micropipe-
lines. Initially assume that all reqiand ackisignals
are initially at 0, and all the data latches are there-
fore transparent. As new data arrives into stage ifrom
the left, and passes through the latch, the correspond-
ing reqibundling signal toggles. As a result, the stage’s
XNOR toggles from 1 to 0, thereby capturing data in
the latch. It also requests the next data item from its left
neighbor by toggling acki. Subsequently, when stage i
Figure 5. Mousetrap pipeline.
Figure 4. Sutherland’s micropipeline.
IEEE Design & Test
14
Asynchronous DesignVPart 1: Overview and Recent Advances
receives an acknowledgment ackiþ1from its right
neighbor, stage is XNOR toggles back to 1, making
stage is latch transparent, and completing the cycle.
The relatively lightweight control and storage struc-
tures allow the pipeline to achieve high throughput:
2.4 giga items/s FIFOs (in 180 nm), and a greatest
common divisor (GCD) test chip at 2.1 GHz (in
130 nm technology). Mousetrap circuits have been
used in several recently-proposed asynchronous
NoC designs (for example, see Horak et al. and
Kasapaki/Sparsø in the ‘Networks-on-Chip’’ sidebar).
GasP pipeline. GasP was developed at Sun Micro-
systems Laboratories to push the limit of achievable
performance by using an aggressive custom circuit
style for specialized applications [25], A distinctive
feature is that, instead of the usual pair of request
and acknowledge wires, each control channel be-
tween adjacent stages consists of a single wire, i.e.,
‘single-track channel,’’ allowing bi-directional com-
munication. Handshaking is performed via carefully
generated pulses: a forward request transition sets the
state of the control channel, and a subsequent
reverse acknowledgment transition resets the chan-
nel state. Hence, GasP effectively combines the
benefits of both two-phase and four-phase protocols
on a single wire. Circuit designs are highly optimized
for delay, but the pulse-based protocol imposes two-
sided timing constraints (i.e., short and long path
requirements), requiring careful balancing of path
delays to ensure correct operation [24].
Dynamic logic pipelines. Dynamic logic datapaths
are common in high-performance systemsVespe-
cially in the core of ALUs in high-speed micropro-
cessors and ASICsVdespite the
greater design and validation
effort required. Interestingly,
dynamic logic is an especially
good match for asynchronous
pipelines. In particular, local
handshaking obviates the need
for the complex and carefully
controlled multiphase clocking
that is typical of synchronous
dynamic circuits, and noise-in-
duced delay variations can be
robustly handled through the use
of DI encoding [24]. Further-
more, a unique feature of many asynchronous
dynamic pipelines is that they are entirely latchless,
storing data directly on logic block outputs with
keepers. As a result, dynamic logic pipelines have
been used in several recent high-performance asyn-
chronous commercial products [4], [5], [24].
We review the PS0 pipeline style by Williams and
Horowitz [7], [24], which was used in the design of
high-speed floating-point dividers at HaL Computers
in the 1990s, and was influential on much subse-
quent research.
Figure 6 shows the basic structure of a PS0 pipe-
line; each stage consists of a function block composed
of dynamic logic, and a completion detector (CD).
The datapath uses DI coding (in particular, dual-rail),
and there are no explicit registers between adjacent
stages. Each function block alternates between an
evaluate phase and a precharge phase. Initially, the
function block outputs are reset to 0, and in the eval-
uate phase, awaiting data inputs. In the evaluate
phase, each block computes after its data inputs
arrive. In the precharge phase, the function block is
reset, with all its outputs returning to 0. The CD iden-
tifies when the stage’s computation is complete, or
when its outputs have been reset to 0. The single input
control for each stage, Prech/Eval, is connected
from the output of the next stage’s CD. The interaction
between stages follows a simple protocol: a stage is
precharged whenever the next stage finishes evalua-
tion, and a stage is enabled to evaluate whenever the
next stage finishes its precharge. This protocol ensures
that two successive wavefronts of data are always
separated by a reset spacer. The use of fast dynamic
logic without latches yields purely combinational
execution times, even for iterative computations that
are implemented using self-timed rings.
Figure 6. PS0 pipeline.
May/June 2015 15
A number of other dynamic pipeline styles have
been proposed [24], with a range of tradeoffs in per-
formance, robustness and other cost metrics. These
include dynamic GasP by Ebergen et al. (Sun Micro-
systems) [25]; PCHB/PCFB by Lines; high-capacity
(HC) and lookahead pipelines (LP) by Singh and
Nowick; IPCMOS by Schuster et al. (IBM Research);
and single-track styles by Beerel et al. Asynchronous
pipelines have been used commercially in Sun’s
UltraSPARC IIIi computers for fast memory access; in
Achronix’s Speedster 22i FPGA’s [5]; in the Ethernet
switch chips of Intel/Fulcrum Microsystems [4]; and
experimentally at IBM Research for a low-latency
finite-impulse response (FIR) filter chip [10].
Synchronization and arbitration
Two related capabilities are needed when han-
dling the continuous-time operation of an asynchro-
nous system: synchronization and arbitration.
Synchronization involves the interfacing of asyn-
chronous and synchronous systems, or two unrelat-
ed synchronous systems, where, at the boundary
crossing, an asynchronous signal must be safely
realigned to a clock domain.
A good overview of the topic
has been presented by Gino-
sar [27]. Any direct connec-
tion of asynchronous inputs
to synchronous registers can
cause setup time violations,
resulting in metastable oper-
ation and possible failure,
such as storing of intermedi-
ate voltage values or even
oscillatory behavior. The first
detailed published results
identifying and evaluating
metastability were presented in 1973 by Chaney
and Molnar (see [27]).
The classic solution for a single bit is to provide a
basic synchronizer: double or triple flip-flops in
series, to ensure sufficient stabilization time to pro-
duce a clean synchronous output, with extremely
high mean-time-between-failure (MTBF). Detailed
synchronizer performance analysis has been pro-
posed, which considers the impact of noise and
thermal effects, along with directions to improve
circuit design [28]. More general solutions have
been proposed for synchronization blocks which
support buffering and flow control [29], [30]. The
approach by Chakraborty and Greenstreet provides
an integrated study of synchronizing two clock
domains, ranging from mesochronous to hetero-
chronous communication [29].
Figure 7 illustrates an example of a mixed-clock
FIFO by Chelcea and Nowick [30], which can inter-
face two arbitrary clock domains, a sender (put
interface) and a receiver (get interface). The design
is one of a complete family of modular mixed-timing
interfaces, including other variants to support mixed
Figure 8. A two-way arbiter. (a) Block diagram, (b) timing, and (c) implementation.
Figure 7. Mixed-clock FIFO.
IEEE Design & Test
16
Asynchronous DesignVPart 1: Overview and Recent Advances
asynchronous/synchronous communication which
are needed in GALS systems. The FIFO is constructed
as a simple token ring, with pointers to head and tail
locations. Each interface operates independently at
its own clock rate, and data items do not move once
deposited. Full and empty detectors are used to avoid
overflow and underflow, respectively. A novel feature
is that only three synchronizers are required, regard-
less of the number of cells in the ring, hence it is
highly scalable. It also avoids synchronization perfor-
scenarios.
Arbitration involves the resolution of two or more
competing signals requesting a shared resource. In
synchronous design, it is a simple operation: at the
start of each clock cycle, existing requests are exam-
ined and one is selected as a winner. In asynchro-
nous design, however, inputs arrive in continuous
time, and resolution must be guaranteed to be clean
and safe, regardless of the signal arrival times.
The basic component to resolve a two-way
asynchronous arbitration is a mutual-exclusion ele-
ment (mutex, or ME), shown in Figure 8, due to Seitz.
This analog component guarantees a hazard-free
output. In principle, as two competing inputs arrive
closer together, its resolution time, i.e., latency, can
become arbitrarily long. In practice, though, only
extremely close spacing of samples (e.g., G1ps)
will result in a relatively long delay. More complex
asynchronous arbiters typically use mutexes as
building blocks. Two- and four-way arbiters are
fundamental components in router nodes in asyn-
chronous NoC’s. N-way asynchronous arbiters have
also been proposed, as well as priority arbiters.
IN THIS PART,we have presented an overview of
some key advances of asynchronous design, and
discussed emerging application areas where asyn-
chrony is poised to play a critical role. We have also
reviewed technical foundations, as well as highlight-
ed recent developments in GALS and NoC design.
and systems, including logic and high-level synthe-
sis, computer-aided design (CAD) tool flows for de-
sign and test, and processors and architectures. h
Acknowledgment
The authors appreciate the funding support of
the National Science Foundation under Grants CCF-
1219013, CCF-0964606 and OCI-1127361.
h
References
[1] S.H.Unger,Asynchronous Sequential Switching
Circuits. New York, NY, USA: Wiley, 1969.
[2] D.EMullerandW.C.Bartky,A Theory of Asynchronous
Circuits. Cambridge, MA, USA: Annals of Computing
Laboratory of Harvard University, 1959, pp. 204–243.
[3] H. van Gageldonk et al., ‘‘An asynchronous low-power
80C51 microcontroller,’’ in Proc. Int. Symp. Adv.
Res. Asynch. Circuits Syst. (ASYNC 98), 1998,
pp. 96–107.
[4] M.Daviesetal.,‘A72-Port10GEthernet
switch/router using quasi-delay-insensitive
asynchronous design,’’ in Proc.Int.Symp.Asynch.
Circuits Syst. (ASYNC 14), 2014, pp. 103–104.
[5] J. Teifel and R. Manohar, ‘‘Highly pipelined
asynchronous FPGAs,’’ in Proc. ACM/SIGDA Int.
Symp. Field Programmable Gate Arrays (FPGA 04),
2004, pp. 133–142.
[6] P. Merolla et al., ‘‘A million spiking-neuron integrated
circuit with a scalable communication network and
interface,’’ Science, vol. 345, no. 6197, pp. 668–673,
2014.
self-timed 160 ns 54 b CMOS divider,’’ IEEE J.
Solid-State Circuits, vol. 26, no. 11, pp. 1651–1661,
1991.
[8] K.M.Fant,Logically Determined Design.NewYork,
NY,USA:Wiley,2005.
[9] K.S.Stevensetal.,‘Anasynchronous instruction
length decoder,’’ IEEE J. Solid-State Circuits,vol.36,
no. 2, pp. 217–228, 2001.
[10] M. Singh et al., ‘‘An adaptively pipelined mixed
synchronous-asynchronous digital FIR filter chip
operating at 1.3 gigahertz,’’ IEEE Trans. Very
Large Scale Integr. (VLSI) Syst.,vol.18,no.7,
pp. 1043–1056, 2010.
[11] T. Liu et al., ‘‘Asynchronous computing in sense
amplifier-based pass transistor logic,’’ IEEE Trans.
Very Large Scale Integr. (VLSI) Syst.,vol.17,no.7,
pp. 883–892, 2009.
[12] L. S. Nielsen and J. Sparsø, ‘‘Designing asynchronous
circuits for low power: An IFIR filter bank for a digital
hearing aid,’’ Proc. IEEE, vol. 87, no. 2, pp. 268–281,
1999.
[13] K.-L. Chang et al., ‘‘Synchronous-logic and
asynchronous-logic 8051 microcontroller cores for
realizing the internet of things: A comparative study
on dynamic voltage scaling and variation effects,’’
IEEE J. Emerg. Sel. Topics Circuits Syst.,vol.3,no.1,
pp. 23–34, 2013.
May/June 2015 17
[14] J. F. Christmann et al., ‘‘Bringing robustness and
power efficiency to autonomous energy harvesting
systems,’’ IEEE Design Test Comput.,vol.28,no.5,
pp. 84–94, 2011.
[15] F. Aeschlimann et al., ‘‘Asynchronous FIR filters:
Towards a new digital processing chain,’’ in Proc.
Int. Symp. Asynch. Circuits Syst. (ASYNC-04), 2004,
pp. 198–206.
[16] C. Vezyrtzis, S. M. Nowick, and Y. Tsividis, ‘‘A flexible,
event-driven digital filter with frequency response
independent of input sample rate,’’ IEEE J. Solid-State
Circuits (JSSC), vol. 49, no. 10, pp. 2292–2304, 2014.
[17] P. Shepherd et al., ‘‘A robust, wide-temperature data
transmission system for space environments,’’ in
Proc. IEEE Aerospace Conf. (AERO 2013), 2013,
pp. 1812–1819.
[18] F. Peper et al., ‘‘Laying out circuits on asynchronous
cellular arrays: A step towards feasible nanocomputers?’’
Nanotechnology, vol. 14, no. 4, pp. 1651–1661, 2003.
[19] M. Vacca, M. Graziano, and M. Zamboni,
‘‘Asynchronous solutions for nanomagnetic logic
circuits,’’ ACMJ.Emerg.Technol.Comput.Syst.
(JETC), vol. 7, no. 4, pp. 15:1–15:18, 2011.
[20] N. Karaki et al., ‘‘A flexible 8b asynchronous
microprocessor based on low-temperature
poly-silicon TFT technology,’’ in Proc. IEEE Int.
Solid-State Circuits Conf. (ISSCC-05), 2005,
pp. 272–273, pg. 598.
[21] M. Kishinevsky et al., Concurrent Hardware: The
Theory and Practice of Self-Timed Design.NewYork,
NY,USA:Wiley,1994.
[22] I. E. Sutherland, ‘‘Micropipelines,’’ Commun. ACM,
vol. 32, no. 6, pp. 720–738, 1989.
[23] M. Singh and S. M. Nowick, ‘‘MOUSETRAP:
High-speed transition-signaling asynchronous
pipelines,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 15, no. 6, pp. 684–698, 2007.
[24] S. M. Nowick and M. Singh, ‘‘High-performance
asynchronous pipelines: An overview,’’ IEEE Design &
Test, vol. 28, no. 5, pp. 8–22, 2011.
[25] I. Sutherland and S. Fairbanks, ‘‘GasP: A minimal FIFO
control,’’ in Proc. In. Symp. Asynch. Circuits Syst.
(ASYNC 01), 2001, pp. 46–53.
[26] S. M. Nowick et al., ‘‘Speculative completion for the
design of high-performance asynchronous dynamic
Circuits Syst. (ASYNC 97), 1997, pp. 210–223.
[27] R. Ginosar, ‘‘Metastability and synchronization:
A tutorial,’’ IEEE Design & Test,vol.28,no.5,
pp. 23–35, 2011.
[28] D. J. Kinniment, A. Bystrov, and A. V. Yakovlev,
‘‘Synchronization circuit performance,’’ IEEE
J. Solid-State Circuits, vol. 37, no. 2, pp. 202–209,
2002.
[29] A. Chakraborty and M. R. Greenstreet, ‘‘Efficient
self-timed interfaces for crossing clock domains,’’ in
Proc. 9th IEEE Int. Symp. Asynch. Circuits Syst.
(ASYNC 03), 2003, pp. 78–88.
[30] T. Chelcea and S. M. Nowick, ‘‘Robust interfaces
for mixed-timing systems,’’ IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 12, no. 8, pp. 857–873,
2004.
Steven M. Nowick is a professor of computer
science at Columbia University, New York, NY, USA.
His research interests include the design and
optimization of asynchronous and mixed-timing
(i.e., GALS) digital systems, scalable and low-
latency on-chip interconnection networks for
shared-memory parallel processors and embedded
systems, extreme low-energy digital systems, neu-
romorphic computing, and variation-tolerant global
communication. He has a PhD degree in computer
science from Stanford University. He is a Fellow of
the IEEE.
Montek Singh is an associate professor of
computer science at the University of North Carolina
at Chapel Hill, NC, USA. His research interests
include asynchronous and mixed-timing circuits and
systems; CAD tools for design, analysis, and optimi-
zation; high-speed and low-power VLSI design; and
applications to emerging computing technologies,
energy-efficient graphics, and image sensing hard-
ware. He has a PhD degree in computer science
from Columbia University, New York, NY, USA.
to Steven M. Nowick, Department of Computer
Science, Columbia University, New York, NY 10027
USA; nowick@cs.columbia.edu; or to Montek Singh,
Department of Computer Science, University of North
Carolina, Chapel Hill, NC 27599 USA; montek@cs.
unc.edu.
IEEE Design & Test
18
Asynchronous DesignVPart 1: Overview and Recent Advances
... Compared with traditional synchronous circuit design methods, asynchronous circuits eliminate Yu global synchronization by clock-trees, introduce local handshaking process for data transfer, and bring potential advantages in power consumption, electromagnetic emission, as well as automatic speed and energy scaling under supply voltage variations [1][2][3] . In addition, the distributed closed-loop control structure of their asynchronous counterparts has better timing reliability than the inherent open-loop timing control of synchronous circuits [4,5] . ...
... Compared with previous overview papers on asynchronous pipeline circuit design [2,3,7] , an in-depth review of each selected asynchronous pipeline circuit, regarding its handshaking mechanism and data latching scheme behind the circuit implementations, as well as the analysis of its performance and timing constraints based on formal behavior models, is provided in this study. ...
Article
As VLSI technology enters the post-Moore era, there has been an increasing interest in asynchronous design because of its potential advantages in power consumption, electromagnetic emission, and automatic speed scaling capacity under supply voltage variations. In most practical asynchronous circuits, a pipeline forms the micro-architecture backbone, and its characteristics play a vital role in determining the overall circuit performance. In this paper, we investigate a series of typical asynchronous pipeline circuits based on bundled-data encoding, spanning different handshake signaling protocols such as 2-phase (micropipeline, Mousetrap, and Click), 4-phase (simple, semi-decoupled, and fully-decoupled), and single-track (GasP). An in-depth review of each selected circuit is conducted regarding the handshaking and data latching mechanisms behind the circuit implementations, as well as the analysis of its performance and timing constraints based on formal behavior models. Overall, this paper aims at providing a survey of asynchronous bundled-data pipeline circuits, and it will be a reference for designers interested in experimenting with asynchronous circuits.
... HE synchronous circuits are pretty popular, and they often use a clock signal to control their operations [1]. These circuits have played a significant role and have dominated the semiconductor industry [2]. This industry has continuously reduced the feature size of transistors and wires for decades. ...
... For this condition, its switching expression also describes the cell function, as presented in Table II and explicitly mentioned by (1). To reduce the number of NMOS transistors used to generate the set block, we convert (1) to (2). Finally, we design the weak feedback inverter. ...
Article
Full-text available
The Null Convention Logic (NCL) based asynchronous circuits have eliminated the disadvantages of the synchronous circuits, including noise, glitches, clock skew, power, and electromagnetic interference. However, using NCL based asynchronous designs was not easy for students and researchers because of the lack of standard NCL cell libraries. This paper proposes a solution to design a semi-static NCL cell library used to synthesize NCL based asynchronous designs. This solution will help researchers save time and effort to approach a new method. In this work, NCL cells are designed based on the Process Design Kit 45nm technology. They are simulated at the different corners with the Ocean script and Electronic Design Automation (EDA) environment to extract the timing models and the power models. These models are used to generate a *.lib file, which is converted to a *.db file by the Design Compiler tool to form a complete library of 27 cells. In addition, we synthesize the NCL based full adders to illustrate the success of the proposed library and compare our synthesis results with the results of the other authors. The comparison results indicate that power and delay are improved significantly.
... Synchronous circuits have played a significant role and have dominated the semiconductor industry [1]. This industry has continuously diminished the wire and transistor dimension. ...
Article
Full-text available
The Null Convention Logic (NCL) based asynchronous design technique has interested researchers because this technique had overcome disadvantages of the synchronous technique, such as noise, glitches, clock skew and power. However, using the NCL-based asynchronous design method is difficult for university students and researchers because of the lack of standard NCL cell libraries. Therefore, in this paper, a novel flow is proposed to design NCL cell libraries. These libraries are used to synthesize NCL-based asynchronous designs. We chose the static NCL cell library to illustrate the proposed design solution because this library is one of the most basic NCL libraries. Static NCL cells in this library are designed based on the Process Design Kit 45nm technology and are implemented by the Virtuoso and the Design Compiler (DC) tool. In addition, the Ocean script and Electronic Design Automation (EDA) environment are used for supporting designs and simulations. A complete library of 27 NCL cells was designed to serve for study and research. We also implemented synthesis for NCL full adders using this library and compared our synthesis results with the results of other authors. The comparison results indicated that our results were a 20% improvement on power consumption.
... In a single-rail bit encoding, one wire represents a single bit [2], whereas a dual-rail encoding requires two wires to represent a single bit. Both datapaths require handshake protocols [7], either two-phase or four-phase [8] [9], comprising Request (Req) and Acknowledge (Ack) signals, to synchronize various stages. The single-rail encoding scheme necessitates the placement of a delay element, typically an inverter chain or a counter, in the Req line to slow down request signals to match the speed of the datapath. ...
Article
Full-text available
Asynchronous systems are native to a full custom domain. Their implementation using auto place-and-route tools requires dynamic calibration of interconnects delays in addition to the placement of predefined static delay elements. This paper presents a completion detector for a single-rail bit encoded datapath that, as an adaptive-delay element, eliminates the need to insert any predefined delay element and caters to routing delays dynamically. A programmable pulse-generator is also proposed that empowers the designers to generate clock signals based on the timing report obtained from the CAD tool to drive various synchronous subsystems and embedded resources like BRAMs in FPGAs. Employing these components, we present an asynchronous pipeline model with implicit control to expedite migration from the traditional synchronous pipelines to their asynchronous counterparts. A single-rail bit encoded datapath has been used to utilize chip area effectively instead of a delay-insensitive dual-rail datapath, and a two-phase handshake protocol has been adopted as opposed to a four-phase handshake protocol to lower handshaking overhead. A RISC processor validates the proposed asynchronous pipeline model, exhibiting a smooth functionality and power-delay parameter comparable to that of a synchronous pipeline, in addition to ease of routing and avoiding clock skews in a complex system-on-chip.
Chapter
This paper aims to design low-power circuits like an adder, buffer, AOI, and logic gates using asynchronous quasi delay insensitive (QDI) templates, essential for many arithmetic computations. Adder is a fundamental building block for applications like ALU, microprocessors, and digital signal processors. The present trendy parallel prefix adder KSA is modeled and verified with various asynchronous QDI templates. The prominent templates include pre-charged half buffer, autonomous signal validity half buffer, and sense amplifier half buffer in dual-rail encoding style. Due to clock circuitry, synchronous circuits determine more switching activity, which dissipates more power. This drawback is overcome through clock-less circuits, which dissipate less power by reducing the switching activity without degrading the functionality of a circuit. An asynchronous circuit has various timing approaches such as QDI, delay insensitive, and bundled data. Still, the QDI approach has significant advantages in power dissipation, delay, and energy savings. The major drawback of QDI templates is the completion detector block, which dissipates more power and occupies a large area overhead. To overcome this drawback, an advanced QDI template—sense amplifier half buffer is designed to provide low power, less delay with efficient energy due to the utilization of sub-threshold process and controlled reset transistor in evaluation block and the absence of completion detector circuit. The paper describes the performance aspects of 32-bit KSA using various QDI templates in terms of multiple metrics like power, delay, and energy using the mentor graphics EDA tool.KeywordsClock-less circuitsQDI templatesSense amplifier half bufferPre-charged half bufferAutonomous signal validity half bufferKSA
Article
In this paper, the architecture of an application-specific integrated circuit for adaptive metasurfaces is presented. The architecture allows scalable networking over large metasurfaces and reconfiguration of each unit-cell with unique complex impedance settings, for adjusting metasurface electromagnetic performance. The one-chip-design metasurface array, includes self-initialization/addressing, configuration-packet routing, and network adaptation for fault-tolerant dynamic metasurfaces. An asynchronous design is adopted for power efficiency and high scalability, whereas speed is enhanced through multilevel pipelining. The ability to dynamically form reconfigurable networks in conjunction with the clockless operation helps the metasurface to cover arbitrary physical shapes, implemented on both rigid or flexible substrates. The architecture is validated through transistor-level simulations of the complete design implemented in a commercially available process. Then, a variety of scalable and/or flexible networks are presented, exploiting the strengths offered by the architecture, to demonstrate its capabilities and estimated performance. The demonstrated results of the architecture and its networking capabilities, allow the realization of chip-to-chip programmability of metasurfaces with up to 218 ASICs. The individual ASICs can be reconfigured in $2 ~\mu \text{s}$ and consume only $342 ~\mu \text{W}$ of static power.
Article
Full-text available
This paper presents a clockless digital filter able to process inputs of different rates and formats, synchronous or asynchronous, with no adjustment needed to handle each input type. The filter is designed using a mix of asynchronous and real-time digital hardware, and for this reason relies on neither a clock nor the input data rate for setting its frequency response. The modular architecture of the filter, including delay segments with separated data and timing paths and a pipelined multi-way adder, allows easy extensions for different data widths. The filter was used as part of an ADC/DSP/DAC system which maintains its frequency response intact for varying sample rates without requiring any internal change. This property is not possible for any synchronous DSP system. The 16-tap, 8-bit FIR filter, integrated in a 130 nm CMOS process, includes on-chip automatic delay tuning, and for certain inputs, has signal-to-error ratio which exceeds that of clocked systems.
Article
Full-text available
Editor’s note: Pipelining is a key element of high-performance design. Distributed synchronization is at the same time one of the key strengths and one of the major difficulties of asynchronous pipelining. It automatically provides elasticity and on-demand power consumption. This tutorial provides an overview of the best-in-class asynchronous pipelining methods that can be used to fully exploit the advantages of this design style, covering both static and dynamic logic implementations.
Conference Paper
Full-text available
The design of a commercially-shipping 72-port 10G Ethernet switch router integrated circuit is presented. The 1.2 billion transistor chip consists of a core of > 1GHz asynchronous circuits surrounded by standard synchronous logic for external interfaces. It is manufactured in a TSMC 65nm process. The asynchronous circuitry includes 15MB of single-ported SRAM, 150KB of dual-ported SRAM, 100KB of TCAM, Tb bandwidth crossbars, and a fully pipelined programmable packet processor processing one billion packets per second. The design implementation relied heavily on a novel tool flow utilizing both commercial and proprietary EDA tools for automatic place-and-route of asynchronous layout.
Article
This seminal book presents a new logically determined design methodology for designing clockless circuit systems. The book presents the foundations, architectures and methodologies to implement such systems. Based on logical relationships, it concentrates on digital circuit system complexity and productivity to allow for more reliable, faster and cheaper products. Transcends shortcomings of Boolean logic. Presents theoritical foundations, architecture and analysis of clockless (asynchronous) circuit design. Contains examples and exercises making it ideal for those studying the area.
Article
A new asynchronous low-latency interconnection network is introduced for a 2D mesh topology. The network-on-chIP, named AEoLiAN, contains a fast lightweight monitoring network to notify the routers of incoming traffic, thereby allowing arbitration and channel allocation to be initiated in advance. In contrast, several recent synchronous early arbitration methods require significant resource overhead, including use of hybrid networks, or wide monitoring channels and additional VCs. The proposed approach has much smaller overhead, allowing a finer-grain router-by-router early arbitration, with monitoring and data advancing independently at different speeds. The new router was implemented in 45nm technology using a standard cell library. It had 52% lower area than a similar lightweight synchronous switch, xpIPesLite, with no early arbitration capability. Network-level simulations were then performed on 6 diverse synthetic benchmarks in an 8×8 2D mesh network topology, and the performance of the new network was compared to an asynchronous baseline. Considerable improvements in system latency over all benchmarks for moderate traffic were obtained, ranging from 34.4-37.9%. Interestingly, the proposed acceleration technique also enabled throughput gains, ranging from 14.7-27.1% for the top 5 benchmarks. In addition, a zero-load end-to-end latency of only 4.9ns was observed, for the longest network path through 15 routers and 14 hops.
Conference Paper
In this paper we explore the use of asynchronous routers in a time-division-multiplexed (TDM) network-on-chip (NOC), Argo, that is being developed for a multi-processor platform for hard real-time systems. TDM inherently requires a common time reference, and existing TDM-based NOC designs are either synchronous or mesochronous. We use asynchronous routers to achieve a simpler, smaller and more robust, self-timed design. Our design exploits the fact that pipelined asynchronous circuits also behave as ripple FIFOs. Thus, it avoids the need for explicit synchronization FIFOs between the routers. Argo has interesting elastic timing properties that allow it to tolerate skew between the network interfaces (NIs). The paper presents Argo NOC-architecture and provides a quantitative analysis of its ability of absorb skew between the NIs. Using a signal transition graph model and realistic component delays derived from a 65 nm CMOS implementation, a worst-case analysis shows that a typical design can tolerate a skew of 1-5 cycles (depending on FIFO depths and NI clock frequency). Simulation results of a 2 × 2 NOC confirm this.
Conference Paper
We present a novel tree arbiter cell that allows a pipelined processing of asynchronous requests. In this way it can achieve significantly lower delay in the critical case of frequent requests coming from different clients. We elaborate the necessary extension to facilitate a cascaded use of this cell in a tree-like fashion, and we show by theoretical analysis that in this configuration our cell provides better fairness than the standard approach. We implement our approach and quantitatively compare its performance properties with related work in a gatelevel simulation. In our sample asynchronous Networks-on-Chip application our new cell proves to increase the throughput of three different designs available in literature by approximately 61.28%, 69.24%, and 186.85% respectively.
Article
Inspired by the brain’s structure, we have developed an efficient, scalable, and flexible non–von Neumann architecture that leverages contemporary silicon technology. To demonstrate, we built a 5.4-billion-transistor chip with 4096 neurosynaptic cores interconnected via an intrachip network that integrates 1 million programmable spiking neurons and 256 million configurable synapses. Chips can be tiled in two dimensions via an interchip communication interface, seamlessly scaling the architecture to a cortexlike sheet of arbitrary size. The architecture is well suited to many applications that use complex neural networks in real time, for example, multiobject detection and classification. With 400-pixel-by-240-pixel video input at 30 frames per second, the chip consumes 63 milliwatts.
Conference Paper
The State-of-the-Art in Rad-Hard and Wide-Temperature IC Fabrication is briefly reviewed, then extensions to the design range of an existing HTSOI process are described. NULL Convention Logic is described in relation to temperature and radiation insensitivity. Finally, an 8-bit data transfer system is described and simulated using NCL and the HTSOI process. The system shows good results over a 400° C operating range.