Part 1: Overview and
Steven M. Nowick
University of North Carolina at Chapel Hill
hTHERE HAS BEEN A continuous growth of interest
in asynchronous design over the last two decades,
as engineers grapple with a host of challenging
trends in the current late-Moore era. As highlighted
in the International Technology Roadmap for Semi-
conductors (ITRS), these include dealing with the
impact of increased variability, power and thermal
bottlenecks, high fault rates (including due to soft
errors), aging, and scalability issues, as individual
chips head to the multibillion-transistor range and
many-core architectures are targeted.
While the synchronous, i.e., centralized clock,
paradigm has prevailedin industry for several decades,
asynchronous designVor the use of a hybrid mix
of asynchronous and synchronous componentsV
provides the potential for ‘‘object-oriented’’distributed
hardware systems, which naturally support modular
and extensible composition,
on-demand operation with-
out extensive instrumented
power management, and
As highlighted by the ITRS
report, it is therefore increas-
ingly viewed as a critical
component for addressing
the above challenges.
This article aims to pro-
vide both a short historical and technical overview
of asynchronous design, and also a snapshot of the
state of the art, with highlights of some recent
exciting technical advances and commercial in-
roads. It also covers some of the remaining
challenges, as well as opportunities, of the field.
Asynchronous design is not new: some of the
earliest processors used clockless techniques. Over-
all, its history can be divided into roughly four eras.
The early years, from the 1950s to the early 1970s,
included the development of classical theory
(Huffman , Unger , McCluskey, Muller ),
as well as use of asynchronous design in a number
of leading commercial processors (Iliac, Iliac II,
Atlas, MU-5) and graphics systems (LDS-1). The
middle years, from the mid 1970s to early 1980s,
were largely an era of retrenchment, with reduced
activity, corresponding to theadventofthesynchro-
nous VLSI era. The mid 1980’s to late 1990’s
represented a revival or ‘‘coming-of-age’’ era, with
the beginning of modern methodologies for asyn-
chronous controller and pipeline design, initial
computer-aided design (CAD) tools and optimiza-
tion techniques, the first academic microprocessors
An asynchronous design paradigm is capable of addressing the impact of
increased process variability, power and thermal bottlenecks, high fault
rates, aging, and scalability issues prevalent in emerging densely packed
integrated circuits. The first part of the two-part article on asynchronous
design presents a chronicle of past and recent commercial advances, as
well as technical foundations, and highlights the enabling role of asynchro-
nous design in two application areas: GALS systems and networks-on-chip.
VPartha Pratim Pande, Washington State University
2168-2356/15 B2015 IEEEMay/June 2015 Copublished by the IEEE CEDA, IEEE CASS, IEEE SSCS, and TTTC 5
Color versions of one or more of the figures in this paper are available online
Digital Object Identifier 10.1109/MDAT.2015.2413759
Date of publication: 16 March 2015; date of current version:
04 June 2015.
(Caltech, University of Manchester, Tokyo Institute of
Technology), and initial commercial uptake for use
in low-power consumer products (Philips Semicon-
ductors) and high-performance interconnection
The modern era, from the early 2000s to present,
includes a surge of activity, with modernization of
design approaches, CAD tool development and
systematic optimization techniques, migration into
on-chip interconnection networks, several large-scale
demonstrations of cost benefits, industrial uptake at
leading companies (IBM, Intel) as well as startups, and
application to emerging technologies (sub-/near-
threshold circuits, sensor networks, energy harvesting,
cellular automata). The approaches in the modern era
bear little resemblance to some of the simple
asynchronous examples found in older textbooks.
This article is divided into two parts. Part 1 begins
with a chronicle of past and recent commercial
advances, and highlights the enabling role of
asynchronous design in several emerging applica-
tion areas. Two promising application domains are
covered in more detailVGALS systems and networks-
on-chipVgiven their importance in facilitating the
integration of large-scale heterogeneous systems-on-
chip. Finally, several foundational techniques are
introduced: handshaking protocols and data encod-
ing, pipelining, and synchronization and arbitration.
Part 2 focuses on methodologies for the design of
asynchronous systems, including logic- and high-
level synthesis; tool flows for design, analysis,
verification and test; as well as examples of
asynchronous processors and architectures.
Asynchronous design has successfully migrated
into commercial products from leading companies
in recent years. In addition, there have been a
number of industry experiments using asynchro-
nous design that were quite successful, even though
products did not appear on the market. There are
also several exciting emerging application areas
where asynchronous design is expected to play a
key enabling role.
There have been several promising examples of
commercialization of asynchronous designs over
the last decade or two, with significant cost benefits
Philips Semiconductors: low-power embedded
controllers. In the late 1990s through early 2000s,
Philips Semiconductors (now NXP) achieved much
commercial success with its asynchronous 80C51
microcontroller . The chip was initially aimed for
use in pager chipsets, and the motivation was to
lower electromagnetic interference (EMI) noise
emissions so that the microcontroller could operate
harmoniously with the radio-frequency (RF) data
link, without the use of shielding. As a result,
encoding and decoding of RF data could now be
performed in software instead of requiring a custom
fixed-function circuit, allowing easier upgrades to
functionality, which was not possible with their
synchronous version. It also demonstrated a 4
power reduction over a comparable synchronous
design in the same technology. The microcontroller
was later used in smart cards for public transport,
where the wide range of operating voltages allowed
the cards to be battery-free and contactless, powered
merely by the brief burst of charge induced when the
user’s hand waves it through the magnetic field of the
card reader. An enhanced version of the asynchro-
nous microcontroller (SmartMX) is now used in
more than 75 countries, including the European
Union and more recently the United States, for
biometric passports and IDs. At last count, the
number of copies sold has exceeded 700 million.
Intel/Fulcrum Microsystems: Ethernet switch chips.
In 2011, Intel acquired Fulcrum Microsystems, an
asynchronous startup producing high-speed net-
working chips, in a move industry analysts regarded
as a bid to compete with Cisco Systems. Intel’s
current FM5000/FM6000 family of switch chips,
which supports industry-leading 40 gigabit Ethernet,
includes a fully-asynchronous high-speed crossbar
switch that provides high bandwidth, low latency,
support for flexible link topologies, and high energy
efficiency. The crossbar bandwidth of over 1 terabit
per second (in a 130 nm process) is achieved
granularity of individual gates, unencumbered by a
rigid clock period . When operated at below
peak throughput, these chips are highly energy-
efficient, since the asynchronous logic provides
the benefit of automatic power-down of inactive
circuitry without additional instrumentation. This
combination of features is considered unique in
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
Achronix: high-performance FPGA’s. Another ex-
ample of asynchronous commercialization is the
Speedster 22i family of FPGAs by Achronix Semi-
conductor . Manufactured in 22 nm, these chips
can operate at 1.5 GHz, and are currently claimed as
the world’s fastest FPGA’s, yet they incur a fraction of
the design cost and operating energy cost of leading
synchronous designs. Their key to achieving fast
operation is the use of asynchronous fine-grain bit-
level pipelines, thereby relaxing the constraints of
global synchronization. This asynchronous internal
implementation is transparent to the end user: the
applications mapped can be synchronous, just as on
any typical FPGA.
IBM: TrueNorth neuromorphic computer. There has
been much excitement recently about neuro-
morphic computing, which seeks to mimic the
functioning of the human brain by using massively-
parallel computer systems. In a departure from the
traditional von Neumann architecture, these sys-
tems employ a highly-distributed memory that is
tightly integrated with a large number of parallel
computational elements that model neurons. Due to
the spatially-distributed nature of computation and
communication, along with wide timing unpredict-
ability of data events, neuromorphic computing fits
well with the asynchronous paradigm. One interest-
ing example is TrueNorth , released in August
2014, which is the largest chip ever developed at
IBM, with 5.4 billion transistors. It integrates 4096
neurosynaptic cores on a single chip, modelling
1 million neurons and 256 million synapses. This
scale of integration poses a formidable physical
design challenge, which was successfully met using
fully-asynchronous processing elements and inter-
connection network. The event-driven asynchronous
operation facilitates extremely low powerV70 mW
for real-time operation over the entire chipVwhich
would be extremely difficult to obtain with current
synchronous techniques. Other neuromorphic
computersVUniversity of Manchester’s SpiNNaker
machine (Furber group) and Stanford University’s
Neurogrid (Boahen group)Valso use fully-
asynchronous communication networks.
Other commercial designs. Several other industrial
applications have explored asynchronous benefits.
In the early 1990s, two commercial processors from
HaL Computer Systems used a self-timed floating-
point divider , which was reported as 2–3.5 times
faster than the leading commercial synchronous
dividers. At Sun Microsystems, now Oracle, high-
speed asynchronous pipelines were used commer-
cially in UltraSPARC IIIi computers for smoothing
out timing discrepancies in the interface to ultra-
fast memories. Theseus Logic has used a NULL
convention logic (NCL) methodology to develop
chips that are robust to extreme variations in
manufacturing and operating conditions . Octa-
sic Inc. has recently developed a clockless DSP
technology that takes advantage of the highly-
variable data-dependent execution times of different
arithmetic operations to achieve a 3throughput
increase over comparable synchronous implemen-
tations. Finally, Tiempo IC has developed low-power
robust microprocessors, and also has exploited
another intrinsic property of asynchronous circuitsV
the lack of a coherent power or electromagnetic
emission signatureVto develop chips with secure
cryptographic functionality that have increased resil-
ience to side-channel attacks.
There have also been several industry experi-
ments with asynchronous design that, though quite
successful, did not appear in commercial products.
Intel RAPPID. In this experimental project at Intel
in the mid-1990’s, an asynchronous implementation
of the IA32 instruction-length decoder was under-
taken because of severe performance bottlenecks
that could not be overcome in their commercial
version using synchronous techniques . The
project focused on making the decoding of the
length of the most common CISC instructions fast by
exploiting concurrency at sub-clock-period granu-
larity, thereby significantly outperforming the exist-
ing synchronous Intel implementation: three times
higher throughput, comparable area, half the
latency, and half the power.
IBM FIR filter. At IBM Research, a project was
undertaken jointly with Columbia University to
develop a mixed synchronous-asynchronous imple-
mentation of a finite impulse response (FIR) filter for
use in the read channels of modern disk drives .
The goal was to reduce the filter’s latency over its
wide range of operating frequencies. In a synchro-
nous implementation that is deeply pipelined for
May/June 2015 7
speed, the latency becomes poorer when the data
rate, and hence the clock recovered from it, slows
down. The hybrid synchronous/asynchronous imple-
mentation replaced the core of the filter with an
asynchronous pipelined unit featuring a fixed latency,
while the remaining circuitry was kept synchronous.
The resulting chip exhibited a 50% reduction in worst-
case latency, along with a 15% throughput im-
provement, over IBM’s leading commercial clocked
implementation in the same technology.
Emerging application areas
Beyond more classical design targets, a number of
novel application areas have recently emerged where
asynchronous design is poised to make an impact.
Large-scale heterogenous system integration. In
multi- and many-core processors and systems-on-
chip (SoC’s), some level of asynchrony is inevitable
in the integration of heterogeneous components.
Typically, there are several distinct timing domains,
which are glued together using an asynchronous
communication fabric. There has been much recent
work on asynchronous and mixed synchronous-
asynchronous systems (see ‘‘GALS Systems’’ and
‘‘Ne t w o r k s- o n- C h ip ’’ s id eb a r s) .
Ultra-low-energy systems and energy harvesting.
Asynchronous design is also playing a crucial role in
the design of systems that operate in regimes where
energy availability is extremely limited. In one ap-
plication, Liu and Rabaey demonstrated the benefits
of asynchrony for sub-threshold operation, with cir-
cuits consuming 32% lower energy than a sub-
threshold synchronous counterpart . In another
application, a collaboration with Oticon Inc., a lead-
ing hearing-aid manufacturer, Nielsen and Sparsø
proposed an IFIR filter that achieves a 5power
savings over a commercial synchronous counterpart
by dynamically adapting the numerical range of the
arithmetic circuitry to each individual sample, since
most audio samples are numerically small .
Such fine-grain adaptation, in which the datapath
latency can vary subtly for each input sample, is not
possible in a fixed-rate synchronous design. In a re-
cent in-depth case study by Chang et al., focusing on
ultra-low-energy 8051 microcontroller cores with
voltage scaling, it was shown that under extreme pro-
cess, voltage, and temperature (PVT) variations, a
synchronous core requires its delay margins to be
increased by a factor of 12, while a comparable
asynchronous core can operate at actual speed .
Finally, Christmann et al. have designed an energy
harvesting approach to implement an autonomous
sensing application, with a reported 40% power-
efficiency gain over synchronous approaches .
The use of asynchronous logic provided greater
energy efficiency not only due to its event-driven
nature, but also by allowing graceful adaptivity to the
highly-variable power availability.
Continuous-time digital signal processors (CT-
DSP’s). Another intriguing direction is the develop-
ment of continuous-time digital signal processors,
where input samples are generated at irregular rates
by a level-crossing analog-to-digital converter, depend-
ing on the actual rate of change of the input’s wave-
form. An early specialized approach, using finely
discretized sampling, demonstrated a 10power re-
duction in a speech processing application . The
first general-purpose continuous-time DSP was re-
cently proposed by Vezyrtzis et al. ; unlike syn-
chronous DSPs, it maintains its frequency response
intact over varying sample rates and can support
multiple input formats without any internal change,
eliminates all aliasing, and demonstrates a signal-to-
error ratio for certain inputs which exceeds that of
clocked systems. The fine-grain asynchronous pro-
cessing of irregular sampling is fundamental to the
operation of these systems, and would not be possible
to support by conventional synchronous techniques.
Handling extreme environments. Asynchronous
approaches have also been explored to handle ex-
treme environments, such as for space missions,
where temperatures can vary widely. For example,
an asynchronous 8-bit data transfer system has been
designed that is fully operational over a 400 C
temperature range, from 175 Ctoþ225 C, usin g
high-temperature SOI technology . The imple-
mentation also shows good resilience to single-event
Alternative computing paradigms. Finally, there is
increasing interest in asynchronous circuits as the
organizing backbone of systems based on emerging
computing technologies, such as cellular nano-arrays
 and nanomagnetics , where highly-robust
asynchronous approaches are crucial to mitigating
timing irregularities, both layout-induced and those
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
resulting from the vagaries of quantum behavior.
Asynchronous approaches were also shown to be a
good match for flexible electronics, in the Seiko/
Epson ACT11 microprocessor , where the phys-
ical bending and manipulation of the material can
introduce unpredictable and large delay variations.
While the above applications are in a wide
variety of areas, they exhibit a few commonly-recur-
ring themes, representing beneficial opportunities
1) Extreme fine-grain pipelining: the ability to
implement and exploit extremely fine-grainV
even gate- and bit-levelVpipelines, uncon-
strained by the need to distribute a high-speed
fixed-rate clock [Intel/Fulcrum, Achronix, HaL,
2) Data-dependent completion times: to support and
micro-architect systems which can exploit subtle
and fine-grain differences in data-dependent
completion time, i.e., at a subcycle granularity
[Intel RAPPID, Oticon IFIR, Ocstasic];
3) Avoiding challenges due to the rigidity of global
timing: supporting new computation paradigms
[CT-DSPs, cellular nano-arrays, nanomagnetics,
flexible electronics], extreme micro-parallelism
[Intel RAPPID], dynamic computational adaptiv-
ity [IBM FIR], and ease of large-scale system integ-
ration [GALS, NoCs, neuromorphic computing];
4) Robustness to voltage, temperature and process
variation: allowing flexible accommodation of dy-
namic timing variations [energy harvesting, sub-
threshold computing, space applications]; and
5) On-demand, i.e. event-driven, operation: highly
energy-proportional computing, without the
need for extensive instrumentation of clock-gating
at multiple design levels [nearly all applications].
While the above themes capture promising oppor-
tunities, and the current industrial uptake indicates
increasing commercial viability and interest, there
remain open issues and challenges that asynchronous
design must overcome to gain wider industry adoption.
A brief discussion appears in the conclusion to Part 2.
Asynchronous systems are typically organized as
a set of components which communicate and
synchronize using handshaking channels. These
channels are defined by two key parameters: com-
munication protocol and data encoding. Building on
these fundamentals, complex asynchronous systems
can be modularly constructed. In particular, we re-
view how these techniques can be used to construct
asynchronous pipelines, which are building blocks
for many high-performance systems. We then address
two basic issues in assembling asynchronous compo-
nents in larger systems: synchronization,whichis
required when interfacing asynchronous and syn-
chronous domains; and arbitration, which is needed
to allow the safe competition of multiple asynchro-
nous components for a shared resource.
Classes of asynchronous circuits
Asynchronous circuits fall into several distinct
classes, depending on the degree of timing robustness
assumed in their operation. This spectrum typically
defines a robustness-performance space, where the
more robust circuits tend to have lower performance
(but not always!). Delay-insensitive circuits operate
correctly regardless of gate and wire delays. Quasi-
delay insensitive (QDI) circuits , ,  operate
correctly assuming arbitrary gate delays, but all wires
at each fanout point must have roughly equal delays,
i.e., an isochronic fork assumption. Speed-independent
(SI) circuits  (an earlier term) assume arbitrary gate
delays, but all wire delays are assumed to be zero.
Other asynchronous circuits require additional timing
constraints, including hold time constraints (i.e.,
‘‘fundamental mode’’ ) on controllers, one-sided
‘‘bundled data’’ , , – timing constraints
on datapaths (see below), as well two-sided con-
straints (i.e. both short- and long-path) .
Protocols and data encoding
The basic structure of an asynchronous commu-
nication channel, between a sender and receiver, is
shown in Figure 1. Ignoring data transmission for now,
the channel is typically implemented by two wires:
req and ack.
Two common handshaking protocols are used to
define a single communication transaction, as
illustrated in Figure 2: 1) a four-phase protocol
(return-to-zero [RZ]), and 2) a two-phase protocol
(non-return-to-zero [NRZ], also known as transition-
signalling). In a four-phase protocol, req and ack
signals are initially at zero. The sender initiates
a transaction by asserting req, and the receiver
May/June 2015 9
An alternative to constructing fully-asynchronous systems
is a hybrid approach, integrating synchronous components
(i.e. cores, memories, accelerators, I/O units, etc.) using an
asynchronous communication network, which together form a
globally-asynchronous locally-synchronous (GALS) system.1-3
For some applications, this approach provides the best of both
worlds: allowing the design reuse of synchronous intellectual
property (IP) blocks, while combining them with ﬂ exible asyn-
chronous interconnect as a global integrative medium. The
elimination of ﬁ xed-rate global clocking on the communication
network can provide a highly-scalable, low-power and robust
mechanism for assembling complex systems.
A GALS approach was ﬁ rst published in 1980 by Chuck
Seitz,4 in an inﬂ uential overview of asynchronous design; the
approach was used even earlier by Evans & Sutherland Com-
puter Corp. in its ﬁ rst commercial graphics system, LDS-1
. The term was later introduced and formalized by Cha-
piro.1 Synchronous components interact with the asynchro-
nous network using either asynchro-
nous/synchronous interface circuits
or pausible clocks.2,3 Fig. A shows a
simpliﬁ ed GALS system with four syn-
chronous cores. Each core can oper-
ate as a separate voltage-frequency
island (VFI), which is connected to a
switch (SW) of an asynchronous com-
munication network through an associ-
ated network interface (NI). A number
of successful GALS multicore archi-
tectures have been developed,2-3,5-7
including those supporting ﬁ ne-grain
message passing8 and latency-insen-
As a recent example, STMicroelectronics’ Platform 2012
(P2012)10 includes a fully-asynchronous network-on-chip,11
supporting a highly-reconﬁ gurable accelerator-based many-
core GALS architecture which facilitates ﬁ ne-grain power, reli-
ability and variability management. The ﬁ rst prototype deliv-
ered 80 GOPS performance with only 2W power consumption.
It has evolved recently into the company’s STHORM Platform.
Interesting specialized applications which beneﬁ t from a
GALS architecture include large-scale neuromorphic systems
(see “Applications” section), as well as approaches to en-
hance resilience to side-channel attacks.12 Another interesting
approach with a GALS-like ﬂ avor is "proximity communication,"
which aims to overcome the latency bottleneck of inter-chip
communication by exploiting capacitive coupling at their inter-
faces, instead of using traditional wired interconnect.13
Overall, there is a surge of interest and activity in GALS
design, in both industry and academia, as systems become
larger-scale and more heterogeneous, and variability and
timing unpredictability become critical factors.
As an addendum, it is worth noting that the term “GALS”
has at times been stretched to describe non-GALS systems.
In its original and widely-used sense,1,3,4 a GALS system in-
cludes a fully-asynchronous interconnection network, includ-
ing handshaking channels, to integrate synchronous com-
ponents. Systems containing multiple synchronous cores,
operating at different clock rates, which are directly con-
nected using synchronizers, e.g. bi-synchronous FIFO’s, are
n alternative to constructin
(i.e. cores, memories, accelerators, I/O units, etc.) usin
asynchronous communication network, which to
For some a
lications, this a
rovides the best o
n reuse of synchronous intellectual
roperty (IP) blocks, while combinin
them with ﬂ exible asyn
nterconnect as a
on the communication
roach was ﬁ rst
ublished in 1980 b
in an in
uential overview o
roach was used even earlier b
uter Corp. in its ﬁ rst commercial
raphics system, LDS-1
. The term was later introduced and formalized b
onents interact with the as
synchronous interface circuits
liﬁ ed GALS s
stem with four s
core can o
ate as a separate volta
, which is connected to a
switch (SW) of an asynchronous com-
ated network interface
. A number
s a recent exam
TMicroelectronics’ Platform 2012
architecture which facilitates ﬁ ne-
rain power, reli
ability and variability mana
ement. The ﬁ rst prototype deliv
erformance with onl
It has evolved recentl
into the com
’s STHORM Platform
specialized applications which bene
architecture include lar
e-scale neuromorphic systems
, as well as a
roaches to en
roach with a GALS-like ﬂ avor is "
which aims to overcome the latenc
aces, instead o
using traditional wired interconnect
a, as systems
unpredictability become critical factors.
s an addendum, it is worth notin
that the term “GALS”
has at times been stretched to describe non-GALS s
onents. Systems containin
multiple synchronous cores,
at different clock rates, which are directly con
. bi-synchronous FIFO’s, are
Fig A. A multicore GALS
more properly referred to as multi-synchronous systems.7
A useful general classiﬁ cation scheme for mixed-timing
systems was proposed by Messerschmitt,3,14 based on the
relationship between different clock domains. In a mesochro-
nous system, synchronous components operate at exactly
the same frequency, but with unknown yet stable phase dif-
ference.15 In a plesiochronous system, synchronous compo-
nents operate at the same nominal frequency, but may have
a slight frequency mismatch, e.g. a few parts per million. Fi-
nally, in a heterochronous system, synchronous components
can operate at arbitrary unrelated frequencies.
1. D. M. Chapiro, “Globally-asynchronous locally-synchronous
systems,” Ph.D. dissertation, Dept. Comput. Sci., Stanford
Univ., Stanford, CA, USA, 1984.
2. M. Krstic et al., “Globally asynchronous, locally synchro-
nous circuits: Overview and outlook,” IEEE Des. Test,
vol. 24, no. 5, pp. 430–441, 2007.
3. P. Teehan, M. Greenstreet, and G. Lemieux, “A survey
and taxonomy of GALS design styles,” IEEE Des. Test,
vol. 24, no. 5, pp. 418–428, 2007.
4. C. L. Seitz, ‘‘System Timing,’’ in Introduction to VLSI Sys-
tems, C. A. Mead and L. A. Conway, Eds. Addison-
Wesley, 1980, pp. 218–262.
5. J. Muttersbach, T. Villiger and W. Fichtner, “Practical
design of globally-synchronous locally-asynchronous
systems,” in Proc. 6th IEEE Int. Symp. Adv. Res. ASYNC,
2000, pp. 52–59.
6. S. Moore et al., “Point to point GALS interconnect,” in
Proc. 8th IEEE Int. Symp. ASYNC, 2002, pp. 69–75.
7. A. Sheibanyrad, A. Greiner, and I. Miro-Panades, ‘‘Multi-
synchronous and fully asynchronous NoCs for GALS
architectures,’’ IEEE Des. Test, vol. 25, no. 6, pp.
8. N. J. Boden et al., ‘‘Myrinet: A gigabit-per-second local
area network,’’ IEEE Micro, vol. 15, no. 1, pp. 29–36,
9. M. Singh and M. Theobald, “Generalized latency-insen-
sitive systems for single-clock and multi-clock architec-
tures,” in Proc. ACM/IEEE DATE, 2004, pp. 1008–1013.
10. L. Benini et al., “P2012: building an ecosystem for a
scalable, modular and high-efﬁ ciency embedded com-
puting accelerator,” in Proc. ACM/IEEE DATE, 2012,
11. Y. Thonnart, P. Vivet, and F. Clermidy, “A fully-asynchronous
low-power framework for GALS NoC integration,” in Proc.
ACM/IEEE DATE, 2010, pp. 33–38.
12. R. Soares et al., “A robust architectural approach for
cryptographic algorithms using GALS pipelines,” IEEE
Des. Test, vol. 28, no. 5, pp. 62–71, 2011.
13. D. Hopkins et al., “Circuit techniques to enable 430 Gb/s
/mm2 proximity communication,” in Proc. IEEE ISSCC,
2007, pp. 368–369.
14. D. G. Messerschmitt, ‘‘Synchronization in digital sys-
tem design,” IEEE J. Sel. Areas Commun., vol. 8, no. 8,
pp. 1404–1419, Oct.1990.
15. M. R. Greenstreet, “Implementing a STARI Chip,” in Proc.
ICCD, 1995, pp. 38–43.
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
Over the last decade, networks-on-chip (NoCs) have be-
come the de facto standard approach for structured on-chip
communication, both for lowpower embedded systems and
high-performance chip multi-processors.1 These on-chip net-
works typically replace traditional ad hoc bus-based com-
munication with packet switching, and can be targeted to a
variety of cost functions (faulttolerance, power, latency, satura-
tion throughput, quality-of-service [QoS]) and parameters (net-
worktopology, channel width, routing strategies).
Since the NoC approach separates the communication in-
frastructure, and its timing, from processing elements, it is a
natural match for an asynchronous paradigm. Asynchronous
interconnect eliminates the need for global clock management
across a large network, thereby providing better support for
scalability, timing robustness and low power, and avoids the
challenge of instrumenting complex clock-gating in a highly
distributed communication structure.
A number of asynchronous and GALS NoCs have been pro-
posed in the last decade or so. An early approach, Chain,2
used delay-insensitive codes on channels for crosstalk miti-
gation and ease of physical design, and was effectively ap-
plied to an ARM-based smart-card chip. Several approaches
to support QoS have been proposed, including combining
guaranteed service (GS) and best effort (BE) trafﬁ c,3 as well as
multiple service levels.4 A comprehensive asynchronous NoC
framework has been developed to provide dynamic voltage
and frequency scaling (DVFS) and ﬁ ne-grain power manage-
ment,5 while other approaches have targeted fault tolerance6
and arbitration for high-radix switches.7 Automated design
ﬂ ows are also being developed,8,9 leveraging commercial
synchronous CAD tools, which use directives to meet asyn-
chronous path timing and latch constraints, as well as judi-
cious control of optimization modes to avoid the introduction
of hazards. A recent asynchronous time-division-multiplexed
(TDM) NoC demonstrates correct operation without any global
synchronization, while tolerating signiﬁ cant skews on different
Power and performance beneﬁ ts of asynchronous NoCs
have been demonstrated for high-performance shared-memo-
ry chip multi-processors11 and Ethernet switch chips,12 as well
as their facilitation of extreme ﬁ ne-grain power management
and ﬂ exible integration of many-core GALS architectures (see
STHORM processor discussion in “GALS” sidebar). The end-
to-end latency beneﬁ ts of asynchronous NoCs over synchro-
nous NoCs have also been demonstrated,8-9,11-12 due to the low
forward latency of individual asynchronous router nodes, and
the ability of packets to advance without continual alignment
to a global clock.
As a recent example, an asynchronous NoC switch architec-
ture,9 using single-rail bundled data and two-phase communi-
cation, obtained a 45% reduction in average energy-per-packet
and 71% area reduction compared to a highly-optimized syn-
chronous single-cycle NoC switch, xpipes Lite, in the same
40nm technology. Additional latency beneﬁ ts have been ob-
tained using low-overhead early arbitration techniques.13
One interesting emerging domain where asynchronous and
GALS NoCs have played a key role, is in the development of
neuromorphic chips (see “Applications” section). These rely on
the scalability and ease-of-integration of asynchronous inter-
connect, and the inherent event-driven operation for low power.
One recent example, IBM’s TrueNorth, integrates 4096 neuro-
synaptic cores on a single chip, which models 1 million neu-
rons and 256 million synapses, in the largest chip developed
to date by IBM (5.4 billion transistors), where the large-scale-
integration is facilitated by using a fully-asynchronous NoC.
Overall, the NoC area is a promising arena where the inte-
grative beneﬁ ts of asynchronous design are making important
1. W. J. Dally and B. Towles, “Route packets, not wires: On-chip
interconnection networks,” in Proc. ACM/IEEE DAC, 2001,
2. J. Bainbridge and S. Furber, ‘‘Chain: A delay-insensitive chip
area interconnect,’’ IEEE Micro, vol. 22, no. 5, pp. 16–23,
3. T. Bjerregaard and J. Sparsø, “A router architecture for con-
nection-oriented service guarantees in the MANGO clock-
less network-on-chip,” in Proc. ACM/IEEE DATE, 2005, pp.
4. R. Dobkin et al., “An asynchronous router for multiple ser-
vice levels networks on chip”, in Proc. 11th IEEE Int. Symp.
ASYNC, 2005, pp. 44–53.
5. E. Beigne et al., “Dynamic voltage and frequency scaling ar-
chitecture for units integration within a GALS NoC,” in Proc.
ACM NOCS, 2008, pp. 129–138.
6. M. Imai and T. Yoneda, “Improving dependability and perfor-
mance of fully asynchronous on-chip networks,” in Proc. 17th
IEEE Int. Symp. ASYNC, 2011, pp. 65–76.
7. S. R. Naqvi and A. Steininger, “A tree arbiter cell for high
speed resource sharing in asynchronous environments,” in
Proc. ACM/IEEE DATE, 2014.
8. Y. Thonnart, E. Beigne, and P. Vivet, “A Pseudo-synchronous
implementation ﬂ ow for WCHB QDI asynchronous circuits,”
in Proc. 18th IEEE Int. Symp. ASYNC, 2012, pp. 73–80.
9. A. Ghiribaldi, D. Bertozzi, and S. M. Nowick, “A transition-
signaling bundled data NoC switch architecture for cost-ef-
fective GALS multicore systems,” in Proc. ACM/IEEE DATE,
2013, pp. 332–337.
10. E. Kasapaki and J. Sparsø, “Argo: A time-elastic time-divi-
sion-multiplexed noC using asynchronous routers,” in Proc.
20th IEEE Int. Symp. ASYNC, 2014, pp. 45–52.
11. M. N. Horak et al., “A low-overhead asynchronous intercon-
nection network for GALS chip multiprocessors,” IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp.
12. A. Lines, “Asynchronous interconnect for synchronous SoC
design,” IEEE Micro, vol. 24, no. 1, pp. 32–41, 2004.
13. W. Jiang et al., “A lightweight early arbitration method for low-
latency asynchronous 2D-mesh NoC’s,” in Proc. ACM/IEEE
May/June 2015 11
responds by asserting ack, in the active or evaluate
phase. The two signals are then deasserted, in turn,
in the return-to-zero or reset phase. In contrast, in a
two-phase protocol, there is no return-to-zero phase:
asingletoggleonreq indicates a request, followed
by a toggle on ack to indicate an acknowledge.
Both two-phase and four-phase protocols are
widely used, with interesting tradeoffs between
them. A four-phase protocol has the benefit of return-
ing interfaces to a unique state, i.e., all-zero, which
typically simplifies hardware design. It is also a good
match for dynamic logic, where the RZ phase directly
corresponds to the precharge phase , , , .
However, the protocol requires two complete round-
trip channel communications per transaction, which
can result in lower throughput. A two-phase protocol
may involve more complex hardware design, but only
requires one round-trip communication per transac-
tion, which can provide higher throughput Vand
sometimes still has quite low complexity , ,
. Alternative protocols using pulse-mode or single-
track handshaking have also been proposed.
Once a communication protocol for a channel
has been defined, data communication is typically
needed. The data itself typically replaces the single
req wire in the above example. There are two
common data encoding schemes: 1) delay-insensitive
(DI) codes, and 2) single-rail bundled data.
Figure 3a illustrates delay-insensitive encoding on
a simple two-bit example. A common approach, dual-
rail encoding, is used for each bit, X and Y, which are
each encoded using two rails or wires (X1/X0 for X,
Y1/Y0 for Y). Assuming a four-phase protocol, all wires
are initially zero, and each bit is encoded as 00,
representing a NULL or spacer token (i.e., no valid
data). A one-hot encoding scheme is used: the
transmission of a 1 (0) value on X involves asserting
wire X1 (X0) high, and similarly for the transmission
on bit Y. Once the receiver has obtained a complete
valid codeword, it asserts ack. The reset phase then
occurs, where data and ack are deasserted in turn. A
completion detector (CD) is used by the receiver to
identify when a valid codeword has been received.
Dual-rail encoding is widely used , , ,
and is one simple instance of a delay-insensitive
code. In particular, note that, regardless of the trans-
mission time and relative skew of the distinct bits,
the receiver can unambiguously identify when every
bit is valid, by checking for the arrival of a legal
codeword (01 or 10) on each bit. As a result, this ap-
proach provides great resilience to physical and
operating variability. Alternative DI codes have also
been widely explored, providing cost tradeoffs in
coding efficiency, dynamic power, and hardware
overhead, including 1-of-4, m-of-n , systematic,
level-encoded dual-rail (LEDR) and level-encoded
transition-signalling (LETS)  codes.
Figure 3b shows an alternative encoding approach,
single-rail bundled data. A standard synchronous-style
data channel is used, i.e. with
binary encoding. One extra req
wire is added, serving as a
‘‘bundling signal’’ or local strobe,
which must arrive at the receiver
after all data bits are stable and
valid. Both four-phase  and
two-phase ,  bundled
protocols are widely used.
Interestingly, the bundled data
scheme allows arbitrary glitches
on the data channel, as long as
data becomes stable and valid
before the req signal is transmit-
ted. Typically, data must remain
Figure 1. An asynchronous channel.
Figure 2. Asynchronous handshake protocols.
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
valid from before the req is
transmitted to after an ack is re-
ceived. Therefore, the scheme
facilitates the use of synchro-
nous-style computation blocks.
It also provides good coding
efficiency, with only one extra
req wire added to the datapath.
However, unlike DI codes, a one-
sided timing constraint must be
enforced: the req delay must
always be longer than worst-
case data transmission. To sup-
port this constraint, a small
matched delay is added, either
an inverter chain or carefully
replicated portion of the critical
path. Unlike in a clocked system,
though, this is a localized con-
straint: stages can be highly
unbalanced, each with its own
distinct matched delay. More-
over, the timing margins can be
made fairly tight because some parameters (e.g.,
process, voltage, temperature) tend to be locally
Finally, a hybrid scheme, called speculative
completion , uses bundled data, but also allows
variable-latency completion, including better than
worst-case, based on the actual data inputs. High-
performance parallel prefix adders (Brent-Kung,
Kogge-Stone) have been demonstrated, operating
at faster rates than synchronous designs.
Pipelining is a fundamental technique to in-
crease concurrency and boost throughput in high-
performance digital systems. All modern high-speed
processors, multimedia and graphics units, and
signal processors are pipelined. In a typical pipe-
lined implementation, complex function blocks are
subdivided into smaller blocks, registers are inserted
to separate them, and a clock is applied to all re-
gisters. In an asynchronous system, no global clock
is used and, instead, the interaction of neighboring
stages is coordinated by a handshaking protocol.
Developing better pipeline protocols and their
efficient circuit-level implementation has been the
focus of many researchers over the past two to three
decades. We review three leading representative
styles, starting with the seminal work of Sutherland.
More details can be found in a recent survey .
Sutherland’s micropipeline. Figure 4 shows a
basic micropipeline , which uses a two-phase
handshaking protocol and single-rail bundled data.
Each interface between adjacent stages has single-
rail data and a bundling signal ðreqiÞgoing forward,
and an acknowledgment ðackiÞgoing backward. A
delay element is added to match the worst-case
delay of the corresponding logic block.
The pipeline operates according to a so-called
capture-pass protocol. The protocol is implemented
using a simple control chain of Muller C-elements
,  (with inversions on the right inputs),
operating on a set of specialized capture-pass data-
path latches. The latches are initially all normally
transparent, unlike synchronous pipelines, so the
entire pipeline forms a flow-through combinational
path. Locally, only after data advances through an
individual stage’s latches, the corresponding request
reqi1causes a transition on the C(i.e., capture)
control input, which makes those latches opaque,
thereby storing and protecting the data. Once data
A C-element is a basic asynchronous storage element;
assuming inputs A and B, the output is 1 (0) if both inputs are
1 (0), otherwise it holds its prior value.
Figure 3. Asynchronous data encoding schemes. (a) Dual-rail encoding;
(b) single-rail bundled data.
May/June 2015 13
advances through the next stage’s latches, where the
data is safely stored, a transition on the P(i.e., pass)
control input via ackiþ1, makes the current stage’s
latches transparent again, completing an entire cycle.
The latches indicate the completion of capture and
pass operations via Cd(capture done) and Pd(pass
done) outputs, respectively. Effectively, each data item
initiates a ‘‘wavefront,’’ which advances through the
pipeline and is protected by a series of latch-capture
operations. Predecessor stages, behind the wavefront,
are subsequently freed up through a series of pass
operations, once data has been safely copied to the
next stage. The old data can then be overwritten by
the next wave front.
Although micropipelines require specialized com-
ponents for implementation, they are remarkable in
the simplicity and elegance of their structure and
operation, and have inspired
several more advanced ap-
proaches. Their introduction by
Sutherland also provided deeper
insights into the nature of asyn-
chronous systems and triggered
a resurgence of research activity
in asynchronous design.
Mousetrap pipeline. We de v e l-
oped Mousetrap at Columbia Uni-
versity to be a high-performance
pipeline that supports the use of
an entirely standard cell method-
ology , . Although its
two-phase capture-pass protocol
is based on that of micropipe-
lines, it has simpler control cir-
cuits and data latches, with much lower area and
delay overheads. Figure 5 shows a basic Mousetrap
pipeline. The local control for each stage is only a
single combinational exclusive-NOR (XNOR) gate,
and the storage for each stage is a single bank of level-
sensitive D-latches, both of which are available in
standard cell libraries.
Although the implementation is quite different,
the overall operation is similar to that of micropipe-
lines. Initially assume that all reqiand ackisignals
are initially at 0, and all the data latches are there-
fore transparent. As new data arrives into stage ifrom
the left, and passes through the latch, the correspond-
ing reqibundling signal toggles. As a result, the stage’s
XNOR toggles from 1 to 0, thereby capturing data in
the latch. It also requests the next data item from its left
neighbor by toggling acki. Subsequently, when stage i
Figure 5. Mousetrap pipeline.
Figure 4. Sutherland’s micropipeline.
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
receives an acknowledgment ackiþ1from its right
neighbor, stage i’s XNOR toggles back to 1, making
stage i’s latch transparent, and completing the cycle.
The relatively lightweight control and storage struc-
tures allow the pipeline to achieve high throughput:
2.4 giga items/s FIFOs (in 180 nm), and a greatest
common divisor (GCD) test chip at 2.1 GHz (in
130 nm technology). Mousetrap circuits have been
used in several recently-proposed asynchronous
NoC designs (for example, see Horak et al. and
Kasapaki/Sparsø in the ‘‘Networks-on-Chip’’ sidebar).
GasP pipeline. GasP was developed at Sun Micro-
systems Laboratories to push the limit of achievable
performance by using an aggressive custom circuit
style for specialized applications , A distinctive
feature is that, instead of the usual pair of request
and acknowledge wires, each control channel be-
tween adjacent stages consists of a single wire, i.e.,
‘‘single-track channel,’’ allowing bi-directional com-
munication. Handshaking is performed via carefully
generated pulses: a forward request transition sets the
state of the control channel, and a subsequent
reverse acknowledgment transition resets the chan-
nel state. Hence, GasP effectively combines the
benefits of both two-phase and four-phase protocols
on a single wire. Circuit designs are highly optimized
for delay, but the pulse-based protocol imposes two-
sided timing constraints (i.e., short and long path
requirements), requiring careful balancing of path
delays to ensure correct operation .
Dynamic logic pipelines. Dynamic logic datapaths
are common in high-performance systemsVespe-
cially in the core of ALUs in high-speed micropro-
cessors and ASICsVdespite the
greater design and validation
effort required. Interestingly,
dynamic logic is an especially
good match for asynchronous
pipelines. In particular, local
handshaking obviates the need
for the complex and carefully
controlled multiphase clocking
that is typical of synchronous
dynamic circuits, and noise-in-
duced delay variations can be
robustly handled through the use
of DI encoding . Further-
more, a unique feature of many asynchronous
dynamic pipelines is that they are entirely latchless,
storing data directly on logic block outputs with
keepers. As a result, dynamic logic pipelines have
been used in several recent high-performance asyn-
chronous commercial products , , .
We review the PS0 pipeline style by Williams and
Horowitz , , which was used in the design of
high-speed floating-point dividers at HaL Computers
in the 1990s, and was influential on much subse-
Figure 6 shows the basic structure of a PS0 pipe-
line; each stage consists of a function block composed
of dynamic logic, and a completion detector (CD).
The datapath uses DI coding (in particular, dual-rail),
and there are no explicit registers between adjacent
stages. Each function block alternates between an
evaluate phase and a precharge phase. Initially, the
function block outputs are reset to 0, and in the eval-
uate phase, awaiting data inputs. In the evaluate
phase, each block computes after its data inputs
arrive. In the precharge phase, the function block is
reset, with all its outputs returning to 0. The CD iden-
tifies when the stage’s computation is complete, or
when its outputs have been reset to 0. The single input
control for each stage, Prech/Eval, is connected
from the output of the next stage’s CD. The interaction
between stages follows a simple protocol: a stage is
precharged whenever the next stage finishes evalua-
tion, and a stage is enabled to evaluate whenever the
next stage finishes its precharge. This protocol ensures
that two successive wavefronts of data are always
separated by a reset spacer. The use of fast dynamic
logic without latches yields purely combinational
execution times, even for iterative computations that
are implemented using self-timed rings.
Figure 6. PS0 pipeline.
May/June 2015 15
A number of other dynamic pipeline styles have
been proposed , with a range of tradeoffs in per-
formance, robustness and other cost metrics. These
include dynamic GasP by Ebergen et al. (Sun Micro-
systems) ; PCHB/PCFB by Lines; high-capacity
(HC) and lookahead pipelines (LP) by Singh and
Nowick; IPCMOS by Schuster et al. (IBM Research);
and single-track styles by Beerel et al. Asynchronous
pipelines have been used commercially in Sun’s
UltraSPARC IIIi computers for fast memory access; in
Achronix’s Speedster 22i FPGA’s ; in the Ethernet
switch chips of Intel/Fulcrum Microsystems ; and
experimentally at IBM Research for a low-latency
finite-impulse response (FIR) filter chip .
Synchronization and arbitration
Two related capabilities are needed when han-
dling the continuous-time operation of an asynchro-
nous system: synchronization and arbitration.
Synchronization involves the interfacing of asyn-
chronous and synchronous systems, or two unrelat-
ed synchronous systems, where, at the boundary
crossing, an asynchronous signal must be safely
realigned to a clock domain.
A good overview of the topic
has been presented by Gino-
sar . Any direct connec-
tion of asynchronous inputs
to synchronous registers can
cause setup time violations,
resulting in metastable oper-
ation and possible failure,
such as storing of intermedi-
ate voltage values or even
oscillatory behavior. The first
detailed published results
identifying and evaluating
metastability were presented in 1973 by Chaney
and Molnar (see ).
The classic solution for a single bit is to provide a
basic synchronizer: double or triple flip-flops in
series, to ensure sufficient stabilization time to pro-
duce a clean synchronous output, with extremely
high mean-time-between-failure (MTBF). Detailed
synchronizer performance analysis has been pro-
posed, which considers the impact of noise and
thermal effects, along with directions to improve
circuit design . More general solutions have
been proposed for synchronization blocks which
support buffering and flow control , . The
approach by Chakraborty and Greenstreet provides
an integrated study of synchronizing two clock
domains, ranging from mesochronous to hetero-
chronous communication .
Figure 7 illustrates an example of a mixed-clock
FIFO by Chelcea and Nowick , which can inter-
face two arbitrary clock domains, a sender (put
interface) and a receiver (get interface). The design
is one of a complete family of modular mixed-timing
interfaces, including other variants to support mixed
Figure 8. A two-way arbiter. (a) Block diagram, (b) timing, and (c) implementation.
Figure 7. Mixed-clock FIFO.
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances
asynchronous/synchronous communication which
are needed in GALS systems. The FIFO is constructed
as a simple token ring, with pointers to head and tail
locations. Each interface operates independently at
its own clock rate, and data items do not move once
deposited. Full and empty detectors are used to avoid
overflow and underflow, respectively. A novel feature
is that only three synchronizers are required, regard-
less of the number of cells in the ring, hence it is
highly scalable. It also avoids synchronization perfor-
mance penalties in steady-state communication
Arbitration involves the resolution of two or more
competing signals requesting a shared resource. In
synchronous design, it is a simple operation: at the
start of each clock cycle, existing requests are exam-
ined and one is selected as a winner. In asynchro-
nous design, however, inputs arrive in continuous
time, and resolution must be guaranteed to be clean
and safe, regardless of the signal arrival times.
The basic component to resolve a two-way
asynchronous arbitration is a mutual-exclusion ele-
ment (mutex, or ME), shown in Figure 8, due to Seitz.
This analog component guarantees a hazard-free
output. In principle, as two competing inputs arrive
closer together, its resolution time, i.e., latency, can
become arbitrarily long. In practice, though, only
extremely close spacing of samples (e.g., G1ps)
will result in a relatively long delay. More complex
asynchronous arbiters typically use mutexes as
building blocks. Two- and four-way arbiters are
fundamental components in router nodes in asyn-
chronous NoC’s. N-way asynchronous arbiters have
also been proposed, as well as priority arbiters.
IN THIS PART,we have presented an overview of
some key advances of asynchronous design, and
discussed emerging application areas where asyn-
chrony is poised to play a critical role. We have also
reviewed technical foundations, as well as highlight-
ed recent developments in GALS and NoC design.
Part 2 of this article focuses on design methodologies
and systems, including logic and high-level synthe-
sis, computer-aided design (CAD) tool flows for de-
sign and test, and processors and architectures. h
The authors appreciate the funding support of
the National Science Foundation under Grants CCF-
1219013, CCF-0964606 and OCI-1127361.
 S.H.Unger,Asynchronous Sequential Switching
Circuits. New York, NY, USA: Wiley, 1969.
 D.EMullerandW.C.Bartky,A Theory of Asynchronous
Circuits. Cambridge, MA, USA: Annals of Computing
Laboratory of Harvard University, 1959, pp. 204–243.
 H. van Gageldonk et al., ‘‘An asynchronous low-power
80C51 microcontroller,’’ in Proc. Int. Symp. Adv.
Res. Asynch. Circuits Syst. (ASYNC 98), 1998,
switch/router using quasi-delay-insensitive
asynchronous design,’’ in Proc.Int.Symp.Asynch.
Circuits Syst. (ASYNC 14), 2014, pp. 103–104.
 J. Teifel and R. Manohar, ‘‘Highly pipelined
asynchronous FPGAs,’’ in Proc. ACM/SIGDA Int.
Symp. Field Programmable Gate Arrays (FPGA 04),
2004, pp. 133–142.
 P. Merolla et al., ‘‘A million spiking-neuron integrated
circuit with a scalable communication network and
interface,’’ Science, vol. 345, no. 6197, pp. 668–673,
self-timed 160 ns 54 b CMOS divider,’’ IEEE J.
Solid-State Circuits, vol. 26, no. 11, pp. 1651–1661,
 K.M.Fant,Logically Determined Design.NewYork,
 K.S.Stevensetal.,‘‘Anasynchronous instruction
length decoder,’’ IEEE J. Solid-State Circuits,vol.36,
no. 2, pp. 217–228, 2001.
 M. Singh et al., ‘‘An adaptively pipelined mixed
synchronous-asynchronous digital FIR filter chip
operating at 1.3 gigahertz,’’ IEEE Trans. Very
Large Scale Integr. (VLSI) Syst.,vol.18,no.7,
pp. 1043–1056, 2010.
 T. Liu et al., ‘‘Asynchronous computing in sense
amplifier-based pass transistor logic,’’ IEEE Trans.
Very Large Scale Integr. (VLSI) Syst.,vol.17,no.7,
pp. 883–892, 2009.
 L. S. Nielsen and J. Sparsø, ‘‘Designing asynchronous
circuits for low power: An IFIR filter bank for a digital
hearing aid,’’ Proc. IEEE, vol. 87, no. 2, pp. 268–281,
 K.-L. Chang et al., ‘‘Synchronous-logic and
asynchronous-logic 8051 microcontroller cores for
realizing the internet of things: A comparative study
on dynamic voltage scaling and variation effects,’’
IEEE J. Emerg. Sel. Topics Circuits Syst.,vol.3,no.1,
pp. 23–34, 2013.
May/June 2015 17
 J. F. Christmann et al., ‘‘Bringing robustness and
power efficiency to autonomous energy harvesting
systems,’’ IEEE Design Test Comput.,vol.28,no.5,
pp. 84–94, 2011.
 F. Aeschlimann et al., ‘‘Asynchronous FIR filters:
Towards a new digital processing chain,’’ in Proc.
Int. Symp. Asynch. Circuits Syst. (ASYNC-04), 2004,
 C. Vezyrtzis, S. M. Nowick, and Y. Tsividis, ‘‘A flexible,
event-driven digital filter with frequency response
independent of input sample rate,’’ IEEE J. Solid-State
Circuits (JSSC), vol. 49, no. 10, pp. 2292–2304, 2014.
 P. Shepherd et al., ‘‘A robust, wide-temperature data
transmission system for space environments,’’ in
Proc. IEEE Aerospace Conf. (AERO 2013), 2013,
 F. Peper et al., ‘‘Laying out circuits on asynchronous
cellular arrays: A step towards feasible nanocomputers?’’
Nanotechnology, vol. 14, no. 4, pp. 1651–1661, 2003.
 M. Vacca, M. Graziano, and M. Zamboni,
‘‘Asynchronous solutions for nanomagnetic logic
(JETC), vol. 7, no. 4, pp. 15:1–15:18, 2011.
 N. Karaki et al., ‘‘A flexible 8b asynchronous
microprocessor based on low-temperature
poly-silicon TFT technology,’’ in Proc. IEEE Int.
Solid-State Circuits Conf. (ISSCC-05), 2005,
pp. 272–273, pg. 598.
 M. Kishinevsky et al., Concurrent Hardware: The
Theory and Practice of Self-Timed Design.NewYork,
 I. E. Sutherland, ‘‘Micropipelines,’’ Commun. ACM,
vol. 32, no. 6, pp. 720–738, 1989.
 M. Singh and S. M. Nowick, ‘‘MOUSETRAP:
High-speed transition-signaling asynchronous
pipelines,’’ IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 15, no. 6, pp. 684–698, 2007.
 S. M. Nowick and M. Singh, ‘‘High-performance
asynchronous pipelines: An overview,’’ IEEE Design &
Test, vol. 28, no. 5, pp. 8–22, 2011.
 I. Sutherland and S. Fairbanks, ‘‘GasP: A minimal FIFO
control,’’ in Proc. In. Symp. Asynch. Circuits Syst.
(ASYNC 01), 2001, pp. 46–53.
 S. M. Nowick et al., ‘‘Speculative completion for the
design of high-performance asynchronous dynamic
adders,’’ in Proc.3rdInt.Symp.Adv.Res.Asynch.
Circuits Syst. (ASYNC 97), 1997, pp. 210–223.
 R. Ginosar, ‘‘Metastability and synchronization:
A tutorial,’’ IEEE Design & Test,vol.28,no.5,
pp. 23–35, 2011.
 D. J. Kinniment, A. Bystrov, and A. V. Yakovlev,
‘‘Synchronization circuit performance,’’ IEEE
J. Solid-State Circuits, vol. 37, no. 2, pp. 202–209,
 A. Chakraborty and M. R. Greenstreet, ‘‘Efficient
self-timed interfaces for crossing clock domains,’’ in
Proc. 9th IEEE Int. Symp. Asynch. Circuits Syst.
(ASYNC 03), 2003, pp. 78–88.
 T. Chelcea and S. M. Nowick, ‘‘Robust interfaces
for mixed-timing systems,’’ IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 12, no. 8, pp. 857–873,
Steven M. Nowick is a professor of computer
science at Columbia University, New York, NY, USA.
His research interests include the design and
optimization of asynchronous and mixed-timing
(i.e., GALS) digital systems, scalable and low-
latency on-chip interconnection networks for
shared-memory parallel processors and embedded
systems, extreme low-energy digital systems, neu-
romorphic computing, and variation-tolerant global
communication. He has a PhD degree in computer
science from Stanford University. He is a Fellow of
Montek Singh is an associate professor of
computer science at the University of North Carolina
at Chapel Hill, NC, USA. His research interests
include asynchronous and mixed-timing circuits and
systems; CAD tools for design, analysis, and optimi-
zation; high-speed and low-power VLSI design; and
applications to emerging computing technologies,
energy-efficient graphics, and image sensing hard-
ware. He has a PhD degree in computer science
from Columbia University, New York, NY, USA.
hDirect questions and comments about this article
to Steven M. Nowick, Department of Computer
Science, Columbia University, New York, NY 10027
USA; firstname.lastname@example.org; or to Montek Singh,
Department of Computer Science, University of North
Carolina, Chapel Hill, NC 27599 USA; montek@cs.
IEEE Design & Test
Asynchronous DesignVPart 1: Overview and Recent Advances