Conference Paper

An Energy-Efficient Delay Insensitive Asynchronous Interface for Globally Asynchronous Locally Synchronous (GALS) System

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This thesis provides a new framework for the design of very high performance digital machines. The new theoretical results which are presented have practical implications, and lead to a better understanding of possibilities and limitations in the design of computers, communication hardware and other digital machinery. The discussion centers on different organizations for globally-asynchronous, locally-synchronous systems, and covers the following issues: organizations for complex digital systems, metastability as a limitation for high performance, structures for two classes of non-conventional architectures, optimization, performance, reliability, and design techniques. We present new algorithms to compile the specifications of such machines onto efficient circuits, and to verify the correctness of the resulting machines. The models we developed for the analysis of the tradeoffs between different variables that affect the safety of operation of these systems, show that the proposed organizations result in extremely fast and reliable digital machines. The proposed organizational schemes can be used within a wide range of architectures, and integrated circuits designed according to this methodology have been developed and tested.
Conference Paper
Full-text available
This paper presents a new fast and templatized family of fine-grain asynchronous pipeline stages based on the single-track protocol. No explicit control wires are required outside of the datapath and the data is 1-of-N encoded. With a forward latency of 2 transitions and a cycle time of 6 for most configurations, the new family can run at 1.6 GHz using MOSIS TSMC 0.25 μm process. This is significantly faster than all known quasi-delay-insensitive templates and has less timing assumptions than the recently proposed ultra-high-speed GasP bundled-data circuits.
Conference Paper
Full-text available
With continued advances in CMOS technology, parameter vari- ations are emerging as a major design challenge. Irregularities during the fabrication of a microprocessor and variations of volt- age and temperature during its operation widen worst-case timing margins of the design—degrading performance significantly. Be- cause runtime variations like supply voltage droops and tempera- ture fluctuations depend on the activity signature of the processor's workload, there are several opportunities to improve performance by dynamically adapting margins. This paper explores the power- performance efficiency gains that result from designing for typical conditions while dynamically tuning frequency and voltage to ac- commodate the runtime behavior of workloads. Such a design de- pends on a fail-safe mechanism that allows it to protect against mar- gin violations during adaptation; we evaluate several such mecha- nisms, and we propose a local recovery scheme that exploits spa- tial variation among the units of the processor. While a processor designed for worst-case conditions might only be capable of a fre- quency that is 75% of an ideal processor with no parameter varia- tions, we show that a fine-grained global frequency tuning mech- anism improves power-performance efficiency (BIPS3/W) by 40% while operating at 91% of an ideal processor's frequency. More- over, a per-unit voltage tuning mechanism aims to reduce the effect of within-die spatial variations to provide a 55% increase in power- performance efficiency. The benefits reported are clearly substan- tial in light of the
Conference Paper
Full-text available
Reliable, low-latency channel communication between independent clock domains may be achieved using a combination of clock pausing techniques, self-calibrating delay lines and an asynchronous interconnect. Such a scheme can be used for point-to-point communication in a globally asynchronous locally synchronous (GALS) system, a possible methodology for managing the predicted increase in clock domains. We present interface wrapper circuits which permit communication between a locally synchronous producer and a locally synchronous consumer via an asynchronous interconnect. Such interfaces can also be used to mix asynchronous and synchronous modules. Clock pausing is used to guarantee that metastability will never result in failure. Arbitration between channel communication and the local clock is performed concurrently so that metastability resolution will rarely delay the clock. Simulation results show that the maximum performance of one data item per consumer clock cycle is achieved when the producer: consumer clock ratio is equal or greater to one.
Conference Paper
Full-text available
The GasP family of asynchronous circuits provides controls for simple pipelines, for branching and joining pipelines, for round-robin scatter and gather for data dependent scatter and gather and for join on demand through arbitration. The family is designed so that each stage operates at the speed of a three-inverter ring oscillator Test chips in 0.35 micron technology exhibit throughput in excess of 1.5 giga data items per second (GDI/s). Between GasP pipeline stages a single wire carries both request and acknowledge messages, also recording the FULL or EMPTY state of each pipeline stage. GasP control circuits rely on careful choice of transistor widths to equalize the delay in logic gates. Assurance of uniform gate delays permits use of self-resetting logic forms that have very low logical effort
Conference Paper
Full-text available
The demands of System-on-Chip (SoC) interconnect increasingly cannot be satisfied through the use of a shared bus. A common alternative, using unidirectional, point-to-point connections and multiplexers, results in much greater area requirements and still suffers from some of the same problems. This paper introduces a delay-insensitive, asynchronous approach to interconnect over long paths using 1-of-4 encoded channels switched through multiplexers. A reimplementation of the MARBLE SoC bus (as used in the AMULET3H chip) using this technique shows that it can provide a higher throughput than the simpler tristate bus while using a narrower datapath
Article
Full-text available
An asymptotically zero power charge recycling bus (CRB) architecture, featuring virtual stacking of the individual bus-capacitance into a series configuration between supply voltage and ground, has been proposed. This CRB architecture makes it possible to reduce not only each bus-swing but also a total equivalent bus-capacitance of the ultramultibit buses running in parallel. The voltage swing of each bus is given by the recycled charge-supplying from the upper adjacent bus capacitance, instead of the power line. The dramatical power reduction was verified by the simulated and measured data. According to these data, the ultrahigh data rate of 25.6 Gb/s can be achieved while maintaining the power dissipation to be less than 100 mW, which corresponds to less than 10% that of the previously reported 0.9 V suppressed bus-swing scheme, at Vcc=3.6 V for the bus width of 512 b with the bus-capacitance of 14 pF per bit operating at 50 MHz
Conference Paper
Many of the challenges of modern SoC design can be mitigated or eliminated with globally asynchronous, locally synchronous (GALS) design techniques. Partitioning a design into many synchronous islands introduces myriad asynchronous boundary crossings which typically incur high latency. We have designed a pausible bisynchronous FIFO that achieves low interface latency with a pausible clocking scheme. While traditional synchronizers have a non-zero probability of metastability and error, pausible clocking enables error-free operation by permitting infrequent slowdowns in the clock rate. Unlike prior pausible synchronizers, our circuit employs standard two-ported synchronous FIFOs, common circuit elements that integrate well with standard tool flows. The pausible bisynchronous FIFO achieves an average latency of 1.34 cycles across an asynchronous interface while using less energy and area than traditional synchronizers.
Conference Paper
Asynchronous across-chip communication is increasingly attractive as clock distribution is becoming increasingly difficult at smaller feature sizes. Several globally asynchronous locally synchronous architectures (GALS), proposed recently, use handshaking protocols for across-chip communication. In this paper, we design two circuits that interface synchronous modules using pausible clocks to an asynchronous communication environment that uses single-track handshaking, a protocol shown to support high-speed asynchronous logic. The resulting circuits allow data transfer rates of up to 1.1 GigaDataItems/second, significantly higher than previous designs
Conference Paper
Globally asynchronous locally synchronous (GALS) design style has evolved as a solution to increasing problems of distributing clock at high frequency in DSM technology. Most wrapper designs proposed in some recent literature are based on bundled data protocols and suffer from the same timing closure problem as synchronous designs. Delay insensitive (DI) protocols offer a solution to this problem. However, most of the work on DI schemes was limited to asynchronous circuits so far. This is, to our knowledge, the first paper that presents a complete asynchronous wrapper architecture for GALS designs based on a DI protocol. It uses 1-of-4 data encoding with single-track handshaking. The resulting circuit shows a throughput of 1.66 Gbps, significantly higher than previous asynchronous DI templates.
Conference Paper
In this paper we describe a complete design methodology for a globally asynchronous on-chip communication network connecting both locally-synchronous and asynchronous modules. Synchronous modules are equipped with asynchronous wrappers which adapt their interfaces to the self-timed environment and prevent metastability. These wrappers are assembled from a concise library of predesigned technology-independent elements and provide high-speed data transfer. We confirmed the validity of our concept by applying it to an ASIC design implementing the Safer crypto-algorithm
Conference Paper
This paper describes a novel communication scheme, which is guaranteed to be free of synchronization failures, amongst multiple synchronous modules operating independently. In this scheme, communication between every pair of modules is done through an asynchronous FIFO channel; communication between a module and the FIFO is done using a request/acknowledge handshaking. Synchronization of handshaking signals to the local module clock is done in an unconventional way-the local clock built out of a ring oscillator is paused or stretched, if necessary, to ensure that the handshaking signal satisfies setup and hold time constraints with respect to the local clock. We constructed a test bed consisting of two synchronous modules with pausible clocking control and an asynchronous FIFO on a MOSIS 1.2 μm CMOS chip. The resulting system functions reliably up to the local clock frequency of 220 MHz (according to SPICE simulation)-the maximum clock rate is limited by the ring oscillator not the pausible clocking control. Preliminary test results indicate that the fabricated chips operate correctly as simulated