ArticlePDF Available

Globally-asynchronous locally-synchronous systems

Authors:

Abstract

This thesis provides a new framework for the design of very high performance digital machines. The new theoretical results which are presented have practical implications, and lead to a better understanding of possibilities and limitations in the design of computers, communication hardware and other digital machinery. The discussion centers on different organizations for globally-asynchronous, locally-synchronous systems, and covers the following issues: organizations for complex digital systems, metastability as a limitation for high performance, structures for two classes of non-conventional architectures, optimization, performance, reliability, and design techniques. We present new algorithms to compile the specifications of such machines onto efficient circuits, and to verify the correctness of the resulting machines. The models we developed for the analysis of the tradeoffs between different variables that affect the safety of operation of these systems, show that the proposed organizations result in extremely fast and reliable digital machines. The proposed organizational schemes can be used within a wide range of architectures, and integrated circuits designed according to this methodology have been developed and tested.
A preview of the PDF is not available
... Asynchronous Handshake Protocol: An asynchronous handshake protocol represents the communication agreement between two or more entities, allowing them to exchange data without the necessity of a common clock [20]. This is in contrast to synchronous protocols, which rely on the timing of a common clock to regulate communication. ...
Article
This paper presents the design and implementation of a self-locking domino logic pipeline controller for a RISC-V processor implemented on an FPGA. The emphasis is on asynchronous circuit design, which offers advantages such as enhanced resilience to supply voltage fluctuations, optimized power efficiency, and the elimination of clock-related issues such as skew and single-point failures. By leveraging the asynchronous Globally Asynchronous Locally Synchronous (GALS) systems and domino logic, the controller ensures hazard-free operation while maintaining race-free processing. The asynchronous approach, integrated into a 32- bit RISC-V processor, allows for flexible and energy-efficient operation, thereby demonstrating its potential for performance-critical applications. This paper high- lights the contrasts between the asynchronous design and the traditional synchronous multicycle processor, demonstrating the benefits of asynchronous systems in terms of power consumption and performance. A significant contribution of this design is the pipeline’s completion detection mechanism, which ensures that each processing stage locks until valid results are obtained, thereby markedly enhancing system stability. Furthermore, the paper investigates the parallelization of domino gates and introduces an asynchronous Arithmetic Logic Unit (ALU), which further optimizes performance through self-locking mechanisms. The power, performance, and area (PPA) analysis of the design demonstrates considerable improvements in throughput (up to 10%) and reduced latency per instruction in comparison to its synchronous counterpart, while maintaining moderate resource utilization on an FPGA. The results indicate that asynchronous domino logic pipelines may offer a promising approach for achieving energy-efficient and high-performance processors in future computing architectures.
... An asynchronous handshake protocol represents a communication agreement between two or more entities that allows them to exchange data without the necessity of a common clock (Chapiro 1984). In contrast to synchronous protocols, which rely on the timing of a common clock to regulate communication, asynchronous handshake protocols employ a pair of signals to regulate data transmission. ...
Article
This paper proposes an asynchronous RISC-V CPU design based on self-locking domino logic. The asynchronous approach offers advantages over traditional synchronous designs, including improved performance, lower power consumption, and greater modularity. The paper details the design and implementation of the asynchronous control unit using domino logic on an FPGA development board. The control unit is designed for a Turing-complete 32-bit RISC-V architecture. A significant aspect of the design is the self-locking mechanism, which ensures that the circuit only unlocks after all processing stages have been completed. This eliminates the need for a global clock and simplifies hazard-free operation. Furthermore, the paper discusses the potential for parallelizing the ALU using domino logic to improve performance further. The implementation of the asynchronous CPU has been analyzed in terms of power, performance, and area using the Vivado Design Suite. The power analysis indicates that the asynchronous processor consumes considerably less power in the clock network compared to its synchronous counterpart, thereby underscoring its energy efficiency. A performance analysis using the SPECint2000 benchmark suite demonstrates a 10% increase in performance, while only using slightly more area. These findings illustrate the asynchronous processor’s potential for performance-critical applications while maintaining energy and area efficiency.
... It's worth noting that at circuit level, many-core neuromorphic chips often adopt a global asynchronous locally synchronous (GALS) manner [6] in circuit design [1], [10], [26], which means the whole chip doesn't operate under single clock signal. The term 'synchronization' in this work is at the SNN computation level. ...
Preprint
Spiking Neural Networks (SNNs) are extensively utilized in brain-inspired computing and neuroscience research. To enhance the speed and energy efficiency of SNNs, several many-core accelerators have been developed. However, maintaining the accuracy of SNNs often necessitates frequent explicit synchronization among all cores, which presents a challenge to overall efficiency. In this paper, we propose an asynchronous architecture for Spiking Neural Networks (SNNs) that eliminates the need for inter-core synchronization, thus enhancing speed and energy efficiency. This approach leverages the pre-determined dependencies of neuromorphic cores established during compilation. Each core is equipped with a scheduler that monitors the status of its dependencies, allowing it to safely advance to the next timestep without waiting for other cores. This eliminates the necessity for global synchronization and minimizes core waiting time despite inherent workload imbalances. Comprehensive evaluations using five different SNN workloads show that our architecture achieves a 1.86x speedup and a 1.55x increase in energy efficiency compared to state-of-the-art synchronization architectures.
... Despite major theoretical contributions having provided since the 1980s a better comprehension of these objects [27,11] from computational and behavioral standpoints, understanding their sensitivity to (a)synchronism remains an open question on which any advance could have deep implications in computer science (around the thematics of synchronous versus asynchronous computation and processing [2,3]) and in systems biology (around the temporal organization of genetic expression [13,10]). In this context, numerous studies have been published by considering distinct settings of the concept of synchronism/asynchronism, i.e. by defining update modes which govern the way automata update their state over time. ...
Chapter
Full-text available
Among the fundamental questions in computer science is that of the impact of synchronism/asynchronism on computations, which has been addressed in various fields of the discipline: in programming, in networking, in concurrence theory, in artificial learning, etc. In this paper, we tackle this question from a standpoint which mixes discrete dynamical system theory and computational complexity, by highlighting that the chosen way of making local computations can have a drastic influence on the performed global computation itself. To do so, we study how distinct update schedules may fundamentally change the asymptotic behaviors of finite dynamical systems, by analyzing in particular their limit cycle maximal period. For the message itself to be general and impacting enough, we choose to focus on a “simple” computational model which prevents underlying systems from having too many intrinsic degrees of freedom, namely elementary cellular automata. More precisely, for elementary cellular automata rules which are neither too simple nor too complex (the problem should be meaningless for both), we show that update schedule changes can lead to significant computational complexity jumps (from constant to superpolynomial ones) in terms of their temporal asymptotes.
... To solve these problems, Globally Asynchronous Locally Synchronous (GALS) was proposed in [1]. GALS systems are composed of several local synchronous modules. ...
Article
Full-text available
In this paper, we propose a design support tool set for interface circuits between synchronous and asynchronous modules. To facilitate the design of interface circuits between synchronous and asynchronous modules, the proposed tool set generates interface circuits and design constraints based on a predefined communication scheme. In addition, the proposed tool set performs timing verification and delay adjustment to guarantee the operations of the generated interface circuits. In the experiment, we evaluated the latency and overhead of the generated interface circuits. The latency and handshake overhead of the interface circuits generated by the proposed tool set depend on the cycle time of the receiver module. In addition, we designed a system which consists of a synchronous RISC-V processor and an asynchronous multilayer perceptron (MLP) circuit using the proposed tool set. The energy consumption of the system was reduced by 34.0% compared with a system which uses a synchronous MLP circuit.
Article
Consider an arbitrary network of communicating modules on a chip, each requiring a local signal telling it when to execute a computational step. There are three common solutions to generating such a local clock signal: 1) by deriving it from a single, central clock source; 2) by local, free-running oscillators; or 3) by handshaking between neighboring modules. Conceptually, each of these solutions is the result of a perceived dichotomy in which (sub)systems are either clocked or asynchronous. We present a solution and its implementation that lies between these extremes. Based on a distributed gradient clock synchronization (GCS) algorithm, we show a novel design providing modules with local clocks, the frequency bounds of which are almost as good as those of free-running oscillators, yet neighboring modules are guaranteed to have a phase offset substantially smaller than one clock cycle. Concretely, parameters obtained from a 15-nm application specific integrated circuit (ASIC) simulation running at 2 GHz yield mathematical worst-case bounds of 20 ps on the phase offset for a 32 \ttimes 32 node grid network.
Preprint
Full-text available
Consider an arbitrary network of communicating modules on a chip, each requiring a local signal telling it when to execute a computational step. There are three common solutions to generating such a local clock signal: (i) by deriving it from a single, central clock source, (ii) by local, free-running oscillators, or (iii) by handshaking between neighboring modules. Conceptually, each of these solutions is the result of a perceived dichotomy in which (sub)systems are either clocked or asynchronous. We present a solution and its implementation that lies between these extremes. Based on a distributed gradient clock synchronization algorithm, we show a novel design providing modules with local clocks, the frequency bounds of which are almost as good as those of free-running oscillators, yet neighboring modules are guaranteed to have a phase offset substantially smaller than one clock cycle. Concretely, parameters obtained from a 15nm ASIC simulation running at 2GHz yield mathematical worst-case bounds of 20ps on the phase offset for a 32×3232 \times 32 node grid network.
Article
Designing software systems underpinned by a formal model of computation (MoC) is crucial for safety-critical, real-time and all industrial applications as it allows formal analysis of those designs and support for correct by design systems. In this paper, we focus on Globally Asynchronous Locally Synchronous (GALS) software systems and Coloured Petri Nets (CPNs) based approach to formally model and analyse GALS software systems specified in SystemJ GALS programming language. The approach translates SystemJ constructs into CPN modules and composes them into CPN GALS model based on control flow and concurrency specified in the SystemJ program. It preserves GALS MoC by automatically integrating synchronizer modules, asynchronous channel interface modules, and scheduling modules to result in the execution model of SystemJ program equivalent CPN. The created CPN GALS model allows system developers to verify the properties of the design formally with the use of Computation Tree Logic (CTL). An industrial automation example is provided as a use case.
Article
With the advent of Internet of Things (IoT), the call for hardware security has been seriously demanding due to the risks of side-channel attacks from adversaries. Advanced Encryption Standard (AES) is the de facto security standard for such applications and needs to ensure a low power, low area and moderate throughput design apart from providing high security to these devices. Substitution-box (S-box), being the core component of AES, has always drawn the attention of the cryptographic community. A chronological development of the S-box over a period of 20-years since the inception of AES is presented. This paper provides the first comprehensive review of the state-of-the-art S-box design techniques, identifying current advancements and analysing their impact on gate count, area, maximum frequency of operation, throughput and power. The other goal of the survey is to study the countermeasures designed for AES to protect it against side-channel attacks. In particular, we consider the power analysis attacks, and the countermeasures are investigated in terms of their security metrics and design overheads, such as area, power, and performance. The countermeasures are based on hiding or masking approaches depending on their design principle. Similar to the S-box survey, a chronological development of the countermeasures since the discovery of power analysis attacks in 1999, is presented. Finally, we suggest some open research gaps and possible direction of research in terms of S-box and countermeasure designs.
ResearchGate has not been able to resolve any references for this publication.