Article

Self-locking Domino Logic Pipelined Controller for RISC-V in FPGA

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper proposes an asynchronous RISC-V CPU design based on self-locking domino logic. The asynchronous approach offers advantages over traditional synchronous designs, including improved performance, lower power consumption, and greater modularity. The paper details the design and implementation of the asynchronous control unit using domino logic on an FPGA development board. The control unit is designed for a Turing-complete 32-bit RISC-V architecture. A significant aspect of the design is the self-locking mechanism, which ensures that the circuit only unlocks after all processing stages have been completed. This eliminates the need for a global clock and simplifies hazard-free operation. Furthermore, the paper discusses the potential for parallelizing the ALU using domino logic to improve performance further. The implementation of the asynchronous CPU has been analyzed in terms of power, performance, and area using the Vivado Design Suite. The power analysis indicates that the asynchronous processor consumes considerably less power in the clock network compared to its synchronous counterpart, thereby underscoring its energy efficiency. A performance analysis using the SPECint2000 benchmark suite demonstrates a 10% increase in performance, while only using slightly more area. These findings illustrate the asynchronous processor’s potential for performance-critical applications while maintaining energy and area efficiency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This approach illustrates the potential of asynchronous design in reducing power consumption for IoT and neuromorphic applications, despite the challenges in commercial tool support. This work builds upon the findings presented in [16,17]. ...
... for the complementary pin F. To generate the duty cycle for the self-locking input pulse circuit, a self-resetting LUT structure can be employed, wherein the LUT is set to one and transitions to a high state on the positive edge of P. Subsequently, the system resets itself asynchronously after a duration of τ ∆ . In lieu of utilizing the fundamental component of an FDCE, specifically a D-Flipflop with Clock Enable and Asynchronous Clear, as illustrated in, the corresponding LUTbased configuration from Figure 2 is employed, see 2 [16]. ...
Article
This paper presents the design and implementation of a self-locking domino logic pipeline controller for a RISC-V processor implemented on an FPGA. The emphasis is on asynchronous circuit design, which offers advantages such as enhanced resilience to supply voltage fluctuations, optimized power efficiency, and the elimination of clock-related issues such as skew and single-point failures. By leveraging the asynchronous Globally Asynchronous Locally Synchronous (GALS) systems and domino logic, the controller ensures hazard-free operation while maintaining race-free processing. The asynchronous approach, integrated into a 32- bit RISC-V processor, allows for flexible and energy-efficient operation, thereby demonstrating its potential for performance-critical applications. This paper high- lights the contrasts between the asynchronous design and the traditional synchronous multicycle processor, demonstrating the benefits of asynchronous systems in terms of power consumption and performance. A significant contribution of this design is the pipeline’s completion detection mechanism, which ensures that each processing stage locks until valid results are obtained, thereby markedly enhancing system stability. Furthermore, the paper investigates the parallelization of domino gates and introduces an asynchronous Arithmetic Logic Unit (ALU), which further optimizes performance through self-locking mechanisms. The power, performance, and area (PPA) analysis of the design demonstrates considerable improvements in throughput (up to 10%) and reduced latency per instruction in comparison to its synchronous counterpart, while maintaining moderate resource utilization on an FPGA. The results indicate that asynchronous domino logic pipelines may offer a promising approach for achieving energy-efficient and high-performance processors in future computing architectures.
Article
Full-text available
This paper presents a high-throughput and ultralow-power asynchronous domino logic pipeline design method, targeting to latch-free and extremely fine-grain OR gate-level design. The data paths are composed of a mixture of dual-rail and single-rail domino gates. Dual-rail domino gates are limited to construct a stable critical data path. Based on this critical data path, the handshake circuits are greatly simplified, which offers the pipeline high throughput as well as low power consumption. Moreover, the stable critical data path enables the adoption of single-rail domino gates in the noncritical data paths. This further saves a lot of power by reducing the overhead of logic circuits. An 8 ,times, 8 array style multiplier is used for evaluating the proposed pipeline method. Compared with a bundled-data asynchronous domino logic pipeline, the proposed pipeline, respectively, saves up to 60.2% and 24.5% of energy in the best case and the worst case when processing different data patterns.
Article
Full-text available
The objective vividly defines a new low-power and high-speed logic family; named Self Resetting Logic with Gate Diffusion Input (SRLGDI). This logic family resolves the issues in dynamic circuits like charge sharing, charge leakage, short circuit power dissipation, monotonicity requirement and low output voltage. In the proposed design structure of SRLGDI, the pull down tree is implemented with Gate Diffusion Input (GDI) with level restoration which apparently eliminated the conductance overlap between nMOS and pMOS devices, thereby reducing the short circuit power dissipation and providing High Output Voltage VoH. The output stage of SRLGDI has been incorporated with an inverter to produce both true and complementary output function. The Resistance Capacitance (RC) delay model has been proposed to obtain the total delay of the circuit during precharge and evaluation phase. Using SRLGDI, the primitive cells and 3 different full adder circuits were designed and simulated in a 0.250μm Complementary Metal Oxide Semiconductor (CMOS) process technology. The simulated result demonstrates that the proposed SRLGDI logic family is superior in terms of speed and power consumption with respect to other logic families like Dynamic Logic (DY), CMOS, Self Resetting CMOS (SRCMOS) and GDI.
Article
Full-text available
This thesis provides a new framework for the design of very high performance digital machines. The new theoretical results which are presented have practical implications, and lead to a better understanding of possibilities and limitations in the design of computers, communication hardware and other digital machinery. The discussion centers on different organizations for globally-asynchronous, locally-synchronous systems, and covers the following issues: organizations for complex digital systems, metastability as a limitation for high performance, structures for two classes of non-conventional architectures, optimization, performance, reliability, and design techniques. We present new algorithms to compile the specifications of such machines onto efficient circuits, and to verify the correctness of the resulting machines. The models we developed for the analysis of the tradeoffs between different variables that affect the safety of operation of these systems, show that the proposed organizations result in extremely fast and reliable digital machines. The proposed organizational schemes can be used within a wide range of architectures, and integrated circuits designed according to this methodology have been developed and tested.
Conference Paper
Full-text available
We describe a high performance clocking methodology for domino pipelines. Our technique maximizes the clock rate of the circular pipeline (“ring”) while maintaining the ring cycle time to be the worst-case combinational logic delay around the ring. It is relatively immune to global clock skew, incurs no latch overhead, allows up to 50% time borrowing, and offers a robust way of preventing race-through problems, adjusted for the worst-case time borrowing
Conference Paper
Full-text available
We describe a method to clock the domino pipeline at the maximum rate by using soft synchronizers between pipeline stages and thus allowing “time borrowing” i.e., allowing input signals to arrive at a pipe stage after the clock tick. We show a robust way of placing “roadblocks” (equivalent to slave latches) in each pipe stage to maintain the optimal clock rate. As explicit latches are not required at the pipe stage boundaries, the latch overhead is eliminated. We use the self-resetting scheme to circumvent often performance-limiting precharge timing requirements. We also address several issues regarding the testability of self-resetting domino circuits including scan register design and multiple stuck fault testing
Article
Over the past decade, the design of low-power processors is a primary requirement of emerging applications, as Internet of Things (IoT) and neuromorphic chips. Therefore, there has been renewed interest in asynchronous circuits for their low-power consumption and robustness. However, one of the main obstacles is the lack of commercial EDA tool support, which makes asynchronous design takes time and is not well-suited for industrial adoption. This paper proposes a new methodology for implementing asynchronous phase-decoupled click-based circuits with traditional EDA tools. To perform static timing analysis both in the control and data paths, we capture asynchronous event propagation via generated clocks. Moreover, we present an adaptive pipeline asynchronous RISC-V processor implemented on the FPGA, Xilinx ZCU102 board. The implementation result shows that the asynchronous RISC-V processor achieves a 3x dynamic power improvement against the synchronous one with a similar resource.
Article
Analog and mixed signal (AMS) electronics becomes increasingly complex and needs to be digitally enhanced by its own control circuitry. The RTL synthesis flow routinely used for digital logic is however optimized for synchronous data processing and produces inefficient control for AMS. In this paper we demonstrate the evident benefits of asynchronous circuits in the context of AMS systems, and propose an asynchronous design for analog electronics (A4A) flow for their specification, synthesis, and formal verification. A library of specialized analog-to-asynchronous (A2A) components is developed for interfacing analog and asynchronous worlds. A4A flow is automated in the Workcraft framework and evaluated using a multiphase buck converter case study where A2A components are employed to sanitise analog sensor readings. Timing analysis of asynchronous buck control shows improved response time: 4x faster reaction to high-load and 7x to under-voltage condition, compared with a 333MHz clocked controller (to achieve a similar response time, a clocked controller would require 3GHz frequency). The simulation results of a 4-phase asynchronous buck demonstrate improved voltage ripple and peak current – 16% and 12% reduction, respectively. These benefits lead to the higher efficiency of power conversion, and can be traded off for the cost of analog components, e.g. coils. Moreover, the use of the proposed design flow and tools helps to improve design productivity and overall robustness of AMS circuits.
Article
Digital Design and Computer Architecture takes a unique and modern approach to digital design. Beginning with digital logic gates and progressing to the design of combinational and sequential circuits, Harris and Harris use these fundamental building blocks as the basis for what follows: the design of an actual MIPS processor. SystemVerilog and VHDL are integrated throughout the text in examples illustrating the methods and techniques for CAD-based circuit design. By the end of this book, readers will be able to build their own microprocessor and will have a top-to-bottom understanding of how it works. Harris and Harris have combined an engaging and humorous writing style with an updated and hands-on approach to digital design. This second edition has been updated with new content on I/O systems in the context of general purpose processors found in a PC as well as microcontrollers found almost everywhere. The new edition provides practical examples of how to interface with peripherals using RS232, SPI, motor control, interrupts, wireless, and analog-to-digital conversion. High-level descriptions of I/O interfaces found in PCs include USB, SDRAM, WiFi, PCI Express, and others. In addition to expanded and updated material throughout, SystemVerilog is now featured in the programming and code examples (replacing Verilog), alongside VHDL. This new edition also provides additional exercises and a new appendix on C programming to strengthen the connection between programming and processor architecture.
Article
This paper presents the first concrete evaluation of the Quasi Delay Insensitive (QDI) asynchronous logic in terms of Electromagnetic Compatibility (EMC). In fact, the QDI logic is evaluated by analyzing the electromagnetic emissions and the conducted susceptibility of a Data Encryption Standard (DES) crypto processor using a GTEM cell. A synchronous DES crypto processor is used as a reference. The results obtained, demonstrate the potentiality of the QDI logic in EMC domain. The electromagnetic emissions of the asynchronous version is 4 times lower than the synchronous version and the power signal required to disturb the circuits has to be 10 times higher for the asynchronous circuit than for the synchronous circuit.
Conference Paper
Pass-transistors have been the key building block for field-programmable gate array (FPGA) circuitry for many years due to the very small switch they enable. However, passtransistor performance and reliability have been degrading with technology scaling. Transmission gates are an alternative to pass-transistors; while larger, they are more robust. We develop a new FPGA circuit optimization flow and use it to investigate the area, delay and power impact of building FPGAs out of transmission gates instead of pass-transistors in a 22nm process. Our results show that transmission gate FPGAs are 15% larger than pass-transistor FPGAs but are 10-25% faster depending on the allowable level of “gate boosting”. Without gate boosting, transmission gate FPGAs are the better option with 14% lower area-delay product. If 200mV of gate boosting is possible however, pass-transistor FPGAs remain the slightly better choice with a 2% better area-delay product. We also show that transmission gates with a separate power supply for their gate terminal enable a low-voltage FPGA with 50% less power and good delay.
Article
The third edition of Hodges and Jackson's Analysis and Design of Digital Integrated Circuits has been thoroughly revised and updated by a new co-author, Resve Saleh of the University of British Columbia. The new edition combines the approachability and concise nature of the Hodges and Jackson classic with a complete overhaul to bring the book into the 21st century. The new edition has replaced the emphasis on Bipolar with an emphasis on CMOS. The book focuses on the latest CMOS technologies and uses standard deep submicron models throughout the book. The material on memory has been expanded and updated. As well the book now includes more on SPICE simulation and new problems that reflect recent technologies. The emphasis of the book is on design, but it does not neglect analysis and has as a goal to provide enough information so that a student can carry out analysis as well as be able to design a circuit. This book provides an excellent and balanced introduction to digital circuit design for both students and professionals. Table of contents 1 Deep Submicron Digital IC Design 2 MOS Transistors 3 Fabrication, Layout and Simulation 4 MOS Inverter Circuits 5 Static MOS Gate Circuits 6 High-Speed CMOS Logic Design 7 Transfer Gate and Dynamic Logic Design 8 Semiconductor Memory Design 9 Additional Topics in Memory Design 10 Interconnect Design 11 Power Grid and Clock Design Appendix A A Brief Introduction to Spice Appendix B Bipolar Transistors and Circuits
Conference Paper
This paper presents a new self-resetting CMOS design for an add-compare-select (ACS) unit, which is a key building block in a Viterbi decoder. Static CMOS and two-phase domino CMOS designs have also been implemented for comparison purposes. The simulation results show that, with the SRCMOS technique, the ACS units operate at a data rate of 568 Mbps in a 0.25 micron CMOS technology, as compared to 357 Mbps and 485 Mbps for static and domino CMOS implementations, respectively.
Article
A new family of self-reset logic (SRL) cells is presented in this paper. The single-ended basic structure proposed realizes an incomplete logic family, since it is incapable of inverting logic. Thus, a dual-rail SRL (DRSRL) implementation is also proposed. These cells maintain small delay variations for all input combinations, once minimum timing requirements on inputs are satisfied, and produce output pulses of fairly constant width for varying fanout, leaving enough headroom in the design to accommodate process, supply voltage, and temperature variations. These properties simplify the implementation of data-path and control circuits where the logic depth does not affect the stage output pulse width, eliminating the need for pulse-width controlling circuits required in previous works on SRL. In SRL, power is consumed only if new data are pumped through the logic. The clock grid is limited to the registers that launch and receive the signal path. The clocking overhead is thus reduced, compared with other dynamic designs, and it is especially suitable for wave pipelining. Case study examples and simulated characterization data are included to show the design methodology.
Article
Low threshold voltage (Vt) can be applied to domino logic to improve the performance in dual threshold voltage technology. Then, the keeper transistor should be up-sized to compensate for reduced noise margin due to the significant subthreshold current of low Vt transistor. However, a large keeper transistor degrades performance. To resolve the tradeoff between performance and noise margin, the authors propose a new domino logic which incorporates a dual keeper structure and delay logic gates. Detailed timing analysis of the proposed domino logic yields optimal timing conditions wherein a contention-free skew-tolerant window is maximized. A broad range of the skew-tolerant window connotes robustness against noise and design parameter variations, while reduced contention between keeper and evaluation NMOS transistors ensures high-speed switching. The authors show that the dual keeper structure increases noise tolerance and delay logic gates fortify signal skew tolerance. Simulation results verify that the proposed domino logic is robust to noise and signal skew while presenting high performance and power efficiency.
Self-locked asynchronous controller for risc-v architecture on fpga
  • F Deeg
  • S M Sattler
Deeg F, Sattler SM (2024) Self-locked asynchronous controller for risc-v architecture on fpga. In AmEC 2024 -Automotive meets Electronics & Control; 15. GMM-GMA-Symposium, 1-5.
Globally asynchronous, locally synchronous circuits: Overview and outlook
  • M Krstic
  • E Grass
  • F K Gãijrkaynak
  • P Vivet
Krstic M, Grass E, GÃijrkaynak FK, Vivet P (2007) Globally asynchronous, locally synchronous circuits: Overview and outlook. IEEE Design & Test of Computers 24(5): 430-441.
Design of the risc-v instruction set architecture
  • A Waterman
Waterman A (2016) Design of the risc-v instruction set architecture. Available at: https:// people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.pdf.