Article

Improving LUT count of FPGA-based sequential blocks

Authors:
  • The Jacob of Paradies University
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
The review is devoted to methods of structural decomposition that are used for optimizing characteristics of circuits of finite state machines (FSMs). These methods are connected with the increasing the number of logic levels in resulting FSM circuits. They can be viewed as an alternative to methods of functional decompositions. The roots of these methods are analysed. It is shown that the first methods of structural decomposition have appeared in 1950s together with microprogram control units. The basic methods of structural decomposition are analysed. They are such methods as the replacement of FSM inputs, encoding collections of FSM outputs, and encoding of terms. It is shown that these methods can be used for any element basis. Additionally, the joint application of different methods is shown. The analysis of change in these methods related to the evolution of the logic elements is performed. The application of these methods for optimizing FPGA- based FSMs is shown. Such new methods as twofold state assignment and mixed encoding of outputs are analysed. Some methods are illustrated with examples of FSM synthesis. Additionally, some experimental results are represented. These results prove that the methods of structural decomposition really improve the characteristics of FSM circuits.
Article
Full-text available
A method is proposed targeting implementation of FPGA-based Mealy finite state machines. The main goal of the method is a reduction for the number of look-up table (LUT) elements and their levels in FSM logic circuits. To do it, it is necessary to eliminate the direct dependence of input memory functions and FSM output functions on FSM inputs and state variables. The method is based on encoding of the terms corresponding to rows of direct structure tables. In such an approach, only terms depend on FSM inputs and state variables. Other functions depend on variables representing terms. The method belongs to the group of the methods of structural decomposition. The set of terms is divided by classes such that each class corresponds to a single-level LUT-based circuit. An embedded memory block (EMB) generates codes of both classes and terms as elements of these classes. The mutual using LUTs and EMB allows diminishing chip area occupied by FSM circuit (as compared to its LUT-based counterpart). The simple sequential algorithm is proposed for finding the partition of the set of terms by a determined number of classes. The method is based on representation of an FSM by a state transition table. However, it can be used for any known form of FSM specification. The example of synthesis is shown. The efficiency of the proposed method was investigated using a library of standard benchmarks. We compared the proposed with some other known design methods. The investigations show that the proposed method gives better results than other discussed methods. It allows the obtaining of FSM circuits with three levels of logic and regular interconnections.
Article
Full-text available
A logic synthesis for finite-state machines (FSMs) aimed at programmable array logic (PAL)-based complex programmable logic devices is proposed here. This approach consists of the simultaneous synthesis of a transition function and an output function. The main contribution is the novel multilevel optimization of an FSM. In this process, a new form of graph is used, i.e., a graph of excitations and outputs. This is a generalization of the graph of outputs that has previously been used in the process of technology mapping of multi-output functions in PAL-based programmable structures. The main idea, the theoretical background, and a precise algorithm are illustrated by means of simple examples. The proposed algorithm was compared with other approaches by synthesizing the FSM benchmarks and mapping the solutions to k -term PAL-based logic blocks. The obtained results are compared on the basis of the area (number of logic blocks) and speed (number of logic levels). The proposed approach is especially effective for larger FSMs.
Article
Full-text available
A method is proposed targeting a decrease in the number of LUTs in circuits of FPGA-based Mealy FSMs. The method improves hardware consumption for Mealy FSMs with the encoding of collections of output variables. The approach is based on constructing a partition for the set of internal states. Each state has two codes. It diminishes the number of arguments in input memory functions. An example of synthesis is given, along with results of investigations. The method targets rather complex FSMs, having more than 15 states.
Article
Full-text available
The mathematical model for designing a complex digital system is a finite state machine (FSM). Applications such as digital signal processing (DSP) and built-in self-test (BIST) require specific operations to be performed only in the particular instances. Hence, the optimal synthesis of such systems requires a reconfigurable FSM. The objective of this paper is to create a framework for a reconfigurable FSM with input multiplexing and state-based input selection (Reconfigurable FSMIM-S) architecture. The Reconfigurable FSMIM-S architecture is constructed by combining the conventional FSMIM-S architecture and an optimized multiplexer bank (which defines the mode of operation). For this, the descriptions of a set of FSMs are taken for a particular application. The problem of obtaining the required optimized multiplexer bank is transformed into a weighted bipartite graph matching problem where the objective is to iteratively match the description of FSMs in the set with minimal cost. As a solution, an iterative greedy heuristic based Hungarian algorithm is proposed. The experimental results from MCNC FSM benchmarks demonstrate a significant speed improvement by 30.43% as compared with variation-based reconfigurable multiplexer bank (VRMUX) and by 9.14% in comparison with combination-based reconfigurable multiplexer bank (CRMUX) during field programmable gate array (FPGA) implementation.
Article
Full-text available
One of the main aspects of logic synthesis dedicated to FPGA is the problem of technology mapping, which is directly associated with the logic decomposition technique. This paper focuses on using configurable properties of CLBs in the process of logic decomposition and technology mapping. A novel theory and a set of efficient techniques for logic decomposition based on a BDD are proposed. The paper shows that logic optimization can be efficiently carried out by using multiple decomposition. The essence of the proposed synthesis method is multiple cutting of a BDD. A new diagram form called an SMTBDD is proposed. Moreover, techniques that allow finding the best technology mapping oriented to configurability of CLBs are presented. In the experimental section, the presented method (MultiDec) is compared with academic and commercial tools. The experimental results show that the proposed technology mapping strategy leads to good results in terms of the number of CLBs.
Article
Full-text available
The hardware implementation of deep neural networks (DNNs) has recently received tremendous attention: many applications in fact require high-speed operations that suit a hardware implementation. However, numerous elements and complex interconnections are usually required, leading to a large area occupation and copious power consumption. Stochastic computing has shown promising results for low-power area-efficient hardware implementations, even though existing stochastic algorithms require long streams that cause long latencies. In this paper, we propose an integer form of stochastic computation and introduce some elementary circuits. We then propose an efficient implementation of a DNN based on integral stochastic computing. The proposed architecture uses integer stochastic streams and a modified Finite State Machine-based tanh function to perform computations and even reduce the latency compared to conventional stochastic computation. The proposed architecture has been implemented on a Virtex7 FPGA, resulting in 44.96% and 62.36% average reductions in area and latency compared to the best reported architecture in literature. We also synthesize the circuits in a 65 nm CMOS technology and show that they can tolerate a fault rate of up to 20% on some computations when timing violations are allowed to occur, resulting in power savings. The fault-tolerance property of the proposed architectures make them suitable for inherently unreliable advanced process technologies such as memristor technology.
Article
Full-text available
Most digital systems operate on a positional representation of data, such as binary radix. An alternative is to operate on random bit streams where the signal value is encoded by the probability of obtaining a one versus a zero. This representation is much less compact than binary radix. However, complex operations can be performed with very simple logic. Furthermore, since the representation is uniform, with all bits weighted equally, it is highly tolerant of soft errors (i.e., bit flips). Both combinational and sequential constructs have been proposed for operating on stochastic bit streams. Prior work has shown that combinational logic can implement multiplication and scaled addition effectively while linear finite-state machines (FSMs) can implement complex functions such as exponentiation and tanh effectively. Prior work on stochastic computation has largely been validated empirically.This paper provides a rigorous mathematical treatment of stochastic implementation of complex functions such as exponentiation and tanh implemented using linear FSMs. It presents two new functions, an absolute value function and exponentiation based on an absolute value, motivated by specific applications. Experimental results show that the linear FSM-based constructs for these functions have smaller area-delay products than the corresponding deterministic constructs. They also are much more tolerant of soft errors.
Article
Full-text available
Synthesis method of high speed finite state machines The paper is concerned with the problem of state assignment and logic optimization of high speed finite state machines. The method is designed for PAL-based CPLDs implementations. Determining the number of logic levels of the transition function before the state encoding process, and keeping the constraints during the process is the main problem at hand. A number of coding bits, as well as codes for the states, are adjusted to achieve a machine with a determined number of logic levels. Elements of two-level minimization are taken into consideration in the state assignment. The proposed optimization method is based on utilizing tri-state buffers, thus enabling achievement of a one-logic-level output block.
Article
Full-text available
The paper concerns the problem of state assignment for finite state machines (FSM), targeting at PAL-based CPLDs implementations. Presented in the paper approach is dedicated to state encoding of fast automata. The main idea is to determine the number of logic levels of the transition function before the state encoding process, and keep the constraints during the process. The number of implicants of every single transition function must be known while assigning states, so elements of two level minimization based on Primary and Secondary Merging Conditions are implemented in the algorithm. The method is based on code length extraction if necessary. In one of the most basic stages of the logic synthesis of sequential devices, the elements referring to constraints of PAL-based CPLDs are taken into account. Key words: state assignment, finite state machines (FSM), programmable array logic (PAL), complex programmable logic devices (CPLD).
Article
Full-text available
Finite State Machines (FSMs) are a key element of integrated circuits. Hard-coded FSMs do not allow changes after the ASIC production. While an embedded FPGA IP core provides flexibility, it is a complex circuit, requires difficult synthesis tools, and is expensive. This article presents and evaluates a novel architecture that is specifically optimized for implementing reconfigurable finite state machines: Transition-based Reconfigurable FSM (TR-FSM). The architecture shows a considerable reduction in area, delay, and power consumption compared to FPGA architectures with a (nearly) FPGA-like reconfigurability.
Article
Full-text available
Decomposition-based logic synthesis for PAL-based CPLDs The paper presents one concept of decomposition methods dedicated to PAL-based CPLDs. The proposed approach is an alternative to the classical one, which is based on two-level minimization of separate single-output functions. The key idea of the algorithm is to search for free blocks that could be implemented in PAL-based logic blocks containing a limited number of product terms. In order to better exploit the number of product terms, two-stage decomposition and BDD-based decomposition are to be used. In BDD-based decomposition methods, functions are represented by Reduced Ordered Binary Decision Diagrams (ROBDDs). The results of experiments prove that the proposed solution is more effective, in terms of the usage of programmable device resources, compared with the classical ones.
Article
Full-text available
This paper presents several orthogonal improvements to the state-of-the-art lookup table (LUT)-based field-programmable gate array (FPGA) technology mapping. The improvements target the delay and area of technology mapping as well as the runtime and memory requirements. 1) Improved cut enumeration computes all K-feasible cuts, without pruning, for up to seven inputs for the largest Microelectronics Center of North Carolina benchmarks. A new technique for on-the-fly cut dropping reduces, by orders of magnitude, the memory needed to represent cuts for large designs. 2) The notion of cut factorization is introduced, in which one computes a subset of cuts for a node and generates other cuts from that subset as needed. Two cut factorization schemes are presented, and a new algorithm that uses cut factorization for delay-oriented mapping for FPGAs with large LUTs is proposed. 3) Improved area recovery leads to mappings with the area, on average, 6% smaller than the previous best work while preserving the delay optimality when starting from the same optimized netlists. 4) Lossless synthesis accumulates alternative circuit structures seen during logic optimization. Extending the mapper to use structural choices reduces the delay, on average, by 6% and the area by 12%, compared with the previous work, while increasing the runtime 1.6 times. Performing five iterations of mapping with choices reduces the delay by 10% and the area by 19% while increasing the runtime eight times. These improvements, on top of the state-of-the-art methods for LUT mapping, are available in the package ABC
Book
This book focuses on control units, which are a vital part of modern digital systems, and responsible for the efficiency of controlled systems. The model of a finite state machine (FSM) is often used to represent the behavior of a control unit. As a rule, control units have irregular structures that make it impossible to design their logic circuits using the standard library cells. Design methods depend strongly on such factors as the FSM used, specific features of the logic elements implemented in the FSM logic circuit, and the characteristics of the control algorithm to be interpreted. This book discusses Moore and Mealy FSMs implemented with FPGA chips, including look-up table elements (LUT) and embedded memory blocks (EMB). It is crucial to minimize the number of LUTs and EMBs in an FSM logic circuit, as well as to make the interconnections between the logic elements more regular, and various methods of structural decompositions can be used to solve this problem. These methods are reduced to the presentation of an FSM circuit as a composition of different logic blocks, the majority of which implement systems of intermediate logic functions different (and much simpler) than input memory functions and FSM output functions. The structural decomposition results in multilevel FSM circuits having fewer logic elements than equivalent single-level circuits. The book describes well-known methods of structural decomposition and proposes new ones, examining their impact on the final amount of hardware in an FSM circuit. It is of interest to students and postgraduates in the area of Computer Science, as well as experts involved in designing digital systems with complex control units. The proposed models and design methods open new possibilities for creating logic circuits of control units with an optimal amount of hardware and regular interconnections.
Book
This book presents a new methodology for high level and logic design of complicated digital systems. This methodology is based on Algorithmic State Machine (ASM) transformations (composition, minimization, extraction, etc.), special algorithms for Data Path and Control Unit design and a very fast optimizing synthesis of FSMs as well as combinational circuits with hardly any constraints on their size, i.e., the number of inputs, outputs and states. Design tools supporting this methodology allow us to implement, check and estimate many possible design versions very fast, to find an optimized decision of a design problem and to simplify the verification problem for digital systems.
Article
Decomposition is a technology-independent process, in which a large complex function is broken into smaller, less complex functions. The costs of two-level or factored-form representations (cubes and literals) are used in most decomposition methods, as they have a high correlation with the area of cell-based designs. However, this correlation is weaker for field-programmable gate arrays (FPGAs) based on look-up tables. Furthermore, local optimizations have limited power due to the structural bias of the circuit descriptions. This paper tries to reduce the structural biasing by remapping the LUT network and decomposing the derived functions using the support as cost function. The proposed method improves the FPGA mapping results of a commercial tool for the 20 largest MCNC benchmarks, with gains of 28% in delay plus 18% in area when targeting delay, and a reduction of 28% in area plus 14% in delay with area as cost function. Results with 23% less area and 6% less delay are obtained after physical synthesis (post place-and-route). Moreover, 12 of the best known results for delay (and 3 for area) of the EPFL benchmarks are improved.
Book
The book is composed of two parts. The first part introduces the concepts of the design of digital systems using contemporary field-programmable gate arrays (FPGAs). Various design techniques are discussed and illustrated by examples. The operation and effectiveness of these techniques is demonstrated through experiments that use relatively cheap prototyping boards that are widely available. The book begins with easily understandable introductory sections, continues with commonly used digital circuits, and then gradually extends to more advanced topics. The advanced topics include novel techniques where parallelism is applied extensively. These techniques involve not only core reconfigurable logical elements, but also use embedded blocks such as memories and digital signal processing slices and interactions with general-purpose and application-specific computing systems. Fully synthesizable specifications are provided in a hardware-description language (VHDL) and are ready to be tested and incorporated in engineering designs. A number of practical applications are discussed from areas such as data processing and vector-based computations (e.g. Hamming weight counters/comparators). The second part of the book covers the more theoretical aspects of finite state machine synthesis with the main objective of reducing basic FPGA resources, minimizing delays and achieving greater optimization of circuits and systems.
Article
Deep neural network (DNN) has emerged as a powerful machine learning technique for various artificial intelligence applications. Due to the unique advantages on speed, area and power, specific hardware design has become a very attractive solution for the efficient deployment of DNN. However, the huge resource cost of multipliers makes the fully-parallel implementations of multiplication-intensive DNN still very prohibitive in many real-time resource-constrained embedded applications. This paper proposes a fully-parallel area-efficient stochastic DNN design. By leveraging stochastic computing (SC) technique, the computations of DNN are implemented using very simple stochastic logic, thereby enabling low-complexity fully-parallel DNN design. In addition, to avoid the accuracy loss incurred by the approximation of SC, we propose an accuracy-aware DNN datapath architecture to retain the test accuracy of stochastic DNN. Moreover, we propose novel low-complexity architecture for binary-to-stochastic (B-to-S) interface to drastically reduce the footprint of peripheral B-to-S circuit. Experimental results show that the proposed stochastic DNN design achieves much better hardware performance than non-stochastic design with negligible test accuracy loss.
Article
The paper presents theoretical background of a new concept of logic synthesis for LUT –based FPGAs. The idea of multi-output function description in the form of PMTBDD diagram is proposed. This form enables to carry out a simple analysis of multi-output function by appropriate algorithms which are dedicated to single-output functions. The essence of logic synthesis is searching for suitable PMTBDD cuttings. The choice of the PMTBDD cuttings enables to obtain an adequate decomposition path. As the result of BDD diagram cutting, SMTBDD diagrams are created. These diagrams are a generalized form of SBDD and MTBDD diagrams. The idea of choosing a cutting line, which matches LUTs included in FPGAs, is also proposed. The essence of the suggested method of searching for the best technology mapping is based on an analytical description of the efficiency of mapping. The experimental results, which prove efficiency of the proposed methods, are presented too.
Article
The paper develops a technology-independent optimization and post-mapping resynthesis for combinational logic networks, with emphasis on scalability and optimizing power. The proposed resynthesis (a) is capable of substantial logic restructuring, (b) is customizable to solve a variety of optimization tasks, and (c) has reasonable runtime on large industrial designs. The approach is based on several heterogeneous algorithms, which include structural analysis, random and constrained simulation, and manipulation of Boolean functions using a SAT solver. Structural methods include improved windowing, which focuses on reconvergent logic structures rich in functional flexibilities. It is shown how a mainstream SAT solver can be minimally modified by combining it with an interpolation package, which computes Boolean functions of nodes after resynthesis as a by-product of completed feasibility proofs. Experimental results focusing on the minimization of the number of 6-LUTs after high-effort iterative FPGA mapping with structural choices, demonstrate that the proposed resynthesis, applied to 15 benchmarks reduced area by 6.0% and delay by 2.3% on average. For 5 benchmarks derived from PLA descriptions, the reduction of 5x in area and 20% in depth was obtained, which speaks for the powerful nature of Boolean optimization employed in the proposed resynthesis.
Conference Paper
ABC is a public-domain system for logic synthesis and formal verification of binary logic circuits appearing in synchronous hardware designs. ABC combines scalable logic transformations based on And-Inverter Graphs (AIGs), with a variety of innovative algorithms. A focus on the synergy of sequential synthesis and sequential verification leads to improvements in both domains. This paper introduces ABC, motivates its development, and illustrates its use in formal verification.
Article
Field-Programmable Gate Arrays (FPGAs) have become one of the key digital circuit implementation media over the last decade. A crucial part of their creation lies in their architecture, which governs the nature of their programmable logic functionality and their programmable inter- connect. FPGA architecture has a dramatic effect on the quality of the final device's speed performance, area efficiency, and power consump- tion. This survey reviews the historical development of programmable logic devices, the fundamental programming technologies that the pro- grammability is built on, and then describes the basic understandings gleaned from research on architectures. We include a survey of the key elements of modern commercial FPGA architecture, and look toward future trends in the field.
Conference Paper
This paper attempts to quantify the optimality of FPGA technology mapping algorithms. The authors developed an algorithm, based on Boolean satisfiability (SAT), that is able to map a small subcircuit into the smallest possible number of lookup tables (LUTs) needed to realize its functionality. This technique was applied iteratively to small portions of circuits that have already been technology mapped by the best available mapping algorithms for FPGAs. In many cases, the optimal mapping of the subcircuit uses fewer LUTs than is obtained by the technology mapping algorithm. It is shown that for some circuits the total area improvement could be up to 67%.
Conference Paper
We present a new theory of non-disjoint serial decomposition. We also present our new decomposition tool DEMAIN. The decomposition approach implemented in DEMAIN relies on: a partition-based representation of Boolean functions; and an effective balanced decomposition strategy that switches between the parallel and non-disjoint serial decomposition. In consequence, we applied the non-disjoint serial decomposition and parallel decomposition for efficient synthesis of FPGA-based circuits directed towards area or delay optimisation
Article
This paper examines a number of stochastic computational elements employed in artificial neural networks, several of which are introduced for the first time, together with an analysis of their operation. We briefly include multiplication, squaring, addition, subtraction, and division circuits in both unipolar and bipolar formats, the principles of which are well-known, at least for unipolar signals. We have introduced several modifications to improve the speed of the division operation. The primary contribution of this paper, however, is in introducing several state machine-based computational elements for performing sigmoid nonlinearity mappings, linear gain, and exponentiation functions. We also describe an efficient method for the generation of, and conversion between, stochastic and deterministic binary signals. The validity of the present approach is demonstrated in a companion paper through a sample application, the recognition of noisy optical characters using soft competitive learning. Network generalization capabilities of the stochastic network maintain a squared error within 10 percent of that of a floating-point implementation for a wide range of noise levels. While the accuracy of stochastic computation may not compare favorably with more conventional binary radix-based computation, the low circuit area, power, and speed characteristics may, in certain situations, make them attractive for VLSI implementation of artificial neural networks
Design of FPGAbased circuits using Hierarchical Finite State Machines
  • I Skliarova
  • V Sklyarov
  • A Sudnitson
I. Skliarova, V. Sklyarov, and A. Sudnitson, Design of FPGAbased circuits using Hierarchical Finite State Machines, Tallinn: TUT Press, 2012.
Intel SoC FPGA Embedded Development Suite User Guide
  • Intel
Intel, "Intel SoC FPGA Embedded Development Suite User Guide". [Online]. https://www.intel.com/content/www/us/en/ programmable/documentation/lro1402536290550.html (accesed: May, 2020).
Zynq UltraScale+MPSoC
  • Xilinix
Xilinix, "Zynq UltraScale+MPSoC". [Online]. https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-mpsoc. html#productTable (accesed: May, 2020).
Hardware/firmware Interface Design: Best Practices for Improving Embedded Systems Development
  • G Stringham
G. Stringham, Hardware/firmware Interface Design: Best Practices for Improving Embedded Systems Development, Newnes, 2010.
FPGA-BASED hardware accelerators
  • I Skliarova
  • V Sklyarov
I. Skliarova and V. Sklyarov, FPGA-BASED hardware accelerators, Springer, 2019.
Functional Decomposition as a universal method for logic synthesis of digital circuits
  • T Łuba
  • M Rawski
  • Z Jachna
T. Łuba, M. Rawski, and Z. Jachna, "Functional Decomposition as a universal method for logic synthesis of digital circuits", in Proceedings of IX International Conference MIXDES'02, 2002, p. 285290.
Embedded System Design, Introduction to SoC System Architecture
  • M Arora
M. Arora, Embedded System Design, Introduction to SoC System Architecture, Learning Bytes Publishing, 2016.
  • O Barkalov
  • L Titarenko
  • M Mazurkiewicz
O. Barkalov, L. Titarenko, and M. Mazurkiewicz, Foundations of Embedded Systems, Springer, 2019.
Virtex-7 family overview
  • Xilinix
Xilinix, "Virtex-7 family overview". [Online]. https://www.xilinx.com/products/silicon-devices/fpga/virtex-7.html (accesed: May, 2020).
SIS: a system for sequential circuit synthesis
  • E Sentowich
E. Sentowich, et al., "SIS: a system for sequential circuit synthesis", in Proc. of the Inter. Conf. of Computer Design (ICCD'92), 1992, p.328333.
Synthesis of multiple-level logic from symbolic highlevel description languages
  • B Lin
B. Lin, "Synthesis of multiple-level logic from symbolic highlevel description languages", in IFIP International Conference on Very Large Scale Integration, 1989, pp. 187-196).