Conference Paper

Two-stage physical synthesis for FPGAs

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents an overview of an industrial physical synthesis CAD flow for FPGAs. The flow provides a performance speedup of 10%-15% for most circuits, and a significant number of circuits show a speedup of 20%-180%. We describe the algorithms used to achieve this result including: incremental retiming, BDD-based resynthesis, local rewiring, and logic replication. The effectiveness of these operations depends on the ability to accurately determine which portions of logic are timing critical at a stage of the CAD flow where there is still freedom to perform logic restructuring. We show how this problem can be effectively solved by inserting prediction and restructuring operations at multiple points of the FPGA CAD flow.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Results from the Timing Analyzer complete the timing analysis step, which is then followed by generation of a programming bit stream. The above steps are depicted in Figure Physical Synthesis is a CAD flow that can evaluate its own effectiveness and adjust its algorithms to iteratively improve its results [Singh05]. This is done by using the results of timing analysis, power analysis and others, to improve the results produced by each stage of the CAD flow. ...
... Its benefits have mostly been shown[Singh05] on commercial FPGAs, because they facilitate a wider array of implementations for logic functions. These commercial FPGAs contain dedicated circuitry to improve the performance of arithmetic circuits as well as more complex LUT based structures that can further improve the implementation of common logic functions. ...
... The approach proposed by Lou et al.[Lou99] breaks this boundary by having the synthesis stage provide several mapping solutions for a subcircuit it considers to be good. The placer then chooses the mapping solution to improve the speed of the logic circuit, since speed is easier to estimate during placement and routing stages.In the context of commercial FPGAs, a complete physical synthesis flow has been implemented by Singh et al.[Singh05] for the Altera Stratix and Stratix II devices. The proposed flow follows the idea of Physical Synthesis and focuses on two areas of optimization: posttechnology mapping and post-placement. ...
... To use this pre-placement delay prediction, we split physical synthesis into an early and late stage [15]. Early physical synthesis takes place before placement and late physical synthesis takes place after placement. ...
... Our modified version of the timing-driven physical synthesis CAD flow splits optimizations performed into two stages: early physical synthesis, which occurs prior to placement, and late physical synthesis, which occurs after placement [15]. Early physical synthesis performs timing-driven optimizations with a coarse estimate of delays. ...
Article
This paper studies the prediction of interconnect delay in an industrial setting. Industrial circuits and two industrial field-programmable gate-array (FPGA) architectures were used in this paper. We show that there is a large amount of inherent randomness in a state-of-the-art FPGA placement algorithm. Thus, it is impossible to predict interconnect delay with a high degree of accuracy. Furthermore, we show that a simple timing model can be used to predict some aspects of interconnect timing with just as much accuracy as predictions obtained by running the placement tool itself. Using this simple timing model in a two-phase timing driven physical synthesis flow can both improve quality of results and decrease runtime. Next, we present a metric for predicting the accuracy of our interconnect delay model and show how this metric can be used to reduce the runtime of a timing driven physical synthesis flow. Finally, we examine the benefits of using the simple timing model in a timing driven physical synthesis flow, and attempt to establish an upper bound on these possible gains, given the difficulty of interconnect delay prediction.
... In the domain of microprocessors, an integrated physical synthesis timing clo-sure methodology was proposed in [9]. In [10], Singh et al. presented FPGA specific industrial physical synthesis CAD flow. Same authors described a retiming based incremental optimization flow in [11]. ...
Conference Paper
State-of-the-art FPGA design has become a very complex process primarily due to the aggressive timing requirements of the designs. Designers spend significant amount of time and effort trying to close the timing on their latest designs. In that timing closure methodology, Physical Synthesis plays a key role to boost the design performance. In traditional approaches, user performs placement followed by physical synthesis. As the design complexity increases, physical synthesis cannot perform all the optimization steps due to the physical constraints imposed by the placement operation. In this work, we propose an interactive methodology to perform physical synthesis in the pre-placement stage of the FPGA timing closure flow. The approach will work in two iterations of the design flow. In the first iteration, the designer will perform the regular post-placement physical synthesis operation on the design. That phase will automatically write a replayable-file which will contain information about all the optimization actions. That file also contains all the attempted optimization moves what physical synthesis deemed beneficial from QoR perspective, but was not able to accept due to the physical constraint. In the second iteration of the design flow, the designer will perform all those physical synthesis optimizations by importing the replayable file in the pre-placement stage. In addition to performing the physical synthesis flow's changes, it also performs the optimizations that were not possible in the traditional physical synthesis flow. After these changes are made in the logical stage of the design flow, the crucial placement step can adapt to the optimized/better netlist structure. As a result, this approach will greatly help the users reach their challenging timing closure goal. We have evaluated the effectiveness and performance of our proposed approach on a large set of industrial designs. All these designs were targeted towards the latest Xilinx Ultrascale™ devices. Our experimental data indicates that the proposed approach improves the design performance by 4% to 5%, on an average.
... The technique can be modified to prevent the selection of critical combinational nodes. We first perform a timing analysis step using a statistical delay model described in [12] . The expected slack (ES) of a cone of logic after mapping to memory can then be estimated using the minimum expected slack of all outputs: ...
Conference Paper
This work describes a new mapping technique, RAM-MAP, that identifies parts of circuits that can be efficiently map- ped into the synchronous embedded memories found on field programmable gate arrays (FPGAs). Previous tech- niques developed for mapping into asynchronous embed- ded memories cannot be used because modern FPGAs do not have asynchronous embedded memories. After tech- nology mapping, an area-prediction cost function is used to guide the selection of logic cones to be placed in em- bedded memories. Extra logic is added to compensate for missing asynchronous functionality on the synchronous memories. Experiments conducted on Altera's Stratix de- vice family indicate that this embedded memory mapping technique can provide an average area reduction of 6.2% and up to 32.5% on a large set of industrial designs. A small architecture change that increases the size of the FPGA fabric by 0.05% can increase the average area re- duction to 14.1% and up to 59.1% on the same design set.
... If we can predict interconnect delay with reasonable accuracy before placement has taken place, we can perform many of the physical synthesis transformations much earlier in the CAD flow and avoid the computational cost of performing placement legalization . Starting with v5.0 of the Quartus II software, physical synthesis is split into an early and late stage [15]. Early physical synthesis takes place before placement and late physical synthesis takes place after placement. ...
Conference Paper
This paper studies the di-culty of predicting interconnect delay in an industrial setting. Fifty industrial circuits, Al- tera's Quartus II CAD software, and Altera's Stratix and Stratix II FPGA architectures were used in the study. We show that there is a large amount of inherent randomness in a state-of-the-art FPGA placement algorithm. Thus, it is impossible to predict interconnect delay with a high de- gree of accuracy. Futhermore, we show that a simple timing model can be used to predict some aspects of interconnect timing with just as much accuracy as predictions obtained by running the placement tool itself. Finally, we examine the beneflts of using the simple timing model in a timing driven physical synthesis ∞ow, and attempt to establish an upper bound on these possible gains, given the di-culty of interconnect delay prediction. of interconnect delay even before the physical design steps (placement and routing) in an FPGA CAD ∞ow are exe- cuted is a di-cult yet important problem. The ability to predict interconnect delay early in the CAD ∞ow ofiers two advantages. First, the timing driven restructuring opera- tions carried out during the early CAD steps (synthesis and technology mapping) can be made much more efiective if in- terconnect delay can be predicted with reasonable accuracy. Second, the delay predictions can be used to provide feed-
... In Manohararajah et al. [2006], prediction is used to mitigate the variability and long run times of commercial place and route tools for estimating interconnect delay. Other issues including timing [Xu and Kurdahi 1999], routability [Brown et al. 1993], interconnect planning [Singh and Marek-Sadowska 2002], and routing delay [Singh et al. 2005] are explored via prediction. Performance is also explored by modeling issues such as power [Degalahal and Tuan 2005] and wafer yield [Maidee and Bazargan 2006] . ...
Article
Full-text available
While the promise of achieving speedup and additional benefits such as high performance per watt with FPGAs continues to expand, chief among the challenges with the emerging paradigm of reconfigurable computing is the complexity in application design and implementation. Before a lengthy development effort is undertaken to map a given application to hardware, it is important that a high-level parallel algorithm crafted for that application first be analyzed relative to the tar- get platform, so as to ascertain the likelihood of success in terms of potential speedup. This article presents the RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development. Categories and Subject Descriptors: B.8.2 (Performance and Reliability): Performance Analy- sis and Design Aids; I.6.m (Simulation and Modeling): Miscellaneous; B.2.0 (Arithmetic and
... The benefit of doing this is that delays can be modeled accurately after placement since cell positions are known. This incremental process is known as physicallydriven synthesis which we will refer to as physical synthesis [12]. The (a) Legal Placement (b) Duplication (c) Incremental Placement optimizations that occur during physical synthesis will usually lead to a placement that is illegal. ...
Conference Paper
While physically driven synthesis techniques have proven to be an effective method to meet tight timing constraints required by a design, the incremental placement step during physically driven synthesis has emerged as the primary bottleneck. As a solution, this paper introduces a scalable incremental placement algorithm based upon the well known transportation problem. This method has an average speedup of 2× and a 30% reduction in memory usage when compared against a commercial incremental placer without any impact on area or speed of the final placed circuit. Furthermore, this method is scalable for structured ASICs.
Conference Paper
This paper describes architectural enhancements in the Altera Stratix? 10 HyperFlex? FPGA architecture, fabricated in the Intel 14nm FinFET process. Stratix 10 includes ubiquitous flip-flops in the routing to enable a high degree of pipelining. In contrast to the earlier architectural exploration of pipelining in pass-transistor based architectures, the direct drive routing fabric in Stratix-style FPGAs enables an extremely low-cost pipeline register. The presence of ubiquitous flip-flops simplifies circuit retiming and improves performance. The availability of predictable retiming affects all stages of the cluster, place and route flow. Ubiquitous flip-flops require a low-cost clock network with sufficient flexibility to enable pipelining of dozens of clock domains. Different cost/performance tradeoffs in a pipelined fabric and use of a 14nm process, lead to other modifications to the routing fabric and the logic element. User modification of the design enables even higher performance, averaging 2.3X faster in a small set of designs.
Article
While the promise of achieving speedup and additional benefits such as high performance per watt with FPGAs continues to expand, chief among the challenges with the emerging paradigm of reconfigurable computing is the complexity in application design and implementation. Before a lengthy development effort is undertaken to map a given application to hardware, it is important that a high-level parallel algorithm crafted for that application first be analyzed relative to the target platform, so as to ascertain the likelihood of success in terms of potential speedup. This article presents the RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development.
Conference Paper
This work describes a new mapping technique, RAM-MAP, that identifies parts of circuits that can be efficiently mapped into the synchronous embedded memories found on field programmable gate arrays (FPGAs). Previous techniques developed for mapping into asynchronous embedded memories cannot be used because modern FPGAs do not have asynchronous embedded memories. After technology mapping, an area-prediction cost function is used to guide the selection of logic cones to be placed in embedded memories. Extra logic is added to compensate for missing asynchronous functionality on the synchronous memories. Experiments conducted on Altera's Stratix device family indicate that this embedded memory mapping technique can provide an average area reduction of 6.2% and up to 32.5% on a large set of industrial designs. A small architecture change that increases the size of the FPGA fabric by 0.05% can increase the average area reduction to 14.1% and up to 59.1% on the same design set
Article
Full-text available
Retiming relocates registers in a circuit to shorten the clock cycle time. In deep sub-micron era, conventional pre-layout retiming cannot work properly because of dominant inter-connection delay that is not available before layout. Al-though some retiming algorithms incorporating interconnec-tion delay have been proposed, layout information is still not utilized eeectively nor eeciently. Retiming and layout is combined for the rst time in this paper. We present heuris-tics for two key problems: interconnection delay estimation and post-retiming incremental placement. An eecient re-timing algorithm incorporating interconnection delay is also proposed. Experimental results show that on the average we can improve the circuit speed by 5:4% targeted toward a 0:5um CMOS technology. Scaling down the technology to 0:1um, as much as 25:6% improvement have been achieved. 1 Introduction Retiming is a sequential logic optimization technique pro-posed by Leiserson and Saxe 1]. It relocates registers to reduce the cycle time and/or area while preserving the func-tionality. Much eeort has been made for retiming. Some ap-plied retiming to reduce powerr6], to improve testabilityy7], or for latch-based circuitss8]. Some eeort made retiming practical by controlling the initial state of the circuitt9] or reducing the run timee4]. However, without accurate interconnection delay, above-mentioned approach cannot get the best circuit performance when the process technology is down to half-micron or be-low. This is because the interconnection delay has become the dominant part of the path delay and the interconnection delay is diicult to measure before placement and routing. Post-layout retiming is a possible approach to take in-terconnection delay into account. With layout information back-annotated, we can estimate the delay and retime the circuit accordingly. Incremental placement and routing tech-niques must also be developed in order to retain most P&R decision from the previous iteration. Figure 1: Estimation for three kinds of interconnection de-lay: (a) adding a register, (b) deleting a register, and (c) an unchanged wire. In this paper, we propose a post-layout retiming tech-nique. Heuristics have been developed for interconnection delay estimation, retiming incorporating interconnection de-lay, and post-retiming placement. The rest of this paper is organized as follows. Three key problems and previous work are described in Section 2. Our system is introduced in Section 3. The retiming algorithm is proposed in Section 4. Heuristics for interconnection de-lay estimation and post-retiming placement are proposed in Section 5 and 6, respectively. Experimental results are de-scribed in Section 7. Finally, Section 8 concludes this paper. 2 Key Problems and Previous Work We have to deal with three key problems: retiming incorpo-rating interconnection delay, interconnection delay estima-tion, and post-retiming placement. Several retiming algorithms taking into account intercon-nection delay have been proposed 2]]3]. If we can annotate a circuit with the interconnection delay value changed after retiming as shown in Figure 1, algorithms in 2]]3] would give the optimal retiming solutions. If the path delay mono-tonicity constraint or the one-way extendable property can-not be satissed, the algorithms used in 2] and 3] are time-consuming. Nevertheless, due to the placement variation, neither the path delay monotonicity constraint nor the one-way extendable property can be easily satissed. Using faster retiming algorithms with sub-optimal solutions may be ac-ceptable because, after post-retiming placement, the opti-mal retiming solution might no longer be optimal. How to estimate the interconnection delay for further retiming is more important. Three kinds of interconnection delay as shown in Figure 1 need to be estimated: adding a register to a wire, deleting a register from a wire, and an unchanged wire. The delay of an unchanged wire can simply be assumed as the original delay. The delay of a register-deleted wire is also easily estimated by generating a routing.
Conference Paper
Full-text available
This paper describes the Altera Stratix II™ logic and routing architecture. This architecture features a novel adaptive logic module (ALM) that is based on a 6-LUT, but can be partitioned into two smaller LUTs to efficiently implement circuits containing a range of LUT sizes that arises in conventional synthesis flows. This provides a performance increase of 15% in the Stratix II architecture while reducing area by 2%. The ALM also includes a more powerful arithmetic structure that can perform two bits of arithmetic per ALM, and perform a sum of up to three inputs. The routing fabric adds a new set of fast inputs to the routing multiplexers for another 3% improvement in performance, while other improvements in routing efficiency cause another 6% reduction in area. These changes in combination with other circuit and architecture changes in Stratix II contribute 27% of an overall 51% performance improvement (including architecture and process improvement). The architecture changes reduce area by 10% in the same process, and by 50% after including process migration.
Conference Paper
Full-text available
As feature sizes shrink to deep sub-micron, the performance of VLSI chips becomes dominated by the interconnect delay. In a traditional top-down design flow, logic synthesis algorithms optimize gate area or delay without accurate interconnect delay because of lack of physical design information. Thus, the effectiveness of the optimization techniques is limited. We integrate logic synthesis and physical design into an iterative procedure for performance optimization. The logic synthesis process can optimize circuit delay based on accurate interconnect delay information extracted from the physical design. The physical design tools can refine the layout incrementally with the engineering change information and changed netlist passed from the logic synthesis process. In this thesis, we integrate logic decomposition, gate sizing and buffer insertion to work together to improve the circuit speed. Experimental results on a set of benchmark circuits show that the techniques are indeed effective
Conference Paper
Full-text available
The authors present techniques to integrate interconnection optimization with logic restructuring and technology decomposition phases of logic synthesis. The approach is based on a point placement of a Boolean network which is used to guide the synthesis process by providing accurate estimates on wiring area and delay. The placement solution is incrementally updated as intermediate Boolean nodes are extracted or eliminated during the decomposition or elimination procedures. Combining these techniques with layout-driven technology mapping makes it possible to produce a synthesis solution and a `companion' placement solution for a given combinational logic circuit simultaneously. Using these techniques, circuits with smaller area and higher performance can be generated
Article
Full-text available
In this paper we present a new data structure for representing Boolean functions and an associated set of manipulation algorithms. Functions are represented by directed, acyclic graphs in a manner similar to the representations introduced by Lee [1] and Akers [2], but with further restrictions on the ordering of decision variables in the graph. Although a function requires, in the worst case, a graph of size exponential in the number of arguments, many of the functions encountered in typical applications have a more reasonable representation. Our algorithms have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large. We present experimental results from applying these algorithms to problems in logic design verification that demonstrate the practicality of our approach.
Article
A data structure is presented for representing Boolean functions and an associated set of manipulation algorithms. Functions are represented by directed, acyclic graphs in a manner similar to the representations introduced by C. Y. Lee (1959) and S. B. Akers (1978), but with further restrictions on the ordering of decision variables in the graph. Although, in the worst case, a function requires a graph where the number of vertices grows exponentially with the number of arguments, many of the functions encountered in typical applications have a more reasonable representation. The algorithms have time complexity proportional to the sizes of the graphs being operated on, and hence are quite efficient as long as the graphs do not grow too large. Experimental results are presented from applying these algorithms to problems in logic design verification that demonstrate the practicality of the approach.
Conference Paper
This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocksthat have numerous architectural constraints. Experiments indicate that our technique produces results of extremely highquality.
Article
Retiming a netlist is an often-studied problem that is not often implemented in practice. Retiming is a relatively dangerous operation in the synthesis flow due to its effects on simulation, verification, debugging, issues such as meta-stability, and because the timing visibility early in the CAD flow is significantly less than would be desired. Many published algorithms also ignore important issues such as register compatibility due to secondary signals, don't touch constraints, common FPGA hardware such as RAM and carry chains and various illegal forms of register moves. In this paper we will discuss our solution to the retiming problem which takes all these issues into account, provide empirical evidence of the viability of retiming on large industrial netlists, discuss the expected gains from retiming, and introduce better ways of analyzing and quantifying these gains. Despite the many additional constraints added to the problem, we show a 5.0% mean improvement in performance with retiming vs. without on a large set of industrial designs, and further that implementing retiming without attention to legality overstates the gain by half.
Article
This paper explores circuit optimization within a graph-theoretic framework. The vertices of the graph are combinational logic elements with assigned numerical propagation delays. The edges of the graph are interconnections between combinational logic elements. Each edge is given a weight equal to the number of clocked registers through which the interconnection passes. A distinguished vertex, called the host, represents the interface between the circuit and the external world. This paper shows how the technique of retiming can be used to transform a given synchronous circuit into a more efficient circuit under a variety of difTerent cost criteria. We give an easily programmed O(|V|3lg|V|) algorithm for determining an equivalent circuit with the smallest possible clock period. We show how to improve the asymptotic time complexity by reducing this problem to an efficiently solvable mixed-integer linear programming problem. We also show that the problem of determining an equivalent circuit with minimum state (total number of registers) is the linear-programming dual of a minimum-cost flow problem, and hence can also be solved efficiently. The techniques are general in that many other constraints can be handled within the graph-theoretic framework.
Article
This work explores the effect of adding a timing driven func-tional decomposition step to the traditional field program-mable gate array (FPGA) CAD flow. Once placement has completed, alternative decompositions of the logic on the critical path are examined for potential delay improvements. The placed circuit is then modified to use the best decompo-sitions found. Any placement illegalities introduced by the new decompositions are resolved by an incremental place-ment step. Experiments conducted on Altera's Stratix and Stratix II device families indicate that this functional de-composition technique can provide average performance im-provements of 6.1% and 5.6% on a large set of industrial designs, respectively.
Article
In this paper, we study the problem of placement-driven technology mapping for table-lookup based FPGA architectures to optimize circuit performance. Early work on technology mapping for FPGAs such as Chortle-d[14] and Flowmap[3] aim to optimize the depth of the mapped solution without consideration of interconnect delay. Later works such as Flowmap-d[7], Bias-Clus[4] and EdgeMap consider interconnect delays during mapping, but do not take into consideration the effects of their mapping solution on the final placement. Our work focuses on the interaction between the mapping and placement stages. First, the interconnect delay information is estimated from the placement, and used during the labeling process. A placement-based mapping solution which considers both global cell congestion and local cell congestion is then developed. Finally, a legalization step and detailed placement is performed to realize the design. We have implemented our algorithm in a LUT based FPGA technology mapping package named PDM (Placement-Driven Mapping) and tested the implementation on a set of MCNC benchmarks. We use the tool VPR[1][2] for placement and routing of the mapped netlist. Experimental results show the longest path delay on a set of large MCNC benchmarks decreased by 12.3% on the average.
Article
This work explores the effect of adding a simple functional decomposition step to the traditional field programmable gate array (FPGA) CAD flow. Once placement has com-pleted, alternative decompositions of the logic on the critical path are examined for potential delay improvements. The placed circuit is then modified to use the best decomposi-tions found. Any placement illegalities introduced by the new decompositions are resolved by an incremental place-ment step. Experiments conducted on Altera's Stratix chips indicate that this functional decomposition technique can provide a performance improvement of 7.6% on average, and up to 26.3% on a set of industrial designs.
Article
The purpose of this paper is to introduce a modified packing and placement algorithm for FPGAs that utilizes logic duplication to improve performance. The modified packing algorithm was designed to leave unused basic logic elements (BLEs) in timing critical clusters, to allow potential targets for logic duplication. The modified placement algorithm consists of a new stage after placement in which logic duplication is performed to shorten the length of the critical path. In this paper, we show that in a representative FPGA architecture using .18 µm technology, the length of the final critical path can be reduced by an average of 14.1%. Approximately half of this gain comes directly from the changes to the packing algorithm while the other half comes from the logic duplication performed during placement.
Article
This paper describes a circuit transformation calledretiming in which registers are added at some points in a circuit and removed from others in such a way that the functional behavior of the circuit as a whole is preserved. We show that retiming can be used to transform a given synchronous circuit into a more efficient circuit under a variety of different cost criteria. We model a circuit as a graph in which the vertex setV is a collection of combinational logic elements and the edge setE is the set of interconnections, each of which may pass through zero or more registers. We give anO(¦V∥E¦lg¦V¦) algorithm for determining an equivalent retimed circuit with the smallest possible clock period. We show that the problem of determining an equivalent retimed circuit with minimum state (total number of registers) is polynomial-time solvable. This result yields a polynomial-time optimal solution to the problem of pipelining combinational circuitry with minimum register cost. We also give a chacterization of optimal retiming based on an efficiently solvable mixed-integer linear-programming problem.
Article
Timing Analysis is a design automation program that assists computer design engineers in locating problem timing in a clocked, sequential machine. The program is effective for large machines because, in part, the running time is proportional to the number of circuits. This is in contrast to alternative techniques such as delay simulation, which requires large numbers of test patterns, and path tracing, which requires tracing of all paths. The output of Timing Analysis includes “slack” at each block to provide a measure of the severity of any timing problem. The program also generates standard deviations for the times so that a statistical timing design can be produced rather than a worst case approach. This system has successfully detected all but a few timing problems for the IBM 3081 Processor Unit (consisting of almost 800,000 circuits) prior to the hardware debugging of timing. The 3081 is characterized by a tight statistical timing design.
Conference Paper
Retiming is a synchronous circuit transformation that can optimize the delay of a synchronous circuit by moving registers across combinational circuit elements. The combinational structure remains unchanged and the observable behavior of the circuit is identical to the original.In this paper, we address the problem of applying retiming techniques to circuits implemented in Field Programmable Gate Arrays (FPGAs). FPGAs contain prefabricated and configurable routing elements that allow us to easily implement a variety of circuits. However this interconnect contributes greatly to the overall delay in the implemented circuit. If a circuit is retimed prior to the placement and routing phases of the CAD flow, then it has no information about the delays introduced by the configurable interconnect. Our fundamental experiment is to determine whether there are any gains in tightly coupling retiming and placement so that the retiming algorithm has some estimate of the routing delays.Specifically, we introduce a post-placement retiming algorithm that understands how to take advantage of FPGA architectural features. This retiming algorithm may introduce extra registers into the circuit. These new registers need to be placed in some location in the FPGA. Retiming register placement is accomplished by a novel incremental clustering and placement algorithm. The incremental algorithm builds upon the placement of the non-retimed circuit to intelligently sift in the newly-introduced registers.In addition, we explore making the placement algorithms "retiming aware." These placement algorithms try to place logic blocks in such a way that the subsequent retiming produces better speed results. These techniques include the identification of retiming-critical cycles during placement.Our experiments show that the integration of retiming with placement results in 19% better clock periods in comparison to the application of retiming before the place and route steps.
Conference Paper
Circuits implemented in FPGAs have delays that are dominated by its programmable interconnect. This interconnect provides the ability to implement arbitrary connections. However, it contains both highly capacitive and resistive elements. The delay encountered by any connection depends strongly on the number of interconnect elements used to route the connection. These delays are only completely known after the place and route phase of the CAD flow. We propose the use of Clock Shifting optimization techniques to improve the clock frequency as a post place and route step.Clock Shifting Optimization is a technique first formalized in [4]. It is a cycle-stealing algorithm that allows one to reduce the critical path delay of a synchronous circuit by shifting the clock signals at each register. This technique allows late arriving signals to be sampled at a later point in time by intentionally introducing a skew on the clock input of the sampling register. Typical FPGAs contain a number of special purpose global clock networks that distribute clock signals to every register in the chip. Unused global clock lines in FPGAs can be used to distribute a finite set of clock skews to the entire circuit. We propose an efficient integer programming method to find the optimal circuit improvement for a finite set of clock skews. This technique is modified to consider inherent uncertainties present in the timing models. The uncertainty controls the aggressiveness of the optimizations as we must take great care in ensuring functionality for any range of possible timing characteristics.Our results confirm intuition that more aggressive speed optimizations can be performed as timing models become more accurate. We also show that providing 4 skewed versions of the nominal clock signal results in the best delay--area tradeoff. This result is evocative as it may suggest future FPGA architectures that contain greater numbers of global clock lines, as we tradeoff gains in speed for greater power requirements from increased clock network flexibility.
Conference Paper
In this paper, we present a new approach that performs timing driven placement for standard cell circuits in interaction with netlist transformations. As netlist transformations are integrated into the placement process, an accurate net delay model is available. This model provides the basis for effective netlist transformations. In contrast to previous approaches that apply netlist transformations during placement, we are not restricted to local transformations like fanout buffering or gate resizing. Instead, we exploit global dependencies between the signals in the circuit. Results for benchmark circuits show excellent placement quality. The maximum path delay is reduced up to 33 % compared to the initial timing driven placement of the original netlist and up to 18 % compared to the results obtained by consecutive optimization of the netlist and timing driven placement of the optimized netlist. This delay reduction is achieved with almost no increase in chip area.
Conference Paper
The complexity of integrated-circuit chips produced today makes it feasible to build inexpensive, special-purpose subsystems that rapidly solve sophisticated problems on behalf of a general-purpose host computer. This paper contributes to the design methodology of efficient VLSI algorithms. We present a transformation that converts synchronous systems into more time-efficient, systolic implementations by removing combinational rippling. The problem of determining the optimized system can be reduced to the graph-theoretic single-destination-shortest-paths problem. More importantly from an engineering standpoint, however, the kinds of rippling that can be removed from a circuit at essentially no cost can be easily characterized. For example, if the only global communication in a system is broadcasting from the host computer, the broadcast can always be replaced by local communication.
Conference Paper
In this paper we present a placement-intelligent resynthesis methodology and optimization algorithms to meet post-layout timing constraints while at the same time reducing the interconnect congestion. We begin with the synthesized design netlist after initial placement and make incremental modifications - taking placement into account - to generate a final netlist and placement that meets the delay constraints after place and route. The algorithms described have been implemented as part of a tool for placement based resynthesis. The tool has been used with a number of pre-optimized designs from industry and to obtain improvements in post-placement delays ranging from 13 to 22% with improved routability.
Conference Paper
In this paper, the authors presented a new linear-time retiming algorithm that produces near-optimal results. The implementation is specifically targeted at Altera's Stratix FPGA-based designs, although the techniques described are general enough for any implementation medium. The algorithm is able to handle the architectural constraints of the target device, multiple timing constraints assigned by the user and implicit legality constraints. It ensures that register moves do not create asynchronous problems such as creating a glitch on a clock/reset signal.
Conference Paper
This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocks that have numerous architectural constraints. Experiments indicate that our technique produces results of extremely high quality.
Conference Paper
Retiming relocates registers in a circuit to shorten the clock cycle time. In deep sub-micron era, conventional pre-layout retiming cannot work properly because of dominant interconnection delay that is not available before layout. Although some retiming algorithms incorporating interconnection delay have been proposed, layout information is still not utilized effectively nor efficiently. Retiming and layout is combined for the first time in this paper. We present heuristics for two key problems: interconnection delay estimation and post-retiming incremental placement. An efficient retiming algorithm incorporating interconnection delay is also proposed. Experimental results show that on the average we can improve the circuit speed by 5.4% targeted toward a 0.52 μm CMOS technology. Scaling down the technology to 0.1 μm, as much as 25.6% improvement have been achieved.
Conference Paper
The ordered binary decision diagram (OBDD) has proven useful in many applications as an efficient data structure for representing and manipulating Boolean functions. A serious drawback of OBDD's is the need for application-specific heuristic algorithms to order the variables before processing. Further, for many problem instances in logic synthesis, the heuristic ordering algorithms which have been proposed are insufficient to allow OBDD operations to complete within a limited amount of memory. The paper proposes a solution to these problems based on having the OBDD package itself determine and maintain the variable order. This is done by periodically applying a minimization algorithm to reorder the variables of the OBDD to reduce its size. A new OBDD minimization algorithm, called the sifting algorithm, is proposed and appears especially effective in reducing the size of the OBDD. Experiments with dynamic variable ordering on the problem of forming the OBDD's for the primary outputs of a combinational circuit show that many computations complete using dynamic variable ordering when the same computation fails otherwise
Article
This paper presents algorithms for disjunctive and nondisjunctive decomposition of Boolean functions and Boolean methods for identifying common subfunctions from multiple Boolean functions. Ordered binary decision diagrams are used to represent and manipulate Boolean functions so that the proposed methods can be implemented concisely. These techniques are applied to the synthesis of look-up table based field programmable gate arrays and results are presented
Article
The routing architecture of an FPGA consists of the length of the wires, the type of switch used to connect wires (buffered, unbuffered, fast or slow) and the topology of the interconnection of the switches and wires. FPGA Routing architecture has a major influence on the logic density and speed of FPGA devices. Previous work [1] based on a 0.35um CMOS process has suggested that an architecture consisting of length 4 wires (where the length of a wire is measured in terms of the number of logic blocks it passes before being switched) and half of the programmable switches are active buffers, and half are pass transistors. In that work, however, the topology of the routing architecture prevented buffered tracks from connecting to pass-transistor tracks. This restriction prevents the creation of interconnection trees for high fanout nets that have a mixture of buffers and pass transistors. Electrical simulations suggest that connections closer to the leaves on interconnection trees are faster using pass transistors, but it is essential to buffer closer to the source. This latter effect is well known in regular ASIC routing [2].