Conference Paper

Difficulty of predicting interconnect delay in a timing driven FPGA CAD flow

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper studies the di-culty of predicting interconnect delay in an industrial setting. Fifty industrial circuits, Al- tera's Quartus II CAD software, and Altera's Stratix and Stratix II FPGA architectures were used in the study. We show that there is a large amount of inherent randomness in a state-of-the-art FPGA placement algorithm. Thus, it is impossible to predict interconnect delay with a high de- gree of accuracy. Futhermore, we show that a simple timing model can be used to predict some aspects of interconnect timing with just as much accuracy as predictions obtained by running the placement tool itself. Finally, we examine the beneflts of using the simple timing model in a timing driven physical synthesis ∞ow, and attempt to establish an upper bound on these possible gains, given the di-culty of interconnect delay prediction. of interconnect delay even before the physical design steps (placement and routing) in an FPGA CAD ∞ow are exe- cuted is a di-cult yet important problem. The ability to predict interconnect delay early in the CAD ∞ow ofiers two advantages. First, the timing driven restructuring opera- tions carried out during the early CAD steps (synthesis and technology mapping) can be made much more efiective if in- terconnect delay can be predicted with reasonable accuracy. Second, the delay predictions can be used to provide feed-

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The interconnect estimation method described here is based on Lou's method [55] of dividing the device model into a grid and the delay-lookup approach of Manohararajah et al. [59]. ...
... The timing model is based on an approach by Manohararajah et al. [59], which uses a single delay value for each connection type in a table. This approach predicts "some aspects of interconnect timing", and they have noted that its accuracy is very high when the results are compared with running the placement tool itself. ...
... Manohararajah et al. [59] have noted that place-and-route tools always (d) If the full arc width cannot be implemented, return to step 2 and find a route for the remaining width. ...
... Accurate estimation of interconnect delay could effectively enhance the bandwidth of communication links and allow efficient architectural exploration and optimization at an early design stage in the design flow, such as floor-planning [13,9]. But early-stage interconnect delay prediction in FPGAs is very difficult due to the large amount of inherent uncertainties introduced in the placement and routing steps [9]. ...
... Accurate estimation of interconnect delay could effectively enhance the bandwidth of communication links and allow efficient architectural exploration and optimization at an early design stage in the design flow, such as floor-planning [13,9]. But early-stage interconnect delay prediction in FPGAs is very difficult due to the large amount of inherent uncertainties introduced in the placement and routing steps [9]. Since the length and delay of interconnections are highly correlated (and in fact usually assumed to be linearly related) an effective approach is to predict the average length from which the delay can be determined. ...
... There are a number of methods to estimate the average interconnection lengths and distribution in FPGAs [2,9]. For instance, a heuristic approach based on bounding box method has been proposed, which can provide a good estimation for the interconnections for generic logic in FPGAs [2]. ...
Conference Paper
Full-text available
This paper presents a new stochastic model to predictinterconnection lengths of communication links in FPGAs. Based on a stochastic inter-module routing model, expected length and variance of interconnections have been rigorously derived and, thus, delay can be computed based on the length estimate. The theoretical results are compared with experimental results of lengths and delays, which are obtained from implementations of links circuits in an FPGA. The stochastic model provides an accurate prediction of length with an average error of 6.3%. Results also show that theproposed model produces reliable predictions of delay and therefore the methodology can be applied to early stage planning and design optimization for communication links. Moreover, as a byproduct of this work, we also present in this paper an interesting phenomenon which we term "interconnection fringing". The fringing effect is attributed to the competition for routing resources in a communication link and will lengthen interconnections and, therefore, increase the delay.
... The bandwidth of these communication architectures is determined by the interconnection delay in the communication links and is critical to the overall system performance. Accurate estimation of interconnect delay could effectively enhance the bandwidth of communication links and allow efficient architectural exploration and optimisation at an early design stage in the design flow, such as floor-planning [3]. But interconnect delay prediction is very difficult owing to the large amount of inherent uncertainties presented in the placement and routing steps [3]. ...
... Accurate estimation of interconnect delay could effectively enhance the bandwidth of communication links and allow efficient architectural exploration and optimisation at an early design stage in the design flow, such as floor-planning [3]. But interconnect delay prediction is very difficult owing to the large amount of inherent uncertainties presented in the placement and routing steps [3]. Since the interconnection lengths and delays are highly correlated and are usually assumed as linearly related, an effective approach is to predict the average interconnection length from which the interconnection delay can be determined. ...
Article
Full-text available
A new method is presented and an analytical expression is derived for average interconnection delay estimation. This method is directly applicable to predicting the average delay for high-bandwidth communication links implemented on FPGAs. The theoretical results are compared with the measured data from the actual circuits and an average error of 4.6% is reported.
... The netrange of a net is defined as the difference between the minimum and maximum logic depth of any given node connected to a net where the delay of a net is proportional to its netrange. Unfortunately, effective pre-placement delay models are still lacking and require further work [MCSB06]. ...
... Also, minimizing logic depth does not mean a reduction in wire length in modern designs. Although the use of timing information during clustering can lead to a better set of clusters, recent research indicates that timing estimates made during clustering may not be accurate when compared to the final placement [33]. ...
... Each point represents the average area reduction versus the average performance reduction of the entire benchmark of 87 circuits, run with a different slack ratio threshold parameter . Because the RAM-MAP technique is performed prior to placement and it is difficult to predict the postplacement delay [14], it is very difficult to both predict and control the performance reduction. ...
Conference Paper
This work describes a new mapping technique, RAM-MAP, that identifies parts of circuits that can be efficiently map- ped into the synchronous embedded memories found on field programmable gate arrays (FPGAs). Previous tech- niques developed for mapping into asynchronous embed- ded memories cannot be used because modern FPGAs do not have asynchronous embedded memories. After tech- nology mapping, an area-prediction cost function is used to guide the selection of logic cones to be placed in em- bedded memories. Extra logic is added to compensate for missing asynchronous functionality on the synchronous memories. Experiments conducted on Altera's Stratix de- vice family indicate that this embedded memory mapping technique can provide an average area reduction of 6.2% and up to 32.5% on a large set of industrial designs. A small architecture change that increases the size of the FPGA fabric by 0.05% can increase the average area re- duction to 14.1% and up to 59.1% on the same design set.
... Most of the existing interconnect estimation techniques are applied at the post-placement design level [2,12], as the information on global routes is extremely scarce at the higher levels of abstraction. Furthermore, in [14], it was noted that the delay of some interconnects can change significantly (up to 2 or 3 times) by changing the seed of the placement algorithm for generic logic in FPGAs. ...
Conference Paper
Full-text available
A novel high-level approach for estimating power consump- tion of global interconnects in data-path oriented designs implemented in FPGAs is presented. The methodology is applied to interconnections between modules and depends only on their mutual distance and shape. The power model has been characterized and verified with on-board power measurements, instead of using low-level estimation tools which often lack the required accuracy (observed errors go up to 350%). The results show that most of the errors of the presented power model lie within 20% of the physical measurements. This is an excellent result considering that in (2) it is shown that there is already a 20% variation in net capacitance due to the different routing solutions given by router for the same placement.
... It is thus proposed that heuristics be used to solve this problem. The method is based on the approach of Lou et al. to modelling the device as a grid of channel cells [Lou et al. 2002] and the delay-lookup approach of Manohararajah et al. [Manohararajah et al. 2006]. The algorithm for mapping arcs to the device is described in Algorithm 2. Initially sorting the connections in descending order mimics what a detailed router does at a higher level by allowing longer connections to use faster routing paths through the channel cells, thereby reducing the risk these nets become critical. ...
Article
Full-text available
Partial runtime reconfiguration allows some circuit components to be reconfigured while the remaining circuitry continues to operate. Applications partitioned into modules have the potential to exploit this capability to virtualize hardware by swapping modules as required. One of the challenges in doing so is to provide a communication infrastructure that supports the interfaces and communication needs of a sequence of dynamic module swaps. In contrast to previous approaches which have examined the use of buses and networks-on-chip for this purpose, we examine the use of customized point-to-point wiring harnesses to provide the dynamic connections required for dynamic modular reconfiguration in an efficient manner. The COMMA methodology implements applications on tile-reconfigurable FPGAs, such as the Virtex-4, and its design flow is integrated with the early access partial reconfiguration tools from Xilinx. This article outlines the methodology and describes greedy and dynamic programming approaches to merging the communication graphs of successive configurations in order to generate effective wiring harnesses within the methodology. Our evaluation indicates merging can markedly reduce total reconfiguration delays at the cost of increased critical path delays. Application of the technique is likely to be limited to scenarios in which the execution time between reconfigurations is short.
Article
As a good trade-off between CPU and ASIC, FPGA is becoming more widely used in both industry and academia. The increasing complexity and scale of modern FPGA, however, impose great challenges on the FPGA placement and packing problem. In this paper, we propose RippleFPGA to solve the packing and placement simultaneously through a set of novel techniques, such as (i) smooth stair-step flow, (ii) implicit packing similar to ASIC legalization, and (iii) two-level detailed placement. To enable the flow, a generic, efficient and false-alarm-free legality checking method is also proposed. Besides, due to the insufficiency of ASIC-like congestion alleviation methods, some FPGA-routing-architecture-aware optimization techniques are proposed to improve the routability. When evaluated by ISPD 2016 Contest benchmarks, RippleFPGA has 5.1% better routed wirelength and 5.5× speedup compared to all the state-of-the-art FPGA placers.
Article
To aid in the hardware/software partitioning of recon gurable computing systems, fast yet accurate FPGA based delay estimations are necessary before the partitioning. Most previous works predict the delay by using a high-level delay-estimation method of the empirical formulae. However, this method needs to run many times of the time-consuming synthesis, place and route procedures, which may take up to hours or days for all possible partition options. To address this problem, this paper proposed an auto estimation model to improve the previous high-level delay-estimation. In this model, we rstly derive calculation formulae called increasing formulae of HLL operations from the basic idea of the hardware circuit design. Then the feedback based framework is applied to adjust the increasing formulae for alternative FPGAs or synthesis properties, and estimate the delay of the partitioning. This model reduces the times of running the time-consuming procedures. Experimental results show the method can achieves error within 5% for virtex-5 FPGA, compared with the real delay.
Article
While the promise of achieving speedup and additional benefits such as high performance per watt with FPGAs continues to expand, chief among the challenges with the emerging paradigm of reconfigurable computing is the complexity in application design and implementation. Before a lengthy development effort is undertaken to map a given application to hardware, it is important that a high-level parallel algorithm crafted for that application first be analyzed relative to the target platform, so as to ascertain the likelihood of success in terms of potential speedup. This article presents the RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development.
Article
To aid in the hardware/software partitioning of the reconfigurable computing systems, it is necessary to conduct fast and accurate FPGA-based delay estimations before the partitioning. Most previous works predict the delay by adopting a high-level delay estimation based on empirical formulae. In such method, the empirical formulae are often obtained by a regression analysis on the real values reported by the synthesis and place-and-route tools of FPGAs. With alternative properties of tools or different FPGA devices, the empirical formulae need to be reanalyzed and decided. However, it is time-consuming due to inevitably repeated running synthesis and place-and-route tasks, which results in slow estimation and always beyond the tolerance of the estimation time. To address this problem, we present an improved high-level delayestimation method in this article. We derived theory formulae called increasing formulae for HLL (High Level Language) operations from the basic idea of the hardware circuit design. These increasing formulae can be fit for most FPGAs. Combining the proposed formulae, the paper proposes a rapid estimation algorithm also. And the algorithm can obtain hardware delay of different hardware versions, thus reduces the number of times of running the time-consuming tasks greatly. Experimental results show that our method can achieve error within 2.69% for virtex-5 FPGA, compared with the real values.
Article
In FPGA CAD flow, the clustering stage builds the foundation for placement and routing stages and affects performance parameters, such as routability, delay, and channel width significantly. Net sharing and criticality are the two most commonly used factors in clustering cost functions. With this study, we first derive a third term, net-length factor, and then design a generic method for integrating net length into the clustering algorithms. Net-length factor enables characterizing the nets based on the routing stress they might cause during later stages of the CAD flow and is essential for enhancing the routability of the design. We evaluate the effectiveness of integrating net length as a factor into the well-known timing (T-VPack)-, depopulation (T-NDPack)-, and routability (iRAC and T-RPack)-driven clustering algorithms. Through exhaustive experimental studies, we show that net-length factor consistently helps improve the channel-width performance of routability-, depopulation-, and timing-driven clustering algorithms that do not explicitly target low fan-out nets in their cost functions. Particularly, net-length factor leads to average reduction in channel width for T-VPack, T-RPack, and T-NDPack by 11.6%, 10.8%, and 14.2%, respectively, and in a majority of the cases, improves the critical-path delay without increasing the array size.
Conference Paper
This work describes a new mapping technique, RAM-MAP, that identifies parts of circuits that can be efficiently mapped into the synchronous embedded memories found on field programmable gate arrays (FPGAs). Previous techniques developed for mapping into asynchronous embedded memories cannot be used because modern FPGAs do not have asynchronous embedded memories. After technology mapping, an area-prediction cost function is used to guide the selection of logic cones to be placed in embedded memories. Extra logic is added to compensate for missing asynchronous functionality on the synchronous memories. Experiments conducted on Altera's Stratix device family indicate that this embedded memory mapping technique can provide an average area reduction of 6.2% and up to 32.5% on a large set of industrial designs. A small architecture change that increases the size of the FPGA fabric by 0.05% can increase the average area reduction to 14.1% and up to 59.1% on the same design set
Conference Paper
Rapid area-time estimation is an instrumental step for efficient design exploration of FPGA-based implementations. In this paper, we address the issue of high-level delay estimation for porting C-based applications onto FPGA. In particular, we present a framework which incorporates a compiler to generate optimized high-level IR (intermediate representation) of the C-applications and an estimation model that is based on an architecture template with application-specific heterogeneous functional units. In order to accurately predict the post place and route delay of the design, the proposed estimation strategy performs a simplified floor-planning process. This leads to more accurate interconnect delay estimation, which is then combined with the pre-characterized delay parameters of the components. Experimental results based on a set of embedded functions show that the proposed estimation technique can achieve comparable results with synthesis results from a commercial FPGA tool in significantly shorter amount of time. In particular, the proposed method has an average error of only 4% with a maximum error of 10%. In addition, the proposed method leads to more consistent estimation results when compared to the Xilinx synthesis tool and to a naive approach that only employs pre-characterized parameters in the estimation model.
Article
Full-text available
While the promise of achieving speedup and additional benefits such as high performance per watt with FPGAs continues to expand, chief among the challenges with the emerging paradigm of reconfigurable computing is the complexity in application design and implementation. Before a lengthy development effort is undertaken to map a given application to hardware, it is important that a high-level parallel algorithm crafted for that application first be analyzed relative to the tar- get platform, so as to ascertain the likelihood of success in terms of potential speedup. This article presents the RC Amenability Test, or RAT, a methodology and model developed for this purpose, supporting rapid exploration and prediction of strategic design tradeoffs during the formulation stage of application development. Categories and Subject Descriptors: B.8.2 (Performance and Reliability): Performance Analy- sis and Design Aids; I.6.m (Simulation and Modeling): Miscellaneous; B.2.0 (Arithmetic and
Conference Paper
Full-text available
In this paper a fast and accurate delay estimation tool for FPGA-based designs is presented. The tool is developed in the context of a HW/SW partitioning tool. Rather than modeling the hardware as a single implementation, our approach for HW/SW partitioning models the hardware as two extreme alternatives that bound the area and latency ranges for different hardware implementations. The presented tool estimates the delay for these two hardware alternatives. Our delay modeling technique accounts for both the logic and routing delays so as to minimize the estimation error. The computational cost of the presented estimation tool depends linearly on the design complexity, and hence, it is very useful for fast design space exploration. Testing this estimation tool on several designs showed that this tool is also accurate with an average error of 4.2%.
Conference Paper
The traditional approach to FPGA packing and CLB-level placement has been shown to yield significantly worse quality than approaches which allow BLEs to move during placement. In practice, however, modern FPGA architectures require expensive DRC checks which can render full BLE-level placement impractical. We address this problem by proposing a novel clustering framework that uses physical information to produce better initial packings which can, in turn, reduce the amount of BLE-level placement that is required. We quantify our packing technique across accepted benchmarks and show that it produces results with 16% less wire length, 19% smaller minimum channel widths, and 8% less critical path delay, on average, than leading methods.
Article
Full-text available
Interconnect prediction is very important for early feasibility studies in modern design flows. Most of the current interconnect estimation techniques estimate either the average or the total wirelength and some qualitative measure of routing demand for circuits. A priori techniques estimate these parameters without actually performing circuit placement. We propose a new a priori interconnect and wirelength estimation methodology for island style field programmable gate arrays (FPGAs). For a given design, we estimate bounding box semiperimeter wirelengths of all nets for an optimized placement and the minimum number of tracks per channel required for successful routing on an FPGA device. We analyze the structural characteristics of circuits and limitations posed by the FPGA architecture to derive a consistent model for wirelength and routing demand estimation. We identify reconvergences present in a circuit as an important global circuit characteristic in wirelength prediction. Our overall results show that we have an average error of 11.6% w.r.t. semiperimeter wirelength measured from the optimized layout using VPR. Also, the number of routing tracks per channel is predicted with an average error of 13.2% of the detailed routing results from VPR.
Conference Paper
This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocksthat have numerous architectural constraints. Experiments indicate that our technique produces results of extremely highquality.
Book
From the Publisher: Architecture and CAD for Deep-Submicron FPGAs addresses several key issues in the design of high-performance FPGA architectures and CAD tools, with particular emphasis on issues that are important for FPGAs implemented in deep-submicron processes. Three factors combine to determine the performance of an FPGA: the quality of the CAD tools used to map circuits into the FPGA, the quality of the FPGA architecture, and the electrical (i.e. transistor-level) design of the FPGA. Architecture and CAD for Deep-Submicron FPGAs examines all three of these issues in concert.
Article
This work explores the effect of adding a timing driven func-tional decomposition step to the traditional field program-mable gate array (FPGA) CAD flow. Once placement has completed, alternative decompositions of the logic on the critical path are examined for potential delay improvements. The placed circuit is then modified to use the best decompo-sitions found. Any placement illegalities introduced by the new decompositions are resolved by an incremental place-ment step. Experiments conducted on Altera's Stratix and Stratix II device families indicate that this functional de-composition technique can provide average performance im-provements of 6.1% and 5.6% on a large set of industrial designs, respectively.
Article
This work explores the effect of adding a simple functional decomposition step to the traditional field programmable gate array (FPGA) CAD flow. Once placement has com-pleted, alternative decompositions of the logic on the critical path are examined for potential delay improvements. The placed circuit is then modified to use the best decomposi-tions found. Any placement illegalities introduced by the new decompositions are resolved by an incremental place-ment step. Experiments conducted on Altera's Stratix chips indicate that this functional decomposition technique can provide a performance improvement of 7.6% on average, and up to 26.3% on a set of industrial designs.
Article
Timing Analysis is a design automation program that assists computer design engineers in locating problem timing in a clocked, sequential machine. The program is effective for large machines because, in part, the running time is proportional to the number of circuits. This is in contrast to alternative techniques such as delay simulation, which requires large numbers of test patterns, and path tracing, which requires tracing of all paths. The output of Timing Analysis includes “slack” at each block to provide a measure of the severity of any timing problem. The program also generates standard deviations for the times so that a statistical timing design can be produced rather than a worst case approach. This system has successfully detected all but a few timing problems for the IBM 3081 Processor Unit (consisting of almost 800,000 circuits) prior to the hardware debugging of timing. The 3081 is characterized by a tight statistical timing design.
Conference Paper
This paper presents an overview of an industrial physical synthesis CAD flow for FPGAs. The flow provides a performance speedup of 10%-15% for most circuits, and a significant number of circuits show a speedup of 20%-180%. We describe the algorithms used to achieve this result including: incremental retiming, BDD-based resynthesis, local rewiring, and logic replication. The effectiveness of these operations depends on the ability to accurately determine which portions of logic are timing critical at a stage of the CAD flow where there is still freedom to perform logic restructuring. We show how this problem can be effectively solved by inserting prediction and restructuring operations at multiple points of the FPGA CAD flow.
Conference Paper
In this paper, the authors presented a new linear-time retiming algorithm that produces near-optimal results. The implementation is specifically targeted at Altera's Stratix FPGA-based designs, although the techniques described are general enough for any implementation medium. The algorithm is able to handle the architectural constraints of the target device, multiple timing constraints assigned by the user and implicit legality constraints. It ensures that register moves do not create asynchronous problems such as creating a glitch on a clock/reset signal.
Conference Paper
This paper presents an algorithm to update the placement of logic elements when given an incremental netlist change. Specifically, these algorithms are targeted to incrementally place logic elements created by layout-driven circuit restructuring techniques. The incremental placement engine assumes that the restructuring algorithms provide a list of new logic elements along with preferred locations for each of these new elements. It then tries to shift non-critical logic elements in the original placement out of the way to satisfy the preferred location requests. Our algorithm considers modern FPGA architectures with clustered logic blocks that have numerous architectural constraints. Experiments indicate that our technique produces results of extremely high quality.
Conference Paper
A new wire length estimation technique is presented. Wire length distribution is modeled by wire density on a 2-D lattice. Assuming a pointwise independent branching process, the wire length distribution is found by solving the neighborhood density equations. For several industrial circuits tested, this technique achieved an estimation error of 9.0% with a maximum deviation of +16.3%, which compared favorably with other techniques recently proposed
Article
We present a novel technique for estimating individual wire lengths in a given standard-cell-based design during the technology mapping phase of logic synthesis. The proposed method is based on creating a black box model of the place and route tool as a function of a number of parameters, which are all available before layout. The place and route tool is characterized, only once, by applying it to a set of typical designs in a certain technology. We also propose a net bounding box estimation technique based on the layout style and net neighborhood analysis. We show that there is inherent variability in wire lengths obtained using commercially available place and route tools-wire length estimation error cannot be any smaller than a lower limit due to this variability. The proposed model works well within these variability limitations.
Article
This paper presents a new wire length estimation technique for row-based design. Noting that the local topological structure of the network is often reflected in the local structure of the placement, we present a technique of topological analysis of the network, in which the local structure of the network is characterized by a growing sequence of multilevel neighborhoods. By assuming a pointwise independent branching process, we derive equations for the probability density of multilevel neighborhoods. The wire length distribution is found by solving these equations. For thirteen industrial circuits tested, this technique gives an average of 15.1% estimation accuracy
Article
The routing architecture of an FPGA consists of the length of the wires, the type of switch used to connect wires (buffered, unbuffered, fast or slow) and the topology of the interconnection of the switches and wires. FPGA Routing architecture has a major influence on the logic density and speed of FPGA devices. Previous work [1] based on a 0.35um CMOS process has suggested that an architecture consisting of length 4 wires (where the length of a wire is measured in terms of the number of logic blocks it passes before being switched) and half of the programmable switches are active buffers, and half are pass transistors. In that work, however, the topology of the routing architecture prevented buffered tracks from connecting to pass-transistor tracks. This restriction prevents the creation of interconnection trees for high fanout nets that have a mixture of buffers and pass transistors. Electrical simulations suggest that connections closer to the leaves on interconnection trees are faster using pass transistors, but it is essential to buffer closer to the source. This latter effect is well known in regular ASIC routing [2].
Quartus II Development Software Handbook v5.0 (Complete Four-Volume Set)
  • Altera
Altera. Quartus II Development Software Handbook v5.0 (Complete Four-Volume Set). v5.0, May 2005.
Stratix Device Handbook (Complete Two-Volume Set). v3
  • Altera
Altera. Stratix Device Handbook (Complete Two-Volume Set). v3.1, Sept. 2004.