Conference Paper

BPR: fast FPGA placement and routing using macroblocks

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Numerous studies have shown the advantages of hardware and software co-design using FPGAs. However, increasingly lengthy place-and-route times represent a barrier to the broader adoption of this technology by significantly reducing designer productivity and turns-per-day, especially compared to more traditional design environments offered by competitive technologies such as GPUs. In this paper, we address this challenge by introducing a new approach to FPGA application design that significantly reduces compile times by exploiting the functional reuse common throughout modern FPGA applications, e.g. as shared code libraries and unchanged modules between compiles. To evaluate this approach, we introduce Block Place and Route (BPR), an FPGA CAD approach that modifies traditional placement and routing to operate at a higher-level of abstraction by pre-computing the internal placement and routing of reused cores. By extending traditional place-and-route algorithms such as simulated-annealing placement and negotiated-congestion routing to abstract away the detailed implementation of reused cores, we show that BPR is capable of orders-of-magnitude speedup in place-and-route over commercial tools with acceptably low overhead for a variety of applications.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This typically results in inferior quality-of-result (QoR) in terms of performance and power consumption for such designs. Longer compilation time [5] is also commonly observed with existing CAD flows, which eventually lowers design productivity and adversely affects the TTM. Among the major steps in CAD flow, the placement step is known to consume a significant amount of time, often taking close to 50% of the total CAD runtime [6]. ...
... These macros are combined to form fixed sized and shaped clusters and then assigned to FPGA regions while minimizing the timing and wiring costs. Pre-compiled macro-block based design flow has been discussed in BPR [5] and HMFlow [13]. BPR introduces a new FPGA CAD tool for fast compilation of FPGA circuits where placement for each macro-block is selected from a database such that distance and expected routing congestion between macro-blocks are minimized. ...
... However, while FPGA devices are becoming more sophisticated, the computeraided design (CAD) tools used for FPGA-based designs have not yet matured sufficiently to efficiently map large-scale applications into such FPGAs [16]. It is observed that typical CAD flow can take tens of minutes to hours or even days [9] for such applications, thereby significantly limiting design productivity. ...
... It also creates new library modules for application logic which does not match with any library Module. BPR [9] uses coarser pre-compiled modules as compared to HMFlow to improve compilation time. In addition, its module library keeps different variations of each module by placing and routing for all possible locations on FPGA. ...
... Acceleration Based on Hard Macros. Researchers have explored acceleration utilizing preimplemented hard macros [19,27,43,49,52,84]. Hard macros consist of pre-built circuitry and can be reused. ...
Article
Full-text available
FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7x reduction in compile time and up to 1.3x increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7x compile time reduction and 1.3x frequency increase.
... It performs multi-level clustering initially in order to improve the runtime of VPR's SA placer [50]. Improving the productivity of FPGA design focusing on macro based design flow has been discussed in BPR [51] and HMFlow [52]. Lowering the compilation time using MDM is considered mainly in these approaches. ...
Preprint
Present Field Programmable Gate Array (FPGA) manufacturers incorporate multi-millions of logic resources which enables hardware designers to design applications extending to enormous scales. However, handling such applications by existing FPGA Computer Aided Design (CAD) flow requires more improvement in terms of compilation time, performance and power efficiency considerations. However, the existing CAD flow begins at Register Transfer Level (RTL) code which limits the design productivity of the input applications only to hardware experts in performing analysis on the large-scale designs for any further optimizations. The manual process of optimizing RTL designs is increasingly difficult when considering larger designs. High-Level Synthesis (HLS) is an approach capable of increasing the design productivity of hardware applications compared to commonly used Hardware Description Languages (HDLs) and is known to be a capable approach for performing optimizations at a higher level of abstraction. Our concern is to explore the approaches that follow the HLS flow to cater the mapping of large-scale applications to FPGA addressing above considerations.
... Analytical placement tools [10] have been known to achieve better balance between scalability and quality of results. There have also been attempts to lower the runtime of placement by introducing pre-compiled macro block based design flow [8] [7]. But these techniques introduce large degradation in placement quality. ...
Conference Paper
Modern FPGAs integrate multi-million logic resources that allow the realization of increasingly large designs. However, state-of-the-art simulated annealing based CAD tools for FPGA suffer from long runtime, poor performance and sub-optimal routing and placement decisions, especially for large applications, leading to less energy efficient designs. In this paper, we present a partitioning methodology that divides large application into smaller subsystems based on the communication frequency between these subsystems. We leverage the existing CAD tools to compile the large design, which is now annotated with their subsystems, to obtain the final bitstream. Experiments show that the proposed strategy can lead to a performance gain of over 60% while still achieving more than 20% reduction in energy consumption.
... Block Place and Route [1] (BPR) is conceptually similar to HMFlow, but uses a different way of design entry: It uses the vendor's synthesis tool to create a netlist. A custom mapper is then used to map this netlist to hard macros. ...
Conference Paper
Field Programmable Gate Arrays (FPGA) offer the opportunity to build individual hardware solutions even for applications which are produced in small quantity. Quite often, these customized Systems-on-Chip (SoC) contain soft-core processors and a selection of standard peripherals. Synthesizing such systems can be time consuming and thus, design space exploration can become a rather long process. In this contribution, we show an approach to substantially speed up the time to create such system implementations. The price for this improved synthesis time is a slightly reduced operating frequency, which is acceptable in many cases. Using a set of benchmark system configurations, we evaluate our approach against state of the art commercial synthesis tools in terms of tool runtime, resource utilization and achieved system clock frequency.
Article
Routing is one of the most time-consuming steps in the field-programmable gate array (FPGA) design process. Even if unceasing efforts have been made to accelerate FPGA routing, the existing work seldom pays attention to the underlying FPGA connection router. In this article, we present a fast FPGA connection router called PRoute which implements a novel prerouting-based parallel local routing algorithm. Basically, PRoute precomputes the potential routing solutions for various connection patterns on FPGAs, which can be directly used in the later practical routing. On the whole, PRoute is composed of A-star maze expansion and parallel local search. In the first part, PRoute gradually expands the maze wavefront toward the lowest-cost node to search for the target sink. For a wire-type node, PRoute invokes a fast parallel local search instead taking advantage of the prerouting results, and hence the time-expensive maze expansion can be reduced. Particularly, it allows PRoute to call one another between A-star maze expansion and parallel local search. This enables the runtime efficiency of PRoute while ensuring its global search ability. In addition, we put forward an engineering improvement to further speed up PRoute by avoiding the exploration of block output pins. To our best knowledge, this work is the first to apply the idea of prerouting for FPGAs. Experimental results show that PRoute achieves speedups of 1.8×1.8\times , 2.4×2.4\times , 3.2×3.2\times , 4.1×4.1\times , and 5.1×5.1\times with 1, 4, 8, 16, and 32 threads relative to the baseline versatile place to route’s connection router, respectively, without degrading the quality of results.
Article
Routing is one of the most time-consuming stages in the FPGA design flow. Even if various attempts have been made to reduce route-time, the existing work rarely focuses on improving the underlying A*-based FPGA connection router. In this paper, we present a fast FPGA connection router called FCRoute based on a novel soft routing-space pruning algorithm. Within FCRoute, a routing resource priority mechanism is applied to classify the routing resource nodes into high-priority nodes and low-priority ones. On the whole, FCRoute is composed of fast maze search and backtracking process. During the fast maze search, we explore only the high-priority nodes in the routing space. In this way, a great deal of unnecessary work can be avoided. When the fast maze search fails to find the target sink, it allows the backtracking process to explore the low-priority nodes promising to be on the best path, after which a new fast maze search is called. By avoiding the exploration of the majority of low-priority nodes, FCRoute maintains runtime-efficiency while ensuring global search ability. In addition, we further accelerate FCRoute with an engineering enhancement which simplifies the cost computations of nodes. Runtime and quality of results are compared with the state-of-the-art connection router in VPR 8. Experimental results show that on average FCRoute explores less than half the number of routing resource nodes, and therefore reduces runtime by 38% while enabling quality-of-results. When combined with the enhancement, FCRoute achieves an average 45% reduction on runtime without sacrificing the quality-of-results.
Article
The increasing size of modern FPGAs allows for ever more complex applications to be mapped onto them. However, long design implementation times for large designs can severely affect design productivity. A modular design methodology can improve design productivity in a divide and conqueror fashion but at the expense of degraded performance and power consumption of the resulting implementation. To reduce the dominant power dissipation component in FPGAs, the routing power, methodologies have been proposed that consider data communication between modules during module formation and placement on the FPGA. Selecting proper mapping region on target FPGAs, on the other hand, is becoming a critical process because of the heterogeneous resources and column arrangements in modern FPGAs. Selecting inappropriate FPGA regions for mapping could lead to degraded performance. Hence, we propose a methodology that uses communication-aware module placement, such that modules are mapped by selecting the best shape and region on the FPGA factoring the columnar resource arrangements. Additionally, techniques for module locking and splitting have been proposed for deterministic convergence of the algorithm and for improved module placement. This methodology exhibits nearly 19% routing power reduction with respect to commercial CAD flows without any degradation in achievable performance.
Article
Field Programmable Gate Arrays (FPGAs) are reconfigurable architectures able to provide a good balance between energy efficiency and flexibility with respect to CPUs and ASICs. The main drawback in using FPGAs, however, is their timing-consuming routing process, significantly hindering the designer productivity. An emerging solution to this problem is to accelerate the routing by parallelization. Existing attempts of parallelizing the FPGA routing either do not fully exploit the parallelism or suffer from an excessive quality loss. Massive parallelism using GPUs has the potential to solve this issue but faces non-trivial challenges.
Article
In this paper, we present a new FPGA routing approach on the basis of the PathFinder routing algorithm. During each routing iteration, our approach applies a novel timing-based re-routing strategy to only re-route the illegal paths. At a lower level, each maze expansion is started from the relatively close part of current routing tree to search for the target sink on the routing resource graph. Experimental results demonstrate that on average the proposed approach reduces the routing runtime by 68.5% compared with the timing-driven router in VPR FPGA placement and routing framework, with reduction of 2.5% and 1.4% in critical path delay and wirelength respectively.
Article
Serial arithmetic has been shown to offer attractive advantages in area for field-programmable gate array (FPGA) datapaths but suffers from a significant reduction in throughput compared to traditional bit-parallel designs. In this work, we perform a performance and trade-off analysis that counterintuitively shows that, despite the decreased throughput of individual serial operators, replication of serial arithmetic can provide a 2.1 × average increase in throughput compared to bit-parallel pipelines for common FPGA applications. We complement this analysis with a novel SerDes architecture that enables existing FPGA pipelines to be replaced with serial logic with potentially higher throughput. We also present a serialized sliding-window architecture that improves average throughput 2.4 × compared to existing bit-parallel work.
Conference Paper
FPGA designs often contain significant amounts of logic such as a board support package that remains unaltered throughout the design process. However, during normal operation, standard FPGA implementation tools re-implement the entire system, including the unchanged logic, adding to the turn around time of design iterations. Recently, FPGA implementation flows have appeared that allow preserving parts of a previously implemented design. In this study, we evaluate the potential speedups in implementation time achievable through preserving the unchanging portion of a design's implementation. We perform these evaluations using Xilinx Partitions, Xilinx SmartGuide, and the HMFlow rapid implementation tool.
Article
There are well known cases where FPGAs provide high performance within a modest power budget, yet unlike conventional desktop solutions, they are oftentimes associated with long wait times before a device configuration is generated. Such long wait times constitute a bottleneck limiting the number of compilation runs performed in a day; thus limiting to FPGA adaptation in modern computing platforms. This work presents an FPGA development paradigm that exploits logic variance and hierarchy as a means to increase FPGA productivity. The practical tasks of logic partitioning, placement and routing are examined and a resulting assembly framework, Quick Flow (qFlow), is implemented. Fifteen International Workshop on Logic and Synthesis (IWLS) 2005 benchmark designs and five large designs are used to evaluate qFlow. Experiments show up to 10x speed-ups using the proposed paradigm compared to vendor tool flows.
Conference Paper
Full-text available
Although hardware/software partitioning of embedded applications onto FPGAs is widely known to have performance and power advantages, FPGA usage has been typically limited to hardware experts, due largely to several problems: 1) difficulty of integrating hardware design tools into well-established software tool flows, 2) increasingly lengthy FPGA design iterations due to placement and routing, and 3) a lack of portability and interoperability resulting from device/platform-specific tools and bitfiles. In this paper, we directly address the last two problems by introducing intermediate fabrics, which are virtual reconfigurable architectures specialized for different application domains, implemented on top of commercial-off-the-shelf devices. Such specialization enables near-instantaneous placement and routing by hiding the complexity of fine-grained physical devices, while also enabling circuit portability across all devices that implement the intermediate fabric. When combined with existing work on runtime synthesis from software binaries, intermediate fabrics reduce the effects of all three problems by enabling transparent usage of COTS FPGAs by software designers. In this paper, we explore intermediate fabric architectures using specialization techniques to minimize area and performance overhead of the virtual fabric while maximizing routability and speedup of placement and routing. We present results showing an average placement and routing speedup of 554x, with an average area overhead of 10% and clock overhead of 18%, which corresponds to an average frequency of 195 MHz.
Conference Paper
Full-text available
Just-in-time (JIT) compilation has been used in many applications to enable standard software binaries to execute on different underlying processor architectures. We previously introduced the concept of a standard hardware binary, using a just-in-time compiler to compile the hardware binary to a field-programmable gate array (FPGA). Our JIT compiler includes lean versions of technology mapping, placement, and routing algorithms, of which routing is the most computationally and memory expensive step. As FPGAs continue to increase in size, a JIT FPGA compiler must be capable of efficiently mapping increasingly larger hardware circuits. In this paper, we analyze the scalability of our lean on-chip router, the Riverside on-chip router (ROCR), for routing increasingly large hardware circuits. We demonstrate that ROCR scales well in terms of execution time, memory usage and circuit quality, and we compare the scalability of ROCR to the well known versatile place and route (VPR) timing-driven routing algorithm, comparing to both their standard routing algorithm and their fast routing algorithm. Our results show that on average ROCR executes 3 times faster using 18 times less memory than VPR. ROCR requires only 1% more routing resources, while creating a critical path 30% longer VPR's standard timing-driven router. Furthermore, for the largest hardware circuit, ROCR executes 3 times faster using 14 times less memory, and results in a critical path 2.6% shorter than VPR's fast timing-driven router.
Article
Full-text available
ng an ASIC coprocessor to reduce energy consumption is common. 1,2 But, because of configurabl lonfig high power consumption, partitioning that usesconfigurabl lonf has mainl targeted speedup. 3-5 Energy is the product of power and time, so the power increase from usingconfigurabl lonf coul outweigh the time savings. In thisarticl--- derived from a paper presented at the 2002 IEEE Symposium onFiel:---d2OOVdld2D Custom Computing Energy Advantages of Microprocessor Platforms with On-Chip Configurable Logic System chips that incorporate configurable logic can reduce the energy consumed in executing software. The key is to use the configurable logic to execute performance-critical loops, producing an average energy savings of from 25% to 71% for embedded-system benchmarks. Greg Stittan Fran Vahid University of California, Riverside 0740-7475/02/$17.00 2002 IEEE IEEE Design & Test of Computers Machines, we show that using on-chip configurabl labl can in fact reduce the software e
Article
Full-text available
This paper reports several properties of heuristic best-first search strategies whose scoring functions f depend on all the information available from each candidate path, not merely on the current cost g and the estimated completion cost h. It is shown that several known properties of A* retain their form (with the minmax of f playing the role of the optimal cost), which helps establish general tests of admissibility and general conditions for node expansion for these strategies. On the basis of this framework the computational optimality of A*, in the sense of never expanding a node that can be skipped by some other algorithm having access to the same heuristic information that A* uses, is examined. A hierarchy of four optimality types is defined, and three classes of algorithms and four domains of problem instances are considered. Computational performances relative to these algorithms and domains are appraised.
Article
Full-text available
Mobile wireless terminals tend to become multimode wireless communication devices. Furthermore, these devices become adaptive. Heterogeneous reconfigurable hardware provides the flexibility, performance, and efficiency to enable the implementation of these devices. The implementation of a wideband code division multiple access and an orthogonal frequency division multiplexing receiver using the same coarse-grained reconfigurable MONTIUM tile processor is discussed. Besides the baseband processing part of the receiver, the same reconfigurable processor has also been used to implement Viterbi and Turbo channel decoders.
Article
Full-text available
More and more, field-programmable gate arrays (FPGAs) are accelerating computing applications. The absolute performance achieved by these configurable machines has been impressive-often one to two orders of magnitude greater than processor-based alternatives. Configurable computing is one of the fastest, most economical ways to solve problems such as RSA (Rivest-Shamir-Adelman) decryption, DNA sequence matching, signal processing, emulation, and cryptographic attacks. But questions remain as to why FPGAs have been so much more successful than their microprocessor and DSP counterparts. Do FPGA architectures have inherent advantages? Or are these examples just flukes of technology and market pricing? Will advantages increase, decrease, or remain the same as technology advances? Is there some generalization that accounts for the advantages in these cases? The author attempts to answer these questions and to see how configurable computing fits into the arsenal of structures used to build general, programmable computing platforms.
Article
Full-text available
Just-in-time (JIT) compilation has previously been used in many applications to enable standard software binaries to execute on different underlying processor architectures. However, embedded systems increasingly incorporate Field Programmable Gate Arrays (FPGAs), for which the concept of a standard hardware binary did not previously exist, requiring designers to implement a hardware circuit for a single specific FPGA. We introduce the concept of a standard hardware binary, using a just-in-time compiler to compile the hardware binary to an FPGA. A JIT compiler for FPGAs requires the development of lean versions of technology mapping, placement, and routing algorithms, of which routing is the most computationally and memory expensive step. We present the Riverside On-Chip Router (ROCR) designed to efficiently route a hardware circuit for a simple configurable logic fabric that we have developed. Through experiments with MCNC benchmark hardware circuits, we show that ROCR works well for JIT FPGA compilation, producing good hardware circuits using an order of magnitude less memory resources and execution time compared with the well known Versatile Place and Route (VPR) tool suite. ROCR produces good hardware circuits using 13X less memory and executing 10X faster than VPR's fastest routing algorithm. Furthermore, our results show ROCR requires only 10% additional routing resources, and results in circuit speeds only 32% slower than VPR's timing-driven router, and speeds that are actually 10% faster than VPR's routabilitydriven router.
Conference Paper
The FPGA compilation process (synthesis, map, place, and route) is a time consuming task that severely limits designer productivity. Compilation time can be reduced by saving implementation data in the form of hard macros. Hard macros consist of previously synthesized, placed and routed circuits that enable rapid design assembly because of the native FPGA circuitry (primitives and nets)which they encapsulate. This work presents results from creating a new FPGA design flow based on hard macros called HMF low. HMF low has shown speedups of 10-50X over the fastest configuration of the Xilinx tools. Designed for rapid prototyping, HMF low achieves these speedups by only utilizing up to 50 percent of the resources on an FPGA and produces implementations that run 2-4X slower than those produced by Xilinx. These speedups are obtained on a wide range of benchmark designs with some exceeding 18,000 slices on a Virtex 4 LX200.
Conference Paper
In systems typified by software defined radio, existing flows for run-time FPGA reconfiguration limit resource efficiency when constructing a variety of datapaths. Our approach allocates a sandbox region in which modules from a library can be flexibly placed and interconnected. An efficient run-time framework makes use of lightweight placement and routing techniques to respond on-demand to application requests. Compile time tools automate the task of adding interface wrappers to modules, insulating the designer from reconfiguration details.
Conference Paper
FPGAs have been used for prototyping of ASICs, for low-volume ASIC replacement and for systems requiring in-field hardware upgrades. However, the potential to use dynamic reconfiguration to adapt FPGA operation to changing application requirements has been hampered by slow reconfiguration times, and poor CAD tool support. In this paper, a new architecture, QUKU (pronounced cuckoo), is described which uses a coarse-grained reconfigurable PE array (CGRA) overlaid on an FPGA. The low-speed reconfigurability of the FPGA is used to optimize the CGRA for different applications, while the high-speed CGRA reconfiguration is used within an application for operator re-use. An FIR filter kernel has been implemented on QUKU and is shown to have performance which bridges the gap between softcore CPUs and custom FPGA filter circuits.
Conference Paper
Designers traditionally think that the way to improve the design cycle is by moving to higher levels of design abstraction. This paper discusses how new EDA tools are changing the form and effectiveness of design reuse as a strategy to increase design quality and engineering productivity
Conference Paper
SDI is a strategy for the efficient implementation of regular datapaths with fixed topology on FPGAs. It employs parametric module generators, a floorplanner based on a genetic algorithm, and a circuit compaction phase through local technology mapping and placement by ILP models. Initial results promise faster layouts than with general algorithms.
Conference Paper
Routing FPGAs is a challenging problem because of the relative scarcity of routing resources, both wires and connection points. This can lead either to slow implementations caused by long wiring paths that avoid congestion or a failure to route all signals. This paper presents PathFinder, a router that balances the goals of performance and routability. PathFinder uses an iterative algorithm that converges to a solution in which all signals are routed while achieving close to the optimal performance allowed by the placement. Routability is achieved by forcing signals to negotiate for a resource and thereby determine which signal needs the resource most. Delay is minimized by allowing the more critical signals a greater say in this negotiation. Because PathFinder requires only a directed graph to describe the architecture of routing resources, it adapts readily to a wide variety of FPGA architectures such as Triptych, Xilinx 3000 and mesh-connected arrays of FPGAs. The results of routing ISCAS benchmarks on the Triptych FPGA architecture show an average increase of only 4.5% in critical path delay over the optimum delay for a placement. Routes of ISCAS benchmarks on the Xilinx 3000 architecture show a greater completion rate than commercial tools, as well as 11% faster implementations.
Conference Paper
This paper presents the hardware structure and application of a coarse-grained dynamically reconfigurable hardware architecture dedicated to wireless communication systems. The application tailored architecture, called DReAM (D_ynamically R_econfigurable Hardware A_rchitecture for M_obile Communication Systems), is a research project at the Darmstadt University of Technology. It covers the complete design process from analyzing the requirements for the dedicated application field, the specification and VHDL implementation of the architecture, up to the physical layout for the final chip. In the following we provide an overview of the major design stages, starting with a motivation for choosing the concept of distributed arithmetic in reconfigurable computing
Article
Routing FPGAs is a challenging problem because of the relative scarcity of routing resources, both wires and connection points. This can lead either to slow implementations caused by long wiring paths that avoid congestion or a failure to route all signals. This paper presents PathFinder, a router that balances the goals of performance and routability. PathFinder uses an iterative algorithm that converges to a solution in which all signals are routed while achieving close to the optimal performance allowed by the placement. Routability is achieved by forcing signals to negotiate for a resource and thereby determine which signal needs the resource most. Delay is minimized by allowing the more critical signals a greater say in this negotiation. Because PathFinder requires only a directed graph to describe the architecture of routing resources, it adapts readily to a wide variety of FPGA architectures such as Triptych, Xilinx 3000 and mesh-connected arrays of FPGAs. The results of routing...
Article
In this paper we describe Frontier, an FPGA placement system that uses design macro-blocks in conjuction with a series of placement algorithms to achieve highly-routable and high-performance layouts quickly. In the first stage of design placement, a macro-based floorplanner is used to quickly identify an initial layout based on inter-macro connectivity. Next, an FPGA routability metric, previously described in [14], is used to evaluate the quality of the initial placement. Finally, if the floorplan is determined to be unroutable, a feedback-driven placement perturbation step is employed to achieve a lower cost placement. For a collection of large reconfigurable computing benchmark circuits our placement system exhibits a 4Theta speedup in combined place and route time versus commercial FPGA CAD software with improved design performance for most designs. It is shown that floorplanning, routability evaluation, and back-end optimization are all necessary to achieve efficient placement so...
Article
In this work we investigate the routing architecture of FPGAs, focusing primarily on determining the best distribution of routing segment lengths and the best mix of pass transistor and tri-state buffer routing switches. While most commercial FPGAs contain many length 1 wires (wires that span only one logic block) we find that wires this short lead to FPGAs that are inferior in terms of both delay and routing area. Our results show instead that it is best for FPGA routing segments to have lengths of 4 to 8 logic blocks. We also show that 50% to 80% of the routing switches in an FPGA should be pass transistors, with the remainder being tri-state buffers. Architectures that employ the best segmentation distributions and the best mixes of pass transistor and tri-state buffer switches found in this paper are not only 11% to 18% faster than a routing architecture very similar to that of the Xilinx XC4000X but also considerably simpler. These results are obtained using an architecture invest...
Article
We describe the capabilities of and algorithms used in a new FPGA CAD tool, Versatile Place and Route (VPR). In terms of minimizing routing area, VPR outperforms all published FPGA place and route tools to which we can compare. Although the algorithms used are based on previously known approaches, we present several enhancements that improve run-time and quality. We present placement and routing results on a new set of large circuits to allow future benchmark comparisons of FPGA place and route tools on circuit sizes more typical of today's industrial designs. VPR is capable of targeting a broad range of FPGA architectures, and the source code is publicly available. It and the associated netlist translation / clustering tool VPACK have already been used in a number of research projects worldwide, and should be useful in many areas of FPGA architecture research.
Article
Many applications of FPGAs, especially logic emulation and custom computing, require the quick placement and routing of circuit designs. In these applications, the advantages FPGAbased systems have over software simulation are diminished by the long run-times of current CAD software used to map the circuit onto FPGAs. To improve the run-time advantage of FPGA systems, users may be willing to trade some mapping quality for a reduction in CAD tool runtimes. In this paper, we seek to establish how much quality degradation is necessary to achieve a given runtime improvement. For this purpose, we implemented and investigated numerous placement and routing algorithms for FPGAs. We also developed new tradeoff-oriented algorithms, where a tuning parameter can be used to control this quality vs. runtime tradeoff. We show how different algorithms can achieve different points within this tradeoff spectrum, as well as how a single algorithm can be tuned to form a curve in the spectrum. We demonstrate that the algorithms vary widely in their tradeoffs, with the fastest algorithm being 8x faster than the slowest, and the highest quality algorithm being 5x better than the least quality algorithm. Compared to the commercial Xilinx CAD tools, we can achieve a 3x speed-up by allowing 1.27x degradation in quality, and a factor of 1.6x quality improvement with 2x slowdown.
Mar.) OpenCores. {Online}. Available: http://opencores
  • Opencores