Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The increasing size of modern FPGAs allows for ever more complex applications to be mapped onto them. However, long design implementation times for large designs can severely affect design productivity. A modular design methodology can improve design productivity in a divide and conqueror fashion but at the expense of degraded performance and power consumption of the resulting implementation. To reduce the dominant power dissipation component in FPGAs, the routing power, methodologies have been proposed that consider data communication between modules during module formation and placement on the FPGA. Selecting proper mapping region on target FPGAs, on the other hand, is becoming a critical process because of the heterogeneous resources and column arrangements in modern FPGAs. Selecting inappropriate FPGA regions for mapping could lead to degraded performance. Hence, we propose a methodology that uses communication-aware module placement, such that modules are mapped by selecting the best shape and region on the FPGA factoring the columnar resource arrangements. Additionally, techniques for module locking and splitting have been proposed for deterministic convergence of the algorithm and for improved module placement. This methodology exhibits nearly 19% routing power reduction with respect to commercial CAD flows without any degradation in achievable performance.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The proposed threshold logic FPGA (TLFPGA) reduced power consumption by 14%, and the area and delay overheads of the implemented circuits were reduced by 5% and 16%, respectively. Herath et al. (2021) proposed a power-efficient mapping approach to implement large-scale applications on modern heterogeneous FPGAs. In the mapping approach, a communication-aware placement methodology found the optimal shape of the modules. ...
Article
Field programmable gate array (FPGA) devices have become widespread in electronic systems due to their low design costs and reconfigurability. In battery-restricted applications such as handheld electronics systems, low-power FPGAs are in great demand. Leakage power almost equals dynamic power in modern integrated circuit technologies, so the reduction of leakage power leads to significant energy savings. We propose a power-efficient architecture for static random access memory (SRAM) based FPGAs, in which two modes (active mode and sleep mode) are defined for each module. In sleep mode, ultra-low leakage power is consumed by the module. The module mode changes dynamically from sleep mode to active mode when module outputs evaluate for new input vectors. After producing the correct outputs, the module returns to sleep mode. The proposed circuit design reduces the leakage power consumption in both active and sleep modes. The proposed low-leakage FPGA architecture is compared with state-of-the-art architectures by implementing Microelectronics Center of North Carolina (MCNC) benchmark circuits on FPGA-SPICE software. Simulation results show an approximately 95% reduction in leakage power consumption in sleep mode. Moreover, the total power consumption (leakage+dynamic power consumption) is reduced by more than 15% compared with that of the best previous design. The average area overhead (4.26%) is less than those of other power-gating designs.
Article
Field-Programmable Gate Array (FPGA) timing simulation is essential in electronic circuit design, allowing for the verification of timing characteristics like delays and clock frequencies. However, bugs in timing simulation tools can lead to inaccurate results, potentially causing designers to miss critical issues in chip performance. Traditional testing methods often fall short in thoroughly assessing these tools, as current FPGA testing primarily focuses on synthesis and behavioral simulation, neglecting timing aspects. To address this issue, we propose SIMTAM for testing timing simulation tools. Specifically, SIMTAM consists of three components: equivalent delay region construction, diversity program segment generation, and differential testing. Given a seed circuit design file written by hardware description language such as Verilog, the delay region construction component randomly identifies delay structures for inertial delay in the design file to construct equivalent delay sleep regions. In the sleep region, the simulator skips the signal pulse whose width is less than the specified delay, thus ensuring the equivalence of the variations. The diversity program segment generation component combines Verilog expressions using generation operators and inject them into the sleep region to generate diverse design files. The differential testing component compares the seed and variant design files to find compilation inconsistency issues. In five months, SIMTAM reported 16 bugs to developers in two popular timing simulation tools Iverilog and Vivado; ten of which are confirmed.
Article
The ever-increasing rate of static power consumption in nanoscale technologies, and consequently, the breakdown of Dennard scaling acts as a power wall for further device scaling. With intensified power density, designers are forced to selectively power off portions of chip area, known as dark silicon. With significant power consumption of routing resources in the field-programmable gate array (FPGA) and their low utilization rate, power gating of unused routing resources can be used to reduce the overall device power consumption. While power gating has taken great attention, previous studies neglect major factors that affect the effectiveness of power gating, for example, routing architecture, topology, and technology. In this article, we propose a power-efficient routing architecture (PERA) for SRAM-based FPGAs, which is designed pursuant to the utilization pattern of routing resources with different topologies. PERA is applicable to different granularity from a multiplexer to a switch-matrix (SM) level. We examine the efficiency of the proposed architecture with different topologies, structures, and parameters of routing resources. We further propose a routing algorithm to reduce the scattered use of resources and hence to take advantage of opportunities of power gating in routing resources. Our experiments using a versatile place and route (VPR) toolset on the FPGA architecture similar to commercial chips over an extensive set of circuits from Microelectronics Center of North Carolina (MCNC), International Workshop on Logic Synthesis (IWLS), Verilog to routing (VTR), and Titan benchmarks indicate that PERA reduces the static power consumption by 43.3%. This improvement is obtained at the expense of 7.4% area overhead. PERA along with the optimized routing algorithm offers a total routing leakage power reduction of up to 64.9% when compared to nonpower-gating architectures and 6.9% when compared with the conventional routing algorithm across all benchmark circuits and architectures with various wire segment lengths. This is while the optimized routing algorithm degrades performance by only less than 3%.
Article
Full-text available
Dynamic and partial reconfiguration are key differentiating capabilities of field programmable gate arrays (FPGAs). While they have been studied extensively in academic literature, they find limited use in deployed systems. We review FPGA reconfiguration, looking at architectures built for the purpose, and the properties of modern commercial architectures. We then investigate design flows and identify the key challenges in making reconfigurable FPGA systems easier to design. Finally, we look at applications where reconfiguration has found use, as well as proposing new areas where this capability places FPGAs in a unique position for adoption.
Article
Full-text available
The travelling salesman problem (TSP) is probably one of the most famous problems in combinatorial optimization. There are many techniques to solve the TSP problem such as Ant Colony Optimization ACO), Genetic Algorithm (GA) and Simulated Annealing (SA). In this paper, we conduct a comparison study to evaluate the performance of these three algorithms in terms of execution time and shortest distance. JAVA programming is used to implement the algorithms using three benchmarks on the same platform conditions. Among the three algorithms, we found out that the Simulated Annealing has the shortest time in execution (<1s) but for the shortest distance, it comes in the second order. Furthermore, in term of shortest distance between the cities, ACO performs better than GA and SA. However, ACO comes in the last order in term of time execution.
Article
Full-text available
Since their introduction, field programmable gate arrays (FPGAs) have grown in capacity by more than a factor of 10 000 and in performance by a factor of 100. Cost and energy per operation have both decreased by more than a factor of 1000. These advances have been fueled by process technology scaling, but the FPGA story is much more complex than simple technology scaling. Quantitative effects of Moore's Law have driven qualitative changes in FPGA architecture, applications and tools. As a consequence, FPGAs have passed through several distinct phases of development. These phases, termed “Ages” in this paper, are The Age of Invention, The Age of Expansion and The Age of Accumulation. This paper summarizes each and discusses their driving pressures and fundamental characteristics. The paper concludes with a vision of the upcoming Age of FPGAs.
Conference Paper
Full-text available
Partial reconfiguration (PR) has enabled the adoption of FPGAs in state of the art adaptive applications. Current PR tools require the designer to perform manual floorplanning, which requires knowledge of the physical architecture of FPGAs and an understanding of how to floorplan for optimal performance and area. This has lead to PR remaining a specialist skill and made it less attractive to high level system designers. In this paper we introduce a technique which can be incorporated into the existing tool flow that overcomes the need for manual floorplanning for PR designs. It takes into account overheads generated due to PR as well as the architecture of the latest FPGAs. This results in a floorplan that is efficient for PR systems, where reconfiguration time and area should be minimised.
Article
Full-text available
Modern vehicles incorporate a significant amount of computation, which has led to an increase in the number of computational nodes and the need for faster in-vehicle networks. Functions range from noncritical control of electric windows, through critical drive-by-wire systems, to entertainment applications; as more systems are automated, this variety and number will continue to increase. Accommodating the varying computational and communication requirements of such a diverse range of functions requires flexible networks and embedded computing devices. As the number of electronic control units (ECUs) increases, power and efficiency become more important, more so in next-generation electric vehicles. Moreover, predictability and isolation of safety-critical functions are nontrivial challenges when aggregating multiple functions onto fewer nodes. Reconfigurable computing can play a key role in addressing these challenges, providing both static and dynamic flexibility, with high computational capabilities, at lower power consumption. Reconfigurable hardware also provides resources and methods to address deterministic requirements, reliability and isolation of aggregated functions. This letter presents some initial research on the place of reconfigurable computing in future vehicles.
Conference Paper
Full-text available
We introduce a new congestion driven placement algorithm for FPGAs in which the overlapping effect of bounding boxes is taken into consideration. Experimental results show that compared with the linear congestion method (Betz et al., 1999) used in the state-of-the-art FPGA place and route package VPR (Betz and Rose, 1997), our algorithm achieves channel width reduction on 70% of the 20 largest MCNC benchmark circuits (10.1% on average) while keeping the channel width of the remaining 30% benchmarks unchanged. A distinct feature of our algorithm is that the critical path delay is not elongated on average, and in most cases reduced
Article
With the recent advancement of multilayer convolutional neural networks (CNN) and fully connected networks (FCN), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGAbased acceleration emerges as one of the most attractive alternatives. In this paper we design and implement Caffeine, a hardware/ software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrixmultiplication representation for both computation-bound convolutional layers and communication-bound fully connected (FCN) layers. Based on this representation, we optimize the accelerator micro-architecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile high-level network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1,460 GOPS on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than 100x speed-up on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 29x and 150x performance and energy gains over Caffe on a 12-core Xeon server, and 5.7x better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.
Conference Paper
Modern FPGAs integrate multi-million logic resources that allow the realization of increasingly large designs. However, state-of-the-art simulated annealing based CAD tools for FPGA suffer from long runtime, poor performance and sub-optimal routing and placement decisions, especially for large applications, leading to less energy efficient designs. In this paper, we present a partitioning methodology that divides large application into smaller subsystems based on the communication frequency between these subsystems. We leverage the existing CAD tools to compile the large design, which is now annotated with their subsystems, to obtain the final bitstream. Experiments show that the proposed strategy can lead to a performance gain of over 60% while still achieving more than 20% reduction in energy consumption.
Article
Advanced driver-assistance systems (ADAS) generally embrace heterogeneous platforms consisting of central processing units and field-programmable gate arrays (FPGAs) to achieve higher performance and energy efficiency. The multiple-target tracking (MTT) system is an important component in most ADAS and is particularly suited for heterogeneous implementation to improve responsiveness. However, the platform heterogeneity necessitates numerous design decisions to obtain the optimal application partitioning between the processor and the FPGA. In this paper, multiple configurations of the MTT application have been investigated on the Xilinx Zynq commercial heterogeneous platform. An extensive design space exploration was performed to recommend the optimal configuration with high performance and energy efficiency. A reduction of more than 65%, both in execution time and energy consumption, has been obtained by the utilization of the heterogeneous architecture. Finally, an analytical model is proposed to estimate execution time and energy consumption to enable a rapid exploration of the different configurations and predict the performance that can be expected with future system-on-chip (SoC) platforms and radar sensors in ADAS.
Conference Paper
Field Programmable Gate Arrays (FPGAs) CAD flow run-time has increased due to the rapid growth in size of designs and FPGAs. Researchers are trying to find new ways to improve compilation time without degrading design performance. In this paper, we present a novel approach that identifies tightly grouped FPGA logic blocks and then uses this information during circuit placement. Our approach is an orthogonal optimization applicable in incremental design and physical optimization, and reduces placement run-time. Specifically, we present a new algorithm that analyzes designs post-placement to extract medium-grained super-clusters that consist of two to seventeen clusters, which we call 'gems'. We modified VPR's simulated annealing placement algorithm to place our mixture of gems and clusters. Our new 'Singularity Annealing' algorithm first crushes each cluster grouping into a 'singularity' (treated as a single cluster). Then, the Singularity Annealer is run over this condensed circuit to obtain an initial placement, followed by an expansion of the singularities. Finally, we run a second low-temperature annealing phase on the entire expanded circuit. Our results show that our system reduces placement run-time on average by 17% while maintains the designs critical path delay, and increases designs channel width, and wirelength by 2% and 6.3%, respectively. We have also presented a test case to show the re-usability of gems in an incremental design example.
Conference Paper
The aim of this paper is to show a novel floor planner based on Mixed-Integer Linear Programming (MILP), providing a suitable formulation that makes the problem tractable using state-of-the-art solvers. The proposed method takes into account an accurate description of heterogeneous resources and partially reconfigurable constraints of recent FPGAs. A global optimum can be found for small instances in a small amount of time. For large instances, with a time limited search, a 20% average improvement can be achieved over floor planners based on simulated annealing. Our approach allows the designer to customize the objective function to be minimized, so that different weights can be assigned to a linear combination of metrics such as total wire length, aspect ratio and area occupancy.
Conference Paper
Numerous studies have shown the advantages of hardware and software co-design using FPGAs. However, increasingly lengthy place-and-route times represent a barrier to the broader adoption of this technology by significantly reducing designer productivity and turns-per-day, especially compared to more traditional design environments offered by competitive technologies such as GPUs. In this paper, we address this challenge by introducing a new approach to FPGA application design that significantly reduces compile times by exploiting the functional reuse common throughout modern FPGA applications, e.g. as shared code libraries and unchanged modules between compiles. To evaluate this approach, we introduce Block Place and Route (BPR), an FPGA CAD approach that modifies traditional placement and routing to operate at a higher-level of abstraction by pre-computing the internal placement and routing of reused cores. By extending traditional place-and-route algorithms such as simulated-annealing placement and negotiated-congestion routing to abstract away the detailed implementation of reused cores, we show that BPR is capable of orders-of-magnitude speedup in place-and-route over commercial tools with acceptably low overhead for a variety of applications.
Conference Paper
The FPGA compilation process (synthesis, map, place, and route) is a time consuming task that severely limits designer productivity. Compilation time can be reduced by saving implementation data in the form of hard macros. Hard macros consist of previously synthesized, placed and routed circuits that enable rapid design assembly because of the native FPGA circuitry (primitives and nets)which they encapsulate. This work presents results from creating a new FPGA design flow based on hard macros called HMF low. HMF low has shown speedups of 10-50X over the fastest configuration of the Xilinx tools. Designed for rapid prototyping, HMF low achieves these speedups by only utilizing up to 50 percent of the resources on an FPGA and produces implementations that run 2-4X slower than those produced by Xilinx. These speedups are obtained on a wide range of benchmark designs with some exceeding 18,000 slices on a Virtex 4 LX200.
Conference Paper
Programmable logic devices such as FPGAs are useful for a wide range of applications. However, FPGAs are not commonly used in battery-powered applications because they consume more power than ASICs and lack power management features. In this paper, we describe the design and implementation of Pika, a low-power FPGA core targeting battery-powered applications such as those in consumer and automotive markets. Our design uses the Xilinx Spartan-3 low-cost FPGA as a baseline and achieves substantial power savings through a series of power optimizations. The resulting architecture is compatible with existing commercial design tools. The implementation is done in a 90nm triple-oxide CMOS process. Compared to the baseline design, Pika consumes 46% less active power and 99% less standby power. Furthermore, it retains circuit and configuration state during standby mode, and wakes up from standby mode in approximately 100ns.
Article
Mean shift, a simple interactive procedure that shifts each data point to the average of data points in its neighborhood is generalized and analyzed in the paper. This generalization makes some k-means like clustering algorithms its special cases. It is shown that mean shift is a mode-seeking process on the surface constructed with a “shadow” kernal. For Gaussian kernels, mean shift is a gradient mapping. Convergence is studied for mean shift iterations. Cluster analysis if treated as a deterministic problem of finding a fixed point of mean shift that characterizes the data. Applications in clustering and Hough transform are demonstrated. Mean shift is also considered as an evolutionary strategy that performs multistart global optimization
Article
this paper, Frontier, an integrated, timing-driven placement system that aggressively uses macro-blocks and floorplanning to quickly converge to a high-quality placement solution, is detailed. This system can be used in place of existing placement approaches for macro-based designs targeted to devices with architectures similar to the Xilinx Virtex [Xilinx Corporation 2001] and XC4000 [Xilinx Corporation 1998] and Lucent Orca [Lucent Technologies 1996] families. Rather than using a single algorithm, the new Frontier tool set relies on a sequence of interrelated placement steps. First, in a floorplanning step, macros are combined together into localized clusters of fixed size and shape and assigned to device regions to minimize placement cost. Following initial floorplanning, a routability and performance evaluator, based on wire length and static timing information, is used to determine if subsequent routing for a given target device is likely to complete successfully with pre-specified timing constraints. If this evaluation is pessimistic, low-temperature simulated annealing is performed on the contents of all soft macros in the design to allow for additional placement cost reduction, enhanced design routability, and improved design performance
Polybench: The Polyhedral Benchmark Suite
  • L.-N Pouchet
A design assembly framework for FPGA back-end acceleration
  • T Frangieh
  • P Athanas