Conference Paper

CAD techniques for power optimization in Virtex-5 FPGAs

Authors:
  • Futurewei Technologies, Inc. US-R&D
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We consider dynamic power dissipation in FPGAs and present CAD techniques for dynamic power reduction. The proposed techniques, comprising power-aware placement, routing, and a novel post-routing transformation, are applied to optimize the power consumed by industrial designs implemented in the Xilinxreg Virtextrade-5 FPGA. Board-level power measurements on a suite of industrial designs show that the techniques reduce power by 10%, on average.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Assigning noncritical paths to clusters with low power supply voltage is the key idea of [86]. Placement algorithms to reduce power consumption in FPGAs are presented in [87] [88] [89]. The main idea is to add estimated dynamic power into cost function of the placement algorithms. ...
... A placement algorithm that takes into account the cost of using clock network resources to reduce power consumed by clock network is reported in [89]. Routing algorithms to reduce power consumption in FPGAs are reported in [87] [26]. Assigning nodes with high switching activity to low-capacitance routing resources is the main idea behind the routing algorithm for reducing interconnect power in [87]. ...
... Routing algorithms to reduce power consumption in FPGAs are reported in [87] [26]. Assigning nodes with high switching activity to low-capacitance routing resources is the main idea behind the routing algorithm for reducing interconnect power in [87]. A routing that can balance arrival times of the inputs of the same LUTs to reduce power consumption in FPGAs is proposed in [26]. ...
... Based on its importance in the larger technology nodes, a large number of works are proposed like architecture modifications and CAD techniques [18], [19]. Recently a dynamic power-aware placement algorithm is proposed in paper [9] without any consideration of other parameters like leakage power. In that paper, a estimated dynamic power consumption during the placement stage using value-specific parameters, is considered as the secondary optimization parameter in the cost function of simulated annealing-based placement algorithm. ...
... Dynamic power of a netlist is the accumulated dynamic power consumption in involving resources of each net. This kind of power can be formulated as follows [9]: ...
... In paper [16] several net's capacitance estimation models have been proposed. To have a fair comparison with paper [9], like that, we used the following linear model, instead of more accurate non-linear models, with mean error rate about 60% to estimate capacitance of net i, but it's important note that using more accurate models is possible too without any costly modification or without any basically modifications in this paper's contribution. ...
Conference Paper
In high frequency FPGAs with technology scale shrinking and threshold voltage value decreasing and based on existing large numbers of unused resources, leakage power has a considerable contribution in total power consumption. On the other hand, process variation, as an important challenge in nano-scale technologies, has a great impact on leakage power of FPGAs. Reconfigurability of FPGAs makes an unique opportunity to mitigate these challenges by their unique variation map extraction. In this paper, a per-chip process variation-aware placement (VMAP) algorithm is proposed to reduce the leakage power of FPGAs using the extracted variation map without neglecting dynamic power consumption. VMAP is adaptive to different process variation maps of various FPGA chips. Experimental results on attempted benchmarks show that power-delay-product (PDP) cost is reduced by 7.2% in the VMAP compared with conventional placement algorithms, with less than 16.8% standard deviation for different variation maps.
... On the other hand, an analytical placer which optimize wire-length, is used in [18] instead of a SA based placer. Power-aware CAD techniques, for instance, [19] and [20], have not discussed their suitability for large designs. On the other hand, scalable techniques mainly focus on improving performance rather than power consumption. ...
... As a result, the dynamic power consumption of the interconnect fabric dominates the total dynamic power dissipation. Dynamic power dissipation (P ) of an FPGA is modeled as follows [19]; ...
... In [1] power consumption in 65nm FPGA is tested by the authors which has given a benefit in low power consumption than Virtes-4 FPGA. In [2] authors have designed CAD technique for power optimization. In this work they have tried to reduce the dynamic power dissipitation. ...
... A prominent transformation from performance oriented design to power-constrained design can be observed in the IC industry as well [35]. Studies on power-aware CAD tools [54] [55] can be seen where the power is reduced adjusting the cost functions of placement and routing algorithms by dealing with the delay or the wire length. However, they are dependent on traditional CAD algorithms and their impact on large-scale complex applications remains uncertain. ...
Preprint
Present Field Programmable Gate Array (FPGA) manufacturers incorporate multi-millions of logic resources which enables hardware designers to design applications extending to enormous scales. However, handling such applications by existing FPGA Computer Aided Design (CAD) flow requires more improvement in terms of compilation time, performance and power efficiency considerations. However, the existing CAD flow begins at Register Transfer Level (RTL) code which limits the design productivity of the input applications only to hardware experts in performing analysis on the large-scale designs for any further optimizations. The manual process of optimizing RTL designs is increasingly difficult when considering larger designs. High-Level Synthesis (HLS) is an approach capable of increasing the design productivity of hardware applications compared to commonly used Hardware Description Languages (HDLs) and is known to be a capable approach for performing optimizations at a higher level of abstraction. Our concern is to explore the approaches that follow the HLS flow to cater the mapping of large-scale applications to FPGA addressing above considerations.
... Altera improves their proprietary CAD tool, Quartus, by introducing parallelism in their SA based Q2P placer [18]. Similarly, Xilinx introduces an analytical placement strategy in their CAD [11], inspired by ASIC placement strategies. ...
... Analytical placers were first invented in the 1980s in the field of ASIC design [71] and continue to be widely used in that field. For FPGAs, Xilinx first started using analytical techniques in 2007 [54]. Academic research on analytical placers is still lagging behind proprietary tools in terms of QoR. ...
Thesis
Full-text available
Field-Programmable Gate Arrays (FPGA) are programmable integraded circuits, which are used to accelerate applications. An FPGA configuration is compiled by specialized software. Compilation is typically split up in the following steps: synthesis, technology mapping, packing, placement and routing. Once the configuration is compiled, the designer can check if the configuration meets the application constraints. In case the constraints are not met, the designer has to modify the high level description of his design and recompile. This slow process is called the FPGA design cycle. To shorten the design cycle we introduce new faster packing, placement and routing techniques.
... In [56], the authors proposed to use the unutilized pins inside the logic block to reduce the glitching power inside the LUT. The methodology is based on forcing the unused pins to constant values in such a way that glitches do not occur in the pass-transistor multiplexer. ...
... The proposed parallel routing techniques are described in Section III. An experimental study is presented in Section IV. Conclusions rithm [14] in their commercial routers [17], [16]. PathFinder is also used in the publicly-available VPR FPGA placement and routing framework [12], which we parallelize in this work.Fig. 1 gives an overview of the negotiated congestion approach. ...
Conference Paper
We consider coarse and fine-grained techniques for parallel FPGA routing on modern multi-core processors. In the coarse-grained approach, sets of design signals are assigned to different processor cores and routed concurrently. Communication between cores is through the MPI (message passing interface) communications protocol. In the fine-grained approach, the task of routing an individual load pin on a signal is parallelized using threads. Specifically, as FPGA routing resources are traversed during maze expansion, delay calculation, costing and priority queue insertion for these resources execute concurrently. The proposed techniques provide deterministic/repeatable results. Moreover, the coarse and fine-grained approaches are not mutually exclusive and can be used in tandem. Results show that on a 4-core processor, the techniques improve router run-time by ~2.1×, on average, with no significant impact on circuit speed performance or interconnect resource usage.
... Intuitively, the shorter wire-lengths produced by edge flow should lead to shorter routes and lower power consumption. Also, smaller LUTs created by WireMap can also be exploited to reduce power after placement/routing by appropriate programming of the memory values [9]. We plan to evaluate the potential power reduction due to WireMap as part of future work. ...
Conference Paper
This paper presents a new technology mapper, WireMap. The mapper uses an edge flow heuristic to improve the routability of a mapped design. The heuristic is applied during the iterative mapping optimization to reduce the total number of pin-to-pin connections (or edges). The average edge reduction of 9.3% is achieved while maintaining depth and LUT count of state-of-the- art technology mapping. Placing and routing the resulting netlists leads to an 8.5% reduction in the total wire length, a 6.0% reduction in minimum channel width, and a 2.3% reduction in critical path delay. Applying WireMap has an additional advantage of reducing an average number of inputs of LUTs without increasing the total LUT count and depth. The percentages of 5- and 6-LUTs in a typical design are reduced, while the percentages of 2-, 3-, and 4-LUTs are increased. These smaller LUTs can be merged into pairs and implemented using the dual output LUT structure found in commercial FPGAs. WireMap leads to 9.4% fewer dual-output LUTs after merging.
... Naturally, no two signals may use the same wire segment . The two largest FPGA vendors, Xilinx and Altera, use a variant of the PathFinder negotiated congestion routing algo- rithm [9] in their commercial routers [10], [11]. PathFinder is also used in the publicly-available VPR FPGA placement and routing framework [12], which we modified and used for this work.Fig. 2 gives an overview of the negotiated congestion approach used in VPR PathFinder. ...
Conference Paper
We propose a new FPGA routing approach that, when combined with a low-cost architecture change, results in a 34% reduction in router run-time, at the cost of a 3% area overhead, with no increase in critical path delay. Our approach begins with traditional PathFinder-style routing, which we run on a coarsened representation of the routing architecture. This leads to fast generation of a partial routing solution where signals are assigned to groups of wire segments rather than individual wire segments. A Boolean satisfiability (SAT)-based stage follows, generating a legal routing solution from the partial solution. Our approach points to a new research direction: reducing FPGA CAD run-time by exploring FPGA architectures and algorithms together.
... We also measured total power (including static and dynamic) and found the total power improvement to be similar, 0.7%, on average across the designs. Thus, we conclude that LUT packing is generally beneficial to power, and the approach can be combined with the techniques in the Xilinx power-aware flow [Gupta et al. 2007]. ...
Article
Packing is a key step in the FPGA tool flow that straddles the boundaries between synthesis, technology mapping and placement. Packing strongly influences circuit speed, density, and power, and in this article, we consider packing in the commercial FPGA context and examine the area and performance trade-offs associated with packing in a state-of-the-art FPGA---the Xilinx® VirtexTM-5 FPGA. In addition to look-up-table (LUT)-based logic blocks, modern FPGAs also contain large IP blocks. We discuss packing techniques for both types of blocks. Virtex-5 logic blocks contain dual-output 6-input LUTs. Such LUTs can implement any single logic function of up to 6 inputs, or any two logic functions requiring no more than 5 distinct inputs. The second LUT output has reduced speed, and therefore, must be used judiciously. We present techniques for dual-output LUT packing that lead to improved area-efficiency, with minimal performance degradation. We then describe packing techniques for large IP blocks, namely, block RAMs and DSPs. We pack circuits into the large blocks in a way that leverages the unique block RAM and DSP layout/architecture in Virtex-5, achieving significantly improved design performance.
... Lamoureux et al. propose to reduce the dynamic power in FPGAs through edge alignment and glitch filtering [Lamoureux et al. 2008]. Fourth, high-level FPGA power models, power-aware architectures, and CAD tools are also studied for faster power estimation [Chen et al. 2007; Gupta et al. 2007]. Li et al. present a flexible FPGA architecture evaluation platform, which incorporates the switch-level models for interconnects and macro models for LUTs [Li et al. 2003]. ...
Article
We explore the use of Data-Level Parallelism (DLP) as a way of improving the energy efficiency and power consumption involved in running applications on an FPGA. We show that static power consumption is a significant fraction of the overall power consumption in an FPGA and that it does not change significantly even as the area required by an architecture increases, because of the dominance of interconnect in an FPGA. We show that the degree of DLP can be used in conjunction with frequency scaling to reduce the overall power consumption.
... Intuitively, the shorter wirelengths produced by edge flow should lead to shorter routes and lower power consumption. Also, smaller LUTs created by WireMap can also be exploited to reduce power after placement/routing by appropriate programming of the memory values [Gupta et al. 2007]. We plan to evaluate the potential power reduction due to WireMap as part of future work. ...
Article
This article presents a new technology mapper, WireMap. The mapper uses an edge flow heuris- tic to improve the routability of a mapped design. The heuristic is applied during the iterative mapping optimization to reduce the total number of pin-to-pin connections (or edges). On acad- emic benchmark (ISCAS, MCNC, and ITC designs), the average edge reduction of 9.3% is achieved while maintaining depth and LUT count compared to state-of-the-art technology mapping. Plac- ing and routing the resulting netlists leads to an 8.5% reduction in the total wirelength, a 6.0% reduction in minimum channel width, and a 2.3% reduction in critical path delay. This technique is applied in the Xilinx ISE Design tool to evaluate its effect on industrial Virtex5 circuits. In a set of 20 large designs, we find the edge reduction is 6.8% while total wirelength measured in the placer is reduced by 3.6%. Applying WireMap has an additional advantage of reducing an average number of inputs of LUTs without increasing the total LUT count and depth. The percentages of 5- and 6-LUTs in a typical design are reduced, while the percentages of 2-, 3-, and 4-LUTs are increased. These smaller LUTs can be merged into pairs and implemented using the dual-output LUT structure found in commercial FPGAs. For academic benchmarks, WireMap leads to 9.4% fewer dual-output LUTs after merging. For the industrial designs, WireMap leads to 6.3% fewer dual-output Virtex5 LUTs.
Article
Routing is a very time-consuming stage in the FPGA design flow, significantly hindering the productivity. This article proposes CPRS, a c oarse-grained p arallel r outing s cheme in a distributed computing environment. First, we partition entire routing region to guide the assignment of nets for parallel processing. The partitioning is a recursive fashion, and at each recursive partitioning, the region is partitioned into two subregions forming three subsets of nets. The first subset consists of potentially dependent nets and they are distributed in different subregions. The remaining two subsets consist of potentially independent nets and they are distributed in their own subregions. Second, we route the nets of first subset in serial and process the remaining two subsets in parallel. The parallel processing is a coarse-grained fashion, which is implemented by MPI parallel programming model. Finally, we explore the optimization of both partitioning and parallel processing to further improve the overall speedup of parallel routing. In addition, we adopt MPI message to synchronize the intermediate results between different cores in parallel routing for a feasible solution. Experiments use a set of commonly used benchmarks to demonstrate the effectiveness of CPRS. Notably, CPRS achieves about 18× speedup on average using 32 processor cores with minor loss of quality, compared with the VTR 7.0 serial router. There is about 1.6× improvement over the state-of-the-art parallel router.
Article
In this paper, we propose a shared-memory parallel FPGA router called ParRA. Basically, ParRA is composed of hybrid partitioning and parallel routing. During the hybrid partitioning, firstly an FPGA is split into multiple subregions and nets are geographically partitioned into local subsets. As the inter-subregion nets usually overlap each other, these nets cannot be routed in parallel. Secondly, the inter-subregion nets are further partitioned into conflict-free subsets. Since each conflict-free subset consists of inter-subregion nets do not overlap each other, the nets in the same conflict-free subset can be routed in parallel. In this way, we significantly increase the number of nets that have potential to be routed in parallel. During the parallel routing process, two novel parallel routing strategies are applied to route the nets in conflict-free and local subsets respectively. With conflict-free subsets, sinks in the same conflict-free subset are routed in parallel while conflict-free subsets are routed one by one. On the contrast, local subsets are routed in parallel while the nets in the same local subset are routed sequentially. With the two different parallel routing strategies, we reduce the interference between threads and balance the workload of threads, which contributes to gain more parallelism. The proposed parallel router provides deterministic routing results. Experimental results show that ParRA achieves an average speedup of 24.3× with 16 threads compared to VPR 7.0, has no negative impact on the quality of results.
Chapter
The previous chapters in this book focused on CAD tools for approximate computing. These included the algorithms and methodologies for verification, synthesis, and test of an approximate computing circuit. However, the scope of approximate computing is not limited to such circuits and the hardware built from them. In fact, approximate computing spans over multiple layers from architecture and hardware to software. There are several system architectures proposed ranging from those employing neural networks to dedicated approximation processors. The hardware is designed wrt. an architectural specification on the respective error criteria. Further, the software that runs the end application is aware of such hardware features and actively utilizes them. This is called cross-layer approximate computing, where the software and hardware work in tandem according to an architectural specification. Such systems can harness the power of approximations more effectively. This chapter details an important microprocessor architecture developed for approximate computing. This architecture ProACt can do cross-layer approximations spanning hardware and software. ProACt stands for Processor for On-demand Approximate Computing. Details on this processor architecture, implementation, and the detailed evaluation are provided in the forthcoming sections.
Conference Paper
Glitches are unnecessary transitions on logic signals that needlessly consume dynamic power. Glitches arise from imbalances in the combinational path delays to a signal, which may cause the signal to toggle multiple times in a given clock cycle before settling to its final value. In this paper, we propose a low-cost circuit structure that is able to eliminate a majority of glitches. The structure, which is incorporated into the output buffers of FPGA logic elements, suppresses pulses on buffer outputs whose duration is shorter than a configurable time window (set at the time of FPGA configuration). Glitches are thereby eliminated "at the source" ensuring they do not propagate into the high-capacitance FPGA interconnect, saving power. An experimental study, using Altera commercial tools for power analysis, demonstrates that the proposed technique reduces 70% of glitches, at a cost of 1% reduction in speed performance.
Article
Monte Carlo simulations eue widely used e.g. in the field of physics and molecular modelling. The main role played in these is by the high performance random number generators, such as RANLUX or MERSSENE TWISTER. In this paper the authors introduce the world's first implementation of the RANLUX algorithm on an FPGA platform for high performance computing purposes. A significant speed-up of one generator instance over 60 times, compared with a graphic card based solution, can be noticed. Comparisons with concurrent solutions were made and are also presented. The proposed solution has an extremely low power demand, consuming less than 2.5 Watts per RANLUX core, which makes it perfect for use in environment friendly and energy-efficient supercomputing solutions and embedded systems.
Conference Paper
This paper presents a new, open-source method for FPGA CAD researchers to realize their techniques on real Xilinx devices. Specifically, we extend the Verilog-To-Routing (VTR) suite, which includes the VPR place-and-route CAD tool on which many FPGA innovations have been based, to generate working Xilinx bitstreams via the Xilinx Design Language (XDL). Currently, we can faithfully translate VPR's heterogeneous packing and placement results into an exact Xilinx 'map' netlist, which is then routed by its 'par' tool. We showcase the utility of this new method with two compelling applications targeting a 40nm Virtex-6 device: a fair comparison of the area, delay, and CAD runtime of academia's state-of-the-art VTR How with a commercial, closed-source equivalent, along with a CAD experiment evaluated using physical measurements of on-chip power consumption and die temperature, over time. This extended How - VTR-to-Bitstream - is released to the community with the hope that it can enhance existing research projects as well as unlock new ones.
Conference Paper
Leakage power in nano-scale technologies is an important source of power consumption. Moreover, FP-GAs with low utilization rates consume large leakage power in their routing and logical resources. As FPGA routing architecture incorporates large number of transistors, leakage power of routing resources contributes the majority of total leakage power consumption. In this paper, a pre-routing prediction model is proposed for estimating leakage power consumption of routing resources. Several parameters are used in the prediction model by analyzing their impacts on the leakage power consumption of routing resources. In addition to these parameters, the behavior of FPGA routers on different nets is estimated and considered in the prediction model. The proposed prediction model can be used in pre-routing power optimization techniques such as power-aware placement algorithms. Simulation results for 10 MCNC benchmarks show that the mean error rate of the proposed prediction model is about 28.8% for the FPGAs with 60% utilization rate.
Chapter
Field-programmable gate arrays (FPGAs) are reconfigurable devices that can be programmed after fabrication to implement any digital logic. As such, they are flexible, easy to modify in-field, and cheaper to use than manufacturing a customized application-specific integrated circuit (ASIC). However, this programmability comes at a cost in terms of area, performance, and perhaps most importantly power. As currently manufactured, FPGAs are significantly less power efficient than ASICs. Fortunately, in the last decade concentrated attention to power consumption has identified many approaches to power reduction. This chapter surveys the techniques and progress made to improve FPGA power efficiency.
Conference Paper
We present HeAP, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs. Specifically, we adapt a state-of-the-art ASIC-based analytical placer to target FPGAs with heterogeneous blocks located at discrete locations throughout the fabric. Our placer also handles macros of LUT-based blocks with specific layout requirements, such as carry chains. Results show that our placer delivers a 4× speedup, on average, compared to Altera's non-timing driven flow, at the cost of a 5% increase in postrouted wirelength, and an 11× speedup compared to Altera's timing-driven flow, at the cost of a 4% increase in post-routed wirelength and a 9% reduction in maximum operating frequency. We also compare with an academic simulated annealing-based placer and demonstrate a 7.4× runtime advantage with 6% better placement quality.
Article
We propose a new field-programmable gate array (FPGA) routing approach, which, when combined with a low-cost architecture change, results in a 40% reduction in router runtime, at the cost of a 6% area overhead and with no increase in critical path delay. Our approach begins with PathFinder-style routing, which we run on a coarsened representation of the routing architecture. This leads to fast generation of a partial routing solution where the signals are assigned to groups of wire segments rather than individual wire segments. A Boolean satisfiability (SAT)-based stage follows, generating a legal routing solution from the partial solution. We explore approximately 165 000 FPGA switch block architectures, showing that the choice of the architecture has a significant impact on the complexity of the SAT formulation, and by extension, on routing runtime. Our approach points to a new research direction, namely, reducing FPGA computer-aided design runtime by exploring FPGA architectures and algorithms together.
Article
We present parallelization and heuristic techniques to reduce the run-time of field-programmable gate array (FPGA) negotiated congestion routing. Two heuristic optimizations provide over 3× speedup versus a sequential baseline. In our parallel approach, sets of design signals are assigned to different processor cores and routed concurrently. Communication between cores is through the message passing interface communications protocol. We propose a geographic partitioning of signals into independent sets to help minimize the communication overhead. Our parallel implementation provides approximately 2.3× speedup using four cores and produces deterministic/repeatable results. When combined, the parallel and heuristic techniques provide over 7× speedup with four cores versus the router in the widely used Versatile Place and Route (VPR) FPGA placement/routing framework, with no significant impact on circuit speed or wirelength.
Article
Although various techniques have been proposed for power reduction in field-programmable devices (FPDs), they are still all based on conventional logic elements (LEs). In the conventional LE, the output of the combinational logic (e.g. the look-up table (LUT) in many field-programmable gate arrays (FPGAs)) is connected to the input of the storage element; while the D flip-flop (DFF) is always clocked even when not necessary. Such unnecessary transitions waste power. To address this problem, we propose a novel productivity-driven LE with reduced number of transitions. The differences between our LE and the conventional LE are in the FFs-type used and the internal LE organisation. In our LEs, DFFs have been replaced by T flip-flops with the T input permanently connected to logic value 1. Instead of connecting the output of the combinational logic to the FF input, we use it as the FF clock. The proposed LE has been validated via Simulation Program with Integrated Circuit Emphasis (SPICE) simulations for a 45-nm Complementary Metal–Oxide–Semiconductor (CMOS) technology as well as via a real Computer-Aided Design (CAD) tools on a real FPGA using the standard Microelectronic Center of North Carolina (MCNC) benchmark circuits. The experimental results show that FPDs using our proposal not only have 48% lower total power but also run 17% faster than conventional FPDs on average.
Article
Full-text available
Runtime reconfigurable systems built upon devices with partial reconfiguration can provide reduction in overall hardware area, power efficiency, and economic cost in addition to the performance improvements due to better customization. However, the users of such systems have to be able to afford some additional costs compared to hardwired application specific circuits. More precisely reconfigurable devices have higher power consumption, occupy larger silicon area and operate at lower speeds. Higher power consumption requires additional packaging cost, shortens chip lifetimes, requires expensive cooling systems, decreases system reliability and prohibits battery operation. The less efficient usage of silicon real estate is usually compensated by the runtime hardware reconfiguration and functional units relocation. The available configuration data paths, however, have limited bandwidth that introduces overheads that may eclipse the dynamic reconfiguration benefits. In this dissertation, we address three major problems related to hardware resources runtime management: efficient online hardware task scheduling and placement, power consumption reduction and reconfiguration overhead minimization. Since hardware tasks are allocated and deallocated dynamically at runtime, the reconfigurable fabric can suffer of fragmentation. This can lead to the undesirable situation that tasks cannot be allocated even if there would be sufficient free area available. As a result, the overall system performance is degraded. Therefore, efficient hardware management of resources is very important. To manage hardware resources efficiently, we propose novel online hardware task scheduling and placement algorithms on partially reconfigurable devices with higher quality and faster execution compared to related proposals. To cope with the high power consumption in field programmable devices, we propose a novel logic element with lower power consumption compared to current approaches. To reduce runtime overhead, we augment the FPGA configuration circuit architecture and allow faster reconfiguration and relocation compared to current reconfigurable devices.
Article
Full-text available
The continued scaling of the CMOS technology has led us into the deep submicron regimes where design is not limited by the functionality on a chip but is constrained with its power consumption. In this paper, we present some widely used techniques for static and dynamic power minimisation in modern VLSI circuits. These techniques are applicable on the different stages of the system design, starting from technology level where designer is allowed to change technology parameters (transistor sizes, supply and threshold voltages) up to the top level which deals with the design's architectural variations. Along with the overview of power minimisation techniques, as an example, the circuit of binary divider was introduced and implemented in various families FPGAs to demonstrate technological as well as Placement and Routing (PAR) influence on total power consumption.
Article
This study discusses the implementation of two sets of techniques for minimising power within the context of a commercial field programmable gate array (FPGA) placement flow. The first aspect discussed in this work is a power-aware objective function for placement. In particular, a capacitance model for global nets is described which allows the net power in a design to be dramatically reduced. The second aspect describes augmentations to a physical re-synthesis flow, which help to reduce area and power by optimising the number of combinational and sequential cells. The results are quantified across a suite of 119 industrial benchmarks targeting the Actel® IGLOO¿FPGA architecture. Power measurements show that the techniques described in this study reduce dynamic power by 13% on average, with a 6.7% average improvement in timing performance across the suite.
Conference Paper
We consider packing in the commercial FPGA context and examine the speed, performance and power trade-offs associ- ated with packing in a state-of-the art FPGA - the Xilinx R VirtexTM-5 FPGA. Two aspects of packing are discussed: 1) packing for general logic blocks, and 2) packing for large IP blocks. Virtex-5 logic blocks contain dual-output 6-input look-up-tables (LUTs). Such LUTs can implement any sin- gle logic function requiring no more than 6 inputs, or any two logic functions requiring no more than 5 distinct inputs. The second LUT output is associated with slower speed, and therefore, must be used judiciously. We present placement- based techniques for dual-output LUT packing that lead to improved area-efficiency and power, with minimal perfor- mance degradation. We then move on to address packing for large IP blocks, specifically, block RAMs and DSPs. We present a packing optimization that is widely applicable in DSP designs that leads to significantly improved design per- formance.
Conference Paper
Clock network power in field-programmable gate arrays (FP- GAs) is considered and two complementary approaches for clock power reduction in the Xilinx R VirtexTM-5 FPGA are presented. The approaches are unique in that they lever- age specific architectural aspects of Virtex-5 to achieve re- ductions in dynamic power consumed by the clock network. The first approach comprises a placement-based technique to reduce interconnect resource usage on the clock network, thereby reducing capacitance and power (up to 12%). The second approach borrows the"clock gating" notion from the ASIC domain and applies it to FPGAs. Clock enable sig- nals on flip-flops are selectively migrated to use the dedi- cated clock enable available on the FPGA's built-in clock network, leading to reduced toggling on the clock intercon- nect and lower power (up to 28%). Power reductions are achieved without any performance penalty, on average.
Conference Paper
Clock signals are responsible for a significant portion of dynamic power in FPGAs owing to their high toggle frequency and capacitance. Clock signals are distributed to loads through a programmable routing tree network, designed to provide low delay and low skew. The placement step of the FPGA CAD flow plays a key role in influencing clock power, as clock tree branches are connected based solely on the placement of the clock loads. In this paper, we present a placement-based approach to clock power reduction based on an integer linear programming (ILP) formulation. Our technique is intended to be used as an optimization post-pass executed after traditional placement, and it offers fine-grained control of the amount by which clock power is optimized versus other placement criteria. Results show that the proposed technique reduces clock network capacitance by over 50% with minimal deleterious impact on post-routed wirelength and circuit speed.
Conference Paper
This paper considers the implementation of an annealing technique for dynamic power reduction in FPGAs. The proposed method comprises a power-aware objective function for placement and is implemented in a commercial tool. In particular, a capacitance model based on multi-dimensional nonlinear regression is described, as well as a new capacitance model for global nets. The importance and advantages of these models are highlighted in terms of the overall attainable reduction in power in a real, commercially-available architecture and tool flow. The results are quantified across a range of industrial benchmarks targeting the Actelreg IGLOOtrade FPGA architecture. Power measurements show that, across a suite of 120 industrial designs, the technique described in this paper reduces dynamic power by 13% on average, with only a 1% degradation in timing performance.
Article
Practical Low Power Digital VLSI Design emphasizes the optimization and trade-off techniques that involve power dissipation, in the hope that the readers are better prepared the next time they are presented with a low power design problem. The book highlights the basic principles, methodologies and techniques that are common to most CMOS digital designs. The advantages and disadvantages of a particular low power technique are discussed. Besides the classical area-performance trade-off, the impact to design cycle time, complexity, risk, testability and reusability are discussed. The wide impacts to all aspects of design are what make low power problems challenging and interesting. Heavy emphasis is given to top-down structured design style, with occasional coverage in the semicustom design methodology. The examples and design techniques cited have been known to be applied to production scale designs or laboratory settings. The goal of Practical Low Power Digital VLSI Design is to permit the readers to practice the low power techniques using current generation design style and process technology. Practical Low Power Digital VLSI Design considers a wide range of design abstraction levels spanning circuit, logic, architecture and system. Substantial basic knowledge is provided for qualitative and quantitative analysis at the different design abstraction levels. Low power techniques are presented at the circuit, logic, architecture and system levels. Special techniques that are specific to some key areas of digital chip design are discussed as well as some of the low power techniques that are just appearing on the horizon. Practical Low Power Digital VLSI Design will be of benefit to VLSI design engineers and students who have a fundamental knowledge of CMOS digital design.
Conference Paper
This paper describes a technique that reduces dynamic power in FPGAs by reducing the number of glitches in the global routing resources. The technique involves adding programmable delay elements within the logic blocks of an FPGA to programmably align the arrival times of early-arriving signals to the inputs of the lookup tables and to filter out glitches generated by earlier circuitry. On average, the proposed technique eliminates 91% of the glitching, which reduces overall FPGA power by 18%. The added circuitry increases overall area by 5% and critical-path delay by less than 1%. Furthermore, since it is applied after routing, the proposed technique requires no modifications to the existing FPGA routing architecture or CAD flow.
Conference Paper
In this paper we evaluate the trade-offs between various low-leakage design techniques for field programmable gate arrays (FGPAs) in deep sub-micron technologies. Since multiplexers are widely used in FPGAs for implementing look up tables (LUTs) and connection and routing switches, several low-leakage implementations of pass transistor based multiplexers and routing switches are proposed and their design trade-offs are presented based on transistor-level simulation, physical design, and impact on overall system performance. We find that gate biasing, the use of redundant SRAM cells, and integration of multi-Vt technology are ideal for FPGAs, and they can reduce leakage current by 2X-4X compared to an implementation without any leakage reduction technique. For some of the potential low-leakage design techniques being evaluated in our study, the impact on chip area is very minimal to an increase of 15%-30%.
Conference Paper
As Field-Programmable Gate Array (FPGA) power consumption continues to increase, lower power FPGA circuitry, architectures, and Computer-Aided Design (CAD) tools need to be developed. Before designing low-power FPGA circuitry, architectures, or CAD tools, we must first determine where the biggest savings (in terms of energy dissipation) are to be made and whether these savings are cumulative. In this paper, we focus on FPGA CAD tools. Specifically, we describe a new power-aware CAD flow for FPGAs that was developed to answer the above questions. Estimating energy using very detailed post-route power and delay models, we determine the energy savings obtained by our power-aware technology mapping, clustering, placement, and routing algorithms and investigate how the savings behave when the algorithms are applied concurrently. The individual savings of the power-aware technology-mapping, clustering, placement, and routing algorithms were 7.6%, 12.6%, 3.0%, and 2.6% respectively. The majority of the overall savings were achieved during the technology mapping and clustering stages of the power-aware FPGA CAD flow. In addition, the savings were mostly cumulative when the individual power-aware CAD algorithms were applied concurrently with an overall energy reduction of 22.6%.
Conference Paper
Reconfigurable architectures, including FPGAs, are promising solutions for managing increasing design complexity while achieving both performance and flexibility. To support reconfiguration, FPGAs use more transistors per function than fixed-logic solutions, resulting in higher leakage power consumption. Consequently, FPGAs are generally not found in mobile applications. In this work, we analyze the leakage power of a low-cost, 90 nm FPGA using detailed device-level simulations. The simulation methodology accounts for design-dependent variations and provides detailed leakage power breakdowns. The analysis quantifies the leakage power challenge in FPGAs, and identifies promising approaches for FPGA leakage power reduction.
Article
The dynamic power consumed by a digital CMOS circuit is directly proportional to both switching activity and interconnect capacitance. In this paper, we consider early prediction of net activity and interconnect capacitance in field-programmable gate array (FPGA) designs. We develop empirical prediction models for these parameters, suitable for use in power-aware layout synthesis, early power estimation/planning, and other applications. We examine how switching activity on a net changes when delays are zero (zero delay activity) versus when logic delays are considered (logic delay activity) versus when both logic and routing delays are considered (routed delay activity). We then describe a novel approach for prelayout activity prediction that estimates a net's routed delay activity using only zero or logic delay activity values, along with structural and functional circuit properties. For capacitance prediction, we show that prediction accuracy is improved by considering aspects of the FPGA interconnect architecture in addition to generic parameters, such as net fanout and bounding box perimeter length. We also demonstrate that there is an inherent variability (noise) in the switching activity and capacitance of nets that limits the accuracy attainable in prediction. Experimental results show the proposed prediction models work well given the noise limitations.
Article
Noting that a common element in most causes of runtime failure is the extent of circuit activity, i.e. the rate at which its nodes are switching, the author proposes a measure of activity, called the transition density, which may be defined as the average switching rate at a circuit node. An algorithm is also presented to propagate density values from the primary inputs to internal and output nodes. To illustrate the practical significance of this work, it is shown how the density values at internal nodes can be used to study circuit reliability by estimating the average power and ground currents; the average power dissipation; the susceptibility to electromigration failures; and the extent of hot-electron degradation. The density propagation algorithm has been implemented in a prototype density simulator which is used to assess the validity and feasibility of the approach experimentally. The results show that the approach is very efficient, and makes possible the analysis of VLSI circuits
Article
This paper analyzes the dynamic power consumption in the fabric of Field Programmable Gate Arrays (FPGAs) by taking advantage of both simulation and measurement. Our target device is Xilinx Virtex^TM-II family, which contains the most recent and largest programmable fabric. We identify important resources in the FPGA architecture and obtain their utilization, using a large set of real designs. Then, using a number of representative case studies we calculate the switching activity corresponding to each resource. Finally, we combine effective capacitance of each resource with its utilization and switching activity to estimate its share of power consumption. According to our results, the power dissipation share of routing, logic and clocking resources are 60%, 16%, and 14%, respectively. Also, we concluded that dynamic power dissipation of a Virtex-II CLB is 5.9W per MHz for typical designs, but it may vary significantly depending on the switching activity.
Article
Reliability assessment is an important part of the design ' process of digital integrated circuits. We observe that a common thread that runs through most causes of run-time failure is the extent of circuit activity, i.e., the rate at which its nodes are switching. We propose a new measure of activity, called the transition density, which may be defined as the "average switching rate" at a circuit node. Based on a stochastic model of logic signals, we rigorously define the transition density and present an algorithm to propagate it from the primary inputs to internal and output nodes. This algorithm may be thought of as a simulation of the circuit, and has been implemented in a prototype density simulator. We present some results of this implementation to verify the theoretical results and assess the feasibility of the approach. In order to obtain the same density information by traditional means, the circuit would need to be simulated for thousands of input transitions. Thus this approach is very efficient and makes possible the analysis of VLSI circuits, which are traditionally too big to simulate for long input sequences.
The power of EPGA architectures
  • T Tuan
  • S Trimberger