Article

Speed and area tradeoffs in cluster-based FPGA architectures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

One way to reduce the delay and area of field-programmable gate arrays (FPGAs) is to employ logic-cluster-based architectures, where a logic cluster is a group of logic elements connected with high-speed local interconnections. In this paper, we empirically evaluate FPGA architectures with logic clusters ranging in size from 1 to 20, and show that compared to architectures with size 1 clusters, architectures with size 8 clusters have 23% less delay (30% faster clock speed) and require 14% less area. We also show that FPGA architectures with large cluster sizes can significantly reduce design compile time-an increasingly important concern as the logic capacity of FPGA's rises. For example, an architecture that uses size 20 clusters requires seven times less compile time than an architecture with size 1 clusters.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Next, we mapped the netlist of LUTs to LBs of increasing size N using the timing-driven clustering algorithm, TV-Pack [86]. Additional results in the range of K = 2 to 7 can be found in Appendix C. ...
... We use a typical academic CAD flow similar to those used in previous LB architecture studies [5,86]. First, benchmark circuits are technology mapped to LUTs using Flowmap [40]. ...
... First, benchmark circuits are technology mapped to LUTs using Flowmap [40]. We then use a timing-driven packing algorithm T-VPack [86] to pack LUTs and registers into logic blocks. Placement and routing is performed using VPR 5.0 [82]. ...
... This flexibility in I/O selection can increase the area efficiency of global routing networks and lead to Manuscript an increase in FPGA area efficiency [2]. Many FPGA architectural studies (for example, [1], [3]- [5]) are based on logic clusters with logically equivalent I/Os. Other logic cluster designs are compared against the logically equivalent ones [6]- [9]. ...
... Previous work achieves logic equivalency through fully connected local routing networks [1]- [5]. As shown in Fig. 2, the network connects an LUT input to each of the logic cluster inputs through an multiplexer, where is the number of logic cluster inputs that the cluster contains. ...
... In this study, the area evaluation and circuit design methodology of [2] is used to more accurately evaluate the effect of multiplexer input reduction on logic cluster area. This methodology has been widely used in many previous FPGA architectural studies [1]- [3], [5], [9], [11]- [20]. The implementation area of a logic cluster is estimated based on active area. ...
Article
Full-text available
Mapping digital circuits onto field-programmable gate arrays (FPGAs) usually consists of two steps. First, circuits are mapped into look-up tables (LUTs). Then, LUTs are mapped onto physical resources. The configuration of LUTs is usually determined during the first step and remains unchanged throughout the second. In this paper, we demonstrate that by reconfiguring LUTs during the second step, one can increase the flexibility of FPGA routing resources. This increase in flexibility can then be used to reduce the implementation area of FPGAs. In particular, it is shown that, for a logic cluster with I inputs and N k -input LUTs, a set of N ?? k ( I + N - k +1):1 multiplexers can be used to connect logic cluster inputs to LUT inputs while maintaining logic equivalency among the logic cluster I/Os. The multiplexers (called a local routing network) are shown to be the minimum required to maintain logic equivalency. Comparing to the previous design, which employs a fully connected local routing network, the proposed design can reduce logic cluster area by 3%-25% and can reduce a significant amount of fanouts for logic cluster inputs.
... Architecture studies have been performed to investigate the impact of LB size on area, delay, and power. When optimizing the architecture for area [4, 17, 1] or power effi- ciency [20], an LB size in the range of N = 4 to 9 is shown to be best. Three competing trends lead to this optimal range. ...
... Now that CAD runtime is becoming a major concern (its relative importance to area, delay, and power is still debatable) we have a reason to revisit these topics. Although [17] observed that logic block size affects CAD runtime, none of these studies directly investigate the CAD runtime versus quality (area, delay, power) trade-offs afforded by logic block size. In addition, none of these studies investigate an LB size greater than N = 20 1 . ...
... These effects have been observed experimentally [17] and expressed analytically [6]. Of course, increasing logic block size will increase the runtime of the CAD stages that map logic to these blocks. ...
Conference Paper
Long FPGA CAD runtime has emerged as a limitation to the future scaling of FPGA densities. Already, compile times on the order of a day are common, and the situation will only get worse as FPGAs get larger. Without a concerted effort to reduce compile times, further scaling of FPGAs will eventually become impractical. Previous works have presented fast CAD tools that tradeoff quality of result for compile time. In this paper, we take a different but complementary approach. We show that the architecture of the FPGA itself can be designed to be amenable to fast-compile. If not done carefully, this can lead to lower-quality mapping results, so a careful tradeoff between area, delay, power, and compile run-time is essential. We investigate the extent to which run-time can be reduced by employing high-capacity logic blocks. We extend previous studies on logic block architectures by quantifying the area, delay and CAD runtime tradeoffs for large capacity blocks, and also investigate some multi-level logic block architectures. In addition, we present an analytically derived equation to guide the design of logic block I/O requirements.
... Each output of the network can be connected to any input of the network. The topology is widely used in many subsequent FPGA studies including [Marq00b]. The topology also has the advantage of reducing the complexity of the packing tools [Betz99a] since any network input can be connected to any LUT input. ...
... ventional cluster-based FPGA architectures described in[Betz97b] [Betz98][Betz99a] [Betz99b][Marq00b]. These architectural choices are described and justified in much more detail in Chapter 8. ...
... En 1984 El tamaño de los clusters también influye en las herramientas de CAD, el tiempo de compilación de un diseño aumenta con clusters pequeños [8,76]. ...
... En dispositivos comerciales es usual ver clusters grandes, compuestos por entre 8 y 10 elementos lógicos. Se han hecho varios estudios sobre el efecto del tamaño de los clusters en el área y la velocidad de una FPGA, así como su interacción con el tamaño de las LUTs, los resultados muestran que los valores óptimos están entre 4 y 10[8,76]. ...
... In order to improve the performance of FPGA time to time quite a few new designs have been designed [1,6]. To design such a new FPGA architecture a lot of effort have been made and mostly for datapath appliances. ...
... Modern FPGA architectures have a hierarchical structure to improve area and delay [4]. On each hierarchical level a number of equivalent blocks are available. ...
Article
Generating a configuration for an FPGA starting from a high level description of a design is a time consuming task. The resulting configuration should have a high quality so that the FPGA resources are used in an efficient way while being able to run at high clock frequencies and having a low power consumption. In this work we present MultiPart, a new hierarchical packing algorithm that obtains better quality and faster runtimes when compared to the frequently used AAPack packer in VPR. MultiPart combines the benefits of partitioning-based and seed-based packing approaches. It tries to preserve the design hierarchy during packing. This results in a gain of 32% in total wirelength and a gain of 10% in critical path delay. The partitioning-based methodology allows us to exploit multithreading, leading to 9.3x faster packing runtimes on a CPU with 10 cores. We also gain in the total routing runtime because MultiPart reduces congestion problems on a higher level. The subcircuits in the partitioned circuit are clustered with a seed-based packer. This allows MultiPart to deal with the constraints of complex heterogeneous architectures. In short, MultiPart targets heterogeneous commercial FPGAs with a lower runtime while increasing the quality of the configuration. The source code of MultiPart is available in our FPGA CAD framework on Github.
... The early FPGA architectures have a flat architecture where the functional blocks contain only one programmable lookup table and a flipflop. The functional blocks in modern architectures have a hierarchical structure to improve area and delay [96]. On each hierarchical level a number of equivalent blocks are available which are connected by a local interconnection network. ...
Thesis
Full-text available
Field-Programmable Gate Arrays (FPGA) are programmable integraded circuits, which are used to accelerate applications. An FPGA configuration is compiled by specialized software. Compilation is typically split up in the following steps: synthesis, technology mapping, packing, placement and routing. Once the configuration is compiled, the designer can check if the configuration meets the application constraints. In case the constraints are not met, the designer has to modify the high level description of his design and recompile. This slow process is called the FPGA design cycle. To shorten the design cycle we introduce new faster packing, placement and routing techniques.
... One of the most widely known academic seed-based clustering tools is VPack and its timingdriven version T-VPack [26,27]. In addition to trying to pack CLBs to capacity, T-VPack also accounts for the timing performance of the circuit by attempting to absorb netlist connections that To choose which BLEs to add to the CLB, a gain value is calculated for every BLE that shares an edge with the current CLB, using a cost function. ...
... FPGAs are designed to emulate a broad range of logic circuits because they are primarily targeted at ASIC prototyping and replacement. Consequently, they use very fine-grain reconfigurable elements such as lookup tables (LUTs) [6,17]. The LUTs are chained into larger operations using flexible-butexpensive on-chip networks. ...
Conference Paper
Full-text available
In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.
... Field-programmable gate arrays (FPGAs) are among the most well known logic-grained spatial architectures and are most commonly used for ASIC prototyping and as stand-alone general logic accelerators. FPGAs are designed to emulate a broad range of logic circuits and use very fine-grained lookup tables (LUTs) as their primary unit of computation [4], [5]. More complex logical operations are constructed by connecting many LUTs together using the on- However, a common observation is that many algorithms primarily utilize byte-or word-level primitive operations, which are not efficiently mapped to bit-level logic and logicgrained spatial architectures such as FPGAs. ...
... These studies have shown that lookup tables with five or six inputs result in circuits with the best performance, while LUTs with three or four inputs are still reasonable. Another study analyzed the impacts on cluster size, the number of single output LUTs within a CLB, and on the speed and area of various circuits [Marquardt et al. 2000]. Their findings indicate that cluster sizes of 3 to 20 LUTs were feasible, and a cluster size of 8 produced the best tradeoff between area and the delay of the final circuits. ...
Article
We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, tech-nology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.
... We begin with the specifications of the system-level partitioning functions and detailing the selected design quality attributes for the HW/SW co-design aimed at the definition of the computational tasks that can be implemented in the dual super-systolic core form, namely: hardware area (ha), hardware execution time (ht), software execution time (St), and the selected system resolution (n); where maxha, maxht and maxSt represent the upper bounds of these constraints. In particular, for implementing the fixed-point RS estimator operations of Equation (1), the partitioning process must satisfy the following performance requirements [25]. ...
Article
Full-text available
A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode.
... (LUT size) is set to be 4 since it has been shown to be one of the most efficient LUT sizes [28] [29] and it is also used in many commercial FPGAs [14] [15]. and are set to be 4 and 10 respectively since this combination was shown to be one of the most efficient by [27] and is used in many previous FPGA studies [ [32]. The setting of and are discussed in more detail latter on in the section. ...
Conference Paper
Full-text available
As the logic capacity of Field-Programmable Gate Arrays (FPGAs) increases, they are being increasingly used to implement large arithmetic-intensive applications, which often contain a large proportion of datapath circuits. Since datapath circuits usually consist of regularly structured components (called bit-slices) which are connected together by regularly structured signals (called buses), it is possible to utilize datapath regularity in order to achieve significant area savings through FPGA architectural innovations. This paper describes such an FPGA routing architecture, called the multi-bit routing architecture, which employs bus-based connections in order to exploit datapath regularity. It is experimentally shown that, comparing to conventional FPGA routing architectures, the multi-bit routing architecture can achieve 14% routing area reduction for implementing datapath circuits, which represents an overall FPGA area savings of 10%. This paper also empirically determines the best values of several important architectural parameters for the new routing architecture including the most area efficient granularity values and the most area efficient proportion of bus-based connections.
... Prior research [Betz and Rose 1997a,b;Betz et al. 1999;Marquardt et al. 2000] on clustered FPGA architectures has focused mainly on the area-delay trade-off stemming from the size and structure of the clusters. Betz et al. proposed a packing/clustering algorithm, VPack, for hierarchical FPGAs. ...
Conference Paper
Full-text available
We present a routability-driven bottom-up clustering technique for area and power reduction in clustered FPGAs. This technique uses a cell connectivity metric to identify seeds for efficient clustering. Effective seed selection, coupled with an interconnect-resource aware clustering and placement, can have a favorable impact on circuit routability. It leads to better device utilization, savings in area, and reduction in power consumption. Routing area reduction of 35% is achieved over previously published results. Power dissipation simulations using a buffered pass-transistor-based FPGA interconnect model are presented. They show that our clustering technique can reduce the overall device power dissipation by an average of 13%.
... For example, it is not clear which techniques are useful for area constrained designs or what the performance impact is with these techniques when no area increase is permitted.An additional dimension for trade-offs that was not explored in this work was that of the time required to implement (synthesize, place and route) designs on an FPGA. This time, typically referred to as compile time, can be significantly impacted by architectural changes such as altering the cluster size[164] or the number of routing resources. This could enable interesting trade-offs as area savings could be made at the expense of increased compile time but future research is needed to determine if any of these trade-offs are viable. ...
Conference Paper
This paper presents experimental measurements of the differences between a 90nm CMOS FPGA and 90nm CMOS Standard Cell ASICs in terms of logic density, circuit speed and power consumption. We are motivated to make these measurements to enable system designers to make better informed hoices between these two media and to give insight to FPGA makers on the deficiencies to attack and thereby improve FPGAs. In the paper, we describe the methodology by which the measurements were obtained and we show that, for circuits containing only combinational logic and flip-flops, the ratio of silicon area required to implement them in FPGAs and ASICs is on average 40. Modern FPGAs also contain "hard" blocks such as multiplier/accumulators and block memories and we find that these blocks reduce this average area gap significantly to as little as 21. The ratio of critical path delay, from FPGA to ASIC, is roughly 3 to 4, with less influence from block memory and hard multipliers. The dynamic power onsumption ratio is approximately 12 times and, with hard blocks, this gap generally becomes smaller.
Article
Photovoltaic devices capable of reversible photovoltaic polarity through external signal modulation may enable multifunctional optoelectronic systems. However, such devices are limited to those induced by gate voltage, electrical poling, or optical wavelength by using complicated device architectures. Here, we show that the photovoltaic polarity is also switchable with the intensity of incident light. The modulation in light intensity induces photovoltaic polarity switching in geometrically asymmetric MoS2 Schottky photodiodes, explained by the asymmetric lowering of the Schottky barrier heights due to the trapping of photogenerated holes at the MoS2/Cr interface states. An applied gate voltage can further modulate the carrier concentration in the MoS2 channel, providing a method to tune the threshold light intensity of polarity switching. Finally, a bidirectional optoelectronic logic gate with “AND” and “OR” functions was demonstrated within a single device.
Chapter
At FPGA chip design stage, area analysis/estimation is essential, just like ASIC chips. State-of-the-art FPGA area estimation techniques will be discussed in this chapter. However, once the FPGA is manufactured, the chip area is fixed, and the “area problem” turns into resource utilization analysis, which can be accurately reported after implementation at FPGA application design stage.
Article
A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way that reduces the number of vias that are needed to connect different routing tracks together while maintaining the tile-based structure of FPGAs. It also takes into account the wiring area required for buffer insertion for long wire segments. The model is evaluated based on the layouts constructed in ASAP7 FinFET 7nm Predictive Design Kit. We found that the new model, while specific to the layout strategy that it employs, improves upon the traditional active-based area estimation models by considering the growth of the metal area independently from the growth of the active area. As a result, the new model is able to more accurately estimate layout area by predicting when metal area will overtake active area as the number of routing tracks is increased. This ability allows the more accurate estimation of the true layout cost of FPGA fabrics at the early floor planning and architectural exploration stage; and this increase in accuracy can encourage a wider use of custom FPGA fabrics that target specific sets of benchmarks in future SOC designs. Furthermore, our data indicate that the conclusions drawn from several significant prior architectural studies remain to be correct under FinFET geometries and wiring area considerations despite their exclusive use of active-only area models. This correctness is due to the small channel widths, around 30-60 tracks per channel, of the architectures that these studies investigate. For architectures that approach the channel width of modern commercial FPGAs with over one to two hundreds tracks per channel, our data show that wiring area models justified by detailed layout considerations are an essential addition to active area models in the correct prediction of the implementation area of FPGAs.
Article
Full-text available
The hardware implementation of advanced artificial intelligence (AI) technology based on complex deep learning and machine learning algorithms is constricted by the limitation of conventional Von–Neuman architecture. Emerging neuromorphic computing architecture based on the human brain with in‐memory computing capability could instigate unprecedented breakthroughs in AI technology. In this pursuit, 2D MoS2 optoelectronic artificial synapse imitating complex biological neuromorphic behavior such as short/long‐term memory, paired‐pulse facilitation, and long‐term depression‐potentiation is proposed and demonstrated. Furthermore, the broadband sensitivity of the device can be utilized to emulate Pavlov's classical conditioning for associative learning of the biological brain. More importantly, reconfigurable Boolean AND and OR logic gate operation is demonstrated within the same device by synergistically modulating the device conductance via the persistent photoconductivity and electrical gate stress. The linear response of the photocurrent to the optical stimulus can perform arithmetic operations such as counting, addition, and subtraction within a single device. This novel integration of memory, synaptic behavior, and processing within a single monolayer MoS2 device is believed to put forth a new horizon for the Non‐Von–Neuman type in‐memory computing architecture for highly advanced AI applications based on 2D materials.
Article
Switch block flexibility is an important design metric in FPGA architectural research. A switch block with high flexibility can provide better routability for implementing digital applications, which can lead to better performance and logic density. Flexibility, however, does come at the cost of an increase in the layout area of switch blocks, which can lower performance and logic density. Consequently, it is important to accurately estimate the layout area of increasing switch block flexibility in order to better design FPGA architectures. This work focuses on the popular disjoint switch block design and evaluates the accuracy of the traditional active-based area estimation models based on realistic layouts. We found that the current active-based area models are inaccurate in estimating the true area cost of increasing switch block flexibility. This inaccuracy is due to the inability of the current models to consider the full effect of routing track count and wire segment length, two important flexibility parameters, on the layout area of switch blocks. We found that by including wiring area into the estimation of FPGA switch block layout area, one can significantly improve the accuracy of predicting the true area cost of these flexibility parameters. These results allow FPGA architects to better optimize future FPGA designs for routability and consequently for performance and logic density.
Article
Full-text available
As one of the core components of electronic hardware systems, Field Programmable Logic Array (FPGA) device design technology continues to advance under the guidance of electronic information technology policies, and has made information technology applications. huge contribution. However, with the advancement of chip technology and the continuous upgrading of information technology, the functions that FPGAs need to perform are more and more complicated. How to efficiently perform layout design and make full use of chip resources has become an important technology to be solved and optimized in FPGA design. The FPGA itself is not limited to a specific function. It contains internal functions such as memory, protocol module, clock module, high-speed interface module and digital signal processing. It can be programmed through logic modules such as programmable logic unit modules and interconnects. Blank FPGA devices are designed to be high performance system applications with complex functions. The layout and routing technology based on cluster logic unit blocks can combine the above resources to give full play to its performance advantages, and its importance is self-evident. Based on the traditional FPGA implementation, this paper analyzes several advantages based on cluster logic block layout and routing technology, and generalizes the design method and flow based on cluster logic block layout and routing technology.
Chapter
One of the major part of any power converter system is a blistering implementation of PWM algorithm for high power conversion. It must fulfil both requirements of power converter hardware topology and computing power necessary for control algorithm implementation. The emergence of multi-million-gate FPGAs with large on-chip RAMs and a processor cores sets a new trend in the design of FPGAs which are exceedingly used to generate the PWM in the area of power electronics. Of late, more and more large complex designs are getting realized using FPGAs, because of less NRE cost and shorter development time. The share of Programmable Logic Devices (PLD), especially FPGAs, in the semiconductor logic market is tremendously growing year-on-year. This calls for an increased controllability of designs, in terms of meeting both area and timing performance, to really derive the perceived benefits. The recent strides in FPGA technology favour the realization of large high-speed designs, which were only possible in an ASIC, in FPGA now. However the routing delay being still unpredictable and the pronounced nature of routing delay over logic delay, in today’s FPGAs impedes the goal of early timing convergence. This paper introduces the few techniques for controlling the design area/time right from architecture stage and the technique can adopt for any FPGA based design applications including the high power conversion. This paper also describes the trade off between Area, speed and power of the optimization techniques.
Article
This work provides an evaluation on the accuracy of the minimum-width transistor area models in ranking the actual layout area of FPGA architectures. Both the original VPR area model and the new COFFE area model are compared against the actual layouts with up to three metal layers for the various FPGA building blocks. We found that both models have significant variations with respect to the accuracy of their predictions across the building blocks. In particular, the original VPR model overestimates the layout area of larger buffers, full adders, and multiplexers by as much as 38%, while they underestimate the layout area of smaller buffers and multiplexers by as much as 58%, for an overall prediction error variation of 96%. The newer COFFE model also significantly overestimates the layout area of full adders by 13% and underestimates the layout area of multiplexers by a maximum of 60% for a prediction error variation of 73%. Such variations are particularly significant considering sensitivity analyses are not routinely performed in FPGA architectural studies. Our results suggest that such analyses are extremely important in studies that employ the minimum-width area models so the tolerance of the architectural conclusions against the prediction error variations can be quantified. Furthermore, an open-source version of the layouts of the actual FPGA building blocks should be created so their actual layout area can be used to achieve a highly accurate ranking of the implementation area of FPGA architectures built upon these layouts.
Article
Integrating reconfigurable fabrics in SOCs requires an accurate estimation of the layout area of the reconfigurable fabrics in order to properly optimize the architectural-level design of the fabrics and accommodate early floor-planning. This work examines the accuracy of using minimum width transistor area, a widely-used area model in many previous FPGA architectural studies, in accurately predicting layout area. In particular, the layout areas of LUT multiplexers are used as a case study. We found that compared to the minimum width transistor area, the traditional metal area based stick diagrams can provide much more accurate layout area estimations. In particular, minimum width transistor area can underestimate the layout area of LUT multiplexers by as much as a factor of 2-3 while stick diagrams can achieve over 90 percent accuracy in layout area estimation while remaining IC-process independent.
Article
Increasing routability rate can reduce the distance of electric circuit connection in FPGA. In this paper, a new FPGA packing algorithm is proposed. According to node distribution, routability driven function is defined based on analysis of influence of packing on wire absorption and port occupancy. Path delay requirements are ensured by absorbing critical paths and area-efficiency is improved by hill-climbing. Experimental results are given to show the efficiency of the proposed algorithm.
Conference Paper
An approach to estimate the performance of FPGA architectures is proposed based on semi-supervised model tree algorithm. The proposed approach avoids synthesizing, mapping, packing, placing and routing, which are essential steps in a traditional flow to obtain the performance of FPGA. Thus it is time efficient while the performance predicted maintains quite close to the result obtained through the traditional method (a tool flow called VTR). This can be utilized effectively during the early FPGA design stage to choose an optimal architecture under a certain metric. Comparisons are made between the performance obtained by the proposed approach and by VTR on a commercial 40nm technology. Results show that the proposed approach has MRE below 7.62% compared to VTR, and improves the time cost by thousands of times when utilized in architecture design space exploration.
Book
Field-Programmable Gate Arrays (FPGAs) have become the dominant digital implementation medium as measured by design starts. They are preferred because designers can avoid the pitfalls of nanoelectronic design and because the designer can change the design up until the last minute. However, it has always been understood that FPGAs use more area, are slower, and consume far more power than the alternative: Application-Specific ICs built from standard cells. But how much? Quantifying and Exploring the Gap Between FPGAs and ASICs is the first book to explore exactly what that difference is, to enable system designers to make better in-formed choices between these two media and to give insight to FPGA makers on the deficiencies to attack and thereby improve FPGAs. The gap is a very nuanced thing, though: it strongly depends on the nature of the circuit being implemented, in sometimes counterintuitive ways. The book presents a careful exploration of these issues in its first half. The second half of the book looks at ways that creators and users of FPGAs can close the gap between FPGAs and ASICs. It presents the most sweeping exploration of FPGA architecture and circuit design ever performed. The authors show that, with careful use of transistor-level design, combined with good choices of the soft-logic architecture, that a wide spectrum of FPGA devices can be used to narrow specific selected gaps in area, speed and power. © Springer Science+Business Media LLC 2009. All rights reserved.
Article
Integrating reconfigurable fabrics in SOCs require an accurate estimation of the layout area of the reconfigurable fabrics in order to properly accommodate early floor-planning. This work examines the accuracy of using minimum width transistor area, a widely used area model in many previous FPGA architectural studies, in assisting floor-planning process. In particular, the layout areas of LUT multiplexers are used as a case study. We found that comparing to the minimum width transistor area, the traditional metal area based stick diagrams can provide much more accurate layout area estimations. In particular, minimum width transistor area can underestimate the layout area of LUT multiplexers by as much as a factor of 3-4 while stick diagrams can achieve over 80 percent accuracy in layout area estimation.
Article
A circuit characteristics-driven semi-supervised modelling approach is proposed for FPGA architecture design space exploration. By including circuit characteristics as input, the proposed approach can estimate the performance of specific circuit on certain architecture accurately. Experimental results illustrate that the approach estimates the area with Mean Relative Error (MRE) up to 6.25%, and delay up to 4.23%, which is comparable to the Semi-supervised Model Tree (SMT) approach. Meanwhile, the proposed approach speedups the modelling process. Compared to the SMT approach, the proposed approach reduces the time cost from 500 h to 250 h when exploring a design space with millions of architectures inside on Intel Xeon E7-4807 platform.
Article
There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.
Article
The thesis work was conducted in French Industrial PhD framework of CIFRE between eFPGA startup company Menta and LIRMM lab of University of Montpellier in 2007-2011. Other than detailed architectural explorations of eFPGA it also covers a unique industrial survey highlighting changing trends, analysis of reasons of several failed attempts in the domain of reconfigurable computing in academia & industry, which is highly interesting for researcher and start-up companies.
Article
In this paper, we present triggered instructions, a novel control paradigm for arrays of processing elements (PEs) aimed at exploiting spatial parallelism. Triggered instructions completely eliminate the program counter and allow programs to transition concisely between states without explicit branch instructions. They also allow efficient reactivity to inter-PE communication traffic. The approach provides a unified mechanism to avoid over-serialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading, which each require distinct hardware mechanisms in a traditional sequential architecture. Our analysis shows that a triggered-instruction based spatial accelerator can achieve 8X greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64% respectively over a program-counter style spatial baseline, resulting in a speedup of 2.0X.
Conference Paper
In this paper, we demonstrate the ability of spatial architectures to significantly improve both runtime performance and energy efficiency on edit distance, a broadly used dynamic programming algorithm. Spatial architectures are an emerging class of application accelerators that consist of a network of many small and efficient processing elements that can be exploited by a large domain of applications. In this paper, we utilize the dataflow characteristics and inherent pipeline parallelism within the edit distance algorithm to develop efficient and scalable implementations on a previously proposed spatial accelerator. We evaluate our edit distance implementations using a cycle-accurate performance and physical design model of a previously proposed triggered instruction-based spatial architecture in order to compare against real performance and power measurements on an x86 processor. We show that when chip area is normalized between the two platforms, it is possible to get more than a 50× runtime performance improvement and over 100× reduction in energy consumption compared to an optimized and vectorized x86 implementation. This dramatic improvement comes from leveraging the massive parallelism available in spatial architectures and from the dramatic reduction of expensive memory accesses through conversion to relatively inexpensive local communication.
Article
In FPGA CAD flow, the clustering stage builds the foundation for placement and routing stages and affects performance parameters, such as routability, delay, and channel width significantly. Net sharing and criticality are the two most commonly used factors in clustering cost functions. With this study, we first derive a third term, net-length factor, and then design a generic method for integrating net length into the clustering algorithms. Net-length factor enables characterizing the nets based on the routing stress they might cause during later stages of the CAD flow and is essential for enhancing the routability of the design. We evaluate the effectiveness of integrating net length as a factor into the well-known timing (T-VPack)-, depopulation (T-NDPack)-, and routability (iRAC and T-RPack)-driven clustering algorithms. Through exhaustive experimental studies, we show that net-length factor consistently helps improve the channel-width performance of routability-, depopulation-, and timing-driven clustering algorithms that do not explicitly target low fan-out nets in their cost functions. Particularly, net-length factor leads to average reduction in channel width for T-VPack, T-RPack, and T-NDPack by 11.6%, 10.8%, and 14.2%, respectively, and in a majority of the cases, improves the critical-path delay without increasing the array size.
Conference Paper
In this paper, the effect of the LUT size on the FPGA area and delay with the recent progress of the semiconductor technology is investigated. An optimized routing area and delay modelling in FPGA architecture with nanometer process is proposed. The proposed method has advantage on accuracy over the previous modelling, due to different spacings for nanometer process. With the improved modelling, we determine the best LUT size in terms of FPGA area and delay by a CAD flow including ABC, Hspice, T-Vpack and VPR. The experimental results show that 6-LUT provides the best area-delay product for a nanometer FPGA.
Article
Field Programmable Gate Arrays (FPGAs) are increasingly being used to implement large datapath-oriented applications that are designed to process multiple-bit wide data. Studies have shown that the regularity of these multi-bit signals can be effectively exploited to reduce the implementation area of datapath circuits on FPGAs that employ the traditional bidirectional routing. Most of modern FPGAs, however, employ unidirectional routing tracks which are more area and delay efficient. No study has investigated the design of multi-bit routing architectures to effectively transport multiple-bit wide signals using unidirectional routing tracks. This paper presents such an investigation of architectures which employ multi-bit connections and unidirectional routing resources to exploit datapath regularity. It is experimentally shown that unidirectional multi-bit routing architectures are 8.6% more area efficient than the conventional routing architecture. This paper also determines the most area efficient proportion of multi-bit routing tracks.
Article
The development of sustainable and durable ultra-low-power SoC calls for flexibility integration in the design flow. Reconfigurable logic circumvents the intrinsic low speed performances of software processing in microcontrollers but FPGA fabrics to be embedded suffer from a high power overhead compared to dedicated ASICs. We show that, by combining a power-oriented implementation using multi-VT, a careful repartition of different MOS flavors, and an aggressive scaling of core voltage, the dynamic power consumption can be reduced below 6μW/tile at 50MHz switching target and the leakage power consumption can be brought down below 0.5μW/tile. Simulation results show that a 16-bits multiplier, mapped onto the fabric developed with these techniques, is characterized by an energy per cycle as low as 2.5pJ.
Article
In FPGA CAD flow, routability driven algorithms have been introduced to improve feasibility of mapping designs onto the underlying architecture; timing and power driven algorithms have been introduced to meet design specifications. A number of techniques have been proposed to tackle routability, timing or power objectives independently during clustering stage. However, there is minimal work that targets multiple optimization goals. In this paper, we evaluate a clustering technique that targets routability and timing goals simultaneously. We combine the timing-driven T-VPack algorithm with a routability-driven non-uniform depopulation scheme (T-RDPack). Our technique keeps clusters on the critical path fully populated, while depopulating other clusters in the design. This approach has been implemented into the versatile place and route (VPR) toolset. We show that, compared to T-VPack, channel width reductions of 11.5%, 19.1%, 24.7% are achieved while incurring an area overhead of 0.6%, 3.1%, 9.1% respectively with negligible increase in critical path delay, exceeding the performance of T-RPack.
Article
Tunnels in glaciers offer unique opportunities for examining basal processes. At Suess Glacier in the Taylor Valley, Antarctica, a 25 m tunnel excavated into the bed of the glacier provides access to a 3.2 m thick basal zone and the ice-substrate contact. Measurements of ice velocity over two years together with glaciotectonic structures show that there are distinct strain concentrations, a sliding interface and thin shear zones or shear planes within the basal ice. Comparison of ice composition, debris concentrations and the shear strength of basal ice samples suggest that strength is controlled by ice chemistry and debris concentration. The highest strain rates occur in fine-grained amber ice with solute concentrations higher than adjacent ice. Sliding occurs at the base of the ice that experiences the highest strain rates. The substrate and blocks of the substrate within basal ice are characterized by brittle and slow ductile deformation whereas ice with low debris concentrations behaves in a ductile manner. The range of structures observed in the basal ice suggests that deformation occurs in a self-enhancing system. As debris begins to deform, debris and ice are mixed resulting in decreased debris concentrations. Subsequent deformation becomes more rapid and increasingly ductile as the debris and sedimentary structures within the debris are attenuated by glacier flow. The structural complexity and thickness of the resulting basal ice are considerably greater than previous descriptions of cold glaciers and demonstrate that the glacier is or was closely coupled to its bed.
Article
In this paper, we address a new approach for high-resolution reconstruction and enhancement of remote sensing (RS) imagery in near-real computational time based on the aggregated hardware/software (HW/SW) co-design paradigm. The software design is aimed at the algorithmic-level decrease of the computational load of the large-scale RS image enhancement tasks via incorporating into the fixed-point iterative reconstruction/enhancement procedures the convex convergence enforcement regularization by constructing the proper projectors onto convex sets (POCS) in the solution domain. The established POCS-regularized iterative techniques are performed separately along the range and azimuth directions over the RS scene frame making an optimal use of the sparseness properties of the employed sensor system modulation format. The hardware design is oriented on employing the Xilinx Field Programmable Gate Array XC4VSX35-10ff668 and performing the image enhancement/reconstruction tasks in a computationally efficient parallel fashion that meets the near-real time imaging system requirements. Finally, we report some simulation results and discuss the implementation performance issues related to enhancement of the real-world RS imagery indicative of the significantly increased performance efficiency gained with the developed approach.
Conference Paper
This paper introduces ASTRA, a novel FPGA-like architecture that can perform operations in space (for maximum performance) or in time (for minimum hardware area) at logic-cell level. Currently, ASTRA is tailored towards DSP applications and supports data flow between nearest neighbor cells. Control signals can be distributed over longer distances using bus-like connections. ASTRA's (logic and interconnect) silicon area is shown to be quite low while still providing sufficient flexibility for real-life applications. First benchmarking results with traditional applications from DSP domain show that ASTRA is competitive with respect to ASIC in terms of silicon area and power consumption
Article
Full-text available
There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.
Article
Full-text available
There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.
Article
Full-text available
Timber Wolf is an integrated set of placement and routing optimization programs. The general combinatorial optimization technique known as simulated annealing is used by each program. Programs for standard cell, macro/custom cell, and gate-array placement, as well as standard cell global routing have been developed. Experimental results on industrial circuits show that area savings over existing layout programs ranging from 15 to 62 percent are possible. Copyright © 1985 by The Institute of Electrical and Electronics Engineers, Inc.
Article
Abstract Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs Master of Applied Science, 1999
Book
From the Publisher: Architecture and CAD for Deep-Submicron FPGAs addresses several key issues in the design of high-performance FPGA architectures and CAD tools, with particular emphasis on issues that are important for FPGAs implemented in deep-submicron processes. Three factors combine to determine the performance of an FPGA: the quality of the CAD tools used to map circuits into the FPGA, the quality of the FPGA architecture, and the electrical (i.e. transistor-level) design of the FPGA. Architecture and CAD for Deep-Submicron FPGAs examines all three of these issues in concert.
Article
This paper describes the Vantis VF1 FPGA architecture, an innovative architecture based on 0.25 u (drawn) (0.18 u Leff)/4-metal technology. It was designed from scratch for high performance, routability and ease-of-use. It supports system level functions (including wide gating functions, dual-port SRAMs, high speed carry chains, and high speed IO blocks) with a symmetrical structure. Additionally, the architecture of each of the critical elements including: variable-grain logic blocks, variable-length-interconnects, dual-port embedded SRAM blocks, I/O blocks and on-chip PLL functions will be described.
Article
When the transient response of a linear network to an applied unit step function consists of a monotonic rise to a final constant value, it is found possible to define delay time and rise time in such a way that these quantities can be computed very simply from the Laplace system function of the network. The usefulness of the new definitions is illustrated by applications to low pass, multi‐stage wideband amplifiers for which a number of general theorems are proved. In addition, an investigation of a certain class of two‐terminal interstage networks is made in an endeavor to find the network giving the highest possible gain—rise time quotient consistent with a monotonic transient response to a step function.
Conference Paper
A new reprogrammable FPGA architecture is described which is specifically designed to be of very low cost. It covers a range of 35 K to a million usable gates. In addition, it delivers high performance and it is synthesis efficient. This architecture is loosely based on an earlier reprogrammable Actel architecture named ES. By changing the structure of the interconnect and by making other improvements, we achieved an average cost reduction by a factor of three per usable gate. The first member of the family based on this architecture is fabricated on a 2.5 V standard 0.25μ CMOS technology with a gate count of up to 130 K which also includes 36 K bits of two port RAM. The gate count of this part is verified in a fully automatic design flow starting from a high level description followed by synthesis, technology mapping, place and route, and timing extraction.
Conference Paper
In this papel; we investigate the speed and area-eficiency of FPGAs employing "logic clusters" containing multiple LUTs and registers as their logic block. We introduce a new, timing-driven tool (T-VPack) to "pack" LUTs and registers into these logic clusters, and we show that this algorithm is superior to an existing packing algorithm. Then, using a realistic routing architecture and sophisticated delay and area models, we empirically evaluate FPGAs composed of clusters ranging in size from one to twenty LUTs, and show that clusters of size seven through ten provide the best area-delay trade-o@ Compared to circuits implemented in an FPGA composed of size one clusters, circuits implemented in an FPGA with size seven clusters have 30% less delay (a 43% increase in speed) and require 8% less area, and circuits implemented in an FPGA with size ten clusters have 34% less delay (a 52% increase in speed), and require no additional area.
Conference Paper
This paper proposes a new field-programmable architecture that is a combination of two existing technologies: Field Programmable Gate Arrays (FPGAs) based on LookUp Tables (LUTs), and Complex Programmable Logic Devices based on PALs/PLAs. The methodology used for development of the new architecture, called Hybrid FPGA, is based on analysis of a large set of benchmark circuits, in which we determine what types of logic resources best match the needs of the circuits. The proposed Hybrid FPGA is evaluated by manually technology mapping a set of circuits into the new architecture and estimating the total chip area needed for each circuit, compared to the area that would be required if only LUTs were available. Preliminary results indicate that compared to LUT-based FPGAs the Hybrid offers savings of more than a factor of two in terms of chip area.
Conference Paper
This paper presents an efficient algorithm for buffered Steiner tree construction with wire sizing. Given a source and n sinks of a signal net, with given positions and a required arrival time associated with each sink, the algorithm finds a Steiner tree with buffer insertion and wire sizing so that the required arrival time (or timing slack) at the source is maximized. The unique contribution of our algorithm is that it performs Steiner tree construction buffer insertion, and wire sizing simultaneously with consideration of both critical delay and total capacitance minimization by combining the performance-driven A-tree construction and dynamic programming based buffer insertion and wire sizing, while tree construction and the other delay minimization techniques were carried out independently in the past. Experimental results show the effectiveness of our approach
Article
The logic blocks of most FPGAs contain clusters of lookup tables and flip-flops yet little is known about good choices for key parameters. How many lookup tables should a cluster contain, how should FPGA routing flexibility change as cluster size changes, and how many inputs should programmable routing provide each cluster?
Article
The field programmable gate-array (FPGA) has become an important technology in VLSI ASIC designs. In the past few years, a number of heuristic algorithms have been proposed for technology mapping in lookup-table (LUT) based FPGA designs, but none of them guarantees optimal solutions for general Boolean networks and little is known about how far their solutions are away from the optimal ones. This paper presents a theoretical breakthrough which shows that the LUT-based FPGA technology mapping problem for depth minimization can be solved optimally in polynomial time. A key step in our algorithm is to compute a minimum height K-feasible cut in a network, which is solved optimally in polynomial time based on network flow computation. Our algorithm also effectively minimizes the number of LUT's by maximizing the volume of each cut and by several post-processing operations. Based on these results, we have implemented an LUT-based FPGA mapping package called FlowMap. We have tested FlowMap on a large set of benchmark examples and compared it with other LUT-based FPGA mapping algorithms for delay optimization, including Chortle-d, MIS-pga-delay, and DAG-Map. FlowMap reduces the LUT network depth by up to 7% and reduces the number of LUT's by up to 50% compared to the three previous methods
Article
The relationship between the routability of a field-programmable gate array (FPGA) and the flexibility of its interconnection structures is examined. The flexibility of an FPGA is determined by the number and distribution of switches used in the interconnection. While good routability can be obtained with a high flexibility, a large number of switches will result in poor performance and logical density because each switch has significant delay and area. The minimum number of switches required to achieve good routability is determined by implementing several industrial circuits in a variety of interconnection architectures. These experiments indicate that high flexibility is essential for the connection block that joins the logic blocks to the routing channel, but a relative low flexibility is sufficient for switch blocks at the junction of horizontal and vertical channels. Furthermore, it is necessary to use only a few more routing tracks than the absolute minimum possible with structures of surprisingly low flexibility
Article
The relationship between the functionality of a field-programmable gate array (FPGA) logic block and the area required to implement digital circuits using that logic block is examined. The investigation is done experimentally by implementing a set of industrial circuits as FPGAs using CAD (computer-aided design) tools for technology mapping, placement, and routing. A range of programming technologies (the method of FPGA customization) is explored using a simple model of the interconnection and logic block area. The experiments are based on logic blocks that use lookup tables for implementing combinational logic. Results indicate that the best number of inputs to use (a measure of the block's functionality) is between three and four, and that a D flip-flop should be included in the logic block. The results are largely independent of the programming technology. More generally, it was observed that the area efficiency of a logic block depends not only on its functionality but also on the average number of pins connected per logic block
XC5200 Series of FPGAs
  • Xilinx Inc
Xilinx Inc., "XC5200 Series of FPGAs", Data Book, 1997.
  • J Rose
  • S Brown
J. Rose and S. Brown. "Flexibility of Interconnection Structures for Field-Programmable Gate Arrays," JSSC, March 1991, pp. 277 -282.