Preprint
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Present Field Programmable Gate Array (FPGA) manufacturers incorporate multi-millions of logic resources which enables hardware designers to design applications extending to enormous scales. However, handling such applications by existing FPGA Computer Aided Design (CAD) flow requires more improvement in terms of compilation time, performance and power efficiency considerations. However, the existing CAD flow begins at Register Transfer Level (RTL) code which limits the design productivity of the input applications only to hardware experts in performing analysis on the large-scale designs for any further optimizations. The manual process of optimizing RTL designs is increasingly difficult when considering larger designs. High-Level Synthesis (HLS) is an approach capable of increasing the design productivity of hardware applications compared to commonly used Hardware Description Languages (HDLs) and is known to be a capable approach for performing optimizations at a higher level of abstraction. Our concern is to explore the approaches that follow the HLS flow to cater the mapping of large-scale applications to FPGA addressing above considerations.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
To increase productivity in designing digital hardware components, high-level synthesis (HLS) is seen as the next step in raising the design abstraction level. However, the quality of results (QoR) of HLS tools has tended to be behind those of manual register-transfer level (RTL) flows. In this paper, we survey the scientific literature published since 2010 about the QoR and productivity differences between the HLS and RTL design flows. Altogether, our survey spans 46 papers and 118 associated applications. Our results show that on average, the QoR of RTL flow is still better than that of the state-of-the-art HLS tools. However, the average development time with HLS tools is only a third of that of the RTL flow, and a designer obtains over four times as high productivity with HLS. Based on our findings, we also present a model case study to sum up the best practices in comparative studies between HLS and RTL. The outcome of our case study is also in line with the survey results, as using an HLS tool is seen to increase the productivity by a factor of six. In addition, to help close the QoR gap, we present a survey of literature focused on improving HLS. Our results let us conclude that HLS is currently a viable option for fast prototyping and for designs with short time to market.
Conference Paper
Full-text available
Achievable frequency~(fmaxf_{max}) is a widely used input constraint for designs targeting Field-Programmable Gate Arrays~(FPGA), because of its impact on design latency and throughput. fmaxf_{max} is limited by critical path delay, which is highly influenced by lower-level details of the circuit implementation such as technology mapping, placement and routing. However, for high-level synthesis~(HLS) design flows, it is challenging to evaluate the real critical delay at the behavioral level. Current HLS flows typically use module pre-characterization for delay estimates. However, we will demonstrate that such delay estimates are not sufficient to obtain high fmaxf_{max} and also minimize total execution latency. In this paper, we introduce a new HLS flow that integrates with Altera's Quartus synthesis and fast placement and routing (PAR) tool to obtain realistic post-PAR delay estimates. This integration enables an iterative flow that improves the performance of the design with both behavioral-level and circuit-level optimizations using realistic delay information. We demonstrate our HLS flow produces up to 24\% (on average 20\%) improvement in fmaxf_{max} and upto 22\% (on average 20\%) improvement in execution latency. Furthermore, results demonstrate that our flow is able to achieve from 65\% to 91\% of the theoretical fmaxf_{max} on Stratix IV devices (550MHz).
Conference Paper
Full-text available
In this work we extend the FSMD (Finite-State Machine with Datapath) model to encompass synchronous memory accesses, intermodule communication and hardware-optimizing transformations. A lightweight typed assembly language, N-Address Code (NAC), is used as a designer-friendly representation of FSMDs, simplifying the adaptation of hardware synthesis with existing frontends.The quality (computation time, chip area) of the generated FSMDs has been evaluated on modern FPGAs. Our approach overcomes the C code limitations of four HLS tools while maintaining a good speed/area balance.
Conference Paper
Full-text available
In the last decade, a considerable amount of effort was spent on raising the implementation level of hardware systems by automatically extracting the parallelism from input applications and using tools to generate Hardware/Software co-design solutions. However, the tools developed thus far either focus on particular application domains or they impose severe restrictions on the input language. In this paper, we present the DWARV 2.0 compiler that accepts general C-code as input and generates synthesizable VHDL for unrestricted application domains. Dissimilar to previous hardware compilers, this implementation is based on CoSy compiler framework. This allowed us to build a highly modular compiler in which standard or custom optimizations can be easily integrated. Validation experiments showed speed-ups of up to 4.41× when comparing against another state of the art hardware compiler.
Conference Paper
Full-text available
In this paper, we introduce a new open source high-level synthesis tool called LegUp that allows software techniques to be used for hardware design. LegUp accepts a standard C program as input and automatically compiles the program to a hybrid architecture containing an FPGA-based MIPS soft processor and custom hardware accelerators that communicate through a standard bus interface. Results show that the tool produces hardware solutions of comparable quality to a commercial high-level synthesis tool.
Conference Paper
Full-text available
While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15× improvement over hand-written VHDL.
Conference Paper
Full-text available
Sea Cucumber (SC) is a synthesizing compiler for FPGAs that accepts Java class files as input (generated from Java source files) and that generates circuits that exploit the coarse- and fine-grained par- allelism available in the input class files. Programmers determine the level of coarse-grained parallelism available by organizing their circuit as a set of inter-communicating, concurrent threads (using standard Java threads) that are implemented by SC as concurrent hardware. SC auto- matically extracts fine-grained parallelism from the body of each thread by processing the byte codes contained in the input class files and em- ploys conventional compiler optimizations such as data-flow and control- flow graph analysis, dead-code elimination, constant folding, operation simplification, predicated static single assignment, if-conversion, hyper- block formation, etc. The resulting EDIF files can be processed using Xilinx place and route software to produce bitstreams that can be down- loaded into FPGAs for execution.
Article
Full-text available
RC systems typically consist of an array of configurable computing elements. The computational granularity of these elements ranges from simple gates - as abstracted by FPGA lookup tables - to complete arithmetic-logic units with or without registers. A rich programmable interconnect completes the array. RC system developer manually partitions an application into two segments: a hardware component in a hardware description language such as VHDL or Verilog that will execute as a circuit on the FPGA and a software component that will execute as a program on the host. Single-assignment C is a C language variant designed to create an automated compilation path from an algorithmic programming language to an FPGA-based reconfigurable computing system.
Conference Paper
Full-text available
Hybrid architectures combining conventional processors with configurable logic resources enable efficient coordination of control with datapath computation. With integration of the two components on a single device, loop control and data-dependent branching can be handled by the conventional processor. While regular datapath computation occurs on the configurable hardware. This paper describes a novel pragma-based approach to programming such hybrid devices. The NAPA C language provides pragma directives so that the programmer (or an automatic partitioner) can specify where data is to reside and where computation is to occur with statement-level granularity. The NAPA C compiler, targeting National Semiconductor's NAPA1000 chip, performs semantic analysis of the pragma-annotated program and co-synthesizes a conventional program executable combined with a configuration bit stream for the adaptive logic. Compiler optimizations include synthesis of hardware pipelines from pipelineable loops
Article
Full-text available
Various projects and products have been built using off-the-shelf field-programmable gate arrays (FPGAs) as computation accelerators for specific tasks. Such systems typically connect one or more FPGAs to the host computer via an I/O bus. Some have shown remarkable speedups, albeit limited to specific application domains. Many factors limit the general usefulness of such systems. Long reconfiguration times prevent the acceleration of applications that spread their time over many different tasks. Low-bandwidth paths for data transfer limit the usefulness of such systems to tasks that have a high computation-to-memory-bandwidth ratio. In addition, standard FPGA tools require hardware design expertise which is beyond the knowledge of most programmers. To help investigate the viability of connected FPGA systems, the authors designed their own architecture called Garp and experimented with running applications on it. They are also investigating whether Garp's design enables automatic, fast, effective compilation across a broad range of applications. They present their results in this article
Article
Full-text available
With the proliferation of highly specialized embedded computer systems has come a diversification of workloads for computing devices. General-purpose processors are struggling to efficiently meet these applications' disparate needs, and custom hardware is rarely feasible. According to the authors, reconfigurable computing, which combines the flexibility of general-purpose processors with the efficiency of custom hardware, can provide the alternative. PipeRench and its associated compiler comprise the authors' new architecture for reconfigurable computing. Combined with a traditional digital signal processor, microcontroller or general-purpose processor, PipeRench can support a system's various computing needs without requiring custom hardware. The authors describe the PipeRench architecture and how it solves some of the pre-existing problems with FPGA architectures, such as logic granularity, configuration time, forward compatibility, hard constraints and compilation time
Conference Paper
Modern FPGAs integrate multi-million logic resources that allow the realization of increasingly large designs. However, state-of-the-art simulated annealing based CAD tools for FPGA suffer from long runtime, poor performance and sub-optimal routing and placement decisions, especially for large applications, leading to less energy efficient designs. In this paper, we present a partitioning methodology that divides large application into smaller subsystems based on the communication frequency between these subsystems. We leverage the existing CAD tools to compile the large design, which is now annotated with their subsystems, to obtain the final bitstream. Experiments show that the proposed strategy can lead to a performance gain of over 60% while still achieving more than 20% reduction in energy consumption.
Article
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers.
Conference Paper
Scheduling plays a central role in high-level synthesis, as it inserts clock boundaries into the untimed behavioral model and greatly impacts the performance, power, and area of the synthesized circuits. While current scheduling techniques can make use of pre-characterized delay values of individual operations, it is difficult to obtain accurate timing estimation on a cluster of operations without considering technology mapping. This limitation is particularly pronounced for FPGAs where a large logic network can be mapped to only a few levels of look-up tables (LUT). In this paper, we propose MAPS, a mapping-aware constrained scheduling algorithm for LUT-based FPGAs. Instead of simply summing up the estimated delay values of individual operations, MAPS jointly performs technology mapping and scheduling, creating the opportunity for more aggressive operation chaining to minimize latency and reduce area. We show that MAPS can produce a latency-optimal solution, while supporting a variety of design timing requirements expressed in a system of difference constraints. We also present an efficient incremental scheduling technique for MAPS to effectively handle resource constraints. Experimental results with real-life benchmarks demonstrate that our proposed algorithm achieves very promising improvements in performance and resource usage when compared to a state-of-the-art commercial high-level synthesis tool targeting Xilinx FPGAs.
Conference Paper
Convolutional neural network (CNN) has been widely employed for image recognition because it can achieve high accuracy by emulating behavior of optic nerves in living creatures. Recently, rapid growth of modern applications based on deep learning algorithms has further improved research and implementations. Especially, various accelerators for deep CNN have been proposed based on FPGA platform because it has advantages of high performance, reconfigurability, and fast development round, etc. Although current FPGA accelerators have demonstrated better performance over generic processors, the accelerator design space has not been well exploited. One critical problem is that the computation throughput may not well match the memory bandwidth provided an FPGA platform. Consequently, existing approaches cannot achieve best performance due to under-utilization of either logic resource or memory bandwidth. At the same time, the increasing complexity and scalability of deep learning applications aggravate this problem. In order to overcome this problem, we propose an analytical design scheme using the roofline model. For any solution of a CNN design, we quantitatively analyze its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of rooine model, we can identify the solution with best performance and lowest FPGA resource requirement. As a case study, we implement a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. Our implementation achieves a peak performance of 61.62 GFLOPS under 100MHz working frequency, which outperform previous approaches significantly.
Book
Preface. 1. Introduction J.M. Rabaey, et al. Part I: Technology and circuit design levels. 2. Device and technology impact on low power electronics Chenming Hu. 3. Low power circuit technologies C. Svensson, Dake Liu. 4. Energy-recovery CMOS W.C. Athas. 5. Low power clock distribution J.G. Xi, W.W.-M. Dai. Part II: Logic and module design levels. 6. Logic synthesis and module design levels M. Pedram. 7. Low power arithmetic components T.K. Callawy, E.E. Schwartzlander. 8. Low power memory design K. Itoh. Part III: Architecture and system design levels. 9. Low-power microprocessor design S. Gary. 10. Portable video-on-demand in wireless communication T.H. Meng, et al. 11. Algorithm and architectural level methodologies R. Mehra, et al. Index.
Article
Power and energy efficiency have emerged as first-order design constraints across the computing spectrum from handheld devices to warehouse-sized datacenters. As the number of transistors continues to scale, effectively managing design complexity under stringent power constraints has become an imminent challenge of the IC industry. The manual process of power optimization in RTL design has been increasingly difficult, if not already unsustainable. Complexity scaling dictates that this process must be automated with robust analysis and synthesis algorithms at a higher level of abstraction. Along this line, high-level synthesis (HLS) is a promising technology to improve design productivity and enable new opportunities for power optimization for higher design quality. By allowing early access to the system architecture, high-level decisions during HLS can have a significant impact on the power and energy efficiency of the synthesized design. In this paper, we will discuss the recent research development of using HLS to effectively explore a multi-dimensional design space and derive low-power implementations. We provide an in-depth coverage of HLS low-power optimization techniques and synthesis algorithms proposed in the last decade. We will also describe the key power optimization challenges facing HLS today and outline potential opportunities in tackling these challenges.
Conference Paper
Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises "move fairness" and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.
Conference Paper
Numerous studies have shown the advantages of hardware and software co-design using FPGAs. However, increasingly lengthy place-and-route times represent a barrier to the broader adoption of this technology by significantly reducing designer productivity and turns-per-day, especially compared to more traditional design environments offered by competitive technologies such as GPUs. In this paper, we address this challenge by introducing a new approach to FPGA application design that significantly reduces compile times by exploiting the functional reuse common throughout modern FPGA applications, e.g. as shared code libraries and unchanged modules between compiles. To evaluate this approach, we introduce Block Place and Route (BPR), an FPGA CAD approach that modifies traditional placement and routing to operate at a higher-level of abstraction by pre-computing the internal placement and routing of reused cores. By extending traditional place-and-route algorithms such as simulated-annealing placement and negotiated-congestion routing to abstract away the detailed implementation of reused cores, we show that BPR is capable of orders-of-magnitude speedup in place-and-route over commercial tools with acceptably low overhead for a variety of applications.
Conference Paper
The increasing design complexity of modern circuits has made traditional FPGA placement techniques not efficient anymore. To improve the scalability, commercial FPGA placement tools have started migrating to analytical placement. In this paper, we propose the first academic multilevel timing-and-wirelength-driven analytical placement algorithm for FPGAs. Our proposed algorithm consists of (1) multilevel timing-and-wirelength-driven analytical global placement with the novel block alignment consideration, (2) partitioning-based legalization, (3) wirelength-driven block matching-based detailed placement, and (4) timing-driven simulated-annealing-based detailed placement. Experimental results show that our proposed approach can achieve 6.91x speedup on average with 7% smaller critical path delay and 1% shorter routed wirelength compared to VPR, the well-known, state-of-the-art academic simulated-annealing-based FPGA placer.
Conference Paper
This paper presents bambu, a modular framework for research on high-level synthesis currently under development at Politecnico di Milano. It can accept most of C constructs without requiring any three-state for their implementations by exploiting a novel and efficient memory architecture. It also allows the integration of floating-point units and thus it can deal with a wide range of data types. Finally, it allows to easily customize the synthesis flow (e.g., transformation passes, constraints, options, synthesis scripts) through an XML file and it automatically generates test-benches and validates the results against the corresponding software execution, supporting both ASIC and FPGA technologies.
Conference Paper
We present HeAP, an analytical placement algorithm for heterogeneous FPGAs comprised of LUT-based logic blocks, multiplier/DSP blocks and block RAMs. Specifically, we adapt a state-of-the-art ASIC-based analytical placer to target FPGAs with heterogeneous blocks located at discrete locations throughout the fabric. Our placer also handles macros of LUT-based blocks with specific layout requirements, such as carry chains. Results show that our placer delivers a 4× speedup, on average, compared to Altera's non-timing driven flow, at the cost of a 5% increase in postrouted wirelength, and an 11× speedup compared to Altera's timing-driven flow, at the cost of a 4% increase in post-routed wirelength and a 9% reduction in maximum operating frequency. We also compare with an academic simulated annealing-based placer and demonstrate a 7.4× runtime advantage with 6% better placement quality.
Book
List of Figures. List of Tables. Preface. Acknowledgments. 1. Introduction. 2. Essential Issues in System Level Design. 3. The SpecC Language. 4. The SpecC Methodology. 5. System Level Design with SpecC. 6. Conclusions. Appendices. Index.
Conference Paper
We describe the Kiwi parallel programming library and its associated synthesis system which is used to transform C# parallel programs into circuits for realization on FPGAs. The Kiwi system is targeted at making reconfigurable computing technology accessible to software engineers that are willing to express their computations as parallel programs. Although there has been much work on compiling sequential C-like programs to hardware by automatically `discovering¿ parallelism, we work by exploiting the parallel architecture communicated by the designer through the choice of parallel and concurrent programming language constructs. Specifically, we describe a system that takes .NET assembly language with suitable custom attributes as input and produces Verilog output which is mapped to FPGAs. We can then choose to apply analysis and verification techniques to either the highlevel representation in C# or other .NET languages or to the generated RTL netlists. A distinctive aspect of our approach is the exploitation of existing language constructs for concurrent programming and synchronization which contrasts with other schemes which introduce specialized concurrency control constructs to extend a sequential language.
Conference Paper
The FPGA compilation process (synthesis, map, place, and route) is a time consuming task that severely limits designer productivity. Compilation time can be reduced by saving implementation data in the form of hard macros. Hard macros consist of previously synthesized, placed and routed circuits that enable rapid design assembly because of the native FPGA circuitry (primitives and nets)which they encapsulate. This work presents results from creating a new FPGA design flow based on hard macros called HMF low. HMF low has shown speedups of 10-50X over the fastest configuration of the Xilinx tools. Designed for rapid prototyping, HMF low achieves these speedups by only utilizing up to 50 percent of the resources on an FPGA and produces implementations that run 2-4X slower than those produced by Xilinx. These speedups are obtained on a wide range of benchmark designs with some exceeding 18,000 slices on a Virtex 4 LX200.
Conference Paper
Placement based on simulated annealing is in dominant use in the FPGA community due to its superior quality of result (QoR). However, given the progression of FPGA device capacity to the order of 100K LUTs, the long runtime associated with simulated annealing warrants a revisit of other placement paradigms in the context of FPGAs. In this paper, we attempt to make a rigorous comparison of a recent crop of academic ASIC placers and VPR when applied to modern FPGA device features and design sizes. We also report a new detailed placer, MDP, based on a new problem formulation of maximum-bipartite matching. We show that MDP is 3X to 7X faster than the detailed placer in FastPlace, which until now has been the fastest detailed placer publicly available. Furthermore, this speedup occurs while producing comparable or superior QoR. With these results, we speculate promising research directions towards scalable, high quality FPGA placement flows that can change the user experience from an overnight wait-time to a coffee break wait-time -- even on large benchmarks.
Conference Paper
The demand for high-speed FPGA compilation tools has occurred for three reasons: first, as FPGA device capacity has grown, the computation time devoted to placement and routing has grown more dramatically than the compute power of the available com- puters. Second, there exists a subset of users who are willing to accept a reduction in the quality1 of result in exchange for a high- speed compilation. Third, high-speed compile has been a long- standing desire of users of FPGA-based custom computing machines, since their compile time requirements are ideally closer to those of regular computers. This paper focuses on the placement phase of the compile process, and presents an ultra-fast placement algorithm targeted to FPGAs. The algorithm is based on a combination of multiple-level, bottom- up clustering and hierarchical simulated annealing. It provides superior area results over a known high-quality placement tool on a set of large benchmark circuits, when both are restricted to a short run time. For example, it can generate a placement for a 100,000- gate circuit in 10 seconds on a 300 MHz Sun UltraSPARC worksta- tion that is only 33% worse than a high-quality placement that takes 524 seconds using a pure simulated annealing implementa- tion. In addition, operating in its fastest mode, this tool can provide an accurate estimate of the wirelength achievable with good quality placement. This can be used, in conjunction with a routing predic- tor, to very quickly determine the routability of a given circuit on a given FPGA device.
Conference Paper
This paper describes a parallel implementation of the timing-driven VPR~5.0 simulated annealing engine. By restricting the move distance to a confined neighborhood, it is possible to consider a large number of non-conflicting moves in parallel and achieve a deterministic result. The full timing-driven algorithm is parallelized, including the detailed timing analysis updates done periodically while placement progresses. The limited move slightly degrades the placement quality, but this is necessary to expose greater degrees of parallelism. The overall bounding box metric degrades about 11% and critical path delay metric degrades about 8% compared to VPR's original algorithm, but we show the amount of degradation is independent of the number of threads. Overall, the parallel implementation scales to a speedup of 123x using 25 threads compared to VPR. With additional tuning effort, we believe the algorithm can be scaled to a larger number of threads, perhaps even run on a GPU, with little additional quality degradation.
Conference Paper
Programmable logic devices such as FPGAs are useful for a wide range of applications. However, FPGAs are not commonly used in battery-powered applications because they consume more power than ASICs and lack power management features. In this paper, we describe the design and implementation of Pika, a low-power FPGA core targeting battery-powered applications such as those in consumer and automotive markets. Our design uses the Xilinx Spartan-3 low-cost FPGA as a baseline and achieves substantial power savings through a series of power optimizations. The resulting architecture is compatible with existing commercial design tools. The implementation is done in a 90nm triple-oxide CMOS process. Compared to the baseline design, Pika consumes 46% less active power and 99% less standby power. Furthermore, it retains circuit and configuration state during standby mode, and wakes up from standby mode in approximately 100ns.
Article
We describe a parallel simulated annealing algorithm for FPGA placement. The algorithm proposes and evaluates multiple moves in parallel, and has been incorporated into Altera’s Quartus II CAD system. Across a set of 18 industrial benchmark circuits, we achieve geometric average speedups during the quench of 2.7x and 4.0x on four and eight processors, respectively, with individual circuits achieving speedups of up to 3.6x and 5.9x. Over the course of the entire anneal, we achieve speedups of up to 2.8x and 3.7x, with geometric average speedups of 2.1x and 2.4x. Our algorithm is the first parallel placer to optimize for criteria other than wirelength, such as critical path length, and is one of the few deterministic parallel placement algorithms. We discuss the challenges involved in combining these two features and the new techniques we used to overcome them. We also quantify the impact of maintaining determinism on eight cores, and find that while it reduces performance by approximately 15% relative to an ideal speedup of 8.0x, hardware limitations are a larger factor and reduce performance by 30--40%. We then suggest possible enhancements to allow our approach to scale to 16 cores and beyond.
Conference Paper
This paper describes CHiMPS, a C-based accelerator compiler for hybrid CPU-FPGA computing platforms. CHiMPSpsilas goal is to facilitate FPGA programming for high-performance computing developers. It inputs generic ANSIC code and automatically generates VHDL blocks for an FPGA. The accelerator architecture is customized with multiple caches that are tuned to the application. Speedups of 2.8x to 36.9x (geometric mean 6.7x) are achieved on a variety of HPC benchmarks with minimal source code changes.
Conference Paper
We consider dynamic power dissipation in FPGAs and present CAD techniques for dynamic power reduction. The proposed techniques, comprising power-aware placement, routing, and a novel post-routing transformation, are applied to optimize the power consumed by industrial designs implemented in the Xilinxreg Virtextrade-5 FPGA. Board-level power measurements on a suite of industrial designs show that the techniques reduce power by 10%, on average.
Conference Paper
As Field-Programmable Gate Array (FPGA) power consumption continues to increase, lower power FPGA circuitry, architectures, and Computer-Aided Design (CAD) tools need to be developed. Before designing low-power FPGA circuitry, architectures, or CAD tools, we must first determine where the biggest savings (in terms of energy dissipation) are to be made and whether these savings are cumulative. In this paper, we focus on FPGA CAD tools. Specifically, we describe a new power-aware CAD flow for FPGAs that was developed to answer the above questions. Estimating energy using very detailed post-route power and delay models, we determine the energy savings obtained by our power-aware technology mapping, clustering, placement, and routing algorithms and investigate how the savings behave when the algorithms are applied concurrently. The individual savings of the power-aware technology-mapping, clustering, placement, and routing algorithms were 7.6%, 12.6%, 3.0%, and 2.6% respectively. The majority of the overall savings were achieved during the technology mapping and clustering stages of the power-aware FPGA CAD flow. In addition, the savings were mostly cumulative when the individual power-aware CAD algorithms were applied concurrently with an overall energy reduction of 22.6%.
Conference Paper
This paper presents a modular and extensible high-level synthesis research system, called SPARK, that takes a behavioral description in ANSI-C as input and produces synthesizable register-transfer level VHDL. SPARK uses parallelizing compiler technology, developed previously, to enhance instruction-level parallelism and re-instruments it for high-level synthesis by incorporating ideas of mutual exclusivity of operations, resource sharing and hardware cost models. In this paper, we present the design flow through the SPARK system, a set of transformations that include speculative code motions and dynamic transformations and show how these transformations and other optimizing synthesis and compiler techniques are employed by a scheduling heuristic. Experiments are performed on two moderately complex industrial applications, namely MPEG-1 and the GIMP image processing tool. The results show that the various code transformations lead to up to 70 % improvements in performance without any increase in the overall area and critical path of the final synthesized design.
Conference Paper
Interconnects (wires, buffers, clock distribution networks, multiplexers and buses) consume a significant fraction of total circuit power. In this work, we demonstrate the importance of optimizing on-chip interconnects for power during high-level synthesis. We present a methodology to integrate interconnect power optimization into high-level synthesis. Our binding algorithm not only reduces power consumption in functional units and registers in the resultant register-transfer level (RTL) architecture, but also optimizes interconnects for power. We take physical design information into account for this purpose. To estimate interconnect power consumption accurately for deep sub-micron (DSM) technologies, wire coupling capacitance is taken into. consideration. We observed that there is significant spurious (i.e., unnecessary) switching activity in the interconnects and propose techniques to reduce it. Compared to interconnect-unaware power-optimized circuits, our experimental results show that interconnect power can be reduced by 53.1% on an average, while reducing overall power by an average of 26.8% with 0.5% area overhead. Compared to area-optimized circuits, the interconnect power reduction is 72.9% and overall power reduction is 56.0% with 44.4% area overhead.
Article
This paper examines the achievements and future of system-on-a-chip (SoC) design methodology and design flow from the viewpoints of an in-house electronic design automation team of an application-specific integrated circuit and system vendor. We initially discuss the problems of the design productivity gap caused by the SoC's complexity and the timing closure caused by deep-submicrometer technology. To solve these two problems, we propose a C-based SoC design environment that features integrated high-level synthesis (HLS) and verification tools. A HLS system is introduced using various successful industrial design examples, and its advantages and drawbacks are discussed. We then look at the future directions of this system. The high-level verification environment consists of a mixed-level hardware/software co-simulator, formal and semi-formal verifiers, and test-bench generators. The verification tools are tightly integrated with the HLS system and take advantage of information from the synthesis system. Then, we discusses the possibility of incorporating physical design features into the C-based SoC design environment. Finally, we describe our global vision for an SoC architecture and SoC design methodology
Article
this paper, Frontier, an integrated, timing-driven placement system that aggressively uses macro-blocks and floorplanning to quickly converge to a high-quality placement solution, is detailed. This system can be used in place of existing placement approaches for macro-based designs targeted to devices with architectures similar to the Xilinx Virtex [Xilinx Corporation 2001] and XC4000 [Xilinx Corporation 1998] and Lucent Orca [Lucent Technologies 1996] families. Rather than using a single algorithm, the new Frontier tool set relies on a sequence of interrelated placement steps. First, in a floorplanning step, macros are combined together into localized clusters of fixed size and shape and assigned to device regions to minimize placement cost. Following initial floorplanning, a routability and performance evaluator, based on wire length and static timing information, is used to determine if subsequent routing for a given target device is likely to complete successfully with pre-specified timing constraints. If this evaluation is pessimistic, low-temperature simulated annealing is performed on the contents of all soft macros in the design to allow for additional placement cost reduction, enhanced design routability, and improved design performance
DK Design Suite: Handel-C to FPGA for Algorithm Design
  • Mentor Graphics
Mentor Graphics. DK Design Suite: Handel-C to FPGA for Algorithm Design. [Online].
eXCite: C to RTL Behavioral Synthesis
  • Y Explorations
Y Explorations. eXCite: C to RTL Behavioral Synthesis. [Online]. Available: http://www.yxi.com/products.php
Compositional system-level design exploration with planning of high-level synthesis
  • H.-Y Liu
  • M Petracca
  • L P Carloni
H.-Y. Liu, M. Petracca, and L. P. Carloni, "Compositional system-level design exploration with planning of high-level synthesis," in Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 2012, pp. 641-646.
VPR: A new packing, placement and routing tool for FPGA research
  • V Betz
  • J Rose
V. Betz and J. Rose, "VPR: A new packing, placement and routing tool for FPGA research," in International Workshop on Field Programmable Logic and Applications, 1997.
Design re-use for compile time reduction in FPGA high-level synthesis flows
  • M Gort
  • J Anderson
M. Gort and J. Anderson, "Design re-use for compile time reduction in FPGA high-level synthesis flows," in Field-Programmable Technology (FPT), 2014 International Conference on, 2014.
Dynamic power consumption in Virtex-II FPGA family
  • L Shang
  • A S Kaviani
  • K Bathala
L. Shang, A. S. Kaviani, and K. Bathala, "Dynamic power consumption in Virtex-II FPGA family," in Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-programmable gate arrays, 2002.