[Show abstract][Hide abstract] ABSTRACT: Computations might execute faster as circuits on a field-programmable gate array (FPGA) than as sequential instructions on a microprocessor because a circuit allows concurrency, from the bit to the process level. Several tools seek to compile popular microprocessor-oriented software programming languages to FPGAs. Key barriers to adoption include the difficulty of integrating such tools into established microprocessor software development flows and such tools' nonconformance to the standard binary concept that forms the basis of many computing domains' ecosystems. In warp processing, a compute platform transparently performs FPGA circuit compilation as a program's binary executes on a microprocessor.
[Show abstract][Hide abstract] ABSTRACT: Designing circuits for FPGAs involves challenges often distinct from designing circuits for ASICs. We describe efforts to convert a pattern counting circuit architecture, based on a pipelined binary tree and originally designed for ASIC implementation, into a circuit suitable for FPGAs. The original architecture, when mapped to a Spartan 3e FPGA, could process 10 million patterns per second and handle up to 4,096 patterns. The modified architecture could instead process 100 million patterns per second and handle up to 32,768 patterns, representing a 10x performance improvement and a 4x efficiency improvement. The redesign involved partitioning large memories into smaller ones at the expense of redundant control logic. Through this and other case studies, design patterns may emerge that aid designers in building high-performance efficient circuits for FPGAs
[Show abstract][Hide abstract] ABSTRACT: Due to the large contribution of the memory subsystem to total system power, the memory subsystem is highly amenable to cus- tomization for reduced power/energy and/or improved performance. Cache parameters such as total size, line size, and associat ivity can be specialized to the needs of an application for system optimiza- tion. In order to determine the best values for cache parameters, most methodologies utilize repetitious application execution to in- dividually analyze each configuration explored. In this pap er we propose a simplified yet efficient technique to accurately es timate the miss rate of many different cache configurations in just o ne single-pass of execution. The approach utilizes simple data struc- tures in the form of a multi-layered table and elementary bitwise operations to capture the locality characteristics of an ap plication's addressing behavior. The proposed technique intends to ease miss rate estimation and reduce cache exploration time. Categories and Subject Descriptors
[Show abstract][Hide abstract] ABSTRACT: Synthesizing common sequential algorithms, captured in a language like C, to FPGA circuits is now well-known to provide dramatic speedups for numerous applications, and to provide tremendous portability and adaptability advantages over circuit implementations of an application. However, many applications targeted to FPGAs are still designed and distributed at the circuit level, due in part to tremendous human ingenuity being exercised at that level to achieve exceptional performance and efficiency. A question then arises as to whether applications for FPGAs will have to be distributed as circuits to achieve desired performance and efficiency, or if instead a more portable language like C might be used. Given a set of common synthesis transformations, we studied the extent to which circuits published in FCCM in the past 6 years could be captured as sequential code and then synthesized back to the published circuit. The study showed that a surprising 82% of the 35 circuits chosen for the study could be re- derived from some form of standard C code, suggesting that standard C code, without extensions, may be an effective means for distributing FPGA applications.
[Show abstract][Hide abstract] ABSTRACT: Various commercial programmable compute platforms have their processor architecture enhanced with field-programmable gate arrays (FPGAs). In a common usage scenario, an application loads custom processors into the FPGA to speed up application execution compared to processor-only execution. Transient applications, changing application workloads, and limited FPGA capacity have led to a new problem of operating-system- controlled dynamic management of the loading of coprocessors into the FPGAs for best overall performance or energy. We define the Dynamic Coprocessor Management problem and provide a mapping to an online optimization problem known as Metrical Task Systems. We introduce a robust heuristic, called the adjusting cumulative benefit (ACBenefit) heuristic, that outperforms other heuristics, including a previously developed one for MTS. For two distinct application sets, we generate numerous workloads and show that the ACBenefit heuristic to provide best results across all considered workloads. In our simulations, the heuristic's results were within 9% of the offline optimal for performance, and within 3% for energy. The heuristic may be applicable to a wide variety of dynamic architecture management problems.
[Show abstract][Hide abstract] ABSTRACT: We conducted a study of citations of papers published between 1996 and 2006 in the CODES and ISSS conferences, representing the hardware/software codesign and system synthesis community. Citations, meaning non-self-citations only, were considered from all papers known to Google Scholar, as well as only from subsequent CODES/ISSS papers. We list the most-cited CODES/ISSS papers of each year, summarizing their topics, and discussing common features of those papers. For comparison purposes, we also measured citations for the computer architecture community's ISCA and MICRO conferences, and for the field-programmable gate array community's FPGA and FCCM conferences. We point out several interesting differences among the citation patterns of the three communities. Categories and Subject Descriptors: C.0 (Computer Systems Organization) System architectures, B.0 (Hardware).
[Show abstract][Hide abstract] ABSTRACT: Architectures with software-writable parameters, or configurable architectures, enable runtime reconfiguration of computing platforms to the applications they execute. Such dynamic tuning can improve application performance or energy. However, reconfiguring incurs a temporary performance cost. Thus, online algorithms are needed that decide when to reconfigure and which configuration to choose such that overall performance is optimized. We introduce the adaptive weighted window (AWW) algorithm, and compare with several other algorithms, including algorithms previously developed by the online algorithm community. We describe experiments showing that AWW results are within 4% of the offline optimal on average. AWW outperforms the other algorithms, and is robust across three datasets and across three categories of application sequences too. AWW improves a non- dynamic approach on average by 6%, and by up to 30% in low- reconfiguration-time situations.
[Show abstract][Hide abstract] ABSTRACT: Expanding the software concept, to spatial models like circuits facilitates programming next-generation embedded systems. Today, embedded-system designers frequently supplement microprocessors with custom digital circuits, often called coprocessors or accelerators, to meet performance demands. A circuit is simply a connection of components, perhaps low-level components like logic gates, or higher-level components like controllers, arithmetic logic units, encoding engines, or even processors. Increasingly, designers implement those circuits on an FPGA, a prefabricated chip that they can configure to implement a particular circuit merely by downloading a particular sequence of bits. Therefore, a circuit implemented on an FPGA is literally software. The key to an FPGA's ability to implement a circuit as software is that an N-address-input memory can implement any N-input combinational circuit.
[Show abstract][Hide abstract] ABSTRACT: Modern FPGAs' parallel computing capability and their ability to be reconfigured make them an ideal platform to build accelerators for supercomputing systems. As a multi-core processor, the recently announced Cell Broadband EngineTMl offers tremendous computing power. In this paper, we introduce a prototype system that combines these two types of computing devices together in a reconfigurable blade and we describe its architecture, memory system and abundant interfaces. On the reconfigurable blade it is desirable that the FPGA devices can be partially reconfigured at run-time. This paper presents the dynamic partial reconfiguration (DPR) technique and its design flow for the reconfigurable blade. We report our experimental results of the blade doing partial reconfiguration. DPR allows the reconfigurable blade to be a powerful, run-time changeable computing engine. A sample application is presented that was both simulated for the Cell processor and dynamically loaded to run on the FPGA.
[Show abstract][Hide abstract] ABSTRACT: Recent high-level synthesis approaches and C-based hardware description languages attempt to improve the hardware design process by allowing developers to capture desired hardware functionality in a well-known high-level source language. However, these approaches have yet to achieve wide commercial success due in part to the difficulty of incorporating such approaches into software tool flows. The requirement of using a specific language, compiler, or development environment may cause many software developers to resist such approaches due to the difficulty and possible instability of changing well-established robust tool flows. Thus, in the past several years, synthesis from binaries has been introduced, both in research and in commercial tools, as a means of better integrating with tool flows by supporting all high-level languages and software compilers. Binary synthesis can be more easily integrated into a software development tool-flow by only requiring an additional backend tool, and it even enables completely transparent dynamic translation of executing binaries to configurable hardware circuits. In this article, we survey the key technologies underlying the important emerging field of binary synthesis. We compare binary synthesis to several related areas of research, and we then describe the key technologies required for effective binary synthesis: decompilation techniques necessary for binary synthesis to achieve results competitive with source-level synthesis, hardware/software partitioning methods necessary to find critical binary regions suitable for synthesis, synthesis methods for converting regions to custom circuits, and binary update methods that enable replacement of critical binary regions by circuits.
Full-text · Article · Aug 2007 · ACM Transactions on Design Automation of Electronic Systems
[Show abstract][Hide abstract] ABSTRACT: The memory hierarchy of a system can consume up to 50% of microprocessor system power. Previous work has shown that tuning a configurable cache to a particular application can reduce memory subsystem energy by 62% on average. We introduce a self-tuning cache that performs transparent runtime cache tuning, thus relieving the application designer and/or compiler from predetermining an application's cache configuration. The self-tuning cache applies tuning at a determined tuning interval. A good interval balances tuning process energy overhead against the energy overhead of running in a sub-optimal cache configuration, which we show wastes much energy. We present a self-tuning cache that dynamically varies the tuning interval, resulting in average energy reduction of as much as 29%, falling within 13% of an oracle-based optimal method.
[Show abstract][Hide abstract] ABSTRACT: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor
system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors’
cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce onchip hardware implementing
an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache’s
total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for
a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system
prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy.
We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache
[Show abstract][Hide abstract] ABSTRACT: Modern systems-on-a-chip platforms support multiple clock domains, in which different sub-circuits are driven by different clock signals. Although the frequency of each domain can be customized, the number of unique clock frequencies on a platform is typically limited. We define the clock-frequency assignment problem to be the assignment of frequencies to processing modules, each with an ideal maximum frequency, such that the sum of module processing times is minimized, subject to a limit on the number of unique frequencies. We develop a novel polynomial-time optimal algorithm to solve the problem, based on dynamic programming. We apply the algorithm to the particular context of post-improvement of accelerator-based hardware/software partitioning, and demonstrate 1.5times-4times additional speedups using just three clock domains
[Show abstract][Hide abstract] ABSTRACT: We introduce a new non-intrusive on-chip cache-tuning hardware module capable of accurately predicting the best configuration of a configurable cache for an executing application. Previous dynamic cache tuning approaches change the cache configuration several times as part of the tuning search process, executing the application using inferior configurations and temporarily causing energy and performance overhead. The introduced tuner uses a different approach, which non-intrusively collects data on addresses issued by the microprocessor, analyzes that data to predict the best cache configuration, and then updates the cache to the new best configuration in "one-shot", without ever having to examine inferior configurations. The result is less energy and less performance overhead, meaning that cache tuning can be applied more frequently. We show through experiments that the one-shot cache tuner can reduce memory-access related energy for instructions by 35% and comes within 4% of a previous intrusive approach, and results in 4.6 times less energy overhead and a 7.7 times speedup in tuning time compared to a previous intrusive approach, at the main expense of 12% larger size
[Show abstract][Hide abstract] ABSTRACT: The integration of microprocessors and field-programmable gate array (FPGA) fabric on a single chip increases both the utility and necessity of tools that automatically move software functions from the microprocessor to accelerators on the FPGA to improve performance or energy. Such hardware/software partitioning for modern FPGAs involves the problem of partitioning functions among two levels of accelerator groups - tightly-coupled accelerators that have fast single-clock-cycle memory access to the microprocessor's memory, and loosely-coupled accelerators that access memory through a bridge to avoid slowing the main clock period with their longer critical paths. This new two-level accelerator-partitioning problem was introduced, and a novel optimal dynamic programming algorithm was described to solve the problem. By making use of the size constraint imposed by FPGAs, the algorithm has what is effectively quadratic runtime complexity, running in just a few seconds for examples with up to 25 accelerators, obtaining an average performance improvement of 35% compared to a traditional single-level bus architecture
[Show abstract][Hide abstract] ABSTRACT: In this paper, we present a software compilation approach for microprocessor/FPGA platforms that partitions a software binary onto custom hardware implemented in the FPGA. Our approach imposes less restrictions on software tool flow than previous compiler approaches, allowing software designers to use any software language and compiler. Our approach uses a back-end partitioning tool that utilizes decompilation techniques to recover important high-level information, resulting in performance comparable to high-level compiler-based approaches.
[Show abstract][Hide abstract] ABSTRACT: Hardware/software partitioning moves software kernels from a microprocessor to custom hardware accelerators. We consider advanced implementation options for accelerators, greatly increasing the partitioning solution space. One option tightly or loosely couples each accelerator with the microprocessor. Another option assigns a clock frequency to each accelerator, with a limit on the number of distinct frequencies. We previously presented efficient optimal solutions to each of those sub-problems independently. In this paper, we introduce heuristics to solve the two sub-problems in an integrated manner. The heuristics run in just seconds for large examples, yielding 2x additional speedup versus the independent solutions, for a total average speedup 5x greater than partitioning with a single coupling and single frequency. Full Text at Springer, may require registration or fee
[Show abstract][Hide abstract] ABSTRACT: We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic synthesis for single-processor single-thread systems, known as warp processing, thread warping improves performances of multiprocessor systems by speeding up individual threads and by allowing more threads to execute concurrently. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGA—an advantage not shared by static compilation/synthesis approaches. We introduce a framework of architecture, CAD tools, and operating system that together support thread warping. We summarize experiments on an extensive architectural simulation framework we developed, showing application speedups of 4x to 502x, averaging 130x compared to a multiprocessor system having four ARM11 microprocessors, for eight benchmark applications. Even compared to a 64-processor system, thread warping achieves 11x speedup.
[Show abstract][Hide abstract] ABSTRACT: Field programmable gate arrays (FPGAs) provide designers with the ability to quickly create hardware circuits. Increases in FPGA configurable logic capacity and decreasing FPGA costs have enabled designers to more readily incorporate FPGAs in their designs. FPGA vendors have begun providing configurable soft processor cores that can be synthesized onto their FPGA products. While FPGAs with soft processor cores provide designers with increased flexibility, such processors typically have degraded performance and energy consumption compared to hard-core processors. Previously, we proposed warp processing, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic. In this paper, we study the potential of a MicroBlaze soft-core based warp processing system to eliminate the performance and energy overhead of a soft-core processor compared to a hard-core processor. We demonstrate that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone. Our data shows that a soft-core based warp processor yields performance and energy consumption competitive with existing hard-core processors, thus expanding the usefulness of soft processor cores on FPGAs to a broader range of applications.
[Show abstract][Hide abstract] ABSTRACT: Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called Design of Experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6x-17x versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3x-6x application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40--45% by using predictive tuning methods already built into a DoE tool.