Frank Vahid

University of California, Riverside, Riverside, California, United States

Are you Frank Vahid?

Claim your profile

Publications (240)44.25 Total impact

  • Source
    Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: Expanding the software concept, to spatial models like circuits facilitates programming next-generation embedded systems. Today, embedded-system designers frequently supplement microprocessors with custom digital circuits, often called coprocessors or accelerators, to meet performance demands. A circuit is simply a connection of components, perhaps low-level components like logic gates, or higher-level components like controllers, arithmetic logic units, encoding engines, or even processors. Increasingly, designers implement those circuits on an FPGA, a prefabricated chip that they can configure to implement a particular circuit merely by downloading a particular sequence of bits. Therefore, a circuit implemented on an FPGA is literally software. The key to an FPGA's ability to implement a circuit as software is that an N-address-input memory can implement any N-input combinational circuit.
    Computer 10/2007; DOI:10.1109/MC.2007.322 · 1.44 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern FPGAs' parallel computing capability and their ability to be reconfigured make them an ideal platform to build accelerators for supercomputing systems. As a multi-core processor, the recently announced Cell Broadband EngineTMl offers tremendous computing power. In this paper, we introduce a prototype system that combines these two types of computing devices together in a reconfigurable blade and we describe its architecture, memory system and abundant interfaces. On the reconfigurable blade it is desirable that the FPGA devices can be partially reconfigured at run-time. This paper presents the dynamic partial reconfiguration (DPR) technique and its design flow for the reconfigurable blade. We report our experimental results of the blade doing partial reconfiguration. DPR allows the reconfigurable blade to be a powerful, run-time changeable computing engine. A sample application is presented that was both simulated for the Cell processor and dynamically loaded to run on the FPGA.
    Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on; 09/2007
  • Source
    Greg Stitt, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: Recent high-level synthesis approaches and C-based hardware description languages attempt to improve the hardware design process by allowing developers to capture desired hardware functionality in a well-known high-level source language. However, these approaches have yet to achieve wide commercial success due in part to the difficulty of incorporating such approaches into software tool flows. The requirement of using a specific language, compiler, or development environment may cause many software developers to resist such approaches due to the difficulty and possible instability of changing well-established robust tool flows. Thus, in the past several years, synthesis from binaries has been introduced, both in research and in commercial tools, as a means of better integrating with tool flows by supporting all high-level languages and software compilers. Binary synthesis can be more easily integrated into a software development tool-flow by only requiring an additional backend tool, and it even enables completely transparent dynamic translation of executing binaries to configurable hardware circuits. In this article, we survey the key technologies underlying the important emerging field of binary synthesis. We compare binary synthesis to several related areas of research, and we then describe the key technologies required for effective binary synthesis: decompilation techniques necessary for binary synthesis to achieve results competitive with source-level synthesis, hardware/software partitioning methods necessary to find critical binary regions suitable for synthesis, synthesis methods for converting regions to custom circuits, and binary update methods that enable replacement of critical binary regions by circuits.
    ACM Transactions on Design Automation of Electronic Systems 08/2007; 12. DOI:10.1145/1255456.1255471 · 0.52 Impact Factor
  • Source
    Ann Gordon-Ross, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: The memory hierarchy of a system can consume up to 50% of microprocessor system power. Previous work has shown that tuning a configurable cache to a particular application can reduce memory subsystem energy by 62% on average. We introduce a self-tuning cache that performs transparent runtime cache tuning, thus relieving the application designer and/or compiler from predetermining an application's cache configuration. The self-tuning cache applies tuning at a determined tuning interval. A good interval balances tuning process energy overhead against the energy overhead of running in a sub-optimal cache configuration, which we show wastes much energy. We present a self-tuning cache that dynamically varies the tuning interval, resulting in average energy reduction of as much as 29%, falling within 13% of an oracle-based optimal method.
    Design Automation Conference, 2007. DAC '07. 44th ACM/IEEE; 07/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce onchip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache’s total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache memory.
    05/2007: pages 103-122;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The integration of microprocessors and field-programmable gate array (FPGA) fabric on a single chip increases both the utility and necessity of tools that automatically move software functions from the microprocessor to accelerators on the FPGA to improve performance or energy. Such hardware/software partitioning for modern FPGAs involves the problem of partitioning functions among two levels of accelerator groups - tightly-coupled accelerators that have fast single-clock-cycle memory access to the microprocessor's memory, and loosely-coupled accelerators that access memory through a bridge to avoid slowing the main clock period with their longer critical paths. This new two-level accelerator-partitioning problem was introduced, and a novel optimal dynamic programming algorithm was described to solve the problem. By making use of the size constraint imposed by FPGAs, the algorithm has what is effectively quadratic runtime complexity, running in just a few seconds for examples with up to 25 accelerators, obtaining an average performance improvement of 35% compared to a traditional single-level bus architecture
  • Source
    Greg Stitt, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present a software compilation approach for microprocessor/FPGA platforms that partitions a software binary onto custom hardware implemented in the FPGA. Our approach imposes less restrictions on software tool flow than previous compiler approaches, allowing software designers to use any software language and compiler. Our approach uses a back-end partitioning tool that utilizes decompilation techniques to recover important high-level information, resulting in performance comparable to high-level compiler-based approaches.
  • Source
    Scott Sirowy, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware/software partitioning moves software kernels from a microprocessor to custom hardware accelerators. We consider advanced implementation options for accelerators, greatly increasing the partitioning solution space. One option tightly or loosely couples each accelerator with the microprocessor. Another option assigns a clock frequency to each accelerator, with a limit on the number of distinct frequencies. We previously presented efficient optimal solutions to each of those sub-problems independently. In this paper, we introduce heuristics to solve the two sub-problems in an integrated manner. The heuristics run in just seconds for large examples, yielding 2x additional speedup versus the independent solutions, for a total average speedup 5x greater than partitioning with a single coupling and single frequency. Full Text at Springer, may require registration or fee
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern systems-on-a-chip platforms support multiple clock domains, in which different sub-circuits are driven by different clock signals. Although the frequency of each domain can be customized, the number of unique clock frequencies on a platform is typically limited. We define the clock-frequency assignment problem to be the assignment of frequencies to processing modules, each with an ideal maximum frequency, such that the sum of module processing times is minimized, subject to a limit on the number of unique frequencies. We develop a novel polynomial-time optimal algorithm to solve the problem, based on dynamic programming. We apply the algorithm to the particular context of post-improvement of accelerator-based hardware/software partitioning, and demonstrate 1.5times-4times additional speedups using just three clock domains
  • Source
    Greg Stitt, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a dynamic optimization technique, thread warping, that uses a single processor on a multiprocessor system to dynamically synthesize threads into custom accelerator circuits on FPGAs (field-programmable gate arrays). Building on dynamic synthesis for single-processor single-thread systems, known as warp processing, thread warping improves performances of multiprocessor systems by speeding up individual threads and by allowing more threads to execute concurrently. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGA—an advantage not shared by static compilation/synthesis approaches. We introduce a framework of architecture, CAD tools, and operating system that together support thread warping. We summarize experiments on an extensive architectural simulation framework we developed, showing application speedups of 4x to 502x, averaging 130x compared to a multiprocessor system having four ARM11 microprocessors, for eight benchmark applications. Even compared to a 64-processor system, thread warping achieves 11x speedup.
    Proceedings of the 5th International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2007, Salzburg, Austria, September 30 - October 3, 2007; 01/2007
  • Roman Lysecky, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: Field programmable gate arrays (FPGAs) provide designers with the ability to quickly create hardware circuits. Increases in FPGA configurable logic capacity and decreasing FPGA costs have enabled designers to more readily incorporate FPGAs in their designs. FPGA vendors have begun providing configurable soft processor cores that can be synthesized onto their FPGA products. While FPGAs with soft processor cores provide designers with increased flexibility, such processors typically have degraded performance and energy consumption compared to hard-core processors. Previously, we proposed warp processing, a technique capable of optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic. In this paper, we study the potential of a MicroBlaze soft-core based warp processing system to eliminate the performance and energy overhead of a soft-core processor compared to a hard-core processor. We demonstrate that the soft-core based warp processor achieves average speedups of 5.8 and energy reductions of 57% compared to the soft core alone. Our data shows that a soft-core based warp processor yields performance and energy consumption competitive with existing hard-core processors, thus expanding the usefulness of soft processor cores on FPGAs to a broader range of applications.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called Design of Experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6x-17x versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3x-6x application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40--45% by using predictive tuning methods already built into a DoE tool.
    2007 Design, Automation and Test in Europe Conference and Exposition (DATE 2007), April 16-20, 2007, Nice, France; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We introduce a new non-intrusive on-chip cache-tuning hardware module capable of accurately predicting the best configuration of a configurable cache for an executing application. Previous dynamic cache tuning approaches change the cache configuration several times as part of the tuning search process, executing the application using inferior configurations and temporarily causing energy and performance overhead. The introduced tuner uses a different approach, which non-intrusively collects data on addresses issued by the microprocessor, analyzes that data to predict the best cache configuration, and then updates the cache to the new best configuration in "one-shot", without ever having to examine inferior configurations. The result is less energy and less performance overhead, meaning that cache tuning can be applied more frequently. We show through experiments that the one-shot cache tuner can reduce memory-access related energy for instructions by 35% and comes within 4% of a previous intrusive approach, and results in 4.6 times less energy overhead and a 7.7 times speedup in tuning time compared to a previous intrusive approach, at the main expense of 12% larger size
    2007 Design, Automation and Test in Europe Conference and Exposition (DATE 2007), April 16-20, 2007, Nice, France; 01/2007
  • Source
    D. Sheldon, F. Vahid, S. Lonardi
    [Show abstract] [Hide abstract]
    ABSTRACT: Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called design of experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6times-17times versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3times-6times application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40-45% by using predictive tuning methods already built into a DoE tool
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Soft-core microprocessors mapped onto field-programmable gate arrays (FPGAs) represent an increasingly common embedded software implementation option. Modern FPGA soft-cores are parameterized to support application-specific customization, wherein pre-defined units, such as a multiplication unit or floating-point unit, may be included in the microprocessor architecture to speed up software execution at the expense of increased size. We introduce a methodology for fast application- specific customization of a parameterized FPGA soft core, using synthesis and execution to obtain size and performance data in order to create a tool that can be used across a variety of tool platforms and FPGA devices. As synthesizing a soft core takes tens of minutes, developing heuristics that execute in an acceptable time of an hour or two, yet find near-optimal results, is a challenge. We consider two approaches, one using a traditional CAD approach that does an initial characterization using synthesis to create an abstract problem model and then explores the solution space using a knapsack algorithm, and the other using a synthesis- in-the-loop exploration approach. We compare approaches for a variety of design constraints, on 11 EEMBC benchmarks, using an actual Xilinx soft-core processor, and for two different commercial Xilinx FPGA devices. Our results show that the approaches can generate a customized configuration exhibiting roughly 2x speedups over a base soft core, reaching within 4% of optimal in about 1.5 hours, including complete synthesis of the soft-core onto the FPGA, compared to over 11 hours for exhaustive search. Our results also show that including synthesis- in-the-loop, compared to a traditional CAD approach, improved speedups by an average of 20% when size constraints were tight. The approaches may also be applicable to soft-core processors targeted to ASICs in addition to FPGAs.
    2006 International Conference on Computer-Aided Design (ICCAD'06), November 5-9, 2006, San Jose, CA, USA; 11/2006
  • Source
    Greg Stitt, Frank Vahid, Walid A. Najjar
    [Show abstract] [Hide abstract]
    ABSTRACT: Although many recent advances have been made in hardware synthesis techniques from software programming languages such as C, the performance of synthesized hardware commonly suffers due to the use of C constructs and coding practices that are not appropriate for hardware. Most previous approaches to addressing this problem require drastic changes to coding practice. We present an approach that instead requires only minimal changes but yields significant speedups. In this approach, a software developer initially writes C code as they normally would, and then applies simple refinement guidelines to only the performance-critical code regions, which are the regions most likely to be synthesized to hardware. Alternatively, if a designer is aware of performance-critical parts of the application, the guidelines could be followed during development. In this study, we analyze dozens of embedded benchmarks to determine the most common C coding practices that limit hardware performance, and introduce coding guidelines to make the code more amenable to synthesis. Those guidelines typically require minimal coding effort, generally consisting of less than ten lines of code for each guideline. The guidelines typically represent modifications that require designer knowledge, making the guidelines difficult or impossible for synthesis tools to automate. We apply these guidelines to six benchmarks, resulting in average speedups of 3.5x compared to synthesis from the original code with a negligible software size and performance overhead.
    2006 International Conference on Computer-Aided Design (ICCAD'06), November 5-9, 2006, San Jose, CA, USA; 11/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, tech-nology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.
    ACM Transactions on Design Automation of Electronic Systems 08/2006; 11:659-681. DOI:10.1145/996566.1142986 · 0.52 Impact Factor
  • Source
    ACM Transactions on Design Automation of Electronic Systems 07/2006; 11:659-681. DOI:10.1145/1142980.1142986 · 0.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Soft-core programmable processors on field-programmable gate arrays (FPGAs) can be custom synthesized to instantiate only those hardware units, such as multipliers and floating-point units, that an application requires to meet performance demands, thus minimizing soft-core size on the FPGA. Conjoining processors, meaning to share hardware units among two or more processors, can further reduce soft-core size, leaving more resources for other circuits such as custom coprocessors. Using Xilinx MicroBlaze coprocessors and standard embedded system benchmarks, we show that conjoining two processors can provide 16% processor size reductions on average, with less than 1 % cycle count overhead. We introduce an efficient dynamic-programming-based exploration method to find the best custom instantiation of hardware units, considering both standalone and conjoined options, for soft-core processors
    2006 International Conference on Computer-Aided Design (ICCAD'06), November 5-9, 2006, San Jose, CA, USA; 01/2006
  • Source
    Susan Lysecky, Frank Vahid
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a set of fixed-function and programmable blocks, eBlocks, previously developed to provide non-programming, non-electronics experts the ability to construct and customize basic embedded computing systems. We present a novel and powerful tool that, combined with these building blocks, enables end-users to automatically generate an optimized physical implementation derived from a virtual system function description. Furthermore, the tool allows the end-user to specify optimization criteria and constraint libraries that guide the tool in generating a suitable physical implementation, without requiring the end-user to have prior programming or electronics experience. We summarize experiments illustrating the ability of the tool to generate physical implementations corresponding to various end-user defined goals. The tool enables end-users having little or no electronics or programming experience to build useful customized basic sensor-based computing systems from existing low-cost building blocks.
    UbiComp 2006: Ubiquitous Computing, 8th International Conference, UbiComp 2006, Orange County, CA, USA, September 17-21, 2006; 01/2006

Publication Stats

5k Citations
44.25 Total Impact Points

Institutions

  • 1994–2013
    • University of California, Riverside
      • Department of Computer Science and Engineering
      Riverside, California, United States
  • 1991–2012
    • University of California, Irvine
      • • Center for Embedded Computer Systems (CECS)
      • • Department of Computer Science
      Irvine, California, United States
  • 2004
    • University of California, San Diego
      • Department of Electrical and Computer Engineering
      San Diego, CA, United States
    • CSU Mentor
      Long Beach, California, United States