[Show abstract][Hide abstract] ABSTRACT: Reconfigurable devices are often employed in heterogeneous systems due to their low power and parallel processing advantages. An important usability requirement is the support of a homogeneous programming interface. Nevertheless, homogeneous programming interfaces do not eliminate the need for code tweaking to enable efficient mapping of the computation across heterogeneous architectures. In this work we propose a code optimization framework which analyzes and restructures CUDA kernels that are optimized for GPU devices in order to facilitate synthesis of high-throughput custom accelerators on FPGAs. The proposed framework enables efficient performance porting without manual code tweaking or annotation by the user. A hierarchical region graph in tandem with code motions and graph coloring of array variables is employed to restructure the kernel for high throughput execution on FPGAs.
Proceedings of the 50th Annual Design Automation Conference; 05/2013
[Show abstract][Hide abstract] ABSTRACT: High level synthesis (HLS) is an important enabling technology for the adoption of hardware accelerator technologies. It promises the performance and energy efficiency of hardware designs with a lower barrier to entry in design expertise, and shorter design time. State-of-the-art high level synthesis now includes a wide variety of powerful optimizations that implement efficient hardware. These optimizations can implement some of the most important features generally performed in manual designs including parallel hardware units, pipelining of execution both within a hardware unit and between units, and fine-grained data communication. We may generally classify the optimizations as those that optimize hardware implementation within a code block (intra-block) and those that optimize communication and pipelining between code blocks (inter-block). However, both optimizations are in practice difficult to apply. Real-world applications contain data-dependent blocks of code and communicate through complex data access patterns. Existing high level synthesis tools cannot apply these powerful optimizations unless the code is inherently compatible, severely limiting the optimization opportunity.
In this paper we present an integrated framework to model and enable both intra- and inter-block optimizations. This integrated technique substantially improves the opportunity to use the powerful HLS optimizations that implement parallelism, pipelining, and fine-grained communication. Our polyhedral model-based technique systematically defines a set of data access patterns, identifies effective data access patterns, and performs the loop transformations to enable the intra- and inter-block optimizations. Our framework automatically explores transformation options, performs code transformations, and inserts the appropriate HLS directives to implement the HLS optimizations. Furthermore, our framework can automatically generate the optimized communication blocks for fine-grained communication between hardware blocks. Experimental evaluation demonstrates that we can achieve an average of 6.04X speedup over the high level synthesis solution without our transformations to enable intra- and inter-block optimizations.
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays; 01/2013
[Show abstract][Hide abstract] ABSTRACT: Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.
Computer-Aided Design (ICCAD), 2013 IEEE/ACM International Conference on; 01/2013
[Show abstract][Hide abstract] ABSTRACT: High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance,
and resource utilization.
However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism.
In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.
[Show abstract][Hide abstract] ABSTRACT: GPUs are an increasingly popular implementation
platform for a variety of general purpose applications from mobile
and embedded devices to high performance computing. The
CUDA and OpenCL parallel programming models enable easy
utilization of the GPU’s resources. However, tuning GPU applications’
performance is a complex and labor intensive task. Software
programmers employ a variety of optimization techniques to
explore tradeoffs between the thread parallelism and performance
of a single thread. However, prior techniques ignore register allocation,
a significant factor in single thread performance and, indirectly
affects the number of simultaneously active threads. In this
paper, we show that joint optimization of register allocation and
thread structure has great potential to significantly improve performance.
However, the design space for this joint optimization
can be large; therefore, we develop performance metrics appropriate
for evaluation within a compiler’s inner loop and efficient
design space exploration techniques that use the metrics to narrow
the search space. Across a range of GPU applications, we achieve
average performance speedup of 1.33X (up to 1.73X) with design
space exploration 355X faster than the exhaustive search.
[Show abstract][Hide abstract] ABSTRACT: Real-time 3D sound localization is an important technology for various applications such as camera steering systems, robotics audition, and gunshot direction. 3D sound localization adds a new dimension, but also significantly increases the computational requirements. Real-time 3D sound localization continuously processes large volumes of data for each possible 3D direction and acoustic frequency range. Such highly demanding compute requirements outpace current CPU compute abilities. This paper develops a real-time implementation of 3D sound localization on Graphical Processing Units (GPUs). Massively parallel GPU architectures are shown to be well suited for 3D sound localization. We optimize various aspects of GPU implementation, such as number of threads per thread block, register allocation per thread, and memory data layout for performance improvement. Experiments indicate that our GPU implementation achieves 501X and 130X speedup compared to a single-thread and a multi-thread CPU implementation respectively, thus enabling real-time operation of 3D sound localization.
[Show abstract][Hide abstract] ABSTRACT: FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort for FPGA implementations remains high—often an order of magnitude larger than design effort using high-level languages. Instead of this time-consuming process, high-level synthesis (HLS) tools generate hardware implementations from algorithm descriptions in languages such as C/C++ and SystemC. Such tools reduce design effort: high-level descriptions are more compact and less error prone. HLS tools promise hardware development abstracted from software designer knowledge of the implementation platform. In this paper, we present an unbiased study of the performance, usability and productivity of HLS using AutoPilot (a state-of-the-art HLS tool). In particular, we first evaluate AutoPilot using the popular embedded benchmark kernels. Then, to evaluate the suitability of HLS on real-world applications, we perform a case study of stereo matching, an active area of computer vision research that uses techniques also common for image denoising, image retrieval, feature matching, and face recognition. Based on our study, we provide insights on current limitations of mapping general-purpose software to hardware using HLS and some future directions for HLS tool development. We also offer several guidelines for hardware-friendly software design. For popular embedded benchmark kernels, the designs produced by HLS achieve 4X to 126X speedup over the software version. The stereo matching algorithms achieve between 3.5X and 67.9X speedup over software (but still less than manual RTL design) with a fivefold reduction in design effort versus manual RTL design.
Journal of Electrical and Computer Engineering. 01/2012; 2012.
[Show abstract][Hide abstract] ABSTRACT: Graphics processing units (GPUs) are increasingly critical for general-purpose parallel processing performance. GPU hardware is composed of many streaming multiprocessors, each of which employs the single-instruction multiple-data (SIMD) execution style. This massively parallel architecture allows GPUs to execute tens of thousands of threads in parallel. Thus, GPU architectures efficiently execute heavily data-parallel applications. However, due to this SIMD execution style, resource utilization and thus overall performance can be significantly affected if computation threads must take diverging control paths. Control flow divergence in GPUs is a well-known problem: prior approaches have attempted to reduce control flow divergence through code transformations, memory access indirection, and input data reorganization. However, as we will demonstrate, the utility of these transformations is seriously affected by the lack of a guiding metric that properly estimates how control flow divergence affects application performance. In this paper, we introduce a metric that simply and accurately estimates performance of computation-bound GPU kernels with control flow divergence, and use the metric as a value function for thread re-grouping algorithms. We measure the performance on NVIDIA GTS250 GPU. For the tested set of applications, our experiments demonstrate that the proposed metric correlates well with actual GPU application performance. Through thread re-grouping guided by our metric, control flow divergence optimization can improve application performance by up to 3.19X.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a real-time three-dimensional
(3D) wideband sound localization system designed with a miniature
XYZO microphone array. Unlike the conventional microphone
arrays for sound localization using only omnidirectional
microphones, the presented microphone array is designed with
both bidirectional (pressure gradient) and omnidirectional microphones.
Therefore, the array has significantly reduced size and
is known as the world’s smallest microphone array design for
3D sound source localization in air. In this paper, we describe
the 3D array configuration and perform array calibration. For
3D sound localization, we provide studies on the array output
model of the XYZO array, the widely known direction-of-arrival
(DOA) estimation methods and the direction search in 3D space.
To achieve the real-time processing for 1∘ search resolution,
we accelerate the parallel computations on GPU platform with
CUDA programming, and a 130X speedup is achieved compared
to a multi-thread CPU implementation. The performance of the
proposed system is studied under various reverberation lengths
and signal-to-noise levels. We also demonstrate a real-time 3D
sound localization demo showing good ability to virtual reality.
Industrial Electronics and Applications (ICIEA), 7th IEEE Conference on; 01/2012
[Show abstract][Hide abstract] ABSTRACT: Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow - starting with high level source code and ending with routed net list - is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can significantly outperform previous related tools whereas it can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.
[Show abstract][Hide abstract] ABSTRACT: FPGAs are an attractive platform for applications with high computation demand and low energy consumption requirements. However, design effort for FPGA implementations remains high — often an order of magnitude larger than design effort using high level languages. Instead of this time-consuming process, high level synthesis (HLS) tools generate hardware implementations from high level languages (HLL) such as C/C++/SystemC. Such tools reduce design effort: high level descriptions are more compact and less error prone. HLS tools promise hardware development abstracted from software designer knowledge of the implementation platform. In this paper, we examine several implementations of stereo matching, an active area of computer vision research that uses techniques also common for image de-noising, image retrieval, feature matching and face recognition. We present an unbiased evaluation of the suitability of using HLS for typical stereo matching software, usability and productivity of AutoPilot (a state of the art HLS tool), and the performance of designs produced by AutoPilot. Based on our study, we provide guidelines for software design, limitations of mapping general purpose software to hardware using HLS, and future directions for HLS tool development. For the stereo matching algorithms, we demonstrate between 3.5X and 67.9X speedup over software (but less than achievable by manual RTL design) with a five-fold reduction in design effort vs. manual hardware design.
2011 International Conference on Field-Programmable Technology, FPT 2011, New Delhi, India, December 12-14, 2011; 01/2011
[Show abstract][Hide abstract] ABSTRACT: A wide variety of application domains such as networking, computer vision, and cryptography target FPGA platforms to meet computation demand and energy consumption constraints. However, design effort for FPGA implementations in hardware description languages (HDLs) remains high - often an order of magnitude larger than design effort using high level languages (HLLs). Instead of development in HDLs, high level synthesis (HLS) tools generate hardware implementations from algorithm descriptions in HLLs such as C/C++/SystemC. HLS tools promise reduced design effort and hardware development without the detailed knowledge of the implementation platform. In this paper, we study AutoPilot, a state-of-the-art HLS tool, and examine the suitability of using HLS for a variety of application domains. Based on our study of application code not originally written for HLS, we provide guidelines for software design, limitations of mapping general purpose software to hardware using HLS, and future directions for HLS tool development. For the examined applications, we demonstrate speedup from 4X to over 126X, with a five-fold reduction in design effort vs. manual design in HDLs.
ASIC (ASICON), 2011 IEEE 9th International Conference on; 01/2011