Conference PaperPDF Available

From OpenCL to high-performance hardware on FPGAs

Authors:

Abstract

We present an OpenCL compilation framework to generate high-performance hardware for FPGAs. For an OpenCL application comprising a host program and a set of kernels, it compiles the host program, generates Verilog HDL for each kernel, compiles the circuit using Altera Complete Design Suite 12.0, and downloads the compiled design onto an FPGA.We can then run the application by executing the host program on a Windows(tm)-based machine, which communicates with kernels on an FPGA using a PCIe interface. We implement four applications on an Altera Stratix IV and present the throughput and area results for each application. We show that we can achieve a clock frequency in excess of 160MHz on our benchmarks, and that OpenCL computing paradigm is a viable design entry method for high-performance computing applications on FPGAs.
A preview of the PDF is not available
... The strength of FPGAs is that they can be reconfigured and adapted for the type of algorithms to execute, mapping an algorithm one-to-one to the FPGA hardware. This involves "programming" the FPGA using a hardware description language (HDL) such as VHDL or Verilog [6]. The HDLs are used to generate a circuit description which is loaded onto the FPGA. ...
... Instead, the SDK compiles OpenCL into Verilog, which is passed to a synthesis program. The compiler is an extension of the LLVM compiler which first produces a LLVM intermediate representation of the kernel, from which Verilog code is produced, followed by the normal FPGA programming flow of synthesis and place-and-route [6]. Figure 1 shows the compilation flow when using the SDK. ...
... Czajkowski et al. [6] state that the reason for choosing OpenCL over a high-level language is the separation between the host and kernel. The kernel can be implemented as a highly performant hardware circuit while the host can handle the communication and programming of the FPGA. ...
Article
Full-text available
As a result of frequency and power limitations, multi-core processors and accelerators are becoming more and more prevalent in today’s systems. To fully utilize such systems, heterogeneous parallel programming is needed, but this introduces new complexities to the development. High-level frameworks such as SkePU have been introduced to help alleviate these complexities. SkePU is a skeleton programming framework based on a set of programming constructs implementing computational parallel patterns, while presenting a sequential interface to the programmer. Using the various skeleton backends, SkePU programs can execute, without source code modification, on multiple types of hardware such as CPUs, GPUs, and clusters. This paper presents the design and implementation of a new backend for SkePU, adding support for FPGAs. We also evaluate the effect of FPGA-specific optimizations in the new backend and compare it with the existing GPU backend, where the actual devices used are of similar vintage and price point. For simple examples, we find that the FPGA-backend’s performance is similar to that of the existing backend for GPUs, while it falls behind in more complex tasks. Finally, some shortcomings in the backend are highlighted and discussed, along with potential solutions.
... Altera SDK for OpenCL (now called Intel FPGA SDK for OpenCL) was released in 2013 [4]. With this SDK, Altera (now Intel PSG) FPGAs became more widely available to software programmers since it allowed them to program FPGAs using a software programming language and a standard API. ...
Preprint
Full-text available
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.
... Yet users of these tools must still have a fairly good understanding of the microarchitecture of the hardware they want to build, since these systems can't (yet) do global restructuring of the input code to improve energy or performance. This same limitation occurs in research efforts like OpenCL-to-FPGA [7], SOpenCL [21] and FCUDA [23], which explore the feasibility of using GPU languages like CUDA and OpenCL as hardware description languages. ...
Preprint
Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, "programming,"and integrating this hardware into a hardware/software system is difficult. We address this problem by extending the image processing language, Halide, so users can specify which portions of their applications should become hardware accelerators, and then we provide a compiler that uses this code to automatically create the accelerator along with the "glue" code needed for the user's application to access this hardware. Starting with Halide not only provides a very high-level functional description of the hardware, but also allows our compiler to generate the complete software program including the sequential part of the workload, which accesses the hardware for acceleration. Our system also provides high-level semantics to explore different mappings of applications to a heterogeneous system, with the added flexibility of being able to map at various throughput rates. We demonstrate our approach by mapping applications to a Xilinx Zynq system. Using its FPGA with two low-power ARM cores, our design achieves up to 6x higher performance and 8x lower energy compared to the quad-core ARM CPU on an NVIDIA Tegra K1, and 3.5x higher performance with 12x lower energy compared to the K1's 192-core GPU.
Thesis
Full-text available
Traditional hardware description languages (HDLs), such as VHDL and Verilog, are widely used for designing digital electronic circuits, e.g., application-specific integrated circuits (ASICs), or programming field-programmable gate arrays (FPGAs). However, using HDLs for implementing complex algorithms or maintaining large projects is tedious and time-consuming, even for experts. This also prevents the widespread use of FPGAs. As a solution, High-Level Synthesis (HLS) has been studied for decades to increase productivity by, ultimately, taking a behavioral description of an algorithm (what the circuit does?) as design entry and automatically generating a register-transfer level (RTL) implementation. Commercial HLS tools start from well-known programming languages (e.g., C, C++ or OpenCL), which were initially developed for programmable devices with an instruction set architecture (ISA). Yet, these tools deliver a satisfactory quality of hardware synthesis results only when programmers describe hardware-favorable implementations for their applications (how the circuit is built?) exploiting, e.g., a specific memory architecture, control path, and data path. This requires an in-depth understanding of hardware design principles. To adopt software programming languages for hardware design, each HLS tool uses its own language dialect and introduces a non-standard set of pragmas. The mixed-use of software and hardware language abstractions hinders a purely behavioral design and makes optimizations hard to understand since the expected code is neither a pure hardware description nor a regular software implementation. Furthermore, a code optimized for one HLS tool has to be changed significantly to target another HLS tool and performs poorly on an ISA. We believe that the next step in HLS will be on the language side, overcoming productivity, portability, and performance hurdles caused by behavioral design deficiencies of existing tools. This dissertation presents and evaluates three distinct solutions to separate the description of the behavior (what?) of an algorithm from its implementation (how?) while providing high-quality hardware synthesis results for the class of image processing applications. This is achieved by generating highly optimized target-specific input code to commercial HLS tools from high-level abstractions that capture parallelism, locality, and memory access information of an input application. In these approaches, an image processing application is described as a set of basic building blocks, namely point, local and global operators, without low-level implementation concerns. Then, optimized input code is generated for the selected HLS tool (Vivado HLS or Intel OpenCL SDK for FPGAs) using one of the following different programming techniques: (i) a source-to-source compiler developed for an image processing domain-specific language (DSL), or (ii) template metaprogramming to specialize input C++ programs at compile time, (iii) a partial evaluation technique for specializing higher-order functions. We present the first source-to-source compiler that generates optimized input code for Intel OpenCL SDK for FPGAs from a DSL. We use Heterogeneous Image Processing Acceleration (Hipacc), an image processing DSL and a source-to-source compiler initially developed for targeting graphics processing units (GPUs). The Hipacc DSL offers high-level abstractions for point, local, and global operators in form of language constructs. During code generation, the compiler front end transforms input DSL code to an abstract syntax tree (AST) representation using Clang/LLVM compiler infrastructure. By leveraging domain knowledge captured from input DSL code, our backend applies several transformations to generate a description of a streaming hardware pipeline. At the final step, Hipacc generates OpenCL code as input to Intel’s HLS compiler. The quality of our hardware synthesis results rivals with those obtained from Intel’s hand-optimized OpenCL code examples in terms of throughput and resource usage. Furthermore, Hipacc’s code generation achieves significantly higher throughput and uses fewer resources compared to Intel’s parallelization intrinsic. Second, we present an approach based on template metaprogramming for developing modular and highly parameterizable function libraries that also deliver high-quality hardware synthesis results when compiled with HLS tools. In this approach, the library application programming interface (API) consists of high-level generic functions for declaring building blocks of image processing applications, e.g., point, local, global operators, unlike typical libraries that offer functions for complete algorithms, e.g., OpenCV. The library is optimized with Vivado HLS best practices as well as hardware-centric design techniques such as deep pipelining, coarse-level parallelization, and bit-level optimizations. The library contains more than one template design for each algorithmic instance to be able to utilize implementations optimized for input parameters. For example, it includes multiple implementations of image border handling and coarse-level parallelization strategies considered for different input parameters of a local operator specification. Furthermore, a compile-time selection algorithm is proposed for selecting the most suitable implementation according to an analytical model derived for resource usage, speed, and latency. In this way, low-level implementation details are hidden from users. In addition to the presented advantages of using high-level abstractions for raising the abstraction level in HLS, we show that this approach is beneficial for achieving performance portability across different computing platforms. Similar to FPGAs, the performance capabilities of central processing units (CPUs) and GPUs can fully be leveraged only when application programs are tuned with low-level architecture-specific optimizations. These optimizations are based on fundamentally different programming paradigms and languages. As a solution, Khronos released OpenVX as the first industrial standard for graph-based specification of computer vision (CV) applications. The graph-based specification allows optimizing memory transfers between different CV functions from a device-specific backend. Furthermore, the standard hides low-level implementation details from the algorithm description. For instance, memory hierarchy and device synchronization are not exposed to the user. However, the OpenVX standard supports only a small set of computer vision functions and does not offer a mechanism to incorporate user code as part of an OpenVX graph. As the next step, HipaccVX is presented as an OpenVX implementation and extension, supporting code generation for a wide variety of computing platforms. HipaccVX leverages OpenVX’s standard API and graph specification while offering new language constructs to describe algorithms using high-level abstractions that adhere to distinct memory access patterns (e.g., local operators). Thus, it supports the acceleration of user-defined code as well as OpenVX’s CV functions. In this way, HipaccVX combines the benefits of DSL design techniques with an industrial standard specification. Finally, AnyHLS, a novel approach to raise the abstraction level in HLS by using partial evaluation as a core compiler technology is presented. Solely one language and one function library are used to generate target-specific input code for two commercial HLS tools, namely Xilinx Vivado HLS and Intel FPGA SDK for OpenCL. Hardware-centric optimizations requiring code transformations are implemented as higher-order functions, without using tool-specific pragma extensions. Extending AnyHLS with new functionality does not require modifications to a compiler or a code generator written in a different (host) language. Contrary to metaprogramming, the well-typedness of a residual program is guaranteed. As a result, significantly higher productivity than the existing techniques and an unprecedented level of portability across different HLS tools are achieved. Productivity, modularity, and portability gains are demonstrated by presenting an image processing library as a case study.
Article
Full-text available
In financial computation, Field Programmable Gate Arrays (FPGAs) have emerged as a transformative technology, particularly in the domain of option pricing. This study presents the impact of Field Programmable Gate Arrays (FPGAs) on computational methods in finance, with an emphasis on option pricing. Our review examined 99 selected studies from an initial pool of 131, revealing how FPGAs substantially enhance both the speed and energy efficiency of various financial models, particularly Black–Scholes and Monte Carlo simulations. Notably, the performance gains—ranging from 270- to 5400-times faster than conventional CPU implementations—are highly dependent on the specific option pricing model employed. These findings illustrate FPGAs’ capability to efficiently process complex financial computations while consuming less energy. Despite these benefits, this paper highlights persistent challenges in FPGA design optimization and programming complexity. This study not only emphasises the potential of FPGAs to further innovate financial computing but also outlines the critical areas for future research to overcome existing barriers and fully leverage FPGA technology in future financial applications.
Article
CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem. To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals. In contrast to these source-to-source approaches, we present a novel framework, CuPBoP, which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized. Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs. To promote further research in this field, we have released CuPBoP as an open-source resource.
Conference Paper
Full-text available
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is often not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPGPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
Conference Paper
Full-text available
A simple protocol for latency-insensitive design is presented. The main features of the protocol are the efficient implementation of elastic communication channels and the automatable design methodology. With this approach, fine-granularity elasticity can be introduced at the level of functional units (e.g. ALUs, memories). A formal specification of the protocol is defined and an efficient scheme for the implementation of elasticity that involves no datapath overhead is presented. The opportunities this protocol opens for microarchitectural design are discussed
Conference Paper
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, an important kernel in many tile-based BLAS algorithms, optimized for implementation on high-end FPGAs. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I.
Article
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373MHz. With 40 PEs and a design speed of 373MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750MB/s for design-II and 5.9GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations. KeywordsHigh performance computing-Matrix multiplication-Rank-1 scheme-FPGA implementation-Memory-bandwidth trade-off-Scalability
Conference Paper
This paper examines various activity estimation techniques in order to determine which are most appropriate for use in the context of field-programmable gate arrays (FPGAs). Specifically, the paper compares how different activity estimation techniques affect the accuracy of FPGA power models and the ability of power-aware FPGA CAD tools to minimize power. After comparing various existing techniques, the most suitable existing techniques are combined with two novel enhancements to create a new activity estimation tool called ACE-2.0. Finally, the new publicly available tool is compared to existing tools to validate the improvements. Using activities estimated by ACE-2.0, the power estimates and power savings were both within 1% of the results obtained using simulated activities
Conference Paper
The problem of automatically generating hardware modules from a high level representation of an application has been at the research forefront in the last few years. In this paper, we use OpenCL, an industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our architectural synthesis tool, SOpenCL (Silicon-OpenCL), adapts OpenCL into a novel hardware design flow which efficiently maps coarse and fine-grained parallelism of an application onto an FPGA reconfigurable fabric. SOpenCL is based on a source- to-source code transformation step that coarsens the OpenCL fine-grained parallelism into a series of nested loops, and on a template-based hardware generation back-end that configures the accelerator based on the functionality and the application performance and area requirements. Our experimentation with a variety of OpenCL and C kernel benchmarks reveals that area, throughput and frequency optimized hardware implementations are attainable using SOpenCL.
Conference Paper
Molecular Dynamics (MD) simulations, supported by parallel software and special hardware, are widely used in materials, computational chemistry and biology science. With advancing of FPGA capability and inclusion of embedded multipliers, lots of studies steer to focus on FPGA accelerated MD simulations. In this paper, we propose a system that can implement the computation on FPGA for Lennard-Jones (LJ) force which has been proved of dominating the whole execution time, and then the results are transferred to the host which takes the charge of all motion integration and other computations. To perform efficient computation on FPGA, we present two methods, one is combining discrete function and interpolation for computing high power, and the other is using Filter filtrate particles and exploiting two LJ force Calculators.