Traditional hardware description languages (HDLs), such as VHDL and Verilog, are widely used for designing digital electronic circuits, e.g., application-specific integrated circuits (ASICs), or programming field-programmable gate arrays (FPGAs). However, using HDLs for implementing complex algorithms or maintaining large projects is tedious and time-consuming, even for experts. This also prevents the widespread use of FPGAs. As a solution, High-Level Synthesis (HLS) has been studied for decades to increase productivity by, ultimately, taking a behavioral description of an algorithm (what the circuit does?) as design entry and automatically generating a register-transfer level (RTL) implementation. Commercial HLS tools start from well-known programming languages (e.g., C, C++ or OpenCL), which were initially developed for programmable devices with an instruction set architecture (ISA). Yet, these tools deliver a satisfactory quality of hardware synthesis results only when programmers describe hardware-favorable implementations for their applications (how the circuit is built?) exploiting, e.g., a specific memory architecture, control path, and data path. This requires an in-depth understanding of hardware design principles. To adopt software programming languages for hardware design, each HLS tool uses its own language dialect and introduces a non-standard set of pragmas. The mixed-use of software and hardware language abstractions hinders a purely behavioral design and makes optimizations hard to understand since the expected code is neither a pure hardware description nor a regular software implementation. Furthermore, a code optimized for one HLS tool has to be changed significantly to target another HLS tool and performs poorly on an ISA. We believe that the next step in HLS will be on the language side, overcoming productivity, portability, and performance hurdles caused by behavioral design deficiencies of existing tools.
This dissertation presents and evaluates three distinct solutions to separate the description of the behavior (what?) of an algorithm from its implementation (how?) while providing high-quality hardware synthesis results for the class of image processing applications. This is achieved by generating highly optimized target-specific input code to commercial HLS tools from high-level abstractions that capture parallelism, locality, and memory access information of an input application. In these approaches, an image processing application is described as a set of basic building blocks, namely point, local and global operators, without low-level implementation concerns. Then, optimized input code is generated for the selected HLS tool (Vivado HLS or Intel OpenCL SDK for FPGAs) using one of the following different programming techniques: (i) a source-to-source compiler developed for an image processing domain-specific language (DSL), or (ii) template metaprogramming to specialize input C++ programs at compile time, (iii) a partial evaluation technique for specializing higher-order functions.
We present the first source-to-source compiler that generates optimized input code for Intel OpenCL SDK for FPGAs from a DSL. We use Heterogeneous Image Processing Acceleration (Hipacc), an image processing DSL and a source-to-source compiler initially developed for targeting graphics processing units (GPUs). The Hipacc DSL offers high-level abstractions for point, local, and global operators in form of language constructs. During code generation, the compiler front end transforms input DSL code to an abstract syntax tree (AST) representation using Clang/LLVM compiler infrastructure. By leveraging domain knowledge captured from input DSL code, our backend applies several transformations to generate a description of a streaming hardware pipeline. At the final step, Hipacc generates OpenCL code as input to Intel’s HLS compiler. The quality of our hardware synthesis results rivals with those obtained from Intel’s hand-optimized OpenCL code examples in terms of throughput and resource usage. Furthermore, Hipacc’s code generation achieves significantly higher throughput and uses fewer resources compared to Intel’s parallelization intrinsic.
Second, we present an approach based on template metaprogramming for developing modular and highly parameterizable function libraries that also deliver high-quality hardware synthesis results when compiled with HLS tools. In this approach, the library application programming interface (API) consists of high-level generic functions for declaring building blocks of image processing applications, e.g., point, local, global operators, unlike typical libraries that offer functions for complete algorithms, e.g., OpenCV. The library is optimized with Vivado HLS best practices as well as hardware-centric design techniques such as deep pipelining, coarse-level parallelization, and bit-level optimizations. The library contains more than one template design for each algorithmic instance to be able to utilize implementations optimized for input parameters. For example, it includes multiple implementations of image border handling and coarse-level parallelization strategies considered for different input parameters of a local operator specification. Furthermore, a compile-time selection algorithm is proposed for selecting the most suitable implementation according to an analytical model derived for resource usage, speed, and latency. In this way, low-level implementation details are hidden from users.
In addition to the presented advantages of using high-level abstractions for raising the abstraction level in HLS, we show that this approach is beneficial for achieving performance portability across different computing platforms. Similar to FPGAs, the performance capabilities of central processing units (CPUs) and GPUs can fully be leveraged only when application programs are tuned with low-level architecture-specific optimizations. These optimizations are based on fundamentally different programming paradigms and languages. As a solution, Khronos released OpenVX as the first industrial standard for graph-based specification of computer vision (CV) applications. The graph-based specification allows optimizing memory transfers between different CV functions from a device-specific backend. Furthermore, the standard hides low-level implementation details from the algorithm description. For instance, memory hierarchy and device synchronization are not exposed to the user. However, the OpenVX standard supports only a small set of computer vision functions and does not offer a mechanism to incorporate user code as part of an OpenVX graph. As the next step, HipaccVX is presented as an OpenVX implementation and extension, supporting code generation for a wide variety of computing platforms. HipaccVX leverages OpenVX’s standard API and graph specification while offering new language constructs to describe algorithms using high-level abstractions that adhere to distinct memory access patterns (e.g., local operators). Thus, it supports the acceleration of user-defined code as well as OpenVX’s CV functions. In this way, HipaccVX combines the benefits of DSL design techniques with an industrial standard specification.
Finally, AnyHLS, a novel approach to raise the abstraction level in HLS by using partial evaluation as a core compiler technology is presented. Solely one language and one function library are used to generate target-specific input code for two commercial HLS tools, namely Xilinx Vivado HLS and Intel FPGA SDK for OpenCL. Hardware-centric optimizations requiring code transformations are implemented as higher-order functions, without using tool-specific pragma extensions. Extending AnyHLS with new functionality does not require modifications to a compiler or a code generator written in a different (host) language. Contrary to metaprogramming, the well-typedness of a residual program is guaranteed. As a result, significantly higher productivity than the existing techniques and an unprecedented level of portability across different HLS tools are achieved. Productivity, modularity, and portability gains are demonstrated by presenting an image processing library as a case study.