Bringing parallel performance to python with domain-specific selective embedded just-in-time specialization
Abstract and Figures
—Today's productivity programmers, such as scientists who need to write code to do science, are typically forced to choose between productive and maintainable code with modest performance (e.g. Python plus native libraries such as SciPy [SciPy]) or complex, brittle, hardware-specific code that entangles application logic with performance concerns but runs two to three orders of magnitude faster (e.g. C++ with OpenMP, CUDA, etc.). The dynamic features of modern productivity languages like Python enable an alternative approach that bridges the gap between productivity and performance. SEJITS (Selective, Embedded, Just-in-Time Specialization) embeds domain-specific languages (DSLs) in high-level languages like Python for popular computational kernels such as stencils, matrix algebra, and others. At runtime, the DSLs are "compiled" by combining expert-provided source code templates specific to each problem type, plus a strategy for optimizing an abstract syntax tree representing a domain-specific but language-independent representation of the problem instance. The result is efficiency-level (e.g. C, C++) code callable from Python whose performance equals or exceeds that of handcrafted code, plus performance portability by allowing multiple code generation strategies within the same specializer to target different hardware present at runtime, e.g. multicore CPUs vs. GPUs. Application writers never leave the Python world, and we do not assume any modification or support for parallelism in Python itself. We present Asp ("Asp is SEJITS for Python") and initial results from sev-eral domains. We demonstrate that domain-specific specializers allow highly-productive Python code to obtain performance meeting or exceeding expert-crafted low-level code on parallel hardware, without sacrificing maintainability or portability.
Figures - uploaded by Derrick Coetzee
Author content
All figure content in this area was uploaded by Derrick Coetzee
Content may be subject to copyright.
... Sepya, a Python DSEL for stencil computations, is one of the many varieties of extant specializers. It produces high-quality OpenMP or Cilk+ code whose performance is competitive with hand-crafted and autotuned code [9]. ...
... While the SE-JITS concept is language-independent and works with any modern dynamic language that supports graceful embedding of DSELs [4], our prototype framework uses Python as both the source language and the language in which specializers (micro-compilers) are written. A typical specializer is a few hundreds to a couple thousand lines of Python [9]. ...
... In the second phase, a code generator specific to this pattern (stencils) maps the DSL-specific constructs to efficient data structures and traversal strategies for the desired target hardware. The existing Sepya implementation targets multicore CPUs with OpenMP or Cilk+ capable compilers [9]; in this work we add support for targeting GPUs via an OpenCL backend. A narrowly pattern-specific code generator working from a DAST can employ "best practice" optimization strategies specific to the <pattern,hardware> pair. ...
The SEJITS framework supports creating embedded domain-specific languages (DSELs) and code generators, a pair of which is called a specializer, with much less effort than creating a full DSL compiler---typically just a few hundred lines of code. SEJITS' main benefit is allowing application writers to stay entirely in high-level languages such as Python by using specialized Python functions (that is, functions written in one of the Python-embedded DSELs) to generate code that runs at native speed. One existing SEJITS DSEL is Sepya [10], a Python DSEL for stencil computations that generates OpenMP and Cilk+ code competitive with existing DSL compilers such as Pochoir and Halide. We extend Sepya to generate OpenCL code for targetting GPUs, and in the process, extend SEJITS with support for meta-specializers, whose job is to enable and optimize the composition of existing specializers written by third parties. In this work, we demonstrate meta-specialization by detecting and removing extraneous data copies to and from the GPU to compose multiple specializer calls (stencil and non-stencil). We also explore the variants of loop fusion to further improve performance of composing these operations. The performance of the generated stencil code is 20x faster SciPy and competitive with existing stencil DSELs on realistic code excerpts. Since meta-specializers must compose and optimize specializers created by third parties, we extend SEJITS with support for meta-specializer hooks, allowing existing specializers to be incrementally enabled for meta-specialization without breaking backwards compatibility. The Sepya and SEJITS extensions together extend the range of platforms for which highly optimized code can be generated and open new possibilities for optimizing the composition of existing specializers.
... Our methodology for developing the specialization framework for accelerating audio content analysis applications using a high-level scripting language is based on the the Selective Embedded Just-In-Time Specialization (SEJITS) mechanism originally described in [16]. In this proposed work we use a specific implementation of SEJITS called Asp [17]. The general concept of providing portable, high performance abstractions was pioneered by autotuning libraries such as FFTW [18], Spiral [19], and OSKI [20]. ...
... 2. Re-segmentation: In each iteration of the agglomeration, we re-segment the feature vectors into subsets using majority vote segmentation (lines [16][17][18]. We use the GPU to compute the log-likelihoods (gmm.score() ...
... The ability to generate code at runtime makes specializers applicable in cases where a library would not be able to provide sufficient flexibility and performance. This could be because the computation itself is too general for a library: an example of this is the domain of stencil computations [4]. It would not be possible to compile implementations of all possible stencils up front, but a specializer can lower a stencil function, given as code in its Pythonembedded DSL, down to C++, making it capable of generating optimized code for arbitrary stencils. ...
... As such, this is an out-of-place computation. A wide variety of work has been done to optimize stencil computations on shared-memory multiprocessors, including communication-avoiding optimizations [69,105], auto-tuners [27,56], standalone domain-specific compilers [97], and embedded domain-specific compilers [20,55,58]. Rather than reproduce this body of work, we wish to take advantage of existing libraries in order to perform the shared-memory portion of the stencil computation. ...
Stencil computations comprise an important class of kernels in many scientific computing applications. As the diversity of both architectures and programming models grow, autotuning is emerging as a critical strategy for achieving portable performance across a broad range of execution contexts for stencil computations. However, costly tuning overhead is a major obstacle to its popularity. In this work, we propose a fast stencil autotuning framework FAST based on an Optimal-Solution Space (OSS) model to significantly improve tuning speed. It leverages a feature extractor that comprehensively characterizes stencil computation. Using the extracted features, FAST constructs an OSS database to train an off-line model which provides an on-line prediction. We evaluate FAST with five important stencil computation applications on both an Intel Xeon multicore CPU and an NVIDIA Tesla K20c GPU. Compared with state-of-the-art stencil autotuners like Patus and SDSL, FAST improves autotuning speed by 10-2697 times without any user annotation, while achieving comparable performance.
The matrix powers kernel, used in communication-avoiding Krylov subspace methods, requires runtime auto-tuning for best performance. We demonstrate how the SEJITS (Selective Embedded Just-In-Time Specialization) approach can be used to deliver a high-performance and performance-portable implementation of the matrix powers kernel to application authors, while separating their high-level concerns from those of auto-tuner implementers involving low-level optimizations. The benefits of delivering this kernel in the form of a specializer, rather than a traditional library, are discussed. Performance of the matrix powers kernel specializer is evaluated in the context of a communication-avoiding conjugate gradient (CA-CG) solver, which compares favorably to traditional CG.
Basic block vectorization consists in realizing instruction-level parallelism inside basic blocks in order to generate SIMD instructions and thus speedup data processing. It is however problematic, because the vectorized program may actually be slower than the original one. Therefore, it would be useful to predict beforehand whether or not vectorization will actually produce any speedup. This paper proposes to do so by expressing vectorization profitability as a classification problem, and by predicting it using a machine learning technique called support vector machine (SVM). It considers three compilers (icc, gcc and llvm), and a benchmark suite made of 151 loops, unrolled with factors ranging from 1 to 20. The paper further proposes a technique that combines the results of two SVMs to reach 99% of accuracy for all three compilers. Moreover, by correctly predicting unprofitable vectorizations, the technique presented in this paper provides speedups of up to 2.16 times, 2.47 times and 3.83 times for icc, gcc and LLVM, respectively (9%, 18% and 56% on average). It also lowers to less than 1% the probability of the compiler generating a slower program with vectorization turned on (from more than 25% for the compilers alone).
Basic block vectorization consists in extracting instruction level parallelism inside basic blocks in order to generate SIMD instructions and thus speedup data processing. It is however a double-edged technique, because the vectorized program may actually be slower than the original one. Therefore, it would be useful to predict beforehand whether or not vectorization could actually produce any speedup. In this article, we propose to do so by using a machine learning technique called support vector machine. We consider a benchmark suite containing 151 loops, unrolled with factors ranging from 1 to 20. We do our prediction offline after as well as before unrolling. Our contribution is threefold. First, we manage to predict correctly the profitability of vectorization for 70% of the programs in both cases. Second, we propose a list of static software characteristics that successfully describe our benchmark with respect to our goal. Finally, we determine that machine learning makes it possible to significantly improve the quality of the code generated by Intel Compiler, with speedups up to 2.2 times.
Today's "high productivity" programming languages such as Python lack the performance of harder-to- program "efficiency" languages (CUDA, Cilk, C with OpenMP) that can exploit extensive programmer knowl- edge of parallel hardware architectures. We com- bine efficiency-language performance with productivity- language programmability using selective embedded just-in-time specialization (SEJITS). At runtime, we specialize (generate, compile, and execute efficiency- language source code for) an application-specific and platform-specific subset of a productivity language, largely invisibly to the application programmer. Be- cause the specialization machinery is implemented in the productivity language itself, it is easy for efficiency programmers to incrementally add specializers for new domain abstractions, new hardware, or both. SEJITS has the potential to bridge productivity-layer research and efficiency-layer research, allowing domain experts to exploit different parallel hardware architectures with a fraction of the programmer time and effort usually re- quired.
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.
Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by ex-pert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mapping a prob-lem onto parallel hardware depending on the input data size and the hardware parameters. We show how to preserve the productivity of high-level languages while obtaining the per-formance of the best low-level language code variant for a given hardware platform and problem size using SEJITS, a set of techniques that leverages just-in-time code generation and compilation. As a case study, we demonstrate our tech-nique for Gaussian Mixture Model training using the EM al-gorithm. With the addition of one line of code to import our framework, a domain programmer using an existing Python GMM library can run her program unmodified on a GPU-equipped computer and achieve performance that meets or beats GPU code hand-crafted by a human expert. We also show that despite the overhead of allowing the domain ex-pert's program to use Python and the overhead of just-in-time code generation and compilation, our approach still results in performance competitive with hand-crafted GPU code.
Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs. Results show that our generalized methodology delivers significant performance gains of up to 22Ã speedup over the reference serial implementation. Overall we demonstrate that such domain-specific auto-tuners hold enormous promise for architectural efficiency, programmer productivity, performance portability, and algorithmic adaptability on existing and emerging multicore systems.
Manufacturers will likely offer multiple products with differing numbers of cores to cover multiple price-performance points, since Moore's Law will permit the doubling of the number of cores per chip every two years. While diversity may be understandable in this time of uncertainty, it exacerbates the already difficult jobs of programmers, compiler writers, and even architects. Hence, an easy-to-understand model that offers performance guidelines would be especially valuable. This article proposes one such model called Roofline, demonstrating it on four diverse multicore computers using four key floating-point kernels. The proposed Roofline model ties together floating-point performance, operational intensity, and memory performance in a 2D graph. The Roofline sets an upper bound on performance of a kernel depending on the kernel's operational intensity. If people think of operational intensity as a column that hits the roof, either it hits the flat part of the roof, meaning performance is compute-bound, or it hits the slanted part of the roof, meaning performance is ultimately memory-bound.
As clock frequencies have tapered off and the number of cores on a chip has taken off, the challenge of effectively utilizing these multicore systems has become increasingly important. However, the diversity of multicore machines in today's market compels us to individually tune for each platform. This is especially true for problems with low computational intensity, since the improvements in memory latency and bandwidth are much slower than those of computational rates. One such kernel is a stencil, a regular nearest neighbor operation over the points in a structured grid. Stencils often arise from solving partial differential equations, which are found in almost every scientific discipline. In this thesis, we analyze three common three-dimensional stencils: the 7-point stencil, the 27-point stencil, and the Gauss-Seidel Red-Black Helmholtz kernel. We examine the performance of these stencil codes over a spectrum of multicore architectures, including the Intel Clovertown, Intel Nehalem, AMD Barcelona, the highly-multithreaded Sun Victoria Falls, and the low power IBM Blue~Gene/P. These platforms not only have significant variations in their core architectures, but also exhibit a 32x range in available hardware threads, a 4.5x range in attained DRAM bandwidth, and a 6.3x range in peak flop rates. Clearly, designing optimal code for such a diverse set of platforms represents a serious challenge. Unfortunately, compilers alone do not achieve satisfactory stencil code performance on this varied set of platforms. Instead, we have created an automatic stencil code tuner, or auto-tuner, that incorporates several optimizations into a single software framework. These optimizations hide memory latency, account for non-uniform memory access times, reduce the volume of data transferred, and take advantage of special instructions. The auto-tuner then searches over the space of optimizations, thereby allowing for much greater productivity than hand-tuning. The fully auto-tuned code runs up to 5.4x faster than a straightforward implementation and is more scalable across cores. By using performance models to identify performance limits, we determined that our auto-tuner can achieve over 95% of the attainable performance for all three stencils in our study. This demonstrates that auto-tuning is an important technique for fully exploiting available multicore resources.
A stencil computation repeatedly updates each point of a d-dimensional grid as a function of itself and its near neighbors. Parallel cache-efficient stencil algorithms based on "trapezoidal decompositions" are known, but most programmers find them difficult to write. The Pochoir stencil compiler allows a programmer to write a simple specification of a stencil in a domain-specific stencil language embedded in C++ which the Pochoir compiler then translates into high-performing Cilk code that employs an efficient parallel cache-oblivious algorithm. Pochoir supports general d-dimensional stencils and handles both periodic and aperiodic boundary conditions in one unified algorithm. The Pochoir system provides a C++ template library that allows the user's stencil specification to be executed directly in C++ without the Pochoir compiler (albeit more slowly), which simplifies user debugging and greatly simplified the implementation of the Pochoir compiler itself. A host of stencil benchmarks run on a modern multicore machine demonstrates that Pochoir outperforms standard parallelloop implementations, typically running 2-10 times faster. The algorithm behind Pochoir improves on prior cache-efficient algorithms on multidimensional grids by making "hyperspace" cuts, which yield asymptotically more parallelism for the same cache efficiency.