Article

Code Refinement of Stencil Codes

Authors:
  • Deutsches Forschungszentrum für Künstliche Intelligenz, Saarbrücken, Germany
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: Compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those peculiarities to the implementation. A popular method to avoid these problems is to implement the algorithm in a Domain-Specific Language (DSL). A DSL compiler can then automatically tune the code for the target platform. In this article we show how to embed a DSL for stencil codes in another language. In contrast to prior approaches we only use a single language for this task which offers explicit control over code refinement. This is used to specialize stencils for particular scenarios. Our results show that our specialized programs achieve competitive performance compared to hand-tuned CUDA programs while maintaining a convenient coding experience.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Such methods are typically hand crafted and manually tuned for a specific target platform. An alternative approach is the use of compilers that transform the underlying interpreter logic to improve performance [13]. ...
Article
Full-text available
Interpreters are well researched in the field of compiler construction and program generation. They are typically used to realize program execution of different programming languages without a compilation step. However, they can also be used to model complex rule-based simulations: The interpreter applies all rules one after another. These can be iteratively applied on a globally updated state in order to get the final simulation result. Many simulations for domain-specific problems already leverage the parallel processing capabilities of Graphics Processing Units (GPUs). They use hardware-specific tuned rule implementations to achieve maximum performance. However, every interpreter-based system requires a high-level algorithm that detects active rules and determines when they are evaluated. A common approach in this context is the use of different interpreter routines for every problem domain. Executing such functions in an efficient way mainly involves dealing with hardware peculiarities like thread divergences, ALU computations and memory operations. Furthermore, the interpreter is often executed on multiple states in parallel these days. This is particularly important for heuristic search or what-if analyses, for instance. In this paper, we present a novel and easy-to-implement method based on thread compaction to realize generic rule-based interpreters in an efficient way on GPUs. It is optimized for many states using a specially designed memory layout. Benchmarks on our evaluation scenarios show that the performance can be significantly increased in comparison to existing commonly-used implementations.
Chapter
The current trend of using artificial neural networks to solve computationally intensive problems is omnipresent. In this scope, DeepQ learning is a common choice for agent-based problems. DeepQ combines the concept of Q-Learning with (deep) neural networks to learn different Q-values/matrices based on environmental conditions. Unfortunately, DeepQ learning requires hundreds of thousands of iterations/Q-samples that must be generated and learned for large-scale problems. Gathering data sets for such challenging tasks is extremely time consuming and requires large data-storage containers. Consequently, a common solution is the automatic generation of input samples for agent-based DeepQ networks. However, a usual workflow is to create the samples separately from the training process in either a (set of) pre-processing step(s) or interleaved with the training process. This requires the input Q-samples to be materialized in order to be fed into the training step of the attached neural network. In this paper, we propose a new GPU-focussed method for on-the-fly generation of training samples tightly coupled with the training process itself. This allows us to skip the materialization process of all samples (e.g. avoid dumping them disk), as they are (re)constructed when needed. Our method significantly outperforms usual workflows that generate the input samples on the CPU in terms of runtime performance and memory/storage consumption. KeywordsMassively-parallel processingNeural networksQ-learningGraphics processing unitsGPUsState construction
Chapter
Graphics Processing Units (GPUs) are widely spread nowadays due to their parallel processing capabilities. Leveraging these hardware features is particularly important for computationally expensive tasks and workloads. Prominent use cases are optimization problems and simulations that can be parallelized and tuned for these architectures. In the general domain of simulations (numerical and discrete), the overall logic is split into several components that are executed one after another. These components need step-size information which determines the number of steps (e.g. the elapsed time) they have to perform. Small step sizes are often required to ensure a valid simulation result with respect to precision and constraint correctness. Unfortunately, they are often the main bottleneck of the simulation. In this paper, we introduce a new and generic way of realizing high-performance simulations with multiple components using adaptive time steps on GPUs. Our method relies on a code-analysis phase that resolves data dependencies between different components. This knowledge is used to generate specially-tuned execution kernels that encapsulate the underlying component logic. An evaluation on our simulation benchmarks shows that we are able to considerably improve runtime performance compared to prior work.
Chapter
Many optimization problems (especially nonsmooth ones) are typically solved by genetic, evolutionary, or metaheuristic-based algorithms. However, these genetic approaches and other related papers typically assume the existence of a neighborhood or successor-state function N(x), where x is a candidate state. The implementation of such a function can become arbitrarily complex in the field of combinatorial optimization. Many N(x) functions for a huge variety of different domain-specific problems have been developed in the past to solve this general problem. However, it has always been a great challenge to port or realize these functions on a massively-parallel architecture like a Graphics Processing Unit (GPU). We present a GPU-based method called FANG that implements a generic and reusable N(x) for arbitrary domains in the field of combinatorial optimization. It can be customized to satisfy domain-specific requirements and leverages the underlying hardware in a fast and efficient way by construction. Moreover, our method has a high scalability with respect to the number of input states and the complexity of a single state. Measurements show significant performance improvements compared to traditional exploration approaches leveraging the CPU on our evaluation scenarios.
Article
Full-text available
We give a mathematically rigorous definition of a grid for algorithms solving partial differential equations. Unlike previous approaches (Benger 2005, PhD thesis; Berti 2000, PhD thesis), our grids have a hierarchical structure. This makes them suitable for geometric multigrid algorithms and hierarchical local grid refinement. The description is also general enough to include geometrically non-conforming grids. The definitions in this article serve as the basis for an implementation of an abstract grid interface as C++ classes in the framework (Bastian et al. 2008, this issue).