Conference Paper

Static GPU Threads and an Improved Scan Algorithm

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

Current GPU programming systems automatically distribute the work on all GPU processors based on a set of fixed assumptions, e.g. that all tasks are independent from each other. We show that automatic distribution limits algorithmic design, and demonstrate that manual work distribution hardly adds any overhead. Our Scan + algorithm is an improved scan relying on manual work distribution. It uses global barriers and task interleaving to provides almost twice the performance of Apple’s reference implementation [1].

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... fall on the left piece and (x i [2], y i [2]), (x i [3], y i [3]) fall on the right piece of the function. However, since the turning point of the function is unknown at the compilation time, we need to do a runtime check to guarantee correctness. ...
... fall on the left piece and (x i [2], y i [2]), (x i [3], y i [3]) fall on the right piece of the function. However, since the turning point of the function is unknown at the compilation time, we need to do a runtime check to guarantee correctness. ...
... However, since the turning point of the function is unknown at the compilation time, we need to do a runtime check to guarantee correctness. Specifically, we check whether the actual initial value x is between x i [1] and x i [2], which are initial values of two middle sample points. If it is the case, true results can be derived through the reconstruction function above. ...
Conference Paper
Much research work has been done to parallelize loops with recurrences over the last several decades. Recently, sampling-and-reconstruction method was proposed to parallelize a broad class of loops with recurrences in an automated fashion, with a practical runtime approach. Although the parallelized codes achieve linear scalability across multi-cores architectures, the sequential merge inherent to this method makes it not scalable on many core architectures, such as GPUs. At the same time, existing parallel merge approaches used for simple reduction loops cannot be directly and correctly applied to this method. Based on this observation, we propose new methods to merge partial results in parallel on GPUs and achieve linear scalability. Our approach involves refined runtime-checking rules to avoid unnecessary runtime check failures and reduce the overhead of reprocessing. We also propose sample converge technique to reduce the number of sample points so that communication and computation overhead is reduced. Finally, based on GPU architectural features, we develop optimization techniques to further improve performance. Our evaluation results of a set of representative algorithms show that our parallel merge implementation is substantially more efficient than sequential merge, and achieves linear scalability on different GPUs.
... As for the GPU platform, several parallel algorithms have been introduced for the prefix operation [2,5,6,12,32]. For instance, in [2], the authors proposed a parallel algorithm based on minimizing the synchronization and maximizing the concurrent execution on the GPU. ...
... For instance, in [2], the authors proposed a parallel algorithm based on minimizing the synchronization and maximizing the concurrent execution on the GPU. Meanwhile, Breitbart [5] introduced the concept of static work-groups instated of automatic distribution and applied this concept to improve the scanning performance by almost a factor of 2. Capannini [6] presented a parallel algorithm on an NVIDIA 275 GTX device that was based on minimizing both memory usage and conflict. In the experiments, a maximum 32-bit word size and a maximum block size of 2048 elements were used and the results showed that the running time reduced by up to 25% as compared to the previous results. ...
Article
Full-text available
The prefix computation strategy is a fundamental technique used to solve many problems in computer science such as sorting, clustering, and computer vision. A large number of parallel algorithms have been introduced that are based on a variety of high-performance systems. However, these algorithms do not consider the cost of the prefix computation operation. In this paper, we design a novel strategy for prefix computation to reduce the running time for high-cost operations such as multiplication. The proposed algorithm is based on (1) reducing the size of the partition and (2) keeping a fixed-size partition during all the steps of the computation. Experiments on a multicore system for different array sizes and number sizes demonstrate that the proposed parallel algorithm reduces the running time of the best-known optimal parallel algorithm in the average range of 62.7–79.6%. Moreover, the proposed algorithm has high speedup and is more scalable than those in previous works.
... Other attempts to enhance the standard GPU hardware and software scheduler are related to the implementation of Persistent Threads and related applications [6], [7]. Userdefined scheduling policies are obtained by batching many kernel calls into a single invocation to then arbitrate the execution of blocks of GPU threads within one or more persistently executing GPU threads [8]. ...
... This is how, for instance, a CUDA application might share a buffer with an OpenGL renderer, or how a video feed or a camera might share frames to be consumed by a CUDA application. EGLStreams act at user space level, flagging processes as data producer and consumer, but allowing these roles to switch over time 6 . At driver level, sharing of EGLStream objects translates into pushing acquire and release syncpoint operations in the Command Push Buffer. ...
... Previously, a way to minimize CPU offloading times was to batch compute kernel calls into a single or reduced number of command submissions. This approach based on persistent kernel proved to have beneficial effects in terms of average performance [5,15,6,16] and for obtaining a higher degree of control when scheduling blocks of GPU threads [31,10]. The problem with persistent kernels is that they often require a substantial code rewriting effort and they are not able to properly manage Copy Engine operations mixed with computing kernels. ...
Conference Paper
Full-text available
There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU com-pute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU command submissions. We will show that this added control may significantly improve the application performance and predictability due to a substantial reduction in CPU-to-GPU driver interactions, making Vulkan an interesting candidate for becoming the state-of-the-art API for heterogeneous Real-Time systems. Our findings are evaluated on a latest generation NVIDIA Jetson AGX Xavier embedded board, executing typical workloads involving Deep Neural Networks of parameterized complexity.
... The Persistent Thread programming model [12] is another attempt to "hijack" the undisclosed GPU scheduling policies by batching many kernel calls into a single invocation to then arbitrate the execution of blocks of GPU threads with user-defined scheduling and synchronization policies [23]. Reducing driver overhead, both performance and GPU utilization are proved to be significantly better than with traditional offloading models for various applications [6,7]. Despite these performance improvements, persistent threads leave even less control to the CPU to preempt GPU activities. ...
Conference Paper
In embedded systems, CPUs and GPUs typically share main memory. The resulting memory contention may significantly inflate the duration of CPU tasks in a hard-to-predict way. Despite initial solutions have been devised to control this undesired inflation, these approaches do not consider the interference due to memory-intensive components in COTS embedded systems like integrated Graphical Processing Units. Dealing with this kind of interference might require custom-made hardware components that are not integrated in off-the-shelf platforms. We address these important issues by proposing a memory-arbitration mechanism, SiGAMMA (SiΓ), for eliminating the interference on CPU tasks caused by conflicting memory requests from the GPU. Tasks on the CPU are assumed to comply with a prefetch-based execution model (PREM) proposed in the real-time literature, while memory accesses from the GPU are arbitrated through a predictable mechanism that avoids contention. Our experiments show that SiΓ proves to be very effective in guaranteeing almost null inflation to memory phases of CPU tasks, while at the same time avoiding excessive starvation of GPU tasks.
... Work queues can be made to allow work to be inserted from the CPU [CVKG10], and support work stealing [CGSS11] and work donation [TPO10]. Persistent threads approaches in combination with work queues have been used in a variety of application areas: sparse matrix-vector multiplication [BG09], sorting algorithms [SHG09], scan algorithms [Bre10], and construction of kd-trees [VHBS16]. ...
Article
Full-text available
While the modern graphics processing unit (GPU) offers massive parallel compute power, the ability to influence the scheduling of these immense resources is severely limited. Therefore, the GPU is widely considered to be only suitable as an externally controlled co-processor for homogeneous workloads which greatly restricts the potential applications of GPU computing. To address this issue, we present a new method to achieve fine-grained priority scheduling on the GPU: hierarchical bucket queuing. By carefully distributing the workload among multiple queues and efficiently deciding which queue to draw work from next, we enable a variety of scheduling strategies. These strategies include fair-scheduling, earliest-deadline-first scheduling and user-defined dynamic priority scheduling. In a comparison with a sorting-based approach, we reveal the advantages of hierarchical bucket queuing over previous work. Finally, we demonstrate the benefits of using priority scheduling in real-world applications by example of path tracing and foveated micropolygon rendering.
... This is obtained employing persistent software entity running on the GPU device, launched through a CUDA kernel invocation, which remains alive until all the jobs scheduled into a work-queue are exhausted. This approach led to interesting results in diverse areas of application such as active data structures [11], prefix-sum operation [12] and even 3D graphic rendering with ray-tracing [13] , but to the best of our knowledge it has never been applied to genetic algorithms. The novelty of this paper is to study how a GA can be successfully ported to a persistent thread style model and measuring its performance compared to "traditional" GAs on GP-GPU that are implemented with multiple CUDA kernel invocations or more or less implicit CPU-GPU synchronization points. ...
Conference Paper
In this paper we present a heavily exploration oriented implementation of genetic algorithms to be executed on graphic processor units (GPUs) that is optimized with our novel mechanism for scheduling GPU-side synchronized jobs that takes inspiration from the concept of persistent threads. Persistent Threads allow an efficient distribution of work loads throughout the GPU so to fully exploit the CUDA (NVIDIA's proprietary Compute Unified Device Architecture) architecture. Our approach (named Scheduled Light Kernel, SLK) uses a specifically designed data structure for issuing sequences of commands from the CPU to the GPU able to minimize CPUGPU communications, exploit streams of concurrent execution of different device side functions within different Streaming Multiprocessors and minimize kernels launch overhead. Results obtained on two completely different experimental settings show that our approach is able to dramatically increase the performance of the tested genetic algorithms compared to the baseline implementation that (while still running on a GPU) does not exploit our proposed approach. Our proposed SLK approach does not require substantial code rewriting and is also compared to newly introduced features in the last CUDA development toolkit, such as nested kernel invocations for dynamic parallelism.
... Further improvements include exploiting producer-consumer locality via local shared memory [Satish et al. 2009;Breitbart 2011;Yan et al. 2013]. The persistent threads model can further be extended to support GPU-wide synchronization via message passing [Stuart and Owens 2009;Luo et al. 2010;Xiao and Feng 2010] and dynamic priorities [Steinberger et al. 2012]. ...
Conference Paper
Full-text available
(Figure Presented) In this paper, we present Whippletree, a novel approach to scheduling dynamic, irregular workloads on the GPU. We introduce a new programming model which offers the simplicity and expressiveness of task-based parallelism while retaining all aspects of the multi-level execution hierarchy essential to unlocking the full potential of a modern GPU. At the same time, our programming model lends itself to efficient implementation on the SIMD-based architecture typical of a current GPU. We demonstrate the practical utility of our model by providing a reference implementation on top of current CUDA hardware. Furthermore, we show that our model compares favorably to traditional approaches in terms of both performance as well as the range of applications that can be covered. We demonstrate the benefits of our model for recursive Reyes rendering, procedural geometry generation and volume rendering with concurrent irradiance caching.
... Our existing CPU algorithm uses barriers and communication between threads, which is not trivial to implement on GPUs. We used a technique called static threadblocks [5] to nevertheless implement these. With static threadblocks, a fixed number of threadblocks per multiprocessor is started, so it is guaranteed that all threadsblocks are active at the same time. ...
Conference Paper
The processing power and parallelism in hardware is expected to increase rapidly over the next years, whereas memory bandwidth per flop and the amount of main memory per flop will be falling behind. These trends will result in both more algorithms to become limited by memory bandwidth, and overall memory requirements to become an important factor for algorithm design. In this paper we study the Gauss-Seidel stencil as an example of a memory bandwidth limited algorithm. We consider GPUs and NUMA systems, which are both designed to provide high memory bandwidth at the cost of making algorithm design more complex. The mapping of the non-linear memory access pattern of the Gauss-Seidel stencil to the different hardware is important to achieve high performance. We show that there is a trade-off between overall performance and memory requirements when optimizing for optimal memory access pattern. Vectorizing on the NUMA system and optimizing to utilize all processors on the GPU does not pay off in terms of performance per memory used, which we consider an important measurement regarding the trends named before.
... A WG can be considered as a unit that runs a certain number of WIs and thus accomplishes a certain task. Static WGs were first introduced by one of the authors in [3], and are a kind of container to which these tasks can be mapped. A static WG is assigned to a CU, where it runs the tasks in a user-defined sequential order without preemption. ...
Article
p>Current processor architectures are diverse and heterogeneous. Examples include multicore chips, GPUs and the Cell Broadband Engine (CBE). The recent Open Compute Language (OpenCL) standard aims at efficiency and portability. This paper explores its efficiency when implemented on the CBE, without using CBE-specific features such as explicit asynchronous memory transfers. We based our experiments on two applications: matrix multiplication, and the client side of the Einstein@Home distributed computing project. Both were programmed in OpenCL, and then translated to the CBE. For matrix multiplication, we deployed different levels of OpenCL performance optimization, and observed that they pay off on the CBE. For Einstein@Home, our translated OpenCL version achieves almost the same speed as a native CBE version. We experimented with two versions of the OpenCL to CBE mapping, in which the PPE component of the CBE does or does not take the role of a compute unit. Another major contribution of the paper is a proposal for two OpenCL extensions that we analyzed for both CBE and NVIDIA GPUs. First, we suggest an additional memory level in OpenCL, called static local memory. With little programming expense, it can lead to significant speedups such as for reduction a factor of seven on the CBE and about 20% on NVIDIA GPUs. Second, we introduce static work-groups to support user-defined mappings of tasks. Static work-groups may simplify programming and lead to speedups of 35% (CBE) and 100% (GPU) for all-parallel-prefix-sums.</p
Article
The Boys function, a mathematical integral function, plays a pivotal role and is frequently evaluated in ab initio molecular orbital computations. The main contribution of this paper is to accelerate the bulk evaluation of the Boys function through the effective utilization of GPUs. The proposed GPU implementation addresses GPU‐specific programming issues such as warp divergence and coalesced/stride access to global memory, and we employ the optimal numerical evaluation method from four methods based on input values to ensure efficient computation with sufficient accuracy. Moreover, to consider actual computation of molecular integrals, we have implemented and evaluated the proposed method in two scenarios: single evaluation, which computes a single value of the Boys function for a single input, and incremental evaluation, which computes multiple values of the Boys function incrementally. The execution time of the proposed GPU implementation was evaluated for both scenarios using an NVIDIA A100 Tensor Core GPU. As a result, the GPU‐accelerated bulk evaluation has achieved a throughput of computing the values of the Boys function times per second for the single evaluation and times per second for the incremental evaluation, respectively. Our parallelized CPU and GPU implementation is available at https://github.com/sstsuji/Boys‐function‐GPU‐library .
Article
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.
Conference Paper
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important op-timizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.
Conference Paper
In this paper, we characterize and analyze an increasingly popular style of programming for the GPU called Persistent Threads (PT). We present a concise formal definition for this programming style, and discuss the difference between the traditional GPU programming style (nonPT) and PT, why PT is attractive for some high-performance usage scenarios, and when using PT may or may not be appropriate. We identify limitations of the nonPT style and identify four primary use cases it could be useful in addressing-CPU-GPU synchronization, load balancing/irregular parallelism, producer-consumer locality, and global synchronization. Through micro-kernel benchmarks we show the PT approach can achieve up to an order-of-magnitude speedup over nonPT kernels, but can also result in performance loss in many cases. We conclude by discussing the hardware and software fundamentals that will influence the development of Persistent Threads as a programming style in future systems.
Conference Paper
Full-text available
While GPGPU stands for general-purpose computation on graphics processing units, the lack of explicit support for inter-block communication on the GPU arguably hampers its broader adoption as a general-purpose computing device. Interblock communication on the GPU occurs via global memory and then requires barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, can incur significant overhead. We propose two approaches for inter-block GPU communication via barrier synchronization: GPU lock-based synchronization and GPU lock-free synchronization. We then evaluate the efficacy of each approach via a micro-benchmark as well as three well-known algorithms - Fast Fourier Transform (FFT), dynamic programming, and bitonic sort. For the micro-benchmark, the experimental results show that our GPU lock-free synchronization performs 8.4 times faster than CPU explicit synchronization and 4.0 times faster than CPU implicit synchronization. When integrated with the FFT, dynamic programming, and bitonic sort algorithms, our GPU lock-free synchronization further improves performance by 10%, 26%, and 40%, respectively, and ultimately delivers an overall speed-up of 70x, 13x, and 24x, respectively.
Conference Paper
This paper explores the challenges in implementing a message passing interface usable on systems with dataparallel processors. As a case study, we design and implement the "DCGN" API on NVIDIA GPUs that is similar to MPI and allows full access to the underlying architecture. We introduce the notion of data-parallel thread-groups as a way to map resources to MPI ranks. We use a method that also allows the data-parallel processors to run autonomously from user-written CPU code. In order to facilitate communication, we use a sleep-based polling system to store and retrieve messages. Unlike previous systems, our method provides both performance and flexibility. By running a test suite of applications with different communication requirements, we find that a tolerable amount of overhead is incurred, somewhere between one and five percent depending on the application, and indicate the locations where this overhead accumulates. We conclude that with innovations in chipsets and drivers, this overhead will be mitigated and provide similar performance to typical CPUbased MPI implementations while providing fully-dynamic communication.
Conference Paper
Current processor architectures are diverse and heterogeneous. Examples include multicore chips, CPUs and the Cell Broadband Engine (CBE). The recent Open Compute Language (OpenCL) standard aims at efficiency and portability. This paper explores its efficiency when implemented on the CBE, without using CBE-specific features such as explicit asynchronous memory transfers. We based our experiments on two applications: matrix multiplication, and the client side of the Einstein@Home distributed computing project. Both were programmed in OpenCL, and then translated to the CBE. For matrix multiplication, we deployed different levels of OpenCL performance optimization, and observed that they pay off on the CBE. For the Einstein@Home application, our translated OpenCL version achieves almost the same speed as a native CBE version. Another main contribution of the paper is a proposal for an additional memory level in OpenCL, called static local memory. With little programming expense, it can lead to significant speedups such as factor seven for reduction. Finally, we studied two versions of the OpenCL to CBE mapping, in which the PPE component of the CBE does or does not take the role of a compute unit.
Book
Programming Massively Parallel Processors: A Hands-on Approach, Third Edition shows both student and professional alike the basic concepts of parallel programming and GPU architecture, exploring, in detail, various techniques for constructing parallel programs. Case studies demonstrate the development process, detailing computational thinking and ending with effective and efficient parallel programs. Topics of performance, floating-point format, parallel patterns, and dynamic parallelism are covered in-depth. For this new edition, the authors have updated their coverage of CUDA, including coverage of newer libraries, such as CuDNN, moved content that has become less important to appendices, added two new chapters on parallel patterns, and updated case studies to reflect current industry practices. Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing Utilizes CUDA version 7.5, NVIDIA's software development tool created specifically for massively parallel environments Contains new and updated case studies Includes coverage of newer libraries, such as CuDNN for Deep Learning © 2017, 2013, 2010 David B. Kirk/NVIDIA Corporation and Wen-mei W. Hwu. Published by Elsevier Inc. All rights reserved.
Article
A study of the effects of adding two scan primitives as unit-time primitives to PRAM (parallel random access machine) models is presented. It is shown that the primitives improve the asymptotic running time of many algorithms by an O (log n ) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. It is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. The author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. These all run on an EREW (exclusive read, exclusive write) PRAM with the addition of two scan primitives and are either simpler or more efficient than their pure PRAM counterparts. The scan primitives have been implemented in microcode on the Connection Machine system, are available in PARIS (the parallel instruction set of the machine)
PTX: Parallel Thread Execution ISA Version 2
  • Nvidia Corporation
OpenCL Parallel Prefix Sum (aka Scan) Example Version 1
  • Apple Inc