-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we revisit the design of synchronization
primitives---specifically barriers, mutexes, and semaphores---and how they
apply to the GPU. Previous implementations are insufficient due to the
discrepancies in hardware and programming model of the GPU and CPU. We create
new implementations in CUDA and analyze the performance of spinning on the GPU,
as well as a method of sleeping on the GPU, by running a set of memory-system
benchmarks on two of the most common GPUs in use, the Tesla- and Fermi-class
GPUs from NVIDIA. From our results we define higher-level principles that are
valid for generic many-core processors, the most important of which is to limit
the number of atomic accesses required for a synchronization operation because
atomic accesses are slower than regular memory accesses. We use the results of
the benchmarks to critique existing synchronization algorithms and guide our
new implementations, and then define an abstraction of GPUs to classify any GPU
based on the behavior of the memory system. We use this abstraction to create
suitable implementations of the primitives specifically targeting the GPU, and
analyze the performance of these algorithms on Tesla and Fermi. We then predict
performance on future GPUs based on characteristics of the abstraction. We also
examine the roles of spin waiting and sleep waiting in each primitive and how
their performance varies based on the machine abstraction, then give a set of
guidelines for when each strategy is useful based on the characteristics of the
GPU and expected contention.
10/2011;
-
Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium; 05/2011
-
08/2010
-
06/2010
-
Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium; 05/2009
-
Computer Graphics Forum (Proceedings of Eurographics \textln2009). 04/2009; 28:385--396.
-
[show abstract]
[hide abstract]
ABSTRACT: We present a software system that enables path-traced rendering of complex scenes. The system consists of two primary components: an application layer that implements the basic rendering algorithm, and an out-of-core scheduling and data-management layer designed to assist the application layer in exploiting hybrid computational resources (e.g., CPUs and GPUs) simultaneously. We describe the basic system architecture, discuss design decisions of the system's data-management layer, and outline an efficient implementation of a path tracer application, where GPUs perform functions such as ray tracing, shadow tracing, importance-driven light sampling, and surface shading. The use of GPUs speeds up the runtime of these components by factors ranging from two to twenty, resulting in a substantial overall increase in rendering speed. The path tracer scales well with respect to CPUs, GPUs and memory per node as well as scaling with the number of nodes. The result is a system that can render large complex scenes with strong performance and scalability.
Computer Graphics Forum 03/2009; 28(2):385 - 396. · 1.63 Impact Factor
-
Comput. Graph. Forum. 01/2009; 28:385-396.