ArticlePDF Available

# Tiled and Clustered Forward Shading Supporting Transparency and MSAA

## Abstract

Supporting Transparency and MSAA
Ola Olsson, Markus Billeter and Ulf Assarsson
Chalmers University of Technology
Z
near far
Solid
Solid
Trans-
parent
Trans-
parent
Z
near far
Solid
Solid
Trans-
parent
Trans-
parent
Figure 1: Shot from the Crytek Sponza scene with semi-transparent bubbles added, lit by 1024 random lights. With 8x MSAA at 720p
resolution, Tiled Deferred runs at 53 FPS (without bubbles), Tiled Forward at 52 FPS and Clustered Forward at 161 FPS on a GTX 680. The
diagrams illustrate how transparent geometry affects clustered and tiled forward shading.
1 Abstract
We present details of Tiled and Clustered Forward Shading in
its application to rendering transparent geometry and using Multi
Sampling Anti Aliasing (MSAA). We detail how transparency and
MSAA is supported, and present performance results measured on
modern GPUs.
Previous techniques for handling large numbers of lights are usu-
However, deferred shading techniques struggle with impractically
large frame buffers when MSAA is used, and make supporting
difﬁcult to support custom shaders on geometry.
Tiled Forward Shading is a new and highly practical approach to
real-time shading scenes with thousands of light sources, intro-
duced by Olsson and Assarsson in 2011 [2011]. Their results, mea-
sured on an GTX 280 GPU, indicated that tiled forward shading
was impractically slow. Performance on more recent GPUs has
improved considerably (approaching that of tiled deferred), which
opens up the possibility of using the technique to support trans-
parency and MSAA.
partitioning [Olsson et al. 2012]. We show how Clustered Forward
Shading can be extended to support transparency efﬁciently.
Forward shading naturally supports both transparency and MSAA,
which has been shown in previous work. However, the performance
and implementation details have not previously been investigated.
We provide details on how to construct the light grid for use with
transparency. When the transparent geometry is considered, the
depth range optimization cannot be fully used. Instead, only a more
conventional hierarchical depth test can be used. The grid structure
can be built once, and quickly pruned to prepare a more efﬁcient
instance for opaque geometry. However, as each transparent layer
must consider all the lights in the tile, performance does not scale
linearly with the depth complexity, but far worse (Figure 1, right).
To improve on this we extend clustered forward shading by con-
structing the grid using a pre-pass over all geometry (not just
opaque), and ﬂagging clusters as a side effect. This allows us to
quickly ﬁnd the unique clusters used. As clusters contain only space
around actual samples that need shading, efﬁciency is much better
(Figure 1, left).
For deferred shading a single 1080p, 16x MSAA, 16-bit ﬂoat
RGBA buffer requires over 250Mb of memory. In addition, each
sample may need to be shaded individually, effectively running
Buffers are required and MSAA is trivially enabled.
A brief performance and memory comparison is shown in Figure 2,
showing that clustered forward outperforms tiled forward by more
than 2 times, and also outperforms tiled deferred, if MSAA is used.
0
100
200
300
400
500
1
2
4
8
16
#Samples/Pixel
Memory Use (Mb)
Deferred
Forward
Figure 2: Left, performance for a view similar to Figure 1 (deferred
without bubbles). Right, memory use of deferred vs. forward at
720p, assuming 32-bit depth and color targets, and 3 × 64-bit G-
buffers.
References
ANDERSSON, J., 2009. Parallel graphics in frostbite - current &
future. SIGGRAPH Course: Beyond Programmable Shading.
LAURITZEN, A., 2010. Deferred rendering for current and fu-
ture rendering pipelines. SIGGRAPH Course: Beyond Pro-
of Graphics, GPU, and Game Tools 15, 4, 235–251.
OLSSON, O., BILLETER, M., AND ASSARSSON, U. 2012. Clus-
tered deferred and forward shading. In HPG ’12: Proceedings
of the Conference on High Performance Graphics 2012.
... The segmentation of the scene into visibility clusters allows us to efficiently assign subsets of VPLs to pixels such that there is a high probability that the view sample (i.e. a point visible from the camera) of a pixel is visible from each VPL assigned to that pixel. To improve performances for this assignment, we use a recent method called Clustered Shading [20]. This technique groups view samples with similar properties (3D-positions and visibility cluster in our case). ...
... We can note that a few number of view samples can be affected by a cone in regard to the total number of them. To reduce the number of tests we implemented the recent clustered shading technique [20] that allows us to reject quickly large group of view samples that does not intersect the cone created from a VPL and a gate. To make the distinction with our visibility clusters, we will call geometry clusters the clusters computed by the clustered shading method. ...
... You can refer to [20] to get more detail about this step. By exploiting compute ability of modern GPU, we identify unique geometry clusters that can be seen by the camera and re-index them from 0 to the total number of them. ...
... The segmentation of the scene into visibility clusters allows us to efficiently assign subsets of VPLs to pixels such that there is a high probability that the view sample (i.e. a point visible from the camera) of a pixel is visible from each VPL assigned to that pixel. To improve performances for this assignment, we use a recent method called Clustered Shading [21]. This technique groups view samples with similar properties (3D-positions and visibility cluster in our case). ...
... We can note that a few number of view samples can be affected by a cone in regard to the total number of them. To reduce the number of tests we implemented the recent clustered shading technique [21] that allows us to reject quickly large group of view samples that does not intersect the cone created from a VPL and a gate. To make the distinction with our visibility clusters, we will call geometry clusters the clusters computed by the clustered shading method. ...
... You can refer to [21] to get more detail about this step. By exploiting compute ability of modern GPU, we identify unique geometry clusters that can be seen by the camera and re-index them from 0 to the total number of them. ...
... In the bin and the tile rasterizer stages, we use the memory segments to minimize required memory bandwidth. Currently, our implementation supports most 3D graphics rendering features including triangle output, scissoring, clipping, backface culling, texture mapping, alpha blending, bilinear interpolation, depth test, flat shading, Gouraud shading, and Phong shading, except complex anti-aliasing features of MSAA (Multi-Sample Anti-Aliasing) [40]. ...
Article
Full-text available
Recently, massively-parallel computing libraries and devices are much widely used, in addition to the traditional 3D graphics systems. In this paper, we present a full 3D fixed-function graphics pipeline, based on the OpenCL, which is one of the most widely used massively-parallel computing library. The full 3D graphics features including WebGL, Web3D and others can be implemented on the massively-parallel computations, without underlying 3D graphics hardware support. Many previous works focused on another massively-parallel system of CUDA, which has a drawback of limited availability. In contrast, we designed and implemented a new architecture with OpenCL, which is now available on various computing devices, including most CPUs, GPUs, and at least theoretically, special-purpose embedded FPGAs. Our work provides full 3D graphics features on OpenCL-capable systems, without dedicated 3D graphics hardware, to finally make 3D graphics features ubiquitous. Technically, we used a top-down approach in its rendering, from the whole screen to precise pixels. At each stage, we tuned our OpenCL implementations and also their global and local parameter spaces. We present the details of our design and also the final result of our implementation, and show its correctness and efficiency.
... • Use tiles [Olsson et al. 2012] set up on the CPU or with a GPU pass to conservatively approximate one of the previous methods. ...
Preprint
We contribute several practical extensions to the probe based irradiance-field-with-visibility representation to improve image quality, constant and asymptotic performance, memory efficiency, and artist control. We developed these extensions in the process of incorporating the previous work into the global illumination solutions of the NVIDIA RTXGI SDK, the Unity and Unreal Engine 4 game engines, and proprietary engines for several commercial games. These extensions include: a single, intuitive tuning parameter (the "self-shadow" bias); heuristics to speed transitions in the global illumination; reuse of irradiance data as prefiltered radiance for recursive glossy reflection; a probe state machine to prune work that will not affect the final image; and multiresolution cascaded volumes for large worlds.
... All visible geometry information necessary to perform lighting calculations is stored in an initial rendering pass before lighting is applied in a second lighting pass to give the final result. Recently, several techniques have been presented that approach or beat the performance Analytical approaches include Tiled Shading [11], For-ward+ [5] and Clustered Shading [12,13,15]. Tiled Shading and Forward+ build a 2D grid of lights, while Clustered Shading builds a 3D grid. ...
Article
Full-text available
There is growing interest in rendering scenes with many lights, where scenes typically contain hundreds to thousands of lights. Each light illuminates geometry within a finite extent called a light volume. A key aspect of performance is determining which lights apply to what geometry, and then applying those lights efficiently. We present a GPU-based approach using spatial data structures, binning lights by depth analytically while also taking advantage of hardware rasterization. This improves light binning performance by 3–6$$\times$$. We also present a GPU memory and cache friendly data structure that takes two passes to build, giving 4–10$$\times$$ improved performance when applying lighting and an overall improvement of 1.3–4$$\times$$ for total frametime.
Article
This paper introduces a real-time rendering method for single-bounce glossy caustics created by GGX microsurfaces. Our method is based on stochastic light culling of virtual point lights (VPLs), which is an unbiased culling method that randomly determines the range of influence of light for each VPL. While the original stochastic light culling method uses a bounding sphere defined by that light range for coarse culling (e.g., tiled culling), we have further extended the method by calculating a tighter bounding ellipsoid for glossy VPLs. Such bounding ellipsoids can be calculated analytically under the classic Phong reflection model which cannot be applied to physically plausible materials used in modern computer graphics productions. In order to use stochastic light culling for such modern materials, this paper derives a simple analytical solution to generate a tighter bounding ellipsoid for VPLs on GGX microsurfaces. This paper also presents an efficient implementation for culling bounding ellipsoids in the context of tiled culling. When stochastic light culling is combined with interleaved sampling for a scene with tens of thousands of VPLs, this tiled culling is faster than conservative rasterization-based clustered shading which is a state-of-the-art culling technique that supports bounding ellipsoids. Using these techniques, VPLs are culled efficiently for completely dynamic single-bounce glossy caustics reflected from GGX microsurfaces.
Article
Two LC-MS/MS methods including different sample preparation and quantitative processes showed a good agreement for analysis of the herbicides MCPA, mecoprop, isoproturon, bentazon and chloridazon, and the metabolite chloridazon-methyl-desphenyl (CMD) in estuarine waters. Due to different sensitivity of the methods only one could be used to analyze marine samples. The transport of these compounds to the Baltic Sea via ten German estuaries and their distribution between coastal water and sediments was studied. The results showed that all selected compounds can be transported to the Baltic Sea (0.9–747 ng/L). Chloridazon, bentazon, isoproturon and CMD were detected (0.9–8.9 ng/L) in the coastal waters and chloridazon and isorproturon in the sediments (5–136 pg/g d.w.). Levels of contaminants in the sediments could be influenced by the total organic carbon content. Concentrations observed in the Baltic Sea are most likely not high enough to cause acute effects, but long term effect studies are strongly recommended.
Article
We present Forward Light Cuts, a novel approach to real-time global illumination using forward rendering techniques. We focus on unshadowed diffuse interactions for the first indirect light bounce in the context of large models such as the complex scenes usually encountered in CAD application scenarios. Our approach efficiently generates and uses a multiscale radiance cache by exploiting the geometry-specific stages of the graphics pipeline, namely the tessellator unit and the geometry shader To do so, we assimilate virtual point lights to the scene's triangles and design a stochastic decimation process chained with a partitioning strategy that accounts for both close-by strong light reflections, and distant regions from which numerous virtual point lights collectively contribute strongly to the end pixel. Our probabilistic solution is supported by a mathematical analysis and a number of experiments covering a wide range of application scenarios. As a result, our algorithm requires no precomputation of any kind, is compatible with dynamic view points, lighting condition, geometry and materials, and scales to tens of millions of polygons on current graphics hardware.
Article
Stochastic sampling in time and over the lens is essential to produce photo-realistic images, and it has the potential to revolutionize real-time graphics. In this paper, we take an architectural view of the problem and propose a novel hardware architecture for efficient shading in the context of stochastic rendering. We replace previous caching mechanisms by a sorting step to extract coherence, thereby ensuring that only non-occluded samples are shaded. The memory bandwidth is kept at a minimum by operating on tiles and using new buffer compression methods. Our architecture has several unique benefits not traditionally associated with deferred shading. First, shading is performed in primitive order, which enables late shading of vertex attributes and avoids the need to generate a G-buffer of pre-interpolated vertex attributes. Second, we support state changes, e.g., change of shaders and resources in the deferred shading pass, avoiding the need for a single über-shader. We perform an extensive architectural simulation to quantify the benefits of our algorithm on real workloads.
Article
Full-text available
Article
Full-text available