ArticlePDF Available

Abstract and Figures

This paper presents a reformulation of bidirectional path-tracing that adequately divides the algorithm into pro-cesses efficiently executed in parallel on both the CPU and the GPU. We thus benefit from high-level optimization techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more than ten times faster than standard bidirectional path-tracing implementations, leading to performance suitable for production-oriented rendering engines.
Content may be subject to copyright.
EUROGRAPHICS 2011 / M. Chen and O. Deussen
(Guest Editors)
Volume 30 (2011), Number 2
Combinatorial Bidirectional Path-Tracing
for Efficient Hybrid CPU/GPU Rendering
Anthony Pajot1, Loïc Barthe1, Mathias Paulin1, and Pierre Poulin2
1IRIT-CNRS, Université de Toulouse, France 2LIGUM, Dept. I.R.O., Université de Montréal, Canada.
Figure 1: Images of a scene with a large dataset (758K triangles, lots of textures) featuring complex lighting conditions (glossy
reflections, caustics, strong indirect lighting, etc.) computed in respectively 50 seconds (left) and one hour (right). Standard
bidirectional path-tracing requires respectively 11 minutes and 13 hours to obtain the same results.
Abstract
This paper presents a reformulation of bidirectional path-tracing that adequately divides the algorithm into pro-
cesses efficiently executed in parallel on both the CPU and the GPU. We thus benefit from high-level optimization
techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation
of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation
limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more
than ten times faster than standard bidirectional path-tracing implementations, leading to performance suitable
for production-oriented rendering engines.
Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Color, shading, shadowing, and texture I.6.8 [Simulation and Modeling]: Type of
Simulation—Monte-Carlo
1. Introduction
Global illumination brings a lot of realism to computer-
generated images. Therefore, production-oriented rendering
engines use it to reach photorealism.
Algorithms to compute global illumination have to meet a
certain number of constraints in order to be seamlessly inte-
grated in a production pipeline:
From an artist point of view, the algorithm should have
intuitive parameters, and should be able to provide inter-
active feedback as well as high quality final images.
From a scene-design point of view, it should be able to
manage huge datasets as well as complex and flexible
shaders, various light models, and various camera mod-
els.
From a data-management point of view, it should avoid as
much as possible precomputed data. Indeed, it is tedious
to keep these data synchronized across artists that work on
the same scene, or between the computers of a renderfarm.
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and
350 Main Street, Malden, MA 02148, USA.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
From a computational point of view, it must be robust to
handle highly dynamic scenes and all-frequency indirect
lighting, to give the artists complete freedom on their de-
signs. For automated rendering, it must give predictable
and reproducible results in a given time frame. Ideally, it
should be easy to use on clusters, to be able to render one
image using all the ressources of a renderfarm.
Methods that are used nowadays mostly rely on
point clouds or other type of precomputed representa-
tions [KFC10]. As they rely on precomputed data, inter-
active feedback is not straightforward, as these data should
be recomputed each time the scene changes. Even though
being predictable and able to handle very large amount of
data, precomputed representations still have problems han-
dling highly dynamic or high-frequency indirect lighting.
Moreover, production pipelines must be adapted appropri-
ately to keep these data in sync during the production pro-
cess, and computing a single image on a cluster can be done
only once the data have been computed.
To remove all these problems, unbiased methods have
been investigated, and path-tracing based algorithms begin
to be mature enough to be successfully used in the movie
industrie [Faj10]. In addition to being potentially fully auto-
matic (thus user-friendly), unbiasedness makes these meth-
ods easy to deploy on clusters, as independent renderings
can be simply averaged to compute the final image. As they
do not require any precomputed data and do not rely on any
interpolation scheme, they also naturally handle highly dy-
namic scenes. Moreover, they use independent samples, thus
precision requirements such as a given number of samples
per pixel are easy to formulate. Finally unlike sequential
methods, the number of samples computed in a given time
can be measured so that the results are predictible and repro-
ducible when the time frame is fixed.
Nevertheless, path-tracing exhibits large variance when
high-frequency or strong indirect lighting effects such as
caustics are present in a scene, leading to visually unpleas-
ant artefacts in the rendering. To reduce these artefacts, con-
straints can be added to the indirect lighting, e.g. reduc-
ing the sharpness of glossy reflections [Faj10], or enlarging
lights. Although interactive feedback can be provided for
scenes where path-tracing has a very low variance, a large
amount of time is needed to obtain a rough preview of the
final appearance for scenes with high variance. On a more
general point of view, unbiased methods have a larger com-
putational cost than methods based on precomputed data,
which is a problem for wide acceptance.
Bidirectional path-tracing (BPT) [VG94] [LW93], has the
same advantages as path-tracing, but is much more robust
with respect to indirect lighting, providing low variance re-
sults even for complex lighting conditions. Even though
more computationally efficient than path-tracing, it remains
too slow for interactive feedback, and is still slower than
methods based on precomputed data. Recently, attempts at
making it faster by using GPU as a co-processor have been
presented in the rendering community [OMP10], however,
the proposed implementation does not allow an efficient col-
laboration of the CPU and GPU, keeping the most of the pro-
cessing charge on the CPU while the GPU remains mostly
idle.
Contribution: In this paper, we combine correlated sam-
pling and standard BPT to efficiently use both CPU and GPU
in a cooperative way (Section 3). The basic principle of BPT
is to repeatedly sample an optical path leaving from the cam-
era, and an optical path leaving from the light. Complete
paths are then created by linking together each subpath of
the camera path with each subpath of the light path. The
last vertex of each subpath are called linking vertices, and
the segment between the two linking vertices is the linking
segment. A complete path created this way contributes to
the final image if the linking vertices are mutually visible,
and if some of the energy arriving to the light linking ver-
tex is scattered to the camera path. Instead of combining two
paths, we combine sets of camera and light paths, comput-
ing the values needed for linking on the GPU. As each cam-
era path is combined with each light path, many more link-
ing segments are available, allowing us to use the GPU at its
maximum without increasing the cost of sampling the paths
(Section 4). We then interleave the CPU and GPU parts in
order to obtain an algorithm where both the CPU and GPU
are always busy (Section 5). This reformulation reduces the
processing time by a factor varying between 12 and 16 com-
pared to standard BPT (Section 6), allowing feedback in less
than a minute even for complex scenes, and the computation
of high-quality images in one hour, as shown in Figure 1.
2. Related Work
If not considering computational efficiency and GPU use,
both biased and unbiased algorithms that do not use precom-
puted data exist to produce high-quality images.
On the unbiased side, sequential methods based on
Markov-Chain Monte-Carlo [VG97,KSKAC02,CTE05,
LFCD07] have been used to improve the robustness of stan-
dard Monte-Carlo methods for very difficult scenes. Unfor-
tunately, they can be highly dependent on the starting state
of the chain, and do not provide feedback as rapidly as stan-
dard Monte-Carlo methods, since the time to cover all the
screen is typically longer. The gain that these methods bring
is most visible on very difficult scenes, but remains quite
limited for more common scenes, for which standard BPT is
highly efficient.
On the biased side, Hachisuka et al. [HOJ08,HJ09] intro-
duced progressive photon mapping and stochastic progres-
sive photon mapping, two consistent algorithms based on
photon mapping. Even though robust, efficient, and able to
produce high-quality images, being consistent instead of un-
biased prevents these algorithms to be directly usable in ren-
derfarms for single image computations. Instead, they need
to be specifically adapted to avoid artefacts in the final im-
ages.
Using both the CPU and GPU in a cooperative way can
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
provide a large gain of performance, allowing the methods
above to provide high-quality results or rough previews sig-
nificantly faster. Attempts at isolating parts of algorithms to
execute them on GPU are examined in rendering engines,
such as in luxrender [Lux10], where intersection tests are
performed on the GPU. The main problem that face devel-
opers is keeping both CPU and GPU busy all the time. In
general, the CPU is too slow to provide enough work to the
GPU. More generally, it is not easy to adapt the algorithms
presented above to efficiently use the GPU to compute inter-
mediate data, without restricting the size of the datasets nor
the complexity of the shaders. In fact, sampling, which must
be done on CPU as it involves all the dataset and the shaders,
would in general require much more time to be computed
than the GPU part, leading to a negligible gain.
3. Combinatorial Bidirectional Path-Tracing (CBPT)
3.1. Base Algorithm
In BPT-based algorithms, a camera path x= (x0,...,xc)and
a light path y= (y0,...,yl)are sampled. x0,...,xcare called
camera vertices,y0,...,ylare called light vertices. For each
vertex xior yjlocated on the surface of an object, the pa-
rameters of the bidirectional scattering distribution func-
tion (BSDF) are computed using a shader tree. Complete
paths are then created by linking subpaths (x0,...,xi)and
(y0,...,yj), for all the possible couples (i,j). The num-
ber of segments of each complete path is i+j+1, and
the linking segment is the segment (xi,yj). Let function
gC(x,i)give the energy transmitted by xfrom xito x0, and
gL(y,j)give the energy transmitted by yfrom y0to yj. The
energy emitted from y0which arrives to x0via the path
z= (y0,...,yj,xi,...,x0)is then:
gi,j(x,y) =gL(y,j)G(yj,xi)gC(x,i)×(1)
fs(yj1yjxi)×
V(yj,xi)×
fs(yjxixi1)
where fsis the BSDF, Vis the visibility function (1 if unoc-
cluded, 0 otherwise), and Gis the geometric term.
We define the basic contribution fi,j(x,y)of such a com-
plete path as:
fi,j(x,y) = wi,j(x,y)gi,j(x,y)
pi,j(x,y). (2)
pi,j(x,y)is the density probability with which the two
subpaths have been sampled, and wi,j(x,y)is the multiple
importance sampling (MIS) weight [VG95].
In our implementation, we use the direct BSDF proba-
bility density function (PDF) pto sample directions for the
camera path, and the adjoint BSDF PDF pto sample direc-
tions for the light path, and the balance heuristic [VG95] to
compute the MIS weights:
wi,j(x,y) = pi,j(x,y)
s,tps,t(x,y)(3)
where each (s,t)couple is one of the possible techniques
with which zcould have been sampled. Computing this
weight requires to compute p(xi1xiyj)and p(yj
xixi1)using the BSDF at xi, and p(xiyjyj1)
and p(yj1yjxi)using the BSDF at yj.
When either ior jare less than 1, the corresponding terms
are not based on the BSDF, but instead on the light or camera
properties. If j=1, it means that xcis on a light, making
a complete path by itself.
The data that depend on both xiand yjhas to be com-
puted per linking segment, and is the most time-consuming
task when computing the contribution of a complete path.
These data can be computed on the GPU very efficiently,
in parallel for each linking segment. Unfortunately, produc-
ing a sufficient number of linking segments would require to
sample and combine a very large number of pairs, leading to
very large CPU costs, large memory footprints both on CPU
and GPU, and very time consuming CPU-to-GPU memory
transfers.
The key idea allowing us to use both CPU and GPU effi-
ciently is to sample populations of NCcamera paths and NL
light paths independently on CPU, and then combine each
camera path with each light path. This leads to the com-
bination of NC×NLpairs of paths, and allows us to have
largely enough linking segments to benefit from the process-
ing power of GPUs without requiring larger sampling costs.
Combining all camera paths with the same light paths intro-
duces a correlation in the estimations, but does not lead to
bias in the average estimator.
In practice, we have three kernels which compute, for
each linking segment (xi,yj)in parallel:
the visibility term V(xi,yj),
the shading values involving the BSDF of the cam-
era point: fs(yjxixi1),p(xi1xiyj), and
p(yjxixi1), if xihas an associated BSDF (i.e. it
is neither on the camera lens nor on a light),
the shading values involving the BSDF of the light
point: fs(yj1yjxi),p(xiyjyj1), and
p(yj1yjxi), if yjhas an associated BSDF.
If xior yjdoes not have an associated BSDF, the prob-
abilities (probability to have sampled the light, probability
density to have sampled the point on the light, probability
density to have sampled the direction from the camera, etc.),
and the light emission and importance emission terms are
computed on CPU, to keep the flexibility on camera and light
models that can be used.
The final contributions of a pair (x,y)can then be
split into two parts. The first part is the sum of all the
basic contributions that affect the image location inter-
sected by the first segment of x. We denote it as the
bidirectional contribution: fb(x,y) = i>0,j6=1fi,j(x,y) +
fc,1(x,y)and we call bidirectional image the image
obtained by considering only the bidirectional contribu-
tions. The second part contains all the contributions ob-
tained by light-tracing, each affecting a different image
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
location: {f0,0(x,y),f0,1(x,y),..., f0,l(x,y)}. We call light-
tracing image the image obtained by adding all the contri-
butions from light-tracing, each multiplied by the number of
pixels Npof the final image. In our implementation, light-
tracing does not contribute to direct lighting, as it brings a
lot of variance for this type of light transport.
As a result, a step of CBPT consists in:
1. sample a camera population {x}of NCpaths, and a light
population {y}of NLpaths;
2. compute the combination data for these two populations
on GPU;
3. compute the contributions of each pair of paths, splatting
NCvalues to the bidirectional image, and splatting the
light-tracing contributions to the light-tracing image.
Note that as is, our algorithm does not directly handle mo-
tion blur, but it can be integrated in a straightforward manner
by sampling each ({x},{y}) population couple with a spe-
cific value of time, i.e. all the paths of the two populations
have the same time value, and this value is different for each
couple of populations.
3.2. Discussion
Setting NCand NL:Ideally, we would like to always be
perceptually faster than standard BPT. Perceptually faster
means computing more camera paths per second, with each
camera path being combined with NL>1 light paths. This
leads to a similar or faster coverage of the image, with each
camera path bringing a lower-variance estimate than in stan-
dard BPT, leading to perceptually faster convergence. NC
and NLcan be computed to ensure faster perceptual conver-
gence, by measuring the time tbneeded by BPT to sample,
combine, and splat the contribution for a pair of paths, and
the time ts(NC,NL)needed by CBPT to perform one step. As
the combination is the most time consuming part of a step,
ts(NC,NL)is roughly constant as long as the number of pairs
P=NC×NLremains constant. Therefore, for a fixed P, an
appropriate NCvalue is such that
NC>ts(P)
tb
. (4)
A lower NCvalue will lead to lower-variance estimate of
each path, larger value will lead to faster coverage, but also
more correlation. A side-effect of Equation (4) is that if NC,
computed using this equation, is such that NLwould be <1,
this indicates that the machine on which CBPT is running is
not fast enough to bring any advantage over standard BPT
for the chosen P.
Light-tracing: The discussion above does not take into
account light-tracing, and using Equation (4) generally gives
NLvalues that are small, leading to high-variance caustics.
Light-tracing does not really take advantage of the GPU
combination system, as each light subpath is combined with
only one vertex of a camera path, namely the vertex which
lies on the lens of the camera. Moreover, contributions for
different camera paths are in general very similar, or even
equal when using a pinhole camera, as all the lens vertices
are at the exact same location. We therefore choose to com-
pute light-tracing using a standard CPU-based light-tracer.
At each step of CBPT, we sample NTlight paths ({ylt })
and compute their light-tracing contributions. In general,
we choose NTclose to NCto get approximately the same
bidirectional/light-tracing ratio as standard BPT. This leads
to the final algorithm for a step of CBPT, presented in Algo-
rithm 1.
Algorithm 1 A complete step of CBPT.
sample({x})
sample({y})
upload({x},{y})
gpu_comp({x},{y})
combine({x},{y})
sample({ylt })
compute_lt({ylt })
Correlated sampling: Correlated sampling can take sev-
eral forms, such as re-using previous paths in order to im-
prove the sampling efficiency [VG97,CTE05], or re-using
a small number of well-behaved random numbers to com-
pute different integrals [KH01]. In our method, the camera
and light paths are all sampled independently using different
random numbers, as in standard BPT. Therefore, complete
paths are sampled in a correlated way, as they are created
by linking the subpaths in all possible ways. To avoid visi-
ble correlation patterns in the final image while ensuring a
proper coverage of the image, the image-space coordinates
that are used for each camera path are generated in an ar-
ray, using a stratified scheme over the entire image, with
four samples per pixel. This array of samples is then shuf-
fled. When sampling a camera population, each path uses
the samples sequentially in the array, leading to paths that
most likely contribute to different parts of the image. There-
fore, correlation is present, but as it is spread randomly over
the image, no regular patterns appear. This array is regen-
erated each time all the samples have been used. Adaptive
sampling can be used by similarly caching a sufficient num-
ber of image coordinates that should be computed according
to the sampling scheme, and then shuffling this array of co-
ordinates.
4. Efficient Computation of Combination Data
Our algorithm requires an efficient computation of the com-
bination data on the GPU. In this section, we suppose that
for each vertex of the two populations {x}and {y}, we have
the position, the BSDF parameters, and the direction to
the previous vertex in the path. The size of this data is in
O(NC+NL). As there are typically few vertices in popula-
tions, the GPU memory requirements are very low for the
population data. Combining populations exhaustively avoids
uploading the O(NC×NL)linking segment array that would
otherwise be necessary.
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
We now give some high- and low-level details on our im-
plementation. Figure 3shows how the techniques we use are
put together.
High-level details: The computation is divided into three
main steps: visibility (blue V rectangles in Figure 3), BSDF
and PDF computations called shading computations from
now on for camera vertices (green C rectangles), and shad-
ing computations for light vertices (red L rectangles). For
each step, we divide the work into batches of fixed size, each
having an associated memory zone on the CPU-side mem-
ory (the batch id where results are downloaded is indicated
in the download rectangle). On the GPU-side, we use two
buffers of fixed size to store the results of the batches (repre-
sented by respectively black and white rectangles inside each
task). Using batches allows us to compute results of the cur-
rent batch while downloading results of the previous batch to
the CPU, leading to an increased efficiency. This also avoids
the need for any array of size O(NC×NL)on the GPU-side,
making the NCand NLvalues bounded only by the CPU-side
memory capacity. In practice, this provides more space for
the scene’s geometry that is needed for the visibility tests.
As is, some shading computations will be done even
though the linking vertices are not mutually visible. In fact,
for the shading models we use [AS00,WMLT07], introduc-
ing an array to only compute the useful shading is much less
efficient, as computing on CPU and uploading this array for
each batch takes more time than directly computing all the
shading values.
Low-level details: We use NVidia’s CUDA language
for GPU computations. The CPU-side work consists only
in synchronization, and is performed in a CUDA-specific
thread, thus not interfering with the main computational
threads. All the positions, directions, and BSDF data are
stored in linear arrays (structure-of-array organisation), that
are re-used across populations to avoid memory allocations,
and enlarged if needed. Each array is accessible through tex-
tures, because each of the values is used many times (once
for each linking segment to which a vertex belongs), and
generally in coherent ways (subsequent threads are likely to
use the same data, or nearby data).
For visibility, we use an adapted version of the radius kd-
tree GPU raytracing implementation by Segovia [Seg08],
which gives a reasonable throughput and is well suited for
individual and incoherent rays that are not stored in an ar-
ray. The rays are effectively built from the thread index idx,
by retrieving the camera and light vertices from their indices
as (idx/VL)and (idx mod VL)respectively, where VLis the
number of vertices in the light population.
The same indexing scheme is used for the camera shad-
ing computations, which makes a single BSDF processed by
consecutive threads, as illustrated by Figure 2. Each thread
handles one linking segment. This leads to a very good local-
ity in the accesses to the textures containing the BSDF pa-
rameters, as well as a very good code coherency in the BSDF
evaluation code. In fact, for most warps, the BSDF parame-
ters are the same across all the threads, the only difference
... ...
...
Figure 2: Threads organisation for the shading of camera
vertices. Each vertex is handled by blocks of VLconsecutive
threads. At least (VL2)/32 warps execute codes with the
exact same BSDF parameters, as they all concern the same
vertex, leading to high code coherency.
Downloads
GPU V V V C
0
C
1
L
0
L
1
sync sync sync sync sync sync sync
t
0 2
1
Figure 3: Temporal execution of our combination system,
not temporally to scale for clarity. The meaning of each ele-
ment is described in the main text.
between consecutive threads being the directions. For light
shading computations, the indexing is reversed (i.e. all the
linking segments for one light vertex are processed in con-
secutive threads), to benefit from the same good properties
than for the camera shading. All the results are written in lin-
ear arrays indexed by the thread index, leading to coalesced
writes.
5. Implementation of CBPT
Using the combination data computation system described in
Section 4, we implement CBPT as described in Algorithm 2.
Note that population sampling and combinations are done in
parallel on all available CPU cores. The main points to note
about Algorithm 2is that we process two couples of popula-
tions at the same time, in an interleaved way. As illustrated
by Figure 4, this allows us to perform GPU processing, CPU
processing, downloads, and uploads at the same time. As the
computation by the GPU of the combination data does not
need any upload and is the only process that performs down-
loads, there is no contention on the memory bus if the GPU
is able to perform transfers in both ways at the same time. In
Algorithm 2,combine() uses the data computed on GPU and
downloaded into the CPU memory to compute the fb(x,y)
contribution for each pair of paths, and splats it in a thread-
safe way to the final image. As the number of splatted values
is small, thread-safety even with a large number of threads
does not create a bottleneck. compute_lt() computes light-
tracing on all available CPU cores.
Timings for each task of a step are reported in Algorithm 2
for a standard scene, and production-oriented parameters.
These timings show the efficiency of our asynchronous com-
putation scheme, as the total wall-clock time needed for one
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
Downloads
GPU
Uploads
CPU S(t) S(t)
t t
data(t-1) data(t)
C(t-1) S(t+1)C(t-2)
step
S(lt)
Figure 4: Temporal execution of CBPT, not temporally to
scale for clarity. Exact timings are given in Algorithm 2. The
block labelled Ccontains both combine() and compute_lt().
The colors white and black for the rectangles indicate which
GPU-side buffer is used to read the population data and
store the results.
loop is 34.5ms, compared to 60.1ms if all computations had
been done synchronously. It also shows that GPU work is
done "for free", as the complete time to perform a step is
equal to the sum of the times needed by each CPU task, ig-
noring the GPU one.
Algorithm 2 CBPT algorithm, with timings of each note-
worthy element using NC=2000, NL=15, NT=1500, in a
scene with 758K triangles and 1.5GB of textures. The time
spent by the GPU to compute all the results is given in "async
time". The total time needed to perform a step is 34.5ms.
for t=0 to do
sample({x}t) {time: 13.5ms}
upload_async({x}t)
sample({y}t) {time: 0.1ms}
upload_async({y}t)
sample({ylt }) {time: 10.1ms}
if t>0then
sync_gpu_comp(t1) {time: 0.1ms}
end if
sync_upload(t)
gpu_comp_async(t) {async time: 25.7ms}
if t>0then
combine({x}t1,{y}t1) {time: 9.6ms}
compute_lt({ylt }) {time: 1.1ms}
end if
end for
6. Results
We now analyze the computational behavior of the combina-
tion system and CBPT. All the measures are done on an In-
tel i7 920 2.80GHz system, with an NVidia GTX480 GPU,
and 16 GB of CPU-side memory. For our tests of CBPT, we
use NC=2000, NL=15, and NT=1500 for all the scenes.
These settings are not aimed at providing peak GPU perfor-
mance, but rather at providing a good compromise between
throughput of the GPU part and rendering quality. No adap-
tive sampling is used.
ring comp lights
GPU CPU ÷GPU CPU ÷
vis 42.6 3.5 12.1 25.4 2.5 10.2
camera 266.7 11.8 22.6 281.7 15.6 18.1
light 266.5 17.3 15.4 280.2 13.0 21.6
comp monitors living
GPU CPU ÷GPU CPU ÷
vis 25.6 2.9 8.8 32.2 2.1 15.3
camera 275.3 16.2 17.0 256.3 12.5 20.5
light 272.8 15.1 18.1 272.8 15.9 17.6
Table 1: Throughputs for visibility (vis), camera shading
(camera), and light shading (light), when using the system
described in Section 4, and when using the 4physical cores
of our processor, plus hyper-threading. The "÷" column
gives the ratio of throughputs, corresponding to the actual
speedups. Visibility is measured in millions of visibility tests
per second, camera and light shadings are measured in mil-
lions of computations of (fs,p,p)tuples per second (see
Section 3for the components of the tuple). All the measures
take all the memory transfers into account.
We use three different scenes of various complexities,
which are presented in Figure 5. We have chosen these chal-
lenging scenes for their high lighting complexity:
The first scene, ring, is geometrically simple, but com-
posed of many glossy surfaces. It produces many sub-
tle caustics that typically lead to noticeable noise, for in-
stance on the back wall from the glossy tiles of the floor.
The comp scene, rendered with two different lighting con-
figurations, is much more involved than the ring scene.
The lights version is lit by the ceiling lights, with indi-
rect lighting caused by specular transmission of the light
through the glass of the light fixtures. The front room and
upper parts of the back room are only indirectly lit. In
the monitors version, light comes only from the TV and
computer monitor. Note the caustics on the wall due to re-
fraction in the twisted pillars made of glass, as well as the
caustics beneath the glass table. Nearly all the non-diffuse
materials are glossy but not ideal mirrors, leading to very
blurry reflections, which is especially visible on the floor.
The living scene is lit by six very small area lights located
on the ceiling above the table and the couch. It contains
a lot of glossy materials (especially all the wooden ob-
jects), of which very few are specular. Note the caustics
caused by the shelves on the left, and the completely indi-
rect lighting in the hallway on the right.
6.1. Combination Throughput
Table 1gives the raw throughputs of visibility and shading
values we obtain on CPU and GPU depending on the scene,
and the speedup brought by our system. All the measures
take all the memory transfers into account. As expected, only
visibility thoughputs decrease with the scene’s size.
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
ring, 7.4K triangles (2.5, 3.4, 463K) comp lights, 758K triangles (3.6, 3.2, 570K)
comp monitors, 758K triangles (3.6, 3.6, 620K) living, 400K triangles (3.7, 3.4, 620K)
Figure 5: The three scenes used to test CBPT. We indicate between parentheses the average length in segments of the sampled
camera and light paths, as well as the average number of linking edges for each couple of populations in CBPT. Note that the
average path lengths for BPT and CBPT are equal, as they use the same code. All the images have been rendered with CBPT.
No post-process has been performed except tone-mapping, as our engine produces HDR images. The top-left image has been
rendered at a resolution of 1600 ×1200 pixels in 1hour. The three others have been rendered at a resolution of 1600 ×900
pixels, in 4hours. As CBPT is based on standard Monte-Carlo methods, images at a resolution of 800 ×450 for the last three
scenes can be obtained with a similar quality in 1hour.
The shading throughput on CPU is quite sensitive to the
type of BSDFs (glossy or purely diffuse) that mostly com-
pose the paths of a certain type, explaining the gap that is
present for some scenes between the camera and light shad-
ing throughputs. This is mostly visible in comp lights be-
cause of the glass fixture surrounding the light sources. On
the other hand, the GPU throughputs are much less affected
by this. Despite the need to transfer the results back to GPU,
we achieve a 15-20×speedup in average compared to CPU
for shading only, consistently on all scenes.
The absolute timings in Table 2give hints about the aver-
age time proportions needed by each element of the combi-
nation. These timings depend on the number of linking seg-
ments that have to be processed for each combination, which
depend on the scene.
Figure 6illustrates the impact of batch size on perfor-
mance, for visibility and shading computations, on the ring
scene. This allows us to evaluate the impact of different
transfers/computation repartitions, and to find optimal batch
sizes for the computer we use.
For the visibility computations, even on this geometri-
cally very simple scene, the transfers are not a limiting fac-
tor, as the visibility results are packed in a very compact
form. Therefore, using batches does not make any notice-
able difference on performance as soon as the batches are
large enough. Consequently, the major advantage brought
by batches for visibility resides in the control we have on
the memory-size requirements on GPU, without much im-
pacting on performance.
For more memory-consuming results such as shading
ones, the batch size has a large impact on performance, with
the additional benefit of using less memory on the GPU. As a
matter of fact, using asynchronism brings a 1.75×speedup,
going from 160 millions to 252 millions of computations
per second when transfers are done in parallel. Note that the
optimal batch sizes are in practice only machine-dependent,
as shading computations efficiency does not depend on the
scene, and visibility computation efficiency is almost con-
stant for any batch size larger than very small values.
6.2. CBPT
To quantify the efficiency of CBPT, we count the number of
fi,j(x,y)computations performed during a complete CBPT
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
ring comp comp living
lights monitors
vis 10.9 22.5 24.4 19.3
camera 1.7 2.0 2.2 2.5
light 1.7 2.0 2.2 2.3
Table 2: Average time needed to complete each step on
GPU, for each scene, in milliseconds.
Batch size (x1000)
Shading throughputs (M/s)
Batch size (x1000)
Visibility throughputs (M/s)
34
36
38
40
42
44
46
0 200 400 600 800 1000 1200 1400
120
140
160
180
200
220
240
260
0 200 400 600 800 1000
Figure 6: Top: Visibility throughput, in millions of tests per
second, in function of the number of visibility tests to per-
form in each batch. Bottom: Shading throughput, in million
of shading tuples computation per second, in function of the
number of shading computations to perform in each batch.
step, and divide it by the time needed to complete the whole
step, including populations sampling and splatting. We call
this efficiency measure basic contributions throughput. This
allows us to have meaningful and consistent results whatever
the average path length is in each scene.
Computational efficiency: Table 3gives the basic contri-
butions throughputs obtained using CBPT, and the speedups
compared to standard BPT. We compute these values when
using CPU-based light-tracing (in this case NT=1500), to
get actual performance, and when not using it (NT=0),
to get the bidirectional-only basic contributions throughput.
The CPU version of BPT uses the same code to sample
paths, and the same code to compute the fi,j(x,y)values,
except that all shading and visibility values are computed on
CPU. Both CBPT and standard BPT uniformly sample the
image, and do not use any adaptive sampling scheme.
The impact of light-tracing on throughputs is noticeable
(around 20%), but the visual impact of a high variance light-
tracing part is much more noticeable than the gain in bidi-
rectional part when setting NTto a very small value, partic-
ularly for very short rendering times. For longer rendering
times and scenes where caustics are easily captured by light-
tracing, NTcan be set to a smaller value, as it will visually
converge faster than the bidirectional part.
CBPT BPT
NT=0NT=1500
ring 20.9 (17.4×) 15.7 (13.1×) 1.2
comp lights 16.2 (16.9×) 12.7 (13.2×) 0.96
comp monitors 16.3 (14.8×) 13.1 (11.9×) 1.1
living 16.5 (21.7×) 12.5 (16.4×) 0.76
Table 3: Basic contributions throughput for CBPT and stan-
dard BPT, in millions of fi,j(x,y)values computed per sec-
ond, and speedup in parenthesis.
As shown by timings in Algorithm 2, our reformulation
allows us to keep both the CPU and GPU fully loaded, the
GPU computation time being masked by the CPU one. The
speedup we obtain with "production settings" is consistently
greater or equal to 12×on our test scenes. Even if our sam-
ples are correlated, the correlation is spread on all the image
by our image-sampling process. This effectively avoids the
appearance of any noticeable correlation pattern.
Visual comparison with standard BPT: Visually ob-
serving noise reduction is made easier when looking at non-
converged images, where improvements are clearly visible.
Figure 7presents the images obtained by CBPT and BPT
after a few seconds of rendering, and after at least 4 sam-
ples per pixel have been computed by CBPT. As images
were stored every 10 seconds, it can happen that more than
4 samples per pixel were actually computed, but both BPT
and CBPT got the same computation time. The places where
the improvements are most visible are on the diffuse walls,
where light-space exploration is crucial to get low variance
results, and in the glossy reflections. Table 4gives the ac-
tual average number of samples per pixel for the bidirec-
tional part of each image. As expected, the speedups ob-
tained are similar to the ones obtained for the basic contribu-
tions throughputs, the little difference coming from the splat-
ting, as BPT needs to splat many more values than CBPT for
a same number of pair of paths. The main information of this
table is that the images presented in Figure 5would have re-
quired from 50 to 66 hours to be computed using standard
BPT, versus 4 hours with CBPT.
Memory usage and scalability: Table 5gives the mem-
ory usage both on CPU and GPU of CBPT. As expected,
the size of the combination data on CPU and the popula-
tions memory size on GPU are related to the average path
length. For populations, we use a conservative allocation
scheme, reuse memory between populations, and refit mem-
ory zones regularly to keep the consumption low. This can
lead to a consequent overestimation of the actual memory
size needed, but drastically reduces the number of mem-
ory allocations, therefore providing a slight speedup. Despite
this, memory requirements remain low for all our scenes
on CPU (between 100 and 200MB), and very low on GPU
(less than 100MB). Table 5also shows that our method han-
dles scenes much larger than the ones we used. Indeed, the
scenes’ kd-tree size are kept relatively low even for quite
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
BPT
CBPT
preview (10s) '4 spp (40s) preview (10s) (close-up) '4 spp (40s) (close-up)
BPT
CBPT
preview (30s) '4 spp (50s) preview (30s) (close-up) '4 spp (50s) (close-up)
BPT
CBPT
preview (30s) '4 spp (50s) preview (30s) (close-up) '4 spp (50s) (close-up)
BPT
CBPT
preview (20s) '4 spp (40s) preview (20s) (close-up) '4 spp (40s) (close-up)
Figure 7: Results obtained by BPT and CBPT on our test scenes, after approximately 10 seconds of actual computations, and
after CBPT has computed approximately 4samples per pixel. Images are rendered at 800×450, except ring which is rendered
at 800 ×600. Note that for all the scenes, mipmaps are lazily built when first accessed, explaining the 30 and 20 seconds of
total rendering times for the preview configuration of the comps and living scenes. The time spent building these mipmaps is
negligible for the ring scene, but takes 16 and 8seconds in the comps and living scenes respectively, and are generally built
when sampling the first paths. This also shows that our system can be seamlessly used together with all the usual ways of
reducing the peak memory usage, as it does not impact the rendering engine architecture.
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing
CBPT BPT (x,y)
prev. '4 spp prev. '4 spp
ring 1.64 5.15 1.60 5.41 14.3×
comp
lights 1.05 3.99 1.38 4.44 13.5×
comp
monitors 1.36 4.12 1.61 4.77 12.9×
living 1.62 4.23 1.53 3.86 16.4×
Table 4: Overall speedup measurement: Average number of
samples computed per-pixel for the bidirectional part of the
images of Figure 7. This is equivalent to the average number
of camera paths that have contributed to each pixel. The last
column gives the ratio between CBPT and BPT of the num-
ber of pairs of paths contributing to the bidirectional part of
each pixel, which is a good measure of the actual speedup
brought by CBPT over standard BPT. For standard BPT,
each camera path is combined with one light path, there-
fore the number of pairs of paths per-pixel is equal to the
number of camera paths. For CBPT, as each camera path
is combined with NLlight paths, the number of pairs is NL
times the number of camera paths per-pixel. In our tests, we
use NL=15.
CPU GPU
pops. comb. kd-tree pops. comb.
ring 73.3 48.0 0.47 3.8 23.5
comp
lights 91.2 66.0 56.1 4.8 23.5
comp
monitors 95.0 72.3 56.1 4.8 23.5
living 60.8 60.3 58.5 5.0 23.5
Table 5: Memory usage for populations and combination
data on CPU, and memory usage for the kd-tree, the popu-
lations data (position, BSDFs parameters, etc.), and all the
batch buffers, in MB.
complex scenes (about 50MB). Therefore, scenes that con-
tain several millions of polygons fits in the GPU memory.
Moreover, the memory size of populations is negligible ex-
cept for idiosyncrasies, as even with participating media, the
paths remain short (10 20 vertices on average).
7. Conclusion
Bidirectional path-tracing is an unbiased and highly-robust
rendering algorithm, but is not well suited for GPU imple-
mentation, as it requires a lot of branching. By exhaustively
combining populations of paths instead of single paths, we
were able to divide the algorithm into two parts, each one
being well suited for either the CPU or the GPU. We main-
tain the CPU, the GPU, and the memory bus between CPU
and GPU busy simultaneously by interleaving the steps of
CBPT. The GPU part is made efficient by using high-level
optimization techniques such as double buffering and asyn-
chronism.
We have shown that CBPT is more than an order of mag-
nitude faster than standard BPT on various test scenes, with-
out affecting the size of the datasets or the flexibility of the
underlying rendering engine in terms of shaders, and models
of lights and cameras. This makes CBPT very well suited
for accelerating image computation in production-oriented
engines.
References
[AS00] ASHIKHMIN M., SH IRL EY P.: An anisotropic phong
light reflection model. Journal of Graphics Tools 5 (2000), 25–
32.
[CTE05] CLINE D., TALBOT J. , EGBERT P.: Energy redistribu-
tion path-tracing. In SIGGRAPH ’05 (2005), pp. 1186–1195.
[Faj10] FAJAR DO M.: Ray tracing solution in film produc-
tion rendering. http://www.graphics.cornell.edu/
~jaroslav/gicourse2010/, 2010. SIGGRAPH 2010
Course on global illumination in production rendering.
[HJ09] HAC HIS UKA T., JENSE N H.: Stochastic progressive pho-
ton mapping. In SIGGRAPH Asia ’09: ACM SIGGRAPH Asia
2009 papers (2009), ACM, pp. 1–8.
[HOJ08] HAC HIS UKA T., OGAKI S., JE NSEN H.: Progressive
photon mapping. ACM Trans. Graph. 27, 5 (2008), 1–8.
[KFC10] K ˇ
RIVÁNEK J., FAJA RDO M., CHRISTENSEN
P. H., TABELLION E., BUN NEL L M., LAR SSO N D.,
KAPLAN YAN A.: Global illumination across industries.
http://www.graphics.cornell.edu/~jaroslav/
gicourse2010/, 2010. SIGGRAPH 2010 Course on global
illumination in production rendering.
[KH01] KELLE R A., HEIDRICH W.: Interleaved sampling. In
EGWR’01 (2001), pp. 269–276.
[KSKAC02] KELEME N C., SZI RMAY-KALO S L., ANTAL G.,
CSONKA F.: A simple and robust mutation strategy for the
Metropolis light transport algorithm. In Eurographics ’02 (2002),
pp. 531–540.
[LFCD07] LAI Y.-C., FAN S. , CH ENN EY S., DYE R C.: Pho-
torealistic image rendering with population Monte Carlo energy
redistribution. In EGSR ’07 (2007), pp. 287–296.
[Lux10] LUXREN DER: Luxrays. http://www.luxrender.
net/wiki/index.php?title=LuxRays, 2010.
[LW93] LAFO RTUN E E. P., WILLEMS Y. D.: Bi-directional path
tracing. In Compugraphics ’93 (1993), pp. 145–153.
[OMP10] O MPF: Hybrid bidirectional path-tracer development
thread. http://ompf.org/forum/viewtopic.php?f=
6&t=1834, 2010.
[Seg08] SEG OVI A B.: Radius-CUDA raytracing ker-
nel. http://bouliiii.blogspot.com/2008/08/
real-time- ray-tracing- with-cuda- 100.html,
2008.
[VG94] VEACH E., GUI BAS L. J.: Bidirectional estimators for
light transport. In EGWR ’94 (1994), pp. 147–162.
[VG95] VEACH E., GUI BAS L. J.: Optimally combining sam-
pling techniques for monte carlo rendering. In SIGGRAPH ’95
(1995), pp. 419–428.
[VG97] VEACH E., GUI BAS L. J.: Metropolis light transport. In
SIGGRAPH ’97 (1997), pp. 65–76.
[WMLT07] WA LTER B., MARSCHNER S. R., LIH., TORRANCE
K. E.: Microfacet models for refraction through rough surfaces.
In EGSR ’07 (2007), pp. 195–206.
c
2010 The Author(s)
Journal compilation c
2010 The Eurographics Association and Blackwell Publishing Ltd.
... BDPT is widely used to resolve the rendering equation and obtain unbiased results [Lafortune and Willems 1993;Veach and Guibas 1995a,b]. The key is to establish the connections between the eye vertices (i.e., eye sub-path) and the light vertices (i.e., light subpath) [Pajot et al. 2011;Popov et al. 2015;Walter et al. 2012]. Davidovič et al. proposed light vertex cache BDPT (LVCBPT) that uses light vertex cache to store and resample light sub-paths. ...
Article
Full-text available
Bidirectional path tracing (BDPT) can be accelerated by selecting appropriate light sub-paths for connection. However, existing algorithms need to perform frequent distribution reconstruction and have expensive overhead. We present a novel approach, SPCBPT, for probabilistic connections that constructs the light selection distribution in sub-path space. Our approach bins the sub-paths into multiple subspaces and keeps the sub-paths in the same subspace of low discrepancy, wherein the light sub-paths can be selected by a subspace-based two-stage sampling method, i.e., first sampling the light subspace and then resampling the light sub-paths within this subspace. The subspace-based distribution is free of reconstruction and provides efficient light selection at a very low cost. We also propose a method that considers the Multiple Importance Sampling (MIS) term in the light selection and thus obtain an MIS-aware distribution that can minimize the upper bound of variance of the combined estimator. Prior methods typically omit this MIS weights term. We evaluate our algorithm using various benchmarks, and the results show that our approach has superior performance and can significantly reduce the noise compared with the state-of-the-art method.
... Document [46] introduced a large-scale parallel ray-tracing algorithm using GPU, but its ray-tracing algorithm is only studied for spheres and lacks generality. Document [47] introduced efficient hybrid CPU-GPU rendering combined two-way path tracing and mentions the dual-level parallel acceleration of CPU and GPU, but it still had the problem of the data communication and transmission between CPU and GPU. Document [48,49] made efficient use of high-bandwidth networks, optimized communication methods, balanced calculations, and reduced communication volume. ...
Article
Full-text available
This paper proposes a parallel computing analysis model HPM and analyzes the parallel architecture of CPU–GPU based on this model. On this basis, we study the parallel optimization of the ray-tracing algorithm on the CPU–GPU parallel architecture and give full play to the parallelism between nodes, the parallelism of the multi-core CPU inside the node, and the parallelism of the GPU, which improve the calculation speed of the ray-tracing algorithm. This paper uses the space division technology to divide the ground data, constructs the KD-tree organization structure, and improves the construction method of KD-tree to reduce the time complexity of the algorithm. The ground data is evenly distributed to each computing node, and the computing nodes use a combination of CPU–GPU for parallel optimization. This method dramatically improves the drawing speed while ensuring the image quality and provides an effective means for quickly generating photorealistic images.
... Bidirectional path tracing reuses entire light carrying paths; early variants connected single vertices on pairs of camera and light subpaths, reusing their prefixes [Lafortune and Willems 1993;Veach and Guibas 1995a]. More recently, reusing paths enabled efficiency improvements and allows judicious choices of path connections [Chaitanya et al. 2018;Pajot et al. 2011;Popov et al. 2015;Tokuyoshi and Harada 2019]. Closely related is work on reusing paths in unidirectional light transport algorithms, where previously-sampled paths are stored and then connected to new paths [Bauszat et al. 2017;Bekaert et al. 2002;Castro et al. 2008;Xu and Sbert 2007]. ...
Article
Efficiently rendering direct lighting from millions of dynamic light sources using Monte Carlo integration remains a challenging problem, even for off-line rendering systems. We introduce a new algorithm---ReSTIR---that renders such lighting interactively, at high quality, and without needing to maintain complex data structures. We repeatedly resample a set of candidate light samples and apply further spatial and temporal resampling to leverage information from relevant nearby samples. We derive an unbiased Monte Carlo estimator for this approach, and show that it achieves equal-error 6×-60× faster than state-of-the-art methods. A biased estimator reduces noise further and is 35×-65× faster, at the cost of some energy loss. We implemented our approach on the GPU, rendering complex scenes containing up to 3.4 million dynamic, emissive triangles in under 50 ms per frame while tracing at most 8 rays per pixel.
... In another direction, the stochastic algorithm of "Monte-Carlo" [22] is sometimes used despite its slower convergence. "Path Tracing" methods [41] launch random rays from pixels of the image plane until one hits an object. It can be bi-directional (rays are shot from camera and sources simultaneously). ...
Preprint
Full-text available
This paper covers the time consuming issues intrinsic to physically-based image rendering algorithms. First, glass materials optical properties were measured on samples of real glasses and other objects materials inside an hotel room were characterized by deducing spectral data from multiple trichromatic images. We then present the rendering model and ray-tracing algorithm implemented in Virtuelium, an open source software. In order to accelerate the computation of the interactions between light rays and objects, the ray-tracing algorithm is parallelized by means of domain decomposition method techniques. Numerical experiments show that the speedups obtained with classical parallelization techniques are significantly less significant than those achieved with parallel domain decomposition methods.
... In another direction, the use of the stochastic algorithm of "Monte-Carlo" [20] is sometimes used despite its slower convergence. "Path Tracing" methods [38] launch random rays from pixels of the image plane until one hits an object. It can be bi-directional (shots rays from camera and sources simultaneously). ...
Preprint
Full-text available
In this paper, we use an original ray-tracing domain decomposition method to address image rendering of naturally lighted scenes. This new method allows to particularly analyze rendering problems on parallel architectures, in the case of interactions between light-rays and glass material. Numerical experiments, for medieval glass rendering within the church of the Royaumont abbey, illustrate the performance of the proposed ray-tracing domain decomposition method (DDM) on multi-cores and multi-processors architectures. On one hand, applying domain decomposition techniques increases speedups obtained by parallelizing the computation. On the other hand, for a fixed number of parallel processes, we notice that speedups increase as the number of sub-domains do.
... In GI rendering, it is not only the light sources which are responsible of the illumination of an object but also all the reflective objects around. It is an observable progress in terms of virtual reality, and many GI algorithms have been developed and improved [43,38,33,12]. Our photon mapping algorithm proceeds in two sequential steps. ...
Preprint
Full-text available
In the context of a virtual reconstitution of the destroyed Royaumont abbey church, this paper investigates computer sciences issues intrinsic to the physically-based image rendering. First, a virtual model was designed from historical sources and archaeological descriptions. Then some materials physical properties were measured on remains of the church and on pieces from similar ancient churches. We specify the properties of our lighting source which is a representation of the sun, and present the rendering algorithm implemented in our software Virtuelium. In order to accelerate the computation of the interactions between light-rays and objects, this ray-tracing algorithm is parallelized by means of domain decomposition techniques. Numerical experiments show that the computational time saved by a classic parallelization is much less significant than that gained with our approach.
... Previous studies have introduced the importance of utilizing every heterogeneous core in a system and have designed different types of methodologies and algorithms for task dispatching. For instance, Pajot presented a hybrid bidirectional path tracing implementation which utilizes techniques such as double buffering, batch processing, and asynchronous execution to balance tasks between CPU and GPU [23,24]. Tzeng at el. experienced several task management techniques, task-donation, and task stealing methodologies for irregular workloads [32,11]. ...
Article
The research interest of real-time global illumination has increased due to the growing demand of graphics applications such as virtual reality. Recently, the design that combines Image-Based Rendering (IBR) and Ray-Tracing to create Synthetic Light Field (SLF) has been widely adopted to provide delicate visual experience for multiple viewpoints at an acceptable frame rate. However, despite its parallel characteristic, constructing a SLF is still inefficient on modern Graphics Processing Unit (GPU) due to the irregularities. For instance, the issues caused by branch divergence, early-termination and irregular memory access prolong the execution time that cannot be simply resolved by workload merging. In this paper, we proposed a Runtime framework that reorganizes the execution into a pipeline-based pattern with grouping of primary rays. The workloads are later distributed to all heterogeneous cores to increase the efficiency of the execution. With this approach, the number of valid rays can be maintained at a high level with less divergence of paths. Based on the experiment on a heterogeneous system, the maximum throughput for a single GPU becomes 3.12 times higher than the original on average and becomes even higher on systems with multiple heterogeneous cores.
Article
Monte‐Carlo rendering requires determining the visibility between scene points as the most common and compute intense operation to establish paths between camera and light source. Unfortunately, many tests reveal occlusions and the corresponding paths do not contribute to the final image. In this work, we present next event estimation++ (NEE++): a visibility mapping technique to perform visibility tests in a more informed way by caching voxel to voxel visibility probabilities. We show two scenarios: Russian roulette style rejection of visibility tests and direct importance sampling of the visibility. We show applications to next event estimation and light sampling in a uni‐directional path tracer, and light‐subpath sampling in Bi‐Directional Path Tracing. The technique is simple to implement, easy to add to existing rendering systems, and comes at almost no cost, as the required information can be directly extracted from the rendering process itself. It discards up to 80% of visibility tests on average, while reducing variance by ∼20% compared to other state‐of‐the‐art light sampling techniques with the same number of samples. It gracefully handles complex scenes with efficiency similar to Metropolis light transport techniques but with a more uniform convergence.
Article
Recent advances in bidirectional path tracing (BPT) reveal that the use of multiple light sub‐paths and the resampling of a small number of these can improve the efficiency of BPT. By increasing the number of pre‐sampled light sub‐paths, the possibility of generating light paths that provide large contributions can be better explored and this can alleviate the correlation of light paths due to the reuse of pre‐sampled light sub‐paths by all eye sub‐paths. The increased number of pre‐sampled light subpaths, however, also incurs a high computational cost. In this paper, we propose a two‐stage resampling method for BPT to efficiently handle a large number of pre‐sampled light sub‐paths. We also derive a weighting function that can treat the changes in path probability due to the two‐stage resampling. Our method can handle a two orders of magnitude larger number of presampled light sub‐paths than previous methods in equal‐time rendering, resulting in stable and better noise reduction than state‐of‐the‐art methods.
Article
Full-text available
Most of the research on the global illumination problem in computer graphics has been concentrated on finite-element (radiosity) techniques. Monte Carlo methods are an intriguing alternative which are attractive for their ability to handle very general scene descriptions without the need for meshing. In this paper we study techniques for reducing the sampling noise inherent in pure Monte Carlo approaches to global illumination. Every light energy transport path from a light source to the eye can be generated in a number of different ways, according to how we partition the path into an initial portion traced from a light source, and a final portion traced from the eye. Each partitioning gives us a different unbiased estimator, but some partitionings give estimators with much lower variance than others. We give examples of this phenomenon and describe its significance. We also present work in progress on the problem of combining these multiple estimators to achieve near-optimal variance, with the goal of producing images with less noise for a given number of samples.
Conference Paper
Full-text available
This work presents a novel global illumination algorithm wh ich concentrates computation on important light transport paths and automatically adjusts energy distribu ted area for each light transport path. We adapt statis- tical framework of Population Monte Carlo into global illum ination to improve rendering efficiency. Information collected in previous iterations is used to guide subsequen t iterations by adapting the kernel function to approxi- mate the target distribution without introducing bias intothe final result. Based on this framework, our algorithm automatically adapts the amount of energy redistribution a t different pixels and the area over which energy is redistributed. Our results show that the efficiency can be im proved by exploring the correlated information among light transport paths.
Conference Paper
Full-text available
Monte Carlo integration is a powerful technique for the evaluation of difficult integrals. Applications in rendering include distribution ray tracing, Monte Carlo path tracing, and form-factor computation for radiosity methods. In these cases variance can often be signifi-cantly reduced by drawing samples from several distributions, each designed to sample well some difficult aspect of the integrand. Nor-mally this is done by explicitly partitioning the integration domain into regions that are sampled differently. We present a power-ful alternative for constructing robust Monte Carlo estimators, by combining samples from several distributions in a way that is prov-ably good. These estimators are unbiased, and can reduce variance significantly at little additional cost. We present experiments and measurements from several areas in rendering: calculation of glossy highlights from area light sources, the "final gather" pass of some radiosity algorithms, and direct solution of the rendering equation using bidirectional path tracing.
Article
Full-text available
The paper presents a new mutation strategy for the Metropolis light transport algorithm, which works in the space of uniform random numbers used to build up paths. Thus instead of mutating directly in the path space, mutations are realized in the infinite dimensional unit cube of pseudo-random numbers and these points are transformed to the path space according to BRDF sampling, light source sampling and Russian roulette. This transformation makes the integrand and the importance function flatter and thus increases the acceptance probability of the new samples in the Metropolis algorithm. Higher acceptance ratio, in turn, reduces the correlation of the samples, which increases the speed of convergence. When mutations are calculated, a new random point is selected usually in the neighborhood of the previous one, but according to our proposition called "large steps", sometimes an independent point is obtained. Large steps greatly reduce the start-up bias and guarantee the ergodicity of the process. Due to the fact that some samples are generated independently of the previous sample, this method can also be considered as a combination of the Metropolis algorithm with a classical random walk. Metropolis light transport is good in rendering bright image areas but poor in rendering dark sections since it allocates samples proportionally to the pixel luminance. Conventional random walks, on the other hand, have the same performace everywhere, thus they are poorer than Metropolis method in bright areas, but are better at dark sections. In order to keep the merits of both approaches, we use multiple importance sampling to combine their results, that is, the combined method will be as good at bright regions as Metropolis and at dark regions as random walks. The re...
Article
Full-text available
We present Energy Redistribution (ER) sampling as an unbiased method to solve correlated integral problems. ER sampling is a hybrid algorithm that uses Metropolis sampling-like mutation strategies in a standard Monte Carlo integration setting, rather than resorting to an intermediate probability distribution step. In the context of global illumination, we present Energy Redistribution Path Tracing (ERPT). Beginning with an inital set of light samples taken from a path tracer, ERPT uses path mutations to redistribute the energy of the samples over the image plane to reduce variance. The result is a global illumination algorithm that is conceptually simpler than Metropolis Light Transport (MLT) while retaining its most powerful feature, path mutation. We compare images generated with the new technique to standard path tracing and MLT.
Article
Full-text available
We present a new Monte Carlo method for solving the light transport problem, inspired by the Metropolis sampling method in computational physics. To render an image, we generate a sequence of light transport paths by randomly mutating a single current path (e.g. adding a new vertex to the path). Each mutation is accepted or rejected with a carefully chosen probability, to ensure that paths are sampled according to the contribution they make to the ideal image. We then estimate this image by sampling many paths, and recording their locations on the image plane. Our algorithm is unbiased, handles general geometric and scattering models, uses little storage, and can be orders of magnitude more e#cient than previous unbiased approaches. It performs especially well on problems that are usually considered di#cult, e.g. those involving bright indirect light, small geometric holes, or glossy surfaces. Furthermore, it is competitive with previous unbiased algorithms even for relatively simple ...
Conference Paper
Microfacet models have proven very successful for modeling light reflection from rough surfaces. In this paper we review microfacet theory and demonstrate how it can be extended to simulate transmission through rough surfaces such as etched glass. We compare the resulting transmission model to measured data from several real surfaces and discuss appropriate choices for the microfacet distribution and shadowing-masking functions. Since rendering transmission through media requires tracking light that crosses at least two interfaces, good importance sampling is a practical necessity. Therefore, we also describe efficient schemes for sampling the microfacet models and the corresponding probability density functions.
Article
This paper presents a simple extension of progressive photon mapping for simulating global illumination with effects such as depth-of-field, motion blur, and glossy reflections. Progressive photon mapping is a robust global illumination algorithm that can handle complex illumination settings including specular-diffuse-specular paths. The algorithm can compute the correct radiance value at a point in the limit. However, progressive photon mapping is not effective at rendering distributed ray tracing effects, such as depth-of-field, that requires multiple pixel samples in order to compute the correct average radiance value over a region. In this paper, we introduce a new formulation of progressive photon mapping, called stochastic progressive photon mapping, which makes it possible to compute the correct average radiance value for a region. The key idea is to use shared photon statistics within the region rather than isolated photon statistics at a point. The algorithm is easy to implement, and our results demonstrate how it efficiently handles scenes with distributed ray tracing effects, while maintaining the robustness of progressive photon mapping in scenes with complex lighting.
Article
This paper introduces a simple and robust progressive global illumination algorithm based on photon mapping. Progressive photon mapping is a multi-pass algorithm where the first pass is ray tracing followed by any number of photon tracing passes. Each photon tracing pass results in an increasingly accurate global illumination solution that can be visualized in order to provide progressive feedback. Progressive photon mapping uses a new radiance estimate that converges to the correct radiance value as more photons are used. It is not necessary to store the full photon map, and unlike standard photon mapping it possible to compute a global illumination solution with any desired accuracy using a limited amount of memory. Compared with existing Monte Carlo ray tracing methods progressive photon mapping provides an efficient and robust alternative in the presence of complex light transport such as caustics and in particular reflections of caustics.