Content uploaded by Anthony Pajot

Author content

All content in this area was uploaded by Anthony Pajot on Oct 12, 2017

Content may be subject to copyright.

EUROGRAPHICS 2011 / M. Chen and O. Deussen

(Guest Editors)

Volume 30 (2011), Number 2

Combinatorial Bidirectional Path-Tracing

for Efﬁcient Hybrid CPU/GPU Rendering

Anthony Pajot1, Loïc Barthe1, Mathias Paulin1, and Pierre Poulin2

1IRIT-CNRS, Université de Toulouse, France 2LIGUM, Dept. I.R.O., Université de Montréal, Canada.

Figure 1: Images of a scene with a large dataset (758K triangles, lots of textures) featuring complex lighting conditions (glossy

reﬂections, caustics, strong indirect lighting, etc.) computed in respectively 50 seconds (left) and one hour (right). Standard

bidirectional path-tracing requires respectively 11 minutes and 13 hours to obtain the same results.

Abstract

This paper presents a reformulation of bidirectional path-tracing that adequately divides the algorithm into pro-

cesses efﬁciently executed in parallel on both the CPU and the GPU. We thus beneﬁt from high-level optimization

techniques such as double buffering, batch processing, and asyncronous execution, as well as from the exploitation

of most of the CPU, GPU, and memory bus capabilities. Our approach, while avoiding pure GPU implementation

limitations (such as limited complexity of shaders, light or camera models, and processed scene data sets), is more

than ten times faster than standard bidirectional path-tracing implementations, leading to performance suitable

for production-oriented rendering engines.

Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Three-Dimensional

Graphics and Realism—Color, shading, shadowing, and texture I.6.8 [Simulation and Modeling]: Type of

Simulation—Monte-Carlo

1. Introduction

Global illumination brings a lot of realism to computer-

generated images. Therefore, production-oriented rendering

engines use it to reach photorealism.

Algorithms to compute global illumination have to meet a

certain number of constraints in order to be seamlessly inte-

grated in a production pipeline:

•From an artist point of view, the algorithm should have

intuitive parameters, and should be able to provide inter-

active feedback as well as high quality ﬁnal images.

•From a scene-design point of view, it should be able to

manage huge datasets as well as complex and ﬂexible

shaders, various light models, and various camera mod-

els.

•From a data-management point of view, it should avoid as

much as possible precomputed data. Indeed, it is tedious

to keep these data synchronized across artists that work on

the same scene, or between the computers of a renderfarm.

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and

350 Main Street, Malden, MA 02148, USA.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

•From a computational point of view, it must be robust to

handle highly dynamic scenes and all-frequency indirect

lighting, to give the artists complete freedom on their de-

signs. For automated rendering, it must give predictable

and reproducible results in a given time frame. Ideally, it

should be easy to use on clusters, to be able to render one

image using all the ressources of a renderfarm.

Methods that are used nowadays mostly rely on

point clouds or other type of precomputed representa-

tions [KFC∗10]. As they rely on precomputed data, inter-

active feedback is not straightforward, as these data should

be recomputed each time the scene changes. Even though

being predictable and able to handle very large amount of

data, precomputed representations still have problems han-

dling highly dynamic or high-frequency indirect lighting.

Moreover, production pipelines must be adapted appropri-

ately to keep these data in sync during the production pro-

cess, and computing a single image on a cluster can be done

only once the data have been computed.

To remove all these problems, unbiased methods have

been investigated, and path-tracing based algorithms begin

to be mature enough to be successfully used in the movie

industrie [Faj10]. In addition to being potentially fully auto-

matic (thus user-friendly), unbiasedness makes these meth-

ods easy to deploy on clusters, as independent renderings

can be simply averaged to compute the ﬁnal image. As they

do not require any precomputed data and do not rely on any

interpolation scheme, they also naturally handle highly dy-

namic scenes. Moreover, they use independent samples, thus

precision requirements such as a given number of samples

per pixel are easy to formulate. Finally unlike sequential

methods, the number of samples computed in a given time

can be measured so that the results are predictible and repro-

ducible when the time frame is ﬁxed.

Nevertheless, path-tracing exhibits large variance when

high-frequency or strong indirect lighting effects such as

caustics are present in a scene, leading to visually unpleas-

ant artefacts in the rendering. To reduce these artefacts, con-

straints can be added to the indirect lighting, e.g. reduc-

ing the sharpness of glossy reﬂections [Faj10], or enlarging

lights. Although interactive feedback can be provided for

scenes where path-tracing has a very low variance, a large

amount of time is needed to obtain a rough preview of the

ﬁnal appearance for scenes with high variance. On a more

general point of view, unbiased methods have a larger com-

putational cost than methods based on precomputed data,

which is a problem for wide acceptance.

Bidirectional path-tracing (BPT) [VG94] [LW93], has the

same advantages as path-tracing, but is much more robust

with respect to indirect lighting, providing low variance re-

sults even for complex lighting conditions. Even though

more computationally efﬁcient than path-tracing, it remains

too slow for interactive feedback, and is still slower than

methods based on precomputed data. Recently, attempts at

making it faster by using GPU as a co-processor have been

presented in the rendering community [OMP10], however,

the proposed implementation does not allow an efﬁcient col-

laboration of the CPU and GPU, keeping the most of the pro-

cessing charge on the CPU while the GPU remains mostly

idle.

Contribution: In this paper, we combine correlated sam-

pling and standard BPT to efﬁciently use both CPU and GPU

in a cooperative way (Section 3). The basic principle of BPT

is to repeatedly sample an optical path leaving from the cam-

era, and an optical path leaving from the light. Complete

paths are then created by linking together each subpath of

the camera path with each subpath of the light path. The

last vertex of each subpath are called linking vertices, and

the segment between the two linking vertices is the linking

segment. A complete path created this way contributes to

the ﬁnal image if the linking vertices are mutually visible,

and if some of the energy arriving to the light linking ver-

tex is scattered to the camera path. Instead of combining two

paths, we combine sets of camera and light paths, comput-

ing the values needed for linking on the GPU. As each cam-

era path is combined with each light path, many more link-

ing segments are available, allowing us to use the GPU at its

maximum without increasing the cost of sampling the paths

(Section 4). We then interleave the CPU and GPU parts in

order to obtain an algorithm where both the CPU and GPU

are always busy (Section 5). This reformulation reduces the

processing time by a factor varying between 12 and 16 com-

pared to standard BPT (Section 6), allowing feedback in less

than a minute even for complex scenes, and the computation

of high-quality images in one hour, as shown in Figure 1.

2. Related Work

If not considering computational efﬁciency and GPU use,

both biased and unbiased algorithms that do not use precom-

puted data exist to produce high-quality images.

On the unbiased side, sequential methods based on

Markov-Chain Monte-Carlo [VG97,KSKAC02,CTE05,

LFCD07] have been used to improve the robustness of stan-

dard Monte-Carlo methods for very difﬁcult scenes. Unfor-

tunately, they can be highly dependent on the starting state

of the chain, and do not provide feedback as rapidly as stan-

dard Monte-Carlo methods, since the time to cover all the

screen is typically longer. The gain that these methods bring

is most visible on very difﬁcult scenes, but remains quite

limited for more common scenes, for which standard BPT is

highly efﬁcient.

On the biased side, Hachisuka et al. [HOJ08,HJ09] intro-

duced progressive photon mapping and stochastic progres-

sive photon mapping, two consistent algorithms based on

photon mapping. Even though robust, efﬁcient, and able to

produce high-quality images, being consistent instead of un-

biased prevents these algorithms to be directly usable in ren-

derfarms for single image computations. Instead, they need

to be speciﬁcally adapted to avoid artefacts in the ﬁnal im-

ages.

Using both the CPU and GPU in a cooperative way can

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

provide a large gain of performance, allowing the methods

above to provide high-quality results or rough previews sig-

niﬁcantly faster. Attempts at isolating parts of algorithms to

execute them on GPU are examined in rendering engines,

such as in luxrender [Lux10], where intersection tests are

performed on the GPU. The main problem that face devel-

opers is keeping both CPU and GPU busy all the time. In

general, the CPU is too slow to provide enough work to the

GPU. More generally, it is not easy to adapt the algorithms

presented above to efﬁciently use the GPU to compute inter-

mediate data, without restricting the size of the datasets nor

the complexity of the shaders. In fact, sampling, which must

be done on CPU as it involves all the dataset and the shaders,

would in general require much more time to be computed

than the GPU part, leading to a negligible gain.

3. Combinatorial Bidirectional Path-Tracing (CBPT)

3.1. Base Algorithm

In BPT-based algorithms, a camera path x= (x0,...,xc)and

a light path y= (y0,...,yl)are sampled. x0,...,xcare called

camera vertices,y0,...,ylare called light vertices. For each

vertex xior yjlocated on the surface of an object, the pa-

rameters of the bidirectional scattering distribution func-

tion (BSDF) are computed using a shader tree. Complete

paths are then created by linking subpaths (x0,...,xi)and

(y0,...,yj), for all the possible couples (i,j). The num-

ber of segments of each complete path is i+j+1, and

the linking segment is the segment (xi,yj). Let function

gC(x,i)give the energy transmitted by xfrom xito x0, and

gL(y,j)give the energy transmitted by yfrom y0to yj. The

energy emitted from y0which arrives to x0via the path

z= (y0,...,yj,xi,...,x0)is then:

gi,j(x,y) =gL(y,j)G(yj,xi)gC(x,i)×(1)

fs(yj−1→yj→xi)×

V(yj,xi)×

fs(yj→xi→xi−1)

where fsis the BSDF, Vis the visibility function (1 if unoc-

cluded, 0 otherwise), and Gis the geometric term.

We deﬁne the basic contribution fi,j(x,y)of such a com-

plete path as:

fi,j(x,y) = wi,j(x,y)gi,j(x,y)

pi,j(x,y). (2)

pi,j(x,y)is the density probability with which the two

subpaths have been sampled, and wi,j(x,y)is the multiple

importance sampling (MIS) weight [VG95].

In our implementation, we use the direct BSDF proba-

bility density function (PDF) pto sample directions for the

camera path, and the adjoint BSDF PDF p∗to sample direc-

tions for the light path, and the balance heuristic [VG95] to

compute the MIS weights:

wi,j(x,y) = pi,j(x,y)

∑s,tps,t(x,y)(3)

where each (s,t)couple is one of the possible techniques

with which zcould have been sampled. Computing this

weight requires to compute p(xi−1→xi→yj)and p∗(yj→

xi→xi−1)using the BSDF at xi, and p(xi→yj→yj−1)

and p∗(yj−1→yj→xi)using the BSDF at yj.

When either ior jare less than 1, the corresponding terms

are not based on the BSDF, but instead on the light or camera

properties. If j=−1, it means that xcis on a light, making

a complete path by itself.

The data that depend on both xiand yjhas to be com-

puted per linking segment, and is the most time-consuming

task when computing the contribution of a complete path.

These data can be computed on the GPU very efﬁciently,

in parallel for each linking segment. Unfortunately, produc-

ing a sufﬁcient number of linking segments would require to

sample and combine a very large number of pairs, leading to

very large CPU costs, large memory footprints both on CPU

and GPU, and very time consuming CPU-to-GPU memory

transfers.

The key idea allowing us to use both CPU and GPU efﬁ-

ciently is to sample populations of NCcamera paths and NL

light paths independently on CPU, and then combine each

camera path with each light path. This leads to the com-

bination of NC×NLpairs of paths, and allows us to have

largely enough linking segments to beneﬁt from the process-

ing power of GPUs without requiring larger sampling costs.

Combining all camera paths with the same light paths intro-

duces a correlation in the estimations, but does not lead to

bias in the average estimator.

In practice, we have three kernels which compute, for

each linking segment (xi,yj)in parallel:

•the visibility term V(xi,yj),

•the shading values involving the BSDF of the cam-

era point: fs(yj→xi→xi−1),p(xi−1→xi→yj), and

p∗(yj→xi→xi−1), if xihas an associated BSDF (i.e. it

is neither on the camera lens nor on a light),

•the shading values involving the BSDF of the light

point: fs(yj−1→yj→xi),p(xi→yj→yj−1), and

p∗(yj−1→yj→xi), if yjhas an associated BSDF.

If xior yjdoes not have an associated BSDF, the prob-

abilities (probability to have sampled the light, probability

density to have sampled the point on the light, probability

density to have sampled the direction from the camera, etc.),

and the light emission and importance emission terms are

computed on CPU, to keep the ﬂexibility on camera and light

models that can be used.

The ﬁnal contributions of a pair (x,y)can then be

split into two parts. The ﬁrst part is the sum of all the

basic contributions that affect the image location inter-

sected by the ﬁrst segment of x. We denote it as the

bidirectional contribution: fb(x,y) = ∑i>0,j6=−1fi,j(x,y) +

fc,−1(x,y)and we call bidirectional image the image

obtained by considering only the bidirectional contribu-

tions. The second part contains all the contributions ob-

tained by light-tracing, each affecting a different image

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

location: {f0,0(x,y),f0,1(x,y),..., f0,l(x,y)}. We call light-

tracing image the image obtained by adding all the contri-

butions from light-tracing, each multiplied by the number of

pixels Npof the ﬁnal image. In our implementation, light-

tracing does not contribute to direct lighting, as it brings a

lot of variance for this type of light transport.

As a result, a step of CBPT consists in:

1. sample a camera population {x}of NCpaths, and a light

population {y}of NLpaths;

2. compute the combination data for these two populations

on GPU;

3. compute the contributions of each pair of paths, splatting

NCvalues to the bidirectional image, and splatting the

light-tracing contributions to the light-tracing image.

Note that as is, our algorithm does not directly handle mo-

tion blur, but it can be integrated in a straightforward manner

by sampling each ({x},{y}) population couple with a spe-

ciﬁc value of time, i.e. all the paths of the two populations

have the same time value, and this value is different for each

couple of populations.

3.2. Discussion

Setting NCand NL:Ideally, we would like to always be

perceptually faster than standard BPT. Perceptually faster

means computing more camera paths per second, with each

camera path being combined with NL>1 light paths. This

leads to a similar or faster coverage of the image, with each

camera path bringing a lower-variance estimate than in stan-

dard BPT, leading to perceptually faster convergence. NC

and NLcan be computed to ensure faster perceptual conver-

gence, by measuring the time tbneeded by BPT to sample,

combine, and splat the contribution for a pair of paths, and

the time ts(NC,NL)needed by CBPT to perform one step. As

the combination is the most time consuming part of a step,

ts(NC,NL)is roughly constant as long as the number of pairs

P=NC×NLremains constant. Therefore, for a ﬁxed P, an

appropriate NCvalue is such that

NC>ts(P)

tb

. (4)

A lower NCvalue will lead to lower-variance estimate of

each path, larger value will lead to faster coverage, but also

more correlation. A side-effect of Equation (4) is that if NC,

computed using this equation, is such that NLwould be <1,

this indicates that the machine on which CBPT is running is

not fast enough to bring any advantage over standard BPT

for the chosen P.

Light-tracing: The discussion above does not take into

account light-tracing, and using Equation (4) generally gives

NLvalues that are small, leading to high-variance caustics.

Light-tracing does not really take advantage of the GPU

combination system, as each light subpath is combined with

only one vertex of a camera path, namely the vertex which

lies on the lens of the camera. Moreover, contributions for

different camera paths are in general very similar, or even

equal when using a pinhole camera, as all the lens vertices

are at the exact same location. We therefore choose to com-

pute light-tracing using a standard CPU-based light-tracer.

At each step of CBPT, we sample NTlight paths ({ylt })

and compute their light-tracing contributions. In general,

we choose NTclose to NCto get approximately the same

bidirectional/light-tracing ratio as standard BPT. This leads

to the ﬁnal algorithm for a step of CBPT, presented in Algo-

rithm 1.

Algorithm 1 A complete step of CBPT.

sample({x})

sample({y})

upload({x},{y})

gpu_comp({x},{y})

combine({x},{y})

sample({ylt })

compute_lt({ylt })

Correlated sampling: Correlated sampling can take sev-

eral forms, such as re-using previous paths in order to im-

prove the sampling efﬁciency [VG97,CTE05], or re-using

a small number of well-behaved random numbers to com-

pute different integrals [KH01]. In our method, the camera

and light paths are all sampled independently using different

random numbers, as in standard BPT. Therefore, complete

paths are sampled in a correlated way, as they are created

by linking the subpaths in all possible ways. To avoid visi-

ble correlation patterns in the ﬁnal image while ensuring a

proper coverage of the image, the image-space coordinates

that are used for each camera path are generated in an ar-

ray, using a stratiﬁed scheme over the entire image, with

four samples per pixel. This array of samples is then shuf-

ﬂed. When sampling a camera population, each path uses

the samples sequentially in the array, leading to paths that

most likely contribute to different parts of the image. There-

fore, correlation is present, but as it is spread randomly over

the image, no regular patterns appear. This array is regen-

erated each time all the samples have been used. Adaptive

sampling can be used by similarly caching a sufﬁcient num-

ber of image coordinates that should be computed according

to the sampling scheme, and then shufﬂing this array of co-

ordinates.

4. Efﬁcient Computation of Combination Data

Our algorithm requires an efﬁcient computation of the com-

bination data on the GPU. In this section, we suppose that

for each vertex of the two populations {x}and {y}, we have

the position, the BSDF parameters, and the direction to

the previous vertex in the path. The size of this data is in

O(NC+NL). As there are typically few vertices in popula-

tions, the GPU memory requirements are very low for the

population data. Combining populations exhaustively avoids

uploading the O(NC×NL)linking segment array that would

otherwise be necessary.

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

We now give some high- and low-level details on our im-

plementation. Figure 3shows how the techniques we use are

put together.

High-level details: The computation is divided into three

main steps: visibility (blue V rectangles in Figure 3), BSDF

and PDF computations – called shading computations from

now on – for camera vertices (green C rectangles), and shad-

ing computations for light vertices (red L rectangles). For

each step, we divide the work into batches of ﬁxed size, each

having an associated memory zone on the CPU-side mem-

ory (the batch id where results are downloaded is indicated

in the download rectangle). On the GPU-side, we use two

buffers of ﬁxed size to store the results of the batches (repre-

sented by respectively black and white rectangles inside each

task). Using batches allows us to compute results of the cur-

rent batch while downloading results of the previous batch to

the CPU, leading to an increased efﬁciency. This also avoids

the need for any array of size O(NC×NL)on the GPU-side,

making the NCand NLvalues bounded only by the CPU-side

memory capacity. In practice, this provides more space for

the scene’s geometry that is needed for the visibility tests.

As is, some shading computations will be done even

though the linking vertices are not mutually visible. In fact,

for the shading models we use [AS00,WMLT07], introduc-

ing an array to only compute the useful shading is much less

efﬁcient, as computing on CPU and uploading this array for

each batch takes more time than directly computing all the

shading values.

Low-level details: We use NVidia’s CUDA language

for GPU computations. The CPU-side work consists only

in synchronization, and is performed in a CUDA-speciﬁc

thread, thus not interfering with the main computational

threads. All the positions, directions, and BSDF data are

stored in linear arrays (structure-of-array organisation), that

are re-used across populations to avoid memory allocations,

and enlarged if needed. Each array is accessible through tex-

tures, because each of the values is used many times (once

for each linking segment to which a vertex belongs), and

generally in coherent ways (subsequent threads are likely to

use the same data, or nearby data).

For visibility, we use an adapted version of the radius kd-

tree GPU raytracing implementation by Segovia [Seg08],

which gives a reasonable throughput and is well suited for

individual and incoherent rays that are not stored in an ar-

ray. The rays are effectively built from the thread index idx,

by retrieving the camera and light vertices from their indices

as (idx/VL)and (idx mod VL)respectively, where VLis the

number of vertices in the light population.

The same indexing scheme is used for the camera shad-

ing computations, which makes a single BSDF processed by

consecutive threads, as illustrated by Figure 2. Each thread

handles one linking segment. This leads to a very good local-

ity in the accesses to the textures containing the BSDF pa-

rameters, as well as a very good code coherency in the BSDF

evaluation code. In fact, for most warps, the BSDF parame-

ters are the same across all the threads, the only difference

... ...

...

Figure 2: Threads organisation for the shading of camera

vertices. Each vertex is handled by blocks of VLconsecutive

threads. At least (VL−2)/32 warps execute codes with the

exact same BSDF parameters, as they all concern the same

vertex, leading to high code coherency.

Downloads

GPU V V V C

0

C

1

L

0

L

1

sync sync sync sync sync sync sync

t

0 2

1

Figure 3: Temporal execution of our combination system,

not temporally to scale for clarity. The meaning of each ele-

ment is described in the main text.

between consecutive threads being the directions. For light

shading computations, the indexing is reversed (i.e. all the

linking segments for one light vertex are processed in con-

secutive threads), to beneﬁt from the same good properties

than for the camera shading. All the results are written in lin-

ear arrays indexed by the thread index, leading to coalesced

writes.

5. Implementation of CBPT

Using the combination data computation system described in

Section 4, we implement CBPT as described in Algorithm 2.

Note that population sampling and combinations are done in

parallel on all available CPU cores. The main points to note

about Algorithm 2is that we process two couples of popula-

tions at the same time, in an interleaved way. As illustrated

by Figure 4, this allows us to perform GPU processing, CPU

processing, downloads, and uploads at the same time. As the

computation by the GPU of the combination data does not

need any upload and is the only process that performs down-

loads, there is no contention on the memory bus if the GPU

is able to perform transfers in both ways at the same time. In

Algorithm 2,combine() uses the data computed on GPU and

downloaded into the CPU memory to compute the fb(x,y)

contribution for each pair of paths, and splats it in a thread-

safe way to the ﬁnal image. As the number of splatted values

is small, thread-safety even with a large number of threads

does not create a bottleneck. compute_lt() computes light-

tracing on all available CPU cores.

Timings for each task of a step are reported in Algorithm 2

for a standard scene, and production-oriented parameters.

These timings show the efﬁciency of our asynchronous com-

putation scheme, as the total wall-clock time needed for one

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

Downloads

GPU

Uploads

CPU S(t) S(t)

t t

data(t-1) data(t)

C(t-1) S(t+1)C(t-2)

step

S(lt)

Figure 4: Temporal execution of CBPT, not temporally to

scale for clarity. Exact timings are given in Algorithm 2. The

block labelled Ccontains both combine() and compute_lt().

The colors white and black for the rectangles indicate which

GPU-side buffer is used to read the population data and

store the results.

loop is 34.5ms, compared to 60.1ms if all computations had

been done synchronously. It also shows that GPU work is

done "for free", as the complete time to perform a step is

equal to the sum of the times needed by each CPU task, ig-

noring the GPU one.

Algorithm 2 CBPT algorithm, with timings of each note-

worthy element using NC=2000, NL=15, NT=1500, in a

scene with 758K triangles and 1.5GB of textures. The time

spent by the GPU to compute all the results is given in "async

time". The total time needed to perform a step is 34.5ms.

for t=0 to ∞do

sample({x}t) {time: 13.5ms}

upload_async({x}t)

sample({y}t) {time: 0.1ms}

upload_async({y}t)

sample({ylt }) {time: 10.1ms}

if t>0then

sync_gpu_comp(t−1) {time: 0.1ms}

end if

sync_upload(t)

gpu_comp_async(t) {async time: 25.7ms}

if t>0then

combine({x}t−1,{y}t−1) {time: 9.6ms}

compute_lt({ylt }) {time: 1.1ms}

end if

end for

6. Results

We now analyze the computational behavior of the combina-

tion system and CBPT. All the measures are done on an In-

tel i7 920 2.80GHz system, with an NVidia GTX480 GPU,

and 16 GB of CPU-side memory. For our tests of CBPT, we

use NC=2000, NL=15, and NT=1500 for all the scenes.

These settings are not aimed at providing peak GPU perfor-

mance, but rather at providing a good compromise between

throughput of the GPU part and rendering quality. No adap-

tive sampling is used.

ring comp lights

GPU CPU ÷GPU CPU ÷

vis 42.6 3.5 12.1 25.4 2.5 10.2

camera 266.7 11.8 22.6 281.7 15.6 18.1

light 266.5 17.3 15.4 280.2 13.0 21.6

comp monitors living

GPU CPU ÷GPU CPU ÷

vis 25.6 2.9 8.8 32.2 2.1 15.3

camera 275.3 16.2 17.0 256.3 12.5 20.5

light 272.8 15.1 18.1 272.8 15.9 17.6

Table 1: Throughputs for visibility (vis), camera shading

(camera), and light shading (light), when using the system

described in Section 4, and when using the 4physical cores

of our processor, plus hyper-threading. The "÷" column

gives the ratio of throughputs, corresponding to the actual

speedups. Visibility is measured in millions of visibility tests

per second, camera and light shadings are measured in mil-

lions of computations of (fs,p,p∗)tuples per second (see

Section 3for the components of the tuple). All the measures

take all the memory transfers into account.

We use three different scenes of various complexities,

which are presented in Figure 5. We have chosen these chal-

lenging scenes for their high lighting complexity:

•The ﬁrst scene, ring, is geometrically simple, but com-

posed of many glossy surfaces. It produces many sub-

tle caustics that typically lead to noticeable noise, for in-

stance on the back wall from the glossy tiles of the ﬂoor.

•The comp scene, rendered with two different lighting con-

ﬁgurations, is much more involved than the ring scene.

The lights version is lit by the ceiling lights, with indi-

rect lighting caused by specular transmission of the light

through the glass of the light ﬁxtures. The front room and

upper parts of the back room are only indirectly lit. In

the monitors version, light comes only from the TV and

computer monitor. Note the caustics on the wall due to re-

fraction in the twisted pillars made of glass, as well as the

caustics beneath the glass table. Nearly all the non-diffuse

materials are glossy but not ideal mirrors, leading to very

blurry reﬂections, which is especially visible on the ﬂoor.

•The living scene is lit by six very small area lights located

on the ceiling above the table and the couch. It contains

a lot of glossy materials (especially all the wooden ob-

jects), of which very few are specular. Note the caustics

caused by the shelves on the left, and the completely indi-

rect lighting in the hallway on the right.

6.1. Combination Throughput

Table 1gives the raw throughputs of visibility and shading

values we obtain on CPU and GPU depending on the scene,

and the speedup brought by our system. All the measures

take all the memory transfers into account. As expected, only

visibility thoughputs decrease with the scene’s size.

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

ring, 7.4K triangles (2.5, 3.4, 463K) comp lights, 758K triangles (3.6, 3.2, 570K)

comp monitors, 758K triangles (3.6, 3.6, 620K) living, 400K triangles (3.7, 3.4, 620K)

Figure 5: The three scenes used to test CBPT. We indicate between parentheses the average length in segments of the sampled

camera and light paths, as well as the average number of linking edges for each couple of populations in CBPT. Note that the

average path lengths for BPT and CBPT are equal, as they use the same code. All the images have been rendered with CBPT.

No post-process has been performed except tone-mapping, as our engine produces HDR images. The top-left image has been

rendered at a resolution of 1600 ×1200 pixels in 1hour. The three others have been rendered at a resolution of 1600 ×900

pixels, in 4hours. As CBPT is based on standard Monte-Carlo methods, images at a resolution of 800 ×450 for the last three

scenes can be obtained with a similar quality in 1hour.

The shading throughput on CPU is quite sensitive to the

type of BSDFs (glossy or purely diffuse) that mostly com-

pose the paths of a certain type, explaining the gap that is

present for some scenes between the camera and light shad-

ing throughputs. This is mostly visible in comp lights be-

cause of the glass ﬁxture surrounding the light sources. On

the other hand, the GPU throughputs are much less affected

by this. Despite the need to transfer the results back to GPU,

we achieve a 15-20×speedup in average compared to CPU

for shading only, consistently on all scenes.

The absolute timings in Table 2give hints about the aver-

age time proportions needed by each element of the combi-

nation. These timings depend on the number of linking seg-

ments that have to be processed for each combination, which

depend on the scene.

Figure 6illustrates the impact of batch size on perfor-

mance, for visibility and shading computations, on the ring

scene. This allows us to evaluate the impact of different

transfers/computation repartitions, and to ﬁnd optimal batch

sizes for the computer we use.

For the visibility computations, even on this geometri-

cally very simple scene, the transfers are not a limiting fac-

tor, as the visibility results are packed in a very compact

form. Therefore, using batches does not make any notice-

able difference on performance as soon as the batches are

large enough. Consequently, the major advantage brought

by batches for visibility resides in the control we have on

the memory-size requirements on GPU, without much im-

pacting on performance.

For more memory-consuming results such as shading

ones, the batch size has a large impact on performance, with

the additional beneﬁt of using less memory on the GPU. As a

matter of fact, using asynchronism brings a 1.75×speedup,

going from 160 millions to 252 millions of computations

per second when transfers are done in parallel. Note that the

optimal batch sizes are in practice only machine-dependent,

as shading computations efﬁciency does not depend on the

scene, and visibility computation efﬁciency is almost con-

stant for any batch size larger than very small values.

6.2. CBPT

To quantify the efﬁciency of CBPT, we count the number of

fi,j(x,y)computations performed during a complete CBPT

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

ring comp comp living

lights monitors

vis 10.9 22.5 24.4 19.3

camera 1.7 2.0 2.2 2.5

light 1.7 2.0 2.2 2.3

Table 2: Average time needed to complete each step on

GPU, for each scene, in milliseconds.

Batch size (x1000)

Shading throughputs (M/s)

Batch size (x1000)

Visibility throughputs (M/s)

34

36

38

40

42

44

46

0 200 400 600 800 1000 1200 1400

120

140

160

180

200

220

240

260

0 200 400 600 800 1000

Figure 6: Top: Visibility throughput, in millions of tests per

second, in function of the number of visibility tests to per-

form in each batch. Bottom: Shading throughput, in million

of shading tuples computation per second, in function of the

number of shading computations to perform in each batch.

step, and divide it by the time needed to complete the whole

step, including populations sampling and splatting. We call

this efﬁciency measure basic contributions throughput. This

allows us to have meaningful and consistent results whatever

the average path length is in each scene.

Computational efﬁciency: Table 3gives the basic contri-

butions throughputs obtained using CBPT, and the speedups

compared to standard BPT. We compute these values when

using CPU-based light-tracing (in this case NT=1500), to

get actual performance, and when not using it (NT=0),

to get the bidirectional-only basic contributions throughput.

The CPU version of BPT uses the same code to sample

paths, and the same code to compute the fi,j(x,y)values,

except that all shading and visibility values are computed on

CPU. Both CBPT and standard BPT uniformly sample the

image, and do not use any adaptive sampling scheme.

The impact of light-tracing on throughputs is noticeable

(around 20%), but the visual impact of a high variance light-

tracing part is much more noticeable than the gain in bidi-

rectional part when setting NTto a very small value, partic-

ularly for very short rendering times. For longer rendering

times and scenes where caustics are easily captured by light-

tracing, NTcan be set to a smaller value, as it will visually

converge faster than the bidirectional part.

CBPT BPT

NT=0NT=1500

ring 20.9 (17.4×) 15.7 (13.1×) 1.2

comp lights 16.2 (16.9×) 12.7 (13.2×) 0.96

comp monitors 16.3 (14.8×) 13.1 (11.9×) 1.1

living 16.5 (21.7×) 12.5 (16.4×) 0.76

Table 3: Basic contributions throughput for CBPT and stan-

dard BPT, in millions of fi,j(x,y)values computed per sec-

ond, and speedup in parenthesis.

As shown by timings in Algorithm 2, our reformulation

allows us to keep both the CPU and GPU fully loaded, the

GPU computation time being masked by the CPU one. The

speedup we obtain with "production settings" is consistently

greater or equal to 12×on our test scenes. Even if our sam-

ples are correlated, the correlation is spread on all the image

by our image-sampling process. This effectively avoids the

appearance of any noticeable correlation pattern.

Visual comparison with standard BPT: Visually ob-

serving noise reduction is made easier when looking at non-

converged images, where improvements are clearly visible.

Figure 7presents the images obtained by CBPT and BPT

after a few seconds of rendering, and after at least 4 sam-

ples per pixel have been computed by CBPT. As images

were stored every 10 seconds, it can happen that more than

4 samples per pixel were actually computed, but both BPT

and CBPT got the same computation time. The places where

the improvements are most visible are on the diffuse walls,

where light-space exploration is crucial to get low variance

results, and in the glossy reﬂections. Table 4gives the ac-

tual average number of samples per pixel for the bidirec-

tional part of each image. As expected, the speedups ob-

tained are similar to the ones obtained for the basic contribu-

tions throughputs, the little difference coming from the splat-

ting, as BPT needs to splat many more values than CBPT for

a same number of pair of paths. The main information of this

table is that the images presented in Figure 5would have re-

quired from 50 to 66 hours to be computed using standard

BPT, versus 4 hours with CBPT.

Memory usage and scalability: Table 5gives the mem-

ory usage both on CPU and GPU of CBPT. As expected,

the size of the combination data on CPU and the popula-

tions memory size on GPU are related to the average path

length. For populations, we use a conservative allocation

scheme, reuse memory between populations, and reﬁt mem-

ory zones regularly to keep the consumption low. This can

lead to a consequent overestimation of the actual memory

size needed, but drastically reduces the number of mem-

ory allocations, therefore providing a slight speedup. Despite

this, memory requirements remain low for all our scenes

on CPU (between 100 and 200MB), and very low on GPU

(less than 100MB). Table 5also shows that our method han-

dles scenes much larger than the ones we used. Indeed, the

scenes’ kd-tree size are kept relatively low even for quite

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

BPT

CBPT

preview (10s) '4 spp (40s) preview (10s) (close-up) '4 spp (40s) (close-up)

BPT

CBPT

preview (30s) '4 spp (50s) preview (30s) (close-up) '4 spp (50s) (close-up)

BPT

CBPT

preview (30s) '4 spp (50s) preview (30s) (close-up) '4 spp (50s) (close-up)

BPT

CBPT

preview (20s) '4 spp (40s) preview (20s) (close-up) '4 spp (40s) (close-up)

Figure 7: Results obtained by BPT and CBPT on our test scenes, after approximately 10 seconds of actual computations, and

after CBPT has computed approximately 4samples per pixel. Images are rendered at 800×450, except ring which is rendered

at 800 ×600. Note that for all the scenes, mipmaps are lazily built when ﬁrst accessed, explaining the 30 and 20 seconds of

total rendering times for the preview conﬁguration of the comps and living scenes. The time spent building these mipmaps is

negligible for the ring scene, but takes 16 and 8seconds in the comps and living scenes respectively, and are generally built

when sampling the ﬁrst paths. This also shows that our system can be seamlessly used together with all the usual ways of

reducing the peak memory usage, as it does not impact the rendering engine architecture.

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.

A. Pajot, L. Barthe, M. Paulin, P. Poulin / Combinatorial Bidirectional Path-Tracing

CBPT BPT (x,y)

prev. '4 spp prev. '4 spp

ring 1.64 5.15 1.60 5.41 14.3×

comp

lights 1.05 3.99 1.38 4.44 13.5×

comp

monitors 1.36 4.12 1.61 4.77 12.9×

living 1.62 4.23 1.53 3.86 16.4×

Table 4: Overall speedup measurement: Average number of

samples computed per-pixel for the bidirectional part of the

images of Figure 7. This is equivalent to the average number

of camera paths that have contributed to each pixel. The last

column gives the ratio between CBPT and BPT of the num-

ber of pairs of paths contributing to the bidirectional part of

each pixel, which is a good measure of the actual speedup

brought by CBPT over standard BPT. For standard BPT,

each camera path is combined with one light path, there-

fore the number of pairs of paths per-pixel is equal to the

number of camera paths. For CBPT, as each camera path

is combined with NLlight paths, the number of pairs is NL

times the number of camera paths per-pixel. In our tests, we

use NL=15.

CPU GPU

pops. comb. kd-tree pops. comb.

ring 73.3 48.0 0.47 3.8 23.5

comp

lights 91.2 66.0 56.1 4.8 23.5

comp

monitors 95.0 72.3 56.1 4.8 23.5

living 60.8 60.3 58.5 5.0 23.5

Table 5: Memory usage for populations and combination

data on CPU, and memory usage for the kd-tree, the popu-

lations data (position, BSDFs parameters, etc.), and all the

batch buffers, in MB.

complex scenes (about 50MB). Therefore, scenes that con-

tain several millions of polygons ﬁts in the GPU memory.

Moreover, the memory size of populations is negligible ex-

cept for idiosyncrasies, as even with participating media, the

paths remain short (10 −20 vertices on average).

7. Conclusion

Bidirectional path-tracing is an unbiased and highly-robust

rendering algorithm, but is not well suited for GPU imple-

mentation, as it requires a lot of branching. By exhaustively

combining populations of paths instead of single paths, we

were able to divide the algorithm into two parts, each one

being well suited for either the CPU or the GPU. We main-

tain the CPU, the GPU, and the memory bus between CPU

and GPU busy simultaneously by interleaving the steps of

CBPT. The GPU part is made efﬁcient by using high-level

optimization techniques such as double buffering and asyn-

chronism.

We have shown that CBPT is more than an order of mag-

nitude faster than standard BPT on various test scenes, with-

out affecting the size of the datasets or the ﬂexibility of the

underlying rendering engine in terms of shaders, and models

of lights and cameras. This makes CBPT very well suited

for accelerating image computation in production-oriented

engines.

References

[AS00] ASHIKHMIN M., SH IRL EY P.: An anisotropic phong

light reﬂection model. Journal of Graphics Tools 5 (2000), 25–

32.

[CTE05] CLINE D., TALBOT J. , EGBERT P.: Energy redistribu-

tion path-tracing. In SIGGRAPH ’05 (2005), pp. 1186–1195.

[Faj10] FAJAR DO M.: Ray tracing solution in ﬁlm produc-

tion rendering. http://www.graphics.cornell.edu/

~jaroslav/gicourse2010/, 2010. SIGGRAPH 2010

Course on global illumination in production rendering.

[HJ09] HAC HIS UKA T., JENSE N H.: Stochastic progressive pho-

ton mapping. In SIGGRAPH Asia ’09: ACM SIGGRAPH Asia

2009 papers (2009), ACM, pp. 1–8.

[HOJ08] HAC HIS UKA T., OGAKI S., JE NSEN H.: Progressive

photon mapping. ACM Trans. Graph. 27, 5 (2008), 1–8.

[KFC∗10] K ˇ

RIVÁNEK J., FAJA RDO M., CHRISTENSEN

P. H., TABELLION E., BUN NEL L M., LAR SSO N D.,

KAPLAN YAN A.: Global illumination across industries.

http://www.graphics.cornell.edu/~jaroslav/

gicourse2010/, 2010. SIGGRAPH 2010 Course on global

illumination in production rendering.

[KH01] KELLE R A., HEIDRICH W.: Interleaved sampling. In

EGWR’01 (2001), pp. 269–276.

[KSKAC02] KELEME N C., SZI RMAY-KALO S L., ANTAL G.,

CSONKA F.: A simple and robust mutation strategy for the

Metropolis light transport algorithm. In Eurographics ’02 (2002),

pp. 531–540.

[LFCD07] LAI Y.-C., FAN S. , CH ENN EY S., DYE R C.: Pho-

torealistic image rendering with population Monte Carlo energy

redistribution. In EGSR ’07 (2007), pp. 287–296.

[Lux10] LUXREN DER: Luxrays. http://www.luxrender.

net/wiki/index.php?title=LuxRays, 2010.

[LW93] LAFO RTUN E E. P., WILLEMS Y. D.: Bi-directional path

tracing. In Compugraphics ’93 (1993), pp. 145–153.

[OMP10] O MPF: Hybrid bidirectional path-tracer development

thread. http://ompf.org/forum/viewtopic.php?f=

6&t=1834, 2010.

[Seg08] SEG OVI A B.: Radius-CUDA raytracing ker-

nel. http://bouliiii.blogspot.com/2008/08/

real-time- ray-tracing- with-cuda- 100.html,

2008.

[VG94] VEACH E., GUI BAS L. J.: Bidirectional estimators for

light transport. In EGWR ’94 (1994), pp. 147–162.

[VG95] VEACH E., GUI BAS L. J.: Optimally combining sam-

pling techniques for monte carlo rendering. In SIGGRAPH ’95

(1995), pp. 419–428.

[VG97] VEACH E., GUI BAS L. J.: Metropolis light transport. In

SIGGRAPH ’97 (1997), pp. 65–76.

[WMLT07] WA LTER B., MARSCHNER S. R., LIH., TORRANCE

K. E.: Microfacet models for refraction through rough surfaces.

In EGSR ’07 (2007), pp. 195–206.

c

2010 The Author(s)

Journal compilation c

2010 The Eurographics Association and Blackwell Publishing Ltd.