ArticlePDF Available

Abstract and Figures

Optimizations for ray tracing have typically focused on decreasing the time taken to render each frame. However, in modern computer systems it may actually be more important to minimize the energy used, or some combination of energy and render time. Understanding the time and energy costs per ray can enable the user to make conscious trade-offs between image quality and time/energy budget in a complete system. To facilitate this, in this paper we present a detailed study of per-ray time and energy costs for ray tracing. Specifically, we use path tracing, broken down into distinct kernels, to carry out an extensive study of the fine-grained contributions in time and energy for each ray over multiple bounces. As expected, we have observed that both the time and energy costs are highly correlated with data movement. Especially in large scenes that do not mostly fit in on-chip caches, accesses to DRAM not only account for the majority of the energy use, but also the corresponding stalls dominate the render time.
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
A Detailed Study of Ray Tracing Performance: Render Time
and Energy Cost
Elena Vasiou ·Konstantin Shkurko ·Ian Mallett ·Erik Brunvand ·
Cem Yuksel
Received: date / Accepted: date
Abstract Optimizations for ray tracing have typically
focused on decreasing the time taken to render each
frame. However, in modern computer systems it may
actually be more important to minimize the energy
used, or some combination of energy and render time.
Understanding the time and energy costs per ray can
enable the user to make conscious trade offs between
image quality and time/energy budget in a complete
system. To facilitate this, in this paper we present a
detailed study of per-ray time and energy costs for ray
tracing. Specifically, we use path tracing, broken down
into distinct kernels, to carry out an extensive study
of the fine-grained contributions in time and energy
for each ray over multiple bounces. As expected, we
have observed that both the time and energy costs are
highly correlated with data movement. Especially in
large scenes that do not mostly fit in on-chip caches,
accesses to DRAM not only account for the majority of
the energy use but also the corresponding stalls domi-
nate the render time.
Keywords Ray Tracing ·Energy Efficiency ·Graphics
Processors ·Memory Timing
1 Introduction
Ray tracing [40] algorithms have evolved to be the most
popular way of rendering photorealistic images. In par-
ticular, path tracing [19] is widely used in production
today. Yet despite their widespread use, ray tracing al-
gorithms remain expensive in terms of both computa-
tion time and energy consumption. New trends arising
University of Utah
50 Central Campus Dr, Sale Lake City, UT, 84112
E-mail: {elvasiou, kshkurko, imallett, elb},
from the need to minimize production costs in indus-
tries relying heavily on computer generated imagery,
as well as the recent expansion of mobile architectures,
where application energy budgets are limited, increase
the importance of studying the energy demands of ray
tracing in addition to the render time.
A large body of work optimizes the computation
cost of ray tracing by minimizing the number of instruc-
tions needed for ray traversal and intersection opera-
tions. However, on modern architectures the time and
energy costs are highly correlated with data movement.
High parallelism and the behavior of deep memory hi-
erarchies, prevalent in modern architectures, make fur-
ther optimizations non-trivial. Although rays contribute
independently to the final image, the performance of the
associated data movement is highly dependent on the
overall state of the memory subsystem. As such, to mea-
sure and understand performance, one cannot merely
rely on the number of instructions to be executed, but
must also consider the data movement throughout the
entire rendering process.
In this paper, we aim to provide a detailed exam-
ination of time and energy costs for path tracing. We
split the ray tracing algorithm into discrete computa-
tional kernels and measure their performance by track-
ing their time and energy costs while rendering a frame
to completion. We investigate what affects and limits
kernel performance for primary, secondary, and shadow
rays. Our investigation explores the variation of time
and energy costs per ray at all bounces in a path. Time
and energy breakdowns are examined for both individ-
ual kernels and the entire rendering process.
To extract detailed measurements of time and en-
ergy usage for different kernels and ray types, we use
a cycle-accurate hardware simulator designed to simu-
late highly parallel architectures. Specifically, we profile
2 Elena Vasiou et al.
TRaX [35,36], a custom architecture designed to accel-
erate ray tracing by combining the parallel computa-
tional power of contemporary GPUs with the execu-
tion flexibility of CPUs. Therefore, our study does not
directly explore ray tracing performance on hardware
that is either designed for general-purpose computation
(CPUs) or rasterization (GPUs).
Our experiments show that data movement is the
main consumer of time and energy. As rays are traced
deeper into the acceleration structure, more of the scene
is accessed and must be loaded. This leads to exten-
sive use of the memory subsystem and DRAM, which
dramatically increases the energy consumption of the
whole system. Shadow ray traversal displays a similar
behavior as regular ray traversal, although it is con-
siderably less expensive, because it implements any-hit
traversal optimization (as opposed to first hit).
In all cases, we observe that the increase in per
ray, per bounce energy is incremental after the first
few bounces, suggesting that longer paths can be ex-
plored at a reduced proportional cost. We also examine
the composition of latency per frame, identifying how
much of the render time is spent on useful work ver-
sus stalling due to resource conflicts. Again, the mem-
ory system dominates the cost. Although compute time
can often be improved through increases in available
resources, the memory system, even when highly provi-
sioned, may not be able to service all necessary requests
without stalling.
2 Background
Some previous work focuses on understanding and im-
proving the energy footprint of rendering on GPUs on
both algorithmic and hardware levels. Yet, very little
has been published on directly measuring the energy
consumption and latency patterns of ray tracing and
subsequently studying the implications of ray costs. In
this section, we briefly discuss the related prior work
and the TRaX architecture we use for our experiments.
2.1 Related Work
Ray tracing performance is traditionally measured as a
function of time to render a single frame. With a known
upper bound on theoretical performance [2], general
optimizations have been proposed to various stages of
the algorithm [34] to improve performance and reduce
memory traffic and data transport [5, 15]. These ap-
proaches are motivated by known behavior, with band-
width usage identified as the major bottleneck in tradi-
tional ray tracing [28,29], leading to suggested changes
in ray and geometry scheduling. Although they address
energy costs of ray tracing at a high level, none of those
explorations examine how individual rays can affect
performance, energy, and image quality, nor do they
systematically analyze the performance of ray tracing
as a whole. We provide a more quantifiable unit of mea-
sure for the underlying behavior by identifying the costs
of rays as they relate to the entire frame generation.
Aila et. al. [2] evaluate the energy consumption of
ray tracing on a GPU with different forms of traversal.
Although the work distribution of ray traversal is iden-
tified as the major inefficiency, the analysis only goes so
far as to suggest which traversal method is the quickest.
Some work reduces energy consumption by mini-
mizing the amount of data transferred from memory
to compute units [3,11,31]. Others attempt to reduce
memory accesses by improving ray coherence and data
management[22,26, 9]. More detailed studies on general
rendering algorithms pinpoint power efficiency improve-
ments [18,32], but unfortunately do not focus on ray
tracing. Wang et. al. [39] use a cost model to mini-
mize power usage, while maintaining visual quality of
the output image by varying rendering options in real-
time frameworks. Similarly, Johnsson et. al. [17] directly
measure the per frame energy of graphics applications
on a smartphone. However, both methods focus on ras-
There is a pool of work investigating architecture ex-
ploitation with much prior work addressing DRAM and
its implications for graphics applications [8,38] with
some particularly focusing on bandwidth [12, 24, 25]. Some
proposed architectures also fall into a category of hard-
ware which aims to reduce overall ray tracing energy
cost by implementing packet-based approaches to in-
crease cache hits [7,30] or by reordering work in a buffer [23].
Streaming architectures [14,37] and hardware that uses
treelets to manage scene traffic [1,21,33] are also effec-
tive in reducing energy demands.
2.2 TRaX Architecture
In our experiments, we use a hardware simulator to ex-
tract detailed information about time and energy con-
suption during rendering. We perform our experiment
by simulating rendering on the TRaX architecture [35,
36]. TRaX is a dedicated ray tracing hardware architec-
ture based on a single program multiple data (SPMD)
programming paradigm, as opposed to single instruc-
tion multiple data (SIMD) approach used by current
GPUs. Unlike other ray tracing specific architectures,
TRaX’s design is more general and programmable. Al-
though it possesses similarities to modern GPU archi-
A Detailed Study of Ray Tracing Performance: Render Time and Energy Cost 3
(a) TM architecture with 32
lightweight cores and shared
cache and compute resources
(b) Potential TRaX chip or-
ganization with multiple TMs
sharing L2 caches
Fig. 1 Overall TRaX Thread Multiprocessor (TM) and
multi-TM chip organization, from [36]. Abbreviations: I$ -
instruction cache, D$ - data cache, and XU - execution unit.
tectures, it is not burdened by the GPU’s data process-
ing assumptions.
Specifically, TRaX consists of Thread Multiproces-
sors (TMs), each of which has a number of Thread Pro-
cessors (TPs), as shown in Fig. 1. Each TP contains
some functional units, a small register file, scratchpad
memory, and a program counter. All TPs within a TM
share access to units which are expensive in terms of
area, like the L1 data cache and floating-point compute
units. Several chip-wide L2 caches are each shared by a
collection of TMs, and are then connected to the main
memory via the memory controller.
3 Experimental Methodology
We run our experiments by simulating path tracing on
the TRaX architecture. TRaX and its simulator are
highly flexible systems, which enable testing modern
architecture configurations. We have also considered
other hardware simulators and decided against using
them for various reasons. GPGPUSim [4] allows sim-
ulating GPUs, but only supports dated architectures
and so would not provide an accurate representation of
path tracing on modern hardware. Moreover, we need a
system that is fast enough to run path tracing to com-
pletion, unlike other architecture simulators which are
designed to feasibly simulate a few million cycles. Ad-
ditionally, the profiling capabilities must separate parts
of the renderer and generate detailed usage statistics for
the memory system and compute, which is not easily
attainable on regular CPUs. Although a comprehen-
sive and configurable simulator for CPU architectures
exists [6], it is far too detailed and thus expensive to
run for the purposes of this study.
As with any application, hardware dependency makes
a difference in the performance evaluation. Therefore,
Details Latency (cyc)
TM Configuration
TPs / TM 32
Int Multiply 2 1
FP Add 8 2
FP Multiply 8 2
FP Inv Sqrt
16 banks
8 banks
Chip Configuration
Technology Node 65nm CMOS
Clock Frequency 1GHz
TMs 32, 1024 total threads
16 banks
8 channels
Table 1 Hardware configuration for the TRaX processor.
Crytek Sponza
262K triangles
Dragon Box
870K triangles
2.9M triangles
San Miguel
10.5M triangles
Fig. 2 Scenes used for all performance tests, arranged by
their size in number of triangles.
we also run our experiments on a physical CPU, though
the experiments on the CPU provide limited informa-
tion, since we cannot gather statistics as detailed as
those available from a cycle-accurate simulator. Yet, we
can still compare the results of these tests to the simu-
lated results and evaluate the generality of our conclu-
We augment the cycle-accurate simulator for TRaX [16]
to profile each ray tracing kernel using high-fidelity
statistics gathered at the instruction level. Each in-
struction tracks its execution time, stalls, and energy
usage within hardware components, including functional
units and the memory hierarchy. Additionally, the sim-
ulator relies on USIMM for high-fidelity DRAM sim-
ulation [10] enabling highly accurate measurements of
main memory performance.
For our study, the TRaX processor comprises 32
TMs with 32 TPs each for a total of 1024 effective
threads, all running at 1GHz. This configuration re-
sembles the performance and area of a modern GPU.
Table 1 shows the energy and latency details for the
hardware components. We use Cacti 6.5 [27] to esti-
mate the areas of on-chip caches and SRAM buffers.
The areas and latencies for compute units are esti-
mated using circuits synthesized with Synopsys Design-
Ware/Design Compiler at a 65nm process. The TMs
share four 512KB L2 caches with 16 banks each. DRAM
is set up to use 8-channel GDDR5 quad-pumped at
twice the processor clock (8GHz effective) reaching a
peak bandwidth of 512GB/s.
4 Elena Vasiou et al.
Fig. 3 Distribution of time spent between memory and compute for a single frame of the Crytek Sponza (left) and San Miguel
(right) scenes rendered with different maximum ray bounces.
We run our experiments on four scenes with differ-
ent geometric complexities (Fig. 2) to expose the effects
of different computational requirements and stresses
to the memory hierarchy. Each scene is rendered at
1024×1024 image resolution, with up to 9 ray bounces.
Our investigation aims to focus on performance related
to ray traversal the underlying acceleration structure
is a Bounding Volume Hierarchy with optimized first
child traversal [20]. We use simple Lambertian shaders
for all surfaces and a single point light to light each
Individual pixels are rendered in parallel, where each
TP independently traces a separate sample to comple-
tion; therefore, different TPs can trace rays at different
We track detailed, instruction-level statistics for each
distinct ray tracing kernel (ray generation, traversal,
and shading) for each ray bounce and type (primary,
secondary, and shadow). We derive energy and latency
averages per ray using this data.
We run our CPU tests on an Intel Core i7-5960X
processor with 20 MB L3 cache and 8 cores (16 threads)
with the same implementation of path tracing used by
TRaX. Only the final rendering times are available for
these experiments.
4 Experimental Results
Our experimental results are derived from 50 simula-
tions across four scenes with maximum ray bounces
varying between 0 (no bounce) and 9. Depending on
the complexity, each simulation can require from a few
hours to a few days to complete. In this section we
present some of our experimental results and the con-
clusions we draw based on them. The full set of ex-
perimental results are included in the supplementary
4.1 Render Time
We first consider the time to render a frame at different
maximum ray bounces and track how the render time
is spent. In particular, we track the average time a TP
spends on the following events:
Compute Execution: the time spent executing in-
Compute Data Stall: stalls from waiting for the
results of previous instructions,
Memory Data Stall: stalls from waiting for data
from the memory hierarchy, including all caches and
DRAM, and
Other: all other stalls caused by contentions on ex-
ecution units and local store operations.
Fig. 3 shows the distribution of time used to render
the Crytek Sponza and San Miguel scenes. In Crytek
Sponza, the majority of the time is spent on computa-
tion without much memory data stalling. As the maxi-
mum number of ray bounces increases, the time for all
components grows approximately proportionally, since
the number of rays (and thus computational require-
ments) increases linearly with each bounce. This is not
surprising, since Crytek Sponza is a relatively small
scene and most of it can fit in the L2 cache, thereby
requiring relatively fewer accesses to DRAM. Once all
scene data is read into the L2 cache, the majority of
memory data stalls are caused by L1 cache misses.
On the other hand, in the San Miguel scene, com-
pute execution makes up the majority of the render
A Detailed Study of Ray Tracing Performance: Render Time and Energy Cost 5
Fig. 4 Classification of time per kernel normalized by the number of rays. Crytek Sponza (left) and San Miguel (right) scenes
rendered with maximum of 9 ray bounces. Contributions from the Generate and Shade kernels are not visible, because they
are negligible compared to others.
time only for primary rays. When we have one or more
ray bounces, memory data stalls quickly become the
main consumer of render time, consistently taking up
approximately 65% of the total time. Even though the
instructions needed to handle secondary rays are com-
parable to the ones for the primary rays, the L1 cache
hit rate drops from approximately 80% for primary rays
to 60% for rays with up to two bounces or more. As a
result, more memory requests escalate up the memory
hierarchy to DRAM, putting yet more pressure on the
memory banks. Besides adding latency, cache misses
also incur higher energy costs.
4.2 Time per Kernel
We can consider the average time spent per ray by the
following individual kernels at different ray bounces:
Generate: ray generation kernel,
Trace: ray traversal kernel for non-shadow rays, in-
cluding the acceleration structure and triangle in-
Trace Shadow: shadow ray traversal kernel, and
Shade: shading kernel.
Fig. 4 shows the average computation time per ray
for each bounce of path tracing up to 9 bounces. The
time consumed by the ray generation and shading ker-
nels is negligible. This is not surprising, since ray gener-
ation does not require accessing the scene data and the
Lambertian shader we use for all surfaces does not use
textures. Even though these two kernels are compute in-
tensive, the tested hardware is not compute limited, and
thus the execution units take a smaller portion of the
total frame rendering time. Traversing regular rays (the
Trace kernel) takes up most of the time and traversing
shadow rays (the Trace Shadow kernel) is about 20%
faster for all bounces.
4.3 Ray Traversal Kernel Time
Within the ray traversal (Trace) kernel, a large portion
of time is spent stalling while waiting for the memory
system–either for data to be fetched or on bank con-
flicts which limit access requests to the memory. Fig. 5
shows the breakdown of time spent for execution and
stalls within the Trace kernel for handling rays at dif-
ferent bounces within the same rendering process up
to 9 bounces.Memory access stalls, which indicate the
time required for data to be fetched into registers, take
a substantial percentage of time even for the first few
bounces. The percentage of memory stalls is higher for
larger scenes, but they amount to a sizable percentage
even for a relatively small scene like Crytek Sponza. In-
terestingly, the percentage of memory stalls beyond the
second bounce remains almost constant. This is because
rays access the scene less coherently, thereby thrashing
the caches.
This is a significant observation, since the simulated
memory system is highly provisioned both in terms of
the number of banks and total storage size. This sug-
gests that further performance improvements gained
will be marginal if only simple increases in resources
are made. Thus we foresee the need to require modi-
fications in how the memory system is structured and
4.4 DRAM Bandwidth
Another interesting observation is the DRAM band-
width behavior. Fig. 6 show the DRAM bandwidth for
6 Elena Vasiou et al.
Fig. 5 Classification of time per ray spent between memory and compute for the Trace kernel. Crytek Sponza (left) and San
Miguel (right) rendered with maximum of 9 ray bounces.
Fig. 6 DRAM bandwidth used to render each scene to dif-
ferent maximum ray bounces.
all four scenes in our tests using different maximum ray
bounces. Notice that the DRAM bandwidth varies sig-
nificantly between different scenes for images rendered
using a few number of maximum bounces. In our tests
our smallest scene, Crytek Sponza, and largest scene,
San Miguel, use a relatively small portion of the DRAM
bandwidth for different reasons. Crytek Sponza uses
less DRAM bandwidth, simply because it is a small
scene. San Miguel, however, uses lower DRAM band-
width because of the coherence of the first few bounces
and the fact that it takes longer to render. The other
two scenes, Hairball and Dragon Box, use a relatively
larger portion of the DRAM bandwidth for renders up
to a few bounces.
Beyond a few bounces, however, the DRAM band-
width utilization of these four scenes tend to align with
the scene sizes. Small scenes that render quickly end up
using larger bandwidth and larger scenes that require a
longer time use a smaller portion of the DRAM band-
width by spreading the memory requests over time. Yet,
all scenes appear to converge towards a similar DRAM
bandwidth utilization.
4.5 Energy Use
The energy used to render the entire frame can be sepa-
rated into seven distinct sources: compute, register file,
local store, instruction cache, L1 data cache, L2 data
cache, and DRAM. Overall, performing a floating point
arithmetic operation is both faster and three orders
of magnitude less energy expensive than fetching an
operand from DRAM [13].
Fig. 7 shows the total energy spent to render a frame
of the Crytek Sponza and San Miguel scenes. In Crytek
Sponza, a small scene which mostly fits within on-chip
data caches, memory accesses still dominate the energy
contributions at 80% overall, including 60% for DRAM
alone, at 9 bounces. Compute, on the other hand, re-
quires only about 1-2% of the total energy.
Interestingly, a larger scene like San Miguel follows a
similar behavior: the entire memory subsystem requires
95% and DRAM requires 80% of the total energy per
frame at the maximum of 9 bounces. The monotonic
increase in the total frame energy at higher maximum
ray bounces can be attributed to the increase in the
total number of rays in the system.
4.6 Energy Use per Kernel
We can consider energy per ray used by individual ker-
nels at different ray bounces by investigating the aver-
age energy spent to execute the assigned kernels.
A Detailed Study of Ray Tracing Performance: Render Time and Energy Cost 7
Fig. 7 Classification of energy contributions by source for a single frame of the Crytek Sponza (left) and San Miguel (right)
scenes rendered with different maximum ray bounces.
Fig. 8 Energy classification per kernel normalized by the number of rays. Crytek Sponza (left) and San Miguel (right) scenes
rendered with maximum of 9 ray bounces. Contributions from the Generate and Shade kernels are not visible, because they
are negligible compared to others.
Fig. 8 shows the average energy use in the Crytek
Sponza and San Miguel scenes. The ray generation ker-
nel has a very small contribution (at most 2%) because
it uses few instructions, mainly for floating point com-
putation operations. In our tests, shading also consumes
a small percentage of energy, simply because we use
simple Lambertian materials without textures. Other
material models, especially ones that use large textures,
could be substantially expensive from the energy per-
spective because of memory accesses. However, investi-
gating a broad range of shading methods is beyond the
scope of this work.
Focusing on the traversal kernels, Fig. 9 compares
costs to trace both shadow and non-shadow rays for all
scenes. Overall, because shadow rays implement any-hit
traversal optimization and consequently load less scene
data, their energy cost is 15% lower than regular rays
on average.
Fig. 9 Comparison of the energy contributions per ray for
the Trace kernels (non- and shadow rays) for all scenes.
4.7 Ray Traversal Kernel Energy
Fig. 9 also shows the total energy cost of the ray traver-
sal kernels at different ray bounces up to the maxi-
8 Elena Vasiou et al.
Fig. 10 Classification of energy contributions per ray by source for the Trace kernel. The Crytek Sponza (left) and San Miguel
(right) scenes rendered with maximum of 9 ray bounces.
mum of 9. Unsurprisingly, the larger scenes and those
with high depth complexity consume more energy as
ray bounces increase. The energy required by rays be-
fore the first bounce is considerably lower than the sec-
ondary rays after the first bounce, since they are less
coherent than primary rays and scatter towards a larger
portion of the scene. This behavior translates into an
increase in both the randomness of memory accesses
and in the amount of data fetched. However, as the
rays bounce further, the cost per ray starts to level
off. This pattern is more obvious for smaller scenes like
Crytek Sponza. Although in the first few ray bounces
the path tracer thrashes the caches and the cache hit
rates drop, the hit rates become roughly constant for
additional bounces. Thus, the number of requests that
reach DRAM remains steady, resulting in the energy
used by the memory system to be fairly consistent for
ray bounces beyond three.
The sources of energy usage per ray for the traversal
kernels (Fig. 10) paint a picture similar to the one from
the overall energy per frame. The memory system is
responsible for 60-95% of the total energy, with DRAM
alone taking up to 80% for higher bounces in the San
Miguel scene.
4.8 Image Contributions per Ray Bounce
It is also important to understand how much each bounce
is contributing to the final image. This information can
be utilized to determine a desired performance/quality
balance. In particular, we perform tests with maximum
ray bounces of 9 and we consider the overall image in-
tensity contributions of all rays up to a certain number
of bounces (maximum of 9), along with contributions
per millisecond and contributions per Joule. As seen in
Fig. 11, the majority of contribution to the image hap-
pens in the first few bounces. After the fourth or fifth
bounce, the energy and latency costs to trace rays at
that bounce become significant compared to their min-
imal contribution to the final image. This behavior is
expected from the current analysis with scene complex-
ity playing a minor role to the overall trend.
4.9 Comparisons to CPU Experiments
The findings so far are specific to the TRaX architec-
ture. To evaluate the generality of our findings, we also
compare our results to the same path tracing appli-
cation running on a CPU. For the four test scenes, we
observe similar relative behavior shown in Fig. 12. Even
though direct comparisons cannot be made, the behav-
ior is similar enough to suggest that performance would
be similar between the two architectures; therefore, the
results of this study could be applied on implementa-
tions running on currently available CPUs.
5 Discussion
We observe that even for smaller scenes, that can es-
sentially fit into cache, memory still is the highest con-
tributor in energy and latency, suggesting that even
in a case of balanced compute workload, compute re-
mains inexpensive. Since data movement is the highest
contributor to energy use, often scene compression is
A Detailed Study of Ray Tracing Performance: Render Time and Energy Cost 9
Fig. 11 Cumulative distribution of the percentage contri-
bution to the final image using different metrics. All scenes
rendered with maximum of 9 ray bounces.
the suggested solution. However, compression schemes
mostly reduce, but do not eliminate, the memory bot-
tlenecks arising from data requests associated with ray
tracing. Our data suggests that render time and energy
cost improvements cannot be made by simply increas-
ing the available memory resources, which are already
constrained by the on-chip area availability.
This brings up an interesting opportunity to find
ways to design a new memory system that is optimized
for ray tracing that would facilitate both lower energy
and latency costs. For example, the recent dual stream-
ing approach [33] that reorders the ray tracing compu-
tations and the memory access pattern is likely to have
a somewhat different time and energy behavior. Explor-
ing different ways of reordering the ray tracing execu-
tion would be an interesting avenue for future research,
which can provide new algorithms and hardware archi-
tectures that can possibly separate from the trends we
observe in our experiments.
Fig. 12 Frame render times up to the maximum of 9
bounces. The CPU implementation uses 500 samples per pixel
(spp), while TRaX uses 1.
6 Conclusions and Future Work
We have presented a detailed study of render time and
energy costs of path tracing running on a custom hard-
ware designed for accelerating ray tracing. We have
identified the memory system as the main source of
both time and energy consumption. We have also ex-
amined how statistics gathered per frame translate into
contributions to the final image. Furthermore, we have
included an evaluation of the generality of our results
by comparing render times against the same applica-
tion running on the CPU. Given these observations, we
would like to consider more holistic performance opti-
mizations as a function of render time, energy cost and
the impact of rays on image quality.
An interesting future work direction would be a sen-
sitivity analysis by varying the hardware specifications,
such as the memory subsystem size. Also, a study tar-
geting more expensive shading models and texture con-
tributions could reveal how shading complexity could
10 Elena Vasiou et al.
impact ray traversal performance. In general, detailed
studies of ray tracing performance can provide much
needed insight that can be used to design a function
that optimizes both render time and energy under con-
strained budgets and a required visual fidelity.
Acknowledgements This material is supported in part by
the National Science Foundation under Grant No. 1409129.
Crytek Sponza is from Frank Meinl at Crytek and Marko
Dabrovic, Dragon is from the Stanford Computer Graphics
Laboratory, Hairball is from Samuli Laine, and San Miguel is
from Guillermo Leal Laguno.
1. Aila, T., Karras, T.: Architecture considerations for trac-
ing incoherent rays. In: Proc. HPG (2010)
2. Aila, T., Laine, S.: Understanding the efficiency of ray
traversal on GPUs. In: Proc. HPG (2009)
3. Arnau, J.M., Parcerisa, J.M., Xekalakis, P.: Eliminating
redundant fragment shader executions on a mobile GPU
via hardware memoization. In: Proc. ISCA (2014)
4. Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H.,
Aamodt, T.M.: Analyzing CUDA workloads using a de-
tailed GPU simulator. In: ISPASS (2009)
5. Barringer, R., Akenine-M¨oller, T.: Dynamic ray stream
traversal. ACM TOG (2014) 33(4), 151 (2014)
6. Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K.,
Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna,
T., Sardashti, S., et al.: The gem5 simulator. ACM
SIGARCH Comp Arch News 39(2), 1–7 (2011)
7. Boulos, S., Edwards, D., Lacewell, J.D., Kniss, J., Kautz,
J., Shirley, P., Wald, I.: Packet-based Whitted and dis-
tribution ray tracing. In: Proc. Graphics Interface (2007)
8. Brunvand, E., Kopta, D., Chatterjee, N.: Why graphics
programmers need to know about DRAM. In: ACM SIG-
GRAPH 2014 Courses (2014)
9. Budge, B., Bernardin, T., Stuart, J.A., Sengupta, S.,
Joy, K.I., Owens, J.D.: Out-of-core Data Management
for Path Tracing on Hybrid Resources. CGF (2009)
10. Chatterjee, N., Balasubramonian, R., Shevgoor, M.,
Pugsley, S., Udipi, A., Shafiee, A., Sudan, K., Awasthi,
M., Chishti, Z.: USIMM: the Utah SImulated Memory
Module. Tech. Rep. UUCS-12-02, U. of Utah (2012)
11. Chatterjee, N., OConnor, M., Lee, D., Johnson, D.R.,
Keckler, S.W., Rhu, M., Dally, W.J.: Architecting an
energy-efficient DRAM system for GPUs. In: HPCA
12. Christensen, P.H., Laur, D.M., Fong, J., Wooten, W.L.,
Batali, D.: Ray differentials and multiresolution geometry
caching for distribution ray tracing in complex scenes. In:
Eurographics (2003)
13. Dally, B.: The challenge of future high-performance com-
puting. Celsius Lecture, Uppsala University, Uppsala,
Sweden (2013)
14. Gribble, C., Ramani, K.: Coherent ray tracing via stream
filtering. In: IRT (2008)
15. Hapala, M., Davidovic, T., Wald, I., Havran, V.,
Slusallek, P.: Efficient stack-less BVH traversal for ray
tracing. In: SCCG (2011)
16. HWRT: SimTRaX a cycle-accurate ray trac-
ing architectural simulator and compiler. (2012). Utah
Hardware Ray Tracing Group
17. Johnsson, B., Akenine-Mller, T.: Measuring per-frame
energy consumption of real-time graphics applications.
JCGT 3, 60–73 (2014)
18. Johnsson, B., Ganestam, P., Doggett, M., Akenine-
oller, T.: Power efficiency for software algorithms run-
ning on graphics processors. In: HPG (2012)
19. Kajiya, J.T.: The rendering equation. In: Proceedings of
20. Karras, T., Aila, T.: Fast parallel construction of high-
quality bounding volume hierarchies. Proc. HPG (2013)
21. Kopta, D., Shkurko, K., Spjut, J., Brunvand, E., Davis,
A.: An energy and bandwidth efficient ray tracing archi-
tecture. In: Proc. HPG (2013)
22. Kopta, D., Shkurko, K., Spjut, J., Brunvand, E., Davis,
A.: Memory considerations for low energy ray tracing.
CGF 34(1), 47–59 (2015)
23. Lee, W.J., Shin, Y., Hwang, S.J., Kang, S., Yoo, J.J.,
Ryu, S.: Reorder buffer: an energy-efficient multithread-
ing architecture for hardware MIMD ray traversal. In:
Proc. HPG (2015)
24. Liktor, G., Vaidyanathan, K.: Bandwidth-efficient BVH
layout for incremental hardware traversal. In: Proc. HPG
25. Mansson, E., Munkberg, J., Akenine-Moller, T.: Deep co-
herent ray tracing. In: IRT (2007)
26. Moon, B., Byun, Y., Kim, T.J., Claudio, P., Kim, H.S.,
Ban, Y.J., Nam, S.W., Yoon, S.E.: Cache-oblivious ray
reordering. ACM Trans. Graph. 29(3) (2010)
27. Muralimanohar, N., Balasubramonian, R., Jouppi, N.:
Optimizing NUCA organizations and wiring alternatives
for large caches with CACTI 6.0. In: MICRO (2007)
28. Navr´atil, P., Fussell, D., Lin, C., Mark, W.: Dynamic
ray scheduling to improve ray coherence and bandwidth
utilization. In: IRT (2007)
29. Navr´atil, P.A., Mark, W.R.: An analysis of ray tracing
bandwidth consumption. Computer Science Department,
University of Texas at Austin (2006)
30. Overbeck, R., Ramamoorthi, R., Mark, W.R.: Large ray
packets for real-time Whitted ray tracing. In: IRT (2008)
31. Pool, J.: Energy-precision tradeoffs in the graphics
pipeline. Ph.D. thesis (2012)
32. Pool, J., Lastra, A., Singh, M.: An energy model for
graphics processing units. In: ICCD (2010)
33. Shkurko, K., Grant, T., Kopta, D., Mallett, I., Yuksel, C.,
Brunvand, E.: Dual streaming for hardware-accelerated
ray tracing. In: Proc. HPG (2017)
34. Smits, B.: Efficiency issues for ray tracing. In: SIG-
GRAPH Courses, SIGGRAPH ’05 (2005)
35. Spjut, J., Kensler, A., Kopta, D., Brunvand, E.: TRaX: A
multicore hardware architecture for real-time ray tracing.
IEEE Trans. on CAD 28(12) (2009)
36. Spjut, J., Kopta, D., Boulos, S., Kellis, S., Brunvand, E.:
TRaX: A multi-threaded architecture for real-time ray
tracing. In: SASP (2008)
37. Tsakok, J.A.: Faster incoherent rays: Multi-BVH ray
stream tracing. In: Proc. HPG (2009)
38. Vogelsang, T.: Understanding the energy consumption of
dynamic random access memories. In: MICRO ’43 (2010)
39. Wang, R., Yu, B., Marco, J., Hu, T., Gutierrez, D., Bao,
H.: Real-time rendering on a power budget. ACM TOG
35(4) (2016)
40. Whitted, T.: An improved illumination model for shaded
display. Com. of the ACM 23(6) (1980)
... Intersection Test The crux of ray tracing is to calculate the closest intersection of a ray and the scene, which is usually represented by a set of geometric primitives such as triangles and spheres. The intersection test dominates the rendering time [39], and is the prime target for optimization. ...
Full-text available
Neighbor search is of fundamental important to many engineering and science fields such as physics simulation and computer graphics. This paper proposes to formulate neighbor search as a ray tracing problem and leverage the dedicated ray tracing hardware in recent GPUs for acceleration. We show that a naive mapping under-exploits the ray tracing hardware. We propose two performance optimizations, query scheduling and query partitioning, to tame the inefficiencies. Experimental results show 2.2X -- 65.0X speedups over existing neighbor search libraries on GPUs. The code is available at
... It is well known in the computer architecture community that memory traffic, especially DRAM traffic, is the largest contributor to latency and energy increases in computing systems [47], [51], [52], [53], [54], [55]. That is specifically the case for ray tracing as well [56]. ...
Full-text available
Data movement, particularly access to the main memory, has been the bottleneck of most computing problems. Ray tracing is no exception. We propose an unconventional solution that combines a ray ordering scheme that minimizes access to the scene data with a large on-chip buffer acting as near-compute storage that is spread over multiple chips. We demonstrate the effectiveness of our approach by introducing Mach-RT (MAny CHip - Ray Tracing), a new hardware architecture for accelerating ray tracing. Extending the concept of dual streaming, we optimize the main memory accesses to a level that allows the same memory system to service multiple processor chips at the same time. While a multiple chip solution might seem to imply increased energy consumption as well, because of the reduced memory traffic we are able to demonstrate, performance increases while maintaining reasonable energy usage compared to academic and commercial architectures. This paper extends our previous work [1] with design space exploration of the L3 cache size, more detailed evaluation of energy and memory performance, a discussion of energy delay product, and a brief exploration of boards with 16 chips. We also introduce new treelet enqueueing logic for the predictive scheduler.
... Traditionally, many works are focused on minimizing the number of instructions during ray traversal and intersection computation. However, data movement consumes a lot of time and energy especially for devices with modern architecture design which affect the calculation speed critically[2]. ...
Conference Paper
Ray tracing has been considered as the next generation CG technology as its photorealistic rendering results. However, heavy computation load is a bottleneck which hindering its popularization. Hardware acceleration has been proved to be an effective method for the efficiency improvement of ray tracing and many related works have been published. In this work, we study three level parallelisms existing in ray tracing and analyze the feasible parallel patterns for each level. Times consumed by different steps of the tracing process are tested and the key factors affecting computation efficiency are pointed out based on experiment results. Latency caused by extern memory accessing is explored and data caching using high speed storage resource is applied to improve the data accessing performance. We also show the influence of context data coupling of instructions on calculation speed and make clear the main obstacle for achieving high tracing efficient. Results in this paper will contribute to hardware acceleration of ray tracing algorithm and design of acceleration devices.
Memory performance is a crucial bottleneck in many GPGPU applications, making optimizations for hardware and software mandatory. While hardware vendors already use highly efficient caching architectures, software engineers usually have to organize their data accordingly in order to efficiently make use of these, requiring deep knowledge of the actual hardware. In this paper we present a novel technique for fine‐grained memory profiling that simulates the whole pipeline of memory flow and finally accumulates profiling values in a way that the user retains information about the potential region in the GPU program by showing these values separately for each allocation. Our memory simulator turns out to outperform state‐of‐the‐art memory models of NVIDIA architectures by a magnitude of 2.4 for the L1 cache and 1.3 for the L2 cache, in terms of accuracy. Additionally, we find our technique of fine grained memory profiling a useful tool for memory optimizations, which we successfully show in case of ray tracing and machine learning applications.
In 3D computer-generated graphics, ray tracing is a form of rendering technique for calculating light transport for the purpose of giving a photo-realistic image. This survey paper mainly focuses on the brief working of ray tracing methodology and various techniques available to achieve this. This helps to understand one how ray tracing can be applied in the field of computer graphics to produce stunning realistic renders. At present time ray tracing is primarily available for gaming (real-time ray tracing) and also for visual effects in cinema industry. This paper also gives a detailed study about hardware and software components used for ray tracing
Three-dimensional image reconstruction from a set of sequence of two-dimensional medical images plays a key role in analysing the anatomy of a human body. These 2D medical images are obtained through magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET) and other modalities of 3D imaging. This paper discusses evaluation of few tools available to reconstruct a 3D image from a set of 2D images. The tools considered for evaluation are: 3D Slicer, MITK, InVesalius, RadiAnt, Real3d VolViCon, ITK-SNAP and Volume Viewer. The evaluation parameters considered in this paper for making a comparative study are: data import facility, data export facility, handling metadata, 2D viewing facility, 3D viewing facility and technical support provided by the developers of the respective tool. These tools do not allow additional 3D image processing algorithms other than existing functionalities of the respective tools; 3D Logical Image Processing System (3DLIPS) allows people to implement their own 3D image processing algorithms. This paper also discusses various functionalities and algorithms supported by 3DLIPS.Keywords3D model reconstructionCellular logic array processingMathematical morphology
Ray tracing is an inherent part of photorealistic image synthesis algorithms. The problem of ray tracing is to find the nearest intersection with a given ray and scene. Although this geometric operation is relatively simple, in practice, we have to evaluate billions of such operations as the scene consists of millions of primitives, and the image synthesis algorithms require a high number of samples to provide a plausible result. Thus, scene primitives are commonly arranged in spatial data structures to accelerate the search. In the last two decades, the bounding volume hierarchy (BVH) has become the de facto standard acceleration data structure for ray tracing‐based rendering algorithms in offline and recently also in real‐time applications. In this report, we review the basic principles of bounding volume hierarchies as well as advanced state of the art methods with a focus on the construction and traversal. Furthermore, we discuss industrial frameworks, specialized hardware architectures, other applications of bounding volume hierarchies, best practices, and related open problems.
Conference Paper
Full-text available
Hardware acceleration for ray tracing has been a topic of great interest in computer graphics. However, even with proposed custom hardware, the inherent irregularity in the memory access pattern of ray tracing has limited its performance, compared with rasterization on commercial GPUs. We provide a different approach to hardware-accelerated ray tracing, beginning with modifying the order of rendering operations, inspired by the streaming character of rasterization. Our dual streaming approach organizes the memory access of ray tracing into two predictable data streams. The predictability of these streams allows perfect prefetching and makes the memory access pattern an excellent match for the behavior of DRAM memory systems. By reformulating ray tracing as fully predictable streams of rays and of geometry we alleviate many long-standing problems of high-performance ray tracing and expose new opportunities for future research. Therefore, we also include extensive discussions of potential avenues for future research aimed at improving the performance of hardware-accelerated ray tracing using dual streaming.
Conference Paper
Full-text available
In this paper, we present an energy- and area-efficient multithreading architecture for Multiple Instruction, Multiple Data (MIMD) ray tracing hardware targeted at low-power devices. Recent ray tracing hardware has predominantly adopted an MIMD approach for efficient parallel traversal of incoherent rays, and supports a multithreading scheme to hide latency and to resolve memory divergence. However, the conventional multithreading scheme has problems such as increased memory cost for thread storage and consumption of additional energy for bypassing threads to the pipeline. Consequently, we propose a new multithreading architecture called Reorder Buffer. Reorder Buffer solves these problems by constituting a dynamic reordering of the rays in the input buffer according to the results of cache accesses. Unlike conventional schemes, Reorder Buffer is cost-effective and energy-efficient because it does not need additional thread memory nor does it consume more energy because it makes use of existing resources. Simulation results show that our architecture is a potentially versatile solution for future ray tracing hardware in low-energy devices because it provides as much as 11.7% better cache utilization and is up to 4.7 times more energy-efficient than the conventional architecture.
Conference Paper
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete for a limited number of DRAM row buffers. As a result, only a fraction of the data fetched into a row buffer is used, leading to significant energy overheads. Our proposed DRAM architecture exploits the hierarchical organization of a DRAM bank to reduce the minimum row activation granularity. To avoid significant incremental area with this approach, we must partition the DRAM datapath into a number of semi-independent subchannels. These narrow subchannels increase data toggling energy which we mitigate using a static data reordering scheme designed to lower the toggle rate. This design has 35% lower energy consumption than a die-stacked DRAM with 2.6% area overhead. The resulting architecture, when augmented with an improved memory access protocol, can support parallel operations across the semi-independent subchannels, thereby improving system performance by 13% on average for a range of workloads.
With recent advances on mobile computing, power consumption has become a significant limiting constraint for many graphics applications. As a result, rendering on a power budget arises as an emerging demand. In this paper, we present a real-time, power-optimal rendering framework to address this problem, by finding the optimal rendering settings that minimize power consumption while maximizing visual quality. We first introduce a novel power-error, multi-objective cost space, and formally formulate power saving as an optimization problem. Then, we develop a two-step algorithm to efficiently explore the vast power-error space and leverage optimal Pareto frontiers at runtime. Finally, we show that our rendering framework can be generalized across different platforms, desktop PC or mobile device, by demonstrating its performance on our own OpenGL rendering framework, as well as the commercially available Unreal Engine.
Conference Paper
Redundancy is at the heart of graphical applications. In fact, generating an animation typically involves the succession of extremely similar images. In terms of rendering these images, this behavior translates into the creation of many fragment programs with the exact same input data. We have measured this fragment redundancy for a set of commercial Android applications, and found that more than 40% of the fragments used in a frame have been already computed in a prior frame. In this paper we try to exploit this redundancy, using fragment memoization. Unfortunately, this is not an easy task as most of the redundancy exists across frames, rendering most HW based schemes unfeasible. We thus first take a step back and try to analyze the temporal locality of the redundant fragments, their complexity, and the number of inputs typically seen in fragment programs. The result of our analysis is a task level memoization scheme, that easily outperforms the current state-of-the-art in low power GPUs More specifically, our experimental results show that our scheme is able to remove 59.7% of the redundant fragment computations on average. This materializes to a significant speedup of 17.6% on average, while also improving the overall energy efficiency by 8.9% on average.
While each new generation of processors gets larger caches and more compute power, external memory bandwidth capabilities increase at a much lower pace. Additionally, processors are equipped with wide vector units that require low instruction level divergence to be efficiently utilized. In order to exploit these trends for ray tracing, we present an alternative to traditional depth-first ray traversal that takes advantage of the available cache hierarchy, and provides high SIMD efficiency, while keeping memory bus traffic low. Our main contribution is an efficient algorithm for traversing large packets of rays against a bounding volume hierarchy in a way that groups coherent rays during traversal. In contrast to previous large packet traversal methods, our algorithm allows for individual traversal order for each ray, which is essential for efficient ray tracing. Ray tracing algorithms is a mature research field in computer graphics, and despite this, our new technique increases traversal performance by 36-53%, and is applicable to most ray tracers.