ArticlePDF Available

Efficient implementation of the 3D-DDA ray traversal algorithm on GPU and its application in radiation dose calculation

Wiley
Medical Physics
Authors:

Abstract

Purpose: The three-dimensional digital differential analyzer (3D-DDA) algorithm is a widely used ray traversal method, which is also at the core of many convolution∕superposition (C∕S) dose calculation approaches. However, porting existing C∕S dose calculation methods onto graphics processing unit (GPU) has brought challenges to retaining the efficiency of this algorithm. In particular, straightforward implementation of the original 3D-DDA algorithm inflicts a lot of branch divergence which conflicts with the GPU programming model and leads to suboptimal performance. In this paper, an efficient GPU implementation of the 3D-DDA algorithm is proposed, which effectively reduces such branch divergence and improves performance of the C∕S dose calculation programs running on GPU. Methods: The main idea of the proposed method is to convert a number of conditional statements in the original 3D-DDA algorithm into a set of simple operations (e.g., arithmetic, comparison, and logic) which are better supported by the GPU architecture. To verify and demonstrate the performance improvement, this ray traversal method was integrated into a GPU-based collapsed cone convolution∕superposition (CCCS) dose calculation program. Results: The proposed method has been tested using a water phantom and various clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based on the efficient 3D-DDA ray traversal implementation runs 1.42 ∼ 2.67× faster than the one based on the original 3D-DDA implementation, without losing any accuracy. Conclusions: The results show that the proposed method can effectively reduce branch divergence in the original 3D-DDA ray traversal algorithm and improve the performance of the CCCS program running on GPU. Considering the wide utilization of the 3D-DDA algorithm, various applications can benefit from this implementation method.
Efficient Implementation of the 3D–DDA Ray Traversal Algorithm on GPU and Its
Application in Radiation Dose Calculation
Kai Xiao, Danny Z. Chen, X. Sharon Hu
Department of Computer Science and Engineering
University of Notre Dame, Notre Dame, IN 46556 5
E–mail: {kxiao, dchen, shu}@nd.edu
Bo Zhou
Department of Radiation Oncology
University of Maryland School of Medicine, Baltimore, MD 21201 10
E–mail: bzhou@umd.edu
Abstract
Purpose: The three dimensional Digital Differential Analyzer (3D–DDA) algorithm is a
widely used ray traversal method, which is also at the core of many 15
convolution/superposition (C/S) dose calculation approaches. However, porting existing
C/S dose calculation methods onto Graphics Processing Unit (GPU) has brought
challenges to retaining the efficiency of this algorithm. In particular, straightforward
implementation of the original 3D–DDA algorithm inflicts a lot of branch divergence
which conflicts with the GPU programming model and leads to sub–optimal performance. 20
In this paper, an efficient GPU implementation of the 3D–DDA algorithm is proposed,
which effectively reduces such branch divergence and improves performance of the C/S
dose calculation programs running on GPU.
Methods: The main idea of the proposed method is to convert a number of conditional
statements in the original 3D–DDA algorithm into a set of simple operations (e.g., 25
arithmetic, comparison and logic) which are better supported by the GPU architecture. To
verify and demonstrate the performance improvement, this ray traversal method was
integrated into a GPU–based Collapsed Cone Convolution/Superposition (CCCS) dose
calculation program.
Results: The proposed method has been tested using a water phantom and various 30
clinical cases on an NVIDIA GTX570 GPU. The CCCS dose calculation program based
on the efficient 3D–DDA ray traversal implementation runs 1.42~2.67X faster than the
one based on the original 3D–DDA implementation, without losing any accuracy.
Conclusions: The results show that the proposed method can effectively reduce branch
divergence in the original 3D–DDA ray traversal algorithm and improve the performance 35
of the CCCS program running on GPU. Considering the wide utilization of the 3D–DDA
algorithm, various applications can benefit from this implementation method.
Key Words: GPU, 3D–DDA ray traversal, branch divergence, dose calculation, CCCS.
I. INTRODUCTION 40
In modern radiotherapy treatment planning, the convolution/superposition (C/S) method
can be viewed as a standard for dose calculation algorithms1. Recently, the Graphic
Processing Unit (GPU) has become an effective platform for accelerating radiation dose
calculation which is a computationally expensive process2. Many C/S dose calculation
algorithms, including Collapsed Cone Convolution/Superposition3 (CCCS) and Monte 45
Carlo Convolution/Superposition4 (MCCS) based methods, have been ported onto GPU
and showed impressive performance improvement5, 6, 7.
One vital module in the C/S dose calculation algorithms is ray traversal, which computes
the traversing trajectories of energy particles through a given region and their exact
radiological path lengths. In this module, the traversal of an individual particle is treated 50
as one ray. From its source point (or the entrance point for a region of interest), the
sequence of voxels a ray penetrates and the radiological length it traverses inside each
voxel are computed, in order to calculate the locations of the voxels and establish the
amount of radiation energy delivered. The total number of ray traversals performed by a
C/S dose calculation algorithm (e.g., the Monte Carlo C/S method 8) for a clinical case 55
can be as many as millions, which accounts for a large proportion of the total
computation time for the radiotherapy treatment planning process. Hence, a fast and
accurate ray traversal method on GPU is a key to boost performance.
A widely used ray traversal algorithm for C/S dose calculation is the 3D–DDA
algorithm9. As an efficient voxel space traversal method, the 3DDDA algorithm has 60
been adopted by a number of clinical dose calculation software packages, such as the one
in the Panther system10 (Prowess, Chico, CA). In C/S dose calculation, the 3D–DDA
algorithm iterates through the voxels along the ray’s traversal path. As the ray passes
each voxel, the voxel index and the radiological length are computed for determining the
Total Energy Released per unit Mass (TERMA) as well as depositing energy. 65
A number of different versions of the 3D–DDA algorithm have been proposed. For
example, Stolte et al.11 adopted additional logical (e.g., masks and negative notations) and
fixed–point arithmetic operations to improve the traversal accuracy and speed.
Stephenson et al.12 used an iterative technique based on what they called “runs” (a run is
a set of contiguous voxels with the same coordinate value in one direction) in order to 70
improve the efficiency of ray traversal for long paths. Fox et al.13 proposed a method of
transforming the major axis of the voxel grid to align with the ray direction, which
reduces the number of iterations during ray traversal. However, the previous development
mainly focused on methods for decreasing the number of iterations or instructions. A
critical issue which considerably affects performance on GPU is that the 3D–DDA 75
algorithm contains a group of nested conditional statements in its inner–loop and this has
received little attention previously since its impact on performance in CPU environment
is not significant.
As a method of computing the exact radiological paths, Siddon’s algorithm14, 15 can also
be used in C/S dose calculation. For every ray, instead of performing the stepping logic 80
iteratively as 3D–DDA, Siddon’s algorithm pre–computes an ordered “distance” array for
each axis, where an individual element in each array represents the distance from the
ray’s source point to every boundary plane along the corresponding axis. The arrays for
the three axes are then merged into a “reference” array whose elements are sorted in the
increasing order of the distances between the source point and boundary planes. 85
Therefore, the radiological path information of a ray can be directly read from its
reference array during dose calculation. However, unless a number of rays have the same
stepping sequence and share the same reference array, storing a reference array for each
ray requires a huge amount of memory space, especially when the number of rays and
grid size are large. With the limited memory capacity, most GPUs do not have sufficient 90
memory for implementing this algorithm. As illustrated in the work of de Greef et al. 16,
Siddon’s algorithm needs to be rewritten by using a stepping approach in order to port the
C/S dose calculation based on it onto GPU. Such “rewriting” results in a structure of
nested conditional statements which is quite similar to the one in the 3D–DDA method,
as shown in the appendix of de Greef et al.’s paper16. 95
To improve the GPU based C/S dose calculation performance, several groups have
investigated a few methods for optimizing memory access patterns generated during ray
traversal17, 18. They explored methods for modifying and scheduling ray distribution and
data alignment in order to reduce the number of off–chip memory transactions by
coalescing, and for utilizing specific GPU features such as cache hierarchy to reduce the 100
latency of memory transactions. However, the performance of ray traversal modules was
not considered explicitly in these studies.
Our investigation shows that the execution of the original 3D–DDA ray traversal method
on GPU is not very efficient, because it contains a large number of conditional statements,
such as ifelse”, which cause branch divergence and degrade performance19, 20. Note 105
that in the Single–Instruction–Multiple–Data (SIMD) architecture of NVIDIA GPUs, 32
consecutive threads in the same block (i.e. ThreadIndex[0, .. ,31]) form a “warp” and
share the same instruction dispatching unit. Ideally, threads in a warp achieve the best
performance when they execute the same instruction flow simultaneously. When a warp
of threads encounters a conditional statement and takes different execution branches (e.g. 110
true/false from an “ifstatement), they issue separate instructions following the different
branches in the code. The instruction unit assigned to such a warp then sequentially
issues and dispatches corresponding instructions for each subset of divergent threads,
which turns parallel execution of threads in a warp into serial execution. Such a situation
is called branch divergence. To alleviate the overhead of branch divergence, NVIDIA 115
provides a branch predication mechanism which schedules execution of every instruction
controlled by a conditional structure with a per–thread condition code (referred to as
predicate). Only instructions with true predicates are actually executed and those with
false predicates cannot write results, evaluate addresses or read operands. Nonetheless,
such a mechanism cannot be applied to branches containing more than 4 to 7 instructions 120
or nested conditions since the execution cost can be too high in these scenarios21. Our
observation reveals that the heavily–used conditional statements in the 3D–DDA
algorithm are nested and often have large numbers of instructions which can seriously
deteriorate its performance on GPU.
In this paper, we present an alternative method to implement the 3D–DDA algorithm for 125
reducing branch divergence. The proposed method, referred to as 3D–DDA–nc, replaces
a number of conditional statements in the original 3D–DDA algorithm by a set of
arithmetic, comparison and logical operations. Since this new set of operations contains
no branches, they utilize the GPU’s underlying SIMD feature more effectively and hence
improve the ray traversal efficiency. We have integrated the 3D–DDA–nc method into a 130
GPU–based CCCS dose calculation program and tested it on various clinical cases. The
results show that the CCCS program based on our 3D–DDA implementation is
1.42~2.67X faster than that based on the original version. Our improved 3D–DDA
implementation can also be applied to other applications involving ray traversal (e.g.,
graphics ray tracing) where branch divergence causes significant performance 135
degradation on GPU.
II. METHODS
Ray traversal is a procedure of computing the propagation path of a ray in a given region.
A ray is uniquely defined by two vectors
S
and
Dir
, where
S
is the source point and
Dir
is the ray’s direction. A point on the ray is represented by
DirTS ×+
, where the 140
variable
T
(
0T
) is the distance from the source to that point. Below, we first briefly
review the original 3D–DDA ray traversal algorithm and then present our 3D–DDA–nc
method.
Algorithm 1 summarizes the essential operations in the original 3D–DDA algorithm. The
first part (Lines 1–14) performs all necessary initialization while the second part (Lines 145
15–33) conducts the actual ray traversal iteratively voxel by voxel. A vector,
Step
, is
introduced such that each element in
Step
represents the positive or negative index
change when the ray crosses a voxel and is initialized to –1 or 1 (Line 5). When the ray is
parallel to any grid axis, the corresponding element of
Step
is set to 0. Vector
represents the distance along the ray when it traverses from one boundary plane to the 150
next parallel boundary plane (Lines 12–14). The elements of vector
T
are used to store
the path length along the ray from
S
to the next boundary to be evaluated (initialized in
Line 7). Therefore, given voxel
CurrentV
, the 3D–DDA algorithm first finds the
boundary plane on which the ray exits the voxel
CurrentV
by identifying the minimum
element in
T
(Lines 16, 17, 25). Then, the minimum element of
T
is increased by 155
(Lines 18, 21, 29). Finally, the corresponding voxel index is incremented by
Step
to update
CurrentV
to the next voxel (Lines 19, 22, 30). An example of applying
the 3D–DDA algorithm to traverse a ray in the 2–D space is illustrated in Figure 1. Note
that when the ray is parallel to any grid axis, there is no voxel index change along that
direction. In this case, the corresponding element in
T
is set to a pre–defined large 160
enough value, indicating that the ray does not intersect any boundary plane along this
direction (Lines 8–9).
As shown in Algorithm 1, the procedure of finding the minimum element of
T
involves
a group of nested “ifelse” statements (Lines 16–32). Considering the fact that one thread
handles a ray at one time in most GPU based ray traversal applications, directly 165
implementing the original 3D–DDA algorithm leads to the possibility that the threads in a
warp take different branches and result in branch divergence. The frequency and
seriousness of such divergence depend on a number of factors, including the distribution
of rays (e.g., their source points and directions) and characteristics of the traversing
region (e.g., the region size and voxel locations). For example, a warp of threads handling 170
the traversal of rays whose source points and directions differ significantly would most
likely introduce branch divergence and deteriorate the GPU execution performance.
We propose a different implementation for the original 3D–DDA algorithm in order to
eliminate the conditional statements used for computing the minimum element of
T
.
Instead of checking and comparing the elements of
T
by using conditional statements, 175
our implementation, referred to as 3D–DDA–nc, maintains a vector
VoxelIncr
through a
set of comparison and logical operations on
T
. Each element in
VoxelIncr
serves as a
flag, indicating whether there is an index change along the corresponding grid axis when
the ray exits the voxel
CurrentV
. Specifically,
xVoxelIncr.
is defined as follows:
xVoxelIncr.
=
)..&(&)..( zTxTyTxT
(1) 180
yVoxelIncr.
and
zVoxelIncr.
are defined in a similar manner. Note that Equation (1) does
not require any conditional statement and can be evaluated by comparisons (e.g., less
orequal”) and logical operations (e.g., “and”).
The observation below forms the basis for the 3D–DDA–nc method.
Observation 1: An element in
VoxelIncr
has a value “1” if and only if the 185
corresponding element in
T
is the minimum.
The correctness of Observation 1 follows immediately from Algorithm 1. For instance, if
yTxT .. <=
and
zTxT .. <=
, then by Algorithm 1, we know that the ray must exit
CurrentV
from its x boundary and
xVoxelIncr.
is set to 1.
Since a ray may have negative directions, to support both increment and decrement of a 190
voxel index, the
Step
vector as used in Algorithm 1 is also adopted in our new
implementation. Hence,
xCurrentV .
can be updated during each step of ray traversal as
follows:
xCurrentV .
+ =
xStepxVoxelIncr .. ×
(2)
yCurrentV .
and
zCurrentV .
can be computed in the similar manner. Furthermore, 195
xT .
can be calculated from
xVoxelIncr.
and
xDeltaT.
as follows:
xT .
+ =
xDeltaTxVoxelIncr .. ×
(3)
Similarly,
yT.
and
zT .
can be computed.
We summarize the 3DDDAnc method in Algorithm 2. Note that the initialization part
(Lines 1–14) of Algorithm 2 is the same as that of Algorithm 1 while the iterative part in 200
Algorithm 2 uses arithmetic, comparison and logical operations instead of conditional
statements. The correctness of Algorithm 2 is easy to verify. As an example, consider a
ray leaving from a voxel’s negative x boundary face. According to 3DDDAnc,
xStep.
is set to –1. Since
xT .
is the minimum element in
T
, vector
VoxelIncr
is set to [1, 0, 0]
in Lines 17–19. Then, vector
CurrentV
is changed by [–1, 0, 0], which means that the 205
next voxel’s index is one less in the x direction than that of the voxel from which the ray
is leaving.
For the situations that a ray exits a voxel from one of its edges or corners, some elements
of
T
are equal. In such a case, multiple elements of
VoxelIncr
are updated to 1, and our
3D–DDA–nc method still computes the next voxel correctly. For example, suppose a ray 210
exits from a corner of a voxel at which the voxel’s negative x, positive y, and negative z
faces join. Then vector
VoxelIncr
is [1, 1, 1] and
Step
is [–1, 1, 1], yielding the change
of the vector
CurrentV
as [–1, 1, –1]. Note that for this particular case, 3D–DDA–nc
computes the next voxel with only one iteration (Lines 17–25 in Algorithm 2). In
comparison, for the original 3D–DDA shown in Algorithm 1, three iterations are used. 215
The fact that 3D–DDA–nc naturally supports the special cases of rays passing through
edges or corners makes ray traversal based on our method more efficient.
With our new implementation, the conditional statements in the original 3D–DDA
method (Lines 16–32 of Algorithm 1) are replaced by the non–conditional operations in
Lines 17–25 of Algorithm 2. This replacement requires 21 additional instructions (6 220
comparisons, 3 logical operations, 3 integer–to–floating–point conversions, 3 floating–
point multiple–and–adds, 3 integer multiplications and 3 integer additions) and 3
additional registers. This additional overhead represents a small portion of the total
numbers of instructions and registers used in most realworld applications. For example,
in the CCCS application considered in this paper, thousands of instructions are used and 225
46 registers are needed. The performance improvements gained by reducing branch
divergence outweigh the additional overhead, as to be shown by the experimental results.
This conclusion applies to any scenarios where branch divergence causes significant
performance degradation due to serialization of costly operations such as memory
reads/writes. 230
III. RESULTS
To evaluate the effectiveness of our 3D–DDA–nc algorithm, we implemented two
versions of the GPU–based CCCS program according to Ahnesjö’s paper3, one following
the original 3D–DDA algorithm and the other adopting our 3D–DDA–nc method. The
original 3D–DDA based CCCS program has been used in the work of Zhou et al.18, and 235
its C–version variant has been used in a commercial product of Prowess Panther TPS.
Both implementations used 384 kernel ray directions and were tested on NVIDIA
GTX570 (Fermi architecture, 480 cores, 1.6GHz core frequency). All programs were
developed under the NVIDIA CUDA v4.0 environment and the performance data were
collected by NVIDIA Compute Profiler v4.0. 240
To use the 3D–DDA algorithm in CCCS, the geometric path length traversed by a ray
inside each voxel needs to be determined in order to calculate the radiological path length.
In the original 3D–DDA algorithm, this is done by storing in variable minT the minimum
element of T selected by the conditional structure. The difference between the minT
values of the current and previous iterations is the geometric path length traversed by the 245
ray inside the current voxel. In 3D–DDA–nc, determining the minimum elements can be
done by the CUDA's built–in instruction fmin(x,y), which returns the minimum of x and y.
The actual code is listed below. When applied to CCCS, these lines of code are inserted
between Line 19 and Line 20 of Algorithm 2.
minTcurrent = fmin(T.x, fmin(T.y, T.z) ); 250
geoLength = minTcurrentminT;
minT = minTcurrent;
Performance comparisons were conducted on a water phantom and six clinical cases as
shown in Table 1. It is worth to mention that, the CCCS program used in this test
includes an important implementation detail to avoid performing ray traversal and dose 255
calculation operations when the rays have no energy to release. With this implementation,
the ray traversal starts at the field boundary instead of the image boundary, and
terminates when the ray exits the image or completely releases its energy. Since the
execution time of CCCS dose calculation is proportional to the number of beams
involved, we tested only one beam aiming at the center of the phantom for each case. 260
Figure 2 summarizes the execution times of the two implementations, showing that a
speedup of 1.42~2.67X is achieved. We further evaluated the accuracy of the dose results
computed by both implementations. The results show that our modification to the ray
traversal method has no impact on the accuracy of the calculated dose results, which is
expected since 3D–DDA–nc does not affect the rays’ traversal paths but only changes the 265
method of computing them.
Although the execution times for the test cases that we presented are short, it is important
to point out that a typical treatment plan commonly involves multiple beams (e.g., 5–9
beams for an IMRT plan, and up to several hundred beams for a rotational delivery plan).
Thus, the reduction in execution time by half or more offers considerable improvement in 270
real treatment planning scenarios, especially when no accuracy comprise occurs.
In order to identify the key contributor to the observed performance improvement, we
also conducted profiling on both implementations to collect internal performance data for
further analysis. These data are shown in Table 2, which include the total number of
branches, the number of divergent branches, the total number of executed instructions, 275
and the memory throughputs.
! Columns 3 and 4 of Table 2 show the numbers of branches and divergent branches in
the execution of each test case, respectively. As shown in the table, our 3D–DDA–nc
method can effectively reduce the total branches and divergent branches of the
CCCS program on GPU (by 4.2~8.6X). 280
! Column 5 compares the total number of executed instructions. The total number of
instructions executed in CCCS based on 3D–DDA–nc is 3.7~5.1X less than that
based on the original 3D–DDA. Since branch divergence forces the GPU to execute
some instructions sequentially, the decrease in branch divergence by 3D–DDA–nc
helps processor cores to share the instruction flow and execute in parallel, which 285
leads to the reduction in the number of executed instructions.
! Columns 6–8 of Table 2 summarize the memory read/write/overall throughputs for
our test cases. The CCCS algorithm is well known to be memory bounded for GPU
implementations18. Therefore, any performance improvement on CCCS must be
accompanied by off–chip (DRAM) memory throughput improvement. As shown in 290
the table, the 3D–DDA–nc method improves the read throughput by 2.1~4.2X
because the enhanced parallel execution from our method provides more
opportunities for the GPUs memory controller to explore memory locality. However,
the write throughput is reduced (as shown in Column 7) due to the increase in
simultaneous write operations to the same addresses for dose deposition. 295
In our implementation, atomic writes are used to maintain coherence for simultaneous
writes to the same memory location. With the improved parallel execution by 3D–DDA
nc, more write requests are generated simultaneously, and thus more atomic operations
are required. This conflicting fact limits the level of speedup ultimately achieved.
However, since atomic operations are quite efficient in the latest GPU architecture and 300
the number of write requests is much less than that of read requests, our method still
results in an overall memory throughput improvement. As shown in Column 8, the
overall memory throughput of the CCCS execution with the 3D–DDA–nc method is
1.4~2.7X faster than that with the original method. Note that, for those CCCS
implementations in which atomic writes are avoided (such as that by Chen et al.17), our 305
method for reducing branch divergence could lead to even bigger performance
improvement.
IV. CONCLUSIONS
We present a simple yet effective method for implementing the 3D–DDA ray traversal
algorithm on GPU by replacing a number of conditional statements with arithmetic, 310
comparison and logical operations. Our method reduces the number of divergent
branches during execution, hence improving the performance of 3D–DDA on GPU. The
experimental results for several clinical cases demonstrate that on a state–of–the–art GPU
platform, a CCCS program based on the improved 3D–DDA algorithm can attain around
2X performance speedup from our modification without losing any dose accuracy on the 315
tested clinical cases.
The proposed implementation can be readily applied to other 3D–DDA based
applications, including graphics ray tracing and 3D animations. The actual achievable
performance improvement for a specific application depends on various factors, such as
the operations to be executed for every traversal step and the types of instructions 320
following the conditional statements. In general, for applications where expensive
instructions (e.g., memory accesses) follow the conditional statements of the 3D–DDA,
our implementation can provide non–trivial performance benefits. For example, in
graphics ray tracing, each step of traversing a voxel requires to access the object
information that the voxel contains. Reducing branch divergence in this application will 325
improve memory read throughput similar to that in the CCCS application (see Column 6
of Table 2). In addition, for many GPU applications containing massive conditional
statements (e.g., image reconstruction, viewing transformation and volume visualization),
the concept of replacing branches described in this work should also be helpful in
reducing branch divergence and its negative impact on performance. 330
Algorithm 1. The original 3D–DDA Algorithm9.
Algorithm 2. The 3D–DDA–nc method. 335
Test Case
Image Size *
Field Size *
Voxel Size (mm) **
Water Phantom
200×200×200
50× 50
1.0×1.0×1.0
Breast A
512×512×168
125×125
0.5×0.5×3.0
Breast B
512×512× 63
110×110
1.0×1.0×1.0
Lung A
512×512×112
125×125
1.0×1.0×3.0
Lung B
512×512×111
110×110
1.2×1.2×3.0
Head&neck A
512×512×144
125×125
1.2×1.2×3.0
Head&neck B
384×384× 58
90× 90
0.6×0.6×3.0
* Number of voxels in each dimension
** Geometric length of a voxel in each dimension
Table 1. Configuration of the test cases. 340
345
CT Image
Method
Branch
Divergence
Instruction
Memory Throughput
(GB/s)
Read
Write
Overall
Water phantom
Original
4.78E+08
9.07E+05
4.13E+09
13.05
10.07
23.12
3DDDAnc
1.81E+08
2.23E+05
1.73E+09
22.74
6.78
28.52
BreastA
Original
8.62E+09
1.02E+07
5.87E+10
6.93
4.09
11.02
3DDDAnc
9.91E+08
1.72E+06
1.14E+10
28.31
2.57
30.89
BreastB
Original
4.07E+09
2.85E+06
3.29E+10
26.57
10.95
37.53
3DDDAnc
6.83E+08
5.59E+05
7.50E+09
72.78
4.55
77.34
LungA
Original
6.74E+09
6.64E+06
4.60E+10
13.13
6.71
19.85
3DDDAnc
9.52E+08
1.35E+06
1.08E+10
42.67
3.25
45.92
LungB
Original
6.03E+09
6.09E+06
4.26E+10
14.05
7.15
21.20
3DDDAnc
8.61E+08
1.21E+06
9.73E+09
44.11
3.22
47.34
Head&neck A
Original
6.62E+09
6.80E+06
3.20E+10
15.04
7.29
22.43
3DDDAnc
9.68E+08
1.57E+06
8.79E+09
32.34
4.15
36.49
Head&neck B
Original
3.45E+09
4.71E+06
2.20E+10
7.52
4.74
12.26
3DDDAnc
5.12E+08
1.17E+06
5.90E+09
15.49
1.66
17.15
Table 2. Detailed execution data comparison between CCCS implementations based on
the original 3D–DDA and 3D–DDA–nc.
350
x
y
T.x
T.y
source CurrentV
CurrentV.x += Step.x
y'
x'
CurrentV.y += Step.y
(T.x)
x''
Figure 1. A 3D–DDA ray traversal example in the 2–D space. The 3D–DDA algorithm
checks the two planes x' and y' to determine the exit point of the ray from voxel CurrentV, 355
by computing the distances from the source to x' and to y' as T.x and T.y and finding the
minimum as the exit point. Starting from CurrentV, in the first step, the exit point is on
the boundary of x'. So CurrentV.x is increased and T.x is updated onto the next boundary
of x''; in the second step, the exit point is on y', and thus CurrentV.y is increased.
360
Figure 2. Total execution time comparison between the CCCS implementations based on
the original 3D–DDA and 3D–DDA–nc for the water phantom and six clinical test cases.
The speedup factors are 1.42– 2.67. In terms of the per processed voxel execution time,
the original 3D–DDA has a range of 5.43 ns – 9.05 ns, and the 3D–DDA–nc has 3.36 ns 365
4.23 ns for the clinical test cases. The variance in the speedup factors and per–voxel
execution time is mainly due to the memory layout of test images and additional atomic
writes introduced by the increased parallel executing threads in 3D–DDA–nc.
ACKNOWLEDGEMENT 370
The research of D.Z. Chen was supported in part by NSF under Grant CCF-0916606.
REFERENCES
1 W. Lu, G. Olivera, M. Chen, P. Reckwerdt, and T. Mackie, “Accurate convolution/superposition for
multiresolution dose calculation using cumulative tabulated kernels,” Phys. Med. Biol., 50(4), 2005.
2 S. Hissoiny, B. Ozell, H. Bouchard, and P. Despres, “GPUMCD: A new GPUoriented Monte Carlo
dose calculation platform”, Med. Phys., 38(2), 2011.
3 A. Ahnesjö, “Collapsed cone convolution of radiant energy for photon dose calculation in
heterogeneous media,” Med. Phys., 16(4), 1989.
4 S. Naqvi, M. Earl, and D. Shepard, “Convolution/superposition using the Monte Carlo method,”
Phys. Med. Biol., 48(14), 2003.
5 R. Jacques, R. Taylor, J. Wong, and T. McNutt, “Towards realtime radiation therapy: GPU
accelerated superposition/convolution,” Computer Methods and Programs in Biomedicine, 98(3),
2010.
6 S. Hissoiny, B. Ozell and P. Despres, A convolutionsuperposition dose calculation engine for
GPUs,” Med. Phys., 37(3), 2010.
7 R. Jacques, J. Wong, T. McNutt, and R. Taylor, “Readtime dose computation: GPUaccelerated
source modeling and superposition/convolution”, Med. Phys., 38(1), 2011.
8B. Zhou, X.S. Hu, D.Z. Chen, and C.X. Yu, “GPUaccelerated Monte Carlo convolution/
superposition implementation for dose calculation,” Med. Phys., 37(11), 2010 .
9 J. Amanatides and A. Woo, A fast voxel traversal algorithm for ray tracing,” Eurographics, 3(10),
1987.
10 T. Knoos, E. Wieslander, L. Cozzi, C. Brink, A. Fogliate, D. Albers, H. Nystrom, and S. Lassen,
Comparison of dose calculation algorithms for treatment planning in external photon beam therapy
for clinical situations,” Phys. Med. Biol., 51(22), 2006.
11 N. Stolte, and R. Caubet, “Discrete raytracing of huge voxel spaces”, Computer Graphics Forum,
14(3), 1995.
12 P. Stephenson, and B. Litow, “Making the DDA run: Twodimensional ray traversal using runs
and runs of runs”, Processings of the 24th Australasian Computer Science Conference, 177183, 2001.
13 C.Fox, H. Romeijn, and J. Dempsey, “Fast voxel and polygon raytracing algorithms in intensity
modulated radiation therapy treatment planning”, Med, Phys., 33(5), 2006.
14 R. Siddon, “Fast calculation of the exact radiological path for a threedimensional CT array,” Med.
Phys, 12(3), 1985.
15 F. Jacobs, E. Sunderman, B. De Sutter, M. Chrisiaens, and I. Lemahieu, “A fast algorithm to
calculate the exact radiological path through a pixel or voxel space”, Journal of Computing and
Information Technology, 6(1), 1998.
16 M. de Greef, J. Crezee, J. C. van Eijk, R. Pool, and A. Bel, “Accelerated ray tracing for
radiotherapy dose calculations on a GPU,” Med. Phys., 36(9), 2009.
17 Q. Chen, M. Chen, and W. Lu, Ultrafast convolution/superposition using tabulated and exponential
kernel,” Med. Phys., 38(3), 2011.
18 B. Zhou, X.S. Hu, and D.Z. Chen, Memoryefficient volume ray tracing on GPU for
radiotherapy,” IEEE 9th Symposium on Application Specific Processors, 5(6), 2011.
19 T. Aila and S. Laine, Understanding the efficiency of ray traversal on GPUs,” Proceedings of the
Conference on High Performance Graphics, 2009.
20 T.D. Han and T.S. Abdelrahman, Reducing branch divergence in GPU programs,” Proceedings of
the Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011.
21 NVIDIA Corp., “NVIDIA CUDA Compute Unified Device Architecture,” Programming Guide,
2011.
... Ray traversal is a fundamental process in many applications, including graphics ray tracing [37], volume rendering [9], and radiation dose calculation [95]. Since applications using ray traversal often involve large numbers of rays and repeatedly conduct the traversal process, the execution speed of ray traversal is critical. ...
... Such divergent execution paths could also worsen the efficiency of memory accessing and hence severely deteriorate performance [95]. Achieving good performance on ...
... Monte Carlo based ray tracing (MCBRT) appears in various applications such as graphics rendering [37], radiation dose calculation [95], and neutron transport simulation [39]. As an "embarrassingly parallel" computational problem, MCBRT has been extensively accelerated by shared memory many-core processors such as graphics processing units (GPUs) [22,40]. ...
Thesis
Full-text available
Shared memory many-core processors such as GPUs have been extensively used in accelerating computation-intensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory many-core architectures, non-trivial modifications are often needed to match the execution patterns of the target algorithms with the characteristics of many-core architectures. This dissertation presents a collection of methods and techniques for accelerating various important applications on GPU, including radiation dose calculation, ray tracing based graphics rendering, and nearest neighbor search. Specifically, we study the performance issues of ray traversal in spatially decomposed scenes, and propose a new data structure, called Shell, to completely eliminate the expensive hierarchical search operations. We also develop an efficient GPU implementation of the Three Dimensional Digital Differential Analyzer (3D-DDA) algorithm, which avoids the overhead of execution divergence by replacing the nested conditional instructions with a set of simple operations. Those two methods are used to accelerate the Collapsed Cone Convolution Superposition (CCCS) algorithm, which is the clinical choice for dose calculation in radiation treatment planning systems. Furthermore, we present a locality enhancing method for Monte Carlo based ray tracing (MCBRT) algorithm on CPU-GPU heterogeneous systems, which improves the spatial and temporal data locality by organizing random rays into coherent groups. Finally, we propose a series of techniques to accelerate nearest neighbor search algorithm on GPU, including a GPU-cache efficient data structure (k-pack tree), a coherent parallel search algorithm, and a cost model based performance optimization method. For each of the target applications, our proposed approaches provide non-trivial performance speedup over the state-of-the-art work, e.g., 6–8X in Monte Carlo dose calculation, and 3.5–5.5X in graphics ray tracing. Our techniques can be implemented in various parallel programming models, such as CUDA and OpenCL, and applicable on many modern GPU architectures, including NVIDIA Kepler/Maxwell, AMD GCN, and Intel Xeon Phi.
... It emerges naturally from practical sampling considerations, as interpolation can thereby be avoided along the driving axis. Prominent alternative techniques are the much-cited algorithm by Siddon [169] and variants thereof [1,73,215,23,200] (known as digital differential analyzer or DDA algorithm in the field of computer graphics), which trace lines in irregular steps from intersection to intersection with any of the raster planes perpendicular to the coordinate axes. The final objective of calculating exact ray-box intersections though can as well be achieved with driving-axis based algorithms [100,42], although the complexity increases in the 3D case. ...
... Run time performance is evaluated for projections of a cylindric volume within a cubic bounding box of 512 3 voxels onto a 512 2 pixel detector. The performance of Algorithm 2.1 is benchmarked against the branchless DDA formulation given by [200]. The volume is stored in 32bit floating point format in either main-or texture memory of the graphics processing unit. ...
Thesis
Full-text available
X-ray dark-field imaging allows to resolve the conflict between the demand for centimeter scaled fields of view and the spatial resolution required for the characterization of fibrous materials structured on the micrometer scale. It draws on the ability of X-ray Talbot interferometers to provide full field images of a sample's ultra small angle scattering properties, bridging a gap of multiple orders of magnitude between the imaging resolution and the contrasted structure scale. The correspondence between shape anisotropy and oriented scattering thereby allows to infer orientations within a sample's microstructure below the imaging resolution. First demonstrations have shown the general feasibility of doing so in a tomographic fashion, based on various heuristic signal models and reconstruction approaches. Here, both a verified model of the signal anisotropy and a reconstruction technique practicable for general imaging geometries and large tensor valued volumes is developed based on in-depth reviews of dark-field imaging and tomographic reconstruction techniques. To this end, a wide interdisciplinary field of imaging and reconstruction methodologies is revisited. To begin with, a novel introduction to the mathematical description of perspective projections provides essential insights into the relations between the tangible real space properties of cone beam imaging geometries and their technically relevant description in terms of homogeneous coordinates and projection matrices. Based on these fundamentals, a novel auto-calibration approach is developed, facilitating the practical determination of perspective imaging geometries with minimal experimental constraints. A corresponding generalized formulation of the widely employed Feldkamp algorithm is given, allowing fast and flexible volume reconstructions from arbitrary tomographic imaging geometries. Iterative reconstruction techniques are likewise introduced for general projection geometries, with a particular focus on the efficient evaluation of the forward problem associated with tomographic imaging. A highly performant 3D generalization of Joseph's classic linearly interpolating ray casting algorithm is developed to this end and compared to typical alternatives. With regard to the anisotropic imaging modality required for tensor tomography, X-ray dark-field contrast is extensively reviewed. Previous literature is brought into a joint context and nomenclature and supplemented by original work completing a consistent picture of the theory of dark-field origination. Key results are explicitly validated by experimental data with a special focus on tomography as well as the properties of anisotropic fibrous scatterers. In order to address the pronounced susceptibility of interferometric images to subtle mechanical imprecisions, an efficient optimization based evaluation strategy for the raw data provided by Talbot interferometers is developed. Finally, the fitness of linear tensor models with respect to the derived anisotropy properties of dark-field contrast is evaluated, and an iterative scheme for the reconstruction of tensor valued volumes from projection images is proposed. The derived methods are efficiently implemented and applied to fiber reinforced plastic samples, imaged at the ID19 imaging beamline of the European Synchrotron Radiation Facility. The results represent unprecedented demonstrations of X-ray dark-field tensor tomography at a field of view of 3-4cm, revealing local fiber orientations of both complex shaped and low-contrast samples at a spatial resolution of 0.1mm in 3D. The results are confirmed by an independent micro CT based fiber analysis.
... We consider a ray l inside this plane and define the entering points (x o , y o ) and the exit point (x p , y p ) in a volume of interest (see Fig. 3). To compute the radiological path (1), it is necessary to compute each segment l k and obtain the relative density [12]. ...
... With this implementation, the texture memory should be the best option to optimize local memory access [13]. Therefore, each thread must resolve the sum in (12) and fetch all c m,n k values, increasing memory overhead but reducing precision loss. ...
... For any spatial grid P, the first step is to construct a 3D line segment connecting the radiation source and the spatial grid P. The tensor grid indices are used as the coordinates of the points on the line segment. The Three-Dimensional Digital Differential Analyzer (3D-DDA) algorithm [20] is utilized to achieve the construction of the line segment, i.e., the passing grids are recorded gradually from the grid where the radiation source is located along the direction of the connecting line, and the grids are then stored in the 3D line segment set S Line . Whether this spatial grid is occluded or not is determined by judging whether there is an intersection between the 3D line segment and the terrain set S Terrain . ...
Article
Full-text available
In recent years, the three-dimensional (3D) radar detection range has played an essential role in the layout of devices such as aircraft and drones. To compensate for the shortcomings of three-dimensional calculations for radar terrain masking, a new calculation method is proposed for assessing the terrain occlusion of radar detection range. First, the high-dimensional electromagnetic data after discretization are modeled based on the tensor data structure, and the tensor grid dilation operator is constructed. Then, the dilation process begins from the overlapping section of the radar detection range and terrain, and it is adjusted by the terrain occlusion judgment factor and the dilation judgment factor to obtain the obstructed part due to the terrain. Finally, the actual radar detection range under terrain occlusion is obtained. The simulation results show that the method proposed in this paper can adapt to different grid sizes and terrain shapes, significantly enhancing computational efficiency while maintaining internal features.
... The authors of [Nguyen 2015] proposed a GPU acceleration of the ray-tracing projection and back-projection operators to increase the iterative algorithm throughput significantly. Also, the ray-tracing projector and back-projector have been widely optimized for GPU computation in order to avoid thread divergences due to multiple conditional branching [Xiao 2011, Xiao 2012, Gao 2012, Thompson 2014. Hao Gao proposed a highly parallelizable ray-driven projector and back-projector with reduced computational complexity in [Gao 2012]. ...
Thesis
Full-text available
The increasing need for computing power imposed by the complexity of processing algorithms and the size of problems requires using hardware accelerators to meet time and energy constraints. FPGA architectures are known to be among the most power-efficient platforms, especially for embedded systems using hardware description languages. The appearance of the new high-level synthesis tools has been a major factor in the consideration of FPGAs for complex applications, as is the case with the manycores processors. The high-level synthesis tools generate a hardware description design from high-level languages such as C, C++, or OpenCL. The recent FPGAs are equipped with several floating-point computing units capable of meeting the precision requirements of a wide range of applications. However, exploiting the full potential of these architectures has always been a major concern. This thesis aims to explore methodologies for accelerating inverse problem algorithms on FPGA architectures through new high-level synthesis tools applied to tomographic reconstruction and radioastronomy. Indeed, many algorithms for these applications are memory-bound. A custom architecture derived from an algorithm-architecture co-design methodology has been proposed to overcome the memory bottleneck. We applied this methodology to the 3D back-projection operator in the context of iterative reconstruction. The 3D back-projector architecture takes advantage of a custom memory access strategy to reach a full computational throughput. Then we consider the parallelization of the complete optimization algorithm on FPGA. We also discuss the position of FPGAs in radio astronomy, particularly for the SKA pipeline imaging system.
... Compared with GPUs, the FPGA is costly and can complete image processing tasks without the CPU. 14,15 In parallel computing, the efficiency of FPGAs is approximately 30-100 times than that of the same cost of the CPU. 16 FPGAs AIP Advances ARTICLE scitation.org/journal/adv ...
Article
Full-text available
Positron emission tomography (PET) can be used to measure the internal defects of industrial parts. However, PET requires a long execution time of image reconstruction, which hinders its practical usage in industrial measurements. A novel parallel scheme based on field-programmable gate arrays (FPGAs) is proposed in this study to accelerate PET image reconstruction. A fast maximum-likelihood expectation–maximization iteration reconstruction algorithm with prior estimation is implemented on the FPGA. This method can achieve satisfactory PET images with limited iteration times. The resources in the FPGA are divided into several groups, and each group supports the image reconstruction for a single sinogram. Thus, several sinograms can be processed in parallel. Two internal defect detection experiments are conducted to apply the proposed method to industrial measurements. Results show that the inner structure can be detected, whereas the inner defects can be visualized. A group of 104 slice images is reconstructed in parallel on FPGAs, and the final 3D PET image of the inner defects is acquired in 10 s.
... Some groups have tried accelerating by efficiently distributing the work amongst the many parallel cores [20], [21], [22]. Groups like [23], [24] have implemented thread divergence strtegies, prior fetch of data [18], reduction of data movement [25], using half precision floating point to reduce data transfer rate [26]. These methods have shown slight improvement in the acceleration levels, in spite of many of these modified algorithms being implemented on GPUs. ...
Article
Full-text available
The major requirements of a good tomographic reconstruction algorithm are reduction in radiation dosage, accurate reconstruction, detail enhancement and rapid reconstruction time. Some of these factors are covered by many algorithms, but are not collectively addressed in one. While the Maximum Likelihood Expectation Maximization (MLEM) algorithm fares well on many of these factors, it is difficult to apply this algorithm in real time due to its long execution time. Our predetermined goal is to reduce the execution time to a large extent so that the MLEM’s advantages can be leveraged by using hardware accelerators such as Field Programmable Gate Arrays (FPGA). FPGAs are becoming especially popular as hardware accelerators and are well known for their programmability, configurability and massive parallelism through a large number of Configurable Logic Blocks (CLBs). Although FPGAs are extremely versatile, they require complex languages like Verilog or VHDL to program them.incorporating changes in the design level at a later stage in FPGAs demands increased effort. Here, for the first time, we present a parallel structure for hardware acceleration of MLEM on the mammoth Virtex 7 V C709 FPGA. Using available tools, we also present a programming flow to design the algorithmic acceleration hardware architecture. The proposed flow does not require prior knowledge of the traditionally cumbersome Hardware Description Languages (HDLs) and this makes post design changes very easy to incorporate and validate. The parallel architecture is implemented on a FPGA operating at 220 MHz and we have achieved a 288× performance compared to an optimized software execution on an Intel Xeon workstation with 12-cores 3.1 GHz 32 GB RAM and 12 MB Cache architecture.
... Monte Carlo based ray tracing (MCBRT) appears in various applications such as graphics rendering [11], radiation dose calculation [34], and neutron transport simulation [12]. ...
Conference Paper
Full-text available
Monte Carlo based ray tracing (MCBRT) is the foundation of simulating the transport of particles in an inhomogeneous medium, and arises in different applications such as global illumination in graphics rendering and dose calculation in radiation therapy. Due to the computation intensive nature of MCBRT, GPUs have been extensively adopted to accelerate it. However, memory bandwidth becomes a new bottleneck for GPU-based implementations due to the lack of data locality in the MCBRT random memory access patterns. To tackle this issue and consequently improve performance of MCBRT, we present a new locality enhancing method, called LEMCBRT, on CPU-GPU heterogeneous systems. LEMCBRT is based on task partitioning and scheduling, which enhances both the spatial and temporal data locality by organizing random rays into coherent groups. We also develop a CPU-GPU pipeline scheme to reduce the overhead in such ray organization process. To show the applicability of our LEMCBRT method, we apply it to a dose calculation problem in radiation cancer treatment, achieving 6-8X speedup over the best-known GPU solutions on various clinical cases of radiation therapy.
Conference Paper
View Video Presentation: https://doi.org/10.2514/6.2022-0029.vid This paper presents a detailed analysis of the accuracy and performance of line marching algorithms executing on a GPU. In the context of an accurate Global Navigation Satellite System (GNSS) quality of service simulation, horizon sky-plots are a useful tool to determine satellite visibility in the presence of obstructions from objects, such as buildings or dense foliage. In order to accurately model satellite visibility at a point of interest on a map, a horizon plot can identify the viewing angles at which objects are blocking the sky. This computation requires traversing a line starting at the point of interest on a 2D altitude map, moving outward for every azimuth angle. To explore the performance of this computation, we propose a new dynamic stopping condition for the traversal of the line, benefiting from objects close to the point of interest. We compare the accuracy of common line marching algorithms, and consider their parallel performance when developed in CUDA. We find that our proposed stopping condition for line marching provides a significant improvement in performance in urban canyon sky-plots, as compared to previous work. Additionally, these results show that simpler algorithms, such as the digital differential analyzer line algorithm, are better suited for GPUs than more sophisticated schemes such as Bresenham’s algorithm, specifically in the context of sky-plot horizon computations. The trade-off between accuracy and performance is analyzed and guidelines are provided that depend on the targeted goal of the GNSS application.
Article
Reducing radiation doses is one of the key concerns in computed tomography (CT) based 3D reconstruction. Although iterative methods such as the expectation maximization (EM) algorithm can be used to address this issue, applying this algorithm to practice is difficult due to the long execution time. Our goal is to decrease this long execution time to an order of a few minutes, so that low-dose 3D reconstruction can be performed even in time-critical events. In this paper we introduce a novel parallel scheme that takes advantage of numerous block RAMs on field-programmable gate arrays (FPGAs). Also, an external memory bandwidth reduction strategy is presented to reuse both the sinogram and the voxel intensity. Moreover, a customized processing engine based on the FPGA is presented to increase overall throughput while reducing the logic consumption. Finally, a hardware and software flow is proposed to quickly construct a design for various CT machines. The complete reconstruction system is implemented on an FPGA-based server-class node. Experiments on actual patient data show that a 26.9 × speedup can be achieved over a 16-thread multicore CPU implementation.
Conference Paper
Full-text available
Iterative algorithms based on runs and runs of runs are presented to calculate the cells of the two-dimensional lattice intersected by a line of real slope and intercept. The technique is applied to the problem of traversing a ray through a two-dimensional grid. Using runs or runs of runs provides a significant improvement in the efficiency of ray traversal for all but very short path lengths when compared to the DDA algorithm implemented using floating or fixed point arithmetic.
Conference Paper
Full-text available
Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch direction and delaying those that take the other direction until later iterations. Branch distribution reduces the length of divergent code by factoring out structurally similar code from the branch paths. We conduct a preliminary evaluation of the two optimizations using both synthetic benchmarks and a highly-optimized real-world application. Our evaluation shows that they improve the performance of the synthetic benchmarks by as much as 30% and 80% respectively, and that of the real-world application by 12% and 16% respectively.
Article
Full-text available
Monte Carlo methods are considered as the gold standard for dosimetric computations in radiotherapy. Their execution time is, however, still an obstacle to the routine use of Monte Carlo packages in a clinical setting. To address this problem, a completely new, and designed from the ground up for the GPU, Monte Carlo dose calculation package for voxelized geometries is proposed: GPUMCD. GPUMCD implements a coupled photon-electron Monte Carlo simulation for energies in the range of 0.01-20 MeV. An analog simulation of photon interactions is used and a class II condensed history method has been implemented for the simulation of electrons. A new GPU random number generator, some divergence reduction methods, as well as other optimization strategies are also described. GPUMCD was run on a NVIDIA GTX480, while single threaded implementations of EGSnrc and DPM were run on an Intel Core i7 860. Dosimetric results obtained with GPUMCD were compared to EGSnrc. In all but one test case, 98% or more of all significant voxels passed the gamma criteria of 2%-2 mm. In terms of execution speed and efficiency, GPUMCD is more than 900 times faster than EGSnrc and more than 200 times faster than DPM, a Monte Carlo package aiming fast executions. Absolute execution times of less than 0.3 s are found for the simulation of 1M electrons and 4M photons in water for monoenergetic beams of 15 MeV, including GPU-CPU memory transfers. GPUMCD, a new GPU-oriented Monte Carlo dose calculation platform, has been compared to EGSnrc and DPM in terms of dosimetric results and execution speed. Its accuracy and speed make it an interesting solution for full Monte Carlo dose calculation in radiation oncology.
Article
The quality of images produced by Discrete Ray‐Tracing voxel spaces is highly dependent on 3d grid resolution. The huge amount of memory needed to store such grids often discards discrete Ray‐Tracing as a practical visualization algorithm. The use of an octree can drastically change this when most of space is empty, as such is the case in most scenes. Although the memory problem can be bypassed using the octree, the performance problem still remains. A known fact is that the performance of discrete traversal is optimal for quite low resolutions. This problem can be easily solved by dividing the task in two steps, working in two low resolutions instead of just one high resolution, thus taking advantage of optimal times in both steps. This is possible thanks to the octree property of representing the same scene in several different resolutions. This article presents a two step Discrete Ray‐Tracing method using an octree and shows, by comparing it with the single step version, that a substantial gain in performance is achieved.
Conference Paper
Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8-2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3-1.6X over the original GPU implementation of the CCCS algorithm.
Conference Paper
We discuss the mapping of elementary ray tracing operations— acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody knows whether the methods are anywhere near the theoretically ob- tainable limits, and if not, what might be causing the discrepancy. We study this question by comparing the measurements against a simulator that tells the upper bound of performance for a given ker- nel. We observe that previously known methods are a factor of 1.5-2.5X off from theoretical optimum, and most of the gap is not explained by memory bandwidth, but rather by previously unidenti- fied inefficiencies in hardware work distribution. We then propose a simple solution that significantly narrows the gap between simula- tion and measurement. This results in the fastest GPU ray tracer to date. We provide results for primary, ambient occlusion and diffuse interreflection rays.
Article
The quality of images produced by Discrete Ray-Tracing voxel spaces is highly dependent on 3d grid resolution. The huge amount of memory needed to store such grids often discards discrete Ray-Tracing as a practical visualization algorithm. The use of an octree can drastically change this when most of space is empty, as such is the case in most scenes. Although the memory problem can be bypassed using the octree, the performance problem still remains. A known fact is that the performance of discrete traversal is optimal for quite low resolutions. This problem can be easily solved by dividing the task in two steps, working in two low resolutions instead of just one high resolution, thus taking advantage of optimal times in both steps. This is possible thanks to the octree property of representing the same scene in several different resolutions. This article presents a two step Discrete Ray-Tracing method using an octree and shows, by comparing it with the single step version, that a substantial gain in performance is achieved.
Article
Collapsed-cone convolution/superposition (CCCS) dose calculation is the workhorse for IMRT dose calculation. The authors present a novel algorithm for computing CCCS dose on the modern graphic processing unit (GPU). The GPU algorithm includes a novel TERMA calculation that has no write-conflicts and has linear computation complexity. The CCCS algorithm uses either tabulated or exponential cumulative-cumulative kernels (CCKs) as reported in literature. The authors have demonstrated that the use of exponential kernels can reduce the computation complexity by order of a dimension and achieve excellent accuracy. Special attentions are paid to the unique architecture of GPU, especially the memory accessing pattern, which increases performance by more than tenfold. As a result, the tabulated kernel implementation in GPU is two to three times faster than other GPU implementations reported in literature. The implementation of CCCS showed significant speedup on GPU over single core CPU. On tabulated CCK, speedups as high as 70 are observed; on exponential CCK, speedups as high as 90 are observed. Overall, the GPU algorithm using exponential CCK is 1000-3000 times faster over a highly optimized single-threaded CPU implementation using tabulated CCK, while the dose differences are within 0.5% and 0.5 mm. This ultrafast CCCS algorithm will allow many time-sensitive applications to use accurate dose calculation.
Article
To accelerate dose calculation to interactive rates using highly parallel graphics processing units (GPUs). The authors have extended their prior work in GPU-accelerated superposition/ convolution with a modern dual-source model and have enhanced performance. The primary source algorithm supports both focused leaf ends and asymmetric rounded leaf ends. The extra-focal algorithm uses a discretized, isotropic area source and models multileaf collimator leaf height effects. The spectral and attenuation effects of static beam modifiers were integrated into each source's spectral function. The authors introduce the concepts of arc superposition and delta superposition. Arc superposition utilizes separate angular sampling for the total energy released per unit mass (TERMA) and superposition computations to increase accuracy and performance. Delta superposition allows single beamlet changes to be computed efficiently. The authors extended their concept of multi-resolution superposition to include kernel tilting. Multi-resolution superposition approximates solid angle ray-tracing, improving performance and scalability with a minor loss in accuracy. Superposition/convolution was implemented using the inverse cumulative-cumulative kernel and exact radiological path ray-tracing. The accuracy analyses were performed using multiple kernel ray samplings, both with and without kernel tilting and multi-resolution superposition. Source model performance was <9 ms (data dependent) for a high resolution (4002) field using an NVIDIA (Santa Clara, CA) GeForce GTX 280. Computation of the physically correct multispectral TERMA attenuation was improved by a material centric approach, which increased performance by over 80%. Superposition performance was improved by approximately 24% to 0.058 and 0.94 s for 64(3) and 128(3) water phantoms; a speed-up of 101-144X over the highly optimized Pinnacle3 (Philips, Madison, WI) implementation. Pinnacle3 times were 8.3 and 94 s, respectively, on an AMD (Sunnyvale, CA) Opteron 254 (two cores, 2.8 GHz). The authors have completed a comprehensive, GPU-accelerated dose engine in order to provide a substantial performance gain over CPU based implementations. Real-time dose computation is feasible with the accuracy levels of the superposition/convolution algorithm.
Article
Dose calculation is a key component in radiation treatment planning systems. Its performance and accuracy are crucial to the quality of treatment plans as emerging advanced radiation therapy technologies are exerting ever tighter constraints on dose calculation. A common practice is to choose either a deterministic method such as the convolution/superposition (CS) method for speed or a Monte Carlo (MC) method for accuracy. The goal of this work is to boost the performance of a hybrid Monte Carlo convolution/superposition (MCCS) method by devising a graphics processing unit (GPU) implementation so as to make the method practical for day-to-day usage. Although the MCCS algorithm combines the merits of MC fluence generation and CS fluence transport, it is still not fast enough to be used as a day-to-day planning tool. To alleviate the speed issue of MC algorithms, the authors adopted MCCS as their target method and implemented a GPU-based version. In order to fully utilize the GPU computing power, the MCCS algorithm is modified to match the GPU hardware architecture. The performance of the authors' GPU-based implementation on an Nvidia GTX260 card is compared to a multithreaded software implementation on a quad-core system. A speedup in the range of 6.7-11.4x is observed for the clinical cases used. The less than 2% statistical fluctuation also indicates that the accuracy of the authors' GPU-based implementation is in good agreement with the results from the quad-core CPU implementation. This work shows that GPU is a feasible and cost-efficient solution compared to other alternatives such as using cluster machines or field-programmable gate arrays for satisfying the increasing demands on computation speed and accuracy of dose calculation. But there are also inherent limitations of using GPU for accelerating MC-type applications, which are also analyzed in detail in this article.