Conference PaperPDF Available

Flying Edges: A High-Performance Scalable Isocontouring Algorithm

Flying Edges:
A High-Performance Scalable Isocontouring Algorithm
William Schroeder
Kitware, Inc.
Rob Maynard
Kitware, Inc.
Berk Geveci
Kitware, Inc.
Isocontouring remains one of the most widely used visualization
techniques. While a plethora of important contouring algorithms
have been developed over the last few decades, many were created
prior to the advent of ubiquitous parallel computing systems. With
the emergence of large data and parallel architectures, a rethinking
of isocontouring and other visualization algorithms is necessary to
take full advantage of modern computing hardware. To this end
we have developed a high-performance isocontouring algorithm
for structured data that is designed to be inherently scalable. Pro-
cessing is performed completely independently along edges over
multiple passes. This novel algorithm also employs computational
trimming based on geometric reasoning to eliminate unnecessary
computation, and removes the parallel bottleneck due to coincident
point merging. As a result the algorithm performs well in serial or
parallel execution, and supports heterogeneous parallel computa-
tion combining data parallel and shared memory approaches. Fur-
ther it is capable of processing data too large to fit entirely inside
GPU memory, does not suffer additional costs due to preprocessing
and search structures, and is the fastest non-preprocessed isocon-
touring algorithm of which we are aware on shared memory, multi-
core systems. The software is currently available under a permis-
sive, open source licence in the VTK visualization system.
Index Terms: I.3.1 [Computing Methodologies]: Computer
Graphics—Parallel processing; I.3.5 [Computing Methodologies]:
Computational Geometry and Object Modeling—Geometric algo-
The current computing era is notable for its rapidly increasing data
size, and the rapid evolution of computing systems towards mas-
sively parallel architectures. The increasing resolution of sensors
and computational models has driven data size since the earliest
days of computing, and is now reaching the point where paral-
lel approaches are absolutely essential to processing the resulting
large data. However, despite the widespread acknowledgement that
parallel methods are essential to future computational advances,
many important visualization algorithms in use today were devel-
oped with serial or coarse-grained data parallel computing models
in mind. Such algorithms are generally not able to take effective
advantage of emerging massively parallel systems, and therefore
struggle to scale with increasing data size. Towards this end, we
are challenging ourselves to rethink approaches to visualization al-
gorithms with the goal of better leveraging emerging parallel hard-
ware and systems. In this paper we revisit isocontouring, one of
the most important and well-studied visualization techniques in use
1.1 Considerations
Modern computational hardware is moving towards massive paral-
lelism. There are a number of reasons for this including physical
constraints from power consumption and frequency scaling con-
cerns [13, 25] which limit the speed at which a single processor can
run. At the same time chip densities continue to increase, mean-
ing that manufacturers are adding hardware in the form of addi-
tional computing cores and supporting infrastructure such as mem-
ory cache and higher-speed data buses. The end result, however, is
that computing systems are undergoing dramatic change, requiring
algorithmic implementations to evolve and reflect this new compu-
tational reality.
Taking advantage of massive parallelism places particular bur-
dens on software implementations. Typically data movement is dis-
couraged, requiring judicious use of local memory cache and man-
agement of the memory hierarchy. Computational tasks and data
structures are simplified so that they may be readily pushed onto
pipelined hardware architectures. Conditional branching is discour-
aged as this tends to result in idle processing cores and interrupts
concurrent execution; and bottlenecks due to I/O contention signifi-
cantly inhibit potential speed up. In general, an algorithm performs
best if it is implemented with simple computational threads (even if
this means extra work is performed); executes independently across
local, cached memory; and reads and writes data from pre-allocated
memory locations avoiding I/O blocking whenever possible.
We believe that it is important to reimagine essential visualiza-
tion algorithms with this new computing reality in mind. While ef-
forts are underway to develop and extend programming languages
which can automatically parallelize algorithms, despite decades of
effort only limited success has been demonstrated [26], and today’s
computational challenges require solutions now. One especially
important visualization algorithm is isocontouring, which reveals
features of underlying scalar fields. However, despite years of re-
search, many implementations of this technique have not been de-
signed in the context of massive parallelism, and are hobbled by ex-
cessive data movement and related bottlenecks, or serial execution
dependencies. For example, naive implementations of the popular
Marching Cubes (MC) [18] algorithm process interior voxel edges
four times, and visit interior data points eight times. In addition,
merging coincident points often introduces a serial bottleneck (due
to the use of a spatial or edge-based locator), and the output of the
algorithm is data dependent, meaning that arrays must be dynami-
cally allocated and that there is no clear mapping of input to output
data, resulting in yet another parallel processing bottleneck.
1.2 Motivation and Related Work
As isocontouring is one of the most useful and important visual-
ization techniques, there is a vast collection of literature addressing
a variety of different techniques. The publication of the MC al-
gorithm produced significant productive research of which [3, 32]
are representative examples; more recently the introduction of GP-
GPU computation has produced many excellent highly parallel
methods [8, 11, 15, 20, 21, 22]. Despite these many advances,
however, we desired to create a general algorithm amenable to par-
allelization on a variety of hardware systems (e.g., CPU and GPU)
with the ability to handle large data, especially given that we often
encounter data that is too large to fit on a single GPU (given current
technology). We also wanted to investigate methods to remove par-
allel computing bottlenecks in MC, which while easily parallelized,
suffers from the inefficiencies noted previously.
Our visualization workflow is also different than what is often as-
sumed in much of the literature. While many existing isocontouring
algorithms are used in an exploratory manner to reveal the structure
of scalar fields, our usual approach is to process very large data us-
ing known isovalues, with the goal of generating the corresponding
isosurfaces as fast as possible, avoiding the extra costs of build-
ing and storing supplemental acceleration structures. For example,
medical CT imaging techniques produce data in which known iso-
values correspond to different biological structures such as skin and
bone. Similarly, in computational fluid dynamics various functions
such as mass density or Q-criterion are well understood and pre-
vious experience typically suggests appropriate isovalues. It is not
uncommon for our datasets to exceed GPU memory, thereby incur-
ring significant transfer overhead across the associated PCIE bus,
as we often process large volumes on the order of 20483resolution
with double precision scalar values (e.g., 9GB for 10243doubles,
or 68GB for 20483doubles). Thus many of the exploratory iso-
contouring techniques, which are typically based on preprocessing
the data to build a rapid search structure such as an octree [30],
interval tree [7], or span space [17], are not suitable to our work-
flow. Such preprocessing steps may add significant time to what is
a non-exploratory workflow, while also introducing complex, auxil-
iary data structures which often consume significant memory and/or
disk resources.
Out-of-core isosurfacing techniques [6, 29] may be problematic
too. While these algorithms are designed to process data that is
much larger than machine memory, the cost of I/O (e.g., writing
out entire solutions) at extreme scale can be prohibitive [19, 25].
Instead, we often revert to in situ processing [2], meaning that vi-
sualization algorithms are executed alongside the running simula-
tion or data acquisition process. The benefit of this approach is
that expensive I/O can be significantly reduced by extracting key
features such as isosurfaces, slices, streamlines, or other solution
features, each of which is significantly smaller than the full dataset.
Using such methods, researchers can intelligently instrument (and
conditionally program) their simulation runs to produce meaning-
ful feature extracts that focus in on key areas of interest, avoiding
the need to write out an entire output dataset [1]. Hence simple iso-
contouring algorithms that can be easily executed in situ with the
simulation are essential for larger data.
Given these considerations, the motivating goal of this work was
to develop a fast, scalable isocontouring algorithm requiring no pre-
processing or additional I/O. We also challenged ourselves to de-
velop simple, independent algorithmic tasks that would scale well
on a variety of different types of massively parallel hardware.
In this section we begin by providing a high-level overview of the
Flying Edges (FE) algorithm, including introducing descriptive no-
tation. Next we address key algorithmic features. Finally we ad-
dress key implementation details.
2.1 Notation
The algorithm operates on a structured grid with Nscalar val-
ues arranged on a topologically regular lattice of data points of
dimensions (l×m×n)with values si jk. Grid cells are defined
from the eight adjacent points associated with scalar values in the
i+ (0,1),j+ (0,1),k+ (0,1)directions. Grid edges Ei j k run in the
row Ejk, column Eik , and stack Ei j directions; and are composed of
the cell edges ei jk ; grid cell rows Rij k consist of all cells vi jk touch-
ing both Ei jk and E(i+1)( j+1)(k+1). So for example, an x-row edge
Figure 1: The cell axes aijk coordinate system. The algorithm processes the aij k in
parallel along x-edges Ejk to count intersections and the number of output triangles.
First and last intersections along Ejk are used to perform computational trimming when
generating output in parallel over the grid cell rows Rjk.
of length nwith 0 i<n:
Ejk =
ei jk and Rjk =
vi jk . (1)
Each vi jk has an associated cell axes ai jk which refers to the three
cell edges emanating from the point located at si jk in the positive
row x, column y, and stack zdirections. Refer to Figure 1.
The purpose of the algorithm is to generate an approximation
to the isocontour Q. The isocontour is defined by an isovalue q:
Q(q) = {p|F(p) = q}where F(p)maps the point pRnto a real-
valued number (i.e., the isovalue q). The resulting approximation
surface Sis a continuous, piecewise linear surface such as a triangle
mesh (in 3D). Note also that we say that Qintersects ei jk when
some plies along the cell edge ei jk . We assume that the scalar
value varies linearly over eijk so Qmay intersect at most at only
one point along the edge.
2.2 Overview
Highly-organized structured data lends itself to a variety of paral-
lel approaches. Our method, which is based on the independent
processing of grid edges, provides a number of benefits:
The algorithm takes advantage of cache locality by processing
data in the fastest varying data direction (i.e., assuming that an
ijkgrid varies fastest in the i-direction, the edges oriented
along the idata rows or based on the notation above, the voxel
cell x-edges ejk );
it separates data processing into simple, independent com-
putational steps that eliminate memory write contention and
most serial, synchronous operations;
it reduces overall computation by performing a small amount
of initial extra work (i.e., computational trimming) to limit the
extent of subsequent computation;
it ensures that each voxel edge is only intersected once by
using the cell axes ai jk to control iteration along the grid edges
Ejk ;
it eliminates the need for dynamic memory reallocation, out-
put memory is allocated only once;
and the algorithm eliminates the parallel bottlenecks due to
point merging.
While others have used edge-based approaches to characterize the
quality of isocontour meshes [10], or to ensure watertight meshes
using octree edge-trees [16], we use our edge-based approach as
an organizing structure for parallel computation; both as a means
of traversing data as well as creating independent computational
One of the most challenging characteristics of any isocontouring
algorithm is that it produces indeterminate output, i.e., the number
of output points and primitives (e.g., triangles) is not known a pri-
ori. This presents a challenge to parallel implementations since pro-
cessing threads work best when output memory can be preallocated
and partitioned to receive output. So for example algorithms such
as MC–which on first glance seem embarrassingly parallel–suffer
significant performance degradation when inserting output entities
into dynamic output lists due to repeated memory reallocation. A
natural way to address this challenge is to use multiple passes to
first characterize the expected output by counting output points and
primitives, followed by additional passes to actually generate the
isocontour. The key is to minimize the effort required to configure
the output. Our approach also takes advantage of the multi-pass ap-
proach to guide and dramatically reduce subsequent computation,
thus the early passes can be considered as a form of preprocessing
to set up later passes for optimized computation.
The algorithm is implemented in four passes: only the first pass
traverses the entirety of the data, the others operate on the resulting
local metadata or locally on rows to generate intersection points
and/or gradients. At a high level, these passes perform the follow-
ing operations:
1. Traverse the grid row-by-row to compute grid x-edge cases;
count the number of xintersections; and set the row computa-
tion trim limits.
2. Traverse each grid cell row, between adjusted trim limits, and
count the number of y- and z-edge intersections on the ai jk , as
well as the number of output triangles generated.
3. Sum the number of x-, y-, and z-points created across the ai j k
of each row, and the number of output triangles; allocate out-
put arrays.
4. Traverse each grid cell row, between adjusted trim limits, and
using the ai jk generate points and produce output triangles
into the preallocated arrays.
The first three passes count the number of output points and trian-
gles, determine where the contour proximally exists and set compu-
tational trim values, and produce the metadata necessary to generate
output. In the fourth and final pass, output points and primitives are
generated and directly written into pre-allocated output arrays with-
out the need for mutexes or locking. In the following we describe
each pass in more detail and then follow with a discussion of key
implementation concepts.
Pass 1: Process grid x-edges Ejk .For each grid x-edge Ejk, all
cell x-edge intervals (i.e., cell edges ejk) composing Ejk are visited
and marked when their interval intersects Q(i.e., an edge case num-
ber is computed for each ejk ). The left and right trim positions xL jk
and xR jk are noted, as well as the number of x-intersections along
Ejk . Each Ejk is processed independently and in parallel. (Note that
the trim position xL jk indicates where Qfirst intersects Ejk on its
left side; and xR jk indicates where Qlast intersects Ejk on its right
side. Additional trimming details will be provided shortly.)
Pass 2: Process grid rows Rjk.For each grid row Rjk , the cells
vi jk between the adjusted trim interval [ ¯xL jk , ¯xRj k) are visited and
their MC case numbers are computed from the edge-based case ta-
ble from Pass 1 (the original si jk values are not reread). Knowing
the cell case value ci jk it is possible to perform a direct table lookup
to determine whether the y- and z-cell axes edges ay
i jk and az
i jk inter-
sect the isocontour, incrementing the y- and z-cell intersection count
as appropriate. In addition, the ci j k directly indicates the number of
triangles generated as the contour passes through vi jk . Again, each
Rjk can be processed in parallel. At the conclusion of the second
pass, the number of output points and triangles is known and is used
to allocate the output arrays in the next pass. Note that because the
adjusted trim interval may be empty, it is possible to rapidly skip
data (rows and entire slices) via computational trimming (see Fig-
ure 2).
Note that while the algorithm is designed to operate across x-
edges due to it being the (typically) fastest varying data direction,
Figure 2: A trimmed contour after the second pass (in 2D). On the right side is a
metadata array that tracks information describing the interaction of the contour with
the Ejk edges, including the number of x- and y-edge intersections, the number of
output primitives generated, and trim positions. Trimming can significantly reduce
computational work.
it can be readily be recast using y- or z-edges. In such cases, the re-
sults remain invariant, as the generated MC cases (after combining
the edge cases in the proper order) are equivalent.
Pass 3: Configure output. At the conclusion of Pass 2, for each
Rjk the number of x-, y-, and z-point intersections is known, as well
as the number of output triangles. The third pass accumulates this
information on each Ejk , assigning starting ids for the x-, y-, and z-
points and triangles to be generated in the final pass along each Rjk.
Note that this accumulation pass can be performed in parallel as a
prefix sum operation in O(m/t+log(t)) where the mis the number
of Ejk and tis the number of threads of execution [4]. (Using the
scan or prefix sum operation is reminiscent of [11, 12] which use
this operation to generate output offsets. Here the scan is used to
accumulate offsets on a per-edge basis.)
Pass 4: Generate output across Rjk .In the final pass, each Rjk
is processed in parallel by moving the ai jk along the row within the
adjusted trim interval [¯xL jk, ¯xR j k), computing x,y,zedge intersec-
tion points on the cell axes as appropriate. This requires reading
some of the si jk again to compute gradients (if requested) and in-
terpolate edge intersections with Q. Triangles are also output (if
necessary) as each cell is visited. Again, the adjusted trim edge in-
terval is used to rapidly skip data, with the cell case values cij k indi-
cating which (if any) points need be generated on the ai jk. Because
the previous pass determined the start triangle and point ids for each
Rjk , an increment operator (based in edge use table U(ci)) is used to
update point and triangle ids as each cell is processed along the row
(Figure 5). This eliminates the need to merge points, and points are
only generated once and written directly into the previously allo-
cated output arrays without memory write contention. It should be
noted that the generation of intersection points and gradients, and
the production of triangle connectivity, can proceed independently
of one another for further parallel optimization.
Assuming that the grid is of dimension n3, the algorithm com-
plexity is driven by the initial pass over the grid values O(n3)in-
voking n2/ttotal threads of execution. The second pass traverses
the xedge classifications, generally of size less than O(n3)due
to computational trimming, also with n2/ttotal thread executions.
The third pass sums the number of output primitives as described
previously in O(n2/t+log(t)). The fourth and final pass processes
up to n2/tthreads, although trimming often reduces the total work-
load significantly.
In the following section we highlight and discuss some of the im-
plementation features of the Flying Edges algorithm.
Figure 3: An edge-based case table combines four x-edge cases to produce an equiv-
alent Marching Cubes vertex-based case table. Edge cases can be computed indepen-
dently in parallel and combined when necessary to generate a MC case when process-
ing Rjk.
3.1 Edge-Based Case Tables
Voxel-based contour case tables were popularized by MC and have
been extended to other topological cell types such as tetrahedra
[28]. Computing a case number ci jk for a cell involves comparing
the scalar field value at each cell vertex against the isocontour value
q:ci jk =1 if si j k q; 0 otherwise; and then performing a shifted
logical OR operation to determine a case value 0 cijk <256. In
naive implementations of MC, many unnecessary accesses to and
comparisons against a particular vertex scalar value ¯soccur as the
algorithm marches from cell to adjacent cell. In Flying Edges, typi-
cally si jk are accessed only once unless it is necessary that a partic-
ular ¯sis required for subsequent computation (e.g., gradient com-
putation or edge interpolation). This is because the total number of
data values Nis usually much greater than the number of si jk which
actually contribute to the computation of the isocontour.
Case computation is performed in parallel along each grid edge
Ejk . The edge case for each ejk that compose the Ejk is deter-
mined from the scalar values of its two end points, with 22total
states possible. The resulting case value ec
jk indicates whether each
cell x-edge intersects the isosurface, and if so (as described pre-
viously in Pass 1), the number of x-intersections is incremented
by one. During subsequent processing, the four edge case values
jk ,ec
(j+1)(k+1)}can be combined to determine
the cell case value ci jk (see Figure 3). The edge-based case table
is equivalent to the standard MC table which considers the states of
eight vertices as compared to four parallel cell x-edges. We use an
edge-based case table because it removes computational dependen-
cies which degrade parallel performance, allowing the computation
of edge cases to proceed completely independently.
Further efficiencies in the algorithm result from exploiting the
implicit topological information contained in each MC case ci. Ob-
viously, given a particular case number, the cidefines the numbers
of points and triangles that will be generated, and exactly which
edges are involved in the production of the output primitives. An
important construct is the edge use table U(ci)which for each ci
indicates which of the 12 voxel edges intersects Q. By using this
information efficiently it is possible to rapidly perform operations
such as counting the number of yand zedge intersections, and
incrementing point and triangle ids across cell rows (during Pass 4)
to generate crack-free isocontours without a spatial locator.
3.2 Computational Trimming
As used here, computational trimming is the process of pre-
computing information or metadata in such a way as to reduce the
need for subsequent computation, thereby reducing the total com-
putational load. More specifically, in the Flying Edges algorithm,
computational trimming is used to significantly reduce the total ef-
fort necessary to generate the isocontour. It is possible to rapidly
skip portions of rows, entire rows, and even entire data slices while
computing. This is accomplished by taking advantage of the topo-
logical characteristics of the continuous, piecewise linear contour
surface S, and using the first x-edge interval pass to bound the com-
putations necessary to perform later passes (including interpolation
and primitive generation). In practice, computational trimming sig-
nificantly accelerates algorithm performance.
Basically computational trimming takes advantage of the infor-
mation available from the initial pass in which x-edge intervals in-
dicate something about the location of the isocontour. For example
(Figure 4), if there are no ejk intersections along any of the four
Ejk that bound a particular cell row Rjk, then an isocontour exists
in those cells along Rjk if and only if it passes through some y-edge
(2D) and/or z-edge (3D). In such circumstances, because Qis con-
tinuous, a single intersection check with any y-edge (in 2D) or y-z
cell face (3D) is enough to determine whether the isocontour inter-
sects any part of Rjk. A similar argument can be made over con-
tiguous runs of cells in Rjk in a manner reminiscent to run-length
encoding. In our implementation, to reduce algorithmic complexity
and memory overhead we chose to keep just two trim positions for
each Ejk : the trim position on the left side where the isocontour
first intersects Ejk ; and the trim position on the right side where the
isocontour last intersects Ejk .
While the first pass identifies intersected x-edges (i.e., generates
edge case values) and determines trim positions along Ejk; subse-
quent passes process cell rows Rjk which (in 3D) are controlled
by the four grid edges {Ejk ,E(j+1)k,Ej(k+1),E(j+1)(k+1)}. Thus to
process cell rows, an adjusted trim interval [¯xLj k, ¯xR jk ) is computed
where ¯xL jk is the leftmost trim position, and ¯x R jk is the rightmost
trim position along the four edges defining Rjk . Additionally, in-
tersection checks are made with the cell faces at the positions ¯xL jk
and ¯xR jk to ensure that the isocontour is not running parallel to the
four Ejk . If any intersection is found, the adjusted trim positions are
reset to the leftmost ( ¯xL jk =0) and/or rightmost ( ¯xR jk = (n1))
locations (assuming the grid x-dimension is n), as the isocontour
must run to the grid boundary.
3.3 Point Merging
In many isocontouring algorithms such as MC and its derivatives,
each interior cell edge is visited four times (since in a structured
grid four cells share an edge). If the edge intersects the isocontour
Q, four coincident points will be generated, requiring additional
work and resources to interpolate and then store the points. In im-
plementation, it is common to use some form of spatial locator or
edge-based hash table to merge these coincident points to produce
a crack-free polygonal surface. While there are algorithmic con-
cerns from the use of the locator (e.g., extra memory requirements)
probably the bigger impact is due to the computational bottleneck
introduced by such an approach in a parallel environment. Multiple
threads feeding into a single locator requires data locking which
negatively impacts performance. Some implementations address
this issue by instantiating multiple locators, one in each subsetted
Figure 4: A 2D depiction of an adjusted trim edge interval. Shown are the trim position
of intersection with the isocontour Qto the furthest left ¯xLjk and furthest right ¯xR jk on
the cell row Rjk. The trim position, in combination with topological reasoning about
the location of the continuous Q, is used to eliminate unnecessary computation.
Figure 5: Processing a cell row Ri jk to generate output, eliminating point merging.
At the beginning of each cell row, the edge metadata–determined from the prefix sum
of Pass 3–contains the starting output triangle ids, and x-, y-, and z-point ids, and is
used to initialize a cell row traversal process (illustrated at the top of the figure and
shown in 2D). Then to traverse to the next adjacent cell along Rijk the edge use table
U(ci)is applied to increment the point ids appropriately. The ai jk are used to generate
the new points, while the other points are simply referenced as they will be produced
in a separate thread. Newly generated points and triangles are directly written into
previously allocated memory using their output ids.
grid block which are processed independently, which can then be
merged at the conclusion of surface generation at the additional cost
of a parallel, tree-based compositing operation.
Flying Edges eliminates all of these problems. First, because
edges are processed by traversal of the cell axes aij k across Rjk ,
each of the x,y,zcell edges is interpolated only once, generating
at most one point per cell edge. Secondly, because the number of
triangles, and the number of x-edge, y-edge and z-edge intersec-
tion points (and hence point ids) is known along all cell rows at
the outcome of the third pass, data arrays of the correct size can
be preallocated to receive the output data. Further, as cell rows
Rjk are processed in the fourth pass, the ci jk implicitly defines how
to increment each of the point ids as traversal moves from cell to
neighboring cell along Rjk (Figure 5). Geometry (interpolated point
coordinates) and topology (triangles defined by three point ids) can
be computed independently and directly written to the preallocated
and partitioned output. Thus a locator or hash table is not required,
thereby eliminating the point merging bottleneck while producing
a crack-free Sacross multiple threads.
3.4 Cell Axes on the Grid Boundary
The ai jk are key to efficient iteration over the grid since they ensure
that every cell edge is processed only one time. However, on the
positive x,y, and zgrid boundaries, the aij k are incomplete, requir-
ing special handling. For example, when an x-edge terminates on
the positive y-zgrid face, the last point along the Ejk is not asso-
ciated with any cell, and hence the ai jk is undefined. Conceptually
we create a partial ai jk consisting of just the yand zedges and pro-
cess them. Similar, but more complex situations occur for the Ejk
located near the x-yand x-zgrid boundaries. In implementation,
dealing with these special cases adds some complexity to what is a
relatively simple algorithm. Alternative ways to deal with this com-
plexity is to pad the volume on the positive grid boundaries, or skip
processing the last layer of cells on the positive grid boundaries.
3.5 Degeneracies
Degeneracies occur when one or more cell vertices take on the value
of the isocontour. Much of the literature ignores degenerate situa-
tions due to the assumed rarity of such occurrences, but in practice
they are common. For example, integral-valued or limited preci-
sion scalar fields are finite in extent, and/or the choice of isovalue
frequently is set to key field values which produces degeneracies.
Degeneracies typically result in the generation of non-manifold
isocontours, or the creation of zero length (line) or zero area (trian-
gle) primitives. This is typically due to the isocontouring “nicking”
adjoining edges, at a common vertex, to produce such degenerate
primitives. If the purpose of contour generation is strictly visual dis-
play, then such situations matter little as these primitives are gener-
ally handled properly during rendering. However, if the isocontour
is to be processed in an analysis pipeline (e.g., topological analysis,
subsequent interior mesh generation), degeneracies can introduce
difficulties into the workflow. In many implementations of MC, de-
generate primitives are culled prior to insertion into dynamic output
arrays. However in FE, the initial passes count the number of out-
put points and triangles and then allocates memory for them; once
degenerate primitives are identified it is not possible to modify the
In the algorithm outlined here, the treatment of degeneracies can
be expanded through modification of the edge-based case table and
the way in which it is calculated. As described previously, edge
bits are set when a cell edge intersects the isosurface Qusing a
semi-open interval check; in the expanded case table an open edge
interval is used (i.e., meaning that si jk are classified in one of three
ways: equal to, less than, or greater than q). Thus the number of
entries in a vertex-based case table for voxel cells would increase to
a total size of 38entries (versus 28for standard MC). While we have
experimented with such alternative case tables in 2D, our workflow
is such that an expanded case table is not needed at this time (a topic
for future research).
3.6 Load Balancing
Many parallel isocontouring algorithms devote significant effort to-
wards logically subdividing and load balancing the computation.
Typically approaches include organizing volumes into rectangular
sub-grids or using a spatial structure such as an octree to manage
the flow of execution [33, 9]. The challenge with isocontouring
is that a priori it is generally not possible to know through which
portion of the volume the isosurface will pass, meaning that some
form of pre-processing (evaluating min-max scalar region within
sub-volumes for example), or task stealing is used to balance the
workload across computational threads. In the Flying Edges algo-
rithm, the basic task is processing of an edge, in which each edge
(or associated cell row) can be processed completely independently.
In our implementation, we chose to use the Thread Building Blocks
(TBB) library using the parallel for construct to loop in par-
allel over all edges [14]. Behind the scenes, TBB manages a thread
pool to process this list of edges, and since the workload across an
edge may vary significantly, new edge tasks are stolen and assigned
to the thread pool in order to ensure continued processing. Note
that the algorithm does not require mutexes or other locks as the
output data is allocated and partitioned prior to data being written.
3.7 Memory Overhead
The algorithm computes and stores information in the first two
passes. X-edge cases are described using two bits due to the four
possible states defined from the two end vertex classifications. Thus
with total grid size N, then approximately 2Nbits of additional
memory are consumed to represent the x-edge cases (compared to
our typical grids with a 32-bit float or integer per si jk). In addition,
edge metadata is stored for each Ejk , which consists of six integer
values: the number of x-, y-, and z-cell axes edge intersections; the
number of triangles produced along the associated Rjk ; and the left
and right trim positions. Assuming that the grid is of dimensions
n3, then the metadata requires 6n2integer values (compared to the
n3=Ngrid size). Thus the total memory overhead in bytes, assum-
ing eight byte integers, is 8 ·6n2+2·n3/8.
In the following we characterize the performance of the algorithm
against four different datasets. First we compare the performance
across typical implementations. Then we examine the internal per-
formance of the algorithm, quantifying the time to execute each
pass of the algorithm, and the benefit of computational trimming.
4.1 Performance
The performance of three algorithms: Marching Cubes (MC), Syn-
chronized Templates (ST), and Flying Edges (FE) is compared in
this section as datasets, the number of threads of execution, and
data sizes are varied. Note that the generated results are not con-
trolled to produce identical results; rather they reflect what is typ-
ically done in practical application. MC and ST are executed us-
ing a data parallel approach, meaning that each thread executes a
pipeline across a subsetted block of data, performing point merging
within the piece but not across piece boundaries (hence there are
topological seams between sub-volumes). In comparison FE does
not produce such seams but can produce degenerate triangles on
output (in other words, we used conventional MC-type case tables
that do not address degenerate conditions). Also note that in this
work we focused on threaded, multi-core implementations as the
data size exceeds the capacity of GPUs (20483with double preci-
sion scalars).
(A quick note regarding Synchronized Templates. This algo-
rithm has long been the fastest non-preprocessed algorithm in the
VTK system, initially implemented by K. Martin in VTK [23] in
the year 2000. Similar to FE, ST uses ai jk to ensure that each cell
edge is processed at most only once and does not use a spatial lo-
cator. It also uses efficient, but serial, programming constructs to
ensure high performance computing.)
To characterize the performance of FE, we ran it against four
different datasets (Figure 6). The test system was an Intel E5-2699
Xeon Workstation, with 2×18 =36 hardware cores for a total of
72 threads, and 64Gbyte memory. These datasets are as follows,
listed in order from smallest to largest in size:
CT-angio. This dataset is part of the 3D Slicer ( data
distribution. It is a anonymized CT angiography scan at resolution
of (512 ×512 ×321), with 16-bit scalar values. We used q=100.
Find the data at [27].
Supernova. This supernova simulation data is sized 4323with a
32-bit float scalar type. The isovalue was set to q=0.07 to run the
tests. Find the data at [5].
Nano. The third test dataset is a scanning transmission elec-
tron microscopy dataset of dimensions (471 ×957 ×1057)with
q=1800 and 16-bit unsigned scalars. The data is a tomographic
reconstruction of a hyper-branched Co2Pnanocrystal. 3D isosur-
face contours show the particle size and shape. See [31] for more
Plasma. This (2048 ×2048 ×2048)volume dataset is one time
step from a kinetic simulation of magnetic reconnection in plasmas.
The scalar data is 32-bit float values. We choose q=0.30.
Table 1 compares the MC, ST, and FE algorithms in serial oper-
ation across the four datasets. The run times are normalized against
the basic MC algorithm, with the actual run time of MC shown
in parentheses. Note that an optimized, threadable version of MC
is also run (MC-opt) which uses an edge-based locator and makes
two passes to discard empty cells. Table 2 compares the MC-opt,
ST, and FE algorithms in parallel operation across the four datasets
using a constant 36 threads (the basic MC was not threadable and
hence not used). Again, the run times are normalized against the
optimized MC-opt algorithm, with the actual run time of MC-opt
shown in parentheses. Table 3 compares ST and FE side-by-side
Table 1: Comparison of selected serially executed isocontouring algorithms. Shown
are normalized speedup factors (normalized against the standard MC algorithm). The
plasma dataset was downsampled to 10243because MC implementations were unable
to process the full dataset.
Algorithm CT-angio Supernova Nano Plasma
MC 1 (2.10s) 1 (2.667s) 1 (3.88s) 1 (69.86s)
MC-Opt 1.49 1.92 1.28 1.79
ST 1.44 1.90 0.54 3.51
FE 5.22 7.35 3.51 8.58
Table 2: Comparison of selected parallel executed isocontouring algorithms (the num-
ber of threads=36). Shown is normalized speedup factors (normalized against opti-
mized MC algorithm). The plasma dataset was downsampled to 10243because MC
implementations were unable to process the full dataset.
Algorithm CT-angio Supernova Nano Plasma
MC-Opt 1 (0.266s) 1 (0.266s) 1 (0.310s) 1 (4.56s)
ST 3.67 3.20 1.10 4.35
FE 8.26 9.45 4.49 11.05
on the full plasma 20483dataset, and records FE overall parallel
efficiency. (All reported run-times are an average time across five
separate runs.)
Finally, Figure 7 plots the parallel efficiency of FE as the number
of processors is varied from one to 72 (Intel Workstation) and one
to 80 (IBM Power System). The IBM Power System S822 has two
POWER8 processors with 10 cores each, each core has 8 threads
running at 3.42 GHz (total of 160 threads) with 160 GB of RAM.
The results show superlinear scalability for FE at 4 threads on
the Intel Workstation. Also note that FE outperforms all versions
of MC, optimized MC, and ST in both serial and parallel execution.
The actual performance gain would be even greater if the threaded
ST and MC algorithms were required to merge coincident points
across piece boundaries. Note also that MC was unable to run any
data greater than approximately 10243due to the use of 32-bit inte-
gral offsets in its implementation.
4.2 Analysis
Table 4 shows the time to execute each pass of FE on the four
datasets described previously. The time is expressed as a percent-
age of the total algorithm execution time, computed from the In-
tel Workstation using 32 cores. Also shown in the table is a per-
formance factor fcapturing the effect of enabling computational
Table 3: Comparison run times and speed up factors, ST vs. FE. Plasma 20483dataset
used on Intel Workstation. Note that this system has 36 physical cores and 72 total
ST FE FE Speed Up FE Efficiency
Threads (seconds) (seconds) (factor) (percent)
1 157.95 53.07 2.98 100.0
2 76.82 25.14 3.06 105.6
4 39.90 12.99 3.07 102.1
8 22.32 7.2 3.10 92.13
16 12.79 4.15 3.08 79.99
24 9.013 2.96 3.04 74.60
36 6.76 2.39 2.83 61.61
72 5.57 2.06 2.71 35.85
Figure 6: The four datasets used for testing. In reading order, the CT-angio, Supernova, Nano, and Plasma datasets.
Figure 7: Parallel scaling efficiency across a representative 20483dataset (the Plasma
dataset described previously). Two plots are shown for a 72-thread (36 cores) Intel
Workstation and IBM Power System with 160-threads (20 cores).
Table 4: Comparison of the four passes of the algorithm across the four datasets in
parallel execution (32 threads). The numbers are expressed as a percentage of total ex-
ecution time. The last column captures the effect of enabling computational trimming
by showing a performance factor gain (i.e., numbers >1 are faster).
Dataset Pass 1 Pass 2 Pass 3 Pass 4 Trimming
%%%% f
CT-angio 27.4 20.0 4.2 48.5 1.39
Supernova 81.9 5.6 12.1 0.4 1.15
Nano 79.0 6.3 14.5 0.2 1.16
Plasma 91.0 2.0 6.9 0.1 1.15
trimming; e.g., a factor greater than one f>1 indicates that the
algorithm is faster when trimming is enabled.
In general, a significant amount of time is spent traversing the
dataset in the initial Pass 1. Indeed as the data becomes larger, in-
creasing time is spent in this initial pass, as the amount of work
to produce the isosurface relative to traversing the entire volume
decreases, depending on the extent of the isosurface through the
volume. Unfortunately in a single pass algorithm this initial traver-
sal cannot be avoided; however it does affirm the benefits of pre-
processing and search structures when performing exploratory iso-
surfacing. While computational trimming has a modest impact, its
effect increases as the amount of work devoted to producing output
(Passes 2-4) increases as well. While we were initially surprised
by the amount of time taken by the metadata prefix sum operation
(Pass 3), this operation is expected to scale suboptimally. Further
note that the Nano dataset is slab shaped, with its smallest dimen-
sion in the xdirection. This suggests that much work is going into
processing many, relatively shorter edge runs compared to the to-
tal dataset size. It may be that in some situations it may be better
to process data in the y- or z-edge directions, depending on the ef-
fect of processing data in non-contiguous order, a topic for future
We have developed a high-performance, scalable isocontouring al-
gorithm for structured scalar data. To our knowledge it is currently
the fastest non-preprocessed isocontouring algorithm on threaded,
multi-core CPUs available today. The development of this algo-
rithm has been motivated by the emergence of new parallel comput-
ing models, with the recognition that many current and future algo-
rithms require rethinking if we are to take full advantage of mod-
ern computing hardware. Each pass of the Flying Edges is inde-
pendently parallelizable across edges, resulting in scalable perfor-
mance through task-stealing, edge-based computational tasks. Our
results demonstrate this, although the actual computational load of
any isocontouring algorithm based on an indexed lookup is rela-
tively small, requiring just a few floating point operations to inter-
polate intersection points along edges. As a result, for very large
data, much of the computing effort involves loading and/or access-
ing data across the memory hierarchy. Thus with large numbers of
threads, memory bounds may limit the overall scalability.
To address memory constraints, distributed, hybrid parallel ap-
proaches may perform better as separate machines (and therefore
memory spaces) can process a subset of the entire volume. Along
similar lines, we have envisioned a parallel, hybrid implementa-
tion that performs the initial, two FE counting passes across all dis-
tributed, grid subsets, and then communicates point and triangle
id offsets to each subsetted grid. Then each machine can process
its data to produce seamless isocontours without the need to merge
points across distributed grids.
Another way to view the Flying Edges algorithm is that it is a
form of data traversal, with geometric constraints (continuous sur-
faces) informing the extent to which data is processed. Similar ge-
ometric constraints exist for related visualization algorithms such
as scalar-based cutting and clipping. Also, the algorithm may be
readily adapted to any topologically regular grid. In the future we
plan on extending the algorithm to process rectilinear and structured
grids, and clipping and cutting algorithms, based on an abstraction
of the edge-based traversal process, with appropriate specialization
required in the way edge interpolations (i.e., point coordinates and
interpolated data) are generated.
We have already begun implementing this algorithm on GPUs.
While our initial experiments are promising, there is some uncer-
tainty as to whether the extra work required to track computational
trimming is warranted. On one hand, processing entire edges Ejk
without early termination simplifies computation and reduces the
likelihood of stalling the computational cores. On the other hand,
computational trimming can significantly reduce the amount of data
to be visited, and consequently the total computation performed.
Future numerical experiments will provide guidance, although it
is possible that differing hardware may produce contrary indica-
tions based on the particulars of the computing architecture. Fi-
nally, GPU processing of large volumes requires complex methods
to stream data on and off the GPU when card memory is exceeded.
The speed of Flying Edges is such that with very large data, the
cost of data transfer to and from the GPU often exceeds the time
to actually process the data on a multi-core CPU. (Refer to Table 2
assuming PCIE transfer throughput of 8GB/sec.)
In the spirit of reproducible science, we have implemented the
algorithm in both 2D and 3D and contributed them to the VTK sys-
tem with a permissive BSD license. These open source implemen-
tations are available as vtkFlyingEdges2D and vtkFlyingEdges3D.
This effort is part of an overall vision which we refer to as VTK-m,
in which critical algorithms in the VTK system are extended to-
wards massive multi-core parallelism. This material is based upon
work supported by the U.S. Department of Energy, Office of Sci-
ence, Office of Advanced Scientific Computing Research, Scien-
tific Discovery through Advanced Computing (SciDAC) program
under Award Number(s) DE-SCOO07440. We have been sup-
ported by many members of the DOE National Labs, particularly
the SDAV [24] effort. Additional support has been provided by
nVidia and Intel. We’d also like to thank the 3D Slicer commu-
nity; H. Karimabadi, W. Daughton, and Yi-Hsin Liu for providing
the Plasma dataset; Rob Hovden for the nanoparticle data; and C.
Atkins and S. Philip for their help generating results.
[1] J. Ahrens. Implications of numerical and data intensive technology
trends on scientific visualization and analysis. Plenarry Presentation
at SIAM CSE15, March 2015.
[2] J. Ahrens, S. Jourdain, P. O’Leary, J. Patchett, D. H. Rogers, and
M. Petersen. An image-based approach to extreme scale in situ visu-
alization and analysis. In Proceedings of Supercomputing 2014, 2014.
[3] C. Bajaj, V. Pascucci, D. Thompson, and X. Zhang. Parallel acceler-
ated isocontouring for out-of-core visualization. In Proceedings of the
1999 IEEE symposium on Parallel visualization and graphics, pages
97–104. IEEE Computer Society, 1999.
[4] G. E. Blelloch. Prefix sums and their applications. Technical Re-
port CMU-CS-90-190, School of Computer Science, Carnegie Mellon
University, Nov. 1990.
[5] J. Blondin. Supernova modelling.
[6] Y.-J. Chiang, C. T. Silva, and W. J. Schroeder. Interactive out-of-core
isosurface extraction. In Proceedings of IEEE Visualization ’98, pages
167–174, 1998.
[7] P. Cignoni, P. Marino, C. Montani, E. Puppo, and R. Scopigno. Speed-
ing up isosurface extraction using interval trees. IEEE Transactions on
Visualization and Computer Graphics, 3(2):158–170, Apr-June 1997.
[8] M. Ci ˙
znicki, M. Kierzynka, K. Kurowski, B. Ludwiczak,
K. Napierała, and J. Palczy´
nski. Efficient isosurface extraction using
marching tetrahedra and histogram pyramids on multiple gpus. In Par-
allel Processing and Applied Mathematics, pages 343–352. Springer,
[9] D. D’Agostino, A. Clematis, and V. Gianuzzi. Parallel isosurface
extraction for 3D data analysis workflows in distributed environ-
ments. Concurrency and Computation: Practice and Experience,
23(11):1284–1310, 2011.
[10] C. Dietrich, C. E. Scheidegger, J. L. Comba, L. P. Nedel, C. T. Silva,
et al. Edge groups: An approach to understanding the mesh quality
of marching methods. Visualization and Computer Graphics, IEEE
Transactions on, 14(6):1651–1666, 2008.
[11] C. Dyken, G. Zieglar, C. Theobalt, and H. Seidal. High-speed
Marching Cubes using HistoPyramids. Computer Graphics Forum,
27(8):2028–2039, 2008 2008.
[12] D. Horn. Stream reduction operations for gpgpu applications. In
M. Pharr and R. Randima, editors, GPU gems 2: programming tech-
niques for high-performance graphics and general-purpose computa-
tion, page 573V589. Addison-Wesley Professional, 2005.
[13] C. Hsu and W. Feng. A power-aware run-time system for high-
performance computing. In Proceedings of the ACM/IEEE SC 2005
Conference, November 2005.
[14] Intel. Threading Building Blocks.
[15] B. Jeong, P. A. Navr´
atil, K. P. Gaither, G. Abram, and G. P. John-
son. Configurable data prefetching scheme for interactive visualiza-
tion of large-scale volume data. In IS&T/SPIE Electronic Imaging,
pages 82940K–82940K. International Society for Optics and Photon-
ics, 2012.
[16] M. Kazhdan, A. Klein, K. Dalal, and H. Hoppe. Unconstrained iso-
surface extraction on arbitrary octrees. In Symposium on Geometry
Processing, volume 7, 2007.
[17] Y. Livnat, H. Shen, and C. R. Johnson. A near optimal isosurface
extraction algorithm using the span space. IEEE Transactions on Vi-
sualization and Computer Graphics, 2(1):73–84, March 1996.
[18] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d
surface construction algorithm. In Computer Graphics (Proceedings
of SIGGRAPH 87), volume 21, pages 163–169, July 1987.
[19] Y. Lu, Y. Chen, and R. Thakur. Memory-conscious collective I/O for
extreme scale HPC systems. In Proceedings of High Performance
Computing, Networking, Storage and Analysis (SCC), 2012.
[20] S. Martin, H.-W. Shen, and P. McCormick. Load-balanced isosurfac-
ing on multi-gpu clusters. EGPGV, 10:91–100, 2010.
[21] NVIDIA Developer Zone. CUDA implementation of iso-
surface computation.
[22] L. Schmitz, L. F. Scheidegger, D. K. Osmari, C. A. Dietrich, and
J. L. D. Comba. Efficient and quality contouring algorithms on the
gpu. Computer Graphics Forum, 29(8):2569–2578, 2010.
[23] W. J. Schroeder, K. Martin, and B. Lorensen. The Visualization
Toolkit: An Object Orient Approach to Computer Graphics, Fourth
Edition. Kitware, Inc., 2006.
[24] DOE scalable data management, analysis, and visualization program.
[25] J. Shalf, S. Dosanjh, and J. Morrison. Exascale computing technology
challenges. In Proceedings of the 9th International Conference on
High Performance Computing for Computational Science, pages 1–
25, 2010.
[26] J. P. Shen and M. H. Lipasti. Modern processor design : fundamentals
of superscalar processors (First Edition). Waveland Press, 2005.
[27] 3D slicer sample data: CT-cardio.
[28] G. Treece, R. Prager, and A. Gee. Regularised marching tetrahedra:
improved iso-surface extraction. Comp. & Graphics, 23(4):593–598,
[29] Q. Wang, J. JaJa, and A. Varshney. An efficient and scalable parallel
algorithm for out-of-core isosurface extraction and rendering. Journal
of Parallel and Distributed Computing, 67:592–603, 2007.
[30] J. Wilhelms and A. Van Gelder. Octrees for faster isosurface genera-
tion. In Proceedings of the 1990 workshop on Volume visualization,
pages 57–62, 1990.
[31] H. Zhang, D.-H. Ha, R. Hovden, L. F. Kourkoutis, and R. D. Robin-
son. Controlled synthesis of uniform cobalt phosphide hyperbranched
nanocrystals using tri-n-octylphosphine oxide as a phosphorus source.
NANO Letters, 11(1):188–197, 2011.
[32] X. Zhang, C. Bajaj, and W. Blanke. Scalable isosurface visualiza-
tion of massive datasets on cots clusters. In Proceedings of the IEEE
2001 symposium on parallel and large-data visualization and graph-
ics, pages 51–58. IEEE Press, 2001.
[33] X. Zhang and B. Chandrajit. Scalable isosurface visualization of mas-
sive datasets on commodity off-the-shelf clusters. Journal of parallel
and distributed computing, 69(1):39–53, 2009.
... The simplicity, robustness and rapidity of the marching cubes have made it the most widely used algorithm for interactive visualization of medical data. Recently, Schroeder et al. [27] introduced a fast isosurface extraction algorithm called Flying edges, which is the parallel Marching cubes algorithm implemented on multi-core CPUs. However, the extracted surface may present discontinuities and topological incoherences due to the ambiguities in the interpolant behavior [7,21]. ...
... In this paper, we present a comparison between four different reconstruction methods, two of them are based on the marching cube algorithm [12,27] and the other two are contour based algorithms [3,29]. The comparison allows to establish similarity, equivalence, or distinctness between the four methods. ...
... The Flying Edges (FE) is an algorithm designed from the standard MC algorithm, in a way that is better optimized for multi-core processors [27]. It operates on a three dimensional grid of voxels and uses a lookup table from which the triangle configuration of each voxel is determined. ...
... In this paper, we present a comparison between four dierent reconstruction methods, two of them are based on the marching cube algorithm [8,2] and the other two are contour based algorithms [1,9]. The comparison allows to establish similarity, equivalence, or distinctness between the four methods. ...
... In the following, a description of each of the four methods is recalled. is an algorithm designed from the standard MC algorithm, in a way that is better optimized for multicore processors [8]. It operates on a 3D grid of voxels and uses a lookup table from which the triangle conguration of each voxel is determined. ...
... Jian [8,10,13,18] , etc.) for various applications ranging from geography, through computer graphics to medical image computing. To our knowledge, however, no investigation has been done in terms of a complex workflow involving multiple data representations and the challenges identified earlier. ...
... It is executed every time the operator makes changes to a segmentation using Segment Editor while 3D visualization is enabled. The converter employs the discrete flying edges algorithm [18] , followed by optional decimation (to keep number of vertices reasonably low) and smoothing (to remove staircase artifacts). ...
Background and objective: Segmentation is a ubiquitous operation in medical image computing. Various data representations can describe segmentation results, such as labelmap volumes or surface models. Conversions between them are often required, which typically include complex data processing steps. We identified four challenges related to managing multiple representations: conversion method selection, data provenance, data consistency, and coherence of in-memory objects. Methods: A complex data container preserves identity and provenance of the contained representations and ensures data coherence. Conversions are executed automatically on-demand. A graph containing the implemented conversion algorithms determines each execution, ensuring consistency between various representations. The design and implementation of a software library are proposed, in order to provide a readily usable software tool to manage segmentation data in multiple data representations. A low-level core library called PolySeg implemented in the Visualization Toolkit (VTK) manages the data objects and conversions. It is used by a high-level application layer, which has been implemented in the medical image visualization and analysis platform 3D Slicer. The application layer provides advanced visualization, transformation, interoperability, and other functions. Results: The core conversion algorithms comprising the graph were validated. Several applications were implemented based on the library, demonstrating advantages in terms of usability and ease of software development in each case. The Segment Editor application provides fast, comprehensive, and easy-to-use manual and semi-automatic segmentation workflows. Clinical applications for gel dosimetry, external beam planning, and MRI-ultrasound image fusion in brachytherapy were rapidly prototyped resulting robust applications that are already in use in clinical research. The conversion algorithms were found to be accurate and reliable using these applications. Conclusions: A generic software library has been designed and developed for automatic management of multiple data formats in segmentation tasks. It enhances both user and developer experience, enabling fast and convenient manual workflows and quicker and more robust software prototyping. The software's BSD-style open-source license allows complete freedom of use of the library.
Interactive isosurface visualisation has been made possible by mapping algorithms to GPU architectures. However, current state‐of‐the‐art isosurfacing algorithms usually consume large amounts of GPU memory owing to the additional acceleration structures they require. As a result, the continued limitations on available GPU memory mean that they are unable to deal with the larger datasets that are now increasingly becoming prevalent. This paper proposes a new parallel isosurface‐extraction algorithm that exploits the blocked organisation of the parallel threads found in modern many‐core platforms to achieve fast isosurface extraction and reduce the associated memory requirements. This is achieved by optimising thread co‐operation within thread‐blocks and reducing redundant computation; ultimately, an indexed triangular mesh can be produced. Experiments have shown that the proposed algorithm is much faster (up to 10×) than state‐of‐the‐art GPU algorithms and has a much smaller memory footprint, enabling it to handle much larger datasets (up to 64×) on the same GPU.
Full-text available
Extreme scale scientific simulations are leading a charge to exascale computation, and data analytics runs the risk of being a bottleneck to scientific discovery. Due to power and I/O constraints, we expect in situ visualization and analysis will be a critical component of these workflows. Options for extreme scale data analysis are often presented as a stark contrast: write large files to disk for interactive, exploratory analysis, or perform in situ analysis to save detailed data about phenomena that a scientists knows about in advance. We present a novel framework for a third option - a highly interactive, image-based approach that promotes exploration of simulation results, and is easily accessed through extensions to widely used open source tools. This in situ approach supports interactive exploration of a wide range of results, while still significantly reducing data movement and storage.
Conference Paper
The algorithms for isosurface extraction have become crucial in petroleum industry, medicine and many other fields over the last years. Nowadays market demands engender a need for methods that not only construct accurate 3D models but also deal with the problem efficiently. Recently, a few highly optimized approaches taking advantage of modern graphics processing units (GPUs) have been published in the literature. However, despite their satisfactory speed, they all may be unsuitable in real-life applications due to limits on maximum domain size they can process. In this paper we present a novel approach to surface extraction by combining the algorithm of Marching Tetrahedra with the idea of Histogram Pyramids. Our GPU-based application can process CT and MRI scan data. Thanks to domain decomposition, the only limiting factor for the size of input instance is the amount of memory needed to store the resulting model. The solution is also immensely fast achieving up to 107-fold speedup comparing to a serial CPU code. Moreover, multiple GPUs support makes it very scalable. Provided tool enables the user to visualize generated model and to modify it in an interactive manner.
Conference Paper
The continuing decrease in memory capacity per core and the increasing disparity between core count and off-chip memory bandwidth create significant challenges for I/O operations in exascale systems. The exascale challenges require rethinking collective I/O for the effective exploitation of the correlation among I/O accesses in the exascale system. In this study we introduce a Memory-Conscious Collective I/O considering the constraint of the memory space. 1)Restricts aggregation data traffic within disjointed subgroups 2)Coordinates I/O accesses in intra-node and inter-node layer 3)Determines I/O aggregators at run time considering data distribution and memory consumption among processes.
Conference Paper
The continuing decrease in memory capacity per core and the increasing disparity between core count and off-chip memory bandwidth create significant challenges for I/O operations in exascale systems. The exascale challenges require rethinking collective I/O for the effective exploitation of the correlation among I/O accesses in the exascale system. In this study, considering the major constraint of the memory space, we introduce a MemoryConscious collective I/O. Given the importance of I/O aggregator in improving the performance of collective I/O, the new collective I/O strategy restricts aggregation data traffic within disjointed subgroups, coordinates I/O accesses in intra-node and inter-node layer and determines I/O aggregators at run time considering data distribution and memory consumption among processes. The preliminary results have demonstrated that the new collective I/O strategy holds promise in substantially reducing the amount of memory pressure, alleviating contention for memory bandwidth and improving the I/O performance for extreme-scale systems.
This paper presents a novel data prefetching and memory management scheme to support interactive visualization of large-scale volume datasets using GPU-based isosurface extraction. Our dynamic in-core approach uses a span-space lattice data structure to predict and prefetch the portions of a dataset that are required by isosurface queries, to manage an application-level volume data cache, and to ensure load-balancing for parallel execution. We also present a GPU memory management scheme that enhances isosurface extraction and rendering performance. With these techniques, we achieve rendering performance superior to other in-core algorithms while using dramatically fewer resources.
Interactive isosurface extraction has recently become possible through successful efforts to map algorithms such as Marching Cubes (MC) and Marching Tetrahedra (MT) to modern Graphics Processing Unit (GPU) architectures. Other isosurfacing algorithms, however, are not so easily portable to GPUs, either because they involve more complex operations or because they are not based on discrete case tables, as is the case with most marching techniques. In this paper, we revisit the Dual Contouring (MC) and Macet isosurface extraction algorithms and propose, respectively: (i) a novel, efficient and parallelizable version of Dual Contouring and (ii) a set of GPU modules which extend the original Marching Cubes algorithm. Similar to marching methods, our novel technique is based on a case table, which allows for a very efficient GPU implementation. In addition, we enumerate and evaluate several alternatives to implement efficient contouring algorithms on the GPU, and present trade-offs among all approaches. Finally, we validate the efficiency and quality of the tessellations produced in all these alternatives.
In this paper we discuss the issues related to the development of efficient parallel implementations of the Marching Cubes algorithm, one of the most used methods for isosurface extraction, which is a fundamental operation for 3D data analysis and visualization. We present three possible parallelization strategies and we outline the pros and cons of each of them, considering isosurface extraction as stand-alone operation or as part of a dynamic workflow. Our analysis shows that none of these implementations represents the most efficient solution for arbitrary situations. This is a major issue, because in many cases the quality of the service provided by a workflow depends on the possibility of selecting dynamically the operations to perform and, consequently, the more efficient basic building block for each stage. In this paper we present a set of guidelines that permits to achieve the highest performance for the extraction of isosurface in the most common situations, considering the characteristics of the data to process and of the workflow. These guidelines represent a suitable example to support the efficient configurations of workflows for 3D data processing in a dynamic and complex computing environment. Copyright © 2011 John Wiley & Sons, Ltd.