Working PaperPDF Available

A GPU Accelerated Continuous and Discontinuous Galerkin Non-hydrostatic Atmospheric Model

A GPU Accelerated Continuous and
Discontinuous Galerkin Non-hydrostatic
Atmospheric Model
Journal Title
The Author(s) 2016
Reprints and permission:
DOI: 10.1177/ToBeAssigned
Daniel Abdi1and Lucas Wilcox1and Timothy Warburton 2and Francis Giraldo1
We present a GPU accelerated nodal discontinuous Galerkin method for the solution of the three dimensional Euler
equations that govern the motion and thermodynamic state of the atmosphere. Acceleration of the dynamical core of
atmospheric models plays an important practical role in not only getting daily forecasts faster but also in obtaining
more accurate (high resolution) results within a given simulation time limit. We use algorithms suitable for the single
instruction multiple thread architecture of GPUs to accelerate our model by two orders of magnitude relative to one core
of a CPU. Tests on one node of the Titan supercomputer show a speedup of upto 15 times using the K20X GPU as
compared to that on the 16-core AMD Opteron CPU. The scalability of the multi-GPU implementation is tested using
16384 GPUs, which resulted in a weak scaling efficiency of about 90%. Finally, the accuracy and performance of our
GPU implementation is verified using several benchmark problems representative of different scales of atmospheric
NUMA, GPU, HPC, OCCA, Atmospheric model, Discontinuos Galerkin, Continuous Galerkin
1 Introduction
Most operational Numerical Weather Prediction (NWP)
models are based on the finite difference or spectral
transform spatial discretization methods. Finite difference
methods are popular with limited area models due to their
ease of implementation and good performance on structured
grids, whereas global circulation models mostly use the
spectral transform method. Spectral transform methods often
do not scale well on massively parallel systems due to
the need for global (all-to-all) communication required
by the Fourier transform. On the other hand, the finite
difference method requires wide halo layers at inter-
processor boundaries to achieve high-order accuracy. The
search for efficient parallel NWP codes in the era of
high performance computing suggests the use of alternative
methods that have local operation properties while still
offering high-order accuracy (Nair et al. 2011); their
efficiency coming from the minimal parallel communication
footprint that is of vital importance as resolution increases.
The Non-hydrostatic Unified Model of the Atmosphere
(NUMA) is one such NWP model that offers high-order
accuracy while using local methods for parallel efficiency
(Marras et al. 2015; Giraldo and Rosmond 2004; Kelly and
Giraldo 2012; Giraldo and Restelli 2008).
In Table 1, we give a summary of a recent review on
the progress of porting several NWP models to the GPU
(Sawyer 2014). Among those models which ported the whole
dynamical core, a maximum overall speedup of 3 times (from
here on, we shall use, e.g., 3x to represent such a speedup) is
observed for a GPU relative to a multi-core CPU. The only
spectral element model in the review was the Community
Atmospheric Model (CAM-SE) that showed a speed of 3x
for the dynamical core using CUDA. A comparison of the
acceleration of CAM-SE tracer kernels using OpenACC,
though substantially easier to program, performed 1.5x
slower than the CUDA version (Norman et al. 2015). This
could occur, for example, by not fully exploiting the private
worker array capability of OpenACC. The most important
metric we shall use to compare performance on the GPU is
speedup, however, we should note that speedup results are
significantly influenced by how well the CPU and GPU codes
are optimized. For this reason, we shall also report individual
GPU kernel performance in-terms of rate of floating point
operations and rate of data transfer (bandwidth) and will
illustrate our results using roofline models.
Element based Galerkin (EBG) methods, in which the
basis functions are defined within an element, are well
suited for distributed computing for two reasons (Kl¨
et al. 2009): Firstly, localized memory accesses result in
low communication overhead. In contrast, global methods
require an all-to-all communication that severely degrades
scalability on most architectures and methods having non-
compact high-order support require larger halo regions
which translates to larger communication stencils that
also reduces scalability. Secondly, High order polynomial
expansion of the solution results in large arithmetic intensity
per degree of freedom. These two properties work in
favor of EBG methods for Graphic Processing Unit (GPU)
computing as well. The two EBG methods of NUMA,
1Department of Applied Mathematics, Naval Postgraduate School, USA
2Department of Mathematics, Virginia Tech University, USA
Corresponding author:
Daniel S. Abdi, Naval Postgraduate School Monterey, CA 93943, USA.
Prepared using sagej.cls [Version: 2015/06/09 v1.01]
2Journal Title XX(X)
Table 1. GPU acceleration of a few atmospheric models based on a summary in Sawyer (2014). The only spectral element (SE)
code is the hydrostatic CAM-SE model. A maximum speedup of 3x over a multi-core CPU is observed among those models that
have ported the whole dynamical core.
Model Non-hydrostatic Method GPU ported Speedup Language
CAM-SE No SE Parts of DyCore 3x CUDA+OpenACC
WRF Yes FD Parts of DyCore 2x CUDA+OpenACC
NICAM Yes FV DyCore 3x OpenACC
ICON Yes FV DyCore 2x CUDA+OpenACC+OpenCL
GEOS-5 Yes FV Parts of DyCore 5x CUDA+OpenACC
FIM/NIM Yes FV DyCore + Physics 3x F2C-ACC + OpenACC
GRAPES Yes SL Parts of DyCore 4x CUDA
COSMO Yes FD DyCore + Physics 2x CUDA+OpenACC
namely continuous Galerkin (CG) and discontinuous
Galerkin (DG), are ported to the GPU in a unified manner
(see Sec. 3.3). Parallel implementation of DG is often easier
and more efficient than that of CG because of a smaller
communication stencil; with a judicious choice of numerical
flux only neighbors sharing a face need to communicate
in DG as opposed to the edge and corner neighbor
communication required by CG. Moreover, DG allows for
a simple overlap of computation of volume integrals and
intra-processor flux with communication of boundary data,
which can be exploited to improve the efficiency of the
parallel implementation (Kelly and Giraldo 2012). CG can
also benefit from a communication-computation overlap but
it requires a bit more work than that for DG (Deville et al.
EBG methods have been successfully ported to GPUs to
speedup the solution of various partial differential equations
(PDEs) by orders of magnitude. Acceleration of a CG
simulation using GPUs is first reported by Goddeke et al.
(2005). Later, Kl¨
ockner et al. (2009) made the first GPU
implementation of nodal DG for the solution of linear
hyperbolic conservation laws. They mention that nontrivial
adjustments to the DG method are required to solve non-
linear hyperbolic equations, such as the compressible Euler
equations, on the GPU due to complexity of implementing
limiters and artificial viscosity. Another notable difference
with the current work is that NUMA uses a tensor-
product approach with hexahedra elements for efficiency
reasons (Kelly and Giraldo 2012); Kl¨
ockner et al. (2009)
argue tetrahedra are preferable on the GPU due to larger
arithmetic intensity and reduced memory fetches. More
recently Siebenborn et al. (2012) implemented the Runge-
Kutta discontinuous Galerkin method of Cockburn and
Shu (1998) on the GPU to solve the non-linear Euler
equations using tetrahedral grids. They reported a speedup
of 18x over the serial implementation of the method running
on a single core CPU. Fuhry et al. (2014) made an
implementation of the 2D discontinuous Galerkin on the
GPU using triangular elements and obtained a speedup of
about 50x relative to a single core CPU. The approach they
used is a one-element-per-thread strategy that is different
from the one-node-per-thread strategy we shall use in this
work when running on the GPU. However, thanks to our
use of a device agnostic language, the same kernel code
used on the GPU switches to using the one-element-per-
thread strategy of Fuhry et al. (2014) when running on the
CPU using OpenMP mode. Chan et al. (2015) presented
a GPU acceleration of DG methods for the solution of
the acoustic wave equation on hex-dominant hybrid meshes
consisting of hexahedra, tetrahedra, wedges and pyramids.
They mention that the DG spectral element formulation on
hexahedra is more efficient on the GPU using Legendre-
Gauss-Lobatto (LGL) points than using Gauss-Legendre
(GL) points. To avoid the cost of storing the inverse mass
matrix on the GPU, they used different basis functions that
yield a diagonal mass matrix for each of the cell shapes
except tetrahedra. For straight-edged elements, the mass
matrix for tetrahedral elements is not diagonal, but a scalar
multiple of that of the reference tetrahedron, therefore the
storage cost is minimal. In (Chan and Warburton 2015),
they consider the use of the Bernstein-Bezier polynomial
basis for DG on the GPU to enhance the sparsity of the
derivative and lift matrices as compared to classical DG
with Lagrange polynomial basis. However, this comes at
a cost of increased condition number of the matrices that
could potentially cause stability issues. They conclude that,
at high order polynomial approximation, DG implemented
with Bernstein-Bezier polynomial basis perform better than
a straightforward implementation of classical DG. Remacle
et al. (2015) studied GPU acceleration of spectral elements
for the solution of the Poisson problem on purely hexahedral
grids. The solution of elliptic problems is most efficiently
done using implicit methods; thus, they implemented a
matrix-free Preconditioned Conjugate Gradient (PCG) on the
GPU and demonstrated that problems with 50 million grid
cells can be solved in a few seconds.
General purpose computing on GPUs can be done using
several programming models from various vendors: AMD’s
OpenCL, NVIDIAs CUDA and OpenACC, to name a
few. The choice of the programming model for a project
depends on several factors. The goal of the current work is
to port NUMA to heterogeneous computing environments
in a performance portable way, and hence cross-platform
portability is the topmost priority. In the future we
shall address performance portability using automatic code
transformation techniques, such as (see (Kl¨
and Warburton 2013)). To achieve cross-platform portability,
we chose a new threading language called OCCA (Open
Concurrent Compute Abstraction) (Medina et al. 2014),
which is a unified approach to multi-threading languages.
Kernels written in OCCA are cross-compiled at runtime to
existing thread models such as OpenCL, CUDA, OpenMP,
etc.; here, we present results only for OpenCL and CUDA
backends and postpone OpenMP for future work. OCCA has
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 3
been shown to deliver portable high performance for various
EBG methods (Medina et al. 2014). It has already been used
in (Gandham et al. 2014) to accelerate the DG solution of
the shallow water equations, in (Remacle et al. 2015) for the
Poisson problem, in (Modave et al. 2015) for acoustic and
elastic problems.
2 Governing equations
The dynamics of non-hydrostatic atmospheric processes
are governed by the compressible Euler equations. The
equation sets can be written in various conservative and
non-conservative forms. Among those, a conservative set is
selected with the prognostic variables (ρ, U,Θ)>, where ρis
density, U= (U, V, W ) = ρu,Θ = ρθ, where θis potential
temperature and u= (u, v, w)are the velocity components.
We write the governing equations in the following way
∂t +∇ · U= 0
∂t +∇ · UU
∂t +∇ · ΘU
ρ= 0
where gis the gravity vector.The pressure in the
momentum equation is obtained from the equation of state
where R=cpcvand γ=cp
cvfor given specific heat of
pressure and volume of cpand cv, respectively. We have
selected to use a conservative form of the equations, to take
advantage of not only global but also local conservation
properties (given the proper discretization method).
For better numerical stability, the density, pressure and
potential temperature variables are split into background
and perturbation components. The background component is
time-invariant and is often obtained by assuming hydrostatic
equilibrium and a neutral atmosphere. Let us define the
decomposition as follows
ρ(x, t) = ρ(x) + ρ0(x, t)
Θ(x, t) = Θ(x)+Θ0(x, t)
P(x, t) = P(x) + P0(x, t).
where (x, t)are the space-time coordinates. Then, the
modified equation set is
∂t +∇ · U= 0
∂t +∇ · UU
∂t +∇ · ΘU
ρ= 0.
In compact vector notation form
∂t +∇·F(q) = S(q)(4)
where q= (ρ0,U,Θ0)>is the solution vector, F(q) =
ρ)>is the flux vector, and S(q) =
(0,ρ0g,0)>is the source vector.
For the purpose of stabilization, we add artificial viscosity
to the governing equations as follows
∂t +∇·F(q) = S(q) + ∇ · (µq)(5)
where µis the constant artificial kinematic viscosity. We
should mention that the equation sets are conservative only
for the inviscid case; therefore, in order to conserve mass, we
do not apply stabilization to the continuity equation.
3 Spatial discretization of the governing
Spatial discretization for the element-based Galerkin (EBG)
methods, namely continuous Galerkin and discontinuous
Galerkin, is conducted by decomposing the domain R3
into Nenon-overlapping hexahedra elements e
Ω =
A key property of hexahedral elements is that they allow
the use of a tensor product approach thereby decreasing the
complexity (in 3D) from O(N6)to O(N4)where Nis the
degree of the polynomial basis. In addition, if we are willing
to accept inexact integration of the mass matrix then we can
co-locate the interpolation and integration points to simplify
the resulting algorithm in addition to increasing its efficiency
without sacrificing too much accuracy (see, e.g., Giraldo
Within each element eare defined basis functions ψj(x)
to form a finite-dimensional approximation qNof q(x, t)by
the expansion
qN(e)(x, t) =
where Mis the number of nodes in an element. The
superscript (e)indicates a local solution as opposed to a
global solution. From here on, the superscript is dropped
from our notations since we are solely interested in EBG
The 3D basis functions are formed from a tensor product
of the 1D basis functions in each direction as
ψijk (ξ, η, ζ ) = ψi(ξ)ψj(η)ψk(ζ)
where the 1D Lagrange basis functions are defined on [1,1]
ψi(ξ) =
where {ξi}M
1is the set of interpolation points in [1,1]. In
a nodal Galerkin approach, ψi(ξ)are Lagrange polynomials
The gravity vector is constant in mesoscale models whereas it varies with
location in global scale models.
Prepared using sagej.cls
4Journal Title XX(X)
associated with a specific set of points; here we choose the
Legendre-Gauss-Lobatto (LGL) points {ξi} ∈ [1,1] which
are the roots of
(1 ξ2)P0
where PN(ξ)is the Nth degree Legendre polynomial. These
points are also used for integration with quadrature weights
given by
N(N+ 1)1
This choice of Lagrange functions gives the Kronecker delta
ψi(ξj) = δij
which, for the 3D basis functions, yields
ψijk (ξa, ηb, ζc) = δai δbj δck .
Unfortunately, the Kronecker delta property does not hold
for the derivatives of the basis functions. However, in the
case of tensor product elements, there exists a simplification
that will tremendously decrease the cost of evaluation of
derivatives and also the associated storage space in case they
are stored as matrix coefficients. Let us write the derivatives
in the following way
∂ψij k
∂ξ (ξa, ηb, ζc) = i
(ξa)δbj δck
∂ψij k
∂η (ξa, ηb, ζc) = δai j
∂ψij k
∂ζ (ξa, ηb, ζc) = δbj δai k
Therefore, for tensor product elements, we need to
consider only 3Nnodes instead of N3when computing
derivatives at a given node. If matrices are built to solve
the system of equations, the storage requirement would
increase in proportion to the polynomial order O(N)instead
of O(N3). This saving is due to the fact that we only
have to compute and store
(χ)where χis one of the
following: ξ, η, ζ . The derivatives with respect to the physical
coordinates x=(x, y, z)are computed using the Jacobian
matrix transformation
where ˆ
is the derivative with respect to the reference
coordinates (ξ, η, ζ)and
3.1 Continuous Galerkin method
Starting from the differential form of the Euler equations
in vector notation, shown in Eq. (4), and then expanding
with basis functions, multiplying by a test function ψi, and
integrating yields the element-wise formulation
∂t de+Ze
Integrating the second term by parts (ψi∇·F =∇ ·
(ψiF)− ∇ψi· F) yields
∂t de+ZΓe
ψiˆn · FdΓeZe
ψi· Fde=
where ˆn is the outward pointing nomral on the boundary of
the element Γe. The second term needs to be evaluated only
at physical boundaries because the fluxes to the left and right
of element interfaces are always equal at interior boundaries,
i.e. F+=F. Eqs. (8) and (9) are the strong and weak
continuous Galerkin (CG) formulations, respectively, with
the finite dimensional space defined as a subset of the
Sobolev space
N={ψH1(Ω)|ψ∈ PN}
where PNdefines the set of all Nth degree polynomials.
Automatically, VCG
NC0(Ω), thus CG solutions satisfy C0-
3.2 Discontinuous Galerkin method
For DG, the finite dimensional space is defined as a subset of
the Hilbert space that allows for discontinuities of solutions
N={ψL2(Ωe)|ψ∈ PN}.
Therefore F+and Fare not equal anymore, hence, we
define a numerical flux Fas an approximate solution to a
Riemann problem to be used in the weak form DG
∂t de+ZΓe
ψiˆn · FdΓeZe
ψi· Fde=
where the Rusanov flux, suitable for hyperbolic equations, is
defined as
F(q)={F(q)} − ˆn |b
where |b
λ|is the speed of sound, {} represent an average and
[[]] represent a jump across a face (from eto its neighbor). If
C0-continuity is enforced on the weak form DG in Eq. (10),
i.e. F=F, it reduces to the weak form CG in Eq. (9).
A strong form DG that resembles Eq. (8) more, can be
obtained by applying a second integration by parts on the
flux integral to remove the smoothness constraint on the test
function ψias follows
∂t de+ZΓe
ψiˆn ·(F− F)(qN)dΓe+
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 5
Again, if C0-continuity is enforced on the strong form DG
formulation, i.e F=Fat interior edges, it simplifies to the
strong form CG formulation in Eq. (8). (see Abdi and Giraldo
(2016) for details).
3.3 Unified CG and DG
The element-wise matrices for both CG and DG are
assembled to form global matrices via an operation
commonly known as global assembly or direct stiffness
summation (DSS). Even though the local matrices are the
same for both methods, the DSS operation yields different
global matrices. CG is often implemented through a global
grid point storage scheme where elements share LGL nodes
at faces so that C0-continuity is satisfied automatically.
Therefore, the DSS operation for CG accumulates values
at shared nodes, while that for DG simply puts the local
element matrices in their proper location in the global matrix.
DG uses a local element-wise storage scheme because
discontinuities (jumps) at element interfaces are allowed.
The standard implementation of CG and DG often follow
these two different approaches of storing data; however, CG
can be recast to use local element-wise storage as well. To
do so, we must explicitly enforce equality of values on the
right and left of element interfaces by accumulating and
then distributing back (gather-scatter) values at shared nodes
for both the mass matrix and right-hand side vector. The
gather-scatter operation is the coupling mechanism for CG,
without which the problem is under-specified. DG achieves
the same via the definition of the numerical flux Fat
element interfaces, which is used by both elements sharing
the face. A detailed explanation of the unified CG and DG
implementation of NUMA can be found in (Abdi and Giraldo
4 Temporal discretization of the governing
The time integrator used is a low-storage explicit Runge-
Kutta (LSERK) method proposed in (Carpenter and Kennedy
1994). It is a five-stage fourth-order RK method that requires
only two storage locations, which is half of that required
by the conventional high-storage fourth-order RK method.
The added cost due to one more stage evaluation is offset
by the larger stable timestep tthe method allows. Each
successive stage is written on to the same register without
erasing the previous value. We need to store previous values
of the field variable qand its residual dqof size N each,
thereby, resulting in a 2N-storage scheme. Given the initial
value problem
dt =R(q)with q(t0) = q0
the updates at each stage jare conducted as follows
dqj=Ajdqj1+ ∆tR(qj1)
where Ajand Bjare constant coefficients for each stage
given in Table 2.
Explicit RK methods have a stringent Courant-Friedrichs-
Lewy (CFL) requirement that often prohibit them from
Table 2. Coefficients of the five-stage LSERK time integrator
stage A B
1 0 0.097618354
2 0.481231743 0.412253292
3 -1.049562606 0.440216964
4 -1.602529574 1.426311463
5 -1.778267193 0.197876053
being used in operational settings. NUMA includes Implicit-
Explicit (IMEX) methods that allow for much larger time
steps, however, those have not yet been ported to the GPU.
The first goal of the GPU project focuses on porting explicit
time integration methods which are known to scale well on
many processors and are also easier to port to GPUs. Implicit
methods require the solution of a coupled system of linear
equations; therefore, depending on the chosen iterative solver
and preconditioner, performance on a cluster of computers
and GPUs may be severely impacted. For this reason, we
reserve the porting of the implicit solvers in NUMA to a
future study.
5 Porting NUMA to the GPU
This section describes the implementation of the unified CG
and DG NUMA on the GPU using the OCCA programming
language (Medina et al. 2014). Before we delve into details
of the implementation, a few words on GPU computing
in general and design considerations are warranted. GPUs
provide the most cost-effective computing power to date,
however, they come with a challenge of adapting existing
code originally written for the CPU to a GPU platform.
5.1 Challenges
First of all, the candidate program to be ported to the GPU
should be able to handle massively fine grained parallelism
via threads. Even though current general purpose GPU
computing offers a lot more flexibility than the days when
they were exclusively used for image rendering, there are still
limitations on what can be done efficiently on GPUs. Single
Instruction Multiple Data (SIMD) programs suited for vector
machines are automatically candidates for porting to GPUs.
More flexibility is achieved on the GPU by limiting SIMD
computation to a small group of threads, 32 threads known
as a warp in NVIDIA terminology, and then scheduling
multiple warps to work on different tasks. In the code design
phase, it is often convenient to think of warps as the smallest
computing unit for the following reason. If even one thread in
a warp decides to do a different operation, warp divergence
occurs in which all threads in a warp have to do operations
twice resulting in a 50% performance loss.
The second issue concerns memory management. Though
the many cores in GPUs provide a lot of computational
power, they can only be harnessed fully if unrestricted
by memory bandwidth limitations. Programs running on a
single core CPU are often compute-bound because more
emphasis is given to data caching in CPU design. In contrast,
most of the chip area in GPUs is devoted to compute units,
and as a result, programs running on a GPU tend to be
memory-bound. Programmers have to carefully manage the
different memory resources available in GPUs. To give an
Prepared using sagej.cls
6Journal Title XX(X)
idea of the complexity of memory management, we briefly
describe the six types of memory in NVIDIA GPUs: global,
local, texture, constant, shared and register memory ordered
in highest to lowest latency. Register memory is the fastest
but is limited in size and only visible to one thread. Shared
memory is fast and visible to a block, a group of warps,
and therefore it is an invaluable means of communication
between threads. Constant and texture memory are read-
only memory that can be used to reduce memory traffic.
Local memory is cached but is only accessible by one
thread; automatic variables that cannot be held in registers
are offloaded to the slow local memory. Global memory,
which is accessible by all threads, is the main memory of
GPUs where the data is stored.
5.2 Design choices
Global memory bandwidth limitation and high latency of
access is often the bottleneck of performance in GPU
computing. To minimize its impact on performance, memory
transactions can be coalesced for a group of threads
accessing the same block in memory. The warp scheduler
also helps to alleviate this problem by swapping out warps
that are waiting for a global memory transaction to complete
for those that are ready to go. There are two approaches of
storing data. The first approach, Array of Structures (AoS),
stores all variables at a given LGL node contiguously in
memory. This is suitable if computation is done for all the
variables in one pass. If, on the other hand, a subset of the
variables are required at a time, a second approach, Structure
of Arrays (SoA), is suitable. While the SoA often degrades
performance on the CPU due to reduced cache efficiency, it
can significantly improve performance on the GPU because
of coalesced memory transactions for a warp. The approach
we use is a mix of these two methods similar to the AoSoFA
(Array of structures of fixed arrays) described in (Allard et al.
2011), in which data for each element is stored in an SoA
manner, and thus an AoS for the whole domain. Using this
approach, scalar data for all nodes in an element is stored
contiguously in memory; this is repeated similarly for each
scalar variable. Variables that are often accessed together,
for instance coordinates (x, y, z) or velocity (u, v, w) can be
stored as one float3 on the GPU.
Our choice of data layout is influenced by our design
decision to do computation on an element by element basis,
for instance launching as many threads as the number of
nodes for computing volume integrals, and as many as face
nodes for surface integrals (see Sec. 5.3.1 and 5.3.2). We
should note here that our approach has a downside in that
the number of threads launched for processing an element
could be small with low-order polynomials approximations;
also the number of threads may not be a multiple of the warp
size. We provide solutions to this problem by processing
multiple elements per block as will be explained in the
coming sections. In the SoA approach, these two problems
do not exist and the appropriate number of threads that fit the
GPU device could be launched to process LGL nodes even
from different elements simultaneously. The SoA approach
may be better for porting code to the GPU using, for instance,
OpenACC or other pragma based programming languages
where the user has less control of the device.
5.3 Unified CG and DG on the GPU
The implementation of CG done within the DG framework
differs only by the final DSS step required for imposing
the C0-continuity constraint instead of using the numerical
flux. Therefore, first we explain the implementation details
of nodal DG on the GPU and then that of the DSS
operation later. The three major computations in DG are
implemented in separate OCCA kernels: volume integration,
surface integration and time step update kernels. Other
major kernels are the boundary kernel required for imposing
boundary conditions, the project kernel for applying the
DSS operation for CG, and two kernels for stabilization:
a Laplacian diffusion kernel for applying second order
artificial viscosity to be used with CG, and a kernel for
computing the gradient required by the Local Discontinuous
Galerkin (LDG) method used for stabilizing DG; in future
work, we will select one stabilization method/kernel for both
methods using the primal form of the elliptic problem. For
the strong form DG discretization of the Euler equations, the
kernels represent the following integrals
∂t de
| {z }
Update kernel
| {z }
Volume kernel
ψˆn ·(F− F)dΓe
| {z }
Surface kernel
ψ∇ · µq)de
| {z }
Diffusion kernel
5.3.1 Volume kernel The volume and surface integration
kernels are written in such a way that a CUDA thread block
processes one or more elements, and a thread processes
contributions from a single Legendre-Gauss-Lobatto (LGL)
node, i.e., the one-node-per-thread approach we mentioned
in the introduction. Gandham et al. (2014) mention that
for low order polynomial approximations, performance can
be improved by as much as five times by processing more
than one element per block. This is especially true for 2D
elements that were used in their study, which have fewer
nodes than the 3D elements we are using in this work. The
reason for this variation in performance with the number of
elements processed per block is the need for a block size that
best fits the underlying hardware limits. In traditional GPU
kernels, for instance the time step update kernel discussed
in Sec. 5.3.3, thread blocks are sized as multiples of the
warp size (32 threads) for best performance. However, for
the volume integration kernels, our algorithms are designed
such that one thread processes one LGL node, therefore the
number of threads launched is not a multiple of the warp size
but the number of nodes.
The main operation in the volume kernel is computing
gradients of the following eight variables (shown in Alg.
1): five prognostic variables (ρ, U, V, W, Θ), pressure Pand
two variables for moisture (here, we omit precipitation). The
gradient of four variables, which are stored as one float4, can
be computed together for efficiency. The current work does
not include support for tracer transport, nor do we employ
the moisture dynamics even though the gradient is computed.
Once the gradients are calculated, we can construct the
divergence and complete the contribution of the volume
integration to the right-hand side vector as shown in Alg. 2.
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 7
Figure 1. Volume integral contribution of a horizontal and vertical slice of a 3D element with 4th polynomial approximation. Due to
the use of the tensor-product approach for hexahedral elements, contributions to a given node (red dot) come only from those
collinear with it along the x-,y-,z- directions, i.e., purple and green nodes on the horizontal slice and light-blue nodes on the vertical
Algorithm 1 GPU algorithms for computing gradient, divergence and Laplacian.
procedure GRADDIV(q,grad,div,compute) Compute gradient or divergence
Memory fence
for k,j,i ∈ {0. . . Nq}do Load field variables into shared memory
sq[k][j][i] = q
Memory fence
for k,j,i ∈ {0. . . Nq}do
qx=0; qy=0; qz=0; Compute local gradients
for n∈ {0. . . Nq}do
qx += sD[i][n]×sq[k][j][n] sD are ψat LGL nodes preloaded to shared memory.
qy += sD[j][n]×sq[k][n][i]
qz += sD[k][n]×sq[n][j][i]
if compute = GRAD then
grad·x = (qx ×Jrx + qy ×Jsx + qz ×Jtx) Js are coefficients of the jacobian matrix J
grad·y = (qx ×Jry + qy ×Jsy + qz ×Jty)
grad·z = (qx ×Jrz + qy ×Jsz + qz ×Jtz)
else if compute = DIVX then
div = (qx ×Jrx + qy ×Jsx + qz ×Jtx)
else if compute = DIVY then
div += (qx ×Jrx + qy ×Jsx + qz ×Jtx)
else if compute = DIVZ then
div += (qx ×Jrx + qy ×Jsx + qz ×Jtx)
procedure GRAD(q,grad) Compute gradient of a scalar field
call GRA DDI V(q,grad,-,GRAD)
procedure DIV(q,div) Compute divergence of a vector field
call GRA DDI V(q·x,-,div,DIVX)
call GRA DDI V(q·y,-,div,DIVY)
call GRA DDI V(q·z,-,div,DIVZ)
procedure LAP(q,lap) Compute Laplacian of a scalar field
call GRA D(q,gq)
call DIV(gq,lap)
For low order polynomials, we can launch one thread
per node and perhaps more by processing multiple elements
per block. This approach works for a maximum polynomial
order of seven. The reason why we cannot use this approach
for higher order polynomials than seven is two fold: first, the
number of threads in a block ((7 + 1)3= 512) approaches
the hardware block size limit. Second, we also approach the
shared memory limit at this polynomial order. Therefore,
we use two different approaches for volume integration for
polynomial orders less than seven (low order) and greater
Prepared using sagej.cls
8Journal Title XX(X)
Algorithm 2 Outline of a combined volume kernel for processing Nkelements per block with Nsslice workers. There are
Nq, number of quadrature points, slices per element for volume kernels and Nf, number of faces, for surface kernels.
procedure VOLUM EKER NE L(q , R)
Shared data[Nk][Nq][Nq][Nq]Extended shared memory array
for outerId0 do
for innerId2 do
wId = innerId2 mod NsSlice worker Id
eId = innerId2 div NsMultiple element processing
for slId=wId to Nqstep Nsdo  Nqslices to work on
e = Nk×outerId0 + elId Element id
call GRA D(qa, qa)Compute gradient of (U,V,W,p) as one float4 variable qa
DU =xU+yV+zW
R(ρ) = DU
R(Θ) = θ×DU
R(U) = U×DU +xp+U·U
R(V) = V×DU +yp+V·U
R(W) = W×DU +yp+W·U
call GRA D(qb, qb)Compute gradient of (ρ, Θ,,) as one float4 variable qb
DR =U· ∇ρ
R(Θ) -= Θ×DR U· ∇Θ
R(u) -= U×DR
R(v) -= V×DR
R(w) -= W×DR
than seven (high order). For low order polynomials, we
can pre-load all the element data (the two float4s to shared
memory at start up, and then never read from global memory
again until the kernel completes).
We can overcome the thread block size limitation for
high order polynomial approximation by launching only
the required number of threads to process one slice of a
3D element, i.e., N2
LGL nodes, as shown in Fig. 1. Then,
we consider three ways of exploiting the shared memory.
The first approach, which we call the naive approach,
does not use shared memory but relies solely on the L1
cache if available. Otherwise, data is read directly from
global memory every time it is required. We can optimize
this approach by adjusting the hardware division of L1
cache to shared memory to be 48 kb/16 kb instead of the
default 16kb/48kb in the K20x GPU. Ignoring cache effects,
the naive approach reads 3NLGL values from memory to
compute the gradient of a variable at a node, for a total
of N3
LGL ×3NLGL memory reads. The second approach,
henceforth called Shared-1 loads a slice of data to shared
memory, then computes the contribution to the gradient from
those nodes on the slice. The data on the slice is re-used
between the N2
LGL nodes on the same plane, therefore, a total
of N3
LGL ×NLGL memory reads are required. The third
approach, henceforth called Shared-2, extends the previous
method by storing the column of data in register as suggested
in (Micikevicius 2009). The column of data may not fit in
registers in which case it is spilled to CUDA private memory
which is global memory. In the latter case, the method will
be the same as the Shared-1 approach with the additional cost
of copying data from global-to-global memory. The best case
scenario is when N3
LGL memory reads are required, but this
cannot be achieved in practice due to the limited number of
registers per thread. The fourth approach does two passes on
the data in which the first pass calculates contributions to
the gradient from nodes on the same slice, say the xy
plane; the second pass completes the gradient calculation
by loading xzslices, and adding the contributions from
nodes in the z-direction. This approach always requires
LGL ×2memory reads.
Even though the slicing approach helps to handle higher
order polynomial approximations, it hurts performance on
the other end of the spectrum. Assuming 512 threads per
block and a hardware limit of 8 blocks per multi-processor,
a 2D kernel using 3rd degree polynomial approximations
will require 8×(3 + 1)2= 128 threads, which yields 25%
efficiency; on the other hand a 3D kernel will occupy
100% of the device because 8×(3 + 1)3= 512 threads are
launched per multiprocessor. We would like to run with high
order polynomial approximations and also have kernels that
are efficient for low order polynomial approximations.
These two competing goals of optimizing kernels for high-
order and low-order polynomials can be handled separately
with different kernels optimized for each. More convenient is
to write the volume kernel in such a way that it can process
multiple elements in a thread block with one or more slice
workers simultaneously. For this reason, the volume, surface
and gradient kernels accept parameters Nk, for number of
elements to process per block, and Ns, for the number of
slice workers per element. We should note here that due to
the run-time compilation feature of OCCA, parameters such
as the polynomial order are constants, as a result kernels are
optimized for the selected set of parameters. For example,
with Nk= 1 and Ns= 1, the kernels produced will be
exactly the same as those we had before adding the multiple
element per block and slicing approaches. If a kernel uses
shared memory to store data for each element processed per
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 9
block and slice worker, its shared memory consumption will
increase in proportion with Nk×Ns, as shown in Alg. 2.
5.3.2 Surface kernel The surface integration, shown in
Alg. 3, is conducted in two stages in accordance with
ockner et al. 2009): the flux gather stage collects
contributions of elements to the numerical flux at face nodes,
and the lifting stage integrates the face values back into the
volume vector. Lifting, in our case, is a simple multiplication
by a factor computed from the ratio of weighted face and
volume Jacobians; this is a result of the tensor-product
approach in conjunction with the choice of integration rule
that results in a diagonal lifting matrix. If the numerical
flux at a physical boundary is pre-determined, for instance
in the case of a no-flux boundary condition, it is directly
set to the prescribed value before lifting. The workload in
surface integration can be split into slices similar to that
used for volume integration. The number of slices available
for parallelization in this case is the number of faces of an
element, six for hexahedra. However, since two faces that
are adjacent to each other share an edge, they cannot be
processed by two slice workers simultaneously. One solution
is to reduce the parallelization to pairs of opposing faces,
thereby avoiding the conflict that arises at the edges when
updating flux terms as shown in Fig. 2. A second option
is to use hardware atomic operations to update the flux
terms. However, hardware support for atomic operations on
double precision floating point operations is not universally
supported by all GPUs at this time.
5.3.3 Update kernel The time step update kernel is
relatively straightforward to implement because we are using
explicit time integration, in which new values at a node
are calculated solely from old values at the same node.
However, explicit time stepping is only conditionally stable
depending on the Courant number. The implementation of
implicit-explicit and fully implicit time stepping methods,
which require the solution of a linear system of equations,
is postponed to the future. For now, we implement the low-
storage fourth order RK method of Carpenter and Kennedy
(1994) by storing the solution at the previous time step and
its residual. Since there is no distinction between nodes in
different elements for this particular kernel, we can select the
appropriate block size that best fits the hardware, e.g. 256 in
5.3.4 Project kernel The direct stiffness summation
(DSS) operation is implemented in two steps, namely gather
and scatter stages. The DSS kernel, shown in Alg. 4, accepts
a vector of node numbers in Compressed Sparse Row (CSR)
format. This vector is used to gather local node values to
then put the result in global nodes — which may be mapped
into multiple local nodes. One thread is launched for each
global node to accumulate the values from all local nodes
sharing this global node. As a result, no conflict will arise
while accumulating values because the gather at a node
is done sequentially by the same thread. For the single
GPU implementation, we can immediately start the scatter
operation which does the opposite operation of scattering the
gathered value back to the local nodes. However, a multi-
GPU implementation requires communication of gathered
values between GPUs before scattering as will be discussed
in Sec. 6.
5.3.5 Diffusion kernels For the purposes of the current
work, we shall use constant second order artificial viscosity
to stabilize both the CG and DG methods in NUMA . The
stabilizing term, shown in Eq. (5), is in divergence form
∇ · (µq)so that we will be able to use dynamic viscosity
methods in the future. However, we use constant viscosity
in the current work, which reduces the stabilizing term to a
Laplacian operator µ2q.
For stabilizing CG, we use the primal form discretization
of the Laplacian operator. Let us start with the DG
discretization with numerical flux qgiven in weak form as
ψi∇ · (µq)de=ZΓe
ψiˆn ·(µq)
|{z }
| {z }
and in the strong form as
ψi∇ · (µq)de=ZΓe
ψiˆn ·(µqµq)
| {z }
ψi∇ · (µq)de
| {z }
If we, then, ensure C1-continuity in the CG discretization,
i.e. by applying DSS on the gradient so that q=q,
the surface integral term disappears from the strong form
formulation. The weak form CG formulation will still
retain the surface integral term despite DSS, however, this
term needs to be evaluated only at physical boundaries
because it cancels out at interior boundaries due to q+=
q. In addition, the term completely disappears if no-
flux boundary conditions are used; dropping the surface
integral term in other cases results in an inconsistent method,
but something that could still be feasible for the purpose
of numerical stabilization. The kernel for computing the
volume contribution of the strong form discretization is
already given in Alg. 1. The volume kernel for the the weak
form discretization is shown in Alg. 5. The first step in this
kernel is to load the field variable qinto the fast shared
memory. Then, we compute and store the local gradients at
each LGL node similar to what is done in the volume kernel.
The shared memory requirement of this kernel is rather
high due to the need for temporarily storing the gradients
besides the field variable. On the other hand, the mixed form
stabilization method we use for DG, i.e. by computing and
storing the gradient in global memory, puts less stress on
shared memory requirement, while being potentially slower.
The same kind of optimizations used for the volume kernel,
such as splitting into slices and multiple elements per block
Hyper-diffusion can also be used but in order to simplify the exposition,
we shall only remark on second order diffusion.
Prepared using sagej.cls
10 Journal Title XX(X)
Figure 2. Coloring of faces for parallel computation of surface integral. Opposing faces can be processed simultaneously because
there are no shared edges between them.
Algorithm 3 Surface kernel
map[3][2] = ((0,5),(1,3),(2,4)) Pairs of faces, shown in Fig. 2, for parallel computation
procedure SUR FAC EKE RNE L(q , R)
for outerId0 do
for innerId2 do
wId = innerId2 mod NsSlice worker Id
eId = innerId2 div NsElement Id
for wId to 2step Nsdo
for b=0 to 2do
slId = map[b][wId]; Get face
for j,i ∈ {0. . . Nq}do
e=Nk×outerId0 + elId
Load face normal ˆn and lift coefficient LL=wij Jij
wijk Jijk
Load q+and qfor current node and adjoning node in the other element
Compute maximum wave speed |λ|=|ˆn ·u|+pγp/ρ
Compute Rusanov flux F(q)={F(q)} − ˆn |λ|
R += L × ˆn ·(F(q)− F (q))
Algorithm 4 DSS kernel
procedure DSSKERN EL(Q, Qcont, starts, indices, nGlobal, wgt)
for outerId0 do
n = outerId0 Global node id
if nnGlobal then
start = starts[n] Read indices of local nodes for the DSS operation
end = starts[n+1]
gQ = 0 Gather stage of DSS
for m=start to end do
ind = indices[m] Local node index
if ind 0then
pw = wgt[ind]; DSS weight computed based on lumped mass coefficients
gQ += Q[ind]×pw
Qcont[n] = gQ
for m=start to end do Scatter stage of DSS
ind = indices[m]
if ind 0then
Q[ind] = Qcont[n]
processing, can be used here as well. After computing the
local gradients, the ψi·µqjterm can be computed
immediately afterwards — which is represented by the
combined geometric factors JJT. Note that we use local
memory fences to synchronize the read/write operations in
shared memory. The fact that we use a discontinuous space
even for CG forces us to apply DSS on both q, for which we
already applied DSS at the end of the time step or RK-stage,
and q, for which we ignore DSS for efficiency reasons
discussed later in this section. In case of hyper-viscosity of
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 11
order 3 or more, the DSS on qmaybe required to ensure
atleast C1continuity.
For stabilizing DG, we use the mixed form of Bassi
and Rebay (1997). The viscous term ∇ · (µq)needed for
stabilizing the Euler equations in Eq. (5) requires us to first
compute the gradient q. We can write the computation of
the stabilizing term in mixed form as follows
∇ · (µq) = ∇ · (µQ)(15)
where Qis the auxiliary variable. Because we are evaluating
the stabilizing term explicitly, we can solve the equations in a
straightforward decoupled manner (Bassi and Rebay 1997).
The strong form DG discretization of the first part of Eq. (15)
is as follows
ψiˆn ·(qq)dΓe
| {z }
| {z }
We should note that the surface integral term is zero for
strong from CG because q=qdue to continuity. Once we
compute Q, we can then compute the viscous term via the
ψi∇ · (µq)de=ZΓe
ψiˆn ·(µQµQ)dΓe
| {z }
ψi∇ · (µQ)de
| {z }
According to (Bassi and Rebay 1997), we use centered fluxes
for both qand Qsuch that q={q}and Q={Q}. The
mixed form is implemented directly by first computing the
volume integral of the gradient in Eq. (16) using Alg. 1,
and then modifying the result with the surface integral
contribution computed using centered fluxes q={q}. It
is necessary to store Qin global memory, unlike the case
for CG, and compute the surface integral using a different
kernel because data is required from neighboring elements.
This difficulty would have also manifested itself in CG if we
chose to gather-scatter Q, which would require a separate
kernel for similar reasons, and force us to use the mixed
form. The fact that we need this term just for stabilization,
and not, for instance for the implicit solution of the Poisson
problem, gives us some leeway to its implementation on the
GPU for performance reasons. However, in the CPU version
of NUMA we apply the DSS operator (which requires inter-
process communication) right after computing the gradient
Q. The kernel for computing the surface gradient fluxes is
similar to the surface integration kernel discussed in Section
5.3.2 — with the only difference being that we use centered
fluxes instead of the upwind-biased Rusanov flux. Finally,
the volume and surface integral contributions of the viscous
term in Eq. (17) are added to the right-hand side vector in the
volume and surface kernels, respectively. In the future we
will study stabilization of DG using the Symmetric Interior
Penalty Method (SIPG) – which shares the same volume
integration kernel as the weak-form CG stabilization method.
6 Multi-GPU implementation
The ever increasing need for higher resolution in numerical
weather prediction (NWP) implies that such large scale
simulations cannot be run on a single GPU card due to
memory limitations. A practical solution is to cluster cheap
legacy GPU cards and break down the problem into smaller
pieces that can be handled by a single GPU card; however,
this necessitates communication between GPUs which is
often a bottleneck of performance. We extend our single
GPU implementation of NUMA to a multi-GPU version
using the existing framework for conducting multi-CPU
simulations on distributed memory computers (see (Kelly
and Giraldo 2012) for details). The communication between
GPUs is done indirectly through CPUs which is the reason
why we were able to use the existing MPI infrastructure.
We should note that the latest technology in GPU hardware
allows for direct communication between GPUs but the
technology is not yet mature and also the GPU cards are more
6.1 Multi-GPU parallelization of EBG methods
The goal of parallelizing NUMA to distributed memory CPU
clusters has already been achieved in (Kelly and Giraldo
2012), in which linear scalability up to tens of thousands of
CPUs was demonstrated. More recently the scalability of the
implementation is tested on the Mira supercomputer, located
at Argonne National Laboratory, using 3.1 million MPI
ranks (M¨
uller et al. 2016). NUMA achieved linear scalability
for both explicit and 1D implicit-explicit (IMEX) time
integration schemes in global numerical weather prediction
problems. The current work extends the capability of NUMA
to multi-GPU clusters which are known to deliver much
more floating point operations per second (FLOPS/s) than
multi-CPU clusters. In the following sections, we describe
the parallel grid generation and partitioning, mulit-GPU CG
and DG implementations.
6.1.1 Parallel grid generation The grid generation and
partitioning stages are done on the CPU and then geometric
data is copied to the GPU once at start up. The reason for
this choice is mainly a lack of robust parallel grid generator
software with a capability of Adaptive Mesh Refinement
(AMR) on the GPU. Originally NUMA used a local grid
generation code and the METIS graph partitioning library for
domain decomposition; however, the need for parallel grid
generation and parallel visualization output processing was
exposed while conducting tests on the Mira supercomputer.
Even though a parallel version of METIS (ParMETIS) exists,
we chose to adopt the parallel hexahedral grid generation and
partitioning software p4est (Burstedde et al. 2011) mainly
because of the latter’s capability of parallel AMR. In static
AMR mode, p4est is in effect a parallel grid generator.
Dynamic AMR requires copying geometric data to the GPU
more than once, i.e., whenever AMR is conducted. For this
reason, recomputing all geometric data on-the-fly on the
GPU could potentially improve performance. ParMETIS is
a graph partitioning software and as such is not capable of
mesh refinements.
6.1.2 Multi-GPU CG The coupling between sub-domains
in the CG spatial discretization is achieved by the Direct
Prepared using sagej.cls
12 Journal Title XX(X)
Algorithm 5 Laplacian diffusion kernel
procedure LAPL AC E(Q, rhs, nu)
Shared sq,sqr,sqs,sqt all arrays of size of [Nq][Nq][Nq]
Memory fence
for k,j,i ∈ {0. . . Nq}do Load field variables into shared memory
sq[k][j][i] = q
Memory fence
for k,j,i ∈ {0. . . Nq}do
qr=0; qs=0; qt=0; Compute local gradients in r-s-t
for n∈ {0. . . Nq}do
qr += sD[i][n]×sq[k][j][n]; sD are ψat LGL nodes preloaded to shared memory.
qs += sD[j][n]×sq[k][n][i];
qt += sD[k][n]×sq[n][j][i];
sqr[k][j][i] = µ(G11×qr + G12×qs + G13×qt); Gs are coeff. of the symmetric JJ Tmatrix
sqs[k][j][i] = µ(G12×qr + G22×qs + G23×qt);
sqt[k][j][i] = µ(G13×qr + G23×qs + G33×qt);
Memory fence
for k,j,i ∈ {0. . . Nq}do
lapq = 0
for n∈ {0. . . Nq}do
lapq += sD[n][i]×sqr[k][j][n];
lapq += sD[n][j]×sqs[k][n][i];
lapq += sD[n][k]×sqt[n][j][i];
rhs -= Jinv ×lapq
Stiffness Summation (DSS) operator which imposes C0
continuity of solutions at element interfaces. The DSS
operator is applied both to the mass matrix and the right-hand
side (RHS) vector. Therefore, a multi-GPU implementation
of CG requires communication between GPUs only for
applying DSS; in fact, we require GPU kernels for applying
DSS only on the RHS vector because the construction of the
mass matrix is done on the CPU. However, to apply DSS
on the RHS vector, we need several kernels. Alg. 6 outlines
the steps required for applying DSS in a mulit-GPU CG
implementation. First, we need a kernel to do the intra-GPU
gather operation on the RHS vector. Then, the values at inter-
GPU boundaries are copied to a contiguous block of GPU
global memory after which the data is copied to the CPU.
CPUs, then, communicate the boundary data to construct the
global RHS using the existing MPI infrastructure in NUMA.
Once the CPUs complete the DSS operation, the CPUs
copy the boundary data back to the GPU global memory.
Contribution from neighboring processors are processed one
by one to update the RHS vector; without this ‘coloring’ of
neighboring processors, conflicts in RHS updates can occur
at shared edges and corner nodes of elements. The last stage
does the intra-GPU scatter operation of DSS.
6.1.3 Multi-GPU DG The coupling between sub-domains
in the DG spatial discretization is achieved by the definition
of the numerical flux at shared boundaries. DG lends itself
to a simple computation-communication overlap; though
CG can benefit from computation-communication overlap
as well, it requires more effort to do so (Deville et al.
2002). Overlapping is especially important in a multi-GPU
implementation to hide the latency associated with the data
transfer between the CPU and GPU. Inter-processor flux
calculation requires values from the left and right elements
sharing a face; however, intra-processor flux calculation
and computation of volume integrals can proceed while
the necessary communication for computing inter-processor
flux is going on. Alg. 7 shows an outline of a multi-
GPU DG implementation with communication-computation
overlap. The latest technology in GPUs allow for copying
data asynchronously using streams. We overlap computation
and communication using two streams designated for each.
The copying of data to and from the GPU is carried out
on the copy stream (COPY), all computations on the GPU
are done on the computation stream (COMP), and MPI
communications between CPUs are on the host stream
(HOST). A wait statement invoked on any device stream
blocks the host thread until all operations on that stream
come to completion. Even though we do not show it for the
sake of simplicity, the communication of qfor the LDG
stabilization method is also done similarly.
7 Performance tests
7.1 Speedup results
First, we present speedup results for the GPU implementa-
tion of NUMA against the base Fortran code . In Table 3,
the time to solution of three test cases, solved using explicit
DG, is presented. This information is useful to get a rough
estimate of the performance per dollar on different GPU
cards. We will present the details of the test cases later in
Sec. 8; here we give the workload of each problem:
1. 2D Rising-thermal bubble: 100 elements with polyno-
mial order 7, for a total of 51200 nodes
The base Fortran code is the original CPU code, i.e., the non-OCCA
implementation that we use on the GPUs.
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 13
Algorithm 6 DSS on the GPU for the RHS vector
procedure DSS(RHS)
Gather RHS See Alg. 4 for details
Copy boundary data to contiguous block of global memory
Copy boundary data to CPU
CPUs communicate and form the global RHS
CPUs copy the assembled RHS back to the GPU
for all neighbors do To avoid conflict in RHS update
Boundary data is used to update the RHS vector
Scatter RHS See Alg. 4 for details
Algorithm 7 Asynchronous Multi-GPU DG
procedure ASYNCH DG COMM
[COMP] Pack boundary data to a contiguous block of global memory
[COMP] Wait
[COPY] Start copying boundary data asynchronously from GPU to CPU
[COMP] Start computing volume integrals and intra-processor flux
[COPY] Wait
[HOST] Send boundary data to neighboring processors asynchronously
[HOST] MPI waitall
[COPY] Start copying boundary data asynchronously from CPU to GPU
[COPY] Wait
[COMP] Compute inter-processor flux
2. 3D Rising-thermal bubble: 1000 elements with
polynomial order 5, for a total of 216000 nodes
3. Acoustic wave on the sphere: 1800 elements with
polynomial order 4, for a total of 225000 nodes
where nodes, here, denote the number of gridpoints in the
mesh. We obtained two orders of magnitude speedups on
the newer GPU cards (GTX Titan Black and K20X) over a
single core 2.2GHz AMD CPU. The specs for the GPU cards,
bandwidth and double precision TFLOPS/s, are as follows:
C2070: 144 GB/s, 0.5 TFLOPS/s, Titan black: 336 GB/s, 1.7
TFLOPS/s , and K20X: 225 GB/s, 1.3 TFLOPS/s.
Next, we present performance tests on the Titan supercom-
puter located at the Oak Ridge National Laboratory, where
each node has a K20X GPU card and an AMD Opteron 6274
CPU with 16 cores at 2.2 GHz. The GPU card has 2,688
cores at 0.732 GHz, 6 GB memory, 250 GB/s bandwidth
with peak performances of 1.31 and 3.95 teraflops in double
and single precision, respectively. The speedup results are
reported relative to the NUMA Fortran code using all 16
cores of the CPU. We will examine the different kernel
design and parameter choices we made in Sec. 5 using the
2D rising thermal bubble benchmark problem. The problem
size is increased progressively from 10x10=100 elements
until we fill up all the memory available on the device at
160x160=25600 elements. The first test result, presented in
Table 4, evaluates the performance of the cube volume kernel
at low-order polynomials using both OpenCL and CUDA
translations of the native OCCA code. Although NVIDIA
hardware includes interfaces for both OpenCL and CUDA,
we obtained better performance with CUDA kernels on
this particular hardware. Also, we observe markedly better
speedups at polynomial orders 4 and 7 compared to other
polynomial orders. The reason for the good performance
at polynomial order 7 is due to the thread block sizes of
(7 + 1)3= 512 that perfectly fits the hardware block size.
Polynomial order 4 gives a thread block size of 125 which
is only slightly less than 128. Therefore, this observation
emphasizes the importance of selecting parameters to get
optimum block dimensions that are multiples of the warp
GPUs are known to deliver higher performance using
single precision (SP) arithmetic than double precision (DP).
For instance, the SP peak performance of a K20X GPU is 3x
more than its DP peak performance. In Table 5, we present
the speedup results comparing SP and DP performance. We
obtain a maximum speedup of about 15x and 11x using
single and double precision calculations, respectively. The
reason for different speedup numbers for SP and DP is that
NUMA running on the CPU is able to achieve a speedup of
only 1.5x using SP, while the GPU performance more than
doubles using SP.
For low order polynomials, we can process two or more
elements per block to get an optimal block size. Table 6
shows the performance comparison of this scheme using
one and two elements per block. We can see that the
performance is significantly improved by processing two
elements per block for upto polynomial order 5; the block
size, when processing two elements per block, exceeds
the hardware limit at polynomial orders above 5. The 100
elements simulation is not able to see any benefit from this
approach because the device will not be fully occupied when
processing two elements per block. All the other runs show
significant benefits from processing two elements per block,
except at polynomial order 4 — for which performance
remains more or less the same. We mentioned earlier that
polynomial order 4 gives a block size that is close to optimal,
hence, there is really no need to process more than one
element per block for this particular configuration.
Prepared using sagej.cls
14 Journal Title XX(X)
Table 3. Speedup comparison between CPU and GPU for both single precision and double precision calculations. The test is
conducted on three types of GPU cards: an old Tesla C2070 and two newer cards GTX Titan Black and K20X GPUs. Two orders of
magnitude performance improvement is obtained relative to a single core CPU with the newer cards.
Test case Double precision Single precision
CPU GPU Speedup CPU GPU Speedup
Tesla C2070 GPU vs One core of Intel Xeon E5645
2D rtb 930.1 27.8 33.4 612.3 13.4 45.6
3D rtb 4408.9 141.9 31.1 3097.0 54.5 56.8
Acoustic wave 3438.8 96.7 35.6 2379.9 44.4 53.6
GTX Titan Black GPU vs One core of Intel Xeon E5645
2D rtb 930.1 8.87 104.9 612.3 4.67 131.0
3D rtb 4408.9 41.47 106.3 3097.0 18.68 165.8
Acoustic wave 3438.8 26.72 128.7 2379.9 15.56 152.9
K20X GPU vs 16-cores of 2.2GHz AMD Opteron 6274
2D rtb 103.17 13.97 7.38 77.75 6.89 11.28
3D rtb 434.36 61.14 7.10 339.61 28.12 12.08
Acoustic wave 166.06 21.10 7.87 132.46 11.24 11.78
Table 4. OpenCL vs CUDA: Speedup comparison between CPU and GPU for double precision calculations at different number of
elements and polynomial orders using OpenCL and CUDA translation of the native OCCA kernel code. The GPU card is K20X and
the CPU is a 16-core 2.2GHz AMD Opteron 6274. The timing (in sec) and speedup are given first for OpenCL and then for CUDA.
The results show CUDA compiled kernels are optimized better. Also polynomial orders 4 and 7 give better speedup numbers in all
N 10x10=100 elements 30x30=900 elements 40x40=1600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 1.46 0.59/0.52 2.47/2.81 10.62 2.57/2.17 4.13/4.90 18.83 4.34/3.70 4.34/5.09
3 2.68 0.69/0.59 3.88/4.54 22.01 3.56/3.06 6.18/7.19 41.53 5.84/5.04 7.11/8.24
4 5.30 0.97/0.86 5.46/6.16 46.45 5.50/5.12 8.45/9.07 81.91 9.27/8.69 8.84/9.43
5 8.12 1.47/1.37 5.52/5.93 77.03 10.53/9.88 7.32/7.80 137.49 18.33/17.11 7.50/8.04
6 13.89 2.27/2.11 6.11/6.58 122.27 17.24/16.11 7.09/7.59 210.35 30.15/28.15 6.98/7.47
7 20.49 2.68/2.41 7.64/8.50 195.61 20.82/18.87 9.40/10.37 343.74 36.36/33.05 9.45/10.40
N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 80.72 15.71/13.33 5.14/6.05 184.19 33.47/27.82 5.50/6.62 336.19 61.56/52.01 5.46/6.46
3 179.07 21.46/18.46 8.34/9.70 405.15 47.63/41.08 8.51/9.86 729.17 84.40/72.61 8.64/10.04
4 350.54 35.01/32.71 10.01/10.71 798.50 77.85/72.77 10.26/10.97 1392.60 138.64/129.64 10.04/10.74
5 587.17 71.90/67.03 8.17/8.76 1329.79 161.42/150.56 8.24/8.83 2352.46 286.74/267.48 8.20/8.79
6 925.25 118.81/110.92 7.79/8.34 2086.84 267.12/249.50 7.82/8.36 - - -
7 1406.61 142.67/130.16 9.86/10.81 3158.43 320.77/293.05 9.84/10.78 - - -
Table 5. Double vs Single Precision: Speedup comparison between CPU and GPU for single and double precision calculations at
different number of elements and polynomial orders using CUDA translation of OCCA kernel code. A maximum speedup of about
15x is observed. The CPU/GPU times and Speedups are given first for double precision and then for single precision. The GPU
card is K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.
N 10x10=100 elements 30x30=900 elements 40x40=1600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 1.46/1.39 0.52/0.47 2.81/2.96 10.62/9.98 2.17/1.57 4.90/6.36 18.83/17.41 3.70/2.53 5.09/6.88
3 2.68/2.60 0.59/0.49 4.54/5.31 22.01/19.66 3.06/1.87 7.19/10.51 41.53/34.72 5.04/3.06 8.24/11.35
4 5.30/4.51 0.86/0.54 6.16/8.35 46.45/35.19 5.12/3.03 9.07/11.61 81.91/63.55 8.69/5.07 9.43/12.53
5 8.12/7.23 1.37/0.77 5.93/9.39 77.03/61.35 9.88/4.86 7.80/12.62 137.49/107.30 17.11/8.35 8.04/12.85
6 13.89/11.18 2.11/1.07 6.58/10.45 122.27/95.67 16.11/7.71 7.59/12.41 210.35/166.40 28.15/13.49 7.47/12.33
7 20.49/15.97 2.41/1.31 8.50/12.19 195.61/135.21 18.87/9.65 10.37/14.01 343.74/236.09 33.05/16.86 10.40/14.00
N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 80.72/70.41 13.33/8.94 6.05/7.88 184.19/172.85 27.82/19.78 6.62/8.74 336.19/285.83 52.01/34.92 6.46/8.18
3 179.07/142.19 18.46/11.18 9.70/12.72 405.15/324.78 41.08/24.87 9.86/13.06 729.17/589.22 72.61/44.10 10.04/13.36
4 350.54/268.69 32.71/19.02 10.71/14.13 798.50/599.25 72.77/42.34 10.97/14.15 1392.60/1069.24 129.64/76.01 10.74/14.07
5 587.17/429.66 67.03/32.38 8.76/13.27 1329.79/1007.31 150.56/72.08 8.83/13.97 2352.46/1729.34 267.48/129.28 8.79/13.37
6 925.25/696.25 110.92/52.91 8.34/13.16 2086.84/1586.54 249.50/118.39 8.36/13.40 - - -
7 1406.61/968.10 130.16/66.41 10.81/14.58 3158.43/2227.29 293.05/148.76 10.78/14.97 - - -
We mentioned in Sec. 5 that using vector datatype float4
to store field variables may help to improve performance
because one load operation is issued when fetching a float4
data instead of four. Table 7 compares the speedup obtained
using float1 and float4 versions of the volume kernel. The
float4 version performs better in most of the cases; here,
again, the performance at polynomial order 4 is more or less
the same.
We discussed in Sec. 5 different ways to handle the
problem with hardware limitations for high order polynomial
approximations. Thread block size and shared memory
hardware limits allow us to use the volume kernel we tested
so far upto polynomial order 7. First, we compare the
performance of the four ways to use shared and L1 cache
memory; namely, the naive, Shared-1, Shared-2 and two-pass
(horizontal+vertical) methods. Fig. 3 shows that the two-pass
method performs the best — about two times better than the
naive approach that does not use shared memory but totally
relies on L1 cache. The Shared-1 and Shared-2 methods
perform similarly; this implies that the Shared-2 approach
suggested in Micikevicius (2009) is not working as expected.
Even though we try to store the data in the vertical direction
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 15
Table 6. Multiple elements per block: The performance of the cube volume kernel can be improved by processing more than one
element in a thread block simultaneously. The GPU times and Speedups are given first for the 1 element-per-block and then for the
2 elements-per-block approaches. Improvement in performance is observed using 2 elements-per-block in all the cases except for
the 10x10 elements case, which does not fully occupy the GPU device when processing 2-elements-per-block. The GPU card is
K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.
N 10x10=100 elements 30x30=900 elements 40x40=1600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 1.46 0.52/0.57 2.81/2.56 10.62 2.17/1.81 4.90/5.87 18.83 3.70/2.85 5.09/6.61
3 2.68 0.59/0.61 4.54/4.39 22.01 3.06/2.93 7.19/7.51 41.53 5.04/4.74 8.24/8.76
4 5.30 0.86/0.92 6.16/5.76 46.45 5.12/5.74 9.07/8.09 81.91 8.69/9.81 9.43/8.35
5 8.12 1.37/1.37 5.93/5.92 77.03 9.88/9.68 7.80/7.96 137.49 17.11/16.72 8.04/8.22
N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 80.72 13.33/9.96 6.05/8.10 184.19 27.82/21.10 6.62/8.73 336.19 52.01/38.09 6.46/8.83
3 179.07 18.46/17.5 9.70/10.23 405.15 41.08/38.51 9.86/10.52 729.17 72.61/67.62 10.04/10.78
4 350.54 32.71/37.15 10.71/9.43 798.50 72.77/82.93 10.97/9.63 1392.60 129.64/147.61 10.74/9.43
5 587.17 67.03/65.2 8.76/9.00 1329.79 150.56/146.67 8.83/9.07 2352.46 267.48/260.89 8.79/9.02
Table 7. float1 vs float4: The effect of using float4 for computing the gradient in the volume kernel is compared against the version
of the volume kernel where one field variable is loaded. The CPU/GPU time and Speedups are given first for float1 and then for
float4. Some improvement is observed in most cases except when using polynomial order 4, which results in a good thread block
size. The GPU card is K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.
N 10x10=100 elements 30x30=900 elements 40x40=1600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 1.46 0.52/0.47 2.81/3.11 10.62 2.17/2.06 4.90/5.15 18.83 3.70/3.33 5.09/5.65
3 2.68 0.59/0.57 4.54/4.70 22.01 3.06/3.10 7.19/7.10 41.53 5.04/5.14 8.24/8.08
4 5.30 0.86/0.82 6.16/6.46 46.45 5.12/5.10 9.07/9.11 81.91 8.69/8.69 9.43/9.43
5 8.12 1.37/1.27 5.93/6.39 77.03 9.88/9.38 7.80/8.21 137.49 17.11/16.29 8.04/8.44
6 13.89 2.11/1.93 6.58/7.19 122.27 16.11/14.86 7.59/8.23 210.35 28.15/26.06 7.47/8.07
N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
2 80.72 13.33/12.00 6.05/6.73 184.19 27.82/26.50 6.62/6.95 336.19 52.01/46.85 6.46/7.18
3 179.07 18.46/18.99 9.70/9.43 405.15 41.08/42.66 9.86/9.50 729.17 72.61/74.93 10.04/9.73
4 350.54 32.71/32.88 10.71/10.66 798.50 72.77/73.00 10.97/10.94 1392.60 129.64/129.72 10.74/10.74
5 587.17 67.03/64.12 8.76/9.16 1329.79 150.56/144.01 8.83/9.23 2352.46 267.48/256.62 8.79/9.17
6 925.25 110.92/102.85 8.34/9.00 2086.84 249.50/222.08 8.36/9.40 - - -
Table 8. Higher order polynomials: The performance of the two pass method, with horizontal + vertical split, is evaluated at higher
order polynomials in double precision calculations. This kernel performs slower than the cube volume kernel when used for low
order polynomials, but it is the best performing version among the volume kernels we considered for high order. The GPU card is
K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.
N 10x10=100 elements 30x30=900 elements 40x40=1600 elements
CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup
8 31.17 4.77 6.53 271.50 33.19 8.18 492.23 58.17 8.46
9 43.21 5.90 7.32 373.63 44.52 8.39 666.37 77.84 8.56
10 59.89 7.14 8.38 493.54 55.02 8.97 909.75 96.49 9.42
11 79.86 9.61 8.31 691.65 75.52 9.15 1199.67 132.28 9.07
12 103.64 13.40 7.73 923.22 107.44 8.59 1713.01 190.06 9.01
13 131.74 16.89 7.80 1140.64 138.13 8.25 2009.99 243.28 8.26
14 169.49 23.52 7.20 1468.72 195.32 7.52 2568.77 340.99 7.53
15 220.91 28.36 7.79 1862.14 233.06 7.99 3352.22 410.42 8.17
7 8 9 10 11 12 13 14 15
Relative speedup
Shared − 1
Shared − 2
Two−pass method
Figure 3. Comparison of different ways of exploiting fast L1 cache and Shared memory in volume kernel. The speedups are
reported relative to the naive approach. The two-pass method performs the best due to better use of shared memory.
in registers, most of it spills to global thread private memory.
Because the polynomial order is high and we are loading all
field data (8 floats) to registers, the register pressure is too
high for the method to show any benefit.
Prepared using sagej.cls
16 Journal Title XX(X)
In Table 8, we present the performance of the high-order
volume kernel that uses the two-pass method for polynomial
orders of 8 to 15. It is not possible to solve bigger size
problems than 40x40 elements with polynomial order 15 on
this GPU because of the limited memory of 6GB per card.
We get a maximum speedup of about 9x at higher order
polynomials, which is slightly less than the 11x performance
we obtained at low-order polynomials; this is understandable
because the two-pass method loads data twice and performs
calculations twice as well.
7.2 Individual kernel performance tests
To evaluate the performance of individual kernels, we
measure the rate of floating point operations in GFLOPS/s
and data transfer rate (bandwidth) in GB/s. Many GPU
applications tend to be memory bound, hence bandwidth is
as important a metric as rate of floating point operations. The
results obtained will guide us how to go about optimizing
kernel performance by classifying them as either compute-
bound or memory-bound. A convenient visualization is the
roofline model (Williams et al. 2009) that sets an upper
bound on kernel performance based on peak GFLOPS/s and
GB/s of the device. We use two approaches to determine the
GFLOPS/s and GB/s: hand-counting the number of floating
point operations and bytes loaded to get an estimate of the
arthimetic throughput and bandwidth, and using a profiler to
get the effective values.
The first results, shown in Figs. 4a-4d, are produced by
hand-counting the number of FLOPS and bytes loaded from
global memory per kernel execution. This would be enough
to calculate the arithmetic intensity (GFLOPS/GB) and
determine whether a kernel would be memory- or compute-
bound; however, we need to conduct actual simulations to
determine kernel execution time and, thus, the efficiency of
our kernels in terms of GFLOPS/s and GB/s. The roofline
plots show that our efficiency increases with problem size
and reaches about 80% for the volume and surface kernels,
while 100% efficiency is observed for the update and project
kernels. These tests are conducted on the isentropic vortex
problem (see Sec. 8.1), which concerns advection of a vortex
by a constant velocity. The GPU is a Tesla K20c GPU
with the following specification: 2,496 cores at 0.706 GHz,
5GB memory, 208 GB/s bandwidth with peak performances
of 1.17 teraflops and 3.52 teraflops in double and single
precision, respectively.
The highest GFLOPS/s observed in any of the kernels
is about 320 GFLOPS/s for the horizontal volume kernel
at N= 10 using single precision arithmetic. The vertical
volume kernel is a close second, but the surface and update
kernels lag far behind in terms of GFLOPS/s performed.
The update kernel, which does the explicit Runge-Kutta time
integration, shows the highest bandwidth performance at
about 208GB/s, which is infact the peak memory bandwidth
of the device. The projection kernel, which does the scatter-
gather operation of CG, comes in a close second. The volume
and surface kernels, though they have the highest arithmetic
intensity, lag behind in terms of bandwidth performance.
Therefore, no single kernel exhibits best performance in
terms of both GFLOPS/s and bandwidth.
The roofline plots expose that the arithmetic intensity
(GFLOPS/GB) of the update kernel, project kernel and
surface kernel do not change with polynomial order. When
extrapolated, all three vertical lines hit the diagonal of
the roofline, confirming the fact that these kernels are
memory-bound. The arithmetic intensity of the volume
kernels increases with polynomial order, complicating
the classification to either compute- or memory- bound;
however, with polynomial degree upto 11 the kernels are still
well within the memory-bound region.
The second group of kernel performance tests, shown in
Fig. 5a-5d, are conducted using a GTX Titan Black GPU.
For these tests we used nvprof, to determine the effective
arthimetic throughput and memory bandwidth. As a result,
the plots obtained from this test are less smoother than
the previous plots which were produced by hand-counting
FLOPS and GB of kernels. Moreover, here we use the
cube volume kernels instead of the split horizontal+vertical
kernels. We also changed the test case to a 2D rising thermal
bubble problem, which requires numerical stabilization, to
invoke the diffusion kernel. The highest GFLOPS/s observed
in this test is 700 GFLOPS/s for the volume kernel using
single precision. To compare performance with the previous
tests that were produced using a different GPU, we look at
the roofline plots instead. We expect the roofline plot for the
combined volume kernel to lean more towards the compute-
bound region because more floating point operations are
done per byte of data loaded. Indeed this turns out to be the
case even though the cube volume kernels were run upto a
maximum polynomial order of 8. The diffusion kernels, used
for computing the Laplacian, also show similar performance
characteristics as the volume kernels.
7.3 Scalability test
The scalability of the multi-GPU implementation is tested on
a GPU cluster, namely, the Titan supercomputer which has
18688 Nvidia Tesla K20X GPU accelerators. We conduct
a weak scalability test, where each GPU gets the same
workload, using the 2D rising thermal bubble problem
discussed in Section 8.2, using 900 elements per GPU with
polynomial order 7 in all directions. In a weak scaling
test, the time to solution should, ideally, stay constant as
the workload is increased; however, delays are introduced
due to the need for communication between GPUs. The
scalability result in Fig. 6 shows that the GPU version of
NUMA is able to achieve 90% scaling efficiency on tens
of thousands of GPUs. Different implementations of the
unified CG/DG algorithms are tested, among which, DG
with overlapping of computation and communication to hide
latency performed the best. Our current CG implementation
does not overlap communication with computation and, as a
result, its scalability suffers.
The 900 element grid per GPU used for producing the
scalability plot is far from filling up the GPU memory, hence,
the scalability could be improved by increasing the problem
size further. We compare scalability up to 64 GPUs, which is
the point where the efficiency of the parallel implementation
flattens out, for different number of elements in Fig. 7. The
scalability increases by more than 20% going from a 100 to
900 elements grid per GPU.
In operational numerical weather prediction (NWP),
strong scaling on multi-GPU systems may be as important
as weak scaling because of limits placed on the simulation
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 17
1 2 3 4 5 6 7 8 9 10 11
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
1 2 3 4 5 6 7 8 9 10 11
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
10−1 100101102
208 GB/s
3520 GFLOPS/s
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
(a) SP-CG kernels performance
1 2 3 4 5 6 7 8 9 10 11
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
1 2 3 4 5 6 7 8 9 10 11
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
10−1 100101102
208 GB/s
1170 GFLOPS/s
Horizontal volume kernel
Vertical volume kernel
Update kernel
Project kernel
(b) DP-CG kernels performance
1 2 3 4 5 6 7 8 9 10 11
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
1 2 3 4 5 6 7 8 9 10 11
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
10−1 100101102
208 GB/s
3520 GFLOPS/s
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
(c) SP-DG kernels performance
1 2 3 4 5 6 7 8 9 10 11
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
1 2 3 4 5 6 7 8 9 10 11
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
10−1 100101102
208 GB/s
1170 GFLOPS/s
Surface kernel
Horizontal volume kernel
Vertical volume kernel
Update kernel
(d) DP-DG kernels performance
Figure 4. Performance of individual kernels: The efficiency of our kernels are tested on a mini-app developed for this purpose. The
FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest
rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double
Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to
illustrate their efficiency. The GPU is a Tesla K20c.
time to make a day’s weather forecast. For this reason, we
also conducted strong scaling tests, shown in Fig. 7, on a
global scale simulation problem described in Sec. 8.5. Our
goal here is to determine the number of GPUs required for
a given simulation time limit for two resolutions: a coarse
grid of 13km resolution and a fine grid of 3km resolution.
Prepared using sagej.cls
18 Journal Title XX(X)
1 2 3 4 5 6 7 8 9
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
1 2 3 4 5 6 7 8 9
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
10−3 10−2 10−1 100101102103
334 GB/s
5121 GFLOPS/s
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
(a) SP-CG kernels performance
1 2 3 4 5 6 7
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
1 2 3 4 5 6 7
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
10−3 10−2 10−1 100101102103
334 GB/s
1707 GFLOPS/s
Volume kernel
Diffusion kernel
Gather kernel
Scatter kernel
Pressure kernel
Update kernel
Flux boundary kernel
Strong boundary kernel
Zero kernel
(b) DP-CG kernels performance
1 2 3 4 5 6 7 8 9
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
1 2 3 4 5 6 7 8 9
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
10−3 10−2 10−1 100101102103
334 GB/s
5121 GFLOPS/s
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
(c) SP-DG kernels performance
1 2 3 4 5 6 7
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
1 2 3 4 5 6 7
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
10−3 10−2 10−1 100101102103
334 GB/s
1707 GFLOPS/s
Volume kernel
Surface kernel
Gradient volume kernel
Gradient surface kernel
Pressure kernel
Update kernel
Boundary kernel
Zero kernel
(d) DP-DG kernels performance
Figure 5. Performance of individual kernels: The efficiency of our kernels are tested after being incorporated to the base NUMA
code. The measurements for this test are done using nvprof: effective memory bandwidth = dram read throughput +
draw write throughput, and effective arithmetic throughput = flop dp/sp efficiency. The Single Precision (SP) and Double Precision
(DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and roofline plots to illustrate their
efficiency. The GPU is a GTX Titan Black.
The grids are cubed sphere with 6x112x112x4 elements§
and N=7 for the 13km resolution test, and 6x144x144x4
elements and N=7 for the 3km resolution test The plot shows
that about 1500 and 8192 GPUs are required to bring down
§On cubed sphere grids, the total number of elements are denoted as
Npanels ×Nξ×Nη×Nζwhere Npanels = 6 for the six panels of the
cubed sphere, Nξ=Nηare the number of elements in both horizontal
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 19
No of GPUs
CG nooverlap
DG nooverlap
DG overlap
Figure 6. Scalability test of multi-GPU implementation of NUMA: The scalability of NUMA for up to 16384 GPUs on the Titan
supercomputer is shown. Each node of Titan contains a Tesla K20X GPU. An efficiency of about 90% is observed relative to a
single GPU. The test is conducted using a unified implementation of CG and DG. The efficiency of DG is significantly improved (by
about 20%) when overlapping communication with computation, which helps to hide both the data copying latency between CPU
and GPU and CPU-CPU communication latency.
0 10 20 30 40 50 60
No of GPUs
10x10 CG nooverlap
10x10 DG nooverlap
10x10 DG overlap
30x30 CG nooverlap
30x30 DG nooverlap
30x30 DG overlap
60x60 CG nooverlap
60x60 DG nooverlap
60x60 DG overlap
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Number of GPUs
Wallclock time in minutes
3km resolution
13 km resolution
Figure 7. (left) Scalability test of Multi-GPU implementation for different number of elements using upto 64 nodes of Titan. The
60x60 element grid gives a much better scalability than the 10x10 grid, hence, we expect better scaling results with bigger size
problems. (right) Strong scalability test for 3km and 13km resolution global simulation on the sphere.
the simulation time below 100 min for the coarse and fine
grids respectively. We believe that once we port the implicit-
explicit (IMEX) time integrators to the GPU, we can meet
simulation time limits with much fewer GPUs than the 3
million CPU threads required to meet a 4.5 minute wall clock
time limit required using the CPU version of NUMA (see
uller et al. 2016) for details).
8 Validation with benchmark problems
The GPU implementation of our Euler solver is validated
using a suite of benchmark problems showcasing various
characteristics of atmospheric dynamics. We consider
problems of different scale: cloud-resolving (micro-scale),
limited area (meso-scale) and global scale atmospheric
problems. Most of these test cases do not have analytical
solutions against which comparisons can be made. For this
reason, we first consider a rather simple test case of advection
of a vortex by a uniform velocity, which has an analytical
solution that will allow us to compute the exact L2error and
establish the accuracy of our numerical model. The rest of
the test cases serve as a demonstration of its application to
practical atmospheric simulation problems.
8.1 2D Isentropic vortex problem
We begin verification with a simple test case that has an
exact solution to the Euler equations. The test case involves
advective transport of an inviscid isentropic vortex in free
stream flow. The problem is often used to test the ability
of numerical methods to preserve flow features, such as
vortices, in free stream flow for long durations. However,
the problem is linear, and hence not suitable for testing the
coupling of wave motion and advective transport that are the
causes of non-linearity in the Euler equations.
The free stream conditions are
ρ= 1, u =U, v =V, θ =θ.
directions on each panel, and Nζare the number of elements in the vertical
Prepared using sagej.cls
20 Journal Title XX(X)
Perturbations are added in such a way that the flow is
isentropic. The initial conditions are
(u0, v0) = β
2πexp 1r2
2(y+yc, x xc)
8γπ2exp (1 r2)
r=q(xxc)2+ (yyc)2.
We simulate the isentropic vortex problem on a
[5m, 5m]x[5m, 5m]x[0.5m, 0.5m]computational
domain, with (xc, yc, zc) = (0,0,0),β= 5,U= 1 m/s,
V= 1 m/s and θ=1. The domain is subdivided into
22 x 22 x 2 elements with polynomial order of N= 7 in
all directions for a total of about 0.5 million nodes. The
simulation is run for 10s with a constant time step of t=
0.001susing the modified Runge-Kutta time integration
scheme discussed in Sec. 4. We anticipate the vortex to move
along the diagonal at a constant velocity while maintaining
its shape. This is indeed what is obtained as shown in Fig. 8.
To evaluate the accuracy of the numerical model, we
compute the L2norm of the error qqover the domain
, i.e., ||qq||L2(Ω), for both single precision (SP) and
double precision (DP) arithmetic, where qis the exact
solution. The DP run takes about 267s to complete while the
SP run takes 161s; however, the maximum error associated
with the SP calculations is much larger as shown in Fig. 8e.
Therefore, if this reduction in accuracy is acceptable for a
certain application, then using single precision arithmetic
on the GPU is recommended. For this particular problem,
DG gives a lower maximum error than CG in both the SP
and DP calculations. The L2-error of density decreases with
increasing polynomial order as shown in Fig. 8e; the per-
second L2-error also shows the same behavior affirming the
fact that higher order polynomials require less work per
degree of freedom. N= 11 is the maximum polynomial
order that we were able to run before we run out of global
memory on the GPU.
8.2 2D Rising thermal bubble
A popular benchmark problem in the study of non-
hydrostatic atmospheric models is the 2D rising thermal
bubble problem first proposed in (Robert 1993). The test
case concerns the evolution of a warm bubble in a neutrally
stratified atmosphere of constant potential temperature θ0.
The bubble is lighter than the surrounding air, hence, it
rises while deforming due to the shear induced by the
uneven distribution of temperature within the bubble. This
deformation results in a mushroom-like cloud. The initial
conditions for this test case are in hydrostatic balance in
which pressure decreases with height as
The potential temperature perturbation is given by
θ0=(0for r > rc
2(1 + cos(πr
rc)) for rrc
r=q(xxc)2+ (zzc)2.
The parameters for the problem are similar to that found in
(Giraldo and Restelli 2008; Ullrich and Jablonowski 2012):
a domain of size [0m, 1000m]x[0m, 100m]x[0m, 1000m],
with (xc, zc) = (500m, 350m),rc= 250m, and θc= 0.5K,
θ0= 300Kand an artificial viscosity of µ= 0.8m2/s for
stabilization. The domain is subdivided into 10 x 1 x 10
elements with polynomial order N= 6 set in all directions
for a total of about 180k nodes. The grid resolution is
about 25m therefore this problem can be considered as cloud
resolving. An inviscid wall boundary condition is used on all
The simulation is run for 1000s using the explicit Runge-
Kutta time integration method discussed in Sec. 4 with a
constant time step of t= 0.02s. The status of the bubble at
different times is shown in Fig. 9. The results agree with that
reported in (Giraldo and Restelli 2008). Most importantly,
the results are identical with that obtained using the CPU
version of NUMA, even though those are not shown here.
We should mention here that matching the CPU version of
NUMA upto machine precision (e.g., 1015) has been an
important goal in the development of the GPU code.
8.3 2D Colliding thermal bubbles
Next, we consider the case of colliding thermal bubbles
proposed in Robert (1993). The shape of the rising warm
bubble is now affected by the presence of a smaller sinking
cold bubble on the right-hand side. This destroys the
symmetry of the rising bubble. We should note here that
the rising thermal bubble problem in Sec. 8.2 could have
been solved considering only half of the domain because
of symmetry, which is not the case here. Also, the potential
temperature perturbation θ0is specified differently for this
problem. Within a certain radius rc, the perturbation is a
constant θc; outside of this inner domain, it is defined by a
Gaussian profile as
θ0=(θcfor rrc
θcexp [((ra)/s)2]for r > rc.
The warm bubble is centered at (xc, zc) = (500m, 300m),
with perturbation potential temperature amplitude of θc=
0.5, radius a= 150mand s= 50m. The initial conditions
for the cold bubble are: (xc, zc) = (560m, 640m),µ= 0.8
m2/s,θc= 0.5,a= 0 m and s= 50 m.
The result of the simulation is shown in Fig. 10 which
confirms the fact that the rising bubble indeed loses its
symmetry. The edge of the rising bubble becomes sharper
in some places from 600s onwards. Qualitative comparison
with the results shown in (Robert 1993; Yelash et al. 2014)
show similar large-scale patterns, while small-scale patterns
differ depending on the grid resolution used. Here, again the
results of the CPU NUMA code are identical with the GPU
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 21
(a) t=0s
(b) t=2s
(c) t=4s
0 5 10 15
Distance (m)
Density (kg/m3)
(d) Density along diagonal
(e) L2-error of density
Figure 8. Isentropic vortex : Plot of density (ρ) of the vortex at different times show that the vor tex, traveling at a speed of 1 m/s,
reaches the expected grid locations at all times. The density distribution within the vortex is maintained as shown in plot 8d. A grid
of 22x22x2 elements with 7th degree polynomial is used.
(a) t=0s
(b) t=300s
(c) t=500s
(d) t=600s
(e) t=700s
(f) t=900s
Figure 9. Potential temperature perturbation θ0(K)contour plot for the 2D rising thermal bubble problem run with CG and an
artificial viscosity of µ= 1.5m2/s for stabilization. Results are shown at t=0, 300,500, 600, 700 and 900 seconds. A grid of
10x1x10 elements with 6th degree polynomials is used.
8.4 Density current
The density current benchmark problem, first proposed in
(Straka et al. 1993), concerns the evolution of a cold bubble
in a neutrally stratified atmosphere of constant potential
temperature θ0. The dimensions of this test case are in
Prepared using sagej.cls
22 Journal Title XX(X)
(a) t=0s
(b) t=300s
(c) t=500s
(d) t=600s
(e) t=700s
(f) t=900s
Figure 10. Colliding thermal bubbles. Evolution of potential temperature perturbation θ0(K)run with CG and an artificial viscosity
of µ= 1.5m2/s for stabilization. Results are shown at t=0,300, 500, 600, 700 and 900 seconds. A grid of 10x1x10 elements with
6th degree polynomials is used.
(a) t=0s
(b) t=300s
(c) t=600s
(d) t=700s
(e) t=800s
(f) t=900s
Figure 11. Density current. Evolution of potential temperature perturbation θ0(K)run with CG and an artificial viscosity of µ= 75
m2/s for stabilization. Results are shown at t=0,300, 600, 700, 800 and 900 seconds. A grid of 128x1x32 elements with 4th degree
polynomials is used for an effective resolution of 50m in the xand zdirections.
the range of typical mesoscale models in which hydrostatic
assumptions are valid. Because the bubble is colder than
the surrounding air, it sinks and hits the ground, then
moves along the surface while forming shearing currents,
which then generate Kelvin-Helmholtz rotors. The numerical
solution of this problem using high order methods often
requires the use of artificial viscosity or other methods for
stabilization. We use a viscosity of µ= 75 m2/s according
to (Straka et al. 1993).
The problem setup is similar to that of the rising thermal
bubble test case with the following differences: a cold bubble
with θc=15 K in Eq. (18), a domain of = [0, 25600m]
×[0, ]×[0, 6400m], ellipsoidal bubble with radii
of (rx, rz) = (4000m, 2000m)and centered at (xc, zc) =
(0,3000m). The problem is symmetrical, therefore, we only
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 23
(a) 0h (b) 4h (c) 7h
Figure 12. Propagation of an acoustic wave. The density perturbation after 0 hour, 4 hours and 7 hours. A cubed sphere grid with
6x10x10x3 elements with 3rd degree polynomial is used.
need to simulate half of the domain. The computational
domain is subdivided into 128 x 1 x 32 elements with
polynomial order of N= 4 set in all directions. With this
set of choices, the effective resolution of our model is 50m.
Inviscid wall boundary conditions are used at all sides.
Fig. 11 shows the evolution of potential temperature of the
bubble up to 900 seconds. The vortical structures formed at
t=900 sec, namely three Kelvin-Helmholtz instability rotors,
are similar to that shown in (Straka et al. 1993; Ullrich and
Jablonowski 2012). The first rotor is formed near the leading
edge of the density current at 300 sec, then the second rotor
develops at the front of the density current around 600 sec.
Here again the GPU code matched results obtained using
NUMA’s CPU code, which has already been verified with
many other atmospheric benchmark problems.
8.5 Acoustic wave
To validate the GPU implementation for global scale
simulations on the sphere, we consider a test case of an
acoustic wave traveling around the globe first described in
(Tomita and Satoh 2005). Several issues emerge that did
not arise in the previous test cases. This test case validates
3D capabilities, curved geometry, metric terms, and a non-
constant gravity vector. The initial state for this problem
is hydrostatically balanced with an isothermal background
potential temperature of θ0=300K. A perturbation pressure
P0is superimposed on the reference pressure
P0=f(λ, φ)g(r)
f(λ, φ) = (0for r > rc
2(1 + cos(πr
rc)) for rrc
g(r) = sin nvπr
where P= 100 Pa, nv= 1,rc=re/3is one third of the
radius of the earth re=6371km and a model altitude of
rT=10km. The geodesic distance ris calculated as
r=recos1[sin φ0sin φ+ cos φ0cos φcos(λλ0)]
where (λ0, φ0)is the origin of the acoustic wave.
The grid is a cubed sphere 6×10 ×10 ×3for a total
of 1800 elements with 3rd order polynomials. No-flux
boundary conditions are applied at the bottom and top
surfaces. Visual comparison of plots showing the location of
the wave at different hours, shown in Fig. 12, against results
in (Tomita and Satoh 2005) indicate that the results are quite
similar to these results as well as to those computed with the
CPU version of NUMA.
The speed of sound is about a=pγp/ρ =347.32 m/s
with the initial conditions of the problem. With this speed,
the acoustic wave should reach the antipode in about 16
hours. The result from the simulation indicates the acoustic
wave has traveled 20.01 million meters within this time —
which gives an average sound speed of 347.55 m/s that is
close to the calculated sound speed (a relative error of less
than 1%).
9 Conclusions
In this work, we have ported the Non-hydrostatic Unified
Model of the Atmosphere (NUMA) to the GPU and
demonstrated speedups of two orders of magnitude relative
to a single core CPU. Tests on one node of the Titan
supercomputer, consisting of a K20x GPU and a 16-core
AMD CPU, yielded speedups of up to 15x and 11x for the
GPU relative to the CPU using single and double precision
arithmetic, respectively. This performance is achieved by
exploiting the specialized GPU hardware using suitable
algorithms and optimizing kernels for performance.
NUMA solves the Euler equations using a unified
continuous and discontinuous Galerkin approach for spatial
discretization and various implicit and explicit time
integration schemes. GPU kernels are written for different
components of the dynamical core, namely, the volume
integration kernel, surface integration kernel, (explicit)
time update kernel, kernels for stabilization, etc. We use
algorithms suitable for the Single Instruction Multiple
Thread (SIMT) architecture of GPUs to maximize bandwidth
usage and rate of floating point operations (FLOPS) of
the kernels. Some of the kernels, for instance the volume
integration, turned out to be high on the FLOPS side, while
some others, such as the explicit time integration kernel, are
high on bandwidth usage. Optimizations of kernels should be
geared towards achieving the maximum attainable efficiency
as bounded by the roofline model.
Prepared using sagej.cls
24 Journal Title XX(X)
We have also implemented a multi-GPU version of
NUMA using the existing MPI-infrastructure for multi-core
CPUs (Kelly and Giraldo 2012). Communication between
GPUs is done via CPUs by first copying the inter-processor
data from the GPU to the CPU. For the discontinuous
Galerkin (DG) implementation, we overlap communication
and computation to hide latency of data copying from the
GPU and communication between CPUs. We then tested
the scalability of our multi-GPU implementation using
16384 GPUs of the Titan supercomputer — the third fastest
supercomputer in the world as of June 2016. We obtained
a weak scaling efficiency of about 90% that increases with
bigger problem size. The CG and DG methods that do not
overlap communication with computation performed about
20% less efficiently, thereby, highlighting the value of this
For portability to heterogeneous computing environment,
we used a novel programming language called OCCA, which
can be cross-compiled to either OpenCL, CUDA or OpenMP
at runtime. Finally, the accuracy and performance of our
GPU implementations are verified using several benchmark
problems representative of different scales of atmospheric
In the current work, we ported only the explicit time
integration modules to the GPU. However, operational
NWP often requires use of implicit-explicit (IMEX) time
integration to counter the limitation imposed by the Courant
number. In the future, we plan to port the IMEX time
integration modules which require solving a system of
equations at each time step.
10 Acknowledgement
This research used resources of the Oak Ridge Leadership
Computing Facility at the Oak Ridge National Laboratory,
which is supported by the Office of Science of the U.S.
Department of Energy under Contract No. DE-AC05-
00OR22725. The authors gratefully acknowledge support
from the Office of Naval Research through PE-0602435N.
Abdi DS and Giraldo FX (2016) Efficient construction of unified
continuous and discontinuous galerkin formulations for the 3d
euler equations. Journal of Computational Physics 320: 46 –
68. DOI:
Allard J, Courtecuisse H and Faure F (2011) Implicit FEM
Solver on GPU for Interactive Deformation Simulation. In:
mei W Hwu W (ed.) GPU Computing Gems Jade Edition,
Applications of GPU Computing Series. Elsevier, pp. 281–294.
DOI:10.1016/B978-0-12-385963- 1.00021-6.
Bassi F and Rebay S (1997) A high-order accurate discontinuous
finite element method for the numerical solution of the com-
pressible navierstokes equations. Journal of Computational
Physics 131(2): 267 – 279. DOI:
Burstedde C, Wilcox LC and Ghattas O (2011) p4est: Scalable
algorithms for parallel adaptive mesh refinement on forests of
octrees. SIAM Journal on Scientific Computing 33(3): 1103–
1133. DOI:10.1137/100791634.
Carpenter M and Kennedy C (1994) Fourth-order 2N-storage
Runge-Kutta schemes. NASA technical memorandum 109112 :
1 – 24.
Chan J, Wang Z, Modave A, Remacle J and Warburton T (2015)
GPU-accelerated discontinuous galerkin methods on hybrid
meshes. arXiv:1507.02557 .
Chan J and Warburton T (2015) GPU-accelerated bernstein-
bezier discontinuous galerkin methods for wave problems.
arXiv:1512.06025 .
Cockburn B and Shu C (1998) The Runge-Kutta discontinuous
Galerkin method for conservation laws V: multidimensional
systems. J. Comput. Phys. 141: 199 – 224.
Deville M, Fischer P and Mund E (2002) High-Order Methods for
Incompressible Fluid Flow. Cambridge University Press.
Fuhry M, Giuliani A and Krivodonova L (2014) Discontinuous
Galerkin methods on graphics processing units for nonlinear
hyperbolic conservation laws. Numerical Methods in Fluids
76: 982 – 1003.
Gandham R, Medina D and Warburton T (2014) GPU accelerated
discontinuous galerkin methods for shallow water equations.
arXiv:1403.1661 .
Giraldo FX (1998) The Lagrange-Galerkin spectral element method
on unstructured quadrilateral grids. Journal of Computational
Physics 147(1): 114–146.
Giraldo FX and Restelli M (2008) A study of spectral element and
discontinuous galerkin methods for the navier-stokes equations
in nonhydrostatic mesoscale atmospheric modeling: Equation
sets and test cases. J. Comput. Phys. 227: 3849 – 3877.
Giraldo FX and Rosmond TE (2004) A scalable spectral element
eulerian atmospheric model (SEE-AM) for NWP: Dynamical
core tests. Monthly Weather Review 132(1): 133–153.
Goddeke D, Strzodka R and Turek S (2005) Accelerating double
precision FEM simulations with GPUs. In: Proceedings of
ASIM. pp. 1 – 21.
Kelly JF and Giraldo FX (2012) Continuous and discontinuous
galerkin methods for a scalable three-dimensional nonhydro-
static atmospheric model: limited area mode. J. Comput. Phys.
231: 7988 – 8008.
ockner A and Warburton T (2013) A loop generation tool
for CPUs and GPUs, part i: Data models, algorithms, and
heuristics .
ockner A, Warburton T, Bridge J and Hesthaven J (2009)
Nodal discontinuous galerkin methods on graphics processors.
Journal of Computational Physics 228(21): 7863 – 7882. DOI:
Marras S, Kelly JF, Moragues M, M¨
uller A, Kopera MA,
azquez M, Giraldo FX, Houzeaux G and Jorba O (2015)
A review of element-based galerkin methods for numerical
weather prediction: Finite elements, spectral elements, and
discontinuous galerkin. Archives of Computational Methods
in Engineering : 1–50DOI:10.1007/s11831-015-9152-1.
Medina D, Amik SC and Warburton T (2014) OCCA: A unified
approach to multi-threading languages. arXiv:1403.0968 .
Micikevicius P (2009) 3d finite difference computation on GPUs
using cuda. In: Proceedings of 2Nd Workshop on General
Purpose Processing on Graphics Processing Units, GPGPU-
2. New York, NY, USA: ACM. ISBN 978-1-60558-517-8, pp.
79–84. DOI:10.1145/1513895.1513905.
Prepared using sagej.cls
Abdi, Wilcox, Warburton and Giraldo 25
Modave A, St-Cyr A and Warburton T (2015) Gpu performance
analysis of a nodal discontinuous galerkin method for acoustic
and elastic models. arXiv:1602.07997 .
uller A, Kopera M, Marras S, Wilcox LC, Isaac T and Giraldo
FX (2016) Strong scaling for numerical weather prediction at
petascale with the atmospheric model numa. Submitted to :
30th IEEE International Parallel and Distributed Processing
Symposium .
Nair RD, Levy MN and Lauritzen PH (2011) Emerging numerical
methods for atmospheric modeling. In: Lauritzen PH,
Jablonowski C, Taylor MA and Nair RD (eds.) Numerical
methods for global atmospheric models,Lecture notes in
computational science and engineering, volume 80. Springer,
pp. 251 – 311.
Norman M, Larkin J, Vose A and Evans K (2015) A case study of
CUDA FORTRAN and OpenACC for an atmospheric climate
kernel. Journal of Computational Science 9: 1 – 6. DOI: Computational
Science at the Gates of Nature.
Remacle J, Gandham R and Warburton T (2015) GPU accelerated
spectral finite elements on all-hex meshes. arXiv:1506.05996 .
Robert A (1993) Bubble convection experiments with a semi-
implicit formulation of the Euler equations. J. Atmos. Sci.
50(13): 1865–1873.
Sawyer W (2014) An overview of GPU-enabled atmospheric
models. In: ENES Workshop on Exascale Technologies and
Innovation in HPC for Climate Models.
Siebenborn M, Schulz V and Schmidt S (2012) A curved-element
unstructured discontinuous galerkin method on GPUs for the
euler equations. Comput. and Vis. in Sc. 15: 61 – 73.
Straka J, Wilhelmson R, Wicker L, Anderson J and Doegemeier
K (1993) Numerical solutions of a nonlinear density current:
A benchmark solution and comparison. International J. Num.
Methods. Fl. 17: 1 – 22.
Tomita H and Satoh M (2005) A new dynamical framework of
non hydrostatic global model using the icosahedral grid. Fluid
Dynamics Research 34: 357 – 400.
Ullrich P and Jablonowski C (2012) Operator-split runge-kutta-
rosenbrock methods for nonhydrostatic atmospheric models.
Monthly Weather Review 140: 1257 – 1284.
Williams S, Waterman A and Patterson D (2009) Roofline:
An insightful visual performance model for multicore
architectures. Commun. ACM 52(4): 65–76. DOI:10.1145/
Yelash L, M¨
uller A, Luk`
a M, Giraldo FX and
Wirth V (2014) Adaptive discontinuous evolution galerkin
method for dry atmospheric flow. Journal of Computational
Physics 268: 106 – 133. DOI:
Prepared using sagej.cls
... There is also the issue of additional accuracy obtained from using small time steps that further motivates the use of explicit methods. We have already discussed the implementation of a scalable explicit solver in our previous work [2], which we will use as a base-line for comparison with our IMEX solvers. Other implementations on GPUs that need to be mentioned include the work in [39] whereby a highly scalable hybrid CPU-GPU algorithm for solving the shallow water equations using explicit time stepping is discussed. ...
... In this section, we shall discuss the discretization of only the linear operators L(q) in Eqs. (14) and (21) and refer the reader to our previous work [3, 2] for the discussion on discretization of the rest of the terms. We begin by separating the linear operator L(q) into flux and source terms as follows ...
... In this section, we describe the implementation of the infrastructure required for conducting IMEX time integration on manycore processors. In our previous work [2], we presented a GPU acceleration of NUMA and its scalability on tens of thousands of GPUs using explicit time integration; here, we shall only discuss the new infrastructure required for enabling IMEX time stepping and refer the reader to our previous work for a complete view of the IMEX time-integration approach (i.e., the explicit part). To summarize the new additions, we need: a) kernels for evaluating left-hand side and right-hand side operators for the implicit terms described in Sec. 3. ...
... In our previous work [1] , we presented a GPU acceleration of the Non- Hydrostatic Unified Model of the Atmosphere ( NUMA ) and its scalability on tens of thousands of GPUs using explicit time integration. In this work, we implement IMplicit-EXplcit (IMEX) time integration on the GPU and Intel's KNL. ...
... Although it is typically assumed that high-order Galerkin methods are not strictly necessary, they do offer many advantages over their low-order counterparts. Examples include their ability to resolve fine scale structures and to do so with fewer degrees of freedom, as well as their strong scaling properties on massively parallel computers (Müller et al. [44], Abdi et al. [2], Gandhem et al. [20]). High-order methods are often attributed with some disadvantages as well. ...
Full-text available
The high-order numerical solution of the non-linear shallow water equations is susceptible to Gibbs oscillations in the proximity of strong gradients. In this paper, we tackle this issue by presenting a shock capturing model based on the numerical residual of the solution. Via numerical tests, we demonstrate that the model removes the spurious oscillations in the proximity of strong wave fronts while preserving their strength. Furthermore, for coarse grids, it prevents energy from building up at small wave-numbers. When applied to the continuity equation to stabilize the water surface, the addition of the shock capturing scheme does not affect mass conservation. We found that our model improves the continuous and discontinuous Galerkin solutions alike in the proximity of sharp fronts propagating on wet surfaces. In the presence of wet/dry interfaces, however, the model needs to be enhanced with the addition of an inundation scheme which, however, we do not address in this paper.
Full-text available
The computational fluid dynamics of hurricane rapid intensification (RI) is examined through idealized simulations using two codes: a community‐based, finite‐difference/split‐explicit model (WRF) and a spectral‐element/semi‐implicit model (NUMA). The focus of the analysis is on the effects of implicit numerical dissipation (IND) in the energetics of the vortex response to heating, which embodies the fundamental dynamics in the hurricane RI process. The heating considered here is derived from observations: four‐dimensional, fully nonlinear, latent heating/cooling rates calculated from airborne Doppler radar measurements collected in a hurricane undergoing RI. The results continue to show significant IND in WRF relative to NUMA with a reduction in various intensity metrics: (a) time‐integrated, mean kinetic energy values in WRF are ∼20% lower than NUMA and (b) peak, localized wind speeds in WRF are ∼12 m/s lower than NUMA. Values of the eddy diffusivity in WRF need to be reduced by ∼50% from those in NUMA to produce a similar intensity time series. Kinetic energy budgets demonstrate that the pressure contribution is the main factor in the model differences with WRF producing smaller energy input to the vortex by ∼23%, on average. The low‐order spatial discretization of the pressure gradient in WRF is implicated in the IND. In addition, the eddy transport term is found to have a largely positive impact on the vortex intensification with a mean contribution of ∼20%. Overall, these results have important implications for the research and operational forecasting communities that use WRF and WRF‐like numerical models.
A multi-layer non-hydrostatic version of the unstructured mesh, discontinuous Galerkin finite element based coastal ocean model, Thetis, is developed. This is accomplished using the PDE solver framework, Firedrake, which is used to automatically produce the code for the discretised model equations in a rapid and efficient manner. The motivation for this work is a need to accurately simulate dispersive nearshore free surface processes. In order to resolve both frequency dispersion and non-linear effects accurately, additional non-hydrostatic terms are included in the layer-integrated hydrostatic equations, producing a form similar to the layered non-linear shallow water equations, but with extra vertical velocities at the layer interfaces. An implementation process is adopted to easily handle the inter-layer connection, i.e. the governing equations are transformed into a depth-integrated system through the introduction of depth-averaged variables. The model is verified and validated through comparisons against several idealised and experimentally-based test cases. All the comparisons demonstrate good agreement, showing that the developed non-hydrostatic model has excellent capabilities in representing coastal wave phenomena including shoaling, refraction and diffraction of dispersive short waves, as well as propagation, run-up and inundation of non-linear tsunami waves.
We present novel algorithms for cell-based adaptive mesh refinement on unstructured meshes of triangles on graphics processing units. Our implementation makes use of improved memory management techniques and a coloring algorithm for avoiding race conditions. Both the solver and AMR algorithms are entirely implemented on the GPU, with negligible communication between device and host. We show that the overhead of the AMR subroutines is small compared to the high order solver and that the proportion of total runtime spent adaptively refining the mesh decreases with the order of approximation. We apply our code to a number of benchmark problems as well as more recently proposed problems for the Euler equations that require extremely high resolution. We present the solution to a shock reflection problem that addresses the von Neumann triple point paradox with an accurately computed triple point location. Finally, we present the first solution on the full Euler equations to the problem of shock disappearance and self-similar diffraction of weak shocks around thin films.
High-order spectral element methods (SEM) for large-eddy simulation (LES) are still very limited in industry. One of the main reasons behind this is the lack of robustness of SEM for under-resolved simulations, which can lead to the failure of the computation or to inaccurate results, aspects that are critical in an industrial setting. To help address this issue, we introduce a non-modal analysis technique that characterizes the numerical diffusion properties of spectral element methods for linear convection–diffusion problems, including the scales affected by numerical diffusion and the relationship between the amount of numerical diffusion and the level of under-resolution in the simulation. This framework differs from traditional eigenanalysis techniques in that all eigenmodes are taken into account with no need to differentiate them as physical or unphysical. While strictly speaking only valid for linear problems, the non-modal analysis is devised so that it can give critical insights for under-resolved nonlinear problems. For example, why do SEM sometimes suffer from numerical stability issues in LES? And, why do they other times are robust and successfully predict under-resolved turbulent flows even without a subgrid-scale model? The answer to these questions in turns provides crucial guidelines to construct more robust and accurate schemes for LES. For illustration purposes, the non-modal analysis is applied to the hybridized discontinuous Galerkin methods as representatives of SEM. The effects of the polynomial order, the upwinding parameter and the Péclet number on the so-called short-term diffusion of the scheme are investigated. From a non-modal analysis point of view, and for the particular case of hybridized discontinuous Galerkin methods, polynomial orders between 2 and 4 with standard upwinding are well-suited for under-resolved turbulence simulations. For lower polynomial orders, diffusion is introduced in scales that are much larger than the grid resolution. For higher polynomial orders, as well as for strong under/over-upwinding, robustness issues can be expected due to low and non-monotonic numerical diffusion. The non-modal analysis results are tested against under-resolved turbulence simulations of the Burgers, Euler and Navier–Stokes equations. While devised in the linear setting, non-modal analysis successfully predicts the behavior of the scheme in the nonlinear problems considered. Although the focus of this paper is on LES, the non-modal analysis can be applied to other simulation fields characterized by under-resolved scales.
Full-text available
We present the acceleration of an IMplicit-EXplicit (IMEX) non-hydrostatic atmospheric model on manycore processors such as GPUs and Intel's MIC architecture. IMEX time integration methods sidestep the constraint imposed by the Courant-Friedrichs-Lewy condition on explicit methods through corrective implicit solves within each time step. In this work, we implement and evaluate the performance of IMEX on manycore processors relative to explicit methods. Using 3D-IMEX at Courant number C=15 , we obtained a speedup of about 4X relative to an explicit time stepping method run with the maximum allowable C=1. In addition, we demonstrate a much larger speedup of 100X at C=150 using 1D-IMEX due to the unconditional stability of the method in the vertical direction. Several improvements on the IMEX procedure were necessary in order to outperform our results with explicit methods: a) reducing the number of degrees of freedom of the IMEX formulation by forming the Schur complement; b) formulating a horizontally-explicit vertically-implicit (HEVI) 1D-IMEX scheme that has a lower workload and potentially better scalability than 3D-IMEX; c) using high-order polynomial preconditioners to reduce the condition number of the resulting system; d) using a direct solver for the 1D-IMEX method by performing and storing LU factorizations once to obtain a constant cost for any Courant number. Without all of these improvements, explicit time integration methods turned out to be difficult to beat. We discuss in detail the IMEX infrastructure required for formulating and implementing efficient methods on manycore processors. Finally, we validate our results with standard benchmark problems in NWP and evaluate the performance and scalability of the IMEX method using up to 4192 GPUs and 16 Knights Landing processors.
Full-text available
A unified approach for the numerical solution of the 3D hyperbolic Euler equations using high order methods, namely continuous Galerkin (CG) and discontinuous Galerkin (DG) methods, is presented. First, we examine how classical CG that uses a global storage scheme can be constructed within the DG framework using constraint imposition techniques commonly used in the finite element literature. Then, we implement and test a simplified version in the Non-hydrostatic Unified Model of the Atmosphere (NUMA) for the case of explicit time integration and a diagonal mass matrix. Constructing CG within the DG framework allows CG to benefit from the desirable properties of DG such as, easier hp-refinement, better stability etc. Moreover, this representation allows for regional mixing of CG and DG depending on the flow regime in an area. The different flavors of CG and DG in the unified implementation are then tested for accuracy and performance using a suite of benchmark problems representative of cloud-resolving scale, meso-scale and global-scale atmospheric dynamics. The value of our unified approach is that we are able to show how to carry both CG and DG methods within the same code and also offer a simple recipe for modifying an existing CG code to DG and vice versa.
Full-text available
We evaluate the computational performance of the Bernstein-Bezier basis for discontinuous Galerkin (DG) discretizations and show how to exploit properties of derivative and lift operators specific to Bernstein polynomials. Issues of efficiency and numerical stability are discussed in the context of a model wave propagation problem. We compare the performance of Bernstein-Bezier kernels to both a straightforward and a block-partitioned implementation of nodal DG kernels in a time-explicit GPU-accelerated DG solver. Computational experiments confirm the advantage of both Bernstein-Bezier and block-partitioned nodal DG kernels over the straightforward implementation at high orders of approximation.
Conference Paper
Full-text available
Numerical weather prediction (NWP) has proven to be computationally challenging due to its inherent multiscale nature. Currently, the highest resolution NWP models use a horizontal resolution of approximately 15km. At this resolution many important processes in the atmosphere are not resolved. Needless to say this introduces errors. In order to increase the resolution of NWP models highly scalable atmospheric models are needed. The Non-hydrostatic Unified Model of the Atmosphere (NUMA), developed by the authors at the Naval Postgraduate School, was designed to achieve this purpose. NUMA is used by the Naval Research Laboratory, Monterey as the engine inside its next generation weather prediction system NEPTUNE. NUMA solves the fully compressible Navier-Stokes equations by means of high-order Galerkin methods (both spectral element as well as discontinuous Galerkin methods can be used). Mesh generation is done using the p4est library. NUMA is capable of running middle and upper atmosphere simulations since it does not make use of the shallow-atmosphere approximation. This paper presents the performance analysis and optimization of the spectral element version of NUMA. The performance at different optimization stages is analyzed. By using vector intrinsics the main computations reach 1.2 PFlops on the entire machine Mira. The paper also presents scalability studies for two idealized test cases that are relevant for NWP applications. The atmospheric model NUMA delivers an excellent strong scaling efficiency of 99% on the entire supercomputer Mira using a mesh with 1.8 billion grid points. This allows us to run a global forecast of a baroclinic wave test case at 3km uniform horizontal resolution and double precision within the time frame required for operational weather prediction.
Full-text available
We present a time-explicit discontinuous Galerkin (DG) solver for the time-domain acoustic wave equation on hybrid meshes containing vertex-mapped hexahedral, wedge, pyramidal and tetrahedral elements. Discretely energy-stable formulations are presented for both Gauss-Legendre and Gauss-Legendre-Lobatto (Spectral Element) nodal bases for the hexahedron. Stable timestep restrictions for hybrid meshes are derived by bounding the spectral radius of the DG operator using order-dependent constants in trace and Markov inequalities. Computational efficiency is achieved under a combination of element-specific kernels (including new quadrature-free operators for the pyramid), multi-rate timestepping, and acceleration using Graphics Processing Units.
Full-text available
Numerical Weather Prediction (NWP) is in a period of transition. As resolutions increase, global models are moving towards fully nonhydrostatic dynamical cores, with the local and global models using the same governing equations; therefore we have reached a point where it may be possible to use a single model for both applications. These new dynamical cores are designed to scale efficiently on clusters with hundreds of thousands or even millions of CPU cores and GPUs. Operational and research NWP codes currently use a wide range of numerical methods: finite difference, spectral transform, finite volume and, increasingly, finite/spectral elements and discontinuous Galerkin, which constitute element-based Galerkin (EBG) methods. Due to their important role in this transition, will EBGs be the dominant power behind NWP in the next 10 years, or will they just be one of many methods to chose from? One decade after the review of numerical methods for atmospheric modeling by Steppeler et al. (2003) [{\it Review of numerical methods for nonhydrostatic weather prediction models} Meteorol. Atmos. Phys. 82, 2003], this review discusses EBG methods as a viable numerical approach for the next-generation NWP models. One well-known weakness of EBG methods is the generation of unphysical oscillations in advection-dominated flows; special attention is hence devoted to dissipation-based stabilization methods. % such as, but not limited to, variational multi-scale stabilization (VMS) or dynamic Large Eddy Simulation (LES) used for stabilization. Since EBGs are geometrically flexible and allow both conforming and non-conforming meshes, as well as grid adaptivity, this review is concluded with a short overview of how mesh generation and dynamic mesh refinement are becoming as important for atmospheric modeling as they have been for engineering applications for many years.
Finite element schemes based on discontinuous Galerkin methods possess features amenable to massively parallel computing accelerated with general purpose graphics processing units (GPUs). However, the computational performance of such schemes strongly depends on their implementation. In the past, several implementation strategies have been proposed. They are based exclusively on specialized compute kernels tuned for each operation, or they can leverage BLAS libraries that provide optimized routines for basic linear algebra operations. In this paper, we present and analyze up-to-date performance results for different implementations, tested in a unified framework on a single NVIDIA GTX980 GPU. We show that specialized kernels written with a one-node-per-thread strategy are competitive for polynomial bases up to the fifth and seventh degrees for acoustic and elastic models, respectively. For higher degrees, a strategy that makes use of the NVIDIA cuBLAS library provides better results, able to reach a net arithmetic throughput 35.7% of the theoretical peak value.