Content uploaded by Daniel S. Abdi

Author content

All content in this area was uploaded by Daniel S. Abdi on Jul 02, 2016

Content may be subject to copyright.

Content uploaded by Daniel S. Abdi

Author content

All content in this area was uploaded by Daniel S. Abdi on Jun 01, 2016

Content may be subject to copyright.

A GPU Accelerated Continuous and

Discontinuous Galerkin Non-hydrostatic

Atmospheric Model

Journal Title

XX(X):1–25

c

The Author(s) 2016

Reprints and permission:

sagepub.co.uk/journalsPermissions.nav

DOI: 10.1177/ToBeAssigned

www.sagepub.com/

Daniel Abdi1and Lucas Wilcox1and Timothy Warburton 2and Francis Giraldo1

Abstract

We present a GPU accelerated nodal discontinuous Galerkin method for the solution of the three dimensional Euler

equations that govern the motion and thermodynamic state of the atmosphere. Acceleration of the dynamical core of

atmospheric models plays an important practical role in not only getting daily forecasts faster but also in obtaining

more accurate (high resolution) results within a given simulation time limit. We use algorithms suitable for the single

instruction multiple thread architecture of GPUs to accelerate our model by two orders of magnitude relative to one core

of a CPU. Tests on one node of the Titan supercomputer show a speedup of upto 15 times using the K20X GPU as

compared to that on the 16-core AMD Opteron CPU. The scalability of the multi-GPU implementation is tested using

16384 GPUs, which resulted in a weak scaling efﬁciency of about 90%. Finally, the accuracy and performance of our

GPU implementation is veriﬁed using several benchmark problems representative of different scales of atmospheric

dynamics.

Keywords

NUMA, GPU, HPC, OCCA, Atmospheric model, Discontinuos Galerkin, Continuous Galerkin

1 Introduction

Most operational Numerical Weather Prediction (NWP)

models are based on the ﬁnite difference or spectral

transform spatial discretization methods. Finite difference

methods are popular with limited area models due to their

ease of implementation and good performance on structured

grids, whereas global circulation models mostly use the

spectral transform method. Spectral transform methods often

do not scale well on massively parallel systems due to

the need for global (all-to-all) communication required

by the Fourier transform. On the other hand, the ﬁnite

difference method requires wide halo layers at inter-

processor boundaries to achieve high-order accuracy. The

search for efﬁcient parallel NWP codes in the era of

high performance computing suggests the use of alternative

methods that have local operation properties while still

offering high-order accuracy (Nair et al. 2011); their

efﬁciency coming from the minimal parallel communication

footprint that is of vital importance as resolution increases.

The Non-hydrostatic Uniﬁed Model of the Atmosphere

(NUMA) is one such NWP model that offers high-order

accuracy while using local methods for parallel efﬁciency

(Marras et al. 2015; Giraldo and Rosmond 2004; Kelly and

Giraldo 2012; Giraldo and Restelli 2008).

In Table 1, we give a summary of a recent review on

the progress of porting several NWP models to the GPU

(Sawyer 2014). Among those models which ported the whole

dynamical core, a maximum overall speedup of 3 times (from

here on, we shall use, e.g., 3x to represent such a speedup) is

observed for a GPU relative to a multi-core CPU. The only

spectral element model in the review was the Community

Atmospheric Model (CAM-SE) that showed a speed of 3x

for the dynamical core using CUDA. A comparison of the

acceleration of CAM-SE tracer kernels using OpenACC,

though substantially easier to program, performed 1.5x

slower than the CUDA version (Norman et al. 2015). This

could occur, for example, by not fully exploiting the private

worker array capability of OpenACC. The most important

metric we shall use to compare performance on the GPU is

speedup, however, we should note that speedup results are

signiﬁcantly inﬂuenced by how well the CPU and GPU codes

are optimized. For this reason, we shall also report individual

GPU kernel performance in-terms of rate of ﬂoating point

operations and rate of data transfer (bandwidth) and will

illustrate our results using rooﬂine models.

Element based Galerkin (EBG) methods, in which the

basis functions are deﬁned within an element, are well

suited for distributed computing for two reasons (Kl¨

ockner

et al. 2009): Firstly, localized memory accesses result in

low communication overhead. In contrast, global methods

require an all-to-all communication that severely degrades

scalability on most architectures and methods having non-

compact high-order support require larger halo regions

which translates to larger communication stencils that

also reduces scalability. Secondly, High order polynomial

expansion of the solution results in large arithmetic intensity

per degree of freedom. These two properties work in

favor of EBG methods for Graphic Processing Unit (GPU)

computing as well. The two EBG methods of NUMA,

1Department of Applied Mathematics, Naval Postgraduate School, USA

2Department of Mathematics, Virginia Tech University, USA

Corresponding author:

Daniel S. Abdi, Naval Postgraduate School Monterey, CA 93943, USA.

Email: dsabdi@nps.edu

Prepared using sagej.cls [Version: 2015/06/09 v1.01]

2Journal Title XX(X)

Table 1. GPU acceleration of a few atmospheric models based on a summary in Sawyer (2014). The only spectral element (SE)

code is the hydrostatic CAM-SE model. A maximum speedup of 3x over a multi-core CPU is observed among those models that

have ported the whole dynamical core.

Model Non-hydrostatic Method GPU ported Speedup Language

CAM-SE No SE Parts of DyCore 3x CUDA+OpenACC

WRF Yes FD Parts of DyCore 2x CUDA+OpenACC

NICAM Yes FV DyCore 3x OpenACC

ICON Yes FV DyCore 2x CUDA+OpenACC+OpenCL

GEOS-5 Yes FV Parts of DyCore 5x CUDA+OpenACC

FIM/NIM Yes FV DyCore + Physics 3x F2C-ACC + OpenACC

GRAPES Yes SL Parts of DyCore 4x CUDA

COSMO Yes FD DyCore + Physics 2x CUDA+OpenACC

namely continuous Galerkin (CG) and discontinuous

Galerkin (DG), are ported to the GPU in a uniﬁed manner

(see Sec. 3.3). Parallel implementation of DG is often easier

and more efﬁcient than that of CG because of a smaller

communication stencil; with a judicious choice of numerical

ﬂux only neighbors sharing a face need to communicate

in DG as opposed to the edge and corner neighbor

communication required by CG. Moreover, DG allows for

a simple overlap of computation of volume integrals and

intra-processor ﬂux with communication of boundary data,

which can be exploited to improve the efﬁciency of the

parallel implementation (Kelly and Giraldo 2012). CG can

also beneﬁt from a communication-computation overlap but

it requires a bit more work than that for DG (Deville et al.

2002).

EBG methods have been successfully ported to GPUs to

speedup the solution of various partial differential equations

(PDEs) by orders of magnitude. Acceleration of a CG

simulation using GPUs is ﬁrst reported by Goddeke et al.

(2005). Later, Kl¨

ockner et al. (2009) made the ﬁrst GPU

implementation of nodal DG for the solution of linear

hyperbolic conservation laws. They mention that nontrivial

adjustments to the DG method are required to solve non-

linear hyperbolic equations, such as the compressible Euler

equations, on the GPU due to complexity of implementing

limiters and artiﬁcial viscosity. Another notable difference

with the current work is that NUMA uses a tensor-

product approach with hexahedra elements for efﬁciency

reasons (Kelly and Giraldo 2012); Kl¨

ockner et al. (2009)

argue tetrahedra are preferable on the GPU due to larger

arithmetic intensity and reduced memory fetches. More

recently Siebenborn et al. (2012) implemented the Runge-

Kutta discontinuous Galerkin method of Cockburn and

Shu (1998) on the GPU to solve the non-linear Euler

equations using tetrahedral grids. They reported a speedup

of 18x over the serial implementation of the method running

on a single core CPU. Fuhry et al. (2014) made an

implementation of the 2D discontinuous Galerkin on the

GPU using triangular elements and obtained a speedup of

about 50x relative to a single core CPU. The approach they

used is a one-element-per-thread strategy that is different

from the one-node-per-thread strategy we shall use in this

work when running on the GPU. However, thanks to our

use of a device agnostic language, the same kernel code

used on the GPU switches to using the one-element-per-

thread strategy of Fuhry et al. (2014) when running on the

CPU using OpenMP mode. Chan et al. (2015) presented

a GPU acceleration of DG methods for the solution of

the acoustic wave equation on hex-dominant hybrid meshes

consisting of hexahedra, tetrahedra, wedges and pyramids.

They mention that the DG spectral element formulation on

hexahedra is more efﬁcient on the GPU using Legendre-

Gauss-Lobatto (LGL) points than using Gauss-Legendre

(GL) points. To avoid the cost of storing the inverse mass

matrix on the GPU, they used different basis functions that

yield a diagonal mass matrix for each of the cell shapes

except tetrahedra. For straight-edged elements, the mass

matrix for tetrahedral elements is not diagonal, but a scalar

multiple of that of the reference tetrahedron, therefore the

storage cost is minimal. In (Chan and Warburton 2015),

they consider the use of the Bernstein-Bezier polynomial

basis for DG on the GPU to enhance the sparsity of the

derivative and lift matrices as compared to classical DG

with Lagrange polynomial basis. However, this comes at

a cost of increased condition number of the matrices that

could potentially cause stability issues. They conclude that,

at high order polynomial approximation, DG implemented

with Bernstein-Bezier polynomial basis perform better than

a straightforward implementation of classical DG. Remacle

et al. (2015) studied GPU acceleration of spectral elements

for the solution of the Poisson problem on purely hexahedral

grids. The solution of elliptic problems is most efﬁciently

done using implicit methods; thus, they implemented a

matrix-free Preconditioned Conjugate Gradient (PCG) on the

GPU and demonstrated that problems with 50 million grid

cells can be solved in a few seconds.

General purpose computing on GPUs can be done using

several programming models from various vendors: AMD’s

OpenCL, NVIDIA’s CUDA and OpenACC, to name a

few. The choice of the programming model for a project

depends on several factors. The goal of the current work is

to port NUMA to heterogeneous computing environments

in a performance portable way, and hence cross-platform

portability is the topmost priority. In the future we

shall address performance portability using automatic code

transformation techniques, such as Loo.py (see (Kl¨

ockner

and Warburton 2013)). To achieve cross-platform portability,

we chose a new threading language called OCCA (Open

Concurrent Compute Abstraction) (Medina et al. 2014),

which is a uniﬁed approach to multi-threading languages.

Kernels written in OCCA are cross-compiled at runtime to

existing thread models such as OpenCL, CUDA, OpenMP,

etc.; here, we present results only for OpenCL and CUDA

backends and postpone OpenMP for future work. OCCA has

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 3

been shown to deliver portable high performance for various

EBG methods (Medina et al. 2014). It has already been used

in (Gandham et al. 2014) to accelerate the DG solution of

the shallow water equations, in (Remacle et al. 2015) for the

Poisson problem, in (Modave et al. 2015) for acoustic and

elastic problems.

2 Governing equations

The dynamics of non-hydrostatic atmospheric processes

are governed by the compressible Euler equations. The

equation sets can be written in various conservative and

non-conservative forms. Among those, a conservative set is

selected with the prognostic variables (ρ, U,Θ)>, where ρis

density, U= (U, V, W ) = ρu,Θ = ρθ, where θis potential

temperature and u= (u, v, w)are the velocity components.

We write the governing equations in the following way

∂ρ

∂t +∇ · U= 0

∂U

∂t +∇ · U⊗U

ρ+PI3=−ρg

∂Θ

∂t +∇ · ΘU

ρ= 0

(1)

where gis the gravity vector.∗The pressure in the

momentum equation is obtained from the equation of state

P=P0RΘ

P0γ

(2)

where R=cp−cvand γ=cp

cvfor given speciﬁc heat of

pressure and volume of cpand cv, respectively. We have

selected to use a conservative form of the equations, to take

advantage of not only global but also local conservation

properties (given the proper discretization method).

For better numerical stability, the density, pressure and

potential temperature variables are split into background

and perturbation components. The background component is

time-invariant and is often obtained by assuming hydrostatic

equilibrium and a neutral atmosphere. Let us deﬁne the

decomposition as follows

ρ(x, t) = ρ(x) + ρ0(x, t)

Θ(x, t) = Θ(x)+Θ0(x, t)

P(x, t) = P(x) + P0(x, t).

where (x, t)are the space-time coordinates. Then, the

modiﬁed equation set is

∂ρ0

∂t +∇ · U= 0

∂U

∂t +∇ · U⊗U

ρ+P0I3=−ρ0g

∂Θ0

∂t +∇ · ΘU

ρ= 0.

(3)

In compact vector notation form

∂q

∂t +∇·F(q) = S(q)(4)

where q= (ρ0,U,Θ0)>is the solution vector, F(q) =

(U,U⊗U

ρ+P0I3,ΘU

ρ)>is the ﬂux vector, and S(q) =

(0,−ρ0g,0)>is the source vector.

For the purpose of stabilization, we add artiﬁcial viscosity

to the governing equations as follows

∂q

∂t +∇·F(q) = S(q) + ∇ · (µ∇q)(5)

where µis the constant artiﬁcial kinematic viscosity. We

should mention that the equation sets are conservative only

for the inviscid case; therefore, in order to conserve mass, we

do not apply stabilization to the continuity equation.

3 Spatial discretization of the governing

equations

Spatial discretization for the element-based Galerkin (EBG)

methods, namely continuous Galerkin and discontinuous

Galerkin, is conducted by decomposing the domain Ω⊂R3

into Nenon-overlapping hexahedra elements Ωe

Ω =

Ne

[

e=1

Ωe.

A key property of hexahedral elements is that they allow

the use of a tensor product approach thereby decreasing the

complexity (in 3D) from O(N6)to O(N4)where Nis the

degree of the polynomial basis. In addition, if we are willing

to accept inexact integration of the mass matrix then we can

co-locate the interpolation and integration points to simplify

the resulting algorithm in addition to increasing its efﬁciency

without sacriﬁcing too much accuracy (see, e.g., Giraldo

(1998)).

Within each element Ωeare deﬁned basis functions ψj(x)

to form a ﬁnite-dimensional approximation qNof q(x, t)by

the expansion

qN(e)(x, t) =

M

X

j=1

ψj(x)q(e)

j(t)

where Mis the number of nodes in an element. The

superscript (e)indicates a local solution as opposed to a

global solution. From here on, the superscript is dropped

from our notations since we are solely interested in EBG

methods.

The 3D basis functions are formed from a tensor product

of the 1D basis functions in each direction as

ψijk (ξ, η, ζ ) = ψi(ξ)⊗ψj(η)⊗ψk(ζ)

where the 1D Lagrange basis functions are deﬁned on [−1,1]

as

ψi(ξ) =

N+1

Y

j=1

j6=i

ξ−ξj

ξi−ξj

,

where {ξi}M

1is the set of interpolation points in [−1,1]. In

a nodal Galerkin approach, ψi(ξ)are Lagrange polynomials

∗The gravity vector is constant in mesoscale models whereas it varies with

location in global scale models.

Prepared using sagej.cls

4Journal Title XX(X)

associated with a speciﬁc set of points; here we choose the

Legendre-Gauss-Lobatto (LGL) points {ξi} ∈ [−1,1] which

are the roots of

(1 −ξ2)P0

N(ξ)

where PN(ξ)is the Nth degree Legendre polynomial. These

points are also used for integration with quadrature weights

given by

ωi=2

N(N+ 1)1

PN(ξi)2

.

This choice of Lagrange functions gives the Kronecker delta

property

ψi(ξj) = δij

which, for the 3D basis functions, yields

ψijk (ξa, ηb, ζc) = δai ⊗δbj ⊗δck .

Unfortunately, the Kronecker delta property does not hold

for the derivatives of the basis functions. However, in the

case of tensor product elements, there exists a simpliﬁcation

that will tremendously decrease the cost of evaluation of

derivatives and also the associated storage space in case they

are stored as matrix coefﬁcients. Let us write the derivatives

in the following way

∂ψij k

∂ξ (ξa, ηb, ζc) = dψi

dξ (ξa)⊗δbj ⊗δck

∂ψij k

∂η (ξa, ηb, ζc) = δai ⊗dψj

dη (ηb)⊗δck

∂ψij k

∂ζ (ξa, ηb, ζc) = δbj ⊗δai ⊗dψk

dζ (ζc).

(6)

Therefore, for tensor product elements, we need to

consider only 3Nnodes instead of N3when computing

derivatives at a given node. If matrices are built to solve

the system of equations, the storage requirement would

increase in proportion to the polynomial order O(N)instead

of O(N3). This saving is due to the fact that we only

have to compute and store dψ

dχ (χ)where χis one of the

following: ξ, η, ζ . The derivatives with respect to the physical

coordinates x=(x, y, z)are computed using the Jacobian

matrix transformation

∇φ=Jˆ

∇φ

where ˆ

∇is the derivative with respect to the reference

coordinates (ξ, η, ζ)and

J=

∂ξ

∂x

∂ξ

∂y

∂ξ

∂z

∂η

∂x

∂η

∂y

∂η

∂z

∂ζ

∂x

∂ζ

∂y

∂ζ

∂z

.(7)

3.1 Continuous Galerkin method

Starting from the differential form of the Euler equations

in vector notation, shown in Eq. (4), and then expanding

with basis functions, multiplying by a test function ψi, and

integrating yields the element-wise formulation

ZΩe

ψi

∂qN

∂t dΩe+ZΩe

ψi∇·FdΩe=ZΩe

ψiS(qN)dΩe.

(8)

Integrating the second term by parts (ψi∇·F =∇ ·

(ψiF)− ∇ψi· F) yields

ZΩe

ψi

∂qN

∂t dΩe+ZΓe

ψiˆn · FdΓe−ZΩe

∇ψi· FdΩe=

ZΩe

ψiS(qN)dΩe

(9)

where ˆn is the outward pointing nomral on the boundary of

the element Γe. The second term needs to be evaluated only

at physical boundaries because the ﬂuxes to the left and right

of element interfaces are always equal at interior boundaries,

i.e. F+=F−. Eqs. (8) and (9) are the strong and weak

continuous Galerkin (CG) formulations, respectively, with

the ﬁnite dimensional space deﬁned as a subset of the

Sobolev space

VCG

N={ψ∈H1(Ω)|ψ∈ PN}

where PNdeﬁnes the set of all Nth degree polynomials.

Automatically, VCG

N∈C0(Ω), thus CG solutions satisfy C0-

continuity.

3.2 Discontinuous Galerkin method

For DG, the ﬁnite dimensional space is deﬁned as a subset of

the Hilbert space that allows for discontinuities of solutions

VDG

N={ψ∈L2(Ωe)|ψ∈ PN}.

Therefore F+and F−are not equal anymore, hence, we

deﬁne a numerical ﬂux F∗as an approximate solution to a

Riemann problem to be used in the weak form DG

ZΩe

ψi

∂qN

∂t dΩe+ZΓe

ψiˆn · F∗dΓe−ZΩe

∇ψi· FdΩe=

ZΩe

ψiS(qN)dΩe

(10)

where the Rusanov ﬂux, suitable for hyperbolic equations, is

deﬁned as

F(q)∗={F(q)} − ˆn |b

λ|

2[[q]]

where |b

λ|is the speed of sound, {} represent an average and

[[]] represent a jump across a face (from Ωeto its neighbor). If

C0-continuity is enforced on the weak form DG in Eq. (10),

i.e. F=F∗, it reduces to the weak form CG in Eq. (9).

A strong form DG that resembles Eq. (8) more, can be

obtained by applying a second integration by parts on the

ﬂux integral to remove the smoothness constraint on the test

function ψias follows

ZΩe

ψi

∂qN

∂t dΩe+ZΓe

ψiˆn ·(F∗− F)(qN)dΓe+

ZΩe

ψi∇·F(qN)dΩe=ZΩe

ψiS(qN)dΩe.

(11)

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 5

Again, if C0-continuity is enforced on the strong form DG

formulation, i.e F=F∗at interior edges, it simpliﬁes to the

strong form CG formulation in Eq. (8). (see Abdi and Giraldo

(2016) for details).

3.3 Uniﬁed CG and DG

The element-wise matrices for both CG and DG are

assembled to form global matrices via an operation

commonly known as global assembly or direct stiffness

summation (DSS). Even though the local matrices are the

same for both methods, the DSS operation yields different

global matrices. CG is often implemented through a global

grid point storage scheme where elements share LGL nodes

at faces so that C0-continuity is satisﬁed automatically.

Therefore, the DSS operation for CG accumulates values

at shared nodes, while that for DG simply puts the local

element matrices in their proper location in the global matrix.

DG uses a local element-wise storage scheme because

discontinuities (jumps) at element interfaces are allowed.

The standard implementation of CG and DG often follow

these two different approaches of storing data; however, CG

can be recast to use local element-wise storage as well. To

do so, we must explicitly enforce equality of values on the

right and left of element interfaces by accumulating and

then distributing back (gather-scatter) values at shared nodes

for both the mass matrix and right-hand side vector. The

gather-scatter operation is the coupling mechanism for CG,

without which the problem is under-speciﬁed. DG achieves

the same via the deﬁnition of the numerical ﬂux F∗at

element interfaces, which is used by both elements sharing

the face. A detailed explanation of the uniﬁed CG and DG

implementation of NUMA can be found in (Abdi and Giraldo

2016).

4 Temporal discretization of the governing

equations

The time integrator used is a low-storage explicit Runge-

Kutta (LSERK) method proposed in (Carpenter and Kennedy

1994). It is a ﬁve-stage fourth-order RK method that requires

only two storage locations, which is half of that required

by the conventional high-storage fourth-order RK method.

The added cost due to one more stage evaluation is offset

by the larger stable timestep ∆tthe method allows. Each

successive stage is written on to the same register without

erasing the previous value. We need to store previous values

of the ﬁeld variable qand its residual dqof size N each,

thereby, resulting in a 2N-storage scheme. Given the initial

value problem

dq

dt =R(q)with q(t0) = q0

the updates at each stage jare conducted as follows

dqj=Ajdqj−1+ ∆tR(qj−1)

qj=qj−1+Bjdqj

where Ajand Bjare constant coefﬁcients for each stage

given in Table 2.

Explicit RK methods have a stringent Courant-Friedrichs-

Lewy (CFL) requirement that often prohibit them from

Table 2. Coefﬁcients of the ﬁve-stage LSERK time integrator

stage A B

1 0 0.097618354

2 0.481231743 0.412253292

3 -1.049562606 0.440216964

4 -1.602529574 1.426311463

5 -1.778267193 0.197876053

being used in operational settings. NUMA includes Implicit-

Explicit (IMEX) methods that allow for much larger time

steps, however, those have not yet been ported to the GPU.

The ﬁrst goal of the GPU project focuses on porting explicit

time integration methods which are known to scale well on

many processors and are also easier to port to GPUs. Implicit

methods require the solution of a coupled system of linear

equations; therefore, depending on the chosen iterative solver

and preconditioner, performance on a cluster of computers

and GPUs may be severely impacted. For this reason, we

reserve the porting of the implicit solvers in NUMA to a

future study.

5 Porting NUMA to the GPU

This section describes the implementation of the uniﬁed CG

and DG NUMA on the GPU using the OCCA programming

language (Medina et al. 2014). Before we delve into details

of the implementation, a few words on GPU computing

in general and design considerations are warranted. GPUs

provide the most cost-effective computing power to date,

however, they come with a challenge of adapting existing

code originally written for the CPU to a GPU platform.

5.1 Challenges

First of all, the candidate program to be ported to the GPU

should be able to handle massively ﬁne grained parallelism

via threads. Even though current general purpose GPU

computing offers a lot more ﬂexibility than the days when

they were exclusively used for image rendering, there are still

limitations on what can be done efﬁciently on GPUs. Single

Instruction Multiple Data (SIMD) programs suited for vector

machines are automatically candidates for porting to GPUs.

More ﬂexibility is achieved on the GPU by limiting SIMD

computation to a small group of threads, 32 threads known

as a warp in NVIDIA terminology, and then scheduling

multiple warps to work on different tasks. In the code design

phase, it is often convenient to think of warps as the smallest

computing unit for the following reason. If even one thread in

a warp decides to do a different operation, warp divergence

occurs in which all threads in a warp have to do operations

twice resulting in a 50% performance loss.

The second issue concerns memory management. Though

the many cores in GPUs provide a lot of computational

power, they can only be harnessed fully if unrestricted

by memory bandwidth limitations. Programs running on a

single core CPU are often compute-bound because more

emphasis is given to data caching in CPU design. In contrast,

most of the chip area in GPUs is devoted to compute units,

and as a result, programs running on a GPU tend to be

memory-bound. Programmers have to carefully manage the

different memory resources available in GPUs. To give an

Prepared using sagej.cls

6Journal Title XX(X)

idea of the complexity of memory management, we brieﬂy

describe the six types of memory in NVIDIA GPUs: global,

local, texture, constant, shared and register memory ordered

in highest to lowest latency. Register memory is the fastest

but is limited in size and only visible to one thread. Shared

memory is fast and visible to a block, a group of warps,

and therefore it is an invaluable means of communication

between threads. Constant and texture memory are read-

only memory that can be used to reduce memory trafﬁc.

Local memory is cached but is only accessible by one

thread; automatic variables that cannot be held in registers

are ofﬂoaded to the slow local memory. Global memory,

which is accessible by all threads, is the main memory of

GPUs where the data is stored.

5.2 Design choices

Global memory bandwidth limitation and high latency of

access is often the bottleneck of performance in GPU

computing. To minimize its impact on performance, memory

transactions can be coalesced for a group of threads

accessing the same block in memory. The warp scheduler

also helps to alleviate this problem by swapping out warps

that are waiting for a global memory transaction to complete

for those that are ready to go. There are two approaches of

storing data. The ﬁrst approach, Array of Structures (AoS),

stores all variables at a given LGL node contiguously in

memory. This is suitable if computation is done for all the

variables in one pass. If, on the other hand, a subset of the

variables are required at a time, a second approach, Structure

of Arrays (SoA), is suitable. While the SoA often degrades

performance on the CPU due to reduced cache efﬁciency, it

can signiﬁcantly improve performance on the GPU because

of coalesced memory transactions for a warp. The approach

we use is a mix of these two methods similar to the AoSoFA

(Array of structures of ﬁxed arrays) described in (Allard et al.

2011), in which data for each element is stored in an SoA

manner, and thus an AoS for the whole domain. Using this

approach, scalar data for all nodes in an element is stored

contiguously in memory; this is repeated similarly for each

scalar variable. Variables that are often accessed together,

for instance coordinates (x, y, z) or velocity (u, v, w) can be

stored as one ﬂoat3 on the GPU.

Our choice of data layout is inﬂuenced by our design

decision to do computation on an element by element basis,

for instance launching as many threads as the number of

nodes for computing volume integrals, and as many as face

nodes for surface integrals (see Sec. 5.3.1 and 5.3.2). We

should note here that our approach has a downside in that

the number of threads launched for processing an element

could be small with low-order polynomials approximations;

also the number of threads may not be a multiple of the warp

size. We provide solutions to this problem by processing

multiple elements per block as will be explained in the

coming sections. In the SoA approach, these two problems

do not exist and the appropriate number of threads that ﬁt the

GPU device could be launched to process LGL nodes even

from different elements simultaneously. The SoA approach

may be better for porting code to the GPU using, for instance,

OpenACC or other pragma based programming languages

where the user has less control of the device.

5.3 Uniﬁed CG and DG on the GPU

The implementation of CG done within the DG framework

differs only by the ﬁnal DSS step required for imposing

the C0-continuity constraint instead of using the numerical

ﬂux. Therefore, ﬁrst we explain the implementation details

of nodal DG on the GPU and then that of the DSS

operation later. The three major computations in DG are

implemented in separate OCCA kernels: volume integration,

surface integration and time step update kernels. Other

major kernels are the boundary kernel required for imposing

boundary conditions, the project kernel for applying the

DSS operation for CG, and two kernels for stabilization:

a Laplacian diffusion kernel for applying second order

artiﬁcial viscosity to be used with CG, and a kernel for

computing the gradient required by the Local Discontinuous

Galerkin (LDG) method used for stabilizing DG; in future

work, we will select one stabilization method/kernel for both

methods using the primal form of the elliptic problem. For

the strong form DG discretization of the Euler equations, the

kernels represent the following integrals

ZΩe

ψ∂q

∂t dΩe

| {z }

Update kernel

+ZΩe

ψ∇·F−SdΩe

| {z }

Volume kernel

+

ZΓe

ψˆn ·(F∗− F)dΓe

| {z }

Surface kernel

=ZΩe

ψ∇ · µ∇q)dΩe

| {z }

Diffusion kernel

.

(12)

5.3.1 Volume kernel The volume and surface integration

kernels are written in such a way that a CUDA thread block

processes one or more elements, and a thread processes

contributions from a single Legendre-Gauss-Lobatto (LGL)

node, i.e., the one-node-per-thread approach we mentioned

in the introduction. Gandham et al. (2014) mention that

for low order polynomial approximations, performance can

be improved by as much as ﬁve times by processing more

than one element per block. This is especially true for 2D

elements that were used in their study, which have fewer

nodes than the 3D elements we are using in this work. The

reason for this variation in performance with the number of

elements processed per block is the need for a block size that

best ﬁts the underlying hardware limits. In traditional GPU

kernels, for instance the time step update kernel discussed

in Sec. 5.3.3, thread blocks are sized as multiples of the

warp size (32 threads) for best performance. However, for

the volume integration kernels, our algorithms are designed

such that one thread processes one LGL node, therefore the

number of threads launched is not a multiple of the warp size

but the number of nodes.

The main operation in the volume kernel is computing

gradients of the following eight variables (shown in Alg.

1): ﬁve prognostic variables (ρ, U, V, W, Θ), pressure Pand

two variables for moisture (here, we omit precipitation). The

gradient of four variables, which are stored as one ﬂoat4, can

be computed together for efﬁciency. The current work does

not include support for tracer transport, nor do we employ

the moisture dynamics even though the gradient is computed.

Once the gradients are calculated, we can construct the

divergence and complete the contribution of the volume

integration to the right-hand side vector as shown in Alg. 2.

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 7

Figure 1. Volume integral contribution of a horizontal and vertical slice of a 3D element with 4th polynomial approximation. Due to

the use of the tensor-product approach for hexahedral elements, contributions to a given node (red dot) come only from those

collinear with it along the x-,y-,z- directions, i.e., purple and green nodes on the horizontal slice and light-blue nodes on the vertical

slice.

Algorithm 1 GPU algorithms for computing gradient, divergence and Laplacian.

procedure GRADDIV(q,grad,div,compute) Compute gradient or divergence

Memory fence

for k,j,i ∈ {0. . . Nq}do Load ﬁeld variables into shared memory

sq[k][j][i] = q

Memory fence

for k,j,i ∈ {0. . . Nq}do

qx=0; qy=0; qz=0; Compute local gradients

for n∈ {0. . . Nq}do

qx += sD[i][n]×sq[k][j][n] sD are ∇ψat LGL nodes preloaded to shared memory.

qy += sD[j][n]×sq[k][n][i]

qz += sD[k][n]×sq[n][j][i]

if compute = GRAD then

grad·x = (qx ×Jrx + qy ×Jsx + qz ×Jtx) Js are coefﬁcients of the jacobian matrix J

grad·y = (qx ×Jry + qy ×Jsy + qz ×Jty)

grad·z = (qx ×Jrz + qy ×Jsz + qz ×Jtz)

else if compute = DIVX then

div = (qx ×Jrx + qy ×Jsx + qz ×Jtx)

else if compute = DIVY then

div += (qx ×Jrx + qy ×Jsx + qz ×Jtx)

else if compute = DIVZ then

div += (qx ×Jrx + qy ×Jsx + qz ×Jtx)

procedure GRAD(q,grad) Compute gradient of a scalar ﬁeld

call GRA DDI V(q,grad,-,GRAD)

procedure DIV(q,div) Compute divergence of a vector ﬁeld

call GRA DDI V(q·x,-,div,DIVX)

call GRA DDI V(q·y,-,div,DIVY)

call GRA DDI V(q·z,-,div,DIVZ)

procedure LAP(q,lap) Compute Laplacian of a scalar ﬁeld

call GRA D(q,gq)

call DIV(gq,lap)

For low order polynomials, we can launch one thread

per node and perhaps more by processing multiple elements

per block. This approach works for a maximum polynomial

order of seven. The reason why we cannot use this approach

for higher order polynomials than seven is two fold: ﬁrst, the

number of threads in a block ((7 + 1)3= 512) approaches

the hardware block size limit. Second, we also approach the

shared memory limit at this polynomial order. Therefore,

we use two different approaches for volume integration for

polynomial orders less than seven (low order) and greater

Prepared using sagej.cls

8Journal Title XX(X)

Algorithm 2 Outline of a combined volume kernel for processing Nkelements per block with Nsslice workers. There are

Nq, number of quadrature points, slices per element for volume kernels and Nf, number of faces, for surface kernels.

procedure VOLUM EKER NE L(q , R)

Shared data[Nk][Nq][Nq][Nq]Extended shared memory array

for outerId0 do

for innerId2 do

wId = innerId2 mod NsSlice worker Id

eId = innerId2 div NsMultiple element processing

for slId=wId to Nqstep Nsdo Nqslices to work on

e = Nk×outerId0 + elId Element id

call GRA D(qa, ∇qa)Compute gradient of (U,V,W,p) as one ﬂoat4 variable qa

DU =∇xU+∇yV+∇zW

R(ρ) = DU

R(Θ) = θ×DU

R(U) = U×DU +∇xp+∇U·U

R(V) = V×DU +∇yp+∇V·U

R(W) = W×DU +∇yp+∇W·U

call GRA D(qb, ∇qb)Compute gradient of (ρ, Θ,−,−) as one ﬂoat4 variable qb

DR =U· ∇ρ

R(Θ) -= Θ×DR −U· ∇Θ

R(u) -= U×DR

R(v) -= V×DR

R(w) -= W×DR

than seven (high order). For low order polynomials, we

can pre-load all the element data (the two ﬂoat4s to shared

memory at start up, and then never read from global memory

again until the kernel completes).

We can overcome the thread block size limitation for

high order polynomial approximation by launching only

the required number of threads to process one slice of a

3D element, i.e., N2

LGL nodes, as shown in Fig. 1. Then,

we consider three ways of exploiting the shared memory.

The ﬁrst approach, which we call the naive approach,

does not use shared memory but relies solely on the L1

cache if available. Otherwise, data is read directly from

global memory every time it is required. We can optimize

this approach by adjusting the hardware division of L1

cache to shared memory to be 48 kb/16 kb instead of the

default 16kb/48kb in the K20x GPU. Ignoring cache effects,

the naive approach reads 3NLGL values from memory to

compute the gradient of a variable at a node, for a total

of N3

LGL ×3NLGL memory reads. The second approach,

henceforth called Shared-1 loads a slice of data to shared

memory, then computes the contribution to the gradient from

those nodes on the slice. The data on the slice is re-used

between the N2

LGL nodes on the same plane, therefore, a total

of N3

LGL ×NLGL memory reads are required. The third

approach, henceforth called Shared-2, extends the previous

method by storing the column of data in register as suggested

in (Micikevicius 2009). The column of data may not ﬁt in

registers in which case it is spilled to CUDA private memory

which is global memory. In the latter case, the method will

be the same as the Shared-1 approach with the additional cost

of copying data from global-to-global memory. The best case

scenario is when N3

LGL memory reads are required, but this

cannot be achieved in practice due to the limited number of

registers per thread. The fourth approach does two passes on

the data in which the ﬁrst pass calculates contributions to

the gradient from nodes on the same slice, say the x−y

plane; the second pass completes the gradient calculation

by loading x−zslices, and adding the contributions from

nodes in the z-direction. This approach always requires

N3

LGL ×2memory reads.

Even though the slicing approach helps to handle higher

order polynomial approximations, it hurts performance on

the other end of the spectrum. Assuming 512 threads per

block and a hardware limit of 8 blocks per multi-processor,

a 2D kernel using 3rd degree polynomial approximations

will require 8×(3 + 1)2= 128 threads, which yields 25%

efﬁciency; on the other hand a 3D kernel will occupy

100% of the device because 8×(3 + 1)3= 512 threads are

launched per multiprocessor. We would like to run with high

order polynomial approximations and also have kernels that

are efﬁcient for low order polynomial approximations.

These two competing goals of optimizing kernels for high-

order and low-order polynomials can be handled separately

with different kernels optimized for each. More convenient is

to write the volume kernel in such a way that it can process

multiple elements in a thread block with one or more slice

workers simultaneously. For this reason, the volume, surface

and gradient kernels accept parameters Nk, for number of

elements to process per block, and Ns, for the number of

slice workers per element. We should note here that due to

the run-time compilation feature of OCCA, parameters such

as the polynomial order are constants, as a result kernels are

optimized for the selected set of parameters. For example,

with Nk= 1 and Ns= 1, the kernels produced will be

exactly the same as those we had before adding the multiple

element per block and slicing approaches. If a kernel uses

shared memory to store data for each element processed per

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 9

block and slice worker, its shared memory consumption will

increase in proportion with Nk×Ns, as shown in Alg. 2.

5.3.2 Surface kernel The surface integration, shown in

Alg. 3, is conducted in two stages in accordance with

(Kl¨

ockner et al. 2009): the ﬂux gather stage collects

contributions of elements to the numerical ﬂux at face nodes,

and the lifting stage integrates the face values back into the

volume vector. Lifting, in our case, is a simple multiplication

by a factor computed from the ratio of weighted face and

volume Jacobians; this is a result of the tensor-product

approach in conjunction with the choice of integration rule

that results in a diagonal lifting matrix. If the numerical

ﬂux at a physical boundary is pre-determined, for instance

in the case of a no-ﬂux boundary condition, it is directly

set to the prescribed value before lifting. The workload in

surface integration can be split into slices similar to that

used for volume integration. The number of slices available

for parallelization in this case is the number of faces of an

element, six for hexahedra. However, since two faces that

are adjacent to each other share an edge, they cannot be

processed by two slice workers simultaneously. One solution

is to reduce the parallelization to pairs of opposing faces,

thereby avoiding the conﬂict that arises at the edges when

updating ﬂux terms as shown in Fig. 2. A second option

is to use hardware atomic operations to update the ﬂux

terms. However, hardware support for atomic operations on

double precision ﬂoating point operations is not universally

supported by all GPUs at this time.

5.3.3 Update kernel The time step update kernel is

relatively straightforward to implement because we are using

explicit time integration, in which new values at a node

are calculated solely from old values at the same node.

However, explicit time stepping is only conditionally stable

depending on the Courant number. The implementation of

implicit-explicit and fully implicit time stepping methods,

which require the solution of a linear system of equations,

is postponed to the future. For now, we implement the low-

storage fourth order RK method of Carpenter and Kennedy

(1994) by storing the solution at the previous time step and

its residual. Since there is no distinction between nodes in

different elements for this particular kernel, we can select the

appropriate block size that best ﬁts the hardware, e.g. 256 in

OpenCL.

5.3.4 Project kernel The direct stiffness summation

(DSS) operation is implemented in two steps, namely gather

and scatter stages. The DSS kernel, shown in Alg. 4, accepts

a vector of node numbers in Compressed Sparse Row (CSR)

format. This vector is used to gather local node values to

then put the result in global nodes — which may be mapped

into multiple local nodes. One thread is launched for each

global node to accumulate the values from all local nodes

sharing this global node. As a result, no conﬂict will arise

while accumulating values because the gather at a node

is done sequentially by the same thread. For the single

GPU implementation, we can immediately start the scatter

operation which does the opposite operation of scattering the

gathered value back to the local nodes. However, a multi-

GPU implementation requires communication of gathered

values between GPUs before scattering as will be discussed

in Sec. 6.

5.3.5 Diffusion kernels For the purposes of the current

work, we shall use constant second order artiﬁcial viscosity

to stabilize both the CG and DG methods in NUMA †. The

stabilizing term, shown in Eq. (5), is in divergence form

∇ · (µ∇q)so that we will be able to use dynamic viscosity

methods in the future. However, we use constant viscosity

in the current work, which reduces the stabilizing term to a

Laplacian operator µ∇2q.

For stabilizing CG, we use the primal form discretization

of the Laplacian operator. Let us start with the DG

discretization with numerical ﬂux q∗given in weak form as

ZΩe

ψi∇ · (µ∇q)dΩe=ZΓe

ψiˆn ·(µ∇q∗)

|{z }

surface

−

ZΩe

∇ψi·(µ∇q)dΩe

| {z }

volume

(13)

and in the strong form as

ZΩe

ψi∇ · (µ∇q)dΩe=ZΓe

ψiˆn ·(µ∇q∗−µ∇q)

| {z }

surface

+

ZΩe

ψi∇ · (µ∇q)dΩe

| {z }

volume

.

(14)

If we, then, ensure C1-continuity in the CG discretization,

i.e. by applying DSS on the gradient so that ∇q=∇q∗,

the surface integral term disappears from the strong form

formulation. The weak form CG formulation will still

retain the surface integral term despite DSS, however, this

term needs to be evaluated only at physical boundaries

because it cancels out at interior boundaries due to ∇q+=

∇q−. In addition, the term completely disappears if no-

ﬂux boundary conditions are used; dropping the surface

integral term in other cases results in an inconsistent method,

but something that could still be feasible for the purpose

of numerical stabilization. The kernel for computing the

volume contribution of the strong form discretization is

already given in Alg. 1. The volume kernel for the the weak

form discretization is shown in Alg. 5. The ﬁrst step in this

kernel is to load the ﬁeld variable qinto the fast shared

memory. Then, we compute and store the local gradients at

each LGL node similar to what is done in the volume kernel.

The shared memory requirement of this kernel is rather

high due to the need for temporarily storing the gradients

besides the ﬁeld variable. On the other hand, the mixed form

stabilization method we use for DG, i.e. by computing and

storing the gradient in global memory, puts less stress on

shared memory requirement, while being potentially slower.

The same kind of optimizations used for the volume kernel,

such as splitting into slices and multiple elements per block

†Hyper-diffusion can also be used but in order to simplify the exposition,

we shall only remark on second order diffusion.

Prepared using sagej.cls

10 Journal Title XX(X)

Figure 2. Coloring of faces for parallel computation of surface integral. Opposing faces can be processed simultaneously because

there are no shared edges between them.

Algorithm 3 Surface kernel

map[3][2] = ((0,5),(1,3),(2,4)) Pairs of faces, shown in Fig. 2, for parallel computation

procedure SUR FAC EKE RNE L(q , R)

for outerId0 do

for innerId2 do

wId = innerId2 mod NsSlice worker Id

eId = innerId2 div NsElement Id

for wId to 2step Nsdo

for b=0 to 2do

slId = map[b][wId]; Get face

for j,i ∈ {0. . . Nq}do

e=Nk×outerId0 + elId

Load face normal ˆn and lift coefﬁcient LL=wij Jij

wijk Jijk

Load q+and q−for current node and adjoning node in the other element

Compute maximum wave speed |λ|=|ˆn ·u|+pγp/ρ

Compute Rusanov ﬂux F(q)∗={F(q)} − ˆn |λ|

2[[q]]

R += L × ˆn ·(F(q)∗− F (q))

Algorithm 4 DSS kernel

procedure DSSKERN EL(Q, Qcont, starts, indices, nGlobal, wgt)

for outerId0 do

n = outerId0 Global node id

if n≤nGlobal then

start = starts[n] Read indices of local nodes for the DSS operation

end = starts[n+1]

gQ = 0 Gather stage of DSS

for m=start to end do

ind = indices[m] Local node index

if ind ≥0then

pw = wgt[ind]; DSS weight computed based on lumped mass coefﬁcients

gQ += Q[ind]×pw

Qcont[n] = gQ

for m=start to end do Scatter stage of DSS

ind = indices[m]

if ind ≥0then

Q[ind] = Qcont[n]

processing, can be used here as well. After computing the

local gradients, the ∇ψi·µ∇qjterm can be computed

immediately afterwards — which is represented by the

combined geometric factors JJT. Note that we use local

memory fences to synchronize the read/write operations in

shared memory. The fact that we use a discontinuous space

even for CG forces us to apply DSS on both q, for which we

already applied DSS at the end of the time step or RK-stage,

and ∇q, for which we ignore DSS for efﬁciency reasons

discussed later in this section. In case of hyper-viscosity of

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 11

order 3 or more, the DSS on ∇qmaybe required to ensure

atleast C1continuity.

For stabilizing DG, we use the mixed form of Bassi

and Rebay (1997). The viscous term ∇ · (µ∇q)needed for

stabilizing the Euler equations in Eq. (5) requires us to ﬁrst

compute the gradient ∇q. We can write the computation of

the stabilizing term in mixed form as follows

∇q=Q

∇ · (µ∇q) = ∇ · (µQ)(15)

where Qis the auxiliary variable. Because we are evaluating

the stabilizing term explicitly, we can solve the equations in a

straightforward decoupled manner (Bassi and Rebay 1997).

The strong form DG discretization of the ﬁrst part of Eq. (15)

is as follows

ZΩe

ψiQdΩe=ZΓe

ψiˆn ·(q∗−q)dΓe

| {z }

surface

+ZΩe

ψi∇qdΩe

| {z }

volume

.

(16)

We should note that the surface integral term is zero for

strong from CG because q∗=qdue to continuity. Once we

compute Q, we can then compute the viscous term via the

discretization

ZΩe

ψi∇ · (µ∇q)dΩe=ZΓe

ψiˆn ·(µQ∗−µQ)dΓe

| {z }

surface

+

ZΩe

ψi∇ · (µQ)dΩe

| {z }

volume

.

(17)

According to (Bassi and Rebay 1997), we use centered ﬂuxes

for both qand Qsuch that q∗={q}and Q∗={Q}. The

mixed form is implemented directly by ﬁrst computing the

volume integral of the gradient in Eq. (16) using Alg. 1,

and then modifying the result with the surface integral

contribution computed using centered ﬂuxes q∗={q}. It

is necessary to store Qin global memory, unlike the case

for CG, and compute the surface integral using a different

kernel because data is required from neighboring elements.

This difﬁculty would have also manifested itself in CG if we

chose to gather-scatter Q, which would require a separate

kernel for similar reasons, and force us to use the mixed

form. The fact that we need this term just for stabilization,

and not, for instance for the implicit solution of the Poisson

problem, gives us some leeway to its implementation on the

GPU for performance reasons. However, in the CPU version

of NUMA we apply the DSS operator (which requires inter-

process communication) right after computing the gradient

Q. The kernel for computing the surface gradient ﬂuxes is

similar to the surface integration kernel discussed in Section

5.3.2 — with the only difference being that we use centered

ﬂuxes instead of the upwind-biased Rusanov ﬂux. Finally,

the volume and surface integral contributions of the viscous

term in Eq. (17) are added to the right-hand side vector in the

volume and surface kernels, respectively. In the future we

will study stabilization of DG using the Symmetric Interior

Penalty Method (SIPG) – which shares the same volume

integration kernel as the weak-form CG stabilization method.

6 Multi-GPU implementation

The ever increasing need for higher resolution in numerical

weather prediction (NWP) implies that such large scale

simulations cannot be run on a single GPU card due to

memory limitations. A practical solution is to cluster cheap

legacy GPU cards and break down the problem into smaller

pieces that can be handled by a single GPU card; however,

this necessitates communication between GPUs which is

often a bottleneck of performance. We extend our single

GPU implementation of NUMA to a multi-GPU version

using the existing framework for conducting multi-CPU

simulations on distributed memory computers (see (Kelly

and Giraldo 2012) for details). The communication between

GPUs is done indirectly through CPUs which is the reason

why we were able to use the existing MPI infrastructure.

We should note that the latest technology in GPU hardware

allows for direct communication between GPUs but the

technology is not yet mature and also the GPU cards are more

expensive.

6.1 Multi-GPU parallelization of EBG methods

The goal of parallelizing NUMA to distributed memory CPU

clusters has already been achieved in (Kelly and Giraldo

2012), in which linear scalability up to tens of thousands of

CPUs was demonstrated. More recently the scalability of the

implementation is tested on the Mira supercomputer, located

at Argonne National Laboratory, using 3.1 million MPI

ranks (M¨

uller et al. 2016). NUMA achieved linear scalability

for both explicit and 1D implicit-explicit (IMEX) time

integration schemes in global numerical weather prediction

problems. The current work extends the capability of NUMA

to multi-GPU clusters which are known to deliver much

more ﬂoating point operations per second (FLOPS/s) than

multi-CPU clusters. In the following sections, we describe

the parallel grid generation and partitioning, mulit-GPU CG

and DG implementations.

6.1.1 Parallel grid generation The grid generation and

partitioning stages are done on the CPU and then geometric

data is copied to the GPU once at start up. The reason for

this choice is mainly a lack of robust parallel grid generator

software with a capability of Adaptive Mesh Reﬁnement

(AMR) on the GPU. Originally NUMA used a local grid

generation code and the METIS graph partitioning library for

domain decomposition; however, the need for parallel grid

generation and parallel visualization output processing was

exposed while conducting tests on the Mira supercomputer.

Even though a parallel version of METIS (ParMETIS) exists,

we chose to adopt the parallel hexahedral grid generation and

partitioning software p4est (Burstedde et al. 2011) mainly

because of the latter’s capability of parallel AMR. In static

AMR mode, p4est is in effect a parallel grid generator.

Dynamic AMR requires copying geometric data to the GPU

more than once, i.e., whenever AMR is conducted. For this

reason, recomputing all geometric data on-the-ﬂy on the

GPU could potentially improve performance. ParMETIS is

a graph partitioning software and as such is not capable of

mesh reﬁnements.

6.1.2 Multi-GPU CG The coupling between sub-domains

in the CG spatial discretization is achieved by the Direct

Prepared using sagej.cls

12 Journal Title XX(X)

Algorithm 5 Laplacian diffusion kernel

procedure LAPL AC E(Q, rhs, nu)

Shared sq,sqr,sqs,sqt all arrays of size of [Nq][Nq][Nq]

Memory fence

for k,j,i ∈ {0. . . Nq}do Load ﬁeld variables into shared memory

sq[k][j][i] = q

Memory fence

for k,j,i ∈ {0. . . Nq}do

qr=0; qs=0; qt=0; Compute local gradients in r-s-t

for n∈ {0. . . Nq}do

qr += sD[i][n]×sq[k][j][n]; sD are ∇ψat LGL nodes preloaded to shared memory.

qs += sD[j][n]×sq[k][n][i];

qt += sD[k][n]×sq[n][j][i];

sqr[k][j][i] = µ(G11×qr + G12×qs + G13×qt); Gs are coeff. of the symmetric JJ Tmatrix

sqs[k][j][i] = µ(G12×qr + G22×qs + G23×qt);

sqt[k][j][i] = µ(G13×qr + G23×qs + G33×qt);

Memory fence

for k,j,i ∈ {0. . . Nq}do

lapq = 0

for n∈ {0. . . Nq}do

lapq += sD[n][i]×sqr[k][j][n];

lapq += sD[n][j]×sqs[k][n][i];

lapq += sD[n][k]×sqt[n][j][i];

rhs -= Jinv ×lapq

Stiffness Summation (DSS) operator which imposes C0

continuity of solutions at element interfaces. The DSS

operator is applied both to the mass matrix and the right-hand

side (RHS) vector. Therefore, a multi-GPU implementation

of CG requires communication between GPUs only for

applying DSS; in fact, we require GPU kernels for applying

DSS only on the RHS vector because the construction of the

mass matrix is done on the CPU. However, to apply DSS

on the RHS vector, we need several kernels. Alg. 6 outlines

the steps required for applying DSS in a mulit-GPU CG

implementation. First, we need a kernel to do the intra-GPU

gather operation on the RHS vector. Then, the values at inter-

GPU boundaries are copied to a contiguous block of GPU

global memory after which the data is copied to the CPU.

CPUs, then, communicate the boundary data to construct the

global RHS using the existing MPI infrastructure in NUMA.

Once the CPUs complete the DSS operation, the CPUs

copy the boundary data back to the GPU global memory.

Contribution from neighboring processors are processed one

by one to update the RHS vector; without this ‘coloring’ of

neighboring processors, conﬂicts in RHS updates can occur

at shared edges and corner nodes of elements. The last stage

does the intra-GPU scatter operation of DSS.

6.1.3 Multi-GPU DG The coupling between sub-domains

in the DG spatial discretization is achieved by the deﬁnition

of the numerical ﬂux at shared boundaries. DG lends itself

to a simple computation-communication overlap; though

CG can beneﬁt from computation-communication overlap

as well, it requires more effort to do so (Deville et al.

2002). Overlapping is especially important in a multi-GPU

implementation to hide the latency associated with the data

transfer between the CPU and GPU. Inter-processor ﬂux

calculation requires values from the left and right elements

sharing a face; however, intra-processor ﬂux calculation

and computation of volume integrals can proceed while

the necessary communication for computing inter-processor

ﬂux is going on. Alg. 7 shows an outline of a multi-

GPU DG implementation with communication-computation

overlap. The latest technology in GPUs allow for copying

data asynchronously using streams. We overlap computation

and communication using two streams designated for each.

The copying of data to and from the GPU is carried out

on the copy stream (COPY), all computations on the GPU

are done on the computation stream (COMP), and MPI

communications between CPUs are on the host stream

(HOST). A wait statement invoked on any device stream

blocks the host thread until all operations on that stream

come to completion. Even though we do not show it for the

sake of simplicity, the communication of ∇qfor the LDG

stabilization method is also done similarly.

7 Performance tests

7.1 Speedup results

First, we present speedup results for the GPU implementa-

tion of NUMA against the base Fortran code ‡. In Table 3,

the time to solution of three test cases, solved using explicit

DG, is presented. This information is useful to get a rough

estimate of the performance per dollar on different GPU

cards. We will present the details of the test cases later in

Sec. 8; here we give the workload of each problem:

1. 2D Rising-thermal bubble: 100 elements with polyno-

mial order 7, for a total of 51200 nodes

‡The base Fortran code is the original CPU code, i.e., the non-OCCA

implementation that we use on the GPUs.

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 13

Algorithm 6 DSS on the GPU for the RHS vector

procedure DSS(RHS)

Gather RHS See Alg. 4 for details

Copy boundary data to contiguous block of global memory

Copy boundary data to CPU

CPUs communicate and form the global RHS

CPUs copy the assembled RHS back to the GPU

for all neighbors do To avoid conﬂict in RHS update

Boundary data is used to update the RHS vector

Scatter RHS See Alg. 4 for details

Algorithm 7 Asynchronous Multi-GPU DG

procedure ASYNCH DG COMM

[COMP] Pack boundary data to a contiguous block of global memory

[COMP] Wait

[COPY] Start copying boundary data asynchronously from GPU to CPU

[COMP] Start computing volume integrals and intra-processor ﬂux

[COPY] Wait

[HOST] Send boundary data to neighboring processors asynchronously

[HOST] MPI waitall

[COPY] Start copying boundary data asynchronously from CPU to GPU

[COPY] Wait

[COMP] Compute inter-processor ﬂux

2. 3D Rising-thermal bubble: 1000 elements with

polynomial order 5, for a total of 216000 nodes

3. Acoustic wave on the sphere: 1800 elements with

polynomial order 4, for a total of 225000 nodes

where nodes, here, denote the number of gridpoints in the

mesh. We obtained two orders of magnitude speedups on

the newer GPU cards (GTX Titan Black and K20X) over a

single core 2.2GHz AMD CPU. The specs for the GPU cards,

bandwidth and double precision TFLOPS/s, are as follows:

C2070: 144 GB/s, 0.5 TFLOPS/s, Titan black: 336 GB/s, 1.7

TFLOPS/s , and K20X: 225 GB/s, 1.3 TFLOPS/s.

Next, we present performance tests on the Titan supercom-

puter located at the Oak Ridge National Laboratory, where

each node has a K20X GPU card and an AMD Opteron 6274

CPU with 16 cores at 2.2 GHz. The GPU card has 2,688

cores at 0.732 GHz, 6 GB memory, 250 GB/s bandwidth

with peak performances of 1.31 and 3.95 teraﬂops in double

and single precision, respectively. The speedup results are

reported relative to the NUMA Fortran code using all 16

cores of the CPU. We will examine the different kernel

design and parameter choices we made in Sec. 5 using the

2D rising thermal bubble benchmark problem. The problem

size is increased progressively from 10x10=100 elements

until we ﬁll up all the memory available on the device at

160x160=25600 elements. The ﬁrst test result, presented in

Table 4, evaluates the performance of the cube volume kernel

at low-order polynomials using both OpenCL and CUDA

translations of the native OCCA code. Although NVIDIA

hardware includes interfaces for both OpenCL and CUDA,

we obtained better performance with CUDA kernels on

this particular hardware. Also, we observe markedly better

speedups at polynomial orders 4 and 7 compared to other

polynomial orders. The reason for the good performance

at polynomial order 7 is due to the thread block sizes of

(7 + 1)3= 512 that perfectly ﬁts the hardware block size.

Polynomial order 4 gives a thread block size of 125 which

is only slightly less than 128. Therefore, this observation

emphasizes the importance of selecting parameters to get

optimum block dimensions that are multiples of the warp

size.

GPUs are known to deliver higher performance using

single precision (SP) arithmetic than double precision (DP).

For instance, the SP peak performance of a K20X GPU is 3x

more than its DP peak performance. In Table 5, we present

the speedup results comparing SP and DP performance. We

obtain a maximum speedup of about 15x and 11x using

single and double precision calculations, respectively. The

reason for different speedup numbers for SP and DP is that

NUMA running on the CPU is able to achieve a speedup of

only 1.5x using SP, while the GPU performance more than

doubles using SP.

For low order polynomials, we can process two or more

elements per block to get an optimal block size. Table 6

shows the performance comparison of this scheme using

one and two elements per block. We can see that the

performance is signiﬁcantly improved by processing two

elements per block for upto polynomial order 5; the block

size, when processing two elements per block, exceeds

the hardware limit at polynomial orders above 5. The 100

elements simulation is not able to see any beneﬁt from this

approach because the device will not be fully occupied when

processing two elements per block. All the other runs show

signiﬁcant beneﬁts from processing two elements per block,

except at polynomial order 4 — for which performance

remains more or less the same. We mentioned earlier that

polynomial order 4 gives a block size that is close to optimal,

hence, there is really no need to process more than one

element per block for this particular conﬁguration.

Prepared using sagej.cls

14 Journal Title XX(X)

Table 3. Speedup comparison between CPU and GPU for both single precision and double precision calculations. The test is

conducted on three types of GPU cards: an old Tesla C2070 and two newer cards GTX Titan Black and K20X GPUs. Two orders of

magnitude performance improvement is obtained relative to a single core CPU with the newer cards.

Test case Double precision Single precision

CPU GPU Speedup CPU GPU Speedup

Tesla C2070 GPU vs One core of Intel Xeon E5645

2D rtb 930.1 27.8 33.4 612.3 13.4 45.6

3D rtb 4408.9 141.9 31.1 3097.0 54.5 56.8

Acoustic wave 3438.8 96.7 35.6 2379.9 44.4 53.6

GTX Titan Black GPU vs One core of Intel Xeon E5645

2D rtb 930.1 8.87 104.9 612.3 4.67 131.0

3D rtb 4408.9 41.47 106.3 3097.0 18.68 165.8

Acoustic wave 3438.8 26.72 128.7 2379.9 15.56 152.9

K20X GPU vs 16-cores of 2.2GHz AMD Opteron 6274

2D rtb 103.17 13.97 7.38 77.75 6.89 11.28

3D rtb 434.36 61.14 7.10 339.61 28.12 12.08

Acoustic wave 166.06 21.10 7.87 132.46 11.24 11.78

Table 4. OpenCL vs CUDA: Speedup comparison between CPU and GPU for double precision calculations at different number of

elements and polynomial orders using OpenCL and CUDA translation of the native OCCA kernel code. The GPU card is K20X and

the CPU is a 16-core 2.2GHz AMD Opteron 6274. The timing (in sec) and speedup are given ﬁrst for OpenCL and then for CUDA.

The results show CUDA compiled kernels are optimized better. Also polynomial orders 4 and 7 give better speedup numbers in all

cases.

N 10x10=100 elements 30x30=900 elements 40x40=1600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 1.46 0.59/0.52 2.47/2.81 10.62 2.57/2.17 4.13/4.90 18.83 4.34/3.70 4.34/5.09

3 2.68 0.69/0.59 3.88/4.54 22.01 3.56/3.06 6.18/7.19 41.53 5.84/5.04 7.11/8.24

4 5.30 0.97/0.86 5.46/6.16 46.45 5.50/5.12 8.45/9.07 81.91 9.27/8.69 8.84/9.43

5 8.12 1.47/1.37 5.52/5.93 77.03 10.53/9.88 7.32/7.80 137.49 18.33/17.11 7.50/8.04

6 13.89 2.27/2.11 6.11/6.58 122.27 17.24/16.11 7.09/7.59 210.35 30.15/28.15 6.98/7.47

7 20.49 2.68/2.41 7.64/8.50 195.61 20.82/18.87 9.40/10.37 343.74 36.36/33.05 9.45/10.40

N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 80.72 15.71/13.33 5.14/6.05 184.19 33.47/27.82 5.50/6.62 336.19 61.56/52.01 5.46/6.46

3 179.07 21.46/18.46 8.34/9.70 405.15 47.63/41.08 8.51/9.86 729.17 84.40/72.61 8.64/10.04

4 350.54 35.01/32.71 10.01/10.71 798.50 77.85/72.77 10.26/10.97 1392.60 138.64/129.64 10.04/10.74

5 587.17 71.90/67.03 8.17/8.76 1329.79 161.42/150.56 8.24/8.83 2352.46 286.74/267.48 8.20/8.79

6 925.25 118.81/110.92 7.79/8.34 2086.84 267.12/249.50 7.82/8.36 - - -

7 1406.61 142.67/130.16 9.86/10.81 3158.43 320.77/293.05 9.84/10.78 - - -

Table 5. Double vs Single Precision: Speedup comparison between CPU and GPU for single and double precision calculations at

different number of elements and polynomial orders using CUDA translation of OCCA kernel code. A maximum speedup of about

15x is observed. The CPU/GPU times and Speedups are given ﬁrst for double precision and then for single precision. The GPU

card is K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.

N 10x10=100 elements 30x30=900 elements 40x40=1600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 1.46/1.39 0.52/0.47 2.81/2.96 10.62/9.98 2.17/1.57 4.90/6.36 18.83/17.41 3.70/2.53 5.09/6.88

3 2.68/2.60 0.59/0.49 4.54/5.31 22.01/19.66 3.06/1.87 7.19/10.51 41.53/34.72 5.04/3.06 8.24/11.35

4 5.30/4.51 0.86/0.54 6.16/8.35 46.45/35.19 5.12/3.03 9.07/11.61 81.91/63.55 8.69/5.07 9.43/12.53

5 8.12/7.23 1.37/0.77 5.93/9.39 77.03/61.35 9.88/4.86 7.80/12.62 137.49/107.30 17.11/8.35 8.04/12.85

6 13.89/11.18 2.11/1.07 6.58/10.45 122.27/95.67 16.11/7.71 7.59/12.41 210.35/166.40 28.15/13.49 7.47/12.33

7 20.49/15.97 2.41/1.31 8.50/12.19 195.61/135.21 18.87/9.65 10.37/14.01 343.74/236.09 33.05/16.86 10.40/14.00

N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 80.72/70.41 13.33/8.94 6.05/7.88 184.19/172.85 27.82/19.78 6.62/8.74 336.19/285.83 52.01/34.92 6.46/8.18

3 179.07/142.19 18.46/11.18 9.70/12.72 405.15/324.78 41.08/24.87 9.86/13.06 729.17/589.22 72.61/44.10 10.04/13.36

4 350.54/268.69 32.71/19.02 10.71/14.13 798.50/599.25 72.77/42.34 10.97/14.15 1392.60/1069.24 129.64/76.01 10.74/14.07

5 587.17/429.66 67.03/32.38 8.76/13.27 1329.79/1007.31 150.56/72.08 8.83/13.97 2352.46/1729.34 267.48/129.28 8.79/13.37

6 925.25/696.25 110.92/52.91 8.34/13.16 2086.84/1586.54 249.50/118.39 8.36/13.40 - - -

7 1406.61/968.10 130.16/66.41 10.81/14.58 3158.43/2227.29 293.05/148.76 10.78/14.97 - - -

We mentioned in Sec. 5 that using vector datatype ﬂoat4

to store ﬁeld variables may help to improve performance

because one load operation is issued when fetching a ﬂoat4

data instead of four. Table 7 compares the speedup obtained

using ﬂoat1 and ﬂoat4 versions of the volume kernel. The

ﬂoat4 version performs better in most of the cases; here,

again, the performance at polynomial order 4 is more or less

the same.

We discussed in Sec. 5 different ways to handle the

problem with hardware limitations for high order polynomial

approximations. Thread block size and shared memory

hardware limits allow us to use the volume kernel we tested

so far upto polynomial order 7. First, we compare the

performance of the four ways to use shared and L1 cache

memory; namely, the naive, Shared-1, Shared-2 and two-pass

(horizontal+vertical) methods. Fig. 3 shows that the two-pass

method performs the best — about two times better than the

naive approach that does not use shared memory but totally

relies on L1 cache. The Shared-1 and Shared-2 methods

perform similarly; this implies that the Shared-2 approach

suggested in Micikevicius (2009) is not working as expected.

Even though we try to store the data in the vertical direction

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 15

Table 6. Multiple elements per block: The performance of the cube volume kernel can be improved by processing more than one

element in a thread block simultaneously. The GPU times and Speedups are given ﬁrst for the 1 element-per-block and then for the

2 elements-per-block approaches. Improvement in performance is observed using 2 elements-per-block in all the cases except for

the 10x10 elements case, which does not fully occupy the GPU device when processing 2-elements-per-block. The GPU card is

K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.

N 10x10=100 elements 30x30=900 elements 40x40=1600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 1.46 0.52/0.57 2.81/2.56 10.62 2.17/1.81 4.90/5.87 18.83 3.70/2.85 5.09/6.61

3 2.68 0.59/0.61 4.54/4.39 22.01 3.06/2.93 7.19/7.51 41.53 5.04/4.74 8.24/8.76

4 5.30 0.86/0.92 6.16/5.76 46.45 5.12/5.74 9.07/8.09 81.91 8.69/9.81 9.43/8.35

5 8.12 1.37/1.37 5.93/5.92 77.03 9.88/9.68 7.80/7.96 137.49 17.11/16.72 8.04/8.22

N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 80.72 13.33/9.96 6.05/8.10 184.19 27.82/21.10 6.62/8.73 336.19 52.01/38.09 6.46/8.83

3 179.07 18.46/17.5 9.70/10.23 405.15 41.08/38.51 9.86/10.52 729.17 72.61/67.62 10.04/10.78

4 350.54 32.71/37.15 10.71/9.43 798.50 72.77/82.93 10.97/9.63 1392.60 129.64/147.61 10.74/9.43

5 587.17 67.03/65.2 8.76/9.00 1329.79 150.56/146.67 8.83/9.07 2352.46 267.48/260.89 8.79/9.02

Table 7. ﬂoat1 vs ﬂoat4: The effect of using ﬂoat4 for computing the gradient in the volume kernel is compared against the version

of the volume kernel where one ﬁeld variable is loaded. The CPU/GPU time and Speedups are given ﬁrst for ﬂoat1 and then for

ﬂoat4. Some improvement is observed in most cases except when using polynomial order 4, which results in a good thread block

size. The GPU card is K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.

N 10x10=100 elements 30x30=900 elements 40x40=1600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 1.46 0.52/0.47 2.81/3.11 10.62 2.17/2.06 4.90/5.15 18.83 3.70/3.33 5.09/5.65

3 2.68 0.59/0.57 4.54/4.70 22.01 3.06/3.10 7.19/7.10 41.53 5.04/5.14 8.24/8.08

4 5.30 0.86/0.82 6.16/6.46 46.45 5.12/5.10 9.07/9.11 81.91 8.69/8.69 9.43/9.43

5 8.12 1.37/1.27 5.93/6.39 77.03 9.88/9.38 7.80/8.21 137.49 17.11/16.29 8.04/8.44

6 13.89 2.11/1.93 6.58/7.19 122.27 16.11/14.86 7.59/8.23 210.35 28.15/26.06 7.47/8.07

N 80x80=6400 elements 120x120=14400 elements 160x160=25600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

2 80.72 13.33/12.00 6.05/6.73 184.19 27.82/26.50 6.62/6.95 336.19 52.01/46.85 6.46/7.18

3 179.07 18.46/18.99 9.70/9.43 405.15 41.08/42.66 9.86/9.50 729.17 72.61/74.93 10.04/9.73

4 350.54 32.71/32.88 10.71/10.66 798.50 72.77/73.00 10.97/10.94 1392.60 129.64/129.72 10.74/10.74

5 587.17 67.03/64.12 8.76/9.16 1329.79 150.56/144.01 8.83/9.23 2352.46 267.48/256.62 8.79/9.17

6 925.25 110.92/102.85 8.34/9.00 2086.84 249.50/222.08 8.36/9.40 - - -

Table 8. Higher order polynomials: The performance of the two pass method, with horizontal + vertical split, is evaluated at higher

order polynomials in double precision calculations. This kernel performs slower than the cube volume kernel when used for low

order polynomials, but it is the best performing version among the volume kernels we considered for high order. The GPU card is

K20X and the CPU is a 16-core 2.2GHz AMD Opteron 6274.

N 10x10=100 elements 30x30=900 elements 40x40=1600 elements

CPU GPU Speedup CPU GPU Speedup CPU GPU Speedup

8 31.17 4.77 6.53 271.50 33.19 8.18 492.23 58.17 8.46

9 43.21 5.90 7.32 373.63 44.52 8.39 666.37 77.84 8.56

10 59.89 7.14 8.38 493.54 55.02 8.97 909.75 96.49 9.42

11 79.86 9.61 8.31 691.65 75.52 9.15 1199.67 132.28 9.07

12 103.64 13.40 7.73 923.22 107.44 8.59 1713.01 190.06 9.01

13 131.74 16.89 7.80 1140.64 138.13 8.25 2009.99 243.28 8.26

14 169.49 23.52 7.20 1468.72 195.32 7.52 2568.77 340.99 7.53

15 220.91 28.36 7.79 1862.14 233.06 7.99 3352.22 410.42 8.17

7 8 9 10 11 12 13 14 15

0

0.5

1

1.5

2

2.5

3

Np

Relative speedup

Naive

Shared − 1

Shared − 2

Two−pass method

Figure 3. Comparison of different ways of exploiting fast L1 cache and Shared memory in volume kernel. The speedups are

reported relative to the naive approach. The two-pass method performs the best due to better use of shared memory.

in registers, most of it spills to global thread private memory.

Because the polynomial order is high and we are loading all

ﬁeld data (8 ﬂoats) to registers, the register pressure is too

high for the method to show any beneﬁt.

Prepared using sagej.cls

16 Journal Title XX(X)

In Table 8, we present the performance of the high-order

volume kernel that uses the two-pass method for polynomial

orders of 8 to 15. It is not possible to solve bigger size

problems than 40x40 elements with polynomial order 15 on

this GPU because of the limited memory of 6GB per card.

We get a maximum speedup of about 9x at higher order

polynomials, which is slightly less than the 11x performance

we obtained at low-order polynomials; this is understandable

because the two-pass method loads data twice and performs

calculations twice as well.

7.2 Individual kernel performance tests

To evaluate the performance of individual kernels, we

measure the rate of ﬂoating point operations in GFLOPS/s

and data transfer rate (bandwidth) in GB/s. Many GPU

applications tend to be memory bound, hence bandwidth is

as important a metric as rate of ﬂoating point operations. The

results obtained will guide us how to go about optimizing

kernel performance by classifying them as either compute-

bound or memory-bound. A convenient visualization is the

rooﬂine model (Williams et al. 2009) that sets an upper

bound on kernel performance based on peak GFLOPS/s and

GB/s of the device. We use two approaches to determine the

GFLOPS/s and GB/s: hand-counting the number of ﬂoating

point operations and bytes loaded to get an estimate of the

arthimetic throughput and bandwidth, and using a proﬁler to

get the effective values.

The ﬁrst results, shown in Figs. 4a-4d, are produced by

hand-counting the number of FLOPS and bytes loaded from

global memory per kernel execution. This would be enough

to calculate the arithmetic intensity (GFLOPS/GB) and

determine whether a kernel would be memory- or compute-

bound; however, we need to conduct actual simulations to

determine kernel execution time and, thus, the efﬁciency of

our kernels in terms of GFLOPS/s and GB/s. The rooﬂine

plots show that our efﬁciency increases with problem size

and reaches about 80% for the volume and surface kernels,

while 100% efﬁciency is observed for the update and project

kernels. These tests are conducted on the isentropic vortex

problem (see Sec. 8.1), which concerns advection of a vortex

by a constant velocity. The GPU is a Tesla K20c GPU

with the following speciﬁcation: 2,496 cores at 0.706 GHz,

5GB memory, 208 GB/s bandwidth with peak performances

of 1.17 teraﬂops and 3.52 teraﬂops in double and single

precision, respectively.

The highest GFLOPS/s observed in any of the kernels

is about 320 GFLOPS/s for the horizontal volume kernel

at N= 10 using single precision arithmetic. The vertical

volume kernel is a close second, but the surface and update

kernels lag far behind in terms of GFLOPS/s performed.

The update kernel, which does the explicit Runge-Kutta time

integration, shows the highest bandwidth performance at

about 208GB/s, which is infact the peak memory bandwidth

of the device. The projection kernel, which does the scatter-

gather operation of CG, comes in a close second. The volume

and surface kernels, though they have the highest arithmetic

intensity, lag behind in terms of bandwidth performance.

Therefore, no single kernel exhibits best performance in

terms of both GFLOPS/s and bandwidth.

The rooﬂine plots expose that the arithmetic intensity

(GFLOPS/GB) of the update kernel, project kernel and

surface kernel do not change with polynomial order. When

extrapolated, all three vertical lines hit the diagonal of

the rooﬂine, conﬁrming the fact that these kernels are

memory-bound. The arithmetic intensity of the volume

kernels increases with polynomial order, complicating

the classiﬁcation to either compute- or memory- bound;

however, with polynomial degree upto 11 the kernels are still

well within the memory-bound region.

The second group of kernel performance tests, shown in

Fig. 5a-5d, are conducted using a GTX Titan Black GPU.

For these tests we used nvprof, to determine the effective

arthimetic throughput and memory bandwidth. As a result,

the plots obtained from this test are less smoother than

the previous plots which were produced by hand-counting

FLOPS and GB of kernels. Moreover, here we use the

cube volume kernels instead of the split horizontal+vertical

kernels. We also changed the test case to a 2D rising thermal

bubble problem, which requires numerical stabilization, to

invoke the diffusion kernel. The highest GFLOPS/s observed

in this test is 700 GFLOPS/s for the volume kernel using

single precision. To compare performance with the previous

tests that were produced using a different GPU, we look at

the rooﬂine plots instead. We expect the rooﬂine plot for the

combined volume kernel to lean more towards the compute-

bound region because more ﬂoating point operations are

done per byte of data loaded. Indeed this turns out to be the

case even though the cube volume kernels were run upto a

maximum polynomial order of 8. The diffusion kernels, used

for computing the Laplacian, also show similar performance

characteristics as the volume kernels.

7.3 Scalability test

The scalability of the multi-GPU implementation is tested on

a GPU cluster, namely, the Titan supercomputer which has

18688 Nvidia Tesla K20X GPU accelerators. We conduct

a weak scalability test, where each GPU gets the same

workload, using the 2D rising thermal bubble problem

discussed in Section 8.2, using 900 elements per GPU with

polynomial order 7 in all directions. In a weak scaling

test, the time to solution should, ideally, stay constant as

the workload is increased; however, delays are introduced

due to the need for communication between GPUs. The

scalability result in Fig. 6 shows that the GPU version of

NUMA is able to achieve 90% scaling efﬁciency on tens

of thousands of GPUs. Different implementations of the

uniﬁed CG/DG algorithms are tested, among which, DG

with overlapping of computation and communication to hide

latency performed the best. Our current CG implementation

does not overlap communication with computation and, as a

result, its scalability suffers.

The 900 element grid per GPU used for producing the

scalability plot is far from ﬁlling up the GPU memory, hence,

the scalability could be improved by increasing the problem

size further. We compare scalability up to 64 GPUs, which is

the point where the efﬁciency of the parallel implementation

ﬂattens out, for different number of elements in Fig. 7. The

scalability increases by more than 20% going from a 100 to

900 elements grid per GPU.

In operational numerical weather prediction (NWP),

strong scaling on multi-GPU systems may be as important

as weak scaling because of limits placed on the simulation

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 17

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

300

350

Np

GFLOPS/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

Np

GB/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

10−1 100101102

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

208 GB/s

3520 GFLOPS/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

Roofline

(a) SP-CG kernels performance

1 2 3 4 5 6 7 8 9 10 11

0

20

40

60

80

100

120

Np

GFLOPS/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

Np

GB/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

10−1 100101102

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

208 GB/s

1170 GFLOPS/s

Horizontal volume kernel

Vertical volume kernel

Update kernel

Project kernel

Roofline

(b) DP-CG kernels performance

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

300

350

Np

GFLOPS/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

Np

GB/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

10−1 100101102

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

208 GB/s

3520 GFLOPS/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

Roofline

(c) SP-DG kernels performance

1 2 3 4 5 6 7 8 9 10 11

0

20

40

60

80

100

120

Np

GFLOPS/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

1 2 3 4 5 6 7 8 9 10 11

0

50

100

150

200

250

Np

GB/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

10−1 100101102

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

208 GB/s

1170 GFLOPS/s

Surface kernel

Horizontal volume kernel

Vertical volume kernel

Update kernel

Roofline

(d) DP-DG kernels performance

Figure 4. Performance of individual kernels: The efﬁciency of our kernels are tested on a mini-app developed for this purpose. The

FLOPS and byte for this test are counted manually. The volume kernel, that is split into two (horizontal + vertical), has the highest

rate of FLOPS/s. The time-step update kernel has the highest bandwidth usage at 208GB/s. The Single Precision (SP) and Double

Precision (DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and rooﬂine plots to

illustrate their efﬁciency. The GPU is a Tesla K20c.

time to make a day’s weather forecast. For this reason, we

also conducted strong scaling tests, shown in Fig. 7, on a

global scale simulation problem described in Sec. 8.5. Our

goal here is to determine the number of GPUs required for

a given simulation time limit for two resolutions: a coarse

grid of 13km resolution and a ﬁne grid of 3km resolution.

Prepared using sagej.cls

18 Journal Title XX(X)

1 2 3 4 5 6 7 8 9

0

100

200

300

400

500

600

700

800

Np

GFLOPS/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

1 2 3 4 5 6 7 8 9

0

50

100

150

200

250

Np

GB/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

10−3 10−2 10−1 100101102103

10−1

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

334 GB/s

5121 GFLOPS/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

Roofline

(a) SP-CG kernels performance

1 2 3 4 5 6 7

0

50

100

150

200

250

300

350

Np

GFLOPS/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

1 2 3 4 5 6 7

0

50

100

150

200

250

Np

GB/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

10−3 10−2 10−1 100101102103

10−1

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

334 GB/s

1707 GFLOPS/s

Volume kernel

Diffusion kernel

Gather kernel

Scatter kernel

Pressure kernel

Update kernel

Flux boundary kernel

Strong boundary kernel

Zero kernel

(b) DP-CG kernels performance

1 2 3 4 5 6 7 8 9

0

100

200

300

400

500

600

700

Np

GFLOPS/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

1 2 3 4 5 6 7 8 9

0

50

100

150

200

250

Np

GB/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

10−3 10−2 10−1 100101102103

10−1

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

334 GB/s

5121 GFLOPS/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

Roofline

(c) SP-DG kernels performance

1 2 3 4 5 6 7

0

50

100

150

200

250

300

Np

GFLOPS/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

1 2 3 4 5 6 7

0

50

100

150

200

250

Np

GB/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

10−3 10−2 10−1 100101102103

10−1

100

101

102

103

104

GFLOPS/GB

GFLOPS/s

334 GB/s

1707 GFLOPS/s

Volume kernel

Surface kernel

Gradient volume kernel

Gradient surface kernel

Pressure kernel

Update kernel

Boundary kernel

Zero kernel

Roofline

(d) DP-DG kernels performance

Figure 5. Performance of individual kernels: The efﬁciency of our kernels are tested after being incorporated to the base NUMA

code. The measurements for this test are done using nvprof: effective memory bandwidth = dram read throughput +

draw write throughput, and effective arithmetic throughput = ﬂop dp/sp efﬁciency. The Single Precision (SP) and Double Precision

(DP) performance of the main kernels in CG and DG are shown in-terms of GFLOPS/s, GB/s and rooﬂine plots to illustrate their

efﬁciency. The GPU is a GTX Titan Black.

The grids are cubed sphere with 6x112x112x4 elements§

and N=7 for the 13km resolution test, and 6x144x144x4

elements and N=7 for the 3km resolution test The plot shows

that about 1500 and 8192 GPUs are required to bring down

§On cubed sphere grids, the total number of elements are denoted as

Npanels ×Nξ×Nη×Nζwhere Npanels = 6 for the six panels of the

cubed sphere, Nξ=Nηare the number of elements in both horizontal

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 19

100101102103104

0

10

20

30

40

50

60

70

80

90

100

No of GPUs

Efficiency(%)

CG nooverlap

DG nooverlap

DG overlap

Figure 6. Scalability test of multi-GPU implementation of NUMA: The scalability of NUMA for up to 16384 GPUs on the Titan

supercomputer is shown. Each node of Titan contains a Tesla K20X GPU. An efﬁciency of about 90% is observed relative to a

single GPU. The test is conducted using a uniﬁed implementation of CG and DG. The efﬁciency of DG is signiﬁcantly improved (by

about 20%) when overlapping communication with computation, which helps to hide both the data copying latency between CPU

and GPU and CPU-CPU communication latency.

0 10 20 30 40 50 60

0

10

20

30

40

50

60

70

80

90

100

No of GPUs

Efficiency(%)

10x10 CG nooverlap

10x10 DG nooverlap

10x10 DG overlap

30x30 CG nooverlap

30x30 DG nooverlap

30x30 DG overlap

60x60 CG nooverlap

60x60 DG nooverlap

60x60 DG overlap

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

50

100

150

200

250

300

350

400

Number of GPUs

Wallclock time in minutes

3km resolution

13 km resolution

Figure 7. (left) Scalability test of Multi-GPU implementation for different number of elements using upto 64 nodes of Titan. The

60x60 element grid gives a much better scalability than the 10x10 grid, hence, we expect better scaling results with bigger size

problems. (right) Strong scalability test for 3km and 13km resolution global simulation on the sphere.

the simulation time below 100 min for the coarse and ﬁne

grids respectively. We believe that once we port the implicit-

explicit (IMEX) time integrators to the GPU, we can meet

simulation time limits with much fewer GPUs than the 3

million CPU threads required to meet a 4.5 minute wall clock

time limit required using the CPU version of NUMA (see

(M¨

uller et al. 2016) for details).

8 Validation with benchmark problems

The GPU implementation of our Euler solver is validated

using a suite of benchmark problems showcasing various

characteristics of atmospheric dynamics. We consider

problems of different scale: cloud-resolving (micro-scale),

limited area (meso-scale) and global scale atmospheric

problems. Most of these test cases do not have analytical

solutions against which comparisons can be made. For this

reason, we ﬁrst consider a rather simple test case of advection

of a vortex by a uniform velocity, which has an analytical

solution that will allow us to compute the exact L2error and

establish the accuracy of our numerical model. The rest of

the test cases serve as a demonstration of its application to

practical atmospheric simulation problems.

8.1 2D Isentropic vortex problem

We begin veriﬁcation with a simple test case that has an

exact solution to the Euler equations. The test case involves

advective transport of an inviscid isentropic vortex in free

stream ﬂow. The problem is often used to test the ability

of numerical methods to preserve ﬂow features, such as

vortices, in free stream ﬂow for long durations. However,

the problem is linear, and hence not suitable for testing the

coupling of wave motion and advective transport that are the

causes of non-linearity in the Euler equations.

The free stream conditions are

ρ= 1, u =U∞, v =V∞, θ =θ∞.

directions on each panel, and Nζare the number of elements in the vertical

direction.

Prepared using sagej.cls

20 Journal Title XX(X)

Perturbations are added in such a way that the ﬂow is

isentropic. The initial conditions are

(u0, v0) = β

2πexp 1−r2

2(−y+yc, x −xc)

θ=θ∞−(γ−1)β2

8γπ2exp (1 −r2)

where

r=q(x−xc)2+ (y−yc)2.

We simulate the isentropic vortex problem on a

[−5m, 5m]x[−5m, 5m]x[−0.5m, 0.5m]computational

domain, with (xc, yc, zc) = (0,0,0),β= 5,U∞= 1 m/s,

V∞= 1 m/s and θ∞=1. The domain is subdivided into

22 x 22 x 2 elements with polynomial order of N= 7 in

all directions for a total of about 0.5 million nodes. The

simulation is run for 10s with a constant time step of ∆t=

0.001susing the modiﬁed Runge-Kutta time integration

scheme discussed in Sec. 4. We anticipate the vortex to move

along the diagonal at a constant velocity while maintaining

its shape. This is indeed what is obtained as shown in Fig. 8.

To evaluate the accuracy of the numerical model, we

compute the L2norm of the error q−q∞over the domain

Ω, i.e., ||q−q∞||L2(Ω), for both single precision (SP) and

double precision (DP) arithmetic, where q∞is the exact

solution. The DP run takes about 267s to complete while the

SP run takes 161s; however, the maximum error associated

with the SP calculations is much larger as shown in Fig. 8e.

Therefore, if this reduction in accuracy is acceptable for a

certain application, then using single precision arithmetic

on the GPU is recommended. For this particular problem,

DG gives a lower maximum error than CG in both the SP

and DP calculations. The L2-error of density decreases with

increasing polynomial order as shown in Fig. 8e; the per-

second L2-error also shows the same behavior afﬁrming the

fact that higher order polynomials require less work per

degree of freedom. N= 11 is the maximum polynomial

order that we were able to run before we run out of global

memory on the GPU.

8.2 2D Rising thermal bubble

A popular benchmark problem in the study of non-

hydrostatic atmospheric models is the 2D rising thermal

bubble problem ﬁrst proposed in (Robert 1993). The test

case concerns the evolution of a warm bubble in a neutrally

stratiﬁed atmosphere of constant potential temperature θ0.

The bubble is lighter than the surrounding air, hence, it

rises while deforming due to the shear induced by the

uneven distribution of temperature within the bubble. This

deformation results in a mushroom-like cloud. The initial

conditions for this test case are in hydrostatic balance in

which pressure decreases with height as

p=p01−gz

cpθ0cp/R

.

The potential temperature perturbation is given by

θ0=(0for r > rc

θc

2(1 + cos(πr

rc)) for r≤rc

(18)

where

r=q(x−xc)2+ (z−zc)2.

The parameters for the problem are similar to that found in

(Giraldo and Restelli 2008; Ullrich and Jablonowski 2012):

a domain of size [0m, 1000m]x[0m, 100m]x[0m, 1000m],

with (xc, zc) = (500m, 350m),rc= 250m, and θc= 0.5K,

θ0= 300Kand an artiﬁcial viscosity of µ= 0.8m2/s for

stabilization. The domain is subdivided into 10 x 1 x 10

elements with polynomial order N= 6 set in all directions

for a total of about 180k nodes. The grid resolution is

about 25m therefore this problem can be considered as cloud

resolving. An inviscid wall boundary condition is used on all

sides.

The simulation is run for 1000s using the explicit Runge-

Kutta time integration method discussed in Sec. 4 with a

constant time step of ∆t= 0.02s. The status of the bubble at

different times is shown in Fig. 9. The results agree with that

reported in (Giraldo and Restelli 2008). Most importantly,

the results are identical with that obtained using the CPU

version of NUMA, even though those are not shown here.

We should mention here that matching the CPU version of

NUMA upto machine precision (e.g., 10−15) has been an

important goal in the development of the GPU code.

8.3 2D Colliding thermal bubbles

Next, we consider the case of colliding thermal bubbles

proposed in Robert (1993). The shape of the rising warm

bubble is now affected by the presence of a smaller sinking

cold bubble on the right-hand side. This destroys the

symmetry of the rising bubble. We should note here that

the rising thermal bubble problem in Sec. 8.2 could have

been solved considering only half of the domain because

of symmetry, which is not the case here. Also, the potential

temperature perturbation θ0is speciﬁed differently for this

problem. Within a certain radius rc, the perturbation is a

constant θc; outside of this inner domain, it is deﬁned by a

Gaussian proﬁle as

θ0=(θcfor r≤rc

θcexp [−((r−a)/s)2]for r > rc.

The warm bubble is centered at (xc, zc) = (500m, 300m),

with perturbation potential temperature amplitude of θc=

0.5, radius a= 150mand s= 50m. The initial conditions

for the cold bubble are: (xc, zc) = (560m, 640m),µ= 0.8

m2/s,θc= 0.5,a= 0 m and s= 50 m.

The result of the simulation is shown in Fig. 10 which

conﬁrms the fact that the rising bubble indeed loses its

symmetry. The edge of the rising bubble becomes sharper

in some places from 600s onwards. Qualitative comparison

with the results shown in (Robert 1993; Yelash et al. 2014)

show similar large-scale patterns, while small-scale patterns

differ depending on the grid resolution used. Here, again the

results of the CPU NUMA code are identical with the GPU

version.

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 21

1.00

0.45

0.6

0.36

0.75

0.9

(a) t=0s

1.00

0.45

0.6

0.36

0.75

0.9

(b) t=2s

1.00

0.45

0.6

0.36

0.75

0.9

(c) t=4s

0 5 10 15

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Distance (m)

Density (kg/m3)

t=0s

t=1s

t=2s

t=4s

(d) Density along diagonal

1 2 3 4 5 6 7 8 9 10 11

10−10

10−5

100

Polynomial order

L2 error

1 2 3 4 5 6 7 8 9 10 11

10−20

10−10

100

L2 error per second

CG L2−error

DG L2−error

CG L2−error per sec

DG L2−error per sec

(e) L2-error of density

Figure 8. Isentropic vortex : Plot of density (ρ) of the vortex at different times show that the vor tex, traveling at a speed of 1 m/s,

reaches the expected grid locations at all times. The density distribution within the vortex is maintained as shown in plot 8d. A grid

of 22x22x2 elements with 7th degree polynomial is used.

0.2

0.4

0.0

0.5

(a) t=0s

0.2

0.4

0.0

0.5

(b) t=300s

0.2

0.4

0.0

0.5

(c) t=500s

0.2

0.4

0.0

0.5

(d) t=600s

0.2

0.4

0.0

0.5

(e) t=700s

0.2

0.4

0.0

0.5

(f) t=900s

Figure 9. Potential temperature perturbation θ0(K)contour plot for the 2D rising thermal bubble problem run with CG and an

artiﬁcial viscosity of µ= 1.5m2/s for stabilization. Results are shown at t=0, 300,500, 600, 700 and 900 seconds. A grid of

10x1x10 elements with 6th degree polynomials is used.

8.4 Density current

The density current benchmark problem, ﬁrst proposed in

(Straka et al. 1993), concerns the evolution of a cold bubble

in a neutrally stratiﬁed atmosphere of constant potential

temperature θ0. The dimensions of this test case are in

Prepared using sagej.cls

22 Journal Title XX(X)

-0.15

0.3

0.1

0

0.50

0.4

(a) t=0s

-0.15

0.3

0.1

0

0.50

0.4

(b) t=300s

-0.15

0.3

0.1

0

0.50

0.4

(c) t=500s

-0.15

0.3

0.1

0

0.50

0.4

(d) t=600s

-0.15

0.3

0.1

0

0.50

0.4

(e) t=700s

-0.15

0.3

0.1

0

0.50

0.4

(f) t=900s

Figure 10. Colliding thermal bubbles. Evolution of potential temperature perturbation θ0(K)run with CG and an artiﬁcial viscosity

of µ= 1.5m2/s for stabilization. Results are shown at t=0,300, 500, 600, 700 and 900 seconds. A grid of 10x1x10 elements with

6th degree polynomials is used.

-15.00

-14

-10.5

-7

-3.5

0.02

(a) t=0s

-15.00

-14

-10.5

-7

-3.5

0.02

(b) t=300s

-15.00

-14

-10.5

-7

-3.5

0.02

(c) t=600s

-15.00

-14

-10.5

-7

-3.5

0.02

(d) t=700s

-15.00

-14

-10.5

-7

-3.5

0.02

(e) t=800s

-15.00

-14

-10.5

-7

-3.5

0.02

(f) t=900s

Figure 11. Density current. Evolution of potential temperature perturbation θ0(K)run with CG and an artiﬁcial viscosity of µ= 75

m2/s for stabilization. Results are shown at t=0,300, 600, 700, 800 and 900 seconds. A grid of 128x1x32 elements with 4th degree

polynomials is used for an effective resolution of 50m in the xand zdirections.

the range of typical mesoscale models in which hydrostatic

assumptions are valid. Because the bubble is colder than

the surrounding air, it sinks and hits the ground, then

moves along the surface while forming shearing currents,

which then generate Kelvin-Helmholtz rotors. The numerical

solution of this problem using high order methods often

requires the use of artiﬁcial viscosity or other methods for

stabilization. We use a viscosity of µ= 75 m2/s according

to (Straka et al. 1993).

The problem setup is similar to that of the rising thermal

bubble test case with the following differences: a cold bubble

with θc=−15 K in Eq. (18), a domain of Ω= [0, 25600m]

×[0, ∞]×[0, 6400m], ellipsoidal bubble with radii

of (rx, rz) = (4000m, 2000m)and centered at (xc, zc) =

(0,3000m). The problem is symmetrical, therefore, we only

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 23

(a) 0h (b) 4h (c) 7h

Figure 12. Propagation of an acoustic wave. The density perturbation after 0 hour, 4 hours and 7 hours. A cubed sphere grid with

6x10x10x3 elements with 3rd degree polynomial is used.

need to simulate half of the domain. The computational

domain is subdivided into 128 x 1 x 32 elements with

polynomial order of N= 4 set in all directions. With this

set of choices, the effective resolution of our model is 50m.

Inviscid wall boundary conditions are used at all sides.

Fig. 11 shows the evolution of potential temperature of the

bubble up to 900 seconds. The vortical structures formed at

t=900 sec, namely three Kelvin-Helmholtz instability rotors,

are similar to that shown in (Straka et al. 1993; Ullrich and

Jablonowski 2012). The ﬁrst rotor is formed near the leading

edge of the density current at 300 sec, then the second rotor

develops at the front of the density current around 600 sec.

Here again the GPU code matched results obtained using

NUMA’s CPU code, which has already been veriﬁed with

many other atmospheric benchmark problems.

8.5 Acoustic wave

To validate the GPU implementation for global scale

simulations on the sphere, we consider a test case of an

acoustic wave traveling around the globe ﬁrst described in

(Tomita and Satoh 2005). Several issues emerge that did

not arise in the previous test cases. This test case validates

3D capabilities, curved geometry, metric terms, and a non-

constant gravity vector. The initial state for this problem

is hydrostatically balanced with an isothermal background

potential temperature of θ0=300K. A perturbation pressure

P0is superimposed on the reference pressure

P0=f(λ, φ)g(r)

where

f(λ, φ) = (0for r > rc

∆P

2(1 + cos(πr

rc)) for r≤rc

g(r) = sin nvπr

rT

where ∆P= 100 Pa, nv= 1,rc=re/3is one third of the

radius of the earth re=6371km and a model altitude of

rT=10km. The geodesic distance ris calculated as

r=recos−1[sin φ0sin φ+ cos φ0cos φcos(λ−λ0)]

where (λ0, φ0)is the origin of the acoustic wave.

The grid is a cubed sphere 6×10 ×10 ×3for a total

of 1800 elements with 3rd order polynomials. No-ﬂux

boundary conditions are applied at the bottom and top

surfaces. Visual comparison of plots showing the location of

the wave at different hours, shown in Fig. 12, against results

in (Tomita and Satoh 2005) indicate that the results are quite

similar to these results as well as to those computed with the

CPU version of NUMA.

The speed of sound is about a=pγp/ρ =347.32 m/s

with the initial conditions of the problem. With this speed,

the acoustic wave should reach the antipode in about 16

hours. The result from the simulation indicates the acoustic

wave has traveled 20.01 million meters within this time —

which gives an average sound speed of 347.55 m/s that is

close to the calculated sound speed (a relative error of less

than 1%).

9 Conclusions

In this work, we have ported the Non-hydrostatic Uniﬁed

Model of the Atmosphere (NUMA) to the GPU and

demonstrated speedups of two orders of magnitude relative

to a single core CPU. Tests on one node of the Titan

supercomputer, consisting of a K20x GPU and a 16-core

AMD CPU, yielded speedups of up to 15x and 11x for the

GPU relative to the CPU using single and double precision

arithmetic, respectively. This performance is achieved by

exploiting the specialized GPU hardware using suitable

algorithms and optimizing kernels for performance.

NUMA solves the Euler equations using a uniﬁed

continuous and discontinuous Galerkin approach for spatial

discretization and various implicit and explicit time

integration schemes. GPU kernels are written for different

components of the dynamical core, namely, the volume

integration kernel, surface integration kernel, (explicit)

time update kernel, kernels for stabilization, etc. We use

algorithms suitable for the Single Instruction Multiple

Thread (SIMT) architecture of GPUs to maximize bandwidth

usage and rate of ﬂoating point operations (FLOPS) of

the kernels. Some of the kernels, for instance the volume

integration, turned out to be high on the FLOPS side, while

some others, such as the explicit time integration kernel, are

high on bandwidth usage. Optimizations of kernels should be

geared towards achieving the maximum attainable efﬁciency

as bounded by the rooﬂine model.

Prepared using sagej.cls

24 Journal Title XX(X)

We have also implemented a multi-GPU version of

NUMA using the existing MPI-infrastructure for multi-core

CPUs (Kelly and Giraldo 2012). Communication between

GPUs is done via CPUs by ﬁrst copying the inter-processor

data from the GPU to the CPU. For the discontinuous

Galerkin (DG) implementation, we overlap communication

and computation to hide latency of data copying from the

GPU and communication between CPUs. We then tested

the scalability of our multi-GPU implementation using

16384 GPUs of the Titan supercomputer — the third fastest

supercomputer in the world as of June 2016. We obtained

a weak scaling efﬁciency of about 90% that increases with

bigger problem size. The CG and DG methods that do not

overlap communication with computation performed about

20% less efﬁciently, thereby, highlighting the value of this

approach.

For portability to heterogeneous computing environment,

we used a novel programming language called OCCA, which

can be cross-compiled to either OpenCL, CUDA or OpenMP

at runtime. Finally, the accuracy and performance of our

GPU implementations are veriﬁed using several benchmark

problems representative of different scales of atmospheric

dynamics.

In the current work, we ported only the explicit time

integration modules to the GPU. However, operational

NWP often requires use of implicit-explicit (IMEX) time

integration to counter the limitation imposed by the Courant

number. In the future, we plan to port the IMEX time

integration modules which require solving a system of

equations at each time step.

10 Acknowledgement

This research used resources of the Oak Ridge Leadership

Computing Facility at the Oak Ridge National Laboratory,

which is supported by the Ofﬁce of Science of the U.S.

Department of Energy under Contract No. DE-AC05-

00OR22725. The authors gratefully acknowledge support

from the Ofﬁce of Naval Research through PE-0602435N.

References

Abdi DS and Giraldo FX (2016) Efﬁcient construction of uniﬁed

continuous and discontinuous galerkin formulations for the 3d

euler equations. Journal of Computational Physics 320: 46 –

68. DOI:http://dx.doi.org/10.1016/j.jcp.2016.05.033.

Allard J, Courtecuisse H and Faure F (2011) Implicit FEM

Solver on GPU for Interactive Deformation Simulation. In:

mei W Hwu W (ed.) GPU Computing Gems Jade Edition,

Applications of GPU Computing Series. Elsevier, pp. 281–294.

DOI:10.1016/B978-0-12-385963- 1.00021-6.

Bassi F and Rebay S (1997) A high-order accurate discontinuous

ﬁnite element method for the numerical solution of the com-

pressible navierstokes equations. Journal of Computational

Physics 131(2): 267 – 279. DOI:http://dx.doi.org/10.1006/jcph.

1996.5572.

Burstedde C, Wilcox LC and Ghattas O (2011) p4est: Scalable

algorithms for parallel adaptive mesh reﬁnement on forests of

octrees. SIAM Journal on Scientiﬁc Computing 33(3): 1103–

1133. DOI:10.1137/100791634.

Carpenter M and Kennedy C (1994) Fourth-order 2N-storage

Runge-Kutta schemes. NASA technical memorandum 109112 :

1 – 24.

Chan J, Wang Z, Modave A, Remacle J and Warburton T (2015)

GPU-accelerated discontinuous galerkin methods on hybrid

meshes. arXiv:1507.02557 .

Chan J and Warburton T (2015) GPU-accelerated bernstein-

bezier discontinuous galerkin methods for wave problems.

arXiv:1512.06025 .

Cockburn B and Shu C (1998) The Runge-Kutta discontinuous

Galerkin method for conservation laws V: multidimensional

systems. J. Comput. Phys. 141: 199 – 224.

Deville M, Fischer P and Mund E (2002) High-Order Methods for

Incompressible Fluid Flow. Cambridge University Press.

Fuhry M, Giuliani A and Krivodonova L (2014) Discontinuous

Galerkin methods on graphics processing units for nonlinear

hyperbolic conservation laws. Numerical Methods in Fluids

76: 982 – 1003.

Gandham R, Medina D and Warburton T (2014) GPU accelerated

discontinuous galerkin methods for shallow water equations.

arXiv:1403.1661 .

Giraldo FX (1998) The Lagrange-Galerkin spectral element method

on unstructured quadrilateral grids. Journal of Computational

Physics 147(1): 114–146.

Giraldo FX and Restelli M (2008) A study of spectral element and

discontinuous galerkin methods for the navier-stokes equations

in nonhydrostatic mesoscale atmospheric modeling: Equation

sets and test cases. J. Comput. Phys. 227: 3849 – 3877.

Giraldo FX and Rosmond TE (2004) A scalable spectral element

eulerian atmospheric model (SEE-AM) for NWP: Dynamical

core tests. Monthly Weather Review 132(1): 133–153.

Goddeke D, Strzodka R and Turek S (2005) Accelerating double

precision FEM simulations with GPUs. In: Proceedings of

ASIM. pp. 1 – 21.

Kelly JF and Giraldo FX (2012) Continuous and discontinuous

galerkin methods for a scalable three-dimensional nonhydro-

static atmospheric model: limited area mode. J. Comput. Phys.

231: 7988 – 8008.

Kl¨

ockner A and Warburton T (2013) A loop generation tool

for CPUs and GPUs, part i: Data models, algorithms, and

heuristics .

Kl¨

ockner A, Warburton T, Bridge J and Hesthaven J (2009)

Nodal discontinuous galerkin methods on graphics processors.

Journal of Computational Physics 228(21): 7863 – 7882. DOI:

http://dx.doi.org/10.1016/j.jcp.2009.06.041.

Marras S, Kelly JF, Moragues M, M¨

uller A, Kopera MA,

V¨

azquez M, Giraldo FX, Houzeaux G and Jorba O (2015)

A review of element-based galerkin methods for numerical

weather prediction: Finite elements, spectral elements, and

discontinuous galerkin. Archives of Computational Methods

in Engineering : 1–50DOI:10.1007/s11831-015-9152-1.

Medina D, Amik SC and Warburton T (2014) OCCA: A uniﬁed

approach to multi-threading languages. arXiv:1403.0968 .

Micikevicius P (2009) 3d ﬁnite difference computation on GPUs

using cuda. In: Proceedings of 2Nd Workshop on General

Purpose Processing on Graphics Processing Units, GPGPU-

2. New York, NY, USA: ACM. ISBN 978-1-60558-517-8, pp.

79–84. DOI:10.1145/1513895.1513905.

Prepared using sagej.cls

Abdi, Wilcox, Warburton and Giraldo 25

Modave A, St-Cyr A and Warburton T (2015) Gpu performance

analysis of a nodal discontinuous galerkin method for acoustic

and elastic models. arXiv:1602.07997 .

M¨

uller A, Kopera M, Marras S, Wilcox LC, Isaac T and Giraldo

FX (2016) Strong scaling for numerical weather prediction at

petascale with the atmospheric model numa. Submitted to :

30th IEEE International Parallel and Distributed Processing

Symposium .

Nair RD, Levy MN and Lauritzen PH (2011) Emerging numerical

methods for atmospheric modeling. In: Lauritzen PH,

Jablonowski C, Taylor MA and Nair RD (eds.) Numerical

methods for global atmospheric models,Lecture notes in

computational science and engineering, volume 80. Springer,

pp. 251 – 311.

Norman M, Larkin J, Vose A and Evans K (2015) A case study of

CUDA FORTRAN and OpenACC for an atmospheric climate

kernel. Journal of Computational Science 9: 1 – 6. DOI:

http://dx.doi.org/10.1016/j.jocs.2015.04.022. Computational

Science at the Gates of Nature.

Remacle J, Gandham R and Warburton T (2015) GPU accelerated

spectral ﬁnite elements on all-hex meshes. arXiv:1506.05996 .

Robert A (1993) Bubble convection experiments with a semi-

implicit formulation of the Euler equations. J. Atmos. Sci.

50(13): 1865–1873.

Sawyer W (2014) An overview of GPU-enabled atmospheric

models. In: ENES Workshop on Exascale Technologies and

Innovation in HPC for Climate Models.

Siebenborn M, Schulz V and Schmidt S (2012) A curved-element

unstructured discontinuous galerkin method on GPUs for the

euler equations. Comput. and Vis. in Sc. 15: 61 – 73.

Straka J, Wilhelmson R, Wicker L, Anderson J and Doegemeier

K (1993) Numerical solutions of a nonlinear density current:

A benchmark solution and comparison. International J. Num.

Methods. Fl. 17: 1 – 22.

Tomita H and Satoh M (2005) A new dynamical framework of

non hydrostatic global model using the icosahedral grid. Fluid

Dynamics Research 34: 357 – 400.

Ullrich P and Jablonowski C (2012) Operator-split runge-kutta-

rosenbrock methods for nonhydrostatic atmospheric models.

Monthly Weather Review 140: 1257 – 1284.

Williams S, Waterman A and Patterson D (2009) Rooﬂine:

An insightful visual performance model for multicore

architectures. Commun. ACM 52(4): 65–76. DOI:10.1145/

1498765.1498785.

Yelash L, M¨

uller A, Luk`

a¨

cov´

a-Medvid’ov`

a M, Giraldo FX and

Wirth V (2014) Adaptive discontinuous evolution galerkin

method for dry atmospheric ﬂow. Journal of Computational

Physics 268: 106 – 133. DOI:http://dx.doi.org/10.1016/j.jcp.

2014.02.034.

Prepared using sagej.cls