Content uploaded by Marius Kurz
Author content
All content in this area was uploaded by Marius Kurz on Apr 19, 2024
Content may be subject to copyright.
GALÆXI: Solving complex compressible flows with high-order discontinuous Galerkin
methods on accelerator-based systems
Daniel Kempfa,1,∗, Marius Kurza,1, Marcel Blinda, Patrick Koppera, Philipp Offenh¨
auserb, Anna Schwarza, Spencer Starra, Jens
Keima, Andrea Becka
aInstitute of Aerodynamics and Gas Dynamics, University of Stuttgart, Pfaffenwaldring 21, 70569 Stuttgart, Germany
bHewlett Packard Enterprise (HPE), Herrenberger Straße 140, 71034 B¨oblingen, Germany
Abstract
This work presents GALÆXI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured
meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GALÆXI implements the high-
order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to
ensure the stability of the high-order scheme near shocks. This work provides details on the general code design, the parallelization
strategy, and the implementation approach for the compute kernels with a focus on the element local mappings between volume and
surface data due to the unstructured mesh. The scheme is implemented using a pure distributed memory parallelization based on a
domain decomposition, where each GPU handles a distinct region of the computational domain. On each GPU, the computations
are assigned to different compute streams which allows to antedate the computation of quantities required for communication while
performing local computations from other streams to hide the communication latency. This parallelization strategy allows for
maximizing the use of available computational resources. This results in excellent strong scaling properties of GALÆXI up to
1024 GPUs if each GPU is assigned a minimum of one million degrees of freedom. To verify its implementation, a convergence
study is performed that recovers the theoretical order of convergence of the implemented numerical schemes. Moreover, the solver
is validated using both the incompressible and compressible formulation of the Taylor–Green-Vortex at a Mach number of 0.1 and
1.25, respectively. A mesh convergence study shows that the results converge to the high-fidelity reference solution and that the
results match the original CPU implementation. Finally, GALÆXI is applied to a large-scale wall-resolved large eddy simulation
of a linear cascade of the NASA Rotor 37. Here, the supersonic region and shocks at the leading edge are captured accurately and
robustly by the implemented shock-capturing approach. It is demonstrated that GALÆXI requires less than half of the energy to
carry out this simulation in comparison to the reference CPU implementation. This renders GALÆXI as a potent tool for accurate
and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.
Keywords: Discontinuous Galerkin, High-Performance Computing, GPUs, Accelerators, Turbulence, Compressible Flow
1. Introduction
The computational sciences have become an essential driver
for understanding the dynamics of complex, nonlinear systems
ranging from the dynamics of earth’s climate [1] to obtaining
information about a patient’s characteristic blood flow to derive
personalized approaches in medical therapy [2]. While these
successes also rely on significant breakthroughs in the devel-
opment of numerical methods and physical models, a major
∗Corresponding author
Email addresses: daniel.kempf@iag.uni-stuttgart.de (Daniel
Kempf), marius.kurz@iag.uni-stuttgart.de (Marius Kurz),
marcel.blind@iag.uni-stuttgart.de (Marcel Blind),
patrick.kopper@iag.uni-stuttgart.de (Patrick Kopper),
philipp.offenhaeuser@hpe.com (Philipp Offenh¨
auser),
anna.schwarz@iag.uni-stuttgart.de (Anna Schwarz),
spencer.starr@iag.uni-stuttgart.de (Spencer Starr),
jens.keim@iag.uni-stuttgart.de (Jens Keim),
andrea.beck@iag.uni-stuttgart.de (Andrea Beck)
1D. Kempf and M. Kurz contributed equally and cordially agree to share
first authorship.
portion can be ascribed to the exponential increase in avail-
able computing power, which has allowed simulating increas-
ingly large and complex problems over the last decades. How-
ever, the corresponding process of shrinking transistors from
generation to generation has become increasingly challenging
and the resulting gains in performance have diminished in re-
cent years [3]. As a consequence, the community is moving
towards accelerator chips, which do not serve as a one-size-fits-
all hardware like traditional CPUs. These chips are specialized
to yield better performance and efficiency for specific tasks,
such as workloads in artificial intelligence, video encoding or
cryptography. This also shows in the field of high-performance
computing (HPC), where nine out of the ten fastest supercom-
puters listed in the most recent TOP500 [4] from November
2023 employ some form of accelerator. In the most recent
GREEN500 [5] list, which focuses on sustainability in terms
of energy invested per computation, all of the ten most efficient
HPC systems employ GPU accelerators.
However, such accelerators generally differ considerably from
general-purpose CPUs in terms of hardware design, as well as
Preprint submitted to Elsevier April 19, 2024
the working principle. As a consequence, using accelerators of-
tentimes not only requires rewriting and redesigning large por-
tions of existing code to make efficient use of such hardware,
but also might change which numerical algorithms are most ef-
ficient for a specific task. This poses significant challenges for
legacy HPC codes due to the considerable effort required to mi-
grate the existing codebase to hardware accelerators. This is-
sue is particularly pervasive in the field of Computational Fluid
Dynamics (CFD), where scale-resolving simulations of turbu-
lent flows generally require significant HPC resources. Here,
modern high-order discretization methods such as Discontinu-
ous Galerkin (DG) and Flux Reconstruction (FR) schemes have
become popular due to their computational efficiency for such
multi-scale problems and their excellent scaling properties on
HPC systems.
The need to adapt existing and established code bases to the
new HPC architectures has already been considered in the CFD
community. One of the most established high-order codes for
incompressible and weakly compressible flow is Nek5000 [6].
The first effort to port Nek5000 to accelerators was reported in
2015 [7], where a barebones version of the code was adapted
for GPUs using the OpenACC library. Full GPU support was
then offered by its successor NekRS [8], which is based on the
Open Concurrent Computing Abstraction (OCCA) [9]. Sim-
ilarly, Neko [10] was implemented from scratch using mod-
ern, object-orientated Fortran and abstraction layers to support
multiple hardware backends. While the previous codes focus
mainly on incompressible and weakly compressible flows, pyFR
[11] also solves the compressible Navier–Stokes equations (NSE)
on unstructured meshes using the FR approach. Moreover, it is
written in Python and relies on code generation to support mul-
tiple computing backends including accelerators and provides
excellent scaling properties on HPC systems. Similarly, the
deal.II [12] and MFEM [13] libraries provide a DG discretiza-
tion to solve the compressible NSE and both have added GPU
support in recent years.
In this work, we present GALÆXI2as a GPU-accelerated
solver for hyperbolic-parabolic conservation laws with special
emphasis on compressible flows. The numerical simulation of
compressible flows is highly relevant for a large number of
problems, e.g. from the aviation industry or aeroacoustics. GA-
LÆXI builds on the well-established FLEXI solver [14] and in-
herits the majority of its features and its extensive pre- and post-
processing suite that is designed for large-scale applications.
Hence, GALÆXI implements multiple flavors of the Discon-
tinuous Galerkin Spectral Element Method (DGSEM) and can
handle fully unstructured, curved, high-order meshes to account
for complex geometries. Moreover, multiple stabilization tech-
niques are implemented to ensure the stability of the scheme in
underresolved simulations and in the vicinity of shocks using
shock capturing schemes based on localized finite-volume (FV)
subcell approaches. The user interface of GALÆXI is deliber-
ately kept compatible to FLEXI, such that GALÆXI can serve
as a drop-in replacement to run existing simulation setups on
GPU systems without modifications.
2https://github.com/flexi-framework/galaexi/
This work contributes the following aspects to the chal-
lenging but necessary steps for the transition to exascale HPC
architectures in CFD. It provides insights into the suitability
of DGSEM for GPU acceleration and quantifies the gains in
performance and efficiency that can be expected for explicit,
high-order DG methods when moving from traditional CPUs
to GPUs. It also provides practical guidelines on how existing
codebases can be ported to GPU hardware and proposes par-
allelization concepts for achieving parallel efficiency on HPC
hardware with high-order schemes. Furthermore, the savings in
the context of energy-to-solution are discussed in particular and
can serve as a point of reference in terms of energy efficiency.
This work is organized as follows. First, Section 2 intro-
duces the governing equations and the numerical methods im-
plemented in GALÆXI. Based on this, Section 3 provides de-
tails on the parallelization strategy and the implementation of
the compute kernels. The resulting performance and scaling
abilities of GALÆXI are presented and discussed in Section 4.
The implementation of the numerical scheme is verified in Sec-
tion 5, demonstrating the theoretical convergence rates of the
numerical methods and accurate results for the incompressible
and compressible formulations of the Taylor–Green-Vortex (TGV)
test case. To demonstrate the applicability to applications of
relevance, GALÆXI is employed in Section 6 to compute a
large-scale, wall-resolved LES of the NASA Rotor 37 [15] test
case. Section 7 summarizes the major results of the paper and
provides an outlook on further developments.
2. Numerical Methods
GALÆXI is implemented as a general solution framework
for hyperbolic-parabolic conservation equations, similar to the
FLEXI framework, but exhibits a particular focus on the com-
pressible Navier–Stokes equations (NSE), which are introduced
in Section 2.1. The high-order DGSEM will be introduced in
Section 2.2, followed by the compatible sub-cell shock captur-
ing scheme in Section 2.3.
2.1. Governing Equations
GALÆXI is used to solve the compressible NSE, which de-
scribe the evolution of the conserved variables U(x,t)=(ρ, ρu, ρe)T
which are comprised of the density, momentum, and energy
density, respectively, at each position in space xand time t.
The NSE can be derived by enforcing the conservation of mass,
momentum, and energy across an infinitesimal control volume.
This yields the evolution equations of the conserved variables
in differential form as
∂ρ
∂t+∇ · (ρu)=0,(1)
∂ρu
∂t+∇ · (ρu⊗u+pI−τ)=0,(2)
∂ρe
∂t+∇ · (u(ρe+p)−τ·u+q)=0,(3)
where pdenotes the static pressure, Ithe identity matrix, and 0
the zero vector. Assuming a Newtonian fluid and Fourier’s law
2
of thermal conduction yields the stress tensor τand heat flux q
as
τ=µ ∇u+∇uT−2
3(∇ · u)I!,(4)
q=−λ∇T.(5)
Here, µdenotes the dynamic viscosity of the fluid and λde-
notes its heat conductivity. Both are material properties of the
specific fluid and depend in the general case on the fluid’s lo-
cal state. Hence, both quantities cannot be considered constant
in the general case. In this work, we assume the viscosity to
follow Sutherland’s law [16], which postulates a dependency of
the viscosity on the temperature of the form
µ(T)=µre f
1.4042 (T/Tre f )3/2
T/Tre f +0.4042 ,(6)
where µre f is the viscosity at the reference temperature Tre f .
Based on this, the thermal conductivity can be computed as
λ=γR
γ−1
µ
Pr ,(7)
with γas the ratio of specific heats, Rdenoting the specific gas
constant, and Pr as the dimensionless Prandtl number, which is
assumed in the following to be constant with Pr =0.71.
Lastly, the equation-of-state (EOS) closes the NSE by pro-
viding a relationship between the conserved variables and the
pressure. For a perfect gas, this can be written as
p=(γ−1)ρe−ρ
2u·u,or (8)
T=p
ρR.(9)
Equations (8) and (9) thus allow the computation of the primi-
tive, i.e. non-conserved, variables Uprim =(ρ, u,p,T)Tfrom the
state U.
2.2. Discontinuous Galerkin Spectral Element Method
In the following section, the DGSEM will be derived for the
compressible NSE, which can be written in flux formulation as
∂U
∂t+∇x·F(U,∇xU)=0,(10)
where F(U,∇xU) encapsulates both the convective and viscous
fluxes. Each of the main construction steps of the DGSEM
will be discussed. However, a more in-depth derivation of the
DGSEM and its implementation is provided by Krais et al. [14].
Mapping the Equations
For the DGSEM, the domain Ωis subdivided into a set of
non-overlapping, curvilinear, hexahedral elements. Each physi-
cal element is then mapped from the physical space x=(x,y,z)T
to the reference element E∈[−1,1]3in computational space
ξ=(ξ, η, ζ )Tusing a transfinite polynomial mapping ξ=χ(x).
ξ
η
ζξ+
η+
ζ+
Figure 1: Perspective sketch of a single DG element in the reference space us-
ing Legendre-Gauss interpolation points with N=2. Gray cubes indicate the
interpolation points within the element, while the gray squares indicate interpo-
lation points on the six local faces called ξ±, η±, ζ±. The linewise operations of
the tensor product ansatz are indicated for the center interpolation point, where
the operations along the coordinates ξ=(ξ, η, ζ) are highlighted in blue, red
and green, respectively.
The reference element is shown for N=2 in Fig. 1. The Ja-
cobian Jof this mapping follows as the determinant of the Ja-
cobian matrix ∇
ξχ, where ∇
ξdenotes the del operator in the
computational coordinates. The transformation of the govern-
ing equations into the computational space requires the con-
travariant basis vectors Jai, with i=1,2,3, which follow in
the curl form as
Jai
n=−ˆxi· ∇
ξ×(xl∇
ξxm),(n,m,l) cyclic,(11)
where ˆxiis the unit vector in the i-th Cartesian direction. Using
the basis vectors and the Jacobian, the transformed equations in
the reference element follow as
J∂U
∂t+∇
ξ·Fi=0,(12)
where Fidenotes the contravariant fluxes given by
Fi=Jai·F.(13)
To construct the DGSEM, Eq. (12) is formulated in the weak
form, which will be derived in the next paragraph.
Weak formulation
To derive its weak form, Eq. (12) is projected onto a set of
test functions ψ(ξ), spanning a polynomial subspace, using the
inner product, which yields
Z
E
J∂U
∂tψdξ−I
∂E
ψ(F·N)∗dS
| {z }
Surface Integral
+Z
E
F· ∇
ξψdξ
| {z }
Volume Integral
=0.(14)
3
Here, the surface integral incorporates the contribution of the
fluxes across the element faces while the volume integral con-
siders only the degrees of freedom within the element. Be-
cause adjacent elements share a common face and no conti-
nuity across elements has been imposed, the solution is gen-
erally discontinuous across the element faces. Consequently,
the solution and hence the fluxes on the element faces are non-
unique. Therefore, numerical flux functions are used to com-
pute a unique numerical flux across element boundaries, which
is denoted by the asterisk (·)∗.
Solution Representation
Within each element, the solution is represented by high-
order Lagrange polynomials. The j-th one-dimensional La-
grange polynomial of degree Nis defined as
N
i(x)=
N
Y
i=0
i,j
xi−x
xi−xj
,(15)
with respect to a set of interpolation points {xj}N
j=0. In practice,
either Legendre–Gauss (GL) or Legendre–Gauss–Lobatto (LGL)
nodes are used as interpolation points. The superscript is sub-
sequently dropped to keep the notation concise. Lagrange poly-
nomials fulfill the Kronecker delta property given by
ixj=
1,if i=j,
0,if i,j.(16)
A tensor-product ansatz is used to construct a three-dimensional
basis of the polynomial subspace PNfrom the one-dimensional
Lagrange polynomials. This yields the approximation of the
solution in the computational space as
U(ξ,t)≈
N
X
i,j,k=0
ˆ
Ui jk(t)i(ξ)j(η)k(ζ).(17)
Semi-discrete form
Evaluating the integrals using the Gauss-type quadrature as-
sociated with the chosen set of interpolation points, i.e. collo-
cation of interpolation and integration points, yields the semi-
discrete form of the DG operator that can be written for each
point i,j,k∈[0,N] as
∂ˆ
Ui jk
∂t=
ApplyJac
z}|{
−1
Ji jk "N
X
α=0
F1
αjk ˆ
Diα+
FillFlux
z }| {
f∗ˆsξ+
jk ˆ
+
i+
FillFlux
z }| {
f∗ˆsξ−
jk ˆ
−
i
+
N
X
β=0
F2
iβkˆ
Djβ+f∗ˆsη+
ik ˆ
+
j+f∗ˆsη−
ik ˆ
−
j
+
N
X
γ=0
F3
i jγˆ
Dkγ
| {z }
VolInt
+f∗ˆsζ+
i j ˆ
+
k+f∗ˆsζ−
i j ˆ
−
k
| {z }
SurfInt
#.
(18)
This notation follows Kopriva [17], where
ˆ
±
i=i(±1)
ωi
and ˆ
Di j =−ωi
ωj
di(ξ)
dξξ=ξj
(19)
are one-dimensional building blocks that entail the numerical
quadrature weights ωiand are precomputed during initializa-
tion to improve the overall performance of the implementation.
Moreover, f∗=f∗(˜
UL,˜
UR) denotes the unique flux at the faces
based on the solution on on the surface of the left and right el-
ement, respectively, and ˆsdenotes the surface element, which
is the norm of the non-normalized physical unit vector as dis-
cussed in more detail by Krais et al. [14]. The monospaced
namings in Eq. (18) refer to the routines in the numerical im-
plementation which are summarized in Table 1.
At this point, we would like to briefly discuss the influence
of the unstructured mesh topology. First of all, the unstructured
neighbor relations only influence the surface-related operations
and only direct neighbors are considered. Here, the relative ori-
entation between the adjacent elements and their sides must be
taken into account. This is taken into account by corresponding
mappings in the ProlongToFace and SurfInt routines.
Time integration
The semi-discrete form Eq. (18) is integrated in time using
an appropriate integration scheme. GALÆXI offers a variety of
different explicit Runge–Kutta-type schemes in a low-storage
formulation to reduce the memory consumption. In the fol-
lowing, a fourth-order Runge–Kutta scheme with 5 stages [18]
is used for the validation and verification results in Section 5,
while a scheme with 14 stages [19] is used for the large-scale
application in Section 6. The latter scheme is chosen since it
exhibits an optimized stability region for convection-dominated
problems and allows for larger time steps.
Nonlinear Stability
The semi-discrete form of the DG discretization in Eq. (18)
is derived by means of numerical quadrature rules. However,
since the integrands, i.e. the fluxes of the compressible NSE,
are non-polynomial, they cannot be integrated exactly by the
applied quadrature rules. The resulting integration errors mani-
fest as aliasing that can cause simulations to crash, especially in
the underresolved regime. For DG, multiple mitigation strate-
gies have been devised ranging from overintegration [20, 21],
also referred to as polynomial dealiasing, to filtering procedures
that strive to counteract the accumulation of energy in the high-
est solution modes [22, 23]. In this work, we rely on the split-
flux formulation introduced by Gassner et al. [24] to construct
a non-linear stable DG scheme. This approach is based on the
strong formulation of the governing equations, which can be
obtained through a second integration-by-parts of Eq. (14). The
discretized form can be cast into the same algorithmic form as
Eq. (18) with only minor modifications in the formulation of the
fluxes [14]. Here, the fluxes of the NSE are replaced by split-
form two-point fluxes that are equivalent on an analytical level
but can be used to enforce additional constraints such as en-
tropy consistency in the discretized formulation. In this work,
4
Routine Vol/Surf DOF-local Lift * Operations Explanation
ConsToPrim Surf, Vol YES NO O(N2,3) Computes primitive variables Uprim from state U.
VolInt Vol NO YES O(N4)Evaluates volume fluxes Fand multiplies with ˆ
D.
ProlongToFace Vol→Surf NO YES O(N2) Evaluates solution at element faces UL/Rto compute f∗.
FillFlux Surf YES YES O(N2) Computes common flux f∗on faces with Riemann solver.
SurfInt Vol←Surf NO YES O(N2)Computes surface integral with f∗and ˆ
`±.
ApplyJac Vol YES YES O(N3)Applies Jacobian Jto ˆ
Ut.
Table 1: Individual operations required to evaluate the three-dimensional DG operator with details on whether the routine acts on volume data or surface data.
Moreover, it is indicated whether performed operations are DOF-local, i.e. are performed independently for each specific DOF, if they have to be re-applied during
the computation of the gradients, which is indicated by the prefix Lift *, and their computational complexity in terms of N.
a kinetic-energy-preserving split-flux formulation proposed by
Pirozzoli [25] is applied. It is important to stress that the evalu-
ation of two-point fluxes increases the computational cost con-
siderably.
Second-Order Equations
For the NSE, the gradients of the primitive variables ∇xUprim
are required to evaluate the viscous fluxes. While several ap-
proaches exist in the literature, GALÆXI follows the BR1 method
by Bassi and Rebay [26]. Here, so-called lifted gradients gare
introduced that should fulfill
g−1
J∇xUprim =0.(20)
This equation is then solved for gby deriving the weak form
of Eq. (20) and applying the DGSEM as is done for the NSE
themselves. This yields an additional set of equations that is
structurally similar to Eq. (18) but using the lifting fluxes in-
stead of the fluxes of the NSE as detailed in [14]. Hence, the
computation of the gradients corresponds to increasing the set
of unknowns by an additional (ndim ×nlift) variables, where ndim
corresponds to the number of spatial dimensions and nlift to the
number of primitive variables for which the gradients should
be computed. This also means that each of the operations indi-
cated in Eq. (18), i.e. ApplyJac, SurfInt, VolInt, FillFlux, has
to be executed again for the lifting procedure in each spatial
direction.
Computational Complexity
Table 1 also provides the estimated number of operations,
i.e. the computational complexity, for the different steps of the
three-dimensional DG discretization. The given numbers de-
scribe the asymptotic behavior of the individual operations with-
out considering details such as the computational complexity
of the flux computation and compiler optimizations. Most im-
portantly, the computational effort to compute the volume inte-
gral scales one order higher in terms of Nthan all other opera-
tions, i.e. it scales with O(N4) instead of O(N3) or even O(N2).
Hence, the volume integral becomes the dominant operation
for increasing N. However, the computations carried out for
the volume integral are purely element local, highly dense and
can be computed very efficiently on various types of hardware.
In practice, this increase in efficiency was observed to partly
compensate for the additional operations required with increas-
ing N[27]. Moreover, the communication stencil between el-
ements is small, since only surface fluxes with direct neighbor
elements have to be exchanged. Consequently, the high cost
of the volume integral and the small communication stencil al-
low for hiding the communication latencies very efficiently in
parallel computations.
2.3. Shock Capturing
GALÆXI is designed for the simulation of compressible
flows which can entail discontinuities in the form of shocks.
However, the application of high-order discretizations near dis-
continuities or strong gradients in the solution produces spuri-
ous oscillations and can cause the numerical scheme to become
unstable. As a consequence, a wide variety of different shock
identification and capturing methods are proposed in the liter-
ature that all strive to stabilize high-order discretizations near
shocks and provide stable and accurate simulations of com-
pressible flows. The common objective of those methods is to
retain the high-order accuracy of the baseline scheme in smooth
regions while identifying and handling so-called troubled cells
within the domain during the simulation. Common approaches
introduce some form of artificial viscosity near the shock re-
gion [28, 29, 30] or use high-order filtering techniques [31].
Another approach is to employ a hybrid discretization, where
the high-order DG scheme is stabilized in the vicinity of the
troubled region with a low-order FV scheme. For this, the DG
element is subdivided into multiple FV subcells as indicated in
Fig. 2. This low-order scheme can then either be solved directly
within the troubled elements and coupled to the surrounding
DG elements using the common Riemann fluxes [32] or can
also be used as a regularizing limiter [33]. In the following, we
employ the blending approach by Hennemann et al. [34], who
proposed to compute a convex blending of both discretization
operators. This approach has also been demonstrated to yield a
sensible turbulence model if tuned correctly [35]. Within each
element both the high-order DG operator RDG(ˆ
U) and the com-
patible low-order FV scheme RFV(ˆ
U) are evaluated. The con-
vex blending of both schemes then yields
∂ˆ
U
∂t=(1 −α)RDG(ˆ
U)+αRFV(ˆ
U),(21)
where the blending factor α∈[0,1] can be computed either via
an a priori or a posteriori strategy for each individual DG ele-
5
−1 0 1
ωi
U(ξ)
ξ
Figure 2: Sketch of the sub-cell shock capturing scheme. The DG polynomial
using LGL points and a polynomial degree of N=3 is shown in black with
the interpolation points indicated as dots and the integral mean solution within
the subcells is shown in blue. The solution in the neighboring DG elements is
indicated in red.
ment [36]. In this work, the a priori indicator by Hennemann
et al. [34] is used, for which the blending approach becomes
an operator local to each individual element. Clearly, the stan-
dard DG scheme can be recovered for α=0, while α=1
yields a pure FV discretization. It is important to stress that
only the contributions of the operators within the element have
to be blended since the outer surface fluxes are identical for the
DG and FV formulation.
3. Parallelization Strategy on Accelerators
GALÆXI is the endeavor to extend the flow solver FLEXI
[14] towards accelerator-based HPC systems. Here, GALÆ-
XI follows three distinct design principles. First, the general
data structure and parallelization strategy of FLEXI for unstruc-
tured geometries should be retained. Second, we strive to retain
the majority of the codebase and the associated features of the
original implementation. Lastly, GALÆXI is designed such
that all routines called during the time-stepping are executed
on the accelerator without the need to transfer data to and from
the CPU. Device code and compute kernels are only required
for routines that are called during time-stepping and thus have
to be computed on the accelerator. In contrast, initialization
and non-frequently performed analyzing routines are still com-
puted on the CPU, since they are less time-critical and CPUs
are better suited towards unstructured workloads. Both GA-
LÆXI and FLEXI are implemented in modern Fortran 2008.
The device code for the accelerators in GALÆXI is currently
implemented using CUDA Fortran, but the integration of other
compute backends is under development.
The design and implementation of GALÆXI is detailed in
the following sections using a hierarchical top-down approach.
First, the high-level distribution of work across different com-
pute devices and the employed communication scheme between
them is detailed in Section 3.1. Based on this, Section 3.2 pro-
vides details on how communication and compute kernels are
arranged and scheduled on a single GPU. Lastly, the general
implementation paradigms for the individual compute kernels
are detailed in Section 3.3. Obviously, performance optimiza-
tions have to be performed across all of these three levels and
changes on one level affect the suitability and performance of
the others. While these individual levels are inherently inter-
linked, we chose this partitioning in the following to provide a
more structured overview of the design principles and methods
applied in GALÆXI.
3.1. Inter-GPU Parallelization
The parallelization strategy between GPUs in GALÆXI is
largely inherited from FLEXI, which employs a pure distributed
memory approach using MPI. Before going into the specific im-
plementation details of GALÆXI, the original MPI paralleliza-
tion strategy of FLEXI is briefly presented. Here, each com-
putational rank is assigned a subdomain of roughly the same
amount of elements as shown in Fig. 3. In the DG context, el-
ements are only coupled via their surface fluxes. Thus, only
the surface information across the MPI borders has to be ex-
changed between individual MPI ranks during the computa-
tion. Moreover, FLEXI sorts this side information such that
the data exchanged between two MPI partners is contiguous in
memory and that the sorting is known a priori on both sides.
This allows to exchange solely the data itself without any addi-
tional sorting information. The overall communication effort is
thus proportional to the number of sides at the MPI boundaries,
which are referred to in the following simply as MPI sides.
To minimize the amount of communication, i.e. the number of
MPI sides in the domain, FLEXI distributes the domain along a
pre-computed space-filling curve. This ensures that the result-
ing subdomains remain reasonably compact for any amount of
subdomains while minimizing partitioning effort. During the
simulation, communication is generally asynchronous and non-
blocking, which means that communicated data is computed
and sent at the earliest possible opportunity. The communica-
tion barrier that checks whether the data has been received is
positioned at the latest possible instant before the data is ac-
tually required for further computations. This allows to effec-
tively hide the communication latency by performing local op-
erations during the data exchange. For this, operations on MPI
sides are prioritized over inner sides to use all operations per-
formed on inner sides for latency hiding.
GALÆXI follows the same general approach for paralleliz-
ing across multiple GPUs. Here, each GPU on a node is asso-
ciated with a distinct CPU core while respecting the memory
topology to optimize performance.3Moreover, GPU-aware im-
plementations of MPI are used, which improve the performance
of MPI communication. By providing Remote Direct Memory
Access (RDMA) and host offloading, these allow direct access
to the local memory of different GPUs on the same node and
transmission of MPI messages directly from the GPU to the
network adapter without the assistance of the CPU or the main
memory. The following key differences emerge between the
CPU and GPU implementations. First, the domain size on a sin-
gle GPU is larger than for the CPU case. This is a result of the
3This means, for instance, to associate the CPU core and the GPU such that
both reside within the same non-uniform memory access (NUMA) domain.
6
Figure 3: Domain decomposition for a generic airfoil simulation with large spanwise extension. The domain is cut such that the airfoil (transparent surface) including
the boundary layer part is visible. Patches of different colors represent individual MPI domains that are processed by different ranks. This figure is an example of a
fine granularity, e.g. in the CPU case. In the GPU case, larger MPI domains occur.
much higher computational power a GPU provides compared
to a single CPU core. A GPU requires significantly more work-
load to run at capacity and to exploit the full degree of its par-
allelism. In practice, this means that the computational domain
per rank increases if GPUs are used. Since the subdomains are
compact, an increase in size means that the volume increases
much faster than the MPI surface, i.e. that inner work becomes
more dominant in comparison to the required communication
and at the same time the amount of data to be communicated
decreases. In consequence, the performance of the interconnect
becomes less dominant than in the CPU case. Second, the GPU
implementation has to consider the asynchronicity between the
GPU device and the host. While an operation is launched by
the host at a specific position in the code sequence, the GPU
schedules and executes the operation independently from the
work performed by the host in the meantime. Hence, additional
synchronization between the host and the device is necessary to
ensure data consistency. This entails for instance ensuring that
a buffer that is about to be sent via MPI has already been filled
with the required information by the GPU. This introduces ad-
ditional overhead. However, due to the asynchronous operation,
CPU and device operations can again be overlapped, which re-
sults in an additional level of parallelism on an intra-GPU level
and is addressed in Section 3.2.
3.2. Intra-GPU Parallelization
As already discussed in the previous paragraphs, device ker-
nels are launched within host code. However, the GPU sched-
ules and executes the launched kernels asynchronously and can
also execute multiple kernels concurrently to maximize its uti-
lization. It is important to consider these properties to maxi-
mize the achieved performance on the device. GALÆXI relies
on so-called streams to manage the concurrency and scheduling
of operations on the GPU. Within the GPU context, streams are
similar to execution pipelines. Kernels within each stream are
executed serially, i.e. the next kernel within a stream pipeline
is only executed once all preceding kernels within this distinct
pipeline have finished execution. However, kernels from dif-
ferent streams can run concurrently on the GPU to maximize
utilization. This can improve the overall performance either
when running small kernels that cannot fully utilize the GPU or
by hiding the overhead associated with starting a kernel on the
device. Another benefit is that streams allow the mitigation of
the tail effect, which describes the negative performance impact
of the last partial wave of computations in a kernel. This effect
stems from the last thread blocks of a kernel call which will
generally not fill the whole GPU, leading to a significant por-
tion of the GPU idling when the last wave of computations are
performed. By using streams, the idling resources can execute
kernels from different streams that are known to be indepen-
dent of the current computation, which improves GPU utiliza-
tion and thus the overall performance.
A key factor when using streams is to ensure correct re-
sults independent of how the individual kernels are scheduled.
Therefore, another level of synchronization between the streams
is required to mitigate race conditions. In GALÆXI, the differ-
ent operations of the convective DG operator, summarized in
Table 1, are assigned to individual streams depending on their
interdependence. This means that if one kernel requires a pre-
vious kernel to be completed, both are assigned to the same
stream to be executed sequentially. In contrast, operations that
are independent of each other get assigned to different streams.
In GALÆXI, three streams are employed to account for the
available concurrency:
•Stream 1 (priority low): Operations within DG elements.
•Stream 2 (priority mid): Operations on inner sides.
•Stream 3 (priority top): Operations on MPI sides.
Here, each stream is assigned a priority which incentivizes the
GPU to preempt and postpone the execution of low-priority ker-
nels in favor of high-priority ones. In GALÆXI, Stream 3 con-
taining the MPI sides is assigned the highest priority to ensure
that data that has to be communicated is always computed at
7
U
ProlongToFace
ConsToPrim
Comm. ˜
UL/R
MPI
MPI
Barrier
ConsToPrimConsToPrim
VolInt
FillFlux FillFlux
Comm. f∗
MPI
MPI
Barrier
Device Synchronization Barrier
SurfInt
Ut
Uprim ˜
UL/R
inner
˜
Uprim,L/R
inner
f∗
inner
˜
UL/R
MPI
˜
UL/R
MPI
˜
Uprim,L/R
MPI
f∗
MPI
UVolInt
t
UVolInt
t,f∗
Device Synchronization Barrier
Stream 1 Stream 2 Stream 3
Figure 4: Flowchart of GALÆXI for a single evaluation of the convective DG
operator using streams. Some routines comprise several individual compute
kernels instead of single, monolithic device kernels. These are summarized
here to keep the flowchart concise. Moreover, the lifting procedure to compute
the gradients is omitted here for readability.
the earliest possible instant to ensure optimal latency hiding.
The flowchart for the evaluation of the DG operator using these
streams is shown in Fig. 4. The local operations within the
DG element and the operations at the element sides can be per-
formed independently in their streams until the computation of
SurfInt, where the surface fluxes f∗as well as the contributions
of the VolInt denoted UVolInt
tare required. Hence, an explicit
synchronization barrier is employed to wait until all previous
operations in all streams have completed. Then, the surface
contributions can be added to the volume integral to yield the
final Ut.
Special care must be taken when extending the operator
towards multiple GPUs, which requires MPI communication.
Here, it has to be ensured that all kernels within the GPU are as-
signed correctly to individual streams and that the MPI commu-
nication between GPUs is effectively hidden by the local work.
For this, the work associated with MPI sides, i.e. Stream 3, is
39.0
VolInt
14.5
ConsToPrim
10.1
FillFlux
7.2
Lift VolInt
5.8
Lift SurfInt
4.8
Lift FillFlux
3.7
Lift ProlongToFace
2.5
ApplyJac
2.4
ProlongToFace
2.4
SurfInt
7.7
Misc
Figure 5: Portion of compute time in percent for individual routines on HAWK-
AI with N=7, split-form DG and 8.9×105DOF on a single GPU. Routines
associated with the computation of the gradients via the lifting method are pre-
fixed with “Lift ”. Various small routines associated with performing the actual
time integration, i.e. updating U, are summarized under Misc.
ensured to be computed with the highest priority, such that the
communication can be initialized as soon as possible. The work
queued in the other streams is then used to hide both the local
overhead of tail effects on the GPU and the latency of the MPI
communication. In practical application, the host idles at the
MPI barrier until the communication is finished, but the GPU
is kept busy with the work from Stream 1 and Stream 2 to re-
tain the overall efficiency. Effectively, the communication of
the solution at the MPI sides ˜
UL/R
MPI is hidden by ConsToPrim on
Stream 1. The communication of the resulting fluxes across the
MPI sides f∗
MPI is hidden by VolInt in Stream 1 and ConsTo-
Prim and FillFlux in Stream 2.
3.3. Kernel Implementation
The goal for the implementation of the compute kernels is
to maximize the utilization of the available parallel resources
provided by the device which roughly translates to keeping as
many threads as possible busy. However, oftentimes the num-
ber of concurrent threads is limited by the number of regis-
ters required by each thread and the amount of shared memory.
Futhermore, the effective performance is limited by the avail-
able memory bandwidth, which might not be sufficient to keep
all threads busy.
Since the specifics of these limitations and their importance
depend heavily on the specific hardware, GALÆXI approaches
the problem from a one-size-fits-all perspective. Here, it is
assumed that improving on these general limitations and the
overall performance for a single type of GPU also yields sen-
sible improvements for other ones. While this approach may
not achieve the optimum performance for each specific type
of hardware, our testing has shown to be a reasonable starting
point for further, more in-depth optimizations.
Device code is typically based on a kernel, which is the code
each individual thread executes. The overall number of threads
8
and their grouping are specified in the launch configuration. In
some sense, the launch configuration entails an implicit tightly
nested loop, while the loop body, i.e. the actual computation,
is implemented in the kernel. The optimal launch configura-
tion is oftentimes highly hardware-specific and can improve
(or impair) the overall performance significantly. Along the
same lines as discussed above, our code relies on sensible initial
guesses for all of these kernels, which gave reasonable results.
Further improvements are planned through the application of
more sophisticated tuning approaches, for instance the kernel
tuner toolkit [37], which allows automatized optimization of
the launch configuration for specific hardware.
The complete list of operations of the DG operator is de-
tailed in Table 1 and the computing time of the kernels asso-
ciated with these operations is summarized in Fig. 5. Natu-
rally, operations that are DOF-local are the easiest to imple-
ment for different hardware. Hence, the following paragraph
first introduces how kernels are designed for DOF-wise oper-
ations in GALÆXI before moving to the much more intricate
task of kernels that map data between the volume and surfaces
of DG elements.
Pointwise operations
For pointwise operations, a large number of identical com-
putations have to be performed with no interdependence be-
tween individual DOFs. Such a computation becomes embar-
rassingly parallel and straightforward to distribute. The follow-
ing paragraph details how such computations are implemented
in GALÆXI using the ConsToPrim operation as an example.
This operation computes the primitive variables Uprim =(ρ, u,p,T)T
based on the vector of conservative variables Uusing the EOS
defined in Eqs. (8) and (9). For this, an elemental ConsTo-
Prim Point routine is implemented that performs the compu-
tation for a single DOF. This elemental routine is the building
block of the main computation and is agnostic to the underly-
ing hardware. GALÆXI then uses different wrappers for this
elemental function. These wrappers distribute the overall work
depending on the specific type of computational hardware used.
If CPUs are used, the design of the wrapper becomes straight-
forward as shown in Algorithm 1. A single CPU core just calls
the ConsToPrim Point routine for each DOF within each ele-
ment of its domain using a tightly nested loop. The GPU wrap-
per shown in Algorithm 2 is based on the CUDA programming
model and consists of two individual components. First, the
kernel that implements the actual compute operation of an in-
dividual GPU thread. The second component is a function that
calls the kernel and provides the launch configuration config.
The launch configuration determines how many threads will be
started to execute the kernel and how the individual threads are
grouped into thread blocks. In this specific case, each thread
of the GPU performs the computation for a single DOF in the
domain. For this, each thread determines in line 8 of Algo-
rithm 2 its own globally unique thread ID i. This thread ID
incorporates the ID of the current block (blockID), the size of
each block (blockDim), and its thread number within the block
(threadID), which are all available for each thread during run-
time. The thread then performs the computation for this i-th
Algorithm 1 Wrapper for ConsToPrim Point on CPU
1: function ConsToPrim CPU(N,nElems,U)
2: for n←1 to nElems do loop over elements
3: for i,j,k←0 to Ndo loop within element
4: Uprim
i jk,n←ConsToPrim Point(Ui jk,n)
5: end for
6: end for
7: return Uprim
8: end function
Algorithm 2 Wrapper for ConsToPrim Point on GPU
1: function ConsToPrim GPU(N,nElems,U)
2: nDOF ←(N+1)3nElem s number of DOF in array
3: Uprim ←ConsToPrim Kernel<<config>>(nDOF ,U)
4: return Uprim
5: end function
6:
7: kernel ConsToPrim Kernel (nDOF,U)
8: i←(blockID-1)*blockDim+threadID own index
9: if i≤nDOF then
10: UPrim
i←ConsToPrim Point(Ui)
11: return Uprim
i
12: end if
13: end kernel
DOF. Note that the high-dimensional structure of the array be-
comes irrelevant in this case and can be “flattened” to a one-
dimensional array containing nDOF entries.
More advanced techniques can be used to optimize those
wrappers for different hardware. This includes for instance
vectorization, such that either the vector units of a CPU or
real vector accelerators can perform the operations performed
in ConsToPrim on several entries of Usimultaneously. Sim-
ilarly, optimization such as loop unrolling or shared memory
parallelization are straightforward to implement. For GPU us-
age, the wrapper can be adapted to distribute multiple DOFs
to each thread and optimize the launch configuration, depend-
ing on the hardware specifics. The same building block ap-
proach can also be applied to support other backends such as
HIP, ROCm, OpenMP or OpenACC while only maintaining a
single version of the equation-specific code.
Volume↔Surface Operations
The optimization potential of the pointwise operations dis-
cussed above is mostly independent of the core algorithms them-
selves. In contrast, the most challenging routines for GPU port-
ing and parallelization in the DG context are routines that map
data between the surfaces and the volume. Due to the highly
local nature of the DG method, the transfer of data from within
the element to its sides and vice versa is required in only a few
operations. Thus, the original FLEXI code opted to store the
data on the element faces and within the elements in different
arrays from which it is retrieved based on precomputed map-
pings. However, revisiting Table 1 reveals that two specific
operations in the DG operator access both volume and surface
9
Algorithm 3 CPU implementation of the SurfInt operation
1: function SurfInt(f∗,Ut,ˆ
`+,ˆ
`−)
2: for s←1 to nS ides do
3: if isPrimary then
4: f∗,tmp,locSide ←SideMapping(s,isPrimary,f∗
s)
5: Ut←DoSurfInt(locSide,Ut,f∗,tmp,ˆ
`+,ˆ
`−)
6: end if
7: if isReplica then
8: f∗,tmp,locSide ←SideMapping(s,isPrimary,−f∗
s)
9: Ut←DoSurfInt(locSide,Ut,f∗,tmp,ˆ
`+,ˆ
`−)
10: end if
11: end for
12: return Ut
13: end function
14:
15: function DoSurfInt(locSide,Ut,f∗
pq,ˆ
`+,ˆ
`−)
16: switch (locSide)
17: case ξ−
18: for i,j,k←0 to Ndo
19: Ut,i jk ←Ut,i jk +f∗
jk ˆ
−
i
20: end for
21: case ...
22: case ζ+
23: for i,j,k←0 to Ndo
24: Ut,i jk ←Ut,i jk +f∗
i j ˆ
+
k
25: end for
26: end function
data: ProlongToFace and SurfInt.4The former evaluates the
polynomial solution from the interior points at the element faces
and stores it in a side-based array (U→˜
UL/R), while the lat-
ter computes the integral of the fluxes on the element faces and
adds their contribution to the volume (f∗→Ut). In both cases,
an interpolation point in the volume is linked to several points
on the surface and vice versa, as shown in Fig. 1. Special care
must be taken to exploit the full potential for parallelization of
the task on a GPU while avoiding race conditions and costly
synchronizations among individual threads. In the following,
this is illustrated for the SurfInt routine.
In the original CPU version, Algorithm 3, the SurfInt rou-
tine loops over all sides on the current rank. For each side, it
obtains the orientation of the side with respect to the volume.
The orientation of the side of a hexahedral DG element depends
on which of its six local faces the side refers to. The contribu-
tion of this side is then added to all DOFs within the element.
This operation is hard to parallelize for GPU hardware since
all 6 local sides add their contribution to each individual DOF
within the element. Writing to the same entries in an array mul-
tiple times can yield race conditions if the individual threads are
not properly synchronized–but synchronizing threads is costly.
4As shown in Table 1, the volume integral is also not a point-local operation
due to the application of the differentiation matrix along the lines indicated in
Fig. 1. However, the operations are retained to the interpolation points within
the volume of the DG element, i.e. no exchange of information between the
volume and the faces is required.
Algorithm 4 GPU kernel for the SurfInt operation
1: kernel SurfInt Kernel (N,nElems ,f∗,Ut)
2: i←(blockID-1)*blockDim+threadID
3: nDOF ←(N+1)3nElem s number of volume DOF
4: if i≤nDOF then
5: for locSide ∈ {ξ−, η−, ζ−, ξ+, η+, ζ +}do
6: p,q,s,ˆ
±
k,isPrimary ←SideMapping(i,locSide)
7: if isPrimary then
8: Ut,i=Ut,i+f∗
pq,sˆ
±
k
9: else
10: Ut,i=Ut,i−f∗
pq,sˆ
±
k
11: end if
12: end for
13: return Ut,i
14: end if
15: end kernel
In GALÆXI, the sequence of operations is thus altered for the
GPU implementation in comparison to the original CPU im-
plementation. The developed algorithm, Algorithm 4, runs as
follows. First, each GPU thread is assigned a single DOF within
an element. Due to the tensor product structure of the DGSEM,
this results in only a single DOF per face influencing the solu-
tion as indicated in Fig. 1. The thread then loops over all six
sides (locSide) of the element. For each side, it identifies the
side index swithin the flux array and the corresponding DOF
on the face specified by the indices p,q. The side whose normal
vector is used to compute the Riemann flux is determined by the
flag isPrimary, while for the adjacent element (isReplica side)
the sign of the flux contribution has to be flipped to account for
the fact that its outward facing normal vector points in the op-
posite direction. Additionally, the correct integration weight ˆω
is identified to add the flux contribution of this locSide to the
respective DOF. While this requires multiple threads to access
the same surface data multiple times, it avoids race conditions
between threads without the need of explicit synchronization,
since only a single thread writes to a specific entry in the Ut
array. Lastly, transforming the fluxes from the side-local to the
element-local coordinate system requires some form of map-
ping. Since GALÆXI is an unstructured solver, the algorithm
also needs to account for the case where coordinate systems
of neighboring elements are rotated with respect to each other.
The combination results in mappings which are non-trivial to
obtain. However, the required mappings are hardware-agnostic
and not relevant for the efficiency of the GPU kernel. In con-
sequence, these specifics are condensed into a single call to a
subroutine SideMapping to keep the algorithm concise. More
details on the side connectivity can be found in Krais et al. [14].
At this point, it is important to revisit the required com-
pute time of the different DG operations as shown in Fig. 5. It
is evident that the majority of the computational work can be
attributed to the operations VolInt, Lift VolInt, and ConsTo-
Prim, which are operations local to each DG element that can
be scheduled independently from any communication. This has
three crucial implications. First, only a small number of rou-
10
tines require the majority of the compute time, which yields
distinct targets for more sophisticated optimization. Second,
these routines do not require any communication, which again
highlights the beneficial ratio of local work to required com-
munication of DG schemes. Third, the overhead introduced by
the unstructured mesh is negligible, since the additional work is
mainly limited to the routines mapping from the sides to the vol-
umes, i.e. SurfInt and ProlongToFace, which take only around
15 % of the overall compute time.
3.4. Summary of the Parallelization Strategy
This section provides details on the parallelization concept
of GALÆXI on three different levels. First, the parallelization
of the workload between GPUs was introduced. Here, GA-
LÆXI subdivides the domain into subdomains with roughly
the same number of elements, which are then assigned to the
individual GPUs and communication across the boundaries of
neighboring subdomains is performed using CUDA-aware MPI.
Second, the individual compute kernels within the GPU are
scheduled using streams to improve the overall utilization of
the GPU. Operations associated with the MPI communication
are assigned to the stream with the highest priority to allow the
GPU to antedate the execution of these kernels to initiate the
communication at the earliest possible point in time. Third, the
design concepts of the kernels were introduced using the Cons-
ToPrim operations as an example for pointwise operations and
the SurfInt to detail the more intricate case of kernels that have
to map from the elements’ volume to their faces and vice versa.
The resulting performance of the kernels demonstrates that the
overhead of the unstructured mesh is negligible. A detailed
discussion of the resulting parallel performance of GALÆXI
across multiple GPUs is provided in the following paragraphs.
4. Performance Evaluation
In the following section, the performance and the scaling
abilities of GALÆXI are demonstrated. First, Section 4.1 in-
troduces the details of the applied systems, i.e. HAWK-AI and
JUWELS Booster. Section 4.2 then derives the performance
metrics that are used to evaluate the performance. With these in
place, Section 4.3 provides details on the code’s memory con-
sumption while the results of the scaling tests are discussed in
Section 4.4.
4.1. Hardware Architecture
The performance of GALÆXI and FLEXI is investigated
for two different systems. First, the JUWELS Booster installed
at the J¨
ulich Supercomputing Centre (JSC) and second, the HAWK
and HAWK-AI systems at the High-Performance Computing
Center Stuttgart (HLRS).
The JUWELS Booster module entails a total of 936 two-
socket nodes. Each node provides two AMD EPYC 7402 pro-
cessors with 24 cores per socket and a total of 512 GiB of
DDR4-3200 main memory per node. Each node comprises
4 NVIDIA A100 GPUs with 40 GiB memory interconeccted
using NVlink, where each GPU is connected to its own net-
work adapter and the individual nodes are integrated using a
Mellanox HDR200 InfiniBand interconnect with 200 Gbit/s per
adapter in a DragonFly+topology.
The HAWK supercomputer at HLRS is based on an HPE
Apollo 9000 with 5632 dual-socket nodes. Each node is equipped
with two AMD EPYC 7742 CPUs, which yield 128 CPU cores
per node. Each node comprises 256 GiB of main memory and
the nodes are connected using a Mellanox HDR200 InfiniBand
interconnect in a 9D-hypercube topology. The HAWK-AI par-
tition of HAWK is based on an HPE Apollo 6500 Gen10 Plus
with 24 nodes, where each node is equipped with two 64-core
AMD EPYC 7702 processors, 8 NVIDIA A100 GPUs inter-
coneccted using NVlink, and 1 TiB of main memory. 20 nodes
employ A100 GPUs with 40 GiB memory and 4 nodes entail
A100 GPUs in the 80 GiB version. The nodes of HAWK-AI
are fully integrated into the main HAWK partition using an In-
ifiniband interconnect in a Fat-Tree topology, such that nodes
from both systems can be used within a single compute job.
The HAWK-AI partition was designed to integrate AI and big
data capabilities into traditional HPC jobs but is also capable of
running and scaling GPU-accelerated HPC applications on its
own.
4.2. Performance Metrics
In the following, we focus on two distinct metrics to quan-
tify and compare the performance of GALÆXI and FLEXI on
different hardware, which rely on the time-to-solution and the
energy-to-solution paradigms, respectively. Here, we use the
performance index (PID), which is defined as
PID =Walltime ×#Ranks
#RK-stages ×#DOF .(22)
The PID describes the walltime required by a single rank to ad-
vance a single DOF for one stage of the explicit Runge–Kutta
time-stepping. Hence, the PID is independent of the number of
timesteps performed, the number of DOF used in the simulation
and the number of ranks employed, where a rank refers either
to a CPU core or a whole GPU as discussed in Section 3. While
this provides a good measure of efficiency for code performance
comparison on either CPU or GPU systems, the usefulness of
this definition is limited when comparing GPU and CPU codes
with each other. Here, a whole GPU would be compared to a
single CPU core with a vastly different compute performance
and power consumption. To account for the differences in hard-
ware, we propose an energy-normalized PID (EPID) as a more
suitable measure of performance. The EPID is defined as
EPID =Walltime ×Power
#RK-stages ×#DOF =Power
#Ranks
| {z }
Prank
×PID,(23)
and describes the energy required to compute the time update
for a single DOF on the specific computing hardware. The
EPID can thus be interpreted as the PID normalized by the spe-
cific power required per rank, which is denoted as Prank.
11
4.3. Memory Requirements
The memory consumption of a real-world application on the
device is given in Table 2 in KiB per DOF for different polyno-
mial degrees N. In general, the overall memory consumption
is low, which is a well-known property of the explicit numer-
ical scheme. The results clearly show that increasing Nim-
proves the memory efficiency, i.e. reduces the required amount
of memory per DOF. This is because GALÆXI stores both the
solution for the DOFs within the DG element ((N+1)3) and
on its surfaces (6(N+1)2). With increasing N, the ratio be-
tween surface to volume information thus decreases, yielding a
lower overall memory footprint. As an illustration of memory
efficiency, it is possible to compute a problem with N=7 and
48 million DOF per solution variable on a single device with
40 GiB of memory.
4.4. Scaling Tests
To evaluate the scalability of GALÆXI on HPC systems, its
parallel performance is evaluated on the JUWELS booster mod-
ule using up to 1024 GPUs for a wide range of problem sizes.
For this, the spatial resolution of a Cartesian mesh with 4 ×4×
2=32 elements is successively doubled in each spatial direc-
tion until the finest resolution of 2563=16.8×106elements is
reached. For a polynomial degree of N=7, which is a typical
choice for production runs, this results in 16 384 to 8.6×109
DOF, respectively. All simulations are initialized with a con-
stant flow state, since in contrast to an implicit time integration
scheme, the computational cost and thus the scaling behavior of
the explicit scheme is independent of the prevailing flow condi-
tion. Each computation is advanced for 100 timesteps and the
scaling properties are evaluated based on the PID. Here, only
the time for the timestepping is considered and initialization
and analyze routines are neglected. The results of the scaling
tests are presented from three different perspectives—first, the
influence of the computational load per GPU on the overall per-
formance, second, investigating the parallel efficiency in a weak
scaling setting and third, from a strong scaling perspective.
In a first step, the PID is plotted against the specific load in
terms of DOF per GPU in Fig. 6. Since the PID is a measure of
computational time, a lower PID indicates better performance.
Most strikingly, all curves converge above the limit of 106DOF
per GPU, which means that the overhead of the parallelization
and communication becomes negligible in comparison to us-
ing only a single GPU. Hence, GALÆXI scales almost per-
fectly beyond the threshold of 106DOF per GPU. The behavior
changes for loads below this threshold. Here the PID increases
towards lower loads for all cases, which means that the com-
putational efficiency decreases. Moreover, the more GPUs are
used for the simulation, the more pronounced this loss in per-
formance becomes. This can be attributed to two factors. First,
the communication latency between the GPUs cannot be hid-
den completely at low loads, since the amount of local work is
insufficient to hide the communication. Furthermore, the loss
in performance becomes more pronounced the more potential
communication partners, i.e. GPUs, are used for the simulation.
The severity of this performance penalty depends strongly on
104105106107
10−9
10−8
10−7
low-load-bound kernel-bound
Communication
DOF per GPU
PID in [s]
#GPUs
1
4
32
256
512
1024
Figure 6: Scaling results for GALÆXI with the split-form DG scheme and
N=7 plotted as PID over the specific load, i.e. DOF per GPU, for up to 1024
GPUs.
the network topology of the HPC system and the job placement
on the system, which is determined by the scheduler. In the
case of the JUWELS booster module, which uses a DragonFly-
type network topology, the communication cost increases sig-
nificantly when the nodes are spread across a larger number of
switch groups, which contain 192 GPUs each. However, lack-
ing latency hiding cannot explain the performance loss when
using a single GPU, since here no communication is necessary.
Instead, this drop in performance can be attributed to the over-
head associated with launching kernels on the GPU. If the ac-
tual computational load of the kernel becomes too small, the
kernels cannot be launched quickly enough to use the GPU to
capacity. Moreover, tail effects become noticeable, as discussed
in Section 3. To summarize the results, the GPU implemen-
tation can be seen to be kernel-bound for high loads, where
the performance becomes independent of the total number of
GPUs used. For very low loads, the performance gets low-load-
bound and becomes increasingly communication-bound with
a dominant performance penalty the more compute nodes are
used. This is in stark contrast to the CPU implementation of
FLEXI as reported by Blind et al. [38]. Here, the impact of the
communication overhead is similarly noticeable for very low
loads. However, a performance penalty also appears for very
high loads, since here the fast CPU cache cannot hold all nec-
essary data and the bandwidth to the main memory becomes the
bottleneck. This results in a narrow band in the rage of 3000 to
10 000 DOF per rank, where optimal performance is achieved
[14, 38]. In the case of GALÆXI, increasing the load only im-
proves the overall performance with the available GPU memory
as the single limiting factor.
In Fig. 7 the investigated weak scaling properties of GA-
LÆXI are depicted. In the weak scaling paradigm, the problem
size and the amount of compute resources are increased pro-
portionally, such that the overall load per GPU is kept constant
12
Table 2: Measured memory consumption per DOF on the GPU for different polynomial degrees Nand the Navier–Stokes equation system.
N 1 2 3 4 5 6 7 8 9 10 11 12
KiB 1.457 1.188 1.049 0.996 0.942 0.895 0.869 0.841 0.827 0.808 0.801 0.787
100101102103
0.2
0.4
0.6
0.8
1
1.2
#GPUs
Parallel Efficiency
#DOF/GPU
2.6×105
1.0×106
2.1×106
8.4×106
3.4×107
Figure 7: Weak scaling of GALÆXI with the split-form DG scheme and N=7
plotted as the parallel efficiency over the number of GPUs for specific loads,
i.e. DOF per GPU. The parallel efficiency is computed based on the PID on a
single node, i.e. on 4 GPUs.
for each case. Here, the parallel efficiency is normalized to the
performance of a complete node. This is done in order to take
communication into account in a meaningful way. This also
enables a suitable assessment of the communication overhead
in the case that only one GPU is used without communication.
The results show again the threshold of 106DOF per GPU as
discussed before. For lower loads, the communication latency
degrades the overall performance, while loads above 106DOF
per GPU show almost perfect weak scaling up to the maximum
of 1024 GPUs.
Lastly, the results for strong scaling of GALÆXI are shown
in Fig. 8. For cases too large to fit into the memory of a sin-
gle GPU, the results are normalized with respect to the small-
est number of GPUs that was able to run the case. The strong
scaling capabilities of GALÆXI are excellent up to the max-
imum 1024 GPUs, as long as the computational load exceeds
the threshold of 106DOF per rank, which is indicated explicitly
for both cases. Below this threshold, i.e. towards larger num-
ber of GPUs, the load per device is insufficient to exploit the
computing power of the GPU and to hide the necessary com-
munication, which results in the loss of performance. This also
matches the results by Fischer et al. [8], who report that NekRS
reaches its limit for strong scaling at a similar load of about
2 to 4 million DOF per rank. For computational loads above
this threshold, GALÆXI yields almost perfect strong scaling
results up to the maximum of 1024 GPUs. Next, the influ-
ence of our scheduling strategy based on parallel streams and
introduced in Section 3.2 is investigated. For this, Fig. 8 also
shows the scaling results for the same problem sizes, with (solid
100101102103
100
101
102
103
streams
(up to 92% increase)
106DOF per GPU
106DOF per GPU
#GPUs
Speedup
#DOF
3.4×107w streams
5.4×108w streams
2.1×109w streams
3.4×107w/o streams
5.4×108w/o streams
Figure 8: Strong scaling of GALÆXI with the split-form DG scheme and N=7
plotted as the speedup over the number of GPUs for three problem sizes. For
two cases, the results without the use of parallel streams are shown dashed. The
speedup is computed based on the smallest number of GPUs that was able to
run the given case. The ideal speedup is shown in black.
lines) and without (dashed lines) the use of parallel streams for
kernel scheduling. For both setups, the omittance of stream
scheduling results in a significant loss in parallel performance
for low loads. This can be attributed to two aspects. First, par-
allel streams allow for hiding the overhead of kernels launches
and tail effects for low loads. However, more importantly, our
implementation permits the GPU to preempt the computation
of quantities that have to be communicated via MPI. This facil-
itates more efficient communication latency hiding, resulting in
better parallel performance in cases involving many communi-
cation partners and low amounts of local work.
5. Verification & Validation
5.1. Verification - Convergence Tests
The correct implementation of the high-order accurate nu-
merical schemes in GALÆXI is verified by testing the order of
convergence of the spatial operator with the method of man-
ufactured solutions [39]. This method allows the derivation
of source terms for nonlinear partial differential equations that
lead to exact solutions that can be expressed in analytical form
and allow computing the error of the numerical discretization
scheme. Following Hindenlang et al. [40], the exact function is
13
10−1100
10−14
10−11
10−8
10−5
10−2
1
10
1
3
∆x
L2-Error (ρ)
Standard DG – GL nodes
10−1100
10−14
10−11
10−8
10−5
10−2
1
10
1
3
∆x
Split-Flux DG – LGL nodes
N=2
N=3
N=4
N=5
N=6
N=7
N=8
N=9
Figure 9: Convergence of the split-flux DG scheme on LGL nodes (left) and the standard DG scheme on GL nodes (right) using N∈[2,9] for the manufactured
solution.
assumed to follow a sinusoidal solution of the form
ρ(x,t)=2+Asin (2π(x+y+z−at)),
u(x,t)=2+Asin (2π(x+y+z−at)),(24)
E(x,t)=2+Asin (2π(x+y+z−at))2,
where the amplitude and advection speed are chosen as A=0.1
and a=1, respectively. This solution describes an oblique, pe-
riodic wave that is advected linearly with speed a. The source
terms that are required for Eq. (24) to be an exact function of
the NSE are detailed in Gassner et al. [41]. The problem is then
initialized within a domain of x∈[−1,1]3with periodic bound-
ary conditions and is discretized with varying N∈[2,9]. The
meshes are varied in the range of containing a single element up
to 643elements at maximum. The computation is advanced in
time up to t=1 and the timestep is chosen sufficiently small to
not influence the overall discretization error. The convergence
test is carried out with both the standard collocation formula-
tion on GL interpolation points and the split-flux formulation on
LGL interpolation points. The results in Fig. 9 demonstrate that
the expected design order is reached for all investigated cases,
which verifies the correct implementation of the schemes.
5.2. Validation - Taylor–Green-Vortex
A popular validation case for turbulent flows is the Taylor–
Green-Vortex (TGV). One reason for its widespread use are its
analytically prescribed initial conditions, which are given by
u(x,0) =
U0sin x
Lcos y
Lcos z
L
−U0cos x
Lsin y
Lcos z
L
0
,(25)
p(x,0) =p0+ρ0U2
0
16 cos 2x
L+cos 2y
L2+cos 2z
L,
with L=2πdenoting the size of the domain, U0=1 the mag-
nitude of the initial velocity fluctuations and ρ0=1 the ref-
erence density. The background pressure p0is chosen to fit
a prescribed background Mach number Ma0=U0pρ0/(γp0).
However, Eq. (26) does not yield sufficient initial conditions for
a compressible flow field, since it lacks information about the
density and temperature fields. Two different approaches are
commonly used to extend it to a full description of a compress-
ible flow field as required for the computation with a compress-
ible solver. For this, either field is held constant, while the other
quantity is computed to yield a thermodynamically admissible
state. Assuming an perfect gas that follows Eq. (9) this yields
the two variants
Version I: ρ(x,0) =ρ0,T(x,0) =p
Rρ0
,(26)
Version II: ρ(x,0) =p
RT0
,T(x,0) =T0.(27)
Two common metrics to assess the accuracy of numerical
schemes for the TGV case are the instantaneous kinetic energy
in the domain Ekand the viscous dissipation rate εT. The inte-
gral kinetic energy is defined as
Ek=1
2ρ0U2
0|Ω|ZΩ
ρu·udΩ,(28)
where |Ω|denotes the overall size of the integration domain.
The viscous dissipation rate of the kinetic energy can be split
into a solenoidal and a dilatational contribution (Zeman [44],
Sarkar et al. [45]), which are defined as
εS=L2
ReU2
0|Ω|ZΩ
µ(T)
µ0
ω·ωdΩ,(29)
εD=4L2
3ReU2
0|Ω|ZΩ
µ(T)
µ0
(∇ · u)2dΩ,(30)
respectively. The solenoidal component εScan be related to the
vortical motion and the dilatational component εDto compress-
ibility effects.
14
0510 15 20
0
0.005
0.01
0.015
t
εS
Incompressible TGV
0510 15 20
t
Compressible TGV
Reference
GPU 643DOF
GPU 1283DOF
GPU 2563DOF
GPU 5123DOF
CPU 643DOF
CPU 1283DOF
CPU 2563DOF
CPU 5123DOF
Figure 10: Temporal evolution of the solenoidal dissipation rate εSfor the incompressible TGV at Ma0=0.1 (left) and the compressible TGV at Ma0=1.25 (right)
using between 64 and 512 DOF per spatial direction with a polynomial degree N=7. The results by DeBonis [42] (left) and Chapelier et al. [43] (left) serve as the
reference solution for the incompressible and compressible case, respectively. The results of the CPU implementation are given for reference.
Two versions of the TGV case are investigated, which both
exhibit a Reynolds number of Re =1600 with the initial con-
ditions prescribed in Eq. (27). First, the weakly compressible
case with Ma0=0.1 is investigated to verify that GALÆXI ac-
curately captures the physics of turbulent flow. In a second step,
the Mach number is increased to Ma0=1.25, which causes
complex shock patterns to emerge during the simulation. Con-
sequently, this supersonic TGV setup is a suitable test case to
assess the stability and accuracy of compressible flow solvers
for shock-turbulence interaction.
Incompressible TGV
First, we consider the TGV at Re =1600 in the incom-
pressible limit with Ma0=0.1 and Version II, i.e. an initially
constant temperature field. Four different resolutions were in-
vestigated to demonstrate the mesh convergence of the code.
For this, either 64, 128, 256, or 512 DOF were employed in
each spatial direction with a polynomial degree of N=7. Two
simulations were carried out for each mesh, first with the GPU-
accelerated GALÆXI and second with its CPU-based predeces-
sor FLEXI for verification purposes. The results are also vali-
dated against the high-fidelity reference solution published by
DeBonis [42]. The results shown in Fig. 10 (left) demonstrate
that GALÆXI and FLEXI yield the same results up to machine
precision. Moreover, as the resolution increases, the temporal
evolution of the dissipation rate converges to the reference solu-
tion, to the point where the solution on the finest mesh with 512
DOF in each direction matches the reference almost perfectly.
Compressible TGV
More recently, the TGV case was extended to the compress-
ible regime by increasing the Mach number of the initial flow
field [46, 43]. A common choice is Ma0=1.25, for which
complex shock patterns emerge that interact with the turbulent
flow. Consequently, the compressible, supersonic TGV case al-
lows for assessing the stability and accuracy of compressible
flow solvers for shock-turbulence interactions. The simulation
is again initialized using the setup in Eq. (27), i.e. Version II,
and Sutherland’s law is applied to address the dependency of
the viscosity on the temperature in the compressible case. The
shock capturing scheme introduced in Section 2.3 is applied for
the stabilization of the scheme near shocks. Again, four mesh
resolutions were investigated with 64, 128, 256, and 512 DOF
in each spatial direction and a polynomial degree of N=7.
The permitted maximum of the blending parameter αis set
identically across all investigated resolutions. The results re-
ported by Chapelier et al. [43] serve as the reference solution.
The results in Fig. 10 (right) again show that GALÆXI and
FLEXI yield identical results for the temporal evolution of the
solenoidal dissipation rate. Moreover, at higher resolutions, the
results converge to the reference solution, where the results are
almost identical for the largest case of 512 DOF per spatial di-
rection.
6. Application
Based on these verification and validation results, both GA-
LÆXI and FLEXI are applied to the large-scale application case
of a wall-resolved LES of the NASA Rotor 37 [15]. This allows
for verification that GALÆXI can handle complex simulations
of compressible flow and quantify the gains in efficiency and
energy-to-solution by using GPUs. For this, Section 6.1 first
provides some background on the case, while Section 6.2 gives
details on the computational setup. Finally, the results are dis-
cussed in Section 6.3.
6.1. Description
In the following section, the applicability of GALÆXI to-
wards large-scale test cases is demonstrated for the turbulent
flow within a NASA Rotor 37 rectilinear transonic compres-
sor cascade. This rotor was originally employed in one of four
15
Figure 11: Computational mesh for the simulation of the NASA Rotor 37 case.
The inflow and outflow regions are pruned and a zoom highlights the mesh
around the leading edge.
transonic axial-flow compressor stages designed and tested at
the NASA Lewis Research Center in the late 1970s [15]. With
its geometry parameters and measurement data publicly avail-
able [47, 48], the rotor has since become a benchmark test
case in the turbomachinery research community including CFD
studies [49], investigation of optimization techniques [50], tip
leakage flow analysis [51], and uncertainty quantification ap-
proaches [52]. At its design point, the rotor operates with a
blade tip Mach number of 1.4939, generating an overall pres-
sure ratio of 2.106. The setup investigated here corresponds to
a ground-idle condition, providing a tip Mach number of 0.824
with a total pressure ratio of 1.305. The cascade geometry is
generated by unwinding the blade profile at mid-span and ex-
truding for 5 % of the chord length. The resulting Reynolds
number based on the inflow velocity and the rotor chord is
972 550. The low operating point and the position at mid-span
results in an inlet relative Mach number of 0.758 and an in-
cidence relative to the mean camberline of 10.1°. As a result
of the high subsonic inflow velocity and near-stall condition,
a transonic expansion region forms on the suction side near
the leading edge. The region is terminated with a near-normal
shock and subsequent shock-boundary-layer interaction with
flow separation occurring throughout the suction side. On the
pressure side, a small laminar separation region forms which is
subsequently terminated by turbulent re-attachment.
6.2. Computational Setup
The computational setup is identical for both GALÆXI and
FLEXI, except for the hardware on which the simulations are
run. The mesh for the LES comprises one compressor pitch
with the compressor blade orientated with the stagger angle of
51.2° and is depicted in Fig. 11. The domain is discretized us-
ing 1.2×106elements with N=5, which results in a total of
2.6×108DOFs for the simulation. The inflow is modeled us-
ing far-field conditions and a subsonic outflow condition [53]
is employed. Additionally, sponge zones [54] are positioned at
the inflow and outflow boundaries to prevent the formation of
artificial reflections. The rotor itself is modeled as an adiabatic
wall and the spanwise and pitchwise boundaries are defined as
periodic. The simulation is performed using the split-form DG
method as introduced in Section 2.2 to mitigate aliasing errors
with the flux formulation given by Pirozzoli [25]. The solution
is advanced in time using a 14-stage 4th-order Runge–Kutta
method [19]. During the simulation, the viscosity is computed
with Sutherland’s law as given in Eq. (6).
The computational resources are chosen such that both codes
run at their maximum efficiency. For GALÆXI, 128 Nvidia
A100 GPUs on HAWK-AI are employed, which yields a to-
tal load of 2.0×106DOF per GPU. For FLEXI, the number
of CPU nodes is chosen such that the walltime is similar to
the GALÆXI computation, which is obtained when using 256
nodes (32 768 CPU cores). This results in a load of around
7900 DOF/core, which resides well within the performance op-
timum of FLEXI [14]. The details of these setups are summa-
rized in Table 3. The simulations are initialized with a precom-
puted converged flow state and are advanced for a total of 8
characteristic time units t∗=t u∞/c, where t∗is defined with
respect to the inflow velocity u∞and the chord length c.
6.3. Results
The instantaneous Mach number distribution on the domain
centerline computed by GALÆXI is depicted in Fig. 12. The
flow enters from the left with a high incidence relative to the
camberline. This results in a shift of the stagnation point to-
wards the pressure side and a strong transonic expansion fan
on the suction side. The supersonic region is terminated with
a near-normal shock, as illustrated in the zoom region. The
corresponding pressure jump results in a forced boundary layer
transition with high levels of unsteadiness. Numerical oscilla-
tions in the vicinity of the discontinuity, i.e. shock, resulting
from Gibb’s phenomenon, are mitigated with the convex blend-
ing approach outlined in Section 2.3. The grayscale overlay in
the zoom region represents the local values of the blending fac-
tor α. It is evident that the FV shock capturing is active only
near the shock in order to preserve the high numerical order of
the DG operator in areas with a smooth solution. Downstream
of the shock region, the separation of the boundary layer causes
temporally varying blockage which couples with the upstream
flow physics resulting in a highly unsteady flow field. Periods
with enhanced separation result in counter-rotating vortex shed-
ding as is visible near the wake downstream of the blade row.
The achieved performance for both codes is summarized in
Table 3. For GALÆXI, a PID increase of about 35 % is ob-
served in comparison to the performance reported in Section 4.
This is attributed to the additional work and load imbalance be-
tween the ranks introduced by the test case, which includes the
sponge zones, the boundary conditions, the shock indicator and
the FV shock capturing scheme. The slight deviation in wall-
16
0
0.5
1
1.5
Ma
Figure 12: Instantaneous field solution for the NASA Rotor 37 case colored by the Mach number. A zoom of the leading edge highlights the supersonic flow
region with the local blending values αof the FV shock capturing scheme overlaid for all elements with α=0.1 (light gray) up to α=0.7 (black). The domain is
periodically extended which is indicated by a blurred overlay.
Table 3: Setup and performance results for the simulation runs on both CPU and GPU for a simulation time of 8t∗.
Ranks DOF/Rank P
rank [W] PID [s] EPID [J] Walltime/t∗[s] Energy/t∗[kWh]
GPU 128 2.03 ×106448 4.58 ×10−92.05 ×10−69209 147
CPU 32 768 7.93 ×1034.94 1.02 ×10−65.06 ×10−67538 339
Savings 59.5 % 56.8 %
time per t∗between the GPU and CPU cases stems from choos-
ing powers of two for the resources.
The specific power draw per rank P
rank shown in Table 3 is
computed as the overall power delivered to the racks used di-
vided by the number of ranks. Hence, the measured power also
includes the power for the network switches. It is important
to note that due to the specific hardware layout, cooling is in-
cluded in the total power consumption for the GPU case, while
the cooling effort is not included for the CPU system. Hence,
the obtained results tend to favor the CPU implementation and
should thus be seen as a conservative lower bound for the poten-
tial gains in efficiency provided by GPU hardware. Moreover,
the limited accuracy and fidelity of the rack-wise power draw
measurements means that the results should be seen as a rough
estimate.
When comparing the resulting EPID, i.e. the necessary amount
of energy to advance a single DOF for a single time level, GA-
LÆXI more than halves the required energy-to-solution. In to-
tal, GALÆXI requires around 147 kWh to advance the solution
for one characteristic time unit t∗, while FLEXI requires around
339 kWh per t∗on CPUs. It is reasonable to relate this reduction
in energy demand by GALÆXI to a similar reduction in asso-
ciated carbon emissions. However, it is important to note that
the I/O operations and analyzing routines are excluded from the
PID computation. Since these operations are still performed on
the CPU for GALÆXI, the resulting overhead causes a slight
discrepancy in the savings for the EPID and energy-to-solution.
As discussed before, due to the measurement limitations, both
results should be regarded as an estimate and lower bound of
the achieved performance.
7. Conclusion & Outlook
This work presents the open-source flow solver GALÆ-
XI, which implements high-order DG methods on unstructured
meshes for GPU-accelerated HPC systems. GALÆXI is the
GPU-accelerated spinoffof the established FLEXI solver and
it supports the majority of the features provided by FLEXI,
which are continuously being extended. This allows the ap-
plication of GALÆXI for scale-resolving simulations of com-
plex compressible flows including shock waves using modern
GPU-based HPC systems. This work provides details on the
general code design, the parallelization strategy, and the im-
plementation approach for the compute kernels. Thus it serves
as an indication on how existing spectral element codes can be
ported efficiently for GPUs. As long as the GPUs are suffi-
ciently loaded, the results demonstrate excellent scaling prop-
erties for GALÆXI on up to 1024 GPUs. The correct high-
order accurate implementation of GALÆXI has been verified
by demonstrating the expected convergence rates. Furthermore,
the code has been validated against reference data for the in-
compressible and compressible variants of the established TGV.
As a demonstration of a large-scale application, GALÆXI was
employed for the simulation of a wall-resolved LES of a NASA
Rotor 37 compressor cascade. Using this example of compress-
ible flow, the implemented finite volume subcell approach was
17
demonstrated to yield a stable and accurate scheme for cap-
turing the unsteady supersonic expansion region at the leading
edge. In addition, GALÆXI has been shown to use only half
the energy required to run the same simulation using the CPU
implementation. With this, GALÆXI reduced the required en-
ergy from around 339 kWh to 147 kWh per characteristic time
unit in comparison to the CPU implementation, which halved
the associated carbon emissions.
Currently, GALÆXI is implemented using the CUDA For-
tran framework, which does not support GPU hardware from
vendors other than NVIDIA. Current efforts are focused on in-
corporating different compute backends into GALÆXI to sup-
port accelerator devices of different vendors alongside the base-
line CPU implementation via hardware abstractions. The en-
visioned code is intended to be readily extendable to arbitrary
compute devices, such that novel accelerator types can be incor-
porated without fundamental code redesigns. Concurrent work
focuses on further optimization of key routines, in particular the
VolInt and FillFlux routines, which together consume almost
half of the computing time as was demonstrated. Along the
same lines, automatic tuning of hardware-specific launch con-
figurations is to be integrated into the code. This is expected to
provide high levels of performance across a wide range of dif-
ferent hardware. Here, the KernelTuner [37] package appears to
be a suitable choice. Lastly, graph-based approaches to domain
decomposition might improve the utilization of the direct, high-
bandwidth connection between individual GPUs on the same
node by maximizing the amount of intra-node and minimizing
the amount of inter-node communication.
This work has demonstrated that high-order DG methods
are well-suited candidates for the efficient simulation of com-
pressible flows on GPU systems. GALÆXI has showcased
that unstructured mesh topologies and adequate state-of-the-art
shock capturing based on FV subcells impose only negligible
overhead on GPU hardware. Most importantly, GALÆXI is
capable of reducing the carbon emission associated with large-
scale flow simulations by more than 55 % in comparison to the
CPU reference, which renders it a potent tool for the upcoming
generation of sustainable, exascale HPC systems.
Acknowledgments
This work was funded by the European Union. This work
has received funding from the European High Performance Com-
puting Joint Undertaking (JU) and Sweden, Germany, Spain,
Greece, and Denmark under grant agreement No 101093393.
Moreover, this research presented was funded by Deutsche For-
schungsgemeinschaft (DFG, German Research Foundation) un-
der Germany’s Excellence Strategy EXC 2075 – 390740016, by
the DFG Rebound – 420603919, and in the framework of the
research unit FOR 2895. We acknowledge the support by the
Stuttgart Center for Simulation Science (SimTech). The authors
gratefully acknowledge the Gauss Centre for Supercomputing
e.V. (www.gauss-centre.eu) for funding this project by provid-
ing computing time through the John von Neumann Institute for
Computing (NIC) on the GCS Supercomputer JUWELS [55]
at J¨
ulich Supercomputing Centre (JSC) as well as the support
and the computing time on “Hawk” and its “Hawk-AI” exten-
sion provided by the Supercomputing Centre Stuttgart (HLRS)
through the project “hpcdg”. This work was completed in part
at the Helmholtz GPU Hackathon, part of the Open Hackathons
program. The authors would like to acknowledge OpenACC-
Standard.org, JSC, HZDR, and HIDA for their support.
Data Availability Statement
The GALÆXI and FLEXI codes used within this work are
available under the GPLv3 license at:
•https://github.com/flexi-framework/galaexi
•https://github.com/flexi-framework/flexi
The data generated in the context of this work and instruc-
tions to reproduce them with these codes are made available
under the CC-BY 4.0 license sorted by section at:
•10.18419/darus-4140 (Section 4)
•10.18419/darus-4155 (Section 5.1)
•10.18419/darus-4139 (Section 5.2)
•10.18419/darus-4138 (Section 6)
References
[1] P. Lynch, The origins of computer weather prediction and climate mod-
eling, Journal of Computational Physics 227 (2008) 3431–3444.
[2] L. Gundelwein, J. Mir´
o, F. G. Barlatay, C. Lapierre, K. Rohr, L. Duong,
Personalized stent design for congenital heart defects using pulsatile
blood flow simulations, Journal of Biomechanics 81 (2018) 68–75.
[3] M. S. Lundstrom, M. A. Alam, Moore’s law: The journey ahead, Science
378 (2022) 722–723.
[4] TOP500 - November 2023, https://www.top500.org/lists/
top500/list/2023/11/, 2023. Accessed: 2024-03-13.
[5] GREEN500 - November 2023, https://www.top500.org/lists/
green500/list/2023/11/, 2023. Accessed: 2024-03-13.
[6] P. Fischer, J. Lottes, H. Tufo, Nek5000, Technical Report, Argonne Na-
tional Lab, Argonne, IL (United States), 2007.
[7] S. Markidis, J. Gong, M. Schliephake, E. Laure, A. Hart, D. Henty,
K. Heisey, P. Fischer, OpenACC acceleration of the Nek5000 spectral
element code, The International Journal of High Performance Computing
Applications 29 (2015) 311–319.
[8] P. Fischer, S. Kerkemeier, M. Min, Y.-H. Lan, M. Phillips, T. Rathnayake,
E. Merzari, A. Tomboulides, A. Karakus, N. Chalmers, et al., NekRS, a
GPU-accelerated spectral element Navier–Stokes solver, Parallel Com-
puting 114 (2022) 102982.
[9] D. S. Medina, A. St-Cyr, T. Warburton, OCCA: A unified approach to
multi-threading languages, arXiv preprint arXiv:1403.0968 (2014).
[10] N. Jansson, M. Karp, A. Podobas, S. Markidis, P. Schlatter, Neko: A
modern, portable, and scalable framework for high-fidelity computational
fluid dynamics, Computers & Fluids (2024) 106243.
[11] F. D. Witherden, B. C. Vermeire, P. E. Vincent, Heterogeneous computing
on mixed unstructured grids with PyFR, Computers & Fluids 120 (2015)
173–186.
[12] D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbichler,
M. Maier, J.-P. Pelteret, B. Turcksin, D. Wells, The deal. II finite element
library: Design, features, and insights, Computers & Mathematics with
Applications 81 (2021) 407–422.
[13] R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cerveny,
V. Dobrev, Y. Dudouit, A. Fisher, T. Kolev, et al., MFEM: A modular
finite element methods library, Computers & Mathematics with Applica-
tions 81 (2021) 42–74.
18
[14] N. Krais, A. Beck, T. Bolemann, H. Frank, D. Flad, G. Gassner, F. Hin-
denlang, M. Hoffmann, T. Kuhn, M. Sonntag, C.-D. Munz, FLEXI: A
high order discontinuous Galerkin framework for hyperbolic–parabolic
conservation laws, Computers & Mathematics with Applications 81
(2021) 186–219.
[15] L. Reid, R. D. Moore, Design and overall performance of four highly
loaded, high speed inlet stages for an advanced high-pressure-ratio core
compressor, Technical Report 1337, NASA Lewis Research Center,
Cleveland, OH, United States, 1978.
[16] W. Sutherland, LII. The viscosity of gases and molecular force, The
London, Edinburgh, and Dublin Philosophical Magazine and Journal of
Science 36 (1893) 507–531.
[17] D. A. Kopriva, Implementing spectral methods for partial differential
equations: Algorithms for scientists and engineers, Springer Science &
Business Media, 2009.
[18] M. H. Carpenter, C. A. Kennedy, Fourth-order 2N-storage Runge–Kutta
schemes, Technical Report NASA-TM-109112, NASA, Langley Re-
search Center, 1994.
[19] J. Niegemann, R. Diehl, K. Busch, Efficient low-storage Runge–Kutta
schemes with optimized stability regions, Journal of Computational
Physics 231 (2012) 364–372.
[20] R. M. Kirby, G. E. Karniadakis, De-aliasing on non-uniform grids: Al-
gorithms and applications, Journal of Computational Physics 191 (2003)
249–264.
[21] A. D. Beck, D. G. Flad, C. Tonh¨
auser, G. Gassner, C.-D. Munz, On
the influence of polynomial de-aliasing on subgrid scale models, Flow,
Turbulence and Combustion 97 (2016) 475–511.
[22] J. S. Hesthaven, T. Warburton, Nodal discontinuous Galerkin methods,
Texts in Applied Mathematics, Springer, New York, NY, 2007.
[23] D. Flad, A. Beck, C.-D. Munz, Simulation of underresolved turbulent
flows by adaptive filtering using the high order discontinuous Galerkin
spectral element method, Journal of Computational Physics 313 (2016)
1–12.
[24] G. J. Gassner, A. R. Winters, D. A. Kopriva, Split form nodal discontin-
uous Galerkin schemes with summation-by-parts property for the com-
pressible Euler equations, Journal of Computational Physics 327 (2016)
39–66.
[25] S. Pirozzoli, Numerical methods for high-speed flows, Annual Review of
Fluid Mechanics 43 (2011) 163–194.
[26] F. Bassi, S. Rebay, A high-order accurate discontinuous finite element
method for the numerical solution of the compressible Navier–Stokes
equations, Journal of Computational Physics 131 (1997) 267–279.
[27] A. D. Beck, T. Bolemann, D. Flad, H. Frank, G. J. Gassner, F. Hindenlang,
C.-D. Munz, High-order discontinuous Galerkin spectral element meth-
ods for transitional and turbulent flow simulations, International Journal
for Numerical Methods in Fluids 76 (2014) 522–548.
[28] J. Zeifang, A. Beck, A data-driven high order sub-cell artificial viscos-
ity for the discontinuous Galerkin spectral element method, Journal of
Computational Physics 441 (2021) 110475.
[29] P.-O. Persson, J. Peraire, Sub-cell shock capturing for discontinuous
Galerkin methods, in: 44th AIAA Aerospace Sciences Meeting and Ex-
hibit, 2006, p. 112.
[30] A. Kl¨
ockner, T. Warburton, J. S. Hesthaven, Viscous shock capturing in
a time-explicit discontinuous Galerkin method, Mathematical Modelling
of Natural Phenomena 6 (2011) 57–83.
[31] M. Bohm, S. Schermeng, A. R. Winters, G. J. Gassner, G. B. Jacobs,
Multi-element SIAC filter for shock capturing applied to high-order dis-
continuous Galerkin spectral element methods, Journal of Scientific Com-
puting 81 (2019) 820–844.
[32] M. Sonntag, C.-D. Munz, Shock capturing for discontinuous Galerkin
methods using finite volume subcells, in: Finite Volumes for Complex
Applications VII-Elliptic, Parabolic and Hyperbolic Problems: FVCA 7,
Berlin, June 2014, Springer, 2014, pp. 945–953.
[33] F. Fambri, M. Dumbser, O. Zanotti, Space-time adaptive ADER-DG
schemes for dissipative flows: Compressible Navier–Stokes and resistive
MHD equations, Computer Physics Communications 220 (2017) 297–
318.
[34] S. Hennemann, A. M. Rueda-Ram´
ırez, F. J. Hindenlang, G. J. Gassner, A
provably entropy stable subcell shock capturing approach for high order
split form DG for the compressible Euler equations, Journal of Compu-
tational Physics 426 (2021) 109935.
[35] A. Beck, M. Kurz, Toward discretization-consistent closure schemes for
large eddy simulation using reinforcement learning, Physics of Fluids 35
(2023).
[36] A. M. Rueda-Ram´
ırez, W. Pazner, G. J. Gassner, Subcell limiting strate-
gies for discontinuous Galerkin spectral element methods, Computers &
Fluids 247 (2022) 105627.
[37] B. van Werkhoven, Kernel Tuner: A search-optimizing GPU code auto-
tuner, Future Generation Computer Systems 90 (2019) 347–358.
[38] M. Blind, M. Gao, D. Kempf, P. Kopper, M. Kurz, A. Schwarz, A. Beck,
Towards exascale CFD simulations using the discontinuous Galerkin
solver FLEXI, 2023. arXiv:2306.12891.
[39] P. J. Roache, Code verification by the method of manufactured solutions,
J. Fluids Eng. 124 (2002) 4–10.
[40] F. Hindenlang, G. J. Gassner, C. Altmann, A. Beck, M. Staudenmaier,
C.-D. Munz, Explicit discontinuous Galerkin methods for unsteady prob-
lems, Computers & Fluids 61 (2012) 86–93.
[41] G. J. Gassner, F. L¨
orcher, C.-D. Munz, J. S. Hesthaven, Polymorphic
nodal elements and their application in discontinuous Galerkin methods,
Journal of Computational Physics 228 (2009) 1573–1590.
[42] J. DeBonis, Solutions of the Taylor–Green vortex problem using high-
resolution explicit finite difference methods, in: 51st AIAA Aerospace
Sciences Meeting, 2013, p. 382.
[43] J.-B. Chapelier, D. J. Lusher, W. Van Noordt, C. Wenzel, T. Gibis,
P. Mossier, A. D. Beck, G. Lodato, C. Brehm, M. Ruggeri, C. Scalo,
N. Sandham, Comparison of high-order numerical methodologies for
the simulation of the supersonic Taylor-Green vortex flow, submitted to
Physics of Fluids (2024).
[44] O. Zeman, Dilatation dissipation: The concept and application in model-
ing compressible mixing layers, Physics of Fluids A: Fluid Dynamics 2
(1990) 178–188.
[45] S. Sarkar, G. Erlebacher, M. Y. Hussaini, H. O. Kreiss, The analysis and
modelling of dilatational terms in compressible turbulence, Journal of
Fluid Mechanics 227 (1991) 473–493.
[46] D. J. Lusher, N. D. Sandham, Assessment of low-dissipative shock-
capturing schemes for the compressible Taylor–Green vortex, AIAA
Journal 59 (2021) 533–545.
[47] R. D. Moore, L. Reid, Performance of single-stage axial-flow transonic
compressor with rotor and stator aspect ratios of 1.19 and 1.26 respec-
tively, and with design pressure ratio of 2.05, Technical Report 1659,
National Aeronautics and Space Administration, Cleveland, OH, United
States, 1980.
[48] K. L. Suder, Experimental Investigation of the Flow Field in a Transonic,
Axial Flow Compressor with Respect to the Development of Blockage
and Loss, Ph.D. thesis, Case Western Reserve University, 1996.
[49] J. D. Denton, Lessons from Rotor 37, Journal of Thermal Science 6
(1997).
[50] E. Benini, Three-dimensional multi-objective design optimization of a
transonic compressor rotor, Journal of Propulsion and Power 20 (2004)
559–565. doi:10.2514/1.2703.
[51] P. Seshadri, G. T. Parks, S. Shahpar, Leakage uncertainties in compres-
sors: The case of rotor 37, Journal of Propulsion and Power 31 (2015)
456–466. doi:10.2514/1.b35039.
[52] A. Loeven, H. Bijl, The application of the probabilistic collo-
cation method to a transonic axial flow compressor, in: 51st
AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics, and ma-
terials conference 18th AIAA/ASME/AHS adaptive structures confer-
ence 12th, American Institute of Aeronautics and Astronautics, 2010.
doi:10.2514/6.2010-2923.
[53] J.-R. Carlson, Inflow/outflow boundary conditions with application to
FUN3D, Technical Report NASA/TM–2011-217181, NASA, Langley
Research Center, 2011.
[54] D. Flad, A. D. Beck, G. Gassner, C.-D. Munz, A discontinuous Galerkin
spectral element method for the direct numerical simulation of aeroacous-
tics, in: 20th AIAA/CEAS aeroacoustics conference, 2014, p. 2740.
[55] J¨
ulich Supercomputing Centre, JUWELS Cluster and Booster: Exas-
cale Pathfinder with Modular Supercomputing Architecture at Juelich
Supercomputing Centre, Journal of large-scale research facili-
ties 7 (2021). URL: http://dx.doi.org/10.17815/jlsrf-7- 183.
doi:10.17815/jlsrf-7- 183.
19