PreprintPDF Available

Architecture Specific Generation of Large Scale Lattice Boltzmann Methods for Sparse Complex Geometries

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save memory accesses and memory consumption and we implement a communication hiding technique to prove scalability. We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized generated kernels in the high performance framework WALBERLA and achieve a scaling efficiency of at least 82% on up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up three different applications to test the sparse data structure for realistic demonstrator problems. We show performance results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse / indirect-addressing data structure compared to the direct-addressing data structure for these applications.
Architecture Specific Generation of
Large Scale Lattice Boltzmann Methods
for Sparse Complex Geometries
Journal Title
XX(X):116
©The Author(s) 2016
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/ToBeAssigned
www.sagepub.com/
SAGE
Philipp Suffa1, Markus Holzer2, Harald K ¨
ostler1,3 and Ulrich R ¨ude1,2
Abstract
We implement and analyse a sparse / indirect-addressing data structure for the Lattice Boltzmann Method to support
efficient compute kernels for fluid dynamics problems with a high number of non-fluid nodes in the domain, such as in
porous media flows. The data structure is integrated into a code generation pipeline to enable sparse Lattice Boltzmann
Methods with a variety of stencils and collision operators and to generate efficient code for kernels for CPU as well as
for AMD and NVIDIA accelerator cards. We optimize these sparse kernels with an in-place streaming pattern to save
memory accesses and memory consumption and we implement a communication hiding technique to prove scalability.
We present single GPU performance results with up to 99% of maximal bandwidth utilization. We integrate the optimized
generated kernels in the high performance framework WALBERL A and achieve a scaling efficiency of at least 82% on
up to 1024 NVIDIA A100 GPUs and up to 4096 AMD MI250X GPUs on modern HPC systems. Further, we set up
three different applications to test the sparse data structure for realistic demonstrator problems. We show performance
results for flow through porous media, free flow over a particle bed, and blood flow in a coronary artery. We achieve a
maximal performance speed-up of 2 and a significantly reduced memory consumption by up to 75% with the sparse /
indirect-addressing data structure compared to the direct-addressing data structure for these applications.
Keywords
Sparse Lattice Boltzmann Method, Indirect Addressing, Complex Geometries, High Performance Computing, GPU
computing, Large Scale, Porous Media
Introduction
The increasing power of high-performance computing
(HPC) systems enables computational fluid dynamic (CFD)
simulations, which were still out of scope some years ago.
The leading HPC systems in the Top500 list1reach a peak
performance of ExaFLOPs, so they can perform 1018 floating
point operations per second. This trend is also caused by
the utilization of accelerators, such as NVIDIA, AMD, or
INTEL Graphics Processing Units (GPUs). With these new
computing capabilities, computational fluid dynamic (CFD)
problems, which were out of scope before, can now be
tackled, such as fully resolved porous media simulations at
relevant scales, as presented in Mattila et al. (2016) or entire
body arterial flows as in Randles et al. (2015). However, it
is not trivial to fully utilize the maximum performance of
these HPC systems, especially for systems with accelerators
(Brodtkorb et al. (2013), Hijma et al. (2023), Lai et al.
(2020), Rak (2024)).
The Lattice Boltzmann method (LBM) (Chen and Doolen
(1998), Kr¨
uger et al. (2017)) is an efficient inherently parallel
method to solve CFD problems for complex geometries. On
many HPC systems, the LBM shows excellent performance
as presented in Liu et al. (2023), Spinelli et al. (2023),
Watanabe and Hu (2022)orGodenschwager et al. (2013).
There are two common ways to store data for
LBM simulations. The direct-addressing LBM stores
and computes all cells of the domain. We call this
technique ”dense” or ”direct-addressing” data structure in
the following. It is shown to be very fast and efficient for
most simulation setups (see Kummerl¨
ander et al. (2023),
Lehmann et al. (2022) and Latt et al. (2021)). However,
it struggles in performance and memory consumption for
simulation domains with a high number of non-fluid nodes,
such as in porous media flows. In the following, we call such
domains ”sparse domains”. In contrast, simulations with a
high percentage of fluid nodes, such as a free channel flow,
we will refer to as ”dense domain”.
The second way to store data for LBM simulations is
the indirect-addressing storage format. We call it ”sparse”
data structure in the following. This approach stores only
fluid cells and can therefore save a significant amount of
memory. Additionally, the sparse approach reaches superior
performance for sparse complex geometries as compared
with the dense approach. In this sense, the sparse approach
has been found to be very efficient, see e.g. in Wittmann et al.
(2013) and Schulz et al. (2002). In particular, the sparse data
structure is employed successfully to simulate porous media
1Chair for System Simulation, Friedrich Alexander Universit¨
at Erlangen-
N¨
urnberg, Erlangen, Germany
2Parallel Algorithm Team, CERFACS, Toulouse, France
3Erlangen National High Performance Computing Center (NHR@FAU),
Erlangen, Germany
Corresponding author:
Email: philipp.suffa@fau.de
Prepared using sagej.cls [Version: 2017/01/17 v1.20]
arXiv:2408.06880v1 [cs.DC] 13 Aug 2024
2Journal Title XX(X)
flows as in Pan et al. (2004), Zeiser et al. (2009), Wang et al.
(2005) and Vidal et al. (2010).
In this work we study the code generation for highly
efficient LBMs and their performance on large HPC systems.
The sparse data structure is realised with the code generation
framework of lbmpy (Bauer et al. (2021)), allowing it to
run on a variety of architectures, such as all common CPUs
as well as NVIDIA and AMD accelerators. The generated
sparse compute kernels are integrated in the multiphysics
HPC framework WALBER LA (Bauer et al. (2020)) to enable
massively parallel simulations with excellent scalability
(Holzer et al. (2024)).
We compare the performance of the generated sparse
kernels with the dense approach and present the scaling
performance of the sparse data structure on modern HPC
systems such as JUWELS Booster (Alvarez (2021)) and
LUMI2. Further, we show performance results for realistic
model problems such as a flow in a porous media, a flow over
a packed bed, and a coronary artery flow on a high number
of accelerator cards.
Lattice Boltzmann Method
The lattice Boltzmann method is a mesoscopic method
established as an alternative to classical Navier-Stokes
solvers (Kr¨
uger et al. (2017)). The simulation domain is
usually discretized by a lattice of square cells. A cell
at position xstores a particle distribution function (PDF)
fi(x, t), which represents the probability of particles at time t
with discrete velocity ci. The macroscopic quantities, lattice
density ρand momentum density ρu, can be computed from
the PDFs using
ρ(x, t) = X
i
fi(x, t)and ρu(x, t) = X
i
cifi(x, t).(1)
A standard set of discrete three-dimensional velocity
directions would be the D3Q19 stencil, which results in
Q=|{ci}| = 19 PDFs, respectively.
The Boltzmann equation discretized in time, space, and
velocity space reads
fi(x+cit, t + t) = fi(x, t)+Ωi(x, t),(2)
with xas lattice spacing, tas time step size and as
collision operator. The LB equation can be separated into a
collision step
˜
fi(x, t) = fi(x, t)+Ωi(x, t)(3)
and a streaming step
fi(x+cit, t + t) = ˜
fi(x, t),(4)
with ˜
fidenoting the post-collision state of the PDFs.
The simplest collision operator is the single relaxation
time (SRT) operator
SRT
i(f) = fifeq
i
τt, (5)
which relaxes the PDFs towards the equilibrium feq
determined by the relaxation time τ. The equilibrium is given
by
feq
i(x, t) = wiρ1 + u·ci
c2
s
+(u·ci)2
2c4
s
u·u
2c2
s(6)
with the speed of sound c2
s= (1/3)∆x2/t2and velocity
set specific weights wi. The fluid velocity of a cell at
position xis calculated as u(x, t) = ρu(x, t)(x, t). The
kinematic viscosity νis related to the relaxation time τand
the dimensionless relaxation parameter ω=t
τ]0,2[ by
ν=c2
s(τ
t1
2) = c2
s(1
ω1
2).(7)
Data Structures in WALBERL A
In the WALBE RL A framework, the simulation domain is
partitioned into uniform cubic blocks, typically with a size
of around 643cells on CPUs and about 2563cells on GPUs.
In Figure 1 the block partitioning into uniform cubic blocks
is shown. For parts of the domain where no fluid is present,
blocks that only consist of obstacle cells can be discarded.
The remaining blocks are then distributed to the available
MPI processes, so that every process gets at least one block.
However, more blocks per process are also possible and can
be useful, for example, when load balancing is necessary,
as we will see in the following. The organization of a
computational grid into blocks introduces a hierarchy that
is found essential for efficient processing on the extreme
scale since many operations can be better organized in such
a hierarchy. In particular, performing the mesh partitioning
and load balancing in terms of blocks keeps the complexity
and overhead of these algorithms small (see Schornbaum and
R¨
ude (2016) and Schornbaum and R¨
ude (2018)).
As indicated in Figure 1, the dense / direct-addressing data
structure implemented in WALBE RL A stores the PDFs for
every cell in memory, even for non-fluid cells. However, for
sparse domains or porous media flows, the porosity
ϕ=NF
N(8)
with Nas the total number of cells and NFas the number of
fluid cells, can be arbitrarily small. Therefore, much memory
may be wasted by storing non-fluid cells on these blocks.
Furthermore, a branch statement in the LBM kernel is needed
to check, if the current cell is a fluid or an obstacle cell.
Additionally, non-fluid cells can cause unnecessary memory
traffic, when cache lines contain fluid and non-fluid cells,
and hardware prefetchers may read data from non-fluid cells,
which is not used. Especially when the domain is sparse,
the dense approach creates a significant overhead and can
lead to a significant performance loss (Godenschwager et al.
(2013)).
Sparse Data Structure
To avoid the disadvantages of the dense data structure, we
have developed a sparse data structure in WALBE R LA and
lbmpy. This technique is also used by other LBM frameworks
such as HARVEY (Randles et al. (2013)), Musubi (Hasert
et al. (2014)), ILBDC (Zeiser et al. (2009)) and MUPHY
(Bernaschi et al. (2008)), just to mention a few.
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 3
Sparse simulation domain Block-partitioned domain
Dense data structure Sparse data structure
Figure 1. Exemplary setup of a sparse simulation domain in 2D with a low percentage of fluid covering the domain (light blue), and
a high number of obstacle cells. Visualisation of the block partitioning with extraction of blocks without fluid. Illustration of a dense
and a sparse data structure for an exemplary setup of 5x5 cells per block and a D2Q5 stencil. While the dense data structure stores
PDFs and operates on all cells, the sparse data structure only stores and operates on fluid cells.
The idea is to only store fluid cells in a one-dimensional
array, we call PDF-list, so that no memory is wasted
on storing non-fluid cells. Furthermore, with such a data
structure LBM kernels only have to iterate over fluid cells,
and no branch conditions are needed in the innermost kernel
loops. On the other hand, one loses spatial information when
storing cell data in a linear array only. With the direct-
addressing data structure, PDFs can easily be accessed by
their spacial location xand the PDF index i. This access via
index arithmetic is not possible for the sparse PDF-list.
A second data structure, the index-list, is introduced
to recover the lost spacial information. This list stores the
streaming information from one PDF to another. So for
one PDF, the index-list stores the location of the PDF,
to which it will propagate to in the streaming step. Using
the index-list, we can access the neighbors of a cell
using one indirection. Therefore, this approach is called an
indirect-addressing scheme.
While the PDF-list consists of NF·Qentries, where
Qis the size of the stencil, the index-list only consists
of NF·(Q1) entries, since the center PDF need not to be
stored, as no propagation information is needed for the center
PDF.
The exact structure of the PDF-list and index-list
is illustrated in Figure 2a and Figure 2b, respectively. We
show the PDF-list in a Structure-of-Array (SoA) format,
so all PDFs of one direction lie next to each other in memory.
The demonstrator domain consists of fluid cells (white),
no-slip boundary cells, which indicate an obstacle (grey),
velocity bounce back boundary conditions (blue) and ghost
layer cells, which are needed for the communication between
blocks (light yellow).
In Figure 2a the PDF of cell 0 in direction west (P DF 0
w)
is stored at position 40 in the PDF-list, and the P DF 1
wof
cell 1 is stored at position 41. To perform a streaming step
for P DF 0
win cell 0, we have to look up the pull index (for a
presumed pull-streaming pattern) in the index-list. This
is illustrated in Figure 2b, where the pull index of the P DF 0
w
is the PDF 41. This makes sense, as for direction west we pull
P DFwfrom the right neighbour cell. The pull index look-up
in the index-list is done for all PDFs of all fluid cells to
perform a complete streaming step. The actual layout of the
PDF-list and index-list in memory is illustrated in
Figure 2c. There the SoA layout is used.
Sparse Boundary Conditions Some modifications to the
list data structures are made to support the implementation
of boundary conditions. No-slip boundary conditions can
easily be realised by setting the pull indices of the PDF,
which would pull from a no-slip boundary to the inverse
direction of the PDF. This is also illustrated in Figure 2b. For
illustration we focus on P DF 1
wof cell 1. It would pull from
its right neighbour cell, but this is an obstacle (no-slip) cell.
So P DF 1
w(PDF 41) pulls from the PDF in east direction
of its same cell 1, which is then the P DF 1
ewith index 21.
Also, periodic boundary conditions are easy to implement;
here, the pull index of the PDF, which must stream from the
periodic boundary on the opposite side, is just set to the PDF
on the other side of the domain.
For boundary conditions other than no-slip or periodic,
PDFs, which correspond to a boundary cell but point to a
fluid cell, must be appended to the PDF-list. Further, the
index-list has to be modified, so that PDFs of fluid cells
next to boundary cells pull from the boundary PDF cells.
This is also illustrated for velocity-bounce-back boundary
conditions (UBB) in Figure 2. Here, the P DF 9
w(PDF 49, cell
9, direction west) pulls from P DFwof the UBB boundary
cell right next to it, which is the appended PDF with index
51.
Sparse Communication For the dense data structure, every
block has a ghost layer of at least one cell all around in which
it stores the PDFs traveling in the corresponding direction.
This ghost layer is used to communicate PDF information
between MPI processes. For the communication between
sparse blocks, we also have to append these ghost layer PDFs
to the PDF-list and modify the index-list so that
Prepared using sagej.cls
4Journal Title XX(X)
0
10
40 20
30
0
1
11
41 21
31
1
2
12
42 22
32
2
3
13
43 23
33
3
4
14
44 24
34
4
5
15
45 25
35
5
6
16
46 26
36
6
7
17
47 27
37
7
8
18
48 28
38
8
9
19
49 29
38
950
51
22
53
54
55
52
Fluid No-slip UBB Ghost layer
(a) PDF-list in SoA layout with appended PDFs from
velocity-bounce-back (UBB) boundaries (in light blue) and ghost layers (in
light yellow). The numbers indicate the position of the PDF in the
PDF-list.
30
41 53
10
012
21 20
11
1
15
43 42
31
216
23 22
13
3
18
45 54
14
419
46 24
32
550
47 25
33
652
27 26
17
7
38
49 55
34
839
51 28
35
922
(b) Structure of the index-list for the corresponding PDF-list in
Figure 2a. The numbers indicate from which position in the PDF-list to
pull from in the streaming step.
1 ... 9 10 ... 19 20 ... 29 30 ... 39 40 ... 49 5150 52 53 54 55
Fluid cells Boundaries Ghost layer
30 12 15 16 18 19 50 52 38 39 53 20 42 22 54 24 25 26 55
... 10 11 31 13 14 32 33 17 34 35 41 21 43 23 45 46 47 27 49
...
51
28
PDF list
Index list
Pull indices North Pull indices East
Pull indices South Pull indices West
(c) Actual structure of PDF-list and index-list, as they lie in
memory in SoA layout.
Figure 2. Structure of the PDF-list and the index-list
for an exemplary D2Q5 velocity set. The domain contains fluid
cells (white), ghost layers (light yellow), velocity-bounce-back
(UBB) boundaries (light blue) and no-slip boundaries (grey).
The directions of the PDF stencil are indicated by colors as well.
In direction west there is a MPI interface to the neighboring
block considered. North, east and south cells next to the
presented cells are also considered as no-slip cells.
cells next to the MPI interface pull from these ghost layer
PDFs (see Figure 2 yellow cells).
Nevertheless, as we only append boundary and ghost-
layer PDFs pointing to fluid, the memory overhead of the
additionally stored PDFs is relatively low. In LBM kernels,
we still only need to iterate over fluid cells.
Code Generation for Sparse Kernels
Many variants of the LBM have been developed over
the last decades, which vary in complexity, accuracy, and
computational cost. The code generation framework lbmpy
is capable to generate kernels for most of these LBMs. It
supports a wide range of velocity sets, for example, the
D2Q5, D2Q9, D3Q15, D3Q19, D3Q27, and more. Further,
highly efficient code for different collision operators can be
generated. For example, the classical collision models such
as single-relaxation time (SRT), two-relaxation time (TRT),
and multi-relaxation time (MRT) operators (Kr¨
uger et al.
(2017)) are available. However, more advanced collision
models are also supported, such as the central moment
operator or the cumulant operator. The cumulant LBM, e.g.,
provides superior accuracy and stability for high Reynolds
number flows (see Geier et al. (2015)). The complexity of
the collision models increases from the SRT model to the
cumulant model in terms of complexity and the number of
moment transfers, so e.g. the transfer from moment space
to central moment space or to cumulant space. This can
increase the number of floating point operations in the
collision step significantly. However, due to optimizations
such as common sub-expression elimination (CSE), the
number of operations per cell lies between only 200 and
400 FLOPS for a D3Q19 stencil irrespective of the collision
model (Hennig et al. (2022)). Therefore, the performance of
the compute kernels remains memory-bound, as the number
of memory accesses stays constant for all collision operators.
Consequently, we can report a similar performance for all
collision operators in the following in Figure 8.
Additionally, lbmpy provides a generic development tool
to design new collision schemes. The high-level domain-
specific language of lbmpy allows the user to formulate,
extend, and test various Lattice Boltzmann Methods (Bauer
et al. (2021)).
To profit from the functionalities of the code generation
pipeline lbmpy, we integrated the generation of new
sparse LBM kernels, boundary handling kernels, and
communication kernels. All together, we are now able to
generate efficient sparse kernels for various velocity sets and
collision operators, which can run on all common CPUs and
NVIDIA and AMD GPUs.
In Figure 3, the code generation pipeline is presented.
In the model creation, the user defines the LB method by
choosing a specific collision model, a stencil and a streaming
pattern. Further, a force model, the option of storing the
PDFs in a zero-centered storage fashion, or similar can be
set. After defining the model, the LB method is represented
as a set of equations stored in an abstract syntax tree
(AST). This AST is now passed to pystencils (Bauer et al.
(2019)), which can perform further optimizations, such as
common subexpression elimination, loop splitting, or adding
vector intrinsics for single-instruction-multiple-data (SIMD)
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 5
execution. At this point, we have to decide about the data
structure of the generated kernels to be sparse or dense,
which defines the loop nest of the generated code and
the data accesses. To support a complete CFD application,
the generation of LB kernels, as well as boundary and
communication kernels, is needed. The code generation
pipeline supports all of the kernels for the sparse and
the dense data structure. As the last step, the actual code
is generated for a specified architecture. In particular, all
kernels can be generated in plain C-code to support all
common CPUs. However, the kernels can also be generated
with a HIP3or CUDA4API to support the GPU venders
AMD or NVIDIA, respectively. Lastly, these optimized
kernels can be run interactively in IPython5, but they can
also be integrated into a high-performance framework such
as WALBER LA to be run in parallel on thousands of CPU or
GPU nodes.
Figure 3. Complete workflow of the code generation pipeline of
lbmpy. For a full CFD application compute kernels as well as
boundary and communication kernels are generated.
Single Node Results In Figure 4, we present the
performance of the generated sparse LBM kernel in
comparison to the generated direct-addressing kernel. The
diagram shows the mega fluid lattice updates per second
(MFLUPs) depending on the porosity ϕas defined in
Equation 8. The LB method uses a D3Q19 velocity set
and the SRT collision operator. The benchmark measures
the LBM kernel performance without boundary handling or
communication, and it is performed on a single NVIDIA
A100 GPU.
We observe that the single GPU performance of the sparse
kernel is quite close to the theoretical peak performance.
Furthermore, the MFLUPs performance remains essentially
constant for decreasing porosity.
However, Figure 4 also shows that the sparse kernels
perform worse for ϕ0.75. This is caused by the extra
memory accesses of the sparse data structure. A dense kernel
has to read and write every PDF of a cell per time step.
This results in a memory access volume of 2Q·BPDF bytes
per cell on GPUs, with Qas stencil size, here 19, and
BPDF as the bytes per stored PDF, here 8 bytes for double
precision. The sparse kernel, on the other hand, accesses
2Q·BPDF + (Q1) ·Bidx bytes per cell, because it needs
to read neighboring information from the index-list.
Bidx is the number of bytes per index in the index-list,
here 4 bytes for an integer.
Nevertheless, the performance for the dense kernel
decreases linearly when porosity decreases. The reason for
this behavior is that dense kernels in WALBER LA traverse all
cells, including the non-fluid cells. This avoids the need for
a branch instruction for non-fluid cells, as mentioned before.
On the other side, this leads to a linear decrease of the fluid
lattice updates per second.
As indicated in Figure 4, the theoretical break-even point
for sparse and dense kernels is approximately ϕ0.75.
On the same NVIDIA A100 GPU we present a comparison
of the memory consumption for a lattice of 2563cells in
Figure 5. The memory usage is measured with the NVIDIA
monitoring tool nvidia-smi, showing that the sparse data
structure consumes linearly less memory with decreasing
porosity. On the other hand, the dense data structure exhibits
a constant memory footprint because it stores all cells in the
domain, regardless of whether a cell is fluid or boundary. For
a porosity of 1.0, where all cells in the domain are fluid, the
sparse data structure consumes more memory, since it also
has to store the index-list in addition to the PDF-list.
The theoretical memory consumption of the LBM kernels
can be calculated as:
Msparse =Ncells ·(2 ·Q·BPDF
| {z }
PDF-lists
+ (Q1) ·Bidx
| {z }
index-list
+ 5 ·BPDF
| {z }
other fields
)·ϕ,
Mdense =Ncells ·(2 ·Q·BPDF
| {z }
PDF-field
+ 5 ·BPDF
| {z }
other fields
).(9)
The additional fields stored are a velocity field (3D), a
density field (1D), and a flag field to indicate boundary cells
(1D).
We see that for the theoretical as well as for the measured
memory footprint, the break-even point of the sparse and
dense data structure is at a porosity of around ϕ0.8, which
is a similar result as for the performance comparison. For a
higher porosity, the dense data structure is more suitable in
terms of memory consumption, and for a lower porosity, the
sparse structure becomes superior.
The measured memory consumption for the sparse as well
as for the dense LBM in Figure 5 is close to the theoretical
memory consumption, so that there is only a small overhead
of less than 10% coming from other data structures than the
necessary pure PDF data.
Prepared using sagej.cls
6Journal Title XX(X)
0.00.20.40.60.81.0
Porosity
0
1000
2000
3000
4000
MFLUPs
dense LBM
superior sparse LBM
superior sparse kernel
dense kernel
hybrid kernel
sparse roofline
dense roofline
Figure 4. Single GPU benchmark for sparse, dense and hybrid
data structure with varying porosity on a NVIDIA A100 with 2563
cells, D3Q19 velocity set and SRT collision operator. The
theoretical performance is calculated from the bandwidth of a
streaming benchmark (1361 GB/s) and the theoretical number
of memory accesses of the kernels, as LBM code is usually
memory bound.
0.00.20.40.60.81.0
Porosity
0
1
2
3
4
5
6
7
8
Memory Consumption in GB
dense LBM
superior sparse LBM
superior
sparse
dense
sparse theory
dense theory
Figure 5. Memory consumption benchmark for 2563cells on a
single NVIDIA A100 GPU for D3Q19 stencil and pull streaming
pattern. The theoretical memory consumption is calculated in
Equation 9.
Hybrid Data Structure
In certain application scenarios, a hybrid data structure may
be advantageous. As an example, consider a free flow over
a particle bed as depicted in Figure 14. After the domain
partitioning, some blocks contain only fluid cells, while
other blocks consist primarily of non-fluid cells. In this case,
neither the sparse nor the dense data structure seems to fit the
given scenario perfectly.
Therefore, we implement the hybrid simulations in
WALBER LA. From lbmpy, sparse and dense LBM and
boundary kernels are generated. In WALB ERL A, the porosity
ϕis calculated individually on each block to determine
the block as a sparse or dense block, based on a porosity
threshold ϕS. Based on the results in Figure 4 and Figure 5,
the porosity threshold should be around ϕs0.8. During
the creation of the data structures on the blocks, a dense
PDF field or a sparse PDF-list and the corresponding
index-list is created on the block, and only the
corresponding generated sparse or dense kernel runs on the
blocks. Besides this functionality, appropriate routines for
the communication between sparse and dense blocks must
be realized. Again, suitable pack and unpack kernels are
generated for CPU or GPU architectures with lbmpy, while
the MPI communication routine itself stays unchanged.
In Figure 4, we display the performance for the hybrid
data structure in green with ϕS= 0.8. As expected, the
performance reflects that of the dense kernel for ϕϕS
and, that of the sparse kernel for ϕ<ϕS. Consequently, the
hybrid approach can always reach the maximum possible
WALBER LA performance per block, independent of the
porosity. The same holds for the memory consumption. If we
set the porosity threshold to 0.8as suggested in Figure 5,
we also get the best possible memory consumption per block
by utilizing the hybrid data structure.
Optimizations to sparse LBM
The high-performance framework WALBER LA in combina-
tion with lbmpy already provides a wide range of optimiza-
tions for LB methods. For the sparse data structure, some of
the optimizations had to be adapted or re-implemented. In
the following, we specifically describe the implementation
of an in-place streaming pattern and a communication hiding
technique specially designed for the sparse data structures.
In-place Streaming: AA Pattern
The most common streaming patterns for LBM are the two-
grid algorithms, where either PDFs of a cell are pushed
into the neighbor cells (push scheme), or PDFs are pulled
from the neighbor cells (pull scheme) (Kr¨
uger et al. (2017)).
These algorithms have in common that a temporary PDF
field is needed. This is because PDFs are stored in a different
position than where they are read from. As illustrated in
Figure 6, these two-grid streaming patterns read from PDF
field A, then they propagate (push/pull) the PDFs, and, lastly,
they store the propagated results at a different location in the
temporary PDF field B. Therefore, these streaming methods
are also called AB patterns. After the propagation, a field
swap of fields A and B is needed.
The in-place streaming AA pattern on the other hand,
enables writing and storing PDF values in the same positions
of the PDF field, so that PDFs can be read from field A and
also be written to field A without creating data dependencies.
This saves the memory of the temporary PDF field B.
Additionally, it also saves memory accesses (see Bailey et al.
(2009)).
These savings are achieved by introducing two alternating
streaming time steps, as shown in Figure 7. In the ”odd”
time step, PDFs are read from field A and pulled to the
current cell. Then, the collision is performed, and the
resulting PDFs are pushed back to the neighbor cells to the
positions where they were read from on field A. It is worth
mentioning, that while we introduced the LBM streaming
step as a separate algorithm step in Equation 4, in the actual
generated LB kernels streaming and collision is a fused step
to save memory accesses. The odd time step consists of two
propagations and one collision, and the reads and writes of
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 7
2
Reminder: Pull Paern
pull
Field A read Field B write
Fused stream collide, no field accesses
collide
Figure 6. Pull pattern: PDFs are read from field A and, after the fused stream-collide step, they are written to field B.
In-Place Streaming: AA Paern
pull
ODD
time step
EVEN
time step
Field A read Field A write
Temporary store PDFs, no field accesses
Field A read Field A write
push
collide
collide
Figure 7. AA pattern: In the ”odd” time step, a pull streaming is done, followed by a collision and a push streaming step. As the
PDFs are pushed to the same positions as they are read from, the PDFs are stored in the opposite stencil directions. In the ”even”
time step, only one collision is done, where the PDFs have to be written from the opposite stencil direction to achieve correct
macroscopic values. Read and write takes place on the same field, therefore only one PDF field must be stored.
the PDFs take place at the same positions on the same field
A. However, this means that after the second propagation, the
PDFs are stored in the wrong position of the stencil. Thus
we have to take care that PDFs are always in the opposite
stencil position after an odd time step. It becomes relevant
when we must calculate macroscopic values or when we
must communicate with neighbor MPI blocks after an odd
time step.
The second time step, the ”even” time step, only consists
of one collision. The PDFs needed for the collision must
be read from the opposite stencil positions to get correct
macroscopic value calculations and similar. As there is
no propagation in the even time step, no neighboring
information is needed there. Consequently, there is also no
need to access the pull index list, which saves memory
access, as discussed in the following.
After one odd and one even time step, the PDFs are again
stored in their right positions, and the outcome is the same as
after two push or pull time steps.
The benefit of the AA streaming pattern in terms of
memory accesses is shown in Table 1. For the pull pattern
in dense kernels, 3Qmemory accesses are required. One
memory access is needed for the read of field A, one is
needed for the write on field B, and the third one is a ”write
allocate B”, which occurs if the data of the PDF of field B
is not already stored in the CPU cache, and therefore has
to be loaded into the cache to be written on. A PDF entry
is only used once in a fused stream-collide step, and the
whole PDF-list is unlikely to fit completely in the CPU
cache for large-scale runs. Therefore, the ”write allocate B”
access is mostly present. This memory access only appears
on CPUs, as GPUs do not utilize cache structure like CPUs
do, so the amount of memory accesses on GPUs per cell is
2Q. Nevertheless, we want to avoid the third access on CPUs
by utilizing an in-place streaming pattern. The data is already
in the CPU cache because we read and write on the same
PDF positions in the same PDF field. By this, we avoid the
cache miss and end up with 2Qmemory accesses per cell.
For sparse kernels, utilizing the AA pattern has even more
advantages in terms of memory access. For the pull pattern,
in addition to the PDF-list, also the index-list has
to be read to get the pull accesses for the propagation step,
which adds a (Q1) to our count of memory accesses. In
total, we need 3·Q+ (Q1) memory accesses for sparse
LBM kernels with the pull streaming pattern.
For the AA pattern on the other side, we only need
neighboring information in every second (odd) time step
because on even time steps, we only compute cell-local,
and therefore no neighboring information is needed. So, the
memory accesses for the index list can be halved to (Q
1)/2. Therefore, this results in 2Q+ (Q1)/2memory
accesses for sparse LBM kernels with the AA streaming
Prepared using sagej.cls
8Journal Title XX(X)
Table 1. Memory accesses per cell for Pull and AA pattern on
CPU with size of PDFs Bpdf = 8 Byte, size of indices in the
index-list Bidx = 4 Byte and Q= 19.
Memory accesses CPU
Dense data Sparse data
Pull 3Q·Bpdf 3Q·Bpdf + (Q1) ·Bidx
AA 2Q·Bpdf 2Q·Bpdf + (Q1)/2·Bidx
Reduction
(1-AA/Pull) 33.3 % 35.6 %
Table 2. Memory accesses per cell for Pull and AA pattern on
GPU with size of PDFs Bpdf = 8 Byte, size of indices in the
index-list Bidx = 4 Byte and Q= 19.
Memory accesses GPU
Dense data Sparse data
Pull pattern 2Q·Bpdf 2Q·Bpdf + (Q1) ·Bidx
AA pattern 2Q·Bpdf 2Q·Bpdf + (Q1)/2·Bidx
Reduction
(1-AA/Pull) 0 % 9.7 %
pattern. We see in Table 1 that the AA pattern on a CPU
reduces the memory accesses compared to the pull pattern
by
13Q·Bpdf + (Q1) ·Bidx
2Q·Bpdf + (Q1)/2·Bidx
.(10)
Because well-optimized LBM codes are usually memory-
bound, an increase in the performance of the LBM by the
same ratio can be expected.
As already mentioned above, this performance boost can
only be achieved on CPUs, as GPUs do not work with
similar caches. No ”write allocate” on the cache can be
avoided. Therefore, only half the memory accesses for the
index-list can be saved for the sparse approach, as
shown in Table 2.
On the other hand, the condition to store the temporary
PDF field can be avoided on CPUs and accelerators. So the
memory consumption for the sparse LBM in Equation 9
shrinks to
Msparse,aa =Ncells ·(Q·BPDF
| {z }
PDF-list
+ (Q1) ·Bidx
| {z }
index-list
+ 5 ·BPDF
| {z }
other fields
)·ϕ.
(11)
So for a D3Q19 stencil, double precision PDFs, and
an integer index-list, we save 36.5% of memory
consumption by utilizing the AA streaming pattern for the
sparse data structure.
Benchmarking Results In Figure 8 the single GPU
benchmarking results for a sparse LBM kernel on a NVIDIA
A100 with a D3Q19 stencil and 2563lattice cells are
presented. We compare the pull streaming pattern with the
AA pattern for various collision operators. The theoretical
peak performance is calculated by the bandwidth, which is
1367 GB/s found by streaming benchmarks (Siefert et al.
(2023)), and the number of theoretical memory accesses
from Table 2.
We observe, that the performance of both streaming
patterns is close to the theoretical performance for all of
the presented collision operators. The average performance
increase of the AA pattern compared to the pull pattern on
accelerators is 7.5%, which is close to the theoretically
achievable performance increase of 9.7% from Table 2.
Therefore, in addition to the avoidance of the storage of the
second PDF field, it is also worth to employ the AA pattern
on GPUs in terms of performance.
SRT TRT MRT CM Cumulant
0
1000
2000
3000
4000
5000
MFLUPs
3635 3641 3622 3625 36243926 3932 3917 3920 3918
Roofline
Pull pattern
AA pattern
Figure 8. Single GPU Benchmark for pull vs AA streaming
pattern for single relaxation time (SRT), two relaxation time
(TRT), multi relaxation time (MRT), central moment (CM) and
cumulant collision model on a D3Q19 stencil with 2563cells on
a single NVIDIA A100.
Communication Hiding
Communication hiding is used to overlap communication
with computation. For this, the domain on every block
has to be divided into a ”block interior” and a ”frame”,
as illustrated in Figure 9. The frame only consists of the
outermost cells, while the block interior consists of all
other cells. An exemplary code to achieve communication
hiding is shown in Algorithm 1. At first, the communication
is started. Every block packs its outermost PDFs in an
MPI buffer and performs a non-blocking MPI-Send to its
neighbors. Now, the block interior cells can be updated
because the information from neighbour blocks is not needed
for these cells. After this step, the algorithm must wait for
the communication to complete and write the information of
the MPI buffers to the ghost layers. Lastly, with the updated
information in the ghost layers, the LBM and boundary
kernels can now be executed on the cells of the frame.
With this algorithm, the communication of the simulation
can be overlapped with the kernel on the interior, which
leads to higher performance because of better scalability on
an increasing number of MPI processes. The width of the
frame in all three dimensions has to be chosen suitably to
achieve best possible performance. A thinner frame width
would increase the number of cells in the interior, providing
more time to overlap the communication. On the other hand,
a small frame width results in small kernels. Especially on
GPUs, small kernels can not fully utilize the GPU, which
can lead to performance drops. Additionally, consecutive
memory access in one dimension is not possible for a thin
frame, which can also reduce the simulation’s performance.
Communication Hiding for Sparse Data Structures
Implementing communication hiding for a sparse data
structure is not straightforward because there is no spatial
information for the cells. This means that a cell has no direct
information about whether it is inside the block interior or
part of the frame. To compensate this, we store two additional
index lists, one for the interior and one for the PDFs on the
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 9
Frame width in x-direction
Frame width in y-direction
Figure 9. Subdivision of the PDF field in a frame and the block
interior to enable communication hiding. In this example, the
frame width is 5 in x, and 3 in y-direction.
Algorithm 1 Communication Hiding
1: for each time step do
2: Start communication
3:
4: Run boundary kernels on block interior
5: Run LBM kernel on block interior
6:
7: Wait communication
8:
9: Run boundary kernels on frame
10: Run LBM kernel on frame
frame. These index lists are initialized by the flag field at the
start of the simulation, where spacial information of cells is
still present. Furthermore, the pull index for the center PDF
of every fluid cell must be stored. This index is used to get
the correct write access for kernels on the interior and frame
cells. These modifications on the list structure are integrated
into the code generation so that they can be turned on and off
and allow generated code to run on different architectures.
Scaling results In Figure 10, the weak scaling of the sparse
data structure on the JUWELS Booster HPC cluster is
presented. JUWELS Booster is currently place 21 of the
Top500 HPC systems (June 2024) and consists of 936
compute nodes, each equipped with 4 NVIDIA A100 GPUs
(see Alvarez (2021)).
We tested three versions of the communication. One
is without communication hiding, one with the minimum
frame size of one cell in every direction, and one with
a frame thickness of 32 cells in x direction and one
cell in y and z direction. This option is promising, since
consecutive memory accesses are still enabled in x-direction,
while the frame size is still small enough to allow a
good communication overlap. In general, a smaller frame
size increases the work of the kernels on the interior
cells and, therefore, should increase the effectiveness of
the communication hiding. On the other hand, the GPU
utilization of the kernels on the frame is quite low for a
small frame size, and consecutive memory accesses are not
secured.
The version without communication hiding performs
best up to one node (4 GPUs), see Figure 10. There
the intra-node communication speed is quite high since
we can exploit the high bandwidth of NVIDIA GPU-
to-GPU connections. However, the performance without
communication deteriorates for more than 4 GPUs when the
inter-node communication speed becomes relevent.
The benchmark runs with communication hiding start
with worse performance on single node, as the overhead of
the kernel call on the frame cells limits the performance.
Nevertheless, these versions exhibit excellent scalability for
up to 32 GPUs. Beyond 32 GPUs, the performance drops
to 83% scaling efficiency on 1024 GPUs. This can possibly
be explained by the InfiniBand network architecture of
JUWELS Booster, which is implemented as a DragonFly+
network. The drop in performance could be caused by the
need for communication between different switch islands of
the system when more than 32 GPUs are employed.
In these cases, the size of the frame only has a negligible
impact. The two scenarios for communication hiding behave
similarly. The smaller frame size of <1,1,1>performs a
bit better on more than 128 GPUs. Nevertheless, we see that
communication hiding can increase the scaling efficiency of
the sparse data structure on up to 1024 NVIDIA A100 GPUs
from 63% to 83%.
20212223242526272829210
#GPUs
0
500
1000
1500
2000
2500
3000
3500
4000
MLUPS per GPU
Roofline
No communication hiding
Frame width <1,1,1>
Frame width <32,1,1>
Figure 10. Weak scaling benchmark on NVIDIA A100 GPU
cluster JUWELS Booster with different configurations for the
communication hiding. The roofline is obtained by a stream
benchmark (Siefert et al. (2023)). The runs are executed with
3203cells per GPU, with a D3Q19 stencil and SRT collision
model on an empty channel setup.
Additionally, we tested the scaling efficiency on the GPU
partition of LUMI, which is in the top 5 on the current
Top500 list (June 2024). The HPC cluster comprises 2978
nodes with 4 AMDMI250X GPUs per node. Further, every
AMD MI250X GPU consists of two Graphical Compute
Dices (GCDs) (Pearson (2023)), so we create one MPI
process per GCD and show the scaling over the GDCs. As
already studied in Holzer et al. (2024), Lehmann (2022)
and Martin et al. (2023), it seems not to be possible to
achieve significantly better performance than equivalent to
approximately 50% of memory bandwidth for LBM codes
on a single AMDMI250X. We observe the same behavior in
Figure 11.
We tested the three communication routines similar to
the benchmarks on JUWELS Booster. For the runs without
communication hiding, the scaling behavior is similar as for
the larger frame size of <32,1,1>. For the cases without
communication hiding, we achieve a CDG scaling efficiency
Prepared using sagej.cls
10 Journal Title XX(X)
20212223242526272829210 211 212 213
#GCDs
0
500
1000
1500
2000
2500
3000
3500
MLUPS per GCD
Roofline
No communication hiding
Frame width <1,1,1>
Frame width <32,1,1>
Figure 11. Weak scaling benchmark on GPU cluster LUMI-G
with different configurations for the communication hiding. The
roofline is obtained by a stream benchmark (Siefert et al.
(2023)) . The AMD MI250X GPUs have two compute chips per
GPU (GCDs). The runs are executed with 2563cells per GCD,
with a D3Q19 stencil and SRT collision model on an empty
channel setup.
of 60% scaling from one to 8192 CDGs (4096 GPUs)
and a node (4 GPUs) scaling efficiency of 82%. For the
greater frame size <32,1,1>we observe similar scaling
behavior as without communication hiding. The expected
acceleration and better scaling could not be observed, a
finding that should be further investigated in future research.
For the small frame size, the scaling efficiency is almost
perfect, but the overall performance is much worse than
for the other communication strategies. Again this behavior
is unexpected. The non-consecutive memory accesses and
kernel calls with small execution times could be the reason
for the relatively pour overall performance of the simulation
in these cases.
Applications
To evaluate the performance of the sparse data structure in a
more realistic scenario than the artificial porosity benchmark
as in Figure 4 or the weak scaling of an empty channel on
JUWELS Booster (Figure 10) and on LUMI (Figure 11),
we set up three different applications. The first is a flow
through a porous medium consisting of a stationary particle
bed. The second one is an extended version of the first
application, where the bottom part of the domain consists
of the same particle bed, while the upper domain is a free
flow, such that we simulate the interaction of a free flow
with a porous sediment bed. The last application is the flow
through a geometry of coronary arteries, which also results
in a complex and sparse domain.
Flow through Porous Media
The efficient simulation of fluid flow through porous media
is an ongoing research topic, for example in Pan et al. (2004),
Yang et al. (2023), Han and Cundall (2013) or Ambekar
et al. (2023), to mention a few. For this porous application,
we generated a particle bed with the WALBE RL A molecular
dynamics module MESA-PD, as shown in Rettinger and
R¨
ude (2018). We defined a domain of 0.1 meters in every
dimension and filled it with 21580 particles with a diameter
of 0.0041m. This setup is illustrated in Figure 12 and results
in an average porosity of 0.356663.
Figure 12. Flow through a particle bed consisting of 21580
particles of 0.0041m diameter, which corresponds to an
average porosity of 0.356663. The size of the domain is 0.1m in
every dimension.
We decomposed the domain with 64 blocks in a 4x4x4
arrangement to run on 64 NVIDIA A100 GPUs on JUWELS
Booster. While we fixed the number of blocks to 64, we
increase the cells per block and therefore also the resolution
of the domain and the number of total cells, as shown in
Figure 13. We performed this benchmark for the sparse data
structure, and compare it to the dense data structure.
We first focus on at the ”kernel-only” results in Figure 13,
which only run the LBM kernel without handling boundaries
or performing communication. We observe, that for block
sizes below 643, the GPU utilization is too low to achieve
good performance. Both, the sparse and the dense kernel
saturate at a block size of 2563. Further, we note that the
performance of the sparse data structure is approximately
two times higher than for the dense structure. This also
fits quite well to the results in Figure 4 for a porosity of
ϕ0.35. The sparse kernel-only performance does not quite
reach the theoretical bandwidth limit. This could be caused
by some in-balances, since the porosity varies between
the blocks, from a minimum of 0.337816 to a maximum
of 0.392441. This means, that some blocks, and therefore
processes, have more workload in terms of cells. We measure
the performance at the end of the simulation run, when
all processes finished their work, so the performance is
determined by the slowest processor. The performance of the
dense kernel on the other side is not affected by the porosity
differences, and therefore performs exactly the same work
on every MPI process and thus does not suffer from load in-
balance issues.
When running the full simulation including boundary
handling and communication routines, we again observe
low performance for small block sizes, which can be
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 11
Figure 13. Comparison of the sparse and the dense data
structure for the flow through the particle bed in Figure 12. The
number of WAL BE RL A blocks is fixed to 64 while the cells per
blocks increases, and therefore also the resolution and the
number of total cells increases. The benchmark was executed
on 64 NVIDIA A100 GPUs on JUWELS Booster with one block
per GPU.
explained by the low utilization of GPUs. However, this time,
there is no saturation for a block size of 2563because a
more extensive block size results in better communication
hiding when the ratio between computational work and
communication improves in favor of computational work.
For the dense structure, it was not possible to perform
simulations with a block size of 5123, as the memory
consumption exceeded the 40 GB of the GPU RAM of a
single A100. For the sparse structure, this is not a problem, as
for a porosity of 0.356663 we save around 50% of memory
compared to the dense structure.
For this application, we significantly benefit from the
sparse data structure, as it achieves a performance increase of
90% compared to the dense one for a block size of 4503.
For a block size of 5123, which results in a cell resolution
of 4.8828 105m and 8.6109total cells, it achieves an
overall performance of 203.408 GFLUPs for the kernel-only
call and 83.475 GFLUPs for the full-simulation run. Further,
as shown in Figure 5, for a domain with an average porosity
of 0.35, we are able to save around 50% of memory
consumption by utilizing the sparse data structure.
Free Flow over River Bed
The second application is the simulation of a free flow over
a river bed similar to Kemmler et al. (2023) or Fattahi
et al. (2016). In Figure 14, the bottom part of the domain
consists of the same porous medium / particle bed as the first
application in Figure 12, with the same average porosity of
about 0.35. The upper part of the domain is a free flow
only, so 100% fluid cells in this part of the domain. The
average porosity of the domain is 0.68.
This application is suitable for utilizing the hybrid data
structure. As already indicated in Figure 14, the blocks in
the upper part of the domain should hold their data in a
dense structure (red blocks). In contrast, the blocks in the
porous part of the domain should be stored with a sparse
data structure (blue blocks). The framework includes the
functionality to select the appropriate data structure for each
block, the user only has to specify an appropriate porosity
threshold.
Figure 14. Free Flow over a particle bed. The porosity of the
blocks in the particle bed on the bottom (blue blocks) have a
porosity of about 0.35, while the upper blocks (red blocks)
consists of fluid cells only.
In Figure 15, a comparison of the sparse, the dense and
the hybrid data structure is shown. The simulations are again
executed on the JUWELS Booster GPU cluster, with a fixed
number of NVIDIA A100 GPUs and one MPI process per
GPU. The performance of the raw LBM kernel (kernel-only)
is plotted, as well as the entire simulation run, including
communication between the MPI processes. We fixed the
number of GPUs to 64 as well as the problem size to 109
cells and only vary the cells per block, resulting in a high
number of blocks for small cells per block and one block per
GPU for the largest block size of 2563cells.
Evaluating the raw kernel performance first, we again
observe that a more significant number of cells per block
leads to a better utilization of the GPU.
Load-balancing issues can emerge when using a sparse or
a hybrid data structure with multiple blocks per GPU. This is
because the block partitioning of a domain can lead to a wide
span of porosity values on the blocks. When decomposing
the river bed simulation in Figure 14 with sparse blocks
only, this results in half of the blocks yielding a porosity of
0.35 and the other half yielding a porosity of 1.0. One can
calculate the workload of a sparse WALB ERL A block on a
GPU similar to the number of memory accesses in Table 2
with
wsparse = ( 2 ·Q·BPDF
| {z }
PDF list read/write
+ (Q1) ·Bidx
| {z }
Index list read
)·ϕ(12)
with Q as the stencil size and ϕas the porosity.
The workload for the dense block on a GPU is
wdense = 2 ·Q·BPDF
| {z }
PDF field read/write
,(13)
which is not depending on the porosity of the block.
Prepared using sagej.cls
12 Journal Title XX(X)
16³
(262144) 32³
(32768) 64³
(4096) 128³
(512) 256³
(64)
Cells per block (#blocks)
0
50
100
150
200
250
300
GFLUPs
kernelOnly sparse balanced
kernelOnly sparse unbalanced
fullSim sparse balanced
fullSim sparse unbalanced
kernelOnly dense balanced
fullSim dense balanced
kernelOnly hybrid balanced
kernelOnly hybrid unbalanced
fullSim hybrid balanced
fullSim hybrid unbalanced
roofline sparse
roofline dense
Figure 15. Comparison of the sparse, dense and hybrid data
structure for the free flow over a particle bed in Figure 14 on 64
NVIDIA A100 GPUs on Juwels Booster. The problem size is
fixed to 1.07 109cells, while the number of cells per block and
therefore also the number of blocks vary. The LBM kernel-only
performance is shown as well as the performance of the whole
simulation including boundary handling and communication
between MPI processes.
For the simulation run with sparse data blocks only,
the workload of the blocks differs significantly depending
on their porosity. Therefore, we employ a load-balancing
algorithm to balance the blocks over the MPI processes to
reach a better workload distribution. We used a space-filling-
Hilbert-curve approach, as described in Schornbaum and
R¨
ude (2018).
In Figure 15, we observe that the load-balancing works
well for the sparse kernel-only runs. Especially for the block
sizes of 643and 1283, the load-balancing seems to reduce the
workload unbalance significantly and, therefore, increases
the performance. For 2563cells per block, there is only one
block per GPU, so no load-balancing is possible in this case.
No load balancing is necessary for the dense data
structure, as every block has the same workload. We see that
the kernel-only runs of the dense data structure outperform
the unbalanced sparse runs. This is because the sparse data
structure is slower than the dense structure on the blocks with
a porosity of 1.0 (see again Figure 4), and all MPI processes
must wait for the processes holding these blocks. However,
when balancing the workload for the sparse kernels, at least
for a block size of 1283, we can achieve a higher performance
than the dense kernels.
The kernel-only runs for the hybrid data structure without
load balancing closely follow the results of the dense kernels
in Figure 15. Without load balancing, the low performance
of the dense blocks in the porous region heavily dominates
the runtime. However, here we can balance the workload to
achieve better results. So for the kernel-only runs, using the
workload-balanced hybrid data outperforms the other kernels
for all block sizes. The only exception is for 2563cells per
block. There no load-balancing is possible, because there is
only one block per GPU.
We observe a similar behavior when studying the entire
simulation runs in Figure 15. This now includes the
communication between the MPI processes. Overall, the
sparse data structure performs superior to the dense one on
the block sizes of 1283and 2563cells per block.
Table 3. Workload per MPI process / GPU before and after
load balancing for the hybrid data structure run with 1283cells
per block in Figure 15.
Workload Average Std deviation Min Max
Unbalanced 1757.47 675.194 1038 2432
Balanced 1757.47 102.644 1520 1826
For the hybrid structure, we observe unexpected behavior.
Here, the load-balanced simulation performs worse than the
sparse simulations, while the unbalanced hybrid simulation
reaches the highest MFLUPs values of all tested data
structures. In Table 3, the workload per MPI process of the
simulation with the hybrid data structure for 1283cells per
block is presented. We observe that the standard deviation
of the average workload is significantly lower than for the
unbalanced execution. Still, the performance of the balanced
run is worse for the full-simulation. Here we observe that
the load-balancing algorithm successfully distributes the
workload over the MPI processes but it reduces the spatial
locality of the blocks with respect to each other. Therefore,
the communication is more expensive.
Nevertheless, for a complex domain setup such as a
free flow over a riverbed, we also manage to improve the
application’s performance by utilizing the sparse or the
hybrid data structure. Additionally, we note the reduced
memory consumption as presented in Figure 5. Overall,
when half of the domain consists of a porous medium, such
as in Figure 14, we save about 25% of memory with the
hybrid data structure.
Coronary Artery
Finally, we present performance results for the flow in a
coronary artery. This is a topic of high interest in medical
engineering and has been studied, for example, in Axner
et al. (2009), Afrouzi et al. (2020), Bernaschi et al. (2010)
and Godenschwager et al. (2013). The flow in a coronary
artery results in a complex and highly sparse domain. An
example setup for this application is shown in Figure 16.
While the whole domain would consist of a very high number
of blocks, we can discard all blocks that do not contain fluid
cells. Still, the remaining blocks have a very low porosity.
Of course, when lowering the size of the block, the block
structure would converge better to the geometry and result in
higher block porosities. However, small block sizes also lead
to under-utilization of GPUs, as already shown in Figure 13.
In Figure 17, we compare the performance of the sparse
and the dense data structure on JUWELS Booster. We fixed
the number of NVIDIA A100 GPUs to 60 and the problem
size to 1.2·108fluid cells while varying the block size and,
therefore, also the number of blocks. Again, we observe
the behavior of a small block size, which results in low
performance for the kernel-only runs of the sparse data
structure. We experience that increasing the block size up
to 1283cells per block leads to higher performance of the
raw sparse LBM kernel. For block sizes larger than 1283,
the porosity drops below ϕ < 0.1. We observe in Figure 4
that the sparse LBM performance starts to deteriorate for a
porosity smaller than 0.1. Therefore, for block sizes larger
1283, the performance of the sparse kernel in Figure 17
shrinks because of the low block porosities. The sweet spot
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 13
Figure 16. Domain partitioning of a coronary artery with
1.2·108fluid cells, 531 blocks and 1283cells per block.
16³
(48514) 32³
(8727) 64³
(1943) 128³
(531) 256³
(163) 320³
(121) 450³
(73) 512³
(60)
Cells per block (#blocks)
0
20
40
60
80
100
120
140
GFLUPs
kernelOnly sparse
fullSim sparse
kernelOnly dense
fullSim dense
average block porosity
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Average block porosity
Figure 17. Comparison of the sparse and the dense data
structure for the artery flow in Figure 16 on 60 NVIDIA A100
GPUs on JUWELS Booster. The problem size is fixed to a
number of 1.2·108fluid cells, while the block size and therefore
also the number of blocks varies.
between large block sizes for good GPU utilization and a
porosity greater than 0.1for good kernel performance is a
block size of 1283cells per block for this setup. The same
behavior does not emerge for the sparse full simulations.
Here, the performance is relatively stable for moderate
block sizes. This is because the performance is heavily
dominated by communication, which is especially high in
this application because of the high number of blocks per
MPI process.
Nevertheless, the sparse data structure shows significantly
higher performance than the dense data structure for all
block sizes. This is true for the kernel-only runs and the full
simulation runs. For a block size of 1283cells, we achieve a
speed-up of about 11 for the kernel-only run and still a speed
up of 2 for the full simulation.
Also remarkable is the amount of memory we can save for
the artery setup. So when choosing a block size of 1283cells
per block to reach maximum raw performance; then, we end
up with an average block porosity of ϕ0.16. This means
that according to Figure 5, we save about 75% of memory
when using the sparse instead of the dense data structure.
We were not able to acquire results for the dense structure
for block sizes greater than 2563because the NVIDIA A100
RAM ran out of memory.
Parts of this work are comparable to Martin et al. (2023).
In this article, the framework HARVEY is ported to different
programming models such as CUDA, HIP, SYCL and
Kokkos. The authors compare the performance on different
hardware, among others also on NVIDIA A100 and AMD
MI250X GPUs. HARVEY is a LBM based CFD software
with a focus on simulations of blood flow in patient-derived
aortas, also utilizing the indirect addressing approach.
Martin et al. (2023) measure the performance of HARVEY
and a proxy app for NVIDIA A100 GPUs on the HPC
System Polaris6. Their proxy app is used to show the
maximum achievable performance for simple test cases.
Their results of the proxy app for an empty channel flow
on one node (4 NVIDIA A100 GPUs) reaches around
12000 MLUPS, equivalent to 3000 MLUPS per A100
GPU. This is close to what we achieve, as shown in
Figure 10 for the communication-hiding cases. The actual
HARVEY framework performs slightly worse for the same
empty channel with around 2250 MLUPS per GPU. On
the maximum scaling size of 1024 NVIDIA A100 GPUs,
the proxy-app reaches around 104MLUPS, equivalent to
976 MLUPS per GPU. The HARVEY framework achieves
around 6104MLUPS, equivalent to 585 MLUPS per A100
GPU. In Figure 10 WALBE RL A achieves 2500 MLUPS per
GPU on 1024 NVIDIA A100 GPUs. We conclude that the
scaling efficiency of WALBER LA seems to be superior. The
results may not be fully comparable, as the team of Martin
et al. (2023) performs a piece-wise strong-scaling, while
we show the weak scaling of our software. Further, they
operated on a different HPC cluster, so especially the scaling
performance can depend significantly on the configuration of
the HPC system.
When comparing the results for the artery geometry with
a resolution of 55 µm, HARVEY achieves around 4104
MLUPS for 64 A100 GPUs, so 625 MLUPS per GPU. In
Figure 17 WALBE RL A was able to achieve 1374 MLUPS
per GPU for the kernel-only runs, but only 230 MLUPS for
the full simulation performance. Of course, the underlying
geometry of the artery is different, so again the results
may not be fully comparable. Nevertheless, there is still
room to optimize the WALBE RL A framework in terms of
load-balancing and communication-efficiency, especially for
cases with multiple blocks per GPU.
Further, Martin et al. (2023) show comparable results
on the AMD MI250X GPUs, which were tested on the
Frontier7HPC system. We compare the results for 4 GCDs.
The proxy app of shows around 2250 MLUPS per GCD,
while WALBER LA shows 2100 MLUPS per GCD (see
Figure 11). So also the team of Martin et al. (2023) confirm,
that LBM algorithms are not able to achieve more than
60% of the maximum theoretical performance of the AMD
MI250X GPUs. For 1024 CDGs, the proxy app shows a
performance of around 106MLUPS, so 976 MLUPS per
CDG. WALBER LA was able to achieve 1500 MLUPS per
GCD on the same number of GCDs, so it shows a slightly
better scaling efficiency. Again, the results are not fully
comparable, as we compare a weak scaling with a piece-wise
strong-scaling and also run on different systems.
Prepared using sagej.cls
14 Journal Title XX(X)
Conclusion
In this article, we presented the benefit of sparse LBM
kernels, especially when using accelerator cards. We
compared the sparse and the dense data structure on a single
GPU and found that for a domain porosity of <0.8, the
sparse data structure outperforms the dense data structure in
terms of performance and memory consumption.
The sparse kernels show excellent performance for the
pull streaming pattern for various collision operators such
as single-/two-/multi-relaxation times, central moments,
or cumulants on a single GPU. We managed to further
increase the performance by 7.5% and reduce the memory
consumption by 36.5% by utilizing an AA streaming pattern
on GPUs. We were also able to show a scaling efficiency
of the sparse data structure of over 82% on the JUWELS
Booster and LUMI-G HPC system for 1024 and 4096 GPUs,
respectively.
We set up a porous media flow simulation and achieved
a speed-up of 1.9 and reduced the memory consumption
by 50%. For an artery blood flow simulation, we gained a
speed-up of 2 dependent on the block sizes and achieved
a decrease of memory consumption of about 75%. We
experienced imbalances in the distribution of the work over
the MPI processes when using the sparse data structures. To
maximize the efficiency of the sparse LBM, we employed
load balancing, but more research is needed to fully optimize
the workload distribution while maintaining the spatial
locality of the neighboring blocks.
To further increase the flexibility of the code generation
with lbmpy, future work is planned to support the
emerging INTEL GPUs by having a SYCL8back-end. As
SYCL is available on most currently available systems,
the code generation could used to generate optimized
sparse LB kernels for all existing and upcoming hardware
such as CPU, accelerator cards, or even exotic hardware
such as Accelerated Processing Units (APUs) or Field
Programmable Gate Arrays (FPGAs).
Acknowledgements
This work was supported by the SCALABLE project (www.
scalable-hpc.eu/). This project has received funding from
the European High-Performance Computing Joint Undertaking
(JU) under grant agreement No 956000. The JU receives
support from the European Union’s Horizon 2020 research
and innovation program and France, Germany, and the Czech
Republic. The authors gratefully acknowledge the Gauss Centre
for Supercomputing e.V. (www.gauss- centre.eu) for funding
this project by providing computing time through the John
von Neumann Institute for Computing (NIC) on the GCS
Supercomputer JUWELS at J¨
ulich Supercomputing Centre (JSC).
We acknowledge the EuroHPC Joint Undertaking for awarding
this project access to the EuroHPC supercomputer LUMI, hosted
by CSC (Finland) and the LUMI consortium through a EuroHPC
Regular Access call. The authors gratefully acknowledge the
scientific support and HPC resources provided by the Erlangen
National High Performance Computing Center (NHR@FAU) of the
Friedrich-AlexanderUniversit¨
at Erlangen-N¨
urnberg (FAU).
Notes
1. Top500 List: https://www.top500.org/, accessed:
2024-6-3
2. LUMI Supercomputer: https://www.
lumi-supercomputer.eu/, accessed: 2024-3-13
3. HIP: https://rocm.docs.amd.com/projects/
HIP/en/latest/, accessed: 2024-5-13
4. CUDA: https://developer.nvidia.com/
cuda-toolkit, accessed: 2024-5-13
5. IPython: https://ipython.org/, accessed: 2024-4-15
6. Polaris: https://www.alcf.anl.gov/polaris,
accessed: 2024-8-2
7. Frontier: https://www.olcf.ornl.gov/
olcf-resources/compute- systems/frontier/,
accessed: 2024-8-2
8. SYCL: https://www.khronos.org/sycl/, accessed:
2024-04-16
References
Afrouzi HH, Ahmadian M, Hosseini M, Arasteh H, Toghraie D and
Rostami S (2020) Simulation of blood flow in arteries with
aneurysm: Lattice Boltzmann Approach (LBM). Computer
Methods and Programs in Biomedicine 187: 105312. DOI:
10.1016/j.cmpb.2019.105312.
Alvarez D (2021) Juwels cluster and booster: Exascale pathfinder
with modular supercomputing architecture at juelich super-
computing centre. Journal of large-scale research facilities
JLSRF 7. DOI:10.17815/jlsrf-7-183.
Ambekar AS, Schwarzmeier C, R¨
ude U and Buwa VV (2023)
Particle-resolved turbulent flow in a packed bed: RANS, LES,
and DNS simulations. AIChE Journal 69(1). DOI:10.1002/aic.
17615.
Axner L, Hoekstra AG, Jeays A, Lawford P, Hose R and Sloot
PM (2009) Simulations of time harmonic blood flow in
the Mesenteric artery: Comparing finite element and lattice
Boltzmann methods. BioMedical Engineering OnLine 8(1): 23.
DOI:10.1186/1475-925X-8-23.
Bailey P, Myre J, Walsh SD, Lilja DJ and Saar MO (2009)
Accelerating Lattice Boltzmann Fluid Flow Simulations Using
Graphics Processors. In: 2009 International Conference on
Parallel Processing. pp. 550–557. DOI:10.1109/ICPP.2009.38.
Bauer M, Eibl S, Godenschwager C, Kohl N, Kuron M, Rettinger
C, Schornbaum F, Schwarzmeier C, Th¨
onnes D, K¨
ostler
H and R¨
ude U (2020) walberla: A block-structured high-
performance framework for multiphysics simulations. DOI:
10.1016/j.camwa.2020.01.007.
Bauer M, H¨
otzer J, Ernst D, Hammer J, Seiz M, Hierl H, H¨
onig
J, K¨
ostler H, Wellein G, Nestler B and R¨
ude U (2019)
Code generation for massively parallel phase-field simulations.
Association for Computing Machinery. DOI:10.1145/3295500.
3356186.
Bauer M, K¨
ostler H and R¨
ude U (2021) lbmpy: Automatic code
generation for efficient parallel lattice Boltzmann methods.
Journal of Computational Science 49: 101269. DOI:10.1016/j.
jocs.2020.101269.
Bernaschi M, Fatica M, Melchionna S, Succi S and Kaxiras
E (2010) A flexible high-performance Lattice Boltzmann
GPU code for the simulations of fluid flows in complex
geometries. Concurrency and Computation: Practice and
Experience 22(1): 1–14. DOI:10.1002/cpe.1466.
Prepared using sagej.cls
Philipp Suffa, Markus Holzer et al. 15
Bernaschi M, Succi S, Fyta M, Kaxiras E, Melchionna S and
Sircar JK (2008) MUPHY: A parallel high performance MUlti
PHYsics/Scale code. In: 2008 IEEE International Symposium
on Parallel and Distributed Processing. pp. 1–8. DOI:10.1109/
IPDPS.2008.4536464.
Brodtkorb AR, Hagen TR and Sætra ML (2013) Graphics
processing unit (GPU) programming strategies and trends
in GPU computing. Journal of Parallel and Distributed
Computing 73(1): 4–13. DOI:10.1016/j.jpdc.2012.04.003.
Chen S and Doolen GD (1998) Lattice Boltzmann method for fluid
flows. Annual Review of Fluid Mechanics 30(1): 329–364.
DOI:10.1146/annurev.fluid.30.1.329.
Fattahi E, Waluga C, Wohlmuth B and R¨
ude U (2016) Large
Scale Lattice Boltzmann Simulation for the Coupling of
Free and Porous Media Flow. In: Kozubek T, Blaheta
R, ˇ
S´
ıstek J, Rozloˇ
zn´
ık M and ˇ
Cerm´
ak M (eds.) High
Performance Computing in Science and Engineering, volume
9611. Cham: Springer International Publishing. ISBN 978-
3-319-40360-1 978-3-319-40361-8, pp. 1–18. DOI:10.1007/
978-3-319-40361- 8 1.
Geier M, Sch¨
onherr M, Pasquali A and Krafczyk M (2015) The
cumulant lattice Boltzmann equation in three dimensions:
Theory and validation. Computers & Mathematics with
Applications 70(4): 507–547. DOI:10.1016/j.camwa.2015.05.
001.
Godenschwager C, Schornbaum F, Bauer M, K¨
ostler H and R¨
ude
U (2013) A framework for hybrid parallel flow simulations
with a trillion cells in complex geometries. In: Proceedings of
the International Conference on High Performance Computing,
Networking, Storage and Analysis, SC ’13. New York, NY,
USA: Association for Computing Machinery. ISBN 978-1-
4503-2378-9, pp. 1–12. DOI:10.1145/2503210.2503273.
Han Y and Cundall PA (2013) LBM–DEM modeling of fluid–
solid interaction in porous media. International Journal for
Numerical and Analytical Methods in Geomechanics 37(10):
1391–1407. DOI:10.1002/nag.2096.
Hasert M, Masilamani K, Zimny S, Klimach H, Qi J, Bernsdorf
J and Roller S (2014) Complex fluid simulations with the
parallel tree-based Lattice Boltzmann solver Musubi. Journal
of Computational Science 5(5): 784–794. DOI:10.1016/j.jocs.
2013.11.001.
Hennig F, Holzer M and R¨
ude U (2022) Advanced Automatic Code
Generation for Multiple Relaxation-Time Lattice Boltzmann
Methods. DOI:10.48550/arXiv.2211.02435.
Hijma P, Heldens S, Sclocco A, van Werkhoven B and Bal HE
(2023) Optimization Techniques for GPU Programming. ACM
Computing Surveys 55(11): 239:1–239:81. DOI:10.1145/
3570638.
Holzer M, Mitchell T, Leonardi CR and Ruede U (2024) Develop-
ment of a central-moment phase-field lattice Boltzmann model
for thermocapillary flows: Droplet capture and computational
performance.
Kemmler S, Schwarzmeier C, Rettinger C, Plewinski J, N´
u˜
nez-
Gonz´
alez F, K¨
ostler H, R¨
ude U and Vowinckel B (2023)
Geometrically Resolved Simulation of Upstream Migrating
Antidune Formation and Propagation. In: 40th IAHR World
Congres.
Kr¨
uger T, Kusumaatmaja H, Kuzmin A, Shardt O, Silva G and
Viggen EM (2017) The Lattice Boltzmann Method. Springer
International Publishing. DOI:10.1007/978-3-319-44649-3.
Kummerl¨
ander A, Dorn M, Frank M and Krause MJ (2023)
Implicit propagation of directly addressed grids in lattice
Boltzmann methods. Concurrency and Computation: Practice
and Experience DOI:10.1002/cpe.7509.
Lai J, Yu H, Tian Z and Li H (2020) Hybrid MPI and CUDA
Parallelization for CFD Applications on Multi-GPU HPC
Clusters. Scientific Programming 2020: e8862123. DOI:
10.1155/2020/8862123.
Latt J, Malaspinas O, Kontaxakis D, Parmigiani A, Lagrava D,
Brogi F, Belgacem MB, Thorimbert Y, Leclaire S, Li S, Marson
F, Lemus J, Kotsalos C, Conradin R, Coreixas C, Petkantchin
R, Raynaud F, Beny J and Chopard B (2021) Palabos: Parallel
Lattice Boltzmann Solver. Computers & Mathematics with
Applications 81: 334–350. DOI:10.1016/j.camwa.2020.03.
022.
Lehmann M (2022) Esoteric Pull and Esoteric Push: Two
Simple In-Place Streaming Schemes for the Lattice Boltzmann
Method on GPUs. Computation 10(6): 92. DOI:10.3390/
computation10060092.
Lehmann M, Krause MJ, Amati G, Sega M, Harting J and Gekle
S (2022) On the accuracy and performance of the lattice
Boltzmann method with 64-bit, 32-bit and novel 16-bit number
formats. Physical Review E 106(1): 015308. DOI:10.1103/
PhysRevE.106.015308.
Liu Z, Chu X, Lv X, Liu H, Fu H and Yang G (2023) Accelerating
Large-Scale CFD Simulations with Lattice Boltzmann Method
on a 40-Million-Core Sunway Supercomputer. In: Proceedings
of the 52nd International Conference on Parallel Processing.
Salt Lake City UT USA: ACM. ISBN 9798400708435, pp.
797–806. DOI:10.1145/3605573.3605605.
Martin A, Liu G, Ladd W, Lee S, Gounley J, Vetter J, Patel
S, Rizzi S, Mateevitsi V, Insley J and Randles A (2023)
Performance Evaluation of Heterogeneous GPU Programming
Frameworks for Hemodynamic Simulations. In: Proceedings
of the SC ’23 Workshops of The International Conference
on High Performance Computing, Network, Storage, and
Analysis. Denver CO USA: ACM. ISBN 9798400707858, pp.
1126–1137. DOI:10.1145/3624062.3624188.
Mattila K, Puurtinen T, Hyv¨
aluoma J, Surmas R, Myllys M,
Turpeinen T, Roberts ´
en F, Westerholm J and Timonen J (2016)
A prospect for computing in porous materials research: Very
large fluid flow simulations. Journal of Computational Science
12: 62–76. DOI:10.1016/j.jocs.2015.11.013.
Pan C, Prins JF and Miller CT (2004) A high-performance lattice
Boltzmann implementation to model flow in porous media.
Computer Physics Communications 158(2): 89–105. DOI:
10.1016/j.cpc.2003.12.003.
Pearson C (2023) Interconnect Bandwidth Heterogeneity on AMD
MI250x and Infinity Fabric.
Rak T (2024) Parallel Programming in the Hybrid Model on
the HPC Clusters. In: Malhotra R, Sumalatha L, Yassin
SMW, Patgiri R and Muppalaneni NB (eds.) High Performance
Computing, Smart Devices and Networks. Singapore: Springer
Nature. ISBN 978-981-9966-90-5, pp. 207–219. DOI:10.1007/
978-981-99-6690- 5 15.
Randles A, Draeger EW, Oppelstrup T, Krauss L and Gunnels JA
(2015) Massively parallel models of the human circulatory
system. In: Proceedings of the International Conference
for High Performance Computing, Networking, Storage and
Analysis, SC ’15. New York, NY, USA: Association for
Prepared using sagej.cls
16 Journal Title XX(X)
Computing Machinery. ISBN 978-1-4503-3723-6, pp. 1–11.
DOI:10.1145/2807591.2807676.
Randles AP, Kale V, Hammond J, Gropp W and Kaxiras E (2013)
Performance Analysis of the Lattice Boltzmann Model Beyond
Navier-Stokes. In: 2013 IEEE 27th International Symposium
on Parallel and Distributed Processing. Cambridge, MA, USA:
IEEE. ISBN 978-1-4673-6066-1 978-0-7695-4971-2, pp.
1063–1074. DOI:10.1109/IPDPS.2013.109.
Rettinger C and R¨
ude U (2018) A coupled lattice Boltzmann
method and discrete element method for discrete particle
simulations of particulate flows. Computers & Fluids 172:
706–719. DOI:10.1016/j.compfluid.2018.01.023.
Schornbaum F and R¨
ude U (2016) Massively Parallel Algorithms
for the Lattice Boltzmann Method on Non-uniform Grids.
SIAM Journal on Scientific Computing 38(2): C96–C126. DOI:
10.1137/15M1035240.
Schornbaum F and R¨
ude U (2018) Extreme-Scale Block-Structured
Adaptive Mesh Refinement. SIAM Journal on Scientific
Computing 40(3): C358–C387. DOI:10.1137/17M1128411.
Schulz M, Krafczyk M, T¨
olke J and Rank E (2002) Parallelization
Strategies and Efficiency of CFD Computations in Complex
Geometries Using Lattice Boltzmann Methods on High-
Performance Computers. In: Griebel M, Keyes DE,
Nieminen RM, Roose D, Schlick T, Breuer M, Durst F and
Zenger C (eds.) High Performance Scientific And Engineering
Computing, volume 21. Berlin, Heidelberg: Springer Berlin
Heidelberg. ISBN 978-3-540-42946-3 978-3-642-55919-8, pp.
115–122. DOI:10.1007/978-3-642-55919- 8 13.
Siefert CM, Pearson C, Olivier SL, Prokopenko A, Hu J and
Fuller TJ (2023) Latency and Bandwidth Microbenchmarks
of US Department of Energy Systems in the June 2023 Top
500 List. In: Proceedings of the SC ’23 Workshops of The
International Conference on High Performance Computing,
Network, Storage, and Analysis. Denver CO USA: ACM. ISBN
9798400707858, pp. 1298–1305. DOI:10.1145/3624062.
3624203.
Spinelli GG, Horstmann T, Masilamani K, Soni MM, Klimach
H, St¨
uck A and Roller S (2023) HPC performance study of
different collision models using the Lattice Boltzmann solver
Musubi. Computers & Fluids 255: 105833. DOI:10.1016/j.
compfluid.2023.105833.
Vidal D, Roy R and Bertrand F (2010) On improving
the performance of large parallel lattice Boltzmann flow
simulations in heterogeneous porous media. Computers &
Fluids 39(2): 324–337. DOI:10.1016/j.compfluid.2009.09.011.
Wang J, Zhang X, Bengough AG and Crawford JW (2005)
Domain-decomposition method for parallel lattice Boltzmann
simulation of incompressible flow in porous media. Physical
Review E 72(1): 016706. DOI:10.1103/PhysRevE.72.016706.
Watanabe S and Hu C (2022) Performance Evaluation of Lattice
Boltzmann Method for Fluid Simulation on A64FX Processor
and Supercomputer Fugaku. In: International Conference on
High Performance Computing in Asia-Pacific Region. Virtual
Event Japan: ACM. ISBN 978-1-4503-8498-8, pp. 1–9. DOI:
10.1145/3492805.3492811.
Wittmann M, Zeiser T, Hager G and Wellein G (2013) Comparison
of different propagation steps for lattice Boltzmann methods.
Computers & Mathematics with Applications 65(6): 924–935.
DOI:10.1016/j.camwa.2012.05.002.
Yang G, Chen Y, Chen S and Wang M (2023) Implementation
of a direct-addressing based lattice Boltzmann GPU solver
for multiphase flow in porous media. Computer Physics
Communications 291: 108828. DOI:10.1016/j.cpc.2023.
108828.
Zeiser T, Hager G and Wellein G (2009) Benchmark analysis and
application results for lattice Boltzmann simulations on NEC
SX vector and Intel Nehalem systems. Parallel Processing
Letters 19(04): 491–511. DOI:10.1142/S0129626409000389.
Prepared using sagej.cls
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Lattice Boltzmann methods (LBM) are well suited to highly parallel computational fluid dynamics simulations due to their separability into a perfectly parallel collision step and a propagation step that only communicates within a local neighborhood. The implementation of the propagation step provides constraints for the maximum possible bandwidth‐limited performance, memory layout and usage of vector instructions. This article revisits and extends the work on implicit propagation on directly addressed grids started by A‐A and its shift‐swap‐streaming (SSS) formulation by reconsidering them as transformations of the underlying space filling curve. In this work, a new periodic shift (PS) pattern is proposed that imposes minimal restrictions on the implementation of collision operators and utilizes virtual memory mapping to provide consistent performance across a range of targets. Various implementation approaches as well as time dependency and performance anisotropy are discussed. Benchmark results for SSS and PS on SIMD CPUs including Intel Xeon Phi as well as Nvidia GPUs are provided. Finally, the application of PS as the propagation pattern of the open source LBM framework OpenLB is summarized.
Preprint
Full-text available
The scientific code generation package lbmpy supports the automated design and the efficient implementation of lattice Boltzmann methods (LBMs) through metaprogramming. It is based on a new, concise calculus for describing multiple relaxation-time LBMs, including techniques that enable the numerically advantageous subtraction of the constant background component from the populations. These techniques are generalized to a wide range of collision spaces and equilibrium distributions. The article contains an overview of lbmpy's front-end and its code generation pipeline, which implements the new LBM calculus by means of symbolic formula manipulation tools and object-oriented programming. The generated codes have only a minimal number of arithmetic operations. Their automatic derivation rests on two novel Chimera transforms that have been specifically developed for efficiently computing raw and central moments. Information contained in the symbolic representation of the methods is further exploited in a customized sequence of algebraic simplifications, further reducing computational cost. When combined, these algebraic transformations lead to concise and compact numerical kernels. Specifically, with these optimizations, the advanced central moment- and cumulant-based methods can be realized with only little additional cost as when compared with the simple BGK method. The effectiveness and flexibility of the new lbmpy code generation system is demonstrated in simulating Taylor-Green vortex decay and the automatic derivation of an LBM algorithm to solve the shallow water equations.
Article
Full-text available
Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here we evaluate the possibility to use even FP16 and posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop customized 16-bit formats—based on a modified IEEE-754 and on a modified posit standard—that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, lid-driven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method), and, finally, the impact of a raindrop (based on a volume-of-fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all precision levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.
Article
Full-text available
I present two novel thread-safe in-place streaming schemes for the lattice Boltzmann method (LBM) on graphics processing units (GPUs), termed Esoteric Pull and Esoteric Push, that result in the LBM only requiring one copy of the density distribution functions (DDFs) instead of two, greatly reducing memory demand. These build upon the idea of the existing Esoteric Twist scheme, to stream half of the DDFs at the end of one stream-collide kernel and the remaining half at the beginning of the next, and offer the same beneficial properties over the AA-Pattern scheme - reduced memory bandwidth due to implicit bounce-back boundaries and the possibility of swapping pointers between even and odd time steps. However, the streaming directions are chosen in a way that allows the algorithm to be implemented in about one tenth the amount of code, as two simple loops, and is compatible with all velocity sets and suitable for automatic code-generation. The performance of the new streaming schemes is slightly increased over Esoteric Twist due to better memory coalescence. Benchmarks across a large variety of GPUs and CPUs show that for most dedicated GPUs, performance differs only insignificantly from the One-Step Pull scheme; however, for integrated GPUs and CPUs, performance is significantly improved. The two proposed algorithms greatly facilitate modifying existing code to in-place streaming, even with extensions already in place, such as demonstrated for the Free Surface LBM implementation FluidX3D. Their simplicity, together with their ideal performance characteristics, may enable more widespread adoption of in-place streaming across LBM GPU codes.
Article
Full-text available
Packed bed reactors are widely used to perform solid‐catalyzed gas‐phase reactions and local turbulence is known to influence heat and mass transfer characteristics. We have investigated turbulence characteristics in a packed bed of 113 spherical particles by performing particle‐resolved Reynolds‐averaged Navier–Stokes (RANS) simulations, Large Eddy Simulation (LES), and Direct Numerical Simulation (DNS). The predictions of the RANS and LES simulations are validated with the lattice Boltzmann method (LBM)–based DNS at particle Reynolds number (Rep) of 600. The RANS and LES simulations can predict the velocity, strain rate, and vorticity with a reasonable accuracy. Due to the dominance of enhanced wall‐function treatment, the turbulence characteristics predicted by the ε‐based models are found to be in a good agreement with the DNS. The ω‐based models under‐predicted the turbulence quantities by several orders of magnitude due to their inadequacy in handling strongly wall‐dominated flows at low Rep. Using the DNS performed at different Rep, we also show that the onset of turbulence occurs between 200≤Rep≤250.
Article
This study develops a computationally efficient phase-field lattice Boltzmann (LB) model with the capability to simulate thermocapillary flows. The model was implemented into the open-source simulation framework, waLBerla, and extended to conduct the collision stage using central moments. The multiphase model was coupled with both a passive-scalar thermal LB and a Runge-Kutta (RK) solution to the energy equation in order to resolve temperature-dependent surface tension phenomena. Various lattice stencils (D3Q7, D3Q15, D3Q19, D3Q27) were tested for the passive-scalar LB, and both the second- and fourth-order RK methods were investigated. There was no significant difference observed in the accuracy of the LB or RK schemes. The passive scalar D3Q7 LB discretisation tended to provide computational benefits, while the second order RK scheme is superior in memory usage. This paper makes contributions relating to the modelling of thermocapillary flows and to understanding the behaviour of droplet capture with thermal sources analogous to thermal tweezers. Four primary contributions to the literature are identified. First, a new 3D thermocapillary, central-moment phase-field LB model is presented and implemented in the open-source software, waLBerla. Second, the accuracy and computational performance of various techniques to resolve the energy equation for multiphase, incompressible fluids are investigated. Third, the dynamic droplet transport behaviour in the presence of thermal sources is studied, and insight is provided into the potential ability to manipulate droplets based on local domain heating. Finally, a concise analysis of the computational performance and near-perfect scaling results on NVIDIA and AMD GPU-clusters is shown. This research enables the detailed study of droplet manipulation and control in thermocapillary devices by providing a highly-efficient computational modelling methodology.
Article
In the past decade, Graphics Processing Units have played an important role in the field of high-performance computing and they still advance new fields such as IoT, autonomous vehicles, and exascale computing. It is therefore important to understand how to extract performance from these processors, something that is not trivial. This survey discusses various optimization techniques found in 450 papers published in the last 14 years. We analyze the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.