ArticlePDF Available

Abstract and Figures

The computational interest on the lattice Boltzmann method (LBM) has grown over the years. The simplicity of its implementation and the local nature of most of its operations allows the use of a parallel computing architecture. In this way, GPUs (Graphics-Processing Units) suits very well for LBM implementation, offering good performance and scalability for a relatively low price. This work presents a LBM implementation on GPUs for D2Q9 and D3Q19. It is written in C/C++ programming language and uses CUDA for the use of its GPU resources. For the code optimization, strategies such as memory layout and merging all operations are presented. Also, designs principles to facilitate alternating boundary conditions are described. The code is validated and its performance is evaluated. Because of the lack of work on the matter, it is analyzed and discussed how some simulation's parameters affects the performance of the code. These are: single and double precision; number of threads per block; ECC (error-correcting code memory) on and off; storing macroscopic variables; domain size; rates of saving and residual. Tests are made using Nvidia's GPUs from Kepler, Pascal and Volta micro-architectures. The results are considered satisfactory, achieving the state-of-the-art performance.
Content may be subject to copyright.
PERFORMANCE ANALYSIS OF THE LATTICE BOLTZMANN METHOD
IMPLEMENTATION ON GPU
Waine B. de Oliveira Jr.
Alan Lugarini
Admilson T. Franco
waine@alunos.utfpr.edu.br
alansouza@utfpr.edu.br
admilson@utfpr.edu.br
Research Center for Rheology and non-Newtonian Fluids (CERNN). Universidade Tecnol´
ogica Federal
do Paran´
a (UTFPR)
Deputado Heitor Alencar Furtado Street, 5000, 81280-340, PR, Brazil
Abstract. The computational interest on the lattice Boltzmann method (LBM) has grown over the years.
The simplicity of its implementation and the local nature of most of its operations allows the use of
a parallel computing architecture. In this way, GPUs (Graphics-Processing Units) suits very well for
LBM implementation, offering good performance and scalability for a relatively low price. This work
presents a LBM implementation on GPUs for D2Q9 and D3Q19. It is written in C/C++ programming
language and uses CUDA for the use of its GPU resources. For the code optimization, strategies such as
memory layout and merging all operations are presented. Also, designs principles to facilitate alternating
boundary conditions are described. The code is validated and its performance is evaluated. Because of
the lack of work on the matter, it is analyzed and discussed how some simulation’s parameters affects
the performance of the code. These are: single and double precision; number of threads per block; ECC
(error-correcting code memory) on and off; storing macroscopic variables; domain size; rates of saving
and residual. Tests are made using Nvidia’s GPUs from Kepler, Pascal and Volta micro-architectures.
The results are considered satisfactory, achieving the state-of-the-art performance.
Keywords: CUDA, Graphics processing unit (GPU), Lattice Boltzmann method (LBM), High perfo-
mance computing (HPC)
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
1 Introduction
The lattice Boltzmann method (LBM) is an alternative to traditional CFD (computational fluid dy-
namics) methods, like FVM (finite volume method) or FDM (finite difference method), for fluid sim-
ulation. It solves the lattice Boltzmann equations (LBE), as opossed to solving Navier-Stokes. It has
been receiving a lot of attention over the last years, due to its computational benefits over other methods.
One of the main benefits is that most of the LBM’s operations are local, which allows the use of parallel
architectures for simulation.
The implementation of LBM on GPUs (Graphics Processing Unit) has been studied since early
2000’s, as by Li et al. [1], because of the cost effectiveness it achieves. Since then, many works, as
Mawson and Revell [2], Habich et al. [3], Herschlag et al. [4], have proposed techniques to LBM opti-
mization on GPUs. The main strategies are about memory layout, due to the performance being mostly
limited by memory bandwidth. The release of CUDA by Nvidia [5] has facilitated the development of
GPU algorithms, with it being used by most of recent works about LBM on GPUs.
This work presents a LBM implementation on GPU using CUDA API and C/C++ language. The
design choices and optimization techniques are based on Januszewski and Kostur [6] and Mawson and
Revell [2]. The code is validated using parallel plates flow and turbulent Taylor-Green vortex (TGV).
Evaluation of the error for single precision is made using TGV on two dimensions.
It is analyzed the impact of simulations parameters and code changes on the program’s performance,
such as single and double precision, number of threads per block, ECC (error-correcting code memory)
on and off, storing macroscopic variables in global memory, domain’s size and rate of savings and resid-
ual calculation. State-of-the-art performance is achieved for GPUs Tesla K20Xm and Tesla P100. Tesla
V100 performance exceeds the limitation of the GPU bandwidth. Speculations about the reasons for that
are made.
The structure of the paper is as follows. In section 2 a mathematical description of LBM is given.
Section 3 briefly describes GPU core concepts for the implementation and the motivation for using GPUs
to run LBM. Section 4 discusses the algorithm implemented, its design, the motivation for some choices
and how LBM’s performance is measured. Section 5 presents the code validation, the impact of single
and double precision on simulation’s error and discusses how parameters impact code’s performance. In
section 6 are the conclusions.
2 LBM
The lattice Boltzmann method (LBM) is a mesoscopic method for fluid flow simulations. It is
usually discretized into a regular Cartesian grid, in which each node represents part of the fluid and has
set of populations (fi). Each one representing the distribution of the fluid for a given direction (ci), given
by a sum of unit vectors (ei). The directions are defined by the velocity set, represented as DnQm, which
also defines the weight (wi) of each population. Where nis the number of dimensions (usually two or
three) and mthe number of discrete velocities. In this work, D3Q19 and D2Q9 were used.
The governing equation for the LBM is the lattice Boltzmann equation (LBE). It is usually divided
in two parts for the algorithm. The collision, when the result of the right side of Eq. (1) is computed, and
the streaming, when that value is assigned to the left side of Eq. (1).
Table 1. D2Q9 velocity set.
i0 1 2 3 4 5 6 7 8
wi
4
9
1
9
1
9
1
9
1
9
1
36
1
36
1
36
1
36
cix 0 +1 0 -1 0 +1 -1 -1 +1
ciy 0 0 0 0 -1 +1 +1 -1 -1
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
Table 2. D3Q19 velocity set.
i0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
wi
1
3
1
18
1
18
1
18
1
18
1
18
1
18
1
36
1
36
1
36
1
36
1
36
1
36
1
36
1
36
1
36
1
36
1
36
1
36
cix 0 +1 -1 0 0 0 0 +1 -1 +1 -1 0 0 +1 -1 +1 -1 0 0
ciy 0 0 0 +1 -1 0 0 +1 -1 0 0 +1 -1 -1 +1 0 0 +1 -1
ciy 0 0 0 0 0 +1 -1 0 0 +1 -1 +1 -1 0 0 -1 +1 -1 +1
(a) D3Q19 (from Kruger et al. [7]). (b) D2Q9.
Figure 1. Velocities for D3Q19 and D2Q9.
On Eq. (1), fiis the probability density function (population) and ithe collision operator for a
direction i, at ~x in space and tin time.
fi(~x +~cit, t + ∆t) = fi(~x, t)+Ωi(~x, t).(1)
All discretized values are defined in terms of lattice units. For simplicity, x
t(lattice speed) is
assumed to be one and ciis the velocity iin the velocity set.
One of the most common and used collision operator is the Bhatnagar-Gross-Krook operator (BGK)
by Bhatnagar et al. [8], defined by Eq. (2).
i(f) = fifeq
i
τt. (2)
Where τis the relaxation time, related to the fluid’s viscosity via τ= (1 + 6ν)/2and feq is the
equilibrium distribution, given by Eq. (3).
feq
i(~x, t) = ρwi1 + ~u ·~ci
c2
s
+(~u ·~ci)2
2c4
s~u ·~u
2c2
s.(3)
In which wiis the velocity weight, csis the speed of sound (1/3for D2Q9 and D3Q19), ~u and ρ
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
are macroscopic variables, velocity and pressure respectively, given in terms of the populations as Eq. (4)
shows.
ρ=X
i
fiρ~u =X
i
fiei.(4)
There are modifications to the LBM, like multiple relaxation times (MRT), regularization, immersed
boundary method (IBM), force terms and many others.
The boundary conditions are also a key for LBM simulations. As said in Kruger et al. [7], there are
two classes of methods: link-wise and wet-node. In the link-wise family, the boundary nodes are x/2
away from the wall and for wet-node, the nodes are on the wall. This must be take into account when
interpreting simulations results.
The bounce-back boundary condition is a link-wise method and one of the most used for stationary
walls. It is defined as
fi=fici=ci.(5)
Where fiis the unknown population. For moving walls or pressure, the Zou-He boundary condi-
tion, proposed by Zou and He [9] is very used. It is a wet-node method and can be defined as
fneq
i=fneq
i+~
t·~ci
|~ci|Ntci=ci.(6)
In which fiare the unknown populations and fneq
i=fifeq
i.~
tis the tangential vector of the wall
and Ntis the transverse momentum correction. It can be determined by forcing a velocity or pressure
condition on the node and solving the linear system using Eq. (4). Hecht and Harting [10] shows the
solution for D3Q19.
3 GPU
GPUs are well suited for parallel operations, mainly when the operations are the same for all domain.
Also, the capacity of floating point operations per second (FLOPS) is very high for modern GPUs, with
Nvidia [11] getting up to 15.7 TFLOPS for single precision. Nvidia [5] compares the theoretical FLOPS
and memory bandwidth between Nvidia’s GPUs and Intel CPU’s over the years. The graphics cards
presents a considerably higher performance for both. This makes LBM fits very well on GPUs, due to
the local nature of the method and it being memory and processing (e.g. calculations) intensive.
The development of CUDA by Nvidia [5] has lead to many LBM applications using it, as Habich
et al. [3], Januszewski and Kostur [6], Schreiber et al. [12]. One of the reason for that is the low learning
curve for it and an easy integration with C/C++ code.
CUDA uses threads to parallelize the program. They are combined in sets of 32 called warps, as
reported in Nvidia [5]. All threads in a warp must be in the same block and they always execute the same
instruction. If a thread diverges, it is disabled. The number of threads per block must be a multiple of
32 to optimize performance. Otherwise resources would be lost by processing a warp with less than 32
threads.
A block must be processed in only one streaming multiprocessor (SM) as well. But more than one
block can be processed by one SM, if resources (as registers) are available. The number of threads per
block and the number of blocks are defined in the kernel launch, a special CUDA function.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
4 Algorithm
The algorithm’s core is the main LBM operations, collision, streaming, boundary conditions and
macroscopic variables calculation. There are two main schemes of LBM algorithm: push and pull. Push
first make the collision and then stream the populations to the neighbour nodes, pushing them. Pull first
stream the population from adjacent nodes and then collides it. Both schemes are shown in Fig. 2.
Figure 2. Representation of the streaming populations for the pull scheme on the left and for push scheme
on the right. D2Q9 is used. Source is Obrecht et al. [13].
For the pull scheme, there are misaligned reads from memory and aligned writes. On the other
hand, push scheme has aligned reads and misaligned writes to memory. Some works, as Mawson and
Revell [2], shows that the cost of misaligned writes are more impactful than misaligned writes for Kepler
architecture. Despite this, pull presented no significant overcome on performance over the push. So, for
simplicity, the push scheme was used.
The code was written using C/C++ and CUDA. For compilation, it was used the nvcc compiler.
Earlier works about LBM on GPUs using CUDA, as Xian and Takayuki [14] and Calore et al. [15],
used one kernel for each operation of the LBM, usually one for streaming, one for collision and another
for boundary conditions. But this leads to higher synchronization and kernel launches overheads and
diminishes the temporal locality of the algorithm. So the approach taken for the code organization was
the same as by Januszewski and Kostur [6], using just one kernel for all LBM operations, as showed in
algorithm 1. Another kernel is used for population and macroscopic variables initialization.
Algorithm 1 Kernel strucutre of the LBM push algorithm for one node in the domain
Load all node’s populations to local variable
if Node has boundary condition then
Apply boundary condition
end if
Calculate macroscopic variables (velocity and density) using Eq. (4)
if macroscopic variables are required then
Write macroscopic variables to global array
end if
Collide and stream populations to global array using Eq. (1)
All global arrays are one dimensional and its spatial location is converted from 3D to 1D via
index scalar for macroscopic variables, and index pop for populations. All kernel launches con-
sists of a block with tthreads in the xaxis, where tis usually 32, 64 or 128. The grid size is calculated
for completing the domain of the lattice in all directions (x,yand z). There are no optimizations using
launch bounds or -maxrregcount, because it is usually handcrafted for each GPU. So this
may be a possible improvement on the code performance, as pointed out by Calore et al. [15].
For boundary condition, a similar scheme of the one by Januszewski and Kostur [6] is used. A
bitmap of 32 bits is used to classify the node and its boundary condition, whether is has one or not.
The bitmap contains the information if the node has or not a boundary condition to apply; the normal
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
direction of the node; the boundary condition scheme to apply (Zou-He, bounce back, free slip, etc.); the
index for a global array that contains the macroscopic variables values of the node, as density or velocity,
that are used. This allows for easy change between boundary conditions and schemes to use.
4.1 Memory layout
As discussed in Mawson and Revell [2], Herschlag et al. [4], Januszewski and Kostur [6], Kruger
et al. [7], Schreiber et al. [12], the performance of LBM is limited by the memory bandwidth. So it is
very important that the algorithm has a good spatial and temporal locality. This is achieved mainly by
the memory layout.
There are two main kinds of memory layout for LBM, the array of structures (AoS) and the structure
of arrays (SoA). The AoS consists of an array containing many structures, such as the population. In this
layout, the population zero of one node is followed, in memory, by the population one of the same node
and so on. So when loading the population zero to the cache, the next populations of the node is also be
loaded.
In the SoA, all the populations zero are contiguous in memory. So the population zero of the node i
is followed, in memory, by the population zero of the node i+ 1 and so on. The same for all populations.
In this layout, when loading the population of an node i, the same population of the adjacent nodes is
also loaded.
Figure 3. Representation in memory of structures containing the variables X, Y, Z and W with the array
of structures layout and with structure of arrays.
Despite the AoS being more common and intuitive, for SIMD (single instruction, multiple data)
architectures, as GPUs, it presents low spatial locality. The most common layout on the literature, as
observed by Herschlag et al. [4], is the SoA. It takes advantage of how the GPU access memory, allowing
for contiguous memory reading or writing, depending on the LBM scheme used (push or pull).
So the chosen layout was the SoA, for populations and macroscopic variables, with the xaxis made
continuous. This is because of the thread continuity in xfor the warps, as said in Nvidia [5].
4.2 Host functions
Some tasks can only be done or are more efficient with host functions. One example is saving
macroscopic variables to the hard drive, which can only be done by the operational system, therefore by
the host. Another one is calculating the velocity (or any other macroscopic) residual using the Eq. (7).
Res =P|v1v0|
P|v1|.(7)
Where v1is the present velocity value and v0is the reference velocity, from a previous iteration, to
be compared to.
This requires the sum of values in a matrix, which can be quite slow on GPU. But the main reason
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
for using a host function is that the GPU resources are not used. So the GPU capacity is not diminished
and the host calculates the residual concurrently with the simulation.
The function for saving macroscopic variables is very simple, writing the binary values of an array
to a file. The one for calculating the residual just loops over the domain, calculating one element of the
summation and, after the loop, adds it all together.
4.3 Performance measurement
The performance for LBM algorithms is measured usually by two parameters. One is the number of
MLUPS (million lattice updates per second), which allows to predict the time of a simulation. The other
one is the bandwidth, that is calculated using Eq. (8).
Bandwidth =2·sizeof(float)·Number of nodes ·Q·Number of iterations
Simul ation T ime .(8)
Qrepresents the number of populations of the velocity set and float can be single or double
precision. The multiplication by two is because all populations are load and then stored, so there are two
memory operations for each population every iteration. The result is given in B/s (bytes per second) and
can be converted to Mb/s or Gb/s.
Because the LBM is limited mostly to the bandwidth, as discussed in Mawson and Revell [2],
the maximum theoretical MLUPS is bounded to it as well. It can be calculated using the maximum
bandwidth of the GPU and then converting it to MLUPS using Eq. (9).
M LU P S =Bandwidth
Q·2·sizeof (float)·106.(9)
5 Results
For simulation, three different machines were used, each with GPU models from three Nvidia micro-
architectures: Kepler (Tesla K20Xm PCIE 6Gb with a GK110), Pascal (Tesla P100 PCIE 16Gb with a
GP100) and Volta (Tesla V100 SXM2 16Gb with a GV100). The processor was an Intel i7-7700k
4.20GHz for the Kepler, and a Intel Xeon 2.30GHz for both Pascal and Volta. The ECC was enabled for
Pascal and Volta and for Kepler there are tests with ECC enabled and disabled. For all GPUs the boost
clock was disabled.
All machines used SSD for recording. The tests were run using CUDA 10.1 Toolkit on Windows
10 64-bit system for Kepler and on Ubuntu 18.04 64-bit system for Pascal and Volta. All the validations
and precision tests were made using the Kepler machine.
5.1 Validation
Parallel plates flow was used for validation of D2Q9 and D3Q19. Equation 10 are the pressure
boundary conditions imposed and the normalized analytical solution for the flow velocity. For all simu-
lations the domains were a cube with N3lattices for D3Q19 and N2for D2Q9. Double precision is used.
The L2norm, given by Eq. (11), was calculated for the velocities in both cases, using a perpendicular
plane to the flow for parallel plates on x= 0.5.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
ρ(x= 0) = ρ0.
ρ(x= 1) = ρ0+12ρ0u0(τ0.5)
N.
ux(y) = 6(yy2)
u0
.
x, y [0,1].
(10)
For the lattice test, the Nwas doubled, u0reduced in a half and the number of iterations multiplied
by four. This is to keep physical time and τconstant. The L2norm is O(∆x2), as said in Kruger et al.
[7]. Figure 4 shows the validations results.
(a) L2for D2Q9. (b) L2for D3Q19.
(c) Velocity profile for D3Q19 with N= 64.
Figure 4. Validation results for parallel plates. For both velocity sets it was used u0= 0.1,ρ0= 1
and τ= 0.74 for the minimum N. For subsequent N,u0was divided by tw and the number of steps
multiplied by four, such that physical time and τwere kept constant.
L2 = P(qnumerical qanalytical)2
Pq2
analytical
.(11)
The error reduces as expected for both velocity sets, being reduced by a factor of 0.25 when Nis
doubled. It was also made a simulation for turbulent Taylor-Green vortex. The initialization was made
using fneq, with ρand ugiven by Eq. (12).
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
ux(x, y, z) = u0sin(x)cos(y)cos(z).
uy(x, y, z) = u0cos(x)sin(y)cos(z).
uz(x, y, z)=0.
ρ(x, y, z) = ρ03ρ0u2
0
16 [cos(2x) + sin(2y)][cos(2z) + 2].
x, y, z [0,2π].
(12)
The dissipation rate, which is the time derivative of integral kinetic energy, was observed over time.
Figure 5 shows the results in comparison with Nathen et al. [16] and Brachet et al. [17]. The present
work presents a better approximation to Brachet et al. [17] than Nathen et al. [16] presents. This may be
due to different initialization scheme.
Figure 5. Dissipation rate for Taylor-Green vortex over time. N= 256 and u0= 0.05 were used.
5.2 Single and double precision
Due to the higher performance of single over double, the impact of the numerical error on the
simulation results was evaluated. For that, tests were made using Taylor-Green vortex flow and D2Q9.
Analytical solution is given by Eq. (13). For initialization it was used feq and regularization. L2norm is
calculated over all the domain. Figure 6 shows the results.
ux(x, y, t) = u0cos(x)sin(y)e2.
uy(x, y, t) = u0sin(x)cos(y)e2.
ρ(x, y, t) = ρ00.75ρ0u2
0[cos(2x) + sin(2y)]e4.
x, y [0,2π].
(13)
For the velocities, both when varying Nand when varying τ, for u00.01 the single precision
error is higher than expected. For u00.02, there is no difference between single and double precision.
So for values of u00.02 single precision does not have significant numerical error on the velocities,
compared to double. A similar result is obtained by Januszewski and Kostur [6]. But for cases with
complex geometries, a very large domain or that heavy differs from the case simulated, this may not be
extendable.
For density, single precision shows a significant higher error than double. The error for single
is between 101up to 107times the error for double. The expected behaviour of O(∆x2)is also not
encountered for single, with the L2norm getting higher as u0is decreased. This found both for N
constant and for τconstant. For cases where the density field is important, it is highly recommended to
use double precision, due to the density’s error for single precision.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
(a) N= 128 (b) τ= 0.581
Figure 6. L2errors for single and double precision for Taylor-Green vortex. A curve showing the
expected error behavior is also presented. On the left are the errors for Nconstant, varying u0and τ.
On the right are the errors for τconstant, varying u0and N. Both were made such that physical time is
preserved.
5.3 Benchmark
The performance of the LBM algorithm is measured by the MLUPS and bandwidth of the simula-
tion. The case used was the Taylor Green vortex, with periodic conditions on all surfaces, τ= 0.9and
u0= 0.05 for D3Q19. Table 3 shows the performance achieved by previous works.
The present work achieves the state of the art performance for both Kepler and Pascal. No work
using the Tesla V100 was found. Table 5 shows the performance achieved by the present work. It is
important to mention that only one test was made for single precision for each GPU, so it is very likely
that the code can achieve greater performance for that configuration. This is true mainly to Volta, because
the only test with single precision made was for N= 128, when Volta presented a considerable higher
performance for N= 32.
Table 3. Performance results from previous works for D3Q19. The MLUPS are the peak ones with the
described configuration.
GPU Reference Precision ECC MLUPS
Tesla K20Xm Januszewski and Kostur [6] Double Off 649
Tesla K20Xm Januszewski and Kostur [6] Single Off 1247
Tesla P100 Herschlag et al. [4] Double On 1659
Tesla P100 Schreiber et al. [12] Double Off 1580
Tesla P100 Schreiber et al. [12] Single Off 2960
For bandwidth measurement, Table 4 shows the results from the bandwidthTest from Nvidia
[5]. The values are less than the maximum reported by Nvidia, but reflects a lot better the GPU memory
transfer capability.
The performance gain for single precision (i.e. float) over double precision (i.e. double) is very
significant, ranging from 60% to 90%. This is due both to memory and calculations. The size of single
precision variables is one half the size of double, so the number of memory reads is reduced by a half.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
Table 4. Measured bandwidth taken by the average of five bandwidthTest for each GPU and
the maximum MLUPS possible for D3Q19, calculated using Eq. (9) and considering sizeof(float)
8 bytes (double precision). The test program transfers 100 times 32Mb of data using the option
cudaMemcpyDeviceToDevice.
GPU ECC Avg. bandwidth (Gb/s) Max. MLUPS
Tesla K20Xm Off 194.7 687.7
Tesla K20Xm On 173.8 613.9
Tesla P100 On 501.0 1769.6
Tesla V100 On 738.8 2584.7
Table 5. Performance results from present work for D3Q19. The MLUPS are the peak ones with the
described configuration. Performance ratio is calculated by Bandwidth
Avg. bandwidth , where Avg. bandwidth is
from Table 4.
GPU Precision ECC MLUPS Bandwidth (Gb/s) Performance ratio (%)
Tesla K20Xm Double Off 635.6 179.9 92.4%
Tesla K20Xm Single Off 1111.1 157.3 90.5%
Tesla P100 Double On 1643.6 465.3 92.9%
Tesla P100 Single On 3051.9 432.0 86.2%
Tesla V100 Double On 3420.8 968.5 131.1%
Tesla V100 Single On 3959.8 560.6 75.9%
For calculations, Kepler (Nvidia [18]) has a relation 1:3 for single/double operations (i.e. it takes
three times longer to do an operation with a double than a float) while Pascal and Volta have a
relation 1:2 (Nvidia [11, 19]). So the calculations are also faster for single precision.
These are the main reasons for the great difference on MLUPS. On the other hand, single precision
presents higher numerical error and for cases such as DNS this may impact have a great impact on the
simulation. Also, the algorithm’s bandwidth for all architectures, specially Kepler and Volta, is lower for
single precision than for double.
The Volta achieves a greater bandwidth than the theoretical by 30% for N= 32, the smallest N
used for simulation. The probable reason for that is the size of the L1 cache for the Tesla V100, which
has 80 streaming multiprocessors (SM) and 128Kb for the L1 of each, achieving the maximum of 10Mb
of memory in the L1. The Tesla P100, on the other hand, has 24Kb in the L1 and 56 SM, totalizing
1.3Mb of memory in the L1, 13% the total of the Tesla V100. The total GPU memory used is 9.5Mb for
N= 32.
This great L1 capacity must minimize the access of global memory for small simulations so that
the bandwidth is not limited by global accesses. Another reason may be architectural differences from
Volta to Pascal and Kepler, with memory accesses optimized on Volta. Further investigation is required
for LBM performance using Volta devices.
Figure 7 shows the performance for the number of threads per block for each GPU. Increasing the
size of the block, it is observed a minor increase of the MLUPS for Kepler and Volta. Pascal did not show
significant variation. The difference from 32 threads to 128 is 5.5% for Kepler, 0.2% for Pascal and 1.1%
for Volta. For blocks with more than one dimension the performance was reduced for all tests. So as
long as the number of threads is a multiple of 32 (warp size Nvidia [5]), the impact of it on simulations
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
Figure 7. MLUPS for Kepler, Pascal and Tesla with N= 128 and varying the number of threads in
one block. The blocks are one dimensional, with threads only on the xaxis. ECC is off for Kepler and
double precision is used.
Table 6. MLUPS for storing macroscopic variables (density and velocity in each direction) in global
memory on no iteration and every iteration. The difference is calculated by 1M LU P S no stor ing
M LUP S storing . For
N= 128 it was used 64 threads per block and 32 threads per block for N= 32. ECC for Kepler is off
and double precision is used.
GPU N MLUPS no storing MLUPS storing Difference (%)
Kepler 32 516.1 494.4 4.2%
Kepler 128 635.54 568.33 10.6%
Pascal 32 711.33 647.0 9.0%
Pascal 128 1628.8 1463.5 10.2%
Volta 32 3420.8 3145.6 8.0%
Volta 128 2466.41 2193.1 11.1%
is low, only considerable for Kepler architecture.
The difference between storing macroscopic variables in global memory values every iteration and
in no iteration is also evaluated. Table 6 shows the results for each GPU. The loss for storing is very
high, up to 10%, so the macroscopic variables must be written to global memory only when necessary,
for performance improvement.
The impact of the ECC for Kepler was measured, MLUPS decreased by 18.8%, for N= 128 and
64 threads per block. ECC has a great impact on the LBM performance for the Kepler architecture. The
loss on performance is higher than the bandwidth loss, with this being 10.8%. For Pascal and Volta, with
the ECC turned on, the bandwidth rate reaches 92.2% and 94.5% for the same case. While on Kepler it
gets to only 84.2% (considering the maximum bandwidth as the one from Table 4 with the ECC on).
This may be due to ECC optimizations on Pascal and Volta architectures or the software not being
optimized. So if a higher performance is needed, the ECC is a critical point for Kepler and it should be
turned off, despite not recommended.
Lattice resolution and the impact of it on the performance was evaluated. Fig. 8 shows the results.
For both Kepler and Pascal, the performance increased with the size of the domain. The gain from
N= 32 to N= 128 was a lot higher for Pascal than Kepler, 129.9% for the first one and 16.7% for
the second. Volta has a greater MLUPS for N= 32 for motives already discussed and does not have
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
significant difference from N= 64 to N= 128.
This is a great feature, because as long as this relation continues, the simulation becomes faster (or
at least does not slow down) for higher resolutions.
Figure 8. MLUPS for Kepler, Pascal and Tesla with 32 threads per block and varying N. The blocks are
one dimensional, with threads only on the xaxis. ECC is off for Kepler and double precision is used.
It is very common to save macroscopic variables during the simulation. Like for having a condition
to start in case of some system failure, to see transient phenomena as in DNS and many other reasons.
The rate of savings impacts on the performance, because a device synchronization is necessary as the
transference of device memory to host memory. Figure 9 shows the results of the savings tests.
(a) N= 32 (b) N= 128
Figure 9. Tests performances for each rate of savings, which is the number of iterations between savings.
100% represents the performance with no saving. N= 32 uses 32 threads per block and 64 threads per
block is used by N= 128. ECC was disabled for Kepler and double precision is used.
For all GPUs the impact for a rate of savings lower than 500 is quite significant, with more than
10% of loss in performance. For low rates of savings, the simulation gets very slow, losing all advantage
of using GPU. Kepler and Pascal presents similar behavior for both N= 32 and N= 128. On Volta the
impact is higher, mostly for N= 32. This is due to the Volta performance being higher, so the overheads
of saving are a lot more significant than to the other ones. For both N values, with rate of savings higher
than 1000, the impact on performance is less than 5% for Kepler and Pascal. The same happens to Volta
for a rate of savings of 2500.
Another common thing to do is keep track of some macroscopic variables and its values over the
time. One example of that is calculating the velocity residual with the Eq. (7). Figure 10 shows the
results of the residual calculation tests.
Like the rate of savings test, for a rate of residual calculation lower than 500 the performance loss
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
(a) N= 32 (b) N= 128
Figure 10. Tests performances for each rate of residual calculation, which is the number of iterations
between residual calculation. 100% represents the performance with no residual calculation. N= 32
uses 32 threads per block and 64 threads per block are used by N= 128. ECC was disabled for Kepler
and double precision is used.
is very significant. Also, Kepler and Pascal presented a similar behavior as well. But the comparison
of Pascal and Volta to Kepler must be careful. Because different processors are used, so this may have
influenced the results. The impact on Volta is higher than on the others, the reason is the same as for
saving. The performance loss is negligible (less than 5%) for a rate of residual calculation higher than
500 on Kepler and Pascal and 1000 on Volta.
6 Conclusion
A implementation of the LBM for D2Q9 and D3Q19 using CUDA was presented. The algorithm
makes use of the push scheme and combine all operations of the method (collision, streaming, boundary
conditions and macroscopics calculation) in one kernel. The memory layout used and its motivation is
discussed. Also, host functions and the performance measurement are presented. The two velocity sets
are validated using parallel plates flow.
The impact of single precision on the simulation error is analyzed. The results showed that it is
not recommended for u00.01 or when density precision is crucial. Code’s performance achieved the
state-of-the-art, with over 90% bandwidth for all GPUs. It surpassed the theoretical maximum MLUPS
for Tesla V100, with a performance 30% over the maximum. It is shown that the impact for rates of
savings and residual calculation are negligible for rates higher than 1000 steps.
Future works may analyze the impact of saving and residual in multiple nodes, since there’s need
of data transactions between nodes in this case. Also code support for more velocity sets and features,
as multiple-relaxation-time (MRT), immersed boundary method (IBM), force terms and others. Due to
performance higher than theoretical for Volta, a further investigation on the architecture and the perks of
its optimizations for LBM is also required.
Acknowledgements
This paper used the resources of CERNN - Research Center for Rheology and non-Newtonian Flu-
ids. The authors would like to thank Luiz Gustavo Ricardo for the technical help on the GPUs and Marco
Ferrari for the insights and tips given.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
W. B. Oliveira Jr., A. Lugarini, A. T. Franco
References
[1] Li, W., Wei, X., & Kaufman, A., 2003. Implementing lattice boltzmann computation on graphics
hardware. Visual Computer, vol. 19, pp. 444–456.
[2] Mawson, M. J. & Revell, A. J., 2014. Memory transfer optimization for a lattice Boltzmann solver
on Kepler architecture nVidia GPUs. Computer Physics Communications, vol. 185, n. 10, pp. 2566 –
2574.
[3] Habich, J., Zeiser, T., Hager, G., & Wellein, G., 2011. Performance analysis and optimization strate-
gies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA. Advances in Engineering
Software, vol. 42, n. 5, pp. 266 – 272. PARENG 2009.
[4] Herschlag, G., Lee, S., S. Vetter, J., & Randles, A., 2018. GPU Data Access on Complex Geome-
tries for D3Q19 Lattice Boltzmann Method. In 2018 IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pp. 825–834.
[5] Nvidia, 2019. CUDA 10.1 Toolkit Documentation.
[6] Januszewski, M. & Kostur, M., 2014. Sailfish: A flexible multi-GPU implementation of the lattice
Boltzmann method. Computer Physics Communications, vol. 185, n. 9, pp. 2350 – 2368.
[7] Kruger, T., Kusumaatmaja, H., Kuzmin, A., Shardt, O., Silva, G., & Viggen, E. M., 2017. The
Lattice Boltzmann method: Principles and practice. Springer, 1 edition.
[8] Bhatnagar, P. L., Gross, E. P., & Krook, M., 1954. A Model for Collision Processes in Gases. I.
Small Amplitude Processes in Charged and Neutral One-Component Systems. Phys. Rev., vol. 94, pp.
511–525.
[9] Zou, Q. & He, X., 1997. On pressure and velocity boundary conditions for the lattice Boltzmann
BGK model. Physics of Fluids, vol. 9, n. 6, pp. 1591–1598.
[10] Hecht, M. & Harting, J., 2010. Implementation of on-site velocity boundary conditions for D3Q19
lattice Boltzmann simulations. Journal of Statistical Mechanics: Theory and Experiment, vol. 2010,
n. 1, pp. 10–18.
[11] Nvidia, 2017. NVIDIA Tesla V100 GPU Architecture. Technical report, Nvidia Corporation.
[12] Schreiber, M., Riesinger, C., Bakhtiari, A., Neumann, P., & Bungartz, H.-J., 2017. A Holistic
Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous
Clusters. Computation, vol. 5.
[13] Obrecht, C., Kuznik, F., Tourancheau, B., & Roux, J.-J., 2011. A new approach to the lattice
Boltzmann method for graphics processing units. Computers and Mathematics with Applications,
vol. 61, n. 12, pp. 3628 – 3638. Mesoscopic Methods for Engineering and Science — Proceedings of
ICMMES-09.
[14] Xian, W. & Takayuki, A., 2011. Multi-GPU performance of incompressible flow computation by
lattice Boltzmann method on GPU cluster. Parallel Computing, vol. 37, n. 9, pp. 521 – 535. Emerging
Programming Paradigms for Large-Scale Scientific Computing.
[15] Calore, E., Gabbana, A., Kraus, J., Pellegrini, E., Schifano, S., & Tripiccione, R., 2016. Massively
parallel lattice–Boltzmann codes on large GPU clusters. Parallel Computing, vol. 58, pp. 1 – 24.
[16] Nathen, P., Gaudlitz, D., Krause, M., & Adams, N., 2018. On the Stability and Accuracy of the
BGK, MRT and RLB Boltzmann Schemes for the Simulation of Turbulent Flows. Communications
in Computational Physics, vol. 23, pp. 846–876.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
Performance analysis of LBM on GPU
[17] Brachet, M. E., Meiron, D. I., Orszag, S. A., Nickel, B. G., Morf, R. H., & Frisch, U., 1983.
Small-scale structure of the Taylor–Green vortex. Journal of Fluid Mechanics, vol. 130, pp. 411–452.
[18] Nvidia, 2014. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler TM GK110/210.
Technical report, Nvidia Corporation.
[19] Nvidia, 2016. NVIDIA Tesla P100. Technical report, Nvidia Corporation.
CILAMCE 2019
Proceedings of the XL Ibero-Latin-American Congress on Computational Methods in Engineering, ABMEC.
Natal/RN, Brazil, November 11-14, 2019
... These DDFs are represented as floating-point numbers, and the streaming consists of copying them to the memory locations associated with neighboring lattice points. So, the LBM algorithm, at its core, copies floating-point numbers in memory with little arithmetic computation in between, meaning its performance is bound by memory bandwidth [3][4][5][6][7][8][9][10][13][14][15][16][17][18][19][20][21][22][33][34][35][36][37][38][39][40][41][42][43][44][45][46][59][60][61][62][63][64][65]. ...
Article
Full-text available
I present two novel thread-safe in-place streaming schemes for the lattice Boltzmann method (LBM) on graphics processing units (GPUs), termed Esoteric Pull and Esoteric Push, that result in the LBM only requiring one copy of the density distribution functions (DDFs) instead of two, greatly reducing memory demand. These build upon the idea of the existing Esoteric Twist scheme, to stream half of the DDFs at the end of one stream-collide kernel and the remaining half at the beginning of the next, and offer the same beneficial properties over the AA-Pattern scheme - reduced memory bandwidth due to implicit bounce-back boundaries and the possibility of swapping pointers between even and odd time steps. However, the streaming directions are chosen in a way that allows the algorithm to be implemented in about one tenth the amount of code, as two simple loops, and is compatible with all velocity sets and suitable for automatic code-generation. The performance of the new streaming schemes is slightly increased over Esoteric Twist due to better memory coalescence. Benchmarks across a large variety of GPUs and CPUs show that for most dedicated GPUs, performance differs only insignificantly from the One-Step Pull scheme; however, for integrated GPUs and CPUs, performance is significantly improved. The two proposed algorithms greatly facilitate modifying existing code to in-place streaming, even with extensions already in place, such as demonstrated for the Free Surface LBM implementation FluidX3D. Their simplicity, together with their ideal performance characteristics, may enable more widespread adoption of in-place streaming across LBM GPU codes.
... This difference, and in particular whether FP32 is sufficient for LBM simulations compared to FP64, has been a point of persistent discussion within the LBM community [13-18, 28-33, 49-55, 57-60]. Nevertheless, only few papers [17,32,33,49,57,61] provide some comparison on how floating-point formats affect the accuracy of the LBM and mostly find only insignificant differences between FP64 and FP32 except at very low velocity and where floating-point round-off leads to spontaneous symmetry breaking. Besides the question of accuracy, a quantitative performance comparison across different hardware microarchitectures is missing as the vast majority of LBM software is either written only for CPUs [62][63][64][65][66][67][68][69][70][71][72][73][74] or only for Nvidia GPUs or CPUs and Nvidia GPUs [16][17][18][19][20][21][22][23][24][25][26]. ...
Preprint
Full-text available
Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory-intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here, we evaluate the possibility to use even FP16 and Posit16 (half) precision for storing fluid populations, while still carrying arithmetic operations in FP32. For this, we first show that the commonly occurring number range in the LBM is a lot smaller than the FP16 number range. Based on this observation, we develop novel 16-bit formats - based on a modified IEEE-754 and on a modified Posit standard - that are specifically tailored to the needs of the LBM. We then carry out an in-depth characterization of LBM accuracy for six different test systems with increasing complexity: Poiseuille flow, Taylor-Green vortices, Karman vortex streets, lid-driven cavity, a microcapsule in shear flow (utilizing the immersed-boundary method) and finally the impact of a raindrop (based on a Volume-of-Fluid approach). We find that the difference in accuracy between FP64 and FP32 is negligible in almost all cases, and that for a large number of cases even 16-bit is sufficient. Finally, we provide a detailed performance analysis of all precision levels on a large number of hardware microarchitectures and show that significant speedup is achieved with mixed FP32/16-bit.
... Some boundary conditions, such as the halfway bounceback can also be executed in parallel. For more details on the software and implementation aspects the reader is referred to [29]. A performance of about 400 million lattice updates per second (MLUPS) was obtained with the present LB scheme. ...
Article
In the Lattice Boltzmann Method (LBM) the viscosity is inversely proportional to the relaxation frequency. Hence, it should be possible to represent the singularity of some viscoplastic models by setting the relaxation frequency to zero. In the present paper we take full advantage of the LBM capabilities to propose an efficient and stable numerical scheme for viscoplastic fluid flow simulations invoking the exact Bingham constitutive equation. This scheme is expected to suit the need for more accurate viscoplastic simulations because it does compute the “infinite viscosity”, therefore dismissing the need for any viscosity regularization. Although allowing for a wide range of relaxation frequencies varying in space and time, we demonstrate that the present implementation does not degrade the standard LBM's error order. Numerical stability was promoted by of regularization of ghost moments for the lattice Boltzmann equation with force term. Since the locality of the LBM is preserved, so is the scheme's ability to be highly scalable in large computer clusters. We outline the method and the theory behind it through a detailed Chapman–Enskog expansion. Three laminar test cases are analyzed: parallel channel Poiseuille flow, square duct Poiseuille flow and lid-driven cavity flow. We show conformity with the standard LBM's error order for different wall boundary conditions and yield stress levels. Comparisons are made with other numerical studies employing augmented Lagrangian and viscosity regularization methods.
Article
This paper investigates the settling of a single spherical particle immersed in an initially unstructured thixo-viscoplastic fluid. The phenomenon was numerically solved using the lattice Boltzmann method (LBM) for the mass and momentum transport equations, the immersed boundary method (IBM) for the particle dynamics, and the advection-diffusion LBM for the structural parameter transport equation. We utilized a simplified version of the Houska model to represent the rheological behavior of an inelastic thixo-viscoplastic material. The model evaluates the yield stress effects, which change based on a structural parameter, in the particle settling of an aging fluid. We fixed the particle properties (diameter, density) and varied the fluid rheological properties, such as the static and dynamic Bingham numbers and the build-up and breakdown numbers. Detailed fields of velocity, rate-of-strain, and structural parameters are presented to enrich the flow interpretation. The numerical results reveal that, for a thixotropic fluid, the classical viscoplastic critical yield number criterion does not correctly predict the condition for which the particle will become stationary. Although a higher dynamic yield stress will help the particle's stoppage, the microstructural breakdown has a powerful effect in facilitating its settling.
Article
Full-text available
Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90 % are achieved leading to 2604.72 GLUPS utilizing 24,576 CPU cores and 2048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 × 10 9 lattice cells.
Article
Full-text available
Investigates the dynamics of the inviscid and viscous Taylor-Green (TG) three dimensional vortex flows. Discusses early time behaviour of the inviscid flow, and analyses the flow both by solving the Euler equations using time marching numerical techniques and also by analysing high order power series in time. Notes previous studies and describes the spectral method of solution of the Euler equations. Examines the possibility of a finite time singularity, and studies the nature of the flow with laminar excitation in the form of vortex sheet, and possibility of more violent vortex stretching at later times. Energy spectra at different Reynolds number are investigated. Examines intermittency dissipative structure and discusses the time evolution of the viscous TG flow. Vorticity, velocity vectors and strain rate contours are presented and discussed to determine transition to turbulence. (C.J.U.)
Article
Full-text available
The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as `Kepler'. We provide a review of previous optimisation strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of `performance enhancing' features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimised storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11Hz for both GPUs, independent of the size of the problem.
Article
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics.
Book
This book is an introduction to the theory, practice, and implementation of the Lattice Boltzmann (LB) method, a powerful computational fluid dynamics method that is steadily gaining attention due to its simplicity, scalability, extensibility, and simple handling of complex geometries. The book contains chapters on the method's background, fundamental theory, advanced extensions, and implementation. To aid beginners, the most essential paragraphs in each chapter are highlighted, and the introductory chapters on various LB topics are front-loaded with special "in a nutshell" sections that condense the chapter's most important practical results. Together, these sections can be used to quickly get up and running with the method. Exercises are integrated throughout the text, and frequently asked questions about the method are dealt with in a special section at the beginning. In the book itself and through its web page, readers can find example codes showing how the LB method can be implemented efficiently on a variety of hardware platforms, including multi-core processors, clusters, and graphics processing units. Students and scientists learning and using the LB method will appreciate the wealth of clearly presented and structured information in this volume.
Article
We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan-Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes.
Article
The Lattice Boltzmann Model (LBM) is a physically-based approach that simulates the microscopic movement of fluid particles by simple, identical, and local rules. We accelerate the computation of the LBM on general-purpose graphics hardware, by grouping particle packets into 2D textures and mapping the Boltzmann equations completely to the rasterization and frame buffer operations. We apply stitching and packing to further improve the performance. In addition, we propose techniques, namely range scaling and range separation, that systematically transform variables into the range required by the graphics hardware and thus prevent overflow. Our approach can be extended to acceleration of the computation of any cellular automata model.