PresentationPDF Available

Ultra scalable Algorithms for Complex Flows

Authors:

Abstract

This presentation summarizes some recent and ongoing work on developing methods for extreme scale simulation. We will study solvers for large scale physics problems and coupled systems involving multi-phase flows as they occur in advanced engineering.A driving example will be additive manufacturing which involves both particle based simulations as well as complex multiphase flows with phase changes. Wiith modern supercomputers and specially developed software, it is possible to simulate such systems with geometrical resolution of each particle.
Ultrascalable algorithms — Ulrich Rüde
Lehrstuhl für Simulation
Universität Erlangen-Nürnberg
www10.informatik.uni-erlangen.de
Ulrich Rüde
LSS Erlangen and CERFACS Toulouse
ulrich.ruede@fau.de
1
Centre Européen de Recherche et de
Formation Avancée en Calcul Scientifique
www.cerfacs.fr
Ultrascalable Algorithms
for Complex Flows
1
!Login (institute account)
Home Cases and
projects
Access and
Infrastructure
Training and
events
About
VSC
Contact User
portal
HomeTraining and eventsVSC User Day
Venue: Koninklijke Vlaamse Academie van België voor
Wetenschappen en Kunsten - Royal Flemish Academy of
Belgium for Science and the Arts, Hertogstraat 1, 1000 Brussel
Price: Free
End registration: 14.05.2018
Event See all events
VSC User Day 2018
22.05.2018, 09:50-18:00
Registrations closed
Dates:
22.05.2018, 09:50-18:00, KVAB, Hertogstraat 1, 1000 Brussel
This year’s users day takes place at the KVAB (same location as last year)
on Tuesday May 22.
More detailed information.
Program
9u50 : Welcome
10u00: “Ultrascalable algorithms for complex ows” – Ulrich Rüde,
CERFACS and Universitaet Erlangen-Nuernberg
Search
1
!Login (institute account)
Home Cases and
projects
Access and
Infrastructure
Training and
events
About
VSC
Contact User
portal
HomeTraining and eventsVSC User Day
Venue: Koninklijke Vlaamse Academie van België voor
Wetenschappen en Kunsten - Royal Flemish Academy of
Belgium for Science and the Arts, Hertogstraat 1, 1000 Brussel
Price: Free
End registration: 14.05.2018
Event See all events
VSC User Day 2018
22.05.2018, 09:50-18:00
Registrations closed
Dates:
22.05.2018, 09:50-18:00, KVAB, Hertogstraat 1, 1000 Brussel
This year’s users day takes place at the KVAB (same location as last year)
on Tuesday May 22.
More detailed information.
Program
9u50 : Welcome
10u00: “Ultrascalable algorithms for complex ows” – Ulrich Rüde,
CERFACS and Universitaet Erlangen-Nuernberg
Search
Büro für Gestaltung Wangler & Abele 04. April 2011
motivation
Additive Manufacturing
Fast Electron Beam Melting
2
Ultrascalable Algorithms — Ulrich Rüde
Körner, C. (2016). Additive manufacturing of metallic components by selective electron beam melting—a review.
International Materials Reviews, 61(5), 361-377.
Klassen, A., Scharowsky, T., & Körner, C. (2014). Evaporation model for beam based additive manufacturing using free
surface lattice Boltzmann methods. Journal of Physics D: Applied Physics, 47(27), 275303.
Markl, M., Ammer, R., Rüde, U., & Körner, C. (2015). Numerical investigations on hatching process strategies for powder-
bed-based additive manufacturing using an electron beam. The International Journal of Advanced Manufacturing
Technology, 78(1-4), 239-247.
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation of Electron Beam Melting Process
(Additive Manufacturing)
EU-Project Fast-EBM
ARCAM (Gothenburg)
TWI (Cambridge)
FAU Material Sciences
FAU Simulation
3
Ultrascalable Algorithms — Ulrich Rüde
Ammer, R., Markl, M., Ljungblad, U., Körner, C., &
UR (2014). Simulating fast electron beam melting
with a parallel thermal free surface lattice Boltzmann
method. Computers & Mathematics with
Applications, 67(2), 318-330.
Ammer, R., UR, Markl, M., Jüchter V., & Körner, C.
(2014). Validation experiments for LBM simulations
of electron beam melting. International Journal of
Modern Physics C.
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation needs coupled nonlinear models of
powder bed generation
granular flow, hopper, rake
particles, size and shape distribution, restitution, friction
electron beam
beam generation, beam control
beam absorption, energy transfer
selective welding
heat transfer, phase transition: melting
multiphase flow in complex geometry
transport of yet solid particles (and oxides) in melt
surface tension, wetting, contact angles
radiation
evaporation: material and momentum balance
solidification
4
Ultrascalable Algorithms — Ulrich Rüde
Physical scales
A somewhat arbitrary table of characteristic sizes
Atoms: 10-10 m
Microstructures: 10-6 m
Pores, Grains, Particles: 10-4 m
Product: 10-1 m
When we want to model
a product: may need 109 particles
the printing process: may need 103 cells per particle
the evolving microstructure: may need 109 phase field
cells per particle
Does it make sense to model the 3D printing of an
artificial hip implant with 1018 computational cells?
Would it be possible?
5
Ultrascalable Algorithms — Ulrich Rüde
Büro für Gestaltung Wangler & Abele 04. April 2011
Building Block I:
Current and Future
High Performance Supercomputers
6
Ultrascalable Algorithms — Ulrich Rüde
SuperMuc: 3 PFlops
Büro für Gestaltung Wangler & Abele 04. April 2011
Ultrascalable Algorithms Ulrich Rüde
From „Peta“ to „Exa“ Supercomputers
JUQUEEN SuperMUC (phase 1)
Blue Gene/Q!
architecture
458,752 PowerPC!
A2 cores
16 cores (1.6 GHz)!
per node
16 GiB RAM per node
5D torus interconnect
5.8 PFlops Peak
TOP 500: #21
Intel Xeon architecture
147,456 cores
16 cores (2.7 GHz) per
node
32 GiB RAM per node
Pruned tree interconnect
3.2 PFlops Peak
TOP 500: #40
SW26010 processor
10,649,600 cores
260 cores (1.45 GHz)!
per node
32 GiB RAM per node
125 PFlops Peak
1.31 PByte RAM
Power consumption:!
15.37 MW
TOP 500: #1
Sunway TaihuLight
Technological basis on the „Nano“ Scale
Clock rates of 2-3 GHz
correspond to cycle
times of 0.3-0.5 nsec.
8
Algorithmic energy efficiency - Uli Rüde
Chapter 3
Test Systems and Tools
This section elaborates the description and the characteristics of dierent test clusters and tools
available at RRZE high performance computing centre to study their power consumption and
the performance behaviour under a wide range of workloads. The command “Likwid-topology
-g” delivers the graphical output of machines topology. Intel introduced the “Running Average
Power Limit (RAPL)” energy sensors with the Sandy Brigde micro-architecture for measuring
energy consumption of short code paths and now it is available in almost all recent Intel CPUs
[17]. The Intel “Tick/Tock” model [18] discusses every micro-architectural change with a die
shrink of the process technology as shown in Figure 3.1.
32 nm Processor Technology 22 nm Processor Technology
Westmere
New Intel
Processor
Ivy Bridge
3rd generation
New Intel
Processor
Sandy Bridge
2nd generation
New Intel
Microarchitecture
Haswell
4th generation
New Intel
Microarchitecture
TICK TICKTOCK TOCK
-- =)
Figure 3.1.: Intel tick tock model towards Intel’s next generations
3.1. “phinally” Testsystem
The “phinally” test system has a dual-socket 8 cores Intel Sandy Bridge EP processor with 16
logical cores per socket through hyper-threading (see Fig. 3.2). It operates at 2.7 GHz base
clock speed and features Intel turbo mode for increased performance on an as-needed basis.
Sandy Bridge is the codename for micro-architecture based on the 32 nm manufacturing process
developed by Intel to replace Westmere micro-architecture. Due to Advanced Vector Extensions
(AVX) 256 bits instruction set with wider vectors, a full socket of phinally system has overall
theoretical peak performance Ppeak of 172.8 GFlops/s and 345.6 GFlops/s for double and single
9
based on: Ayesha Afzal: The Cost of Computation: Metrics and Models for Modern
Multicore-based Systems in Scientific Computing, Master Thesis, FAU Erlangen, 2015
8.3. Microscopic performance, power and/or energy models
The concern is to measure the number of joules for a particular byte transfer and during this
transfer time the whole chip consumes baseline energy E0which is burnt anyway even when
processor is idle and waits for the data transfer. The baseline energy E0will be dierent for
benchmarks having dierent runtime. Thus, a correction is applied by taking out this baseline
energy and only considering the dynamic part of the energy consumption.
Table 8.2.: Single core of dynamic energy cost for one flop (tread addition and multiplication on the
same footing) and for one byte transfer(load/store) with baseline power W0= 15.9 W on
“phinally” system
Metrics Energy cost
"flop "L1REG
byte "L2L1
byte "L3L2
byte "MEML3
byte
Flop only 830 pJ/F 0 0 0 0
Load only 0 227 pJ/B 314 pJ/B 256 pJ/B 1880 pJ/B
Store only 0 377 pJ/B 300 pJ/B 340 pJ/B 2977 pJ/B
Validation
The energy consumption can be predicted for a variety of tasks with the multiple operations
while knowing the microscopic level parameters (i.e., the energy cost of a single flop and a
byte transfer). For validation of this hypothesis, a real benchmark like 2D jacobi was chosen
which have a lot of data transfer with few flops so the dominant energy contribution is the data
movement. A comparison of measurements with the analytically predicted energy consumption
was done by putting together the knowledge about the data transfer cost, the flop cost, the
ECM model and the layer condition in energy model. For Jacobi stencil code, three streams
are needed to transfer when the layer condition is satisfied. Whereas, when layer condition is
violated, so there are more data transfers (five streams) across expensive data paths and the
code takes longer time which results in large energy cost.
Table 8.3.: Single core energy cost for one flop and energy cost for one byte transfer with baseline power
W0= 15.9 W on “phinally” system
Metrics Copy (AVX) Jacobi (lcL3) Jacobi (lcL2)
(9 it * 1 GB) (100 it *8k *8k) (100 it *8k *8k with blocking)
Cycles per cache line 30.73 42.8 31.13
Memory Data volume (Byte) 9.66E+9 1.54E+11 1.54E+11
E0[pJ/B] 1414 1308 949
Calculated Total energy [J] 46 681 649
Measured Total energy [J] 45 651 638
The table 8.3 shows that how far the predicted values coincide with the measurements for both
same and dierent number of cache line transfers through memory hierarchy. We observed that
the single-core energy consumption of streaming kernels has a large baseline energy contribution
into total energy compare to multiple cores. However, for multi-cores, the total energy becomes
increasingly much more dominated by the dynamic power (which our power model also predicts).
67
8 core Sandy Bridge System - measured through systematic benchmarking!
see also Georg Hager’s talk at PACO 2015
Best values on Green 500 currently convert to 0.1 nJ/Flop:!
equivalent to 100 MW for ExaFlops performance
Can we do „Peta-Scale“ Applications?
What is the largest system that we can solve today?
and now, 13 years later?
Juqueen has ~400 TByte main memory = 4*1014 Bytes = !
5 vectors each with N=1013 double precision elements
matrix-free implementation necessary
even with a sparse matrix format, storing a matrix of
dimension N=1013 is not possible
Which algorithm?
multigrid
Cost = C*N
C „moderate“, e.g. C=200.
does it parallelize well on that scale?
should we worry since 𝜿 = O(N2/3)?
9
Algorithmic energy efficiency - Uli Rüde
Bergen, B, Hülsemann, F, UR (2005): Is 1.7· 1010 unknowns the largest finite element system that can be
solved today? Proceedings of SC’05.
Exploring the Limits
typically appear in simulations for molecules, quantum mechanics, or geophysics. The initial mesh
T2
consists of 240 tetrahedrons for the case of 5 nodes and 80 threads. The number of degrees of
freedoms on the coarse grid
T0
grows from 9
.
0
·
10
3
to 4
.
1
·
10
7
by the weak scaling. We consider
the Stokes system with the Laplace-operator formulation. The relative accuracies for coarse grid
solver (PMINRES and CG algorithm) are set to 10
3
and 10
4
, respectively. All other parameters
for the solver remain as previously described.
nodes threads DoFs iter time time w.c.g. time c.g. in %
5 80 2.7·10910 685.88 678.77 1.04
40 640 2.1·1010 10 703.69 686.24 2.48
320 5 120 1.2·1011 10 741.86 709.88 4.31
2 560 40 960 1.7·1012 9 720.24 671.63 6.75
20 480 327 680 1.1·1013 9 776.09 681.91 12.14
Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry.
Numerical results with up to 10
13
degrees of freedom are presented in Tab. 10, where we observe
robustness with respect to the problem size and excellent scalability. Beside the time-to-solution
(time) we also present the time excluding the time necessary for the coarse grid (time w.c.g.) and
the total amount in % that is needed to solve the coarse grid. For this particular setup, this
fraction does not exceed 12%. Due to 8 refinement levels, instead of 7 previously, and the reduction
of threads per node from 32 to 16, longer computation times (time-to-solution) are expected,
compared to the results in Sec. 4.3. In order to evaluate the performance, we compute the factor
tn
cn1
,where
t
denotes the time-to-solution (including the coarse grid),
nc
the number of used
threads, and
n
the degrees of freedom. This factor is a measure for the compute time per degree of
freedom, weighted with the number of threads, under the assumption of perfect scalability. For
1
.
1
·
10
13
DoFs, this factor takes the value of approx. 2
.
3
·
10
5
and for the case of 2
.
2
·
10
12
DoFs
on the unit cube (Tab. 5) approx. 6
.
0
·
10
5
, which is of the same order. Thus, in both scaling
experiments the time-to-solution for one DoF is comparable. The reason why the ratio is even
smaller for the extreme case of 1
.
1
·
10
13
DoFs is the deeper multilevel hierarchy. Recall also that
the computational domain is dierent in both cases.
The computation of 10
13
degrees of freedom is close to the limits that are given by the shared
memory of each node. By
(8)
, we obtain a theoretical total memory consumption of 274.22 TB,
and on one node of 14.72 GB. Though 16 GB of shared memory per node is available, we employ
one further optimization step and do not allocate the right-hand side on the finest grid level. The
right-hand side vector is replaced by an assembly on-the-fly, i.e., the right-hand side values are
evaluated and integrated locally when needed. By applying this on-the-fly assembly, the theoretical
18
Multigrid with Uzawa Smoother
Optimized for Minimal Memory Consumption
1013 Unknowns correspond to 80 TByte for the solution vector
Juqueen has 450 TByte Memory
matrix free implementation essential
10
Algorithmic energy efficiency - Uli Rüde
Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance study for Stokes solvers at
the extreme scale, Journal of Computational Science, 2016.
What are the largest FE computations today?
11
Algorithmic energy efficiency - Uli Rüde
Energy
Computer Scale
gigascale: 109
terascale: 1012
petascale: 1015
exascale: 1018
Problem Scale: DoF=N
106
109
1012
1015
Direct method: 1*N2
0.278 Wh
278 kWh
278 GWh
278 PWh
Krylov method: 100*N1.33
10 Ws
28 Wh
278 kWh
2.77 GWh
Full Multigrid: 200 N
0.2 Ws
0.056 Wh
56 Wh
56 kWh
TerraNeo prototype!
(est. for Juqueen)
0.13 Wh
30 Wh
27 kWh
?
Scaling of Algorithmic Energy Consumption: Energy(Flop) = 1nJ
Solution of Laplace equation!
in 3D with N=n3 unknowns
Direct methods:
banded: ~n7 = N2.33
nested dissection: ~n6 = N2
Iterative Methods:
Jacobi: ~50 n5 = 50 N1.66
CG: ~100 n4 = 100 N1.33
Full Multigrid: ~200 n3= 200 N
Tera-Scale problems: What must we NOT do!
Use „standard“ algorithms
Assume that we use an O(N2) algorithm on a problem
with N= 1012
If time(N=1) = 10-9 sec!
then time(N=1012) = 102*12-9 sec > 30 Mio years
If energy(N=1) = 10-9 J!
then energy(N=1012) = 102*12-9 J = 277 GWh
We cannot store a system matrix!
even with sparse format when N=1012 - way too big!
12
Algorithmic energy efficiency - Uli Rüde
For Tera-Scale problems:!
we must use optimal algorithms!
and the constants are essential!
Büro für Gestaltung Wangler & Abele 04. April 2011
Building block II:
Granular media
simulations
with the physics engine
13
Ultrascalable Algorithms — Ulrich Rüde
Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms.
Springer Science & Business Media.
Hiking 2016 in the Silvretta mountains
Lagrangian Particle Presentation
Single particle described by
state variables (position x, orientation φ, !
translational and angular velocity v and ω)
a parameterization of its shape S (e.g. !
geometric primitive, composite object, or mesh)
and its inertia properties (mass m, principle !
moments of inertia Ixx, Iyy and Izz).
14
Ultrascalable Algorithms - Ulrich Rüde
The Newton-Euler equations of motion for rigid
bodies describe the rate of change of the state
variables:
Newton-Euler Equations for Rigid Bodies
˙
x(t)
˙
'(t)=v(t)
Q('(t))!(t)
M('(t)) ˙
v(t)
˙
!(t)=f(s(t),t)
(s(t),t)!(t)I('(t))!(t)
Integrator of order one similar to semi-implicit Euler.
x0()
'0()=x
'+tv0()
Q(')!0()
v0()
!0()=v
!+tM(')1f(s,t)
(s,t)!I(')!
Integration of positions is implicit in the velocities and integration of ve-
locities is explicit.
Contact Detection
Formal description of contact detection for a pair of convex rigid bodies.
ˆx(t) = arg min
f2(y)0
f1(y)
n(t)=rf2(ˆx(t))
(t)=f1(ˆx(t))
f1/2: Signed distance function of body 1/2
: Minimum signed distance function
n: Function for surface normal
Time-continuous non-penetration constraint for hard contacts:
(t)0?n(t)0
1
Discretization Underlying the Time-Stepping
Non-penetration conditions Coulomb friction conditions
0?n0ktok2µn
˙
+0?n0kv+
tok2to =µnv+
to
¨
+0?n0k˙
v+
tok2to =µn˙
v+
to
0?n0ktok2µn
˙
+0?n0kv+
tok2to =µnv+
to
t+v0
n()0?n0ktok2µn
kv0
to()k2to =µnv0
to()
Signorini condition impact law friction cone condition frictional reaction opposes slip
=0
˙
+=0
=0
kv+
tok2=0
forces
impulses
continuous
discrete
Erlangen, 15.12.2014 — T. Preclik — Lehrstuhl f ¨
ur Systemsimulation Ultrascale Simulations of Non-smooth Granular Dynamics 9
Nonlinear Complementarity: Measure Differential Inclusions
15
Ultrascalable Algorithms - Ulrich Rüde
Preclik, T., & UR (2015). Ultrascale simulations of non-smooth granular dynamics. Computational Particle
Mechanics, 2(2), 173-196.
Preclik, T., Eibl, S., & UR (2017). The Maximum Dissipation Principle in Rigid-Body Dynamics with Purely
Inelastic Impacts. arXiv preprint:1706.00221.
Büro für Gestaltung Wangler & Abele 04. April 2011
Parallel Computation
Key features of the
parallelization:
domain partitioning
distribution of data
synchronization protocol
subdomain NBGS
accumulators and corrections
aggressive message
aggregation
nearest-neighbor
communication
16
Ultrascalable Algorithms — Ulrich Rüde
Iglberger, K., & UR (2010). Massively parallel granular flow
simulations with non-spherical particles. Computer Science-
Research and Development, 25(1-2), 105-113
Iglberger, K., & UR (2011). Large-scale rigid body simulations.
Multibody System Dynamics, 25(1), 81-95
17
Ultrascalable Algorithms - Ulrich Rüde
Shaker scenario with sharp edged hard objects
864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm.
7.1 Scalability of Granular Gases
2
5
.
9
%
9
.
5
%
8
.
0
%
2
5
.
8
%
1
8
.
1
%
1
2
.
6
%
(a) Time-step profile of the granular gas exe-
cuted with 5×2×2=20 processes on a single
node.
1
6
.
0
%
5
.
9
%
2
2
.
7
%
2
2
.
7
%
3
0
.
6
%
1
6
.
5
%
8
.
3
%
(b) Time-step profile of the granular gas exe-
cuted with 8 ×8×5=320 processes on 16
nodes.
Figure 7.3: The time-step profiles for two weak-scaling executions of the granular gas on
the Emmy cluster with 253particles per process.
domain decompositions. The scaling experiment for the one-dimensional domain decom-
positions (20 ×1×1, . . . , 10 240 ×1×1) performs best and achieves on 512 nodes a parallel
eciency of 98.3% with respect to the single node performance. The time measurements
for two-dimensional domain decompositions (5 ×4×1, 8 ×5×1, . . . 128 ×80 ×1) are
consistently slower, but the parallel eciency does not drop below 89.7%. The time mea-
surements for three-dimensional domain decompositions (5×2×2, 5×4×2, . . . , 32 ×20×16)
come in last, and the parallel eciency goes down to 76.1% for 512 nodes. Again this
behaviour can be explained due to the dierences in the communication volumes of one-,
two- and three-dimensional domain decompositions. The largest weak scaling setups in
this experiment contained 1.6108non-spherical particles.
Fig. 7.3 breaks down the wall clock time of various time step components in two-level pie
charts. The times are averaged over all time steps and processes. The dark blue section
corresponds to the fraction of the time in a time step used for detecting and filtering
contacts. The orange section corresponds to the time used for initializing the velocity
accumulators. The time to relax the contacts is indicated by the yellow time slice, it
includes the contact sweeps for all 10 iterations without the velocity synchronization. The
time used by all velocity synchronizations is shown in the green section, which includes
the synchronizations for each iteration and the synchronization after the initialization of
the velocity accumulators. The time slice is split up on the second level in the time used
for assembling, exchanging, and processing the velocity correction message (dark green
145
Scaling Results
Solver algorithmically not optimal for dense systems, hence cannot scale
unconditionally, but is highly efficient in many cases of practical importance
Strong and weak scaling results for a constant number of iterations
performed on SuperMUC and Juqueen
Largest ensembles computed
2.8 × 1010 non-spherical particles
1.1 × 1010 contacts
granular gas: scaling results
18
Ultrascalable Algorithms - Ulrich Rüde
18 Tobias Preclik, Ulrich R¨ude
(a) Weak-scaling graph on the Emmy cluster.
0.096
0.098
0.1
0.102
0.104
0.106
0.108
0.11
0.112
0.114
0.116
1 4 16 64 256 1024 4096 16384
0
0.2
0.4
0.6
0.8
1
1.2
av. time per time step and 1000 particles in s
parallel eciency
number of nodes
av. time per time step (rst series)
av. time per time step (second series)
parallel eciency (second series)
(b) Weak-scaling graph on the Juqueen supercomputer.
(c) Weak-scaling graph on the SuperMUC supercomputer.
Fig. 5: Inter-node weak-scaling graphs for a granular
gas on all test machines.
The reason why the measured times in the first
series became shorter for 4 096 nodes and more is re-
vealed when considering how the processes get mapped
to the hardware. The default mapping on Juqueen is
ABCDET, where the letters A to E stand for the five
dimensions of the torus network, and T stands for the
hardware thread within each node. The six-dimensional
coordinates are then mapped to the MPI ranks in a
row-major order, that is, the last dimension increases
fastest. The T coordinate is limited by the number of
processes per node, which was 64 for the above measure-
ments. Upon creation of a three-dimensional communi-
cator, the three dimensions of the domain partition-
ing are mapped also in row-major order. This eects, if
the number of processes in z-dimension is less than the
number of processes per node, that a two-dimensional
or even three-dimensional section of the domain parti-
tioning is mapped to a single node. However, if the num-
ber of processes in z-dimension is larger or equal to the
number of processes per node, only a one-dimensional
section of the domain partitioning is mapped to a single
node. A one-dimensional section of the domain parti-
tioning performs considerably less intra-node communi-
cation than a two- or three-dimensional section of the
domain partitioning. This matches exactly the situa-
tion for 2 048 and 4 096 nodes. For 2 048 nodes, a two-
dimensional section 1232 of the domain partitioning
646432 is mapped to each node, and for 4 096 nodes
a one-dimensional section 1 164 of the domain par-
titioning 64 64 64 is mapped to each node. To sub-
stantiate this claim, we confirmed that the performance
jump occurs when the last dimension of the domain par-
titioning reaches the number of processes per node, also
when using 16 and 32 processes per node.
Fig. 5c presents the weak-scaling results on the Su-
perMUC supercomputer. The setup diers from the
granular gas scenario presented in Sect. 7.2.1 in that it
is more dilute. The distance between the centers of two
granular particles along each spatial dimension is 2 cm,
amounting to a solid volume fraction of 3.8% and conse-
quently to less collisions. As on the Juqueen supercom-
puter only three-dimensional domain partitionings were
used. All runs on up to 512 nodes were running within a
single island. The run on 1 024 nodes also used the min-
imum number of 2 islands. The run on 4 096 nodes used
nodes from 9 islands, and the run on 8 192 nodes used
nodes from 17 islands, that is both runs used one island
more than required. The graph shows that most of the
performance is lost in runs on up to 512 nodes. In these
runs only the non-blocking intra-island communication
is utilised. Thus this part of the setup is very similar
to the Emmy cluster since it also has dual-socket nodes
with Intel Xeon E5 processors and a non-blocking tree
Infiniband network. Nevertheless, the intra-island scal-
ing results are distinctly worse. The reasons for these
dierences were not yet further investigated. However,
the scaling behaviour beyond a single island is decent
featuring a parallel eciency of 73.8% with respect to
a single island. A possible explanation of the under-
performing intra-node scaling behaviour could be that
some of the Infiniband links were degraded to QDR,
which was a known problem at the time the extreme-
Breakup up of compute times on
Erlangen RRZE Cluster Emmy
Büro für Gestaltung Wangler & Abele 04. April 2011
Building Block III:
Scalable PDE Simulations
19
Ultrascalable Algorithms - Ulrich Rüde
Succi, S. (2001). The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford university press.
Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for
computational engineering simulations. Journal of Computational Science, 2(2), 105-112.
Ultrascalable algorithms — Ulrich Rüde
Domain Partitioning and Parallelization
20
static load balancing
allocation of block data ( grids)
static block-level refinement
( forest of octrees)
DISK
DISK
separation of domain partitioning
from simulation (optional)
compact (KiB/MiB)
binary
MPI IO
Ultrascalable algorithms — Ulrich Rüde
Parallel AMR load balancing
21
forest of octrees:
octrees are not explicitly stored,!
but implicitly defined via block IDs
2:1 balanced grid!
(used for the LBM)
distributed graph:
nodes = blocks
edges explicitly stored as
< block ID, process rank > pairs
different views on
domain partitioning
Adaptive Mesh Refinement and
Load Balancing
22
Ultrascalable Algorithms - Ulrich Rüde
Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for
distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531.
Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel
algorithm for computing graph partitions. Journal of Parallel and Distributed Computing,
69(9), 750-761.
Schornbaum, F., & Rüde, U. (2016). Massively Parallel Algorithms for the Lattice
Boltzmann Method on NonUniform Grids. SIAM Journal on Scientific Computing, 38(2),
C96-C126.
Schornbaum, F., & Rüde, U. (2017). Extreme-Scale Block-Structured Adaptive Mesh
Refinement. arXiv preprint:1704.06829.
Performance on
Coronary Arteries
Geometry
Ultrascalable Algorithms - Ulrich Rüde
Godenschwager, C., Schornbaum, F., Bauer,
M., Köstler, H., & UR (2013). A framework for
hybrid parallel flow simulations with a trillion
cells in complex geometries. In Proceedings
of SC13: International Conference for High
Performance Computing, Networking,
Storage and Analysis (p. 35). ACM.
Weak scaling !
458,752 cores of JUQUEEN!
over a trillion (1012) fluid lattice cells
Strong scaling!
32,768 cores of SuperMUC
cell sizes of 0.1 mm
2.1 million fluid cells
6000 time steps per second
Color coded proc assignment
Büro für Gestaltung Wangler & Abele 04. April 2011
Single Node Performance
24
Ultrascalable Algorithms - Ulrich Rüde
SuperMUC
JUQUEEN
vectorized
optimized
standard
Pohl, T., Deserno, F., Thürey, N., UR, Lammers, P., Wellein, G., & Zeiser, T. (2004). Performance evaluation of parallel large-
scale lattice Boltzmann applications on three supercomputing architectures. Proceedings of the 2004 ACM/IEEE conference
on Supercomputing (p. 21). IEEE Computer Society.
Donath, S., Iglberger, K., Wellein, G., Zeiser, T., Nitsure, A., & UR (2008). Performance comparison of different parallel lattice
Boltzmann implementations on multi-core multi-socket systems. International Journal of Computational Science and
Engineering, 4(1), 3-11.
AMR Performance
25
Ultrascalable Algorithms - Ulrich Rüde
37
JUQUEEN space filling curve: Morton
0
2
4
6
8
10
12
seconds
31,062
127,232
429,408
256 4096 32,768 458,752
cores
#cells per
core
14 billion cells
197 billion cells
58 billion cells
hybrid MPI+OpenMP version with SMP
1 process 2 cores 8 threads
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM
Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
AMR Performance
26
Ultrascalable Algorithms - Ulrich Rüde
38
JUQUEEN diffusion load balancing
0
2
4
6
8
10
12
seconds
31,062
127,232
429,408
256 4096 32,768 458,752
cores
#cells per
core
14 billion cells
197 billion cells
58 billion cells
time almost independent of
#processes !
LBM AMR - Performance
Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM
Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016
Büro für Gestaltung Wangler & Abele 04. April 2011
Multi-Physics
Simulations
for Particulate Flows
Parallel Coupling
with waLBerla and PE
27
Ladd, A. J. (1994). Numerical simulations of particulate
suspensions via a discretized Boltzmann equation. Part 1.
Theoretical foundation. Journal of Fluid Mechanics, 271(1),
285-309.
Tenneti, S., & Subramaniam, S. (2014). Particle-resolved
direct numerical simulation for gas-solid flow model
development. Annual Review of Fluid Mechanics, 46,
199-230.
Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR
(2016). Two computational models for simulating the
tumbling motion of elongated particles in fluids. Computers &
Fluids, 127, 17-35.
Ultrascalable Algorithms — Ulrich Rüde
Fluid-Structure Interaction
direct simulation of Particle Laden Flows (4-way coupling)
28
Ultrascalable Algorithms - Ulrich Rüde
Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor
cores. In Proceedings of Supercomputing 2010, IEEE Computer Society.
Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid
dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151.
Büro für Gestaltung Wangler & Abele 04. April 2011
Mapping Moving Obstacles
into the LBM Fluid Grid
29
Ultrascalable Algorithms — Ulrich Rüde
An Example
Fluid Cell
Noslip Cell
Acceleration Cell
Velocity/ Pressure Cell
PDF acting as Force
Cells with state change
from Fluid to Particle
Momentum calculation
Büro für Gestaltung Wangler & Abele 04. April 2011
30
Ultrascalable Algorithms — Ulrich Rüde
Cells with state change
from Particle to Fluid
Mapping Moving Obstacles
into the LBM Fluid Grid
An Example (2)
Cell change from particle to fluidCell change from fluid to particle
Büro für Gestaltung Wangler & Abele 04. April 2011
LBM for Multiphysics — Ulrich Rüde
Comparison between coupling methods
Example: Single moving particle
evaluation of oscillating oblique regime:
Re= 263, Ga= 190
correctly represented by momentum
exchange (less good with Noble and
Torczynski method)
Different coupling variants
First order bounce back
Second order central linear
interpolation
Cross validation with spectral method of
Ullmann & Dušek
31
2 0 2
xpH
2
0
2
4
6
8
10
z
2 0 2
xpHz?
10
8
6
4
2
0
2
xp||
Figure 4: Contours of the projected relative velocity urkfor case B-CLI-48 (Ga =178.46). Contours are at (-0.4:0.2:1.2) where
the red line outlines the recirculation area with urk=0. The blue cross in the left plot marks the location taken for the
calculation of the recirculation length Lr.
of the bifurcation point and the respective method will then fail to capture this motion at Ga = 190. In
Fig. 7,aphase-spacediagramoftheresultsforthedierent coupling algorithms together with the reference
data is shown for the two resolutions D/x= 36 and 48. The expected time-periodic behavior is a closed
curve around a fixed midpoint. Even for the finer resolution, only CLI and MR are able to capture this
oscillating motion accurately. Oscillations can also be found for BB but the amplitude in upH is too large
and the value for upV around which the curve oscillates is slightly changing in time. On the other hand, all
PSC variants yield exponentially decaying oscillations and thus fail to capture this instability. It is worth
to note that CLI is also able to resemble the time-periodic oscillations with a resolution of D/x= 36,
whereas MR shows strong deviations from a closed curve. This motion can be analyzed in more detail by
calculating the time average and fluctuation values of the dierent sphere velocities. These values are given
in Tab. 3, where denotes the average and 0the fluctuation part of a quantity and their exact definitions
can be found in [32]. Tab. 3also shows the frequency of the oscillation which is calculated with the help
of a discrete Fourier transformation. It can be seen that the average of the upV signal is captured well by
the MEM variants with errors well below 2% for the fine resolution. In contrast to that, the PSC variants’
16
Visualization of recirculation
length in particle wake
M.Uhlmann, J.Dušek, The motion of a single heavy sphere in ambient fluid: A
benchmark forinterface-resolved particulate flow simulations with significant
relative velocities, International Journal of Multiphase Flow 59 (2014).
D. R. Noble, J. R. Torczynski, A Lattice-Boltzmann Method for Partially
Saturated Computational Cells, International Journal of Modern Physics C
(1998).
Rettinger, C., Rüde, U. (2017). A comparative study of fluid-particle coupling
methods for fully resolved lattice Boltzmann simulations. Computers & Fluids.
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation und Vorhersagbarkeit — Ulrich Rüde
Simulation of suspended particle transport
32
Preclik, T., Schruff, T., Frings, R., & Rüde, U.
(2017, August). Fully Resolved Simulations of
Dune Formation in Riverbeds. In High
Performance Computing: 32nd International
Conference, ISC High Performance 2017,
Frankfurt, Germany, June 18-22, 2017,
Proceedings (Vol. 10266, p. 3). Springer.
0.864*109 LBM cells
350 000 spherical particles
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation und Vorhersagbarkeit — Ulrich Rüde
Sedimentation and fluidized beds
33
3 levels mesh refinement
3800 spherical particles
Galileo number 50
128 processes
1024-4000 blocks
Block size 323
Büro für Gestaltung Wangler & Abele 04. April 2011
Volume of Fluids Method
for Free Surface Flows
34
Ultrascalable Algorithms — Ulrich Rüde
joint work with R.Ammer, S. Bogner, M. Bauer, D. Anderl, N. Thürey, S. Donath, T.Pohl, C Körner, A. Delgado
Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling
foaming. Journal of Statistical Physics, 121(1-2), 179-196.
Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for Large-Scale
Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318.
Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein
foams using a lattice Boltzmann method. Food & function, 5(4), 755-763.
Bogner, S., Ammer, R., & Rüde, U. (2015). Boundary conditions for free interfaces with the lattice Boltzmann method. Journal of
Computational Physics, 297, 1-12.
Building Block V
Free Surface Flows
Volume-of-Fluids like approach
Flag field: Compute only in fluid
Special “free surface” conditions in interface cells
Reconstruction of curvature for surface tension
35
Ultrascalable Algorithms - Ulrich Rüde
Simulation for hygiene products (for Procter&Gamble)
capillary pressure
inclination
surface tension
contact angle
36
Ultrascalable Algorithms - Ulrich Rüde
Büro für Gestaltung Wangler & Abele 04. April 2011
Additive Manufacturing
Fast Electron Beam
Melting
37
Ultrascalable Algorithms — Ulrich Rüde
Bikas, H., Stavropoulos, P., & Chryssolouris, G. (2015). Additive manufacturing methods and modelling approaches: a critical
review. The International Journal of Advanced Manufacturing Technology, 1-17.
Klassen, A., Scharowsky, T., & Körner, C. (2014). Evaporation model for beam based additive manufacturing using free
surface lattice Boltzmann methods. Journal of Physics D: Applied Physics, 47(27), 275303.
Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR (2005). Lattice Boltzmann model for free surface flow for modeling
foaming. Journal of Statistical Physics, 121(1-2), 179-196.
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation of Electron Beam Melting
38
Ultrascalable Algorithms — Ulrich Rüde
Simulating powder bed generation
using the PE framework
High speed camera shows
melting step for manufacturing a
hollow cylinder
WaLBerla Simulation
Simulating powder bed generation
using the PE framework
39
Phase field simulations
of solidification processes
Bauer, M., Hötzer, J., Jainta, M., Steinmetz, P., Berghoff, M., Schornbaum, F., ... & Rüde, U. (2015, November). Massively
parallel phase-field simulations for ternary eutectic directional solidification. In Proceedings of the International Conference for
High Performance Computing, Networking, Storage and Analysis (p. 8). ACM.
Hötzer, J., Jainta, M., Steinmetz, P., Nestler, B., Dennstedt, A., Genau, A., ... & Rüde, U. (2015). Large scale phase-field
simulations of directional ternary eutectic solidification. Acta Materialia, 93, 194-204.
Microstructures forming during ternary
eutectic directional solidification
Building Block VI
Figure 9: Weak Scaling on SuperMUC(left), Hornet(middle) and JUQUEEN(right)
ring connection chain
ring connection chain
ring connection chain
ring connection chain
(a) Simulation result
ring connection chain
ring connection chain
ring connection chain
ring connection chain
(b) tomography reconstruction of experiment from
A.Dennstedt
Figure 10: Three dimensional simulation and experimental results of directional solidification of the ternary
eutectic system Ag-Al-Cu
(a) Phase Al2Cu (b) Phase Ag2Al
Figure 11: Exempted lamellae from the simulation depicted in Figure 10. The lamellae grew from left to
right.
Supercomputing for Material Sciences — Ulrich Rüde
Phase field computations
Grand Potential approach
3 phase fields
chemical potential
temperature equation
explicit integration (!?)
finite differences, structured grid
parallel software developed in group of B. Nestler (KIT)
extremely fine resolutions (in space and time) necessary!
to observe pattern formation
re-implementation in waLBerla
performance engineering leeds to speedup of x80
spiraling structures observed
40
Supercomputing for Material Sciences — Ulrich Rüde
Phase field model for ternary eutectic solidification
41
Supercomputing for Material Sciences — Ulrich Rüde
10 Journal Title XX(X)
TE
T
z
G
v
periodic boundary condition
moving window
Neumann
boundary
condition
Dirichlet
boundary
condition
`
block
Figure 2. Setting to simulate the ternary eutectic directional
solidification based on [38]. Thereby, the melt `, consisting of
three components, solidifies in the three phases ,and . In
dashed blue the moving window technique with the
block-structured grid is highlighted. Bellow, in red the moving
analytic temperature gradient is shown.
temperature gradient @T@tare given as:
⌧✏@
@t=T@a(,)
@−∇⋅@a(,)
@

=r
T
@!()
@@ (,µ,T)
@

=
1
N
N
=1(r+),
(4)
@µ
@t=N
=1
h()@c(µ,T)
@µ1
∇⋅M(,µ,T)µJat(,µ,T)
N
=1
c(µ,T)@h()
@t
N
=1
h()@c(µ,T)
@T@T
@t
,(5)
@T
@t=@
@t(T0+G(zvt))=Gv. (6)
The evolution equation for the chemical potentials (5) alone
results in 1 384 floating point operations per cell and 680
Bytes that need to be transferred from main memory [31].
Further details of the phase-field model are presented in
[8,39]. The discretizations in space with a finite difference
scheme and in time with an explicit Euler scheme are
specified in [40]. To efficiently solve the evolution equations
on current HPC systems, the model is optimized and
parallelized on different levels as proposed in [31]. Besides
explicit vectorization of the sweeps and parallelization with
MPI, a moving window approach is implemented on top
of the block structured grid data structures of WA LBERLA
as depicted in Figure 2. This allows to reduce the total
simulation domain to just a region around the solidification
front. Typical simulations in representative volume elements
require between 10 000 to 20 000 compute cores for multiple
days [810]. A typical phase-field simulation of the direction
solidification of the ternary eutectic system Al-Ag-Cu as
Figure 3. Phase-field simulation of the direction solidification of
the ternary eutectic system Al-Ag-Cu in a
12 000 ×12 000 ×65 142 voxel cell domain which was calculated
with 19 200 cores on the SuperMUC system. A detailed
discussion is presented in [8,41].
described in [8] is shown in fig. 3. The 12 000 ×12 000 ×
65 142 voxel cell domain was calculated on the SuperMUC
system with 19 200 cores for approximately 19 hours. Three
distinct solid phases of different composition grow into
the undercooled melt and form characteristic microstructure
patterns. On the left and right, rods for the two phases Al2Ag
and Ag2Al are exempted to extract the evolution of the single
solid phases within the complex microstructure. The rods
split, merge and overgrow during the simulation as described
in [8,41].
Based on this highly parallel and optimized solver, the
eutectic solidification of idealized systems [4244] and real
ternary alloys like Al-Ag-Cu [810,41,45,46] and Ni-Al-
Cr [47] were investigated in large scale domains. Thus, the
experimentally assumed growth of spirals could be proved
[42]. In [46] the influence of different melt compositions on
the evolving patterns could be shown.
7 Benchmarks
In this section, we present benchmarks to illustrate the
performance of our implementation. First, we introduce the
application parameters. Then we cover different aspects of
the covered checkpointing scheme, explain and present the
respective results.
7.1 The test cases
To benchmark the presented checkpointing scheme of
Section 5.2, we simulate the directional solidification of a
ternary eutectic system using the implementation presented
in [31] of the phase-field model introduced in Section 6. For
the simulation parameters, the values of [42] to study spiral
growth are used. To resolve spiral growth, large domain sizes
and millions of iterations are required, resulting in massively
Prepared using sagej.cls
10 Journal Title XX(X)
TE
T
z
G
v
periodic boundary condition
moving window
Neumann
boundary
condition
Dirichlet
boundary
condition
`
block
Figure 2. Setting to simulate the ternary eutectic directional
solidification based on [38]. Thereby, the melt `, consisting of
three components, solidifies in the three phases ,and . In
dashed blue the moving window technique with the
block-structured grid is highlighted. Bellow, in red the moving
analytic temperature gradient is shown.
temperature gradient @T@tare given as:
⌧✏@
@t=T@a(,)
@−∇⋅@a(,)
@

=r
T
@!()
@@ (,µ,T)
@

=
1
N
N
=1(r+),
(4)
@µ
@t=N
=1
h()@c(µ,T)
@µ1
∇⋅M(,µ,T)µJat(,µ,T)
N
=1
c(µ,T)@h()
@t
N
=1
h()@c(µ,T)
@T@T
@t
,(5)
@T
@t=@
@t(T0+G(zvt))=Gv. (6)
The evolution equation for the chemical potentials (5) alone
results in 1 384 floating point operations per cell and 680
Bytes that need to be transferred from main memory [31].
Further details of the phase-field model are presented in
[8,39]. The discretizations in space with a finite difference
scheme and in time with an explicit Euler scheme are
specified in [40]. To efficiently solve the evolution equations
on current HPC systems, the model is optimized and
parallelized on different levels as proposed in [31]. Besides
explicit vectorization of the sweeps and parallelization with
MPI, a moving window approach is implemented on top
of the block structured grid data structures of WA LBERLA
as depicted in Figure 2. This allows to reduce the total
simulation domain to just a region around the solidification
front. Typical simulations in representative volume elements
require between 10 000 to 20 000 compute cores for multiple
days [810]. A typical phase-field simulation of the direction
solidification of the ternary eutectic system Al-Ag-Cu as
Figure 3. Phase-field simulation of the direction solidification of
the ternary eutectic system Al-Ag-Cu in a
12 000 ×12 000 ×65 142 voxel cell domain which was calculated
with 19 200 cores on the SuperMUC system. A detailed
discussion is presented in [8,41].
described in [8] is shown in fig. 3. The 12 000 ×12 000 ×
65 142 voxel cell domain was calculated on the SuperMUC
system with 19 200 cores for approximately 19 hours. Three
distinct solid phases of different composition grow into
the undercooled melt and form characteristic microstructure
patterns. On the left and right, rods for the two phases Al2Ag
and Ag2Al are exempted to extract the evolution of the single
solid phases within the complex microstructure. The rods
split, merge and overgrow during the simulation as described
in [8,41].
Based on this highly parallel and optimized solver, the
eutectic solidification of idealized systems [4244] and real
ternary alloys like Al-Ag-Cu [810,41,45,46] and Ni-Al-
Cr [47] were investigated in large scale domains. Thus, the
experimentally assumed growth of spirals could be proved
[42]. In [46] the influence of different melt compositions on
the evolving patterns could be shown.
7 Benchmarks
In this section, we present benchmarks to illustrate the
performance of our implementation. First, we introduce the
application parameters. Then we cover different aspects of
the covered checkpointing scheme, explain and present the
respective results.
7.1 The test cases
To benchmark the presented checkpointing scheme of
Section 5.2, we simulate the directional solidification of a
ternary eutectic system using the implementation presented
in [31] of the phase-field model introduced in Section 6. For
the simulation parameters, the values of [42] to study spiral
growth are used. To resolve spiral growth, large domain sizes
and millions of iterations are required, resulting in massively
Prepared using sagej.cls
10 Journal Title XX(X)
TE
T
z
G
v
periodic boundary condition
moving window
Neumann
boundary
condition
Dirichlet
boundary
condition
`
block
Figure 2. Setting to simulate the ternary eutectic directional
solidification based on [38]. Thereby, the melt `, consisting of
three components, solidifies in the three phases ,and . In
dashed blue the moving window technique with the
block-structured grid is highlighted. Bellow, in red the moving
analytic temperature gradient is shown.
temperature gradient @T@tare given as:
⌧✏@
@t=T@a(,)
@−∇⋅@a(,)
@

=r
T
@!()
@@ (,µ,T)
@

=
1
N
N
=1(r+),
(4)
@µ
@t=N
=1
h()@c(µ,T)
@µ1
∇⋅M(,µ,T)µJat(,µ,T)
N
=1
c(µ,T)@h()
@t
N
=1
h()@c(µ,T)
@T@T
@t
,(5)
@T
@t=@
@t(T0+G(zvt))=Gv. (6)
The evolution equation for the chemical potentials (5) alone
results in 1 384 floating point operations per cell and 680
Bytes that need to be transferred from main memory [31].
Further details of the phase-field model are presented in
[8,39]. The discretizations in space with a finite difference
scheme and in time with an explicit Euler scheme are
specified in [40]. To efficiently solve the evolution equations
on current HPC systems, the model is optimized and
parallelized on different levels as proposed in [31]. Besides
explicit vectorization of the sweeps and parallelization with
MPI, a moving window approach is implemented on top
of the block structured grid data structures of WA LBERLA
as depicted in Figure 2. This allows to reduce the total
simulation domain to just a region around the solidification
front. Typical simulations in representative volume elements
require between 10 000 to 20 000 compute cores for multiple
days [810]. A typical phase-field simulation of the direction
solidification of the ternary eutectic system Al-Ag-Cu as
Figure 3. Phase-field simulation of the direction solidification of
the ternary eutectic system Al-Ag-Cu in a
12 000 ×12 000 ×65 142 voxel cell domain which was calculated
with 19 200 cores on the SuperMUC system. A detailed
discussion is presented in [8,41].
described in [8] is shown in fig. 3. The 12 000 ×12 000 ×
65 142 voxel cell domain was calculated on the SuperMUC
system with 19 200 cores for approximately 19 hours. Three
distinct solid phases of different composition grow into
the undercooled melt and form characteristic microstructure
patterns. On the left and right, rods for the two phases Al2Ag
and Ag2Al are exempted to extract the evolution of the single
solid phases within the complex microstructure. The rods
split, merge and overgrow during the simulation as described
in [8,41].
Based on this highly parallel and optimized solver, the
eutectic solidification of idealized systems [4244] and real
ternary alloys like Al-Ag-Cu [810,41,45,46] and Ni-Al-
Cr [47] were investigated in large scale domains. Thus, the
experimentally assumed growth of spirals could be proved
[42]. In [46] the influence of different melt compositions on
the evolving patterns could be shown.
7 Benchmarks
In this section, we present benchmarks to illustrate the
performance of our implementation. First, we introduce the
application parameters. Then we cover different aspects of
the covered checkpointing scheme, explain and present the
respective results.
7.1 The test cases
To benchmark the presented checkpointing scheme of
Section 5.2, we simulate the directional solidification of a
ternary eutectic system using the implementation presented
in [31] of the phase-field model introduced in Section 6. For
the simulation parameters, the values of [42] to study spiral
growth are used. To resolve spiral growth, large domain sizes
and millions of iterations are required, resulting in massively
Prepared using sagej.cls
Kohl, H¨
otzer, Schornbaum, Bauer, Godenschwager, K ¨
ostler, Nestler, R¨
ude 15
30. Huber M, Gmeiner B, R¨
ude U et al. Resilience for
MassivelyParallel Multigrid Solvers. SIAM Journal on
Scientific Computing 2016; 38(5). doi:10.1137/15M1026122.
31. Bauer M, H¨
otzer J, Jainta M et al. Massively parallel phase-
field simulations for ternary eutectic directional solidification.
In Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis.
SC ’15, New York, NY, USA: ACM. ISBN 978-1-4503-3723-
6, pp. 8:1–8:12. doi:10.1145/2807591.2807662.
32. W Kurz PRS. Gerichtet erstarrte eutektische Werkstoffe:
Herstellung, Eigenschaften und Anwendungen von In-situ-
Verbundwerkstoffen. Springer, 1975. ISBN 978-3-642-65994-
2.
33. Fisher K and Kurz W. Fundamentals of Solidification. Trans
Tech Publications 1986; doi:10.1002/crat.2170210909.
34. H¨
otzer J, Kellner M, Steinmetz P et al. Applications of the
phase-field method for the solidification of microstructures in
multi-component systems. Journal of the Indian Institute of
Science 2016; .
35. Dennstedt A and Ratke L. Microstructures of directionally
solidified Al–Ag–Cu ternary eutectics. Transactions of
the Indian Institute of Metals 2012; 65(6): 777–782.
doi:10.1007/s12666-012-0172-3.
36. Lewis D, Allen S, Notis M et al. Determination of the eutectic
structure in the Ag–Cu–Sn system. Journal of Electronic
Materials 2002; 31(2): 161–167. doi:10.1007/s11664-002-
0163-y.
37. Ruggiero MA and Rutter JW. Origin of microstructure in the
332 k eutectic of the Bi–In–Sn system. Materials Science and
Technology 1997; 13(1): 5–11. doi:10.1179/mst.1997.13.1.5.
38. H¨
otzer J. Massiv-parallele und großskalige phasenfeldsimula-
tionen zur untersuchung der mikrostrukturentwicklung, 2017.
doi:10.5445/IR/1000069984.
39. Choudhury A and Nestler B. Grand-potential formulation for
multicomponent phase transformations combined with thin-
interface asymptotics of the double-obstacle potential. Physical
Review E 2012; 85(2). doi:10.1103/physreve.85.021602.
40. H¨
otzer J, Tschukin O, Said M et al. Calibration of a
multi-phase field model with quantitative angle measurement.
Journal of Materials Science 2016; 51(4): 1788–1797.
doi:10.1007/s10853-015-9542-7.
41. H¨
otzer J, Steinmetz P, Dennstedt A et al. Influence of
growth velocity variations on the pattern formation during the
directional solidification of ternary eutectic Al–Ag–Cu. Acta
Materialia 2017; doi:10.1016/j.actamat.2017.07.007.
42. H¨
otzer J, Steinmetz P, Jainta M et al. Phase-field
simulations of spiral growth during directional ternary eutectic
solidification. Acta Materialia 2016; 106: 249 – 259.
doi:10.1016/j.actamat.2015.12.052.
43. Steinmetz P, H ¨
otzer J, Kellner M et al. Large-scale
phase-field simulations of ternary eutectic microstructure
evolution. Computational Materials Science 2016; 117: 205–
214. doi:10.1016/j.commatsci.2016.02.001.
44. Steinmetz P, Kellner M, H¨
otzer J et al. Quantitative comparison
of phase-field simulations with a ternary eutectic three-
dimensional jackson-hunt approach. Computational Materials
Science (submitted) 2016; .
45. H¨
otzer J, Jainta M, Steinmetz P et al. Die Vielfalt der
Musterbildung in Metallen. horizonte 2015; 45.
46. Steinmetz P, Kellner M, H¨
otzer J et al. Phase-field study of the
pattern formation in Al-Ag-Cu under the influence of the melt
concentration. Computational Materials Science 2016; 121: 6
– 13. doi:10.1016/j.commatsci.2016.04.025.
47. Kellner M, Sprenger I, Steinmetz P et al. Phase-field simulation
of the microstructure evolution in the eutectic NiAl-34Cr
system. Computational Materials Science 2017; 128: 379 –
387. doi:10.1016/j.commatsci.2016.11.049.
48. Padua D. Encyclopedia of Parallel Computing. Springer
Science & Business Media, 2011. ISBN 978-0-387-09844-9.
49. Bland W, Lu H, Seo S et al. Lessons learned implementing
user-level failure mitigation in mpich. In 2015 15th
IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing. pp. 1123–1126. doi:10.1109/ccgrid.2015.51.
Prepared using sagej.cls
Microstructure
evolution
42
Supercomputing for Material Sciences — Ulrich Rüde
Phase-field simulation of the direction
solidification of the ternary eutectic system
Al-Ag-Cu in a !
12 000 x 12000 x 65142!
voxel cell domain calculated with 19 200
cores on the SuperMUC system.
10 Journal Title XX(X)
TE
T
z
G
v
periodic boundary condition
moving window
Neumann
boundary
condition
Dirichlet
boundary
condition
`
block
Figure 2. Setting to simulate the ternary eutectic directional
solidification based on [38]. Thereby, the melt `, consisting of
three components, solidifies in the three phases ,and . In
dashed blue the moving window technique with the
block-structured grid is highlighted. Bellow, in red the moving
analytic temperature gradient is shown.
temperature gradient @T@tare given as:
⌧✏@
@t=T@a(,)
@−∇⋅@a(,)
@

=r
T
@!()
@@ (,µ,T)
@

=
1
N
N
=1(r+),
(4)
@µ
@t=N
=1
h()@c(µ,T)
@µ1
∇⋅M(,µ,T)µJat(,µ,T)
N
=1
c(µ,T)@h()
@t
N
=1
h()@c(µ,T)
@T@T
@t
,(5)
@T
@t=@
@t(T0+G(zvt))=Gv. (6)
The evolution equation for the chemical potentials (5) alone
results in 1 384 floating point operations per cell and 680
Bytes that need to be transferred from main memory [31].
Further details of the phase-field model are presented in
[8,39]. The discretizations in space with a finite difference
scheme and in time with an explicit Euler scheme are
specified in [40]. To efficiently solve the evolution equations
on current HPC systems, the model is optimized and
parallelized on different levels as proposed in [31]. Besides
explicit vectorization of the sweeps and parallelization with
MPI, a moving window approach is implemented on top
of the block structured grid data structures of WA LBERLA
as depicted in Figure 2. This allows to reduce the total
simulation domain to just a region around the solidification
front. Typical simulations in representative volume elements
require between 10 000 to 20 000 compute cores for multiple
days [810]. A typical phase-field simulation of the direction
solidification of the ternary eutectic system Al-Ag-Cu as
Figure 3. Phase-field simulation of the direction solidification of
the ternary eutectic system Al-Ag-Cu in a
12 000 ×12 000 ×65 142 voxel cell domain which was calculated
with 19 200 cores on the SuperMUC system. A detailed
discussion is presented in [8,41].
described in [8] is shown in fig. 3. The 12 000 ×12 000 ×
65 142 voxel cell domain was calculated on the SuperMUC
system with 19 200 cores for approximately 19 hours. Three
distinct solid phases of different composition grow into
the undercooled melt and form characteristic microstructure
patterns. On the left and right, rods for the two phases Al2Ag
and Ag2Al are exempted to extract the evolution of the single
solid phases within the complex microstructure. The rods
split, merge and overgrow during the simulation as described
in [8,41].
Based on this highly parallel and optimized solver, the
eutectic solidification of idealized systems [4244] and real
ternary alloys like Al-Ag-Cu [810,41,45,46] and Ni-Al-
Cr [47] were investigated in large scale domains. Thus, the
experimentally assumed growth of spirals could be proved
[42]. In [46] the influence of different melt compositions on
the evolving patterns could be shown.
7 Benchmarks
In this section, we present benchmarks to illustrate the
performance of our implementation. First, we introduce the
application parameters. Then we cover different aspects of
the covered checkpointing scheme, explain and present the
respective results.
7.1 The test cases
To benchmark the presented checkpointing scheme of
Section 5.2, we simulate the directional solidification of a
ternary eutectic system using the implementation presented
in [31] of the phase-field model introduced in Section 6. For
the simulation parameters, the values of [42] to study spiral
growth are used. To resolve spiral growth, large domain sizes
and millions of iterations are required, resulting in massively
Prepared using sagej.cls
Ternary eutectic directional solidification. The melt l,
consists of three components, solidifies in the three
phases α, β and γ. Moving window technique, block-
structured grid. Moving analytic temperature gradient
2022242628210 212 214
cores
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
MLUP/s per core
interface
liquid
solid
Figure 9: Weak Scaling on SuperMUC(left), Hornet(middle) and JUQUEEN(right)
ring connection chain
ring connection chain
ring connection chain
ring connection chain
(a) Simulation result
ring connection chain
ring connection chain
ring connection chain
ring connection chain
(b) tomography reconstruction of experiment from
A.Dennstedt
Figure 10: Three dimensional simulation and experimental results of directional solidification of the ternary
eutectic system Ag-Al-Cu
(a) Phase Al2Cu (b) Phase Ag2Al
Figure 11: Exempted lamellae from the simulation depicted in Figure 10. The lamellae grew from left to
right.
Büro für Gestaltung Wangler & Abele 04. April 2011
Conclusions
43
Ultrascalable Algorithms — Ulrich Rüde
Büro für Gestaltung Wangler & Abele 04. April 2011
44
The Two Principles of Science
Theory
mathematical models,
differential equations,
Newton
Experiments
observation and
prototypes
empirical sciences
Computational Science
simulation, optimization
(quantitative) virtual reality
Three
Computational methods open the path to
Predictive Science
Ultrascalable Algorithms — Ulrich Rüde
Büro für Gestaltung Wangler & Abele 04. April 2011
Coupled Flow for ExaScale — Ulrich Rüde
Computational Science is done in Teams
Dr.-Ing. Dominik Bartuschat
Martin Bauer, M.Sc. (hons)
Dr. Regina Degenhardt
Sebastian Eibl, M. Sc.
Dipl. Inf. Christian Godenschwager
Marco Heisig, M.Sc.(hons)
PD Dr.-Ing. Harald Köstler
Nils Kohl, M. Sc.
Sebastian Kuckuk, M. Sc.
Christoph Rettinger, M.Sc.(hons)
Jonas Schmitt, M. Sc.
Dipl.-Inf. Florian Schornbaum
Dominik Schuster, M. Sc.
Dominik Thönnes, M. Sc.
Dr.-Ing. Benjamin Bergen
Dr.-Ing. Simon Bogner
Dr.-Ing. Stefan Donath
Dr.-Ing. Jan Eitzinger
Dr.-Ing. Uwe Fabricius
Dr. rer. nat. Ehsan Fattahi
Dr.-Ing. Christian Feichtinger
Dr.-Ing. Björn Gmeiner
Dr.-Ing. Jan Götz
Dr.-Ing. Tobias Gradl
Dr.-Ing. Klaus Iglberger
Dr.-Ing. Markus Kowarschik
Dr.-Ing. Christian Kuschel
Dr.-Ing. Marcus Mohr
Dr.-Ing. Kristina Pickl
Dr.-Ing. Tobias Preclik
Dr.-Ing. Thomas Pohl
Dr.-Ing. Daniel Ritter
Dr.-Ing. Markus Stürmer
Dr.-Ing. Nils Thürey
45
Büro für Gestaltung Wangler & Abele 04. April 2011
Simulation und Vorhersagbarkeit — Ulrich Rüde
Thank you for your attention!
46
Bogner, S., & UR. (2013). Simulation of floating bodies with the lattice Boltzmann method. Computers & Mathematics with
Applications, 65(6), 901-913.
Anderl, D., Bogner, S., Rauh, C., UR, & Delgado, A. (2014). Free surface lattice Boltzmann with enhanced bubble model.
Computers & Mathematics with Applications, 67(2), 331-339.
Bogner, S. Harting, J., & UR (2017). Simulation of liquid-gas-solid flow with a free surface lattice Boltzmann method. Submitted.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Over the last years, the phase-field method has been established to model capillarity-induced microstructural evolution in various material systems. Several phase-field models were introduced and different studies proved that the microstructure evolution is crucially affected by the triple junction (TJ’s) mobilities as well as the evolution of the dihedral angles. In order to understand basic mechanisms in multi-phase systems, we are interested in the time evolution of TJ’s, especially in the contact angles in these regions. Since the considered multi-phase systems consist of a high number of grains, it is not feasible to measure the angles at all TJ’s by hand. In this work, we present a method enabling the localization of TJ’s and the measurement of dihedral contact angles in the diffuse interface inherent in the phase-field model. Based on this contact angle measurement method, we show how to calibrate the phase-field model in order to satisfy Young’s law for different contact angles.
Article
Full-text available
Different three-phase microstructures are observed in directionally solidified Al–Ag–Cu ternary eutectics composed of an Ag2Al, an Al2Cu and a solid solution α-Al phase. A survey from regular to irregular structures is given and some special, new structural defects are observed and discussed. An approach for the quantitative description of the rich variety of complex three-phase microstructures is suggested including the defects. Shape factor and specific surface area of the phase particles are determined and compared for different microstructures.
Article
The solidification of multicomponent alloys is of high technical and scientific importance. In this review, we describe the ongoing research of the phase-field method for the solidification of dendritic and eutectic structures. Therefore, the corresponding experimental and theoretical investigations are presented. First, an overview of the historical development in solidification research is given. Thereafter, the ongoing progress of the phase-field models is reviewed. Then, we address the experimental and simulative investigations of different forms of dendritic and eutectic solidification. We distinguish between thermal and solutal dendritic growth as well as thin-sample and Bridgman furnace experiments of eutectic growth. Impurity-driven Mullins-Sekerka instabilities like cell structures, eutectic colonies and spiral dendritic growth are presented. Then, validation methods for the comparison between simulations, experiments and theoretical approaches are addressed. Subsequently, related aspects to simulate solidification are introduced. Especially, further physical aspects and computational optimizations are considered. Concluding, possible future research in the context of the phase-field method for solidification is discussed.
Article
This paper presents an enhancement to the free surface lattice Boltzmann method (FSLBM) for the simulation of bubbly flows including rupture and breakup of bubbles. The FSLBM uses a volume of fluid approach to reduce the problem of a liquid–gas two-phase flow to a single-phase free surface simulation. In bubbly flows compression effects leading to an increase or decrease of pressure in the suspended bubbles cannot be neglected. Therefore, the free surface simulation is augmented by a bubble model that supplies the missing information by tracking the topological changes of the free surface in the flow. The new model presented here is capable of handling the effects of bubble breakup and coalesce without causing a significant computational overhead. Thus, the enhanced bubble model extends the applicability of the FSLBM to a new range of practically relevant problems, like bubble formation and development in chemical reactors or foaming processes.
Article
This paper is devoted to the simulation of floating rigid bodies in free surface flows. For that, a lattice Boltzmann based model for liquid–gas–solid flows is presented. The approach is built upon previous work for the simulation of liquid–solid particle suspensions on the one hand, and on an interface-capturing technique for liquid–gas free surface flows on the other. The incompressible liquid flow is approximated by a lattice Boltzmann scheme, while the dynamics of the compressible gas are neglected. We show how the particle model and the interface capturing technique can be combined by a novel set of dynamic cell conversion rules. We also evaluate the behaviour of the free surface–particle interaction in simulations. One test case is the rotational stability of non-spherical rigid bodies floating on a plane water surface–a classical hydrostatic problem known from naval architecture. We show the consistency of our method in this kind of flows and obtain convergence towards the ideal solution for the heeling stability of a floating box.
Article
A search for lead-free solder alloys has produced an alloy in the Ag-Cu-Sn system. This alloy is of great importance to the soldering community, and proper determination of structure, processing, and properties will be significant. In the present study, tin-rich alloys were fabricated to better determine the much-debated morphology of secondary and tertiary phases in the eutectic structure. A deep etching procedure was used to reveal the growth structure of monovariant “eutectic-like” reactions as well as the ternary eutectic reaction. Scanning electron microscopy (SEM) and electron-probe microanalysis (EPMA) verified the three-phase nature of the eutectic. The rodlike eutectic structure in this system is consistent with the more simplified volume fraction and surface energy models that have been presented in the literature.
Article
In this paper, we describe the derivation of a model for the simulation of phase transformations in multicomponent real alloys starting from a grand-potential functional. We first point out the limitations of a phase-field model when evolution equations for the concentration and the phase-field variables are derived from a free energy functional. These limitations are mainly attributed to the contribution of the grand-chemical-potential excess to the interface energy. For a range of applications, the magnitude of this excess becomes large and its influence on interface profiles and dynamics is not negligible. The related constraint regarding the choice of the interface thickness limits the size of the domain that can be simulated and, hence, the effect of larger scales on microstructure evolution can not be observed. We propose a modification to the model in order to decouple the bulk and interface contributions. Following this, we perform the thin-interface asymptotic analysis of the phase-field model. Through this, we determine the thin-interface kinetic coefficient and the antitrapping current to remove the chemical potential jump at the interface. We limit our analysis to the Stefan condition at lowest order in ε (parameter related to the interface width) and apply results from previous literature that the corrections to the Stefan condition (surface diffusion and interface stretching) at higher orders are removed when antisymmetric interpolation functions are used for interpolating the grand-potential densities and the diffusion mobilities.
Determination of the eutectic structure in the Ag-Cu-Sn system
  • D Lewis
  • S Allen
  • M Notis
Lewis D, Allen S, Notis M et al. Determination of the eutectic structure in the Ag-Cu-Sn system. Journal of Electronic Materials 2002; 31(2): 161-167. doi:10.1007/s11664-0020163-y.
Origin of microstructure in the 332 k eutectic of the Bi-In-Sn system
  • M A Ruggiero
  • J W Rutter
Ruggiero MA and Rutter JW. Origin of microstructure in the 332 k eutectic of the Bi-In-Sn system. Materials Science and Technology 1997; 13(1): 5-11. doi:10.1179/mst.1997.13.1.5.
Massiv-parallele und großskalige phasenfeldsimulationen zur untersuchung der mikrostrukturentwicklung
  • J Hötzer
Hötzer J. Massiv-parallele und großskalige phasenfeldsimulationen zur untersuchung der mikrostrukturentwicklung, 2017. doi:10.5445/IR/1000069984.