Content uploaded by Ulrich Rüde

Author content

All content in this area was uploaded by Ulrich Rüde on May 22, 2018

Content may be subject to copyright.

Ultrascalable algorithms — Ulrich Rüde

Lehrstuhl für Simulation

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Ulrich Rüde

LSS Erlangen and CERFACS Toulouse

ulrich.ruede@fau.de

1

Centre Européen de Recherche et de

Formation Avancée en Calcul Scientifique

www.cerfacs.fr

Ultrascalable Algorithms

for Complex Flows

1

!Login (institute account)

Home Cases and

projects

Access and

Infrastructure

Training and

events

About

VSC

Contact User

portal

Home › Training and events › VSC User Day

Venue: Koninklijke Vlaamse Academie van België voor

Wetenschappen en Kunsten - Royal Flemish Academy of

Belgium for Science and the Arts, Hertogstraat 1, 1000 Brussel

Price: Free

End registration: 14.05.2018

Event See all events

VSC User Day 2018

22.05.2018, 09:50-18:00

Registrations closed

Dates:

22.05.2018, 09:50-18:00, KVAB, Hertogstraat 1, 1000 Brussel

This year’s users day takes place at the KVAB (same location as last year)

on Tuesday May 22.

More detailed information.

Program

9u50 : Welcome

10u00: “Ultrascalable algorithms for complex ﬂows” – Ulrich Rüde,

CERFACS and Universitaet Erlangen-Nuernberg

Search

1

!Login (institute account)

Home Cases and

projects

Access and

Infrastructure

Training and

events

About

VSC

Contact User

portal

Home › Training and events › VSC User Day

Venue: Koninklijke Vlaamse Academie van België voor

Wetenschappen en Kunsten - Royal Flemish Academy of

Belgium for Science and the Arts, Hertogstraat 1, 1000 Brussel

Price: Free

End registration: 14.05.2018

Event See all events

VSC User Day 2018

22.05.2018, 09:50-18:00

Registrations closed

Dates:

22.05.2018, 09:50-18:00, KVAB, Hertogstraat 1, 1000 Brussel

This year’s users day takes place at the KVAB (same location as last year)

on Tuesday May 22.

More detailed information.

Program

9u50 : Welcome

10u00: “Ultrascalable algorithms for complex ﬂows” – Ulrich Rüde,

CERFACS and Universitaet Erlangen-Nuernberg

Search

Büro für Gestaltung Wangler & Abele 04. April 2011

motivation …

Additive Manufacturing

Fast Electron Beam Melting

2

Ultrascalable Algorithms — Ulrich Rüde

Körner, C. (2016). Additive manufacturing of metallic components by selective electron beam melting—a review.

International Materials Reviews, 61(5), 361-377.

Klassen, A., Scharowsky, T., & Körner, C. (2014). Evaporation model for beam based additive manufacturing using free

surface lattice Boltzmann methods. Journal of Physics D: Applied Physics, 47(27), 275303.

Markl, M., Ammer, R., Rüde, U., & Körner, C. (2015). Numerical investigations on hatching process strategies for powder-

bed-based additive manufacturing using an electron beam. The International Journal of Advanced Manufacturing

Technology, 78(1-4), 239-247.

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation of Electron Beam Melting Process

(Additive Manufacturing)

EU-Project Fast-EBM

ARCAM (Gothenburg)

TWI (Cambridge)

FAU Material Sciences

FAU Simulation

3

Ultrascalable Algorithms — Ulrich Rüde

Ammer, R., Markl, M., Ljungblad, U., Körner, C., &

UR (2014). Simulating fast electron beam melting

with a parallel thermal free surface lattice Boltzmann

method. Computers & Mathematics with

Applications, 67(2), 318-330.

Ammer, R., UR, Markl, M., Jüchter V., & Körner, C.

(2014). Validation experiments for LBM simulations

of electron beam melting. International Journal of

Modern Physics C.

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation needs coupled nonlinear models of

powder bed generation

granular flow, hopper, rake

particles, size and shape distribution, restitution, friction

electron beam

beam generation, beam control

beam absorption, energy transfer

selective welding

heat transfer, phase transition: melting

multiphase flow in complex geometry

transport of yet solid particles (and oxides) in melt

surface tension, wetting, contact angles

radiation

evaporation: material and momentum balance

solidification

microstructure evolution

4

Ultrascalable Algorithms — Ulrich Rüde

Physical scales

A somewhat arbitrary table of characteristic sizes

Atoms: 10-10 m

Microstructures: 10-6 m

Pores, Grains, Particles: 10-4 m

Product: 10-1 m

When we want to model

a product: may need 109 particles

the printing process: may need 103 cells per particle

the evolving microstructure: may need 109 phase field

cells per particle

Does it make sense to model the 3D printing of an

artificial hip implant with 1018 computational cells?

Would it be possible?

5

Ultrascalable Algorithms — Ulrich Rüde

Büro für Gestaltung Wangler & Abele 04. April 2011

Building Block I:

Current and Future

High Performance Supercomputers

6

Ultrascalable Algorithms — Ulrich Rüde

SuperMuc: 3 PFlops

Büro für Gestaltung Wangler & Abele 04. April 2011

Ultrascalable Algorithms Ulrich Rüde

From „Peta“ to „Exa“ Supercomputers

JUQUEEN SuperMUC (phase 1)

Blue Gene/Q!

architecture

458,752 PowerPC!

A2 cores

16 cores (1.6 GHz)!

per node

16 GiB RAM per node

5D torus interconnect

5.8 PFlops Peak

TOP 500: #21

Intel Xeon architecture

147,456 cores

16 cores (2.7 GHz) per

node

32 GiB RAM per node

Pruned tree interconnect

3.2 PFlops Peak

TOP 500: #40

SW26010 processor

10,649,600 cores

260 cores (1.45 GHz)!

per node

32 GiB RAM per node

125 PFlops Peak

1.31 PByte RAM

Power consumption:!

15.37 MW

TOP 500: #1

Sunway TaihuLight

Technological basis on the „Nano“ Scale

Clock rates of 2-3 GHz

correspond to cycle

times of 0.3-0.5 nsec.

8

Algorithmic energy efficiency - Uli Rüde

Chapter 3

Test Systems and Tools

This section elaborates the description and the characteristics of di↵erent test clusters and tools

available at RRZE high performance computing centre to study their power consumption and

the performance behaviour under a wide range of workloads. The command “Likwid-topology

-g” delivers the graphical output of machines topology. Intel introduced the “Running Average

Power Limit (RAPL)” energy sensors with the Sandy Brigde micro-architecture for measuring

energy consumption of short code paths and now it is available in almost all recent Intel CPUs

[17]. The Intel “Tick/Tock” model [18] discusses every micro-architectural change with a die

shrink of the process technology as shown in Figure 3.1.

32 nm Processor Technology 22 nm Processor Technology

Westmere

New Intel

Processor

Ivy Bridge

3rd generation

New Intel

Processor

Sandy Bridge

2nd generation

New Intel

Microarchitecture

Haswell

4th generation

New Intel

Microarchitecture

TICK TICKTOCK TOCK

-- =)

Figure 3.1.: Intel tick tock model towards Intel’s next generations

3.1. “phinally” Testsystem

The “phinally” test system has a dual-socket 8 cores Intel Sandy Bridge EP processor with 16

logical cores per socket through hyper-threading (see Fig. 3.2). It operates at 2.7 GHz base

clock speed and features Intel turbo mode for increased performance on an as-needed basis.

Sandy Bridge is the codename for micro-architecture based on the 32 nm manufacturing process

developed by Intel to replace Westmere micro-architecture. Due to Advanced Vector Extensions

(AVX) 256 bits instruction set with wider vectors, a full socket of phinally system has overall

theoretical peak performance Ppeak of 172.8 GFlops/s and 345.6 GFlops/s for double and single

9

based on: Ayesha Afzal: The Cost of Computation: Metrics and Models for Modern

Multicore-based Systems in Scientific Computing, Master Thesis, FAU Erlangen, 2015

8.3. Microscopic performance, power and/or energy models

The concern is to measure the number of joules for a particular byte transfer and during this

transfer time the whole chip consumes baseline energy E0which is burnt anyway even when

processor is idle and waits for the data transfer. The baseline energy E0will be di↵erent for

benchmarks having di↵erent runtime. Thus, a correction is applied by taking out this baseline

energy and only considering the dynamic part of the energy consumption.

Table 8.2.: Single core of dynamic energy cost for one ﬂop (tread addition and multiplication on the

same footing) and for one byte transfer(load/store) with baseline power W0= 15.9 W on

“phinally” system

Metrics Energy cost

"flop "L1REG

byte "L2L1

byte "L3L2

byte "MEML3

byte

Flop only 830 pJ/F 0 0 0 0

Load only 0 227 pJ/B 314 pJ/B 256 pJ/B 1880 pJ/B

Store only 0 377 pJ/B 300 pJ/B 340 pJ/B 2977 pJ/B

Validation

The energy consumption can be predicted for a variety of tasks with the multiple operations

while knowing the microscopic level parameters (i.e., the energy cost of a single ﬂop and a

byte transfer). For validation of this hypothesis, a real benchmark like 2D jacobi was chosen

which have a lot of data transfer with few ﬂops so the dominant energy contribution is the data

movement. A comparison of measurements with the analytically predicted energy consumption

was done by putting together the knowledge about the data transfer cost, the ﬂop cost, the

ECM model and the layer condition in energy model. For Jacobi stencil code, three streams

are needed to transfer when the layer condition is satisﬁed. Whereas, when layer condition is

violated, so there are more data transfers (ﬁve streams) across expensive data paths and the

code takes longer time which results in large energy cost.

Table 8.3.: Single core energy cost for one ﬂop and energy cost for one byte transfer with baseline power

W0= 15.9 W on “phinally” system

Metrics Copy (AVX) Jacobi (lcL3) Jacobi (lcL2)

(9 it * 1 GB) (100 it *8k *8k) (100 it *8k *8k with blocking)

Cycles per cache line 30.73 42.8 31.13

Memory Data volume (Byte) 9.66E+9 1.54E+11 1.54E+11

E0[pJ/B] 1414 1308 949

Calculated Total energy [J] 46 681 649

Measured Total energy [J] 45 651 638

The table 8.3 shows that how far the predicted values coincide with the measurements for both

same and di↵erent number of cache line transfers through memory hierarchy. We observed that

the single-core energy consumption of streaming kernels has a large baseline energy contribution

into total energy compare to multiple cores. However, for multi-cores, the total energy becomes

increasingly much more dominated by the dynamic power (which our power model also predicts).

67

8 core Sandy Bridge System - measured through systematic benchmarking!

see also Georg Hager’s talk at PACO 2015

Best values on Green 500 currently convert to 0.1 nJ/Flop:!

equivalent to 100 MW for ExaFlops performance

Can we do „Peta-Scale“ Applications?

What is the largest system that we can solve today?

and now, 13 years later?

Juqueen has ~400 TByte main memory = 4*1014 Bytes = !

5 vectors each with N=1013 double precision elements

matrix-free implementation necessary

even with a sparse matrix format, storing a matrix of

dimension N=1013 is not possible

Which algorithm?

multigrid

Cost = C*N

C „moderate“, e.g. C=200.

does it parallelize well on that scale?

should we worry since 𝜿 = O(N2/3)?

9

Algorithmic energy efficiency - Uli Rüde

Bergen, B, Hülsemann, F, UR (2005): Is 1.7· 1010 unknowns the largest finite element system that can be

solved today? Proceedings of SC’05.

Exploring the Limits …

typically appear in simulations for molecules, quantum mechanics, or geophysics. The initial mesh

T2

consists of 240 tetrahedrons for the case of 5 nodes and 80 threads. The number of degrees of

freedoms on the coarse grid

T0

grows from 9

.

0

·

10

3

to 4

.

1

·

10

7

by the weak scaling. We consider

the Stokes system with the Laplace-operator formulation. The relative accuracies for coarse grid

solver (PMINRES and CG algorithm) are set to 10

3

and 10

4

, respectively. All other parameters

for the solver remain as previously described.

nodes threads DoFs iter time time w.c.g. time c.g. in %

5 80 2.7·10910 685.88 678.77 1.04

40 640 2.1·1010 10 703.69 686.24 2.48

320 5 120 1.2·1011 10 741.86 709.88 4.31

2 560 40 960 1.7·1012 9 720.24 671.63 6.75

20 480 327 680 1.1·1013 9 776.09 681.91 12.14

Table 10: Weak scaling results with and without coarse grid for the spherical shell geometry.

Numerical results with up to 10

13

degrees of freedom are presented in Tab. 10, where we observe

robustness with respect to the problem size and excellent scalability. Beside the time-to-solution

(time) we also present the time excluding the time necessary for the coarse grid (time w.c.g.) and

the total amount in % that is needed to solve the coarse grid. For this particular setup, this

fraction does not exceed 12%. Due to 8 reﬁnement levels, instead of 7 previously, and the reduction

of threads per node from 32 to 16, longer computation times (time-to-solution) are expected,

compared to the results in Sec. 4.3. In order to evaluate the performance, we compute the factor

tn

cn1

,where

t

denotes the time-to-solution (including the coarse grid),

nc

the number of used

threads, and

n

the degrees of freedom. This factor is a measure for the compute time per degree of

freedom, weighted with the number of threads, under the assumption of perfect scalability. For

1

.

1

·

10

13

DoFs, this factor takes the value of approx. 2

.

3

·

10

5

and for the case of 2

.

2

·

10

12

DoFs

on the unit cube (Tab. 5) approx. 6

.

0

·

10

5

, which is of the same order. Thus, in both scaling

experiments the time-to-solution for one DoF is comparable. The reason why the ratio is even

smaller for the extreme case of 1

.

1

·

10

13

DoFs is the deeper multilevel hierarchy. Recall also that

the computational domain is di↵erent in both cases.

The computation of 10

13

degrees of freedom is close to the limits that are given by the shared

memory of each node. By

(8)

, we obtain a theoretical total memory consumption of 274.22 TB,

and on one node of 14.72 GB. Though 16 GB of shared memory per node is available, we employ

one further optimization step and do not allocate the right-hand side on the ﬁnest grid level. The

right-hand side vector is replaced by an assembly on-the-ﬂy, i.e., the right-hand side values are

evaluated and integrated locally when needed. By applying this on-the-ﬂy assembly, the theoretical

18

Multigrid with Uzawa Smoother

Optimized for Minimal Memory Consumption

1013 Unknowns correspond to 80 TByte for the solution vector

Juqueen has 450 TByte Memory

matrix free implementation essential

10

Algorithmic energy efficiency - Uli Rüde

Gmeiner B., Huber M, John L, UR, Wohlmuth, B: A quantitative performance study for Stokes solvers at

the extreme scale, Journal of Computational Science, 2016.

What are the largest FE computations today?

11

Algorithmic energy efficiency - Uli Rüde

Energy

Computer Scale

gigascale: 109

terascale: 1012

petascale: 1015

exascale: 1018

Problem Scale: DoF=N

106

109

1012

1015

Direct method: 1*N2

0.278 Wh

278 kWh

278 GWh

278 PWh

Krylov method: 100*N1.33

10 Ws

28 Wh

278 kWh

2.77 GWh

Full Multigrid: 200 N

0.2 Ws

0.056 Wh

56 Wh

56 kWh

TerraNeo prototype!

(est. for Juqueen)

0.13 Wh

30 Wh

27 kWh

?

Scaling of Algorithmic Energy Consumption: Energy(Flop) = 1nJ

Solution of Laplace equation!

in 3D with N=n3 unknowns

Direct methods:

banded: ~n7 = N2.33

nested dissection: ~n6 = N2

Iterative Methods:

Jacobi: ~50 n5 = 50 N1.66

CG: ~100 n4 = 100 N1.33

Full Multigrid: ~200 n3= 200 N

Tera-Scale problems: What must we NOT do!

Use „standard“ algorithms

Assume that we use an O(N2) algorithm on a problem

with N= 1012

If time(N=1) = 10-9 sec!

then time(N=1012) = 102*12-9 sec > 30 Mio years

If energy(N=1) = 10-9 J!

then energy(N=1012) = 102*12-9 J = 277 GWh

We cannot store a system matrix!

even with sparse format when N=1012 - way too big!

12

Algorithmic energy efficiency - Uli Rüde

For Tera-Scale problems:!

we must use optimal algorithms!

… and the constants are essential!

Büro für Gestaltung Wangler & Abele 04. April 2011

Building block II:

Granular media

simulations

with the physics engine

13

Ultrascalable Algorithms — Ulrich Rüde

Pöschel, T., & Schwager, T. (2005). Computational granular dynamics: models and algorithms.

Springer Science & Business Media.

Hiking 2016 in the Silvretta mountains

Lagrangian Particle Presentation

Single particle described by

state variables (position x, orientation φ, !

translational and angular velocity v and ω)

a parameterization of its shape S (e.g. !

geometric primitive, composite object, or mesh)

and its inertia properties (mass m, principle !

moments of inertia Ixx, Iyy and Izz).

14

Ultrascalable Algorithms - Ulrich Rüde

The Newton-Euler equations of motion for rigid

bodies describe the rate of change of the state

variables:

Newton-Euler Equations for Rigid Bodies

✓˙

x(t)

˙

'(t)◆=✓v(t)

Q('(t))!(t)◆

M('(t)) ✓˙

v(t)

˙

!(t)◆=✓f(s(t),t)

⌧(s(t),t)!(t)⇥I('(t))!(t)◆

•Integrator of order one similar to semi-implicit Euler.

✓x0()

'0()◆=✓x

'◆+t✓v0()

Q(')!0()◆

✓v0()

!0()◆=✓v

!◆+tM(')1✓f(s,t)

⌧(s,t)!⇥I(')!◆

•Integration of positions is implicit in the velocities and integration of ve-

locities is explicit.

Contact Detection

•Formal description of contact detection for a pair of convex rigid bodies.

ˆx(t) = arg min

f2(y)0

f1(y)

n(t)=rf2(ˆx(t))

⇠(t)=f1(ˆx(t))

f1/2: Signed distance function of body 1/2

⇠: Minimum signed distance function

n: Function for surface normal

•Time-continuous non-penetration constraint for hard contacts:

⇠(t)0?n(t)0

1

Discretization Underlying the Time-Stepping

Non-penetration conditions Coulomb friction conditions

⇠0?n0ktok2µn

˙

⇠+0?n0kv+

tok2to =µnv+

to

¨

⇠+0?n0k˙

v+

tok2to =µn˙

v+

to

⇠0?⇤n0k⇤tok2µ⇤n

˙

⇠+0?⇤n0kv+

tok2⇤to =µ⇤nv+

to

⇠

t+v0

n()0?n0ktok2µn

kv0

to()k2to =µnv0

to()

Signorini condition impact law friction cone condition frictional reaction opposes slip

⇠=0

˙

⇠+=0

⇠=0

kv+

tok2=0

forces

impulses

continuous

discrete

Erlangen, 15.12.2014 — T. Preclik — Lehrstuhl f ¨

ur Systemsimulation — Ultrascale Simulations of Non-smooth Granular Dynamics 9

Nonlinear Complementarity: Measure Differential Inclusions

15

Ultrascalable Algorithms - Ulrich Rüde

Preclik, T., & UR (2015). Ultrascale simulations of non-smooth granular dynamics. Computational Particle

Mechanics, 2(2), 173-196.

Preclik, T., Eibl, S., & UR (2017). The Maximum Dissipation Principle in Rigid-Body Dynamics with Purely

Inelastic Impacts. arXiv preprint:1706.00221.

Büro für Gestaltung Wangler & Abele 04. April 2011

Parallel Computation

Key features of the

parallelization:

domain partitioning

distribution of data

synchronization protocol

subdomain NBGS

accumulators and corrections

aggressive message

aggregation

nearest-neighbor

communication

16

Ultrascalable Algorithms — Ulrich Rüde

Iglberger, K., & UR (2010). Massively parallel granular flow

simulations with non-spherical particles. Computer Science-

Research and Development, 25(1-2), 105-113

Iglberger, K., & UR (2011). Large-scale rigid body simulations.

Multibody System Dynamics, 25(1), 81-95

17

Ultrascalable Algorithms - Ulrich Rüde

Shaker scenario with sharp edged hard objects

864 000 sharp-edged particles with a diameter between 0.25 mm and 2 mm.

7.1 Scalability of Granular Gases

2

5

.

9

%

9

.

5

%

8

.

0

%

2

5

.

8

%

1

8

.

1

%

1

2

.

6

%

(a) Time-step proﬁle of the granular gas exe-

cuted with 5×2×2=20 processes on a single

node.

1

6

.

0

%

5

.

9

%

2

2

.

7

%

2

2

.

7

%

3

0

.

6

%

1

6

.

5

%

8

.

3

%

(b) Time-step proﬁle of the granular gas exe-

cuted with 8 ×8×5=320 processes on 16

nodes.

Figure 7.3: The time-step proﬁles for two weak-scaling executions of the granular gas on

the Emmy cluster with 253particles per process.

domain decompositions. The scaling experiment for the one-dimensional domain decom-

positions (20 ×1×1, . . . , 10 240 ×1×1) performs best and achieves on 512 nodes a parallel

eﬃciency of 98.3% with respect to the single node performance. The time measurements

for two-dimensional domain decompositions (5 ×4×1, 8 ×5×1, . . . 128 ×80 ×1) are

consistently slower, but the parallel eﬃciency does not drop below 89.7%. The time mea-

surements for three-dimensional domain decompositions (5×2×2, 5×4×2, . . . , 32 ×20×16)

come in last, and the parallel eﬃciency goes down to 76.1% for 512 nodes. Again this

behaviour can be explained due to the di↵erences in the communication volumes of one-,

two- and three-dimensional domain decompositions. The largest weak scaling setups in

this experiment contained 1.6⋅108non-spherical particles.

Fig. 7.3 breaks down the wall clock time of various time step components in two-level pie

charts. The times are averaged over all time steps and processes. The dark blue section

corresponds to the fraction of the time in a time step used for detecting and ﬁltering

contacts. The orange section corresponds to the time used for initializing the velocity

accumulators. The time to relax the contacts is indicated by the yellow time slice, it

includes the contact sweeps for all 10 iterations without the velocity synchronization. The

time used by all velocity synchronizations is shown in the green section, which includes

the synchronizations for each iteration and the synchronization after the initialization of

the velocity accumulators. The time slice is split up on the second level in the time used

for assembling, exchanging, and processing the velocity correction message (dark green

145

Scaling Results

Solver algorithmically not optimal for dense systems, hence cannot scale

unconditionally, but is highly efficient in many cases of practical importance

Strong and weak scaling results for a constant number of iterations

performed on SuperMUC and Juqueen

Largest ensembles computed

2.8 × 1010 non-spherical particles

1.1 × 1010 contacts

granular gas: scaling results

18

Ultrascalable Algorithms - Ulrich Rüde

18 Tobias Preclik, Ulrich R¨ude

(a) Weak-scaling graph on the Emmy cluster.

0.096

0.098

0.1

0.102

0.104

0.106

0.108

0.11

0.112

0.114

0.116

1 4 16 64 256 1024 4096 16384

0

0.2

0.4

0.6

0.8

1

1.2

av. time per time step and 1000 particles in s

parallel eciency

number of nodes

av. time per time step (rst series)

av. time per time step (second series)

parallel eciency (second series)

(b) Weak-scaling graph on the Juqueen supercomputer.

(c) Weak-scaling graph on the SuperMUC supercomputer.

Fig. 5: Inter-node weak-scaling graphs for a granular

gas on all test machines.

The reason why the measured times in the ﬁrst

series became shorter for 4 096 nodes and more is re-

vealed when considering how the processes get mapped

to the hardware. The default mapping on Juqueen is

ABCDET, where the letters A to E stand for the ﬁve

dimensions of the torus network, and T stands for the

hardware thread within each node. The six-dimensional

coordinates are then mapped to the MPI ranks in a

row-major order, that is, the last dimension increases

fastest. The T coordinate is limited by the number of

processes per node, which was 64 for the above measure-

ments. Upon creation of a three-dimensional communi-

cator, the three dimensions of the domain partition-

ing are mapped also in row-major order. This e↵ects, if

the number of processes in z-dimension is less than the

number of processes per node, that a two-dimensional

or even three-dimensional section of the domain parti-

tioning is mapped to a single node. However, if the num-

ber of processes in z-dimension is larger or equal to the

number of processes per node, only a one-dimensional

section of the domain partitioning is mapped to a single

node. A one-dimensional section of the domain parti-

tioning performs considerably less intra-node communi-

cation than a two- or three-dimensional section of the

domain partitioning. This matches exactly the situa-

tion for 2 048 and 4 096 nodes. For 2 048 nodes, a two-

dimensional section 1⇥2⇥32 of the domain partitioning

64⇥64⇥32 is mapped to each node, and for 4 096 nodes

a one-dimensional section 1 ⇥1⇥64 of the domain par-

titioning 64 ⇥64 ⇥64 is mapped to each node. To sub-

stantiate this claim, we conﬁrmed that the performance

jump occurs when the last dimension of the domain par-

titioning reaches the number of processes per node, also

when using 16 and 32 processes per node.

Fig. 5c presents the weak-scaling results on the Su-

perMUC supercomputer. The setup di↵ers from the

granular gas scenario presented in Sect. 7.2.1 in that it

is more dilute. The distance between the centers of two

granular particles along each spatial dimension is 2 cm,

amounting to a solid volume fraction of 3.8% and conse-

quently to less collisions. As on the Juqueen supercom-

puter only three-dimensional domain partitionings were

used. All runs on up to 512 nodes were running within a

single island. The run on 1 024 nodes also used the min-

imum number of 2 islands. The run on 4 096 nodes used

nodes from 9 islands, and the run on 8 192 nodes used

nodes from 17 islands, that is both runs used one island

more than required. The graph shows that most of the

performance is lost in runs on up to 512 nodes. In these

runs only the non-blocking intra-island communication

is utilised. Thus this part of the setup is very similar

to the Emmy cluster since it also has dual-socket nodes

with Intel Xeon E5 processors and a non-blocking tree

Inﬁniband network. Nevertheless, the intra-island scal-

ing results are distinctly worse. The reasons for these

di↵erences were not yet further investigated. However,

the scaling behaviour beyond a single island is decent

featuring a parallel eﬃciency of 73.8% with respect to

a single island. A possible explanation of the under-

performing intra-node scaling behaviour could be that

some of the Inﬁniband links were degraded to QDR,

which was a known problem at the time the extreme-

Breakup up of compute times on

Erlangen RRZE Cluster Emmy

Büro für Gestaltung Wangler & Abele 04. April 2011

Building Block III:

Scalable PDE Simulations

19

Ultrascalable Algorithms - Ulrich Rüde

Succi, S. (2001). The lattice Boltzmann equation: for fluid dynamics and beyond. Oxford university press.

Feichtinger, C., Donath, S., Köstler, H., Götz, J., & Rüde, U. (2011). WaLBerla: HPC software design for

computational engineering simulations. Journal of Computational Science, 2(2), 105-112.

Ultrascalable algorithms — Ulrich Rüde

Domain Partitioning and Parallelization

20

static load balancing

allocation of block data (→ grids)

static block-level refinement

(→ forest of octrees)

DISK

DISK

separation of domain partitioning

from simulation (optional)

compact (KiB/MiB)

binary

MPI IO

Ultrascalable algorithms — Ulrich Rüde

Parallel AMR load balancing

21

forest of octrees:

octrees are not explicitly stored,!

but implicitly defined via block IDs

2:1 balanced grid!

(used for the LBM)

distributed graph:

nodes = blocks

edges explicitly stored as

< block ID, process rank > pairs

different views on

domain partitioning

Adaptive Mesh Refinement and

Load Balancing

22

Ultrascalable Algorithms - Ulrich Rüde

Isaac, T., Burstedde, C., Wilcox, L. C., & Ghattas, O. (2015). Recursive algorithms for

distributed forests of octrees. SIAM Journal on Scientific Computing, 37(5), C497-C531.

Meyerhenke, H., Monien, B., & Sauerwald, T. (2009). A new diffusion-based multilevel

algorithm for computing graph partitions. Journal of Parallel and Distributed Computing,

69(9), 750-761.

Schornbaum, F., & Rüde, U. (2016). Massively Parallel Algorithms for the Lattice

Boltzmann Method on NonUniform Grids. SIAM Journal on Scientific Computing, 38(2),

C96-C126.

Schornbaum, F., & Rüde, U. (2017). Extreme-Scale Block-Structured Adaptive Mesh

Refinement. arXiv preprint:1704.06829.

Performance on

Coronary Arteries

Geometry

Ultrascalable Algorithms - Ulrich Rüde

Godenschwager, C., Schornbaum, F., Bauer,

M., Köstler, H., & UR (2013). A framework for

hybrid parallel flow simulations with a trillion

cells in complex geometries. In Proceedings

of SC13: International Conference for High

Performance Computing, Networking,

Storage and Analysis (p. 35). ACM.

Weak scaling !

458,752 cores of JUQUEEN!

over a trillion (1012) fluid lattice cells

Strong scaling!

32,768 cores of SuperMUC

cell sizes of 0.1 mm

2.1 million fluid cells

6000 time steps per second

Color coded proc assignment

Büro für Gestaltung Wangler & Abele 04. April 2011

Single Node Performance

24

Ultrascalable Algorithms - Ulrich Rüde

SuperMUC

JUQUEEN

vectorized

optimized

standard

Pohl, T., Deserno, F., Thürey, N., UR, Lammers, P., Wellein, G., & Zeiser, T. (2004). Performance evaluation of parallel large-

scale lattice Boltzmann applications on three supercomputing architectures. Proceedings of the 2004 ACM/IEEE conference

on Supercomputing (p. 21). IEEE Computer Society.

Donath, S., Iglberger, K., Wellein, G., Zeiser, T., Nitsure, A., & UR (2008). Performance comparison of different parallel lattice

Boltzmann implementations on multi-core multi-socket systems. International Journal of Computational Science and

Engineering, 4(1), 3-11.

AMR Performance

25

Ultrascalable Algorithms - Ulrich Rüde

37

•JUQUEEN – space filling curve: Morton

0

2

4

6

8

10

12

seconds

31,062

127,232

429,408

256 4096 32,768 458,752

cores

#cells per

core

14 billion cells

197 billion cells

58 billion cells

hybrid MPI+OpenMP version with SMP

1 process ⇔ 2 cores ⇔ 8 threads

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM

Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

AMR Performance

26

Ultrascalable Algorithms - Ulrich Rüde

38

•JUQUEEN – diffusion load balancing

0

2

4

6

8

10

12

seconds

31,062

127,232

429,408

256 4096 32,768 458,752

cores

#cells per

core

14 billion cells

197 billion cells

58 billion cells

time almost independent of

#processes !

LBM AMR - Performance

Peta-Scale Simulations with the HPC Framework waLBerla: Massively Parallel AMR for the LBM

Florian Schornbaum - FAU Erlangen-Nürnberg - April 15, 2016

Büro für Gestaltung Wangler & Abele 04. April 2011

Multi-Physics

Simulations

for Particulate Flows

Parallel Coupling

with waLBerla and PE

27

Ladd, A. J. (1994). Numerical simulations of particulate

suspensions via a discretized Boltzmann equation. Part 1.

Theoretical foundation. Journal of Fluid Mechanics, 271(1),

285-309.

Tenneti, S., & Subramaniam, S. (2014). Particle-resolved

direct numerical simulation for gas-solid flow model

development. Annual Review of Fluid Mechanics, 46,

199-230.

Bartuschat, D., Fischermeier, E., Gustavsson, K., & UR

(2016). Two computational models for simulating the

tumbling motion of elongated particles in fluids. Computers &

Fluids, 127, 17-35.

Ultrascalable Algorithms — Ulrich Rüde

Fluid-Structure Interaction

direct simulation of Particle Laden Flows (4-way coupling)

28

Ultrascalable Algorithms - Ulrich Rüde

Götz, J., Iglberger, K., Stürmer, M., & UR (2010). Direct numerical simulation of particulate flows on 294912 processor

cores. In Proceedings of Supercomputing 2010, IEEE Computer Society.

Götz, J., Iglberger, K., Feichtinger, C., Donath, S., & UR (2010). Coupling multibody dynamics and computational fluid

dynamics on 8192 processor cores. Parallel Computing, 36(2), 142-151.

Büro für Gestaltung Wangler & Abele 04. April 2011

Mapping Moving Obstacles

into the LBM Fluid Grid

29

Ultrascalable Algorithms — Ulrich Rüde

An Example

Fluid Cell

Noslip Cell

Acceleration Cell

Velocity/ Pressure Cell

PDF acting as Force

Cells with state change

from Fluid to Particle

Momentum calculation

Büro für Gestaltung Wangler & Abele 04. April 2011

30

Ultrascalable Algorithms — Ulrich Rüde

Cells with state change

from Particle to Fluid

Mapping Moving Obstacles

into the LBM Fluid Grid

An Example (2)

Cell change from particle to fluidCell change from fluid to particle

Büro für Gestaltung Wangler & Abele 04. April 2011

LBM for Multiphysics — Ulrich Rüde

Comparison between coupling methods

Example: Single moving particle

evaluation of oscillating oblique regime:

Re= 263, Ga= 190

correctly represented by momentum

exchange (less good with Noble and

Torczynski method)

Different coupling variants

First order bounce back

Second order central linear

interpolation

Cross validation with spectral method of

Ullmann & Dušek

31

2 0 2

xpH

2

0

2

4

6

8

10

z

2 0 2

xpHz?

10

8

6

4

2

0

2

xp||

Figure 4: Contours of the projected relative velocity urkfor case B-CLI-48 (Ga =178.46). Contours are at (-0.4:0.2:1.2) where

the red line outlines the recirculation area with urk=0. The blue cross in the left plot marks the location taken for the

calculation of the recirculation length Lr.

of the bifurcation point and the respective method will then fail to capture this motion at Ga = 190. In

Fig. 7,aphase-spacediagramoftheresultsforthediﬀerent coupling algorithms together with the reference

data is shown for the two resolutions D/x= 36 and 48. The expected time-periodic behavior is a closed

curve around a ﬁxed midpoint. Even for the ﬁner resolution, only CLI and MR are able to capture this

oscillating motion accurately. Oscillations can also be found for BB but the amplitude in upH is too large

and the value for upV around which the curve oscillates is slightly changing in time. On the other hand, all

PSC variants yield exponentially decaying oscillations and thus fail to capture this instability. It is worth

to note that CLI is also able to resemble the time-periodic oscillations with a resolution of D/x= 36,

whereas MR shows strong deviations from a closed curve. This motion can be analyzed in more detail by

calculating the time average and ﬂuctuation values of the diﬀerent sphere velocities. These values are given

in Tab. 3, where denotes the average and 0the ﬂuctuation part of a quantity and their exact deﬁnitions

can be found in [32]. Tab. 3also shows the frequency of the oscillation which is calculated with the help

of a discrete Fourier transformation. It can be seen that the average of the upV signal is captured well by

the MEM variants with errors well below 2% for the ﬁne resolution. In contrast to that, the PSC variants’

16

Visualization of recirculation

length in particle wake

M.Uhlmann, J.Dušek, The motion of a single heavy sphere in ambient fluid: A

benchmark forinterface-resolved particulate flow simulations with significant

relative velocities, International Journal of Multiphase Flow 59 (2014).

D. R. Noble, J. R. Torczynski, A Lattice-Boltzmann Method for Partially

Saturated Computational Cells, International Journal of Modern Physics C

(1998).

Rettinger, C., Rüde, U. (2017). A comparative study of fluid-particle coupling

methods for fully resolved lattice Boltzmann simulations. Computers & Fluids.

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation und Vorhersagbarkeit — Ulrich Rüde

Simulation of suspended particle transport

32

Preclik, T., Schruff, T., Frings, R., & Rüde, U.

(2017, August). Fully Resolved Simulations of

Dune Formation in Riverbeds. In High

Performance Computing: 32nd International

Conference, ISC High Performance 2017,

Frankfurt, Germany, June 18-22, 2017,

Proceedings (Vol. 10266, p. 3). Springer.

0.864*109 LBM cells

350 000 spherical particles

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation und Vorhersagbarkeit — Ulrich Rüde

Sedimentation and fluidized beds

33

3 levels mesh refinement

3800 spherical particles

Galileo number 50

128 processes

1024-4000 blocks

Block size 323

Büro für Gestaltung Wangler & Abele 04. April 2011

Volume of Fluids Method

for Free Surface Flows

34

Ultrascalable Algorithms — Ulrich Rüde

joint work with R.Ammer, S. Bogner, M. Bauer, D. Anderl, N. Thürey, S. Donath, T.Pohl, C Körner, A. Delgado

Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR. (2005). Lattice Boltzmann model for free surface flow for modeling

foaming. Journal of Statistical Physics, 121(1-2), 179-196.

Donath, S., Feichtinger, C., Pohl, T., Götz, J., & UR. (2010). A Parallel Free Surface Lattice Boltzmann Method for Large-Scale

Applications. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, 318.

Anderl, D., Bauer, M., Rauh, C., UR, & Delgado, A. (2014). Numerical simulation of adsorption and bubble interaction in protein

foams using a lattice Boltzmann method. Food & function, 5(4), 755-763.

Bogner, S., Ammer, R., & Rüde, U. (2015). Boundary conditions for free interfaces with the lattice Boltzmann method. Journal of

Computational Physics, 297, 1-12.

Building Block V

Free Surface Flows

Volume-of-Fluids like approach

Flag field: Compute only in fluid

Special “free surface” conditions in interface cells

Reconstruction of curvature for surface tension

35

Ultrascalable Algorithms - Ulrich Rüde

Simulation for hygiene products (for Procter&Gamble)

capillary pressure

inclination

surface tension

contact angle

36

Ultrascalable Algorithms - Ulrich Rüde

Büro für Gestaltung Wangler & Abele 04. April 2011

Additive Manufacturing

Fast Electron Beam

Melting

37

Ultrascalable Algorithms — Ulrich Rüde

Bikas, H., Stavropoulos, P., & Chryssolouris, G. (2015). Additive manufacturing methods and modelling approaches: a critical

review. The International Journal of Advanced Manufacturing Technology, 1-17.

Klassen, A., Scharowsky, T., & Körner, C. (2014). Evaporation model for beam based additive manufacturing using free

surface lattice Boltzmann methods. Journal of Physics D: Applied Physics, 47(27), 275303.

Körner, C., Thies, M., Hofmann, T., Thürey, N., & UR (2005). Lattice Boltzmann model for free surface flow for modeling

foaming. Journal of Statistical Physics, 121(1-2), 179-196.

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation of Electron Beam Melting

38

Ultrascalable Algorithms — Ulrich Rüde

Simulating powder bed generation

using the PE framework

High speed camera shows

melting step for manufacturing a

hollow cylinder

WaLBerla Simulation

Simulating powder bed generation

using the PE framework

39

Phase field simulations

of solidification processes

Bauer, M., Hötzer, J., Jainta, M., Steinmetz, P., Berghoff, M., Schornbaum, F., ... & Rüde, U. (2015, November). Massively

parallel phase-field simulations for ternary eutectic directional solidification. In Proceedings of the International Conference for

High Performance Computing, Networking, Storage and Analysis (p. 8). ACM.

Hötzer, J., Jainta, M., Steinmetz, P., Nestler, B., Dennstedt, A., Genau, A., ... & Rüde, U. (2015). Large scale phase-field

simulations of directional ternary eutectic solidification. Acta Materialia, 93, 194-204.

Microstructures forming during ternary

eutectic directional solidification

Building Block VI

Figure 9: Weak Scaling on SuperMUC(left), Hornet(middle) and JUQUEEN(right)

ring connection chain

ring connection chain

ring connection chain

ring connection chain

(a) Simulation result

ring connection chain

ring connection chain

ring connection chain

ring connection chain

(b) tomography reconstruction of experiment from

A.Dennstedt

Figure 10: Three dimensional simulation and experimental results of directional solidiﬁcation of the ternary

eutectic system Ag-Al-Cu

(a) Phase Al2Cu (b) Phase Ag2Al

Figure 11: Exempted lamellae from the simulation depicted in Figure 10. The lamellae grew from left to

right.

Supercomputing for Material Sciences — Ulrich Rüde

Phase field computations

Grand Potential approach

3 phase fields

chemical potential

temperature equation

explicit integration (!?)

finite differences, structured grid

parallel software developed in group of B. Nestler (KIT)

extremely fine resolutions (in space and time) necessary!

to observe pattern formation

re-implementation in waLBerla

performance engineering leeds to speedup of x80

spiraling structures observed

40

Supercomputing for Material Sciences — Ulrich Rüde

Phase field model for ternary eutectic solidification

41

Supercomputing for Material Sciences — Ulrich Rüde

10 Journal Title XX(X)

TE

T

z

G

v

periodic boundary condition

moving window

Neumann

boundary

condition

Dirichlet

boundary

condition

`

block

↵

↵

Figure 2. Setting to simulate the ternary eutectic directional

solidiﬁcation based on [38]. Thereby, the melt `, consisting of

three components, solidiﬁes in the three phases ↵,and . In

dashed blue the moving window technique with the

block-structured grid is highlighted. Bellow, in red the moving

analytic temperature gradient is shown.

temperature gradient @T@tare given as:

⌧✏@↵

@t=−✏T@a(,∇)

@↵−∇⋅@a(,∇)

@∇↵

∶=r↵

−T

✏

@!()

@↵−@ (,µ,T)

@↵

∶=↵

−1

N

N

�

=1(r+),

(4)

@µ

@t=N

�

↵=1

h↵()@c↵(µ,T)

@µ−1

∇⋅M(,µ,T)∇µ−Jat(,µ,T)

−N

�

↵=1

c↵(µ,T)@h↵()

@t

−N

�

↵=1

h↵()@c↵(µ,T)

@T@T

@t

,(5)

@T

@t=@

@t(T0+G(z−vt))=−Gv. (6)

The evolution equation for the chemical potentials (5) alone

results in 1 384 ﬂoating point operations per cell and 680

Bytes that need to be transferred from main memory [31].

Further details of the phase-ﬁeld model are presented in

[8,39]. The discretizations in space with a ﬁnite difference

scheme and in time with an explicit Euler scheme are

speciﬁed in [40]. To efﬁciently solve the evolution equations

on current HPC systems, the model is optimized and

parallelized on different levels as proposed in [31]. Besides

explicit vectorization of the sweeps and parallelization with

MPI, a moving window approach is implemented on top

of the block structured grid data structures of WA LBERLA

as depicted in Figure 2. This allows to reduce the total

simulation domain to just a region around the solidiﬁcation

front. Typical simulations in representative volume elements

require between 10 000 to 20 000 compute cores for multiple

days [8–10]. A typical phase-ﬁeld simulation of the direction

solidiﬁcation of the ternary eutectic system Al-Ag-Cu as

Figure 3. Phase-ﬁeld simulation of the direction solidiﬁcation of

the ternary eutectic system Al-Ag-Cu in a

12 000 ×12 000 ×65 142 voxel cell domain which was calculated

with 19 200 cores on the SuperMUC system. A detailed

discussion is presented in [8,41].

described in [8] is shown in ﬁg. 3. The 12 000 ×12 000 ×

65 142 voxel cell domain was calculated on the SuperMUC

system with 19 200 cores for approximately 19 hours. Three

distinct solid phases of different composition grow into

the undercooled melt and form characteristic microstructure

patterns. On the left and right, rods for the two phases Al2Ag

and Ag2Al are exempted to extract the evolution of the single

solid phases within the complex microstructure. The rods

split, merge and overgrow during the simulation as described

in [8,41].

Based on this highly parallel and optimized solver, the

eutectic solidiﬁcation of idealized systems [42–44] and real

ternary alloys like Al-Ag-Cu [8–10,41,45,46] and Ni-Al-

Cr [47] were investigated in large scale domains. Thus, the

experimentally assumed growth of spirals could be proved

[42]. In [46] the inﬂuence of different melt compositions on

the evolving patterns could be shown.

7 Benchmarks

In this section, we present benchmarks to illustrate the

performance of our implementation. First, we introduce the

application parameters. Then we cover different aspects of

the covered checkpointing scheme, explain and present the

respective results.

7.1 The test cases

To benchmark the presented checkpointing scheme of

Section 5.2, we simulate the directional solidiﬁcation of a

ternary eutectic system using the implementation presented

in [31] of the phase-ﬁeld model introduced in Section 6. For

the simulation parameters, the values of [42] to study spiral

growth are used. To resolve spiral growth, large domain sizes

and millions of iterations are required, resulting in massively

Prepared using sagej.cls

10 Journal Title XX(X)

TE

T

z

G

v

periodic boundary condition

moving window

Neumann

boundary

condition

Dirichlet

boundary

condition

`

block

↵

↵

Figure 2. Setting to simulate the ternary eutectic directional

solidiﬁcation based on [38]. Thereby, the melt `, consisting of

three components, solidiﬁes in the three phases ↵,and . In

dashed blue the moving window technique with the

block-structured grid is highlighted. Bellow, in red the moving

analytic temperature gradient is shown.

temperature gradient @T@tare given as:

⌧✏@↵

@t=−✏T@a(,∇)

@↵−∇⋅@a(,∇)

@∇↵

∶=r↵

−T

✏

@!()

@↵−@ (,µ,T)

@↵

∶=↵

−1

N

N

�

=1(r+),

(4)

@µ

@t=N

�

↵=1

h↵()@c↵(µ,T)

@µ−1

∇⋅M(,µ,T)∇µ−Jat(,µ,T)

−N

�

↵=1

c↵(µ,T)@h↵()

@t

−N

�

↵=1

h↵()@c↵(µ,T)

@T@T

@t

,(5)

@T

@t=@

@t(T0+G(z−vt))=−Gv. (6)

The evolution equation for the chemical potentials (5) alone

results in 1 384 ﬂoating point operations per cell and 680

Bytes that need to be transferred from main memory [31].

Further details of the phase-ﬁeld model are presented in

[8,39]. The discretizations in space with a ﬁnite difference

scheme and in time with an explicit Euler scheme are

speciﬁed in [40]. To efﬁciently solve the evolution equations

on current HPC systems, the model is optimized and

parallelized on different levels as proposed in [31]. Besides

explicit vectorization of the sweeps and parallelization with

MPI, a moving window approach is implemented on top

of the block structured grid data structures of WA LBERLA

as depicted in Figure 2. This allows to reduce the total

simulation domain to just a region around the solidiﬁcation

front. Typical simulations in representative volume elements

require between 10 000 to 20 000 compute cores for multiple

days [8–10]. A typical phase-ﬁeld simulation of the direction

solidiﬁcation of the ternary eutectic system Al-Ag-Cu as

Figure 3. Phase-ﬁeld simulation of the direction solidiﬁcation of

the ternary eutectic system Al-Ag-Cu in a

12 000 ×12 000 ×65 142 voxel cell domain which was calculated

with 19 200 cores on the SuperMUC system. A detailed

discussion is presented in [8,41].

described in [8] is shown in ﬁg. 3. The 12 000 ×12 000 ×

65 142 voxel cell domain was calculated on the SuperMUC

system with 19 200 cores for approximately 19 hours. Three

distinct solid phases of different composition grow into

the undercooled melt and form characteristic microstructure

patterns. On the left and right, rods for the two phases Al2Ag

and Ag2Al are exempted to extract the evolution of the single

solid phases within the complex microstructure. The rods

split, merge and overgrow during the simulation as described

in [8,41].

Based on this highly parallel and optimized solver, the

eutectic solidiﬁcation of idealized systems [42–44] and real

ternary alloys like Al-Ag-Cu [8–10,41,45,46] and Ni-Al-

Cr [47] were investigated in large scale domains. Thus, the

experimentally assumed growth of spirals could be proved

[42]. In [46] the inﬂuence of different melt compositions on

the evolving patterns could be shown.

7 Benchmarks

In this section, we present benchmarks to illustrate the

performance of our implementation. First, we introduce the

application parameters. Then we cover different aspects of

the covered checkpointing scheme, explain and present the

respective results.

7.1 The test cases

To benchmark the presented checkpointing scheme of

Section 5.2, we simulate the directional solidiﬁcation of a

ternary eutectic system using the implementation presented

in [31] of the phase-ﬁeld model introduced in Section 6. For

the simulation parameters, the values of [42] to study spiral

growth are used. To resolve spiral growth, large domain sizes

and millions of iterations are required, resulting in massively

Prepared using sagej.cls

10 Journal Title XX(X)

TE

T

z

G

v

periodic boundary condition

moving window

Neumann

boundary

condition

Dirichlet

boundary

condition

`

block

↵

↵

Figure 2. Setting to simulate the ternary eutectic directional

solidiﬁcation based on [38]. Thereby, the melt `, consisting of

three components, solidiﬁes in the three phases ↵,and . In

dashed blue the moving window technique with the

block-structured grid is highlighted. Bellow, in red the moving

analytic temperature gradient is shown.

temperature gradient @T@tare given as:

⌧✏@↵

@t=−✏T@a(,∇)

@↵−∇⋅@a(,∇)

@∇↵

∶=r↵

−T

✏

@!()

@↵−@ (,µ,T)

@↵

∶=↵

−1

N

N

�

=1(r+),

(4)

@µ

@t=N

�

↵=1

h↵()@c↵(µ,T)

@µ−1

∇⋅M(,µ,T)∇µ−Jat(,µ,T)

−N

�

↵=1

c↵(µ,T)@h↵()

@t

−N

�

↵=1

h↵()@c↵(µ,T)

@T@T

@t

,(5)

@T

@t=@

@t(T0+G(z−vt))=−Gv. (6)

The evolution equation for the chemical potentials (5) alone

results in 1 384 ﬂoating point operations per cell and 680

Bytes that need to be transferred from main memory [31].

Further details of the phase-ﬁeld model are presented in

[8,39]. The discretizations in space with a ﬁnite difference

scheme and in time with an explicit Euler scheme are

speciﬁed in [40]. To efﬁciently solve the evolution equations

on current HPC systems, the model is optimized and

parallelized on different levels as proposed in [31]. Besides

explicit vectorization of the sweeps and parallelization with

MPI, a moving window approach is implemented on top

of the block structured grid data structures of WA LBERLA

as depicted in Figure 2. This allows to reduce the total

simulation domain to just a region around the solidiﬁcation

front. Typical simulations in representative volume elements

require between 10 000 to 20 000 compute cores for multiple

days [8–10]. A typical phase-ﬁeld simulation of the direction

solidiﬁcation of the ternary eutectic system Al-Ag-Cu as

Figure 3. Phase-ﬁeld simulation of the direction solidiﬁcation of

the ternary eutectic system Al-Ag-Cu in a

12 000 ×12 000 ×65 142 voxel cell domain which was calculated

with 19 200 cores on the SuperMUC system. A detailed

discussion is presented in [8,41].

described in [8] is shown in ﬁg. 3. The 12 000 ×12 000 ×

65 142 voxel cell domain was calculated on the SuperMUC

system with 19 200 cores for approximately 19 hours. Three

distinct solid phases of different composition grow into

the undercooled melt and form characteristic microstructure

patterns. On the left and right, rods for the two phases Al2Ag

and Ag2Al are exempted to extract the evolution of the single

solid phases within the complex microstructure. The rods

split, merge and overgrow during the simulation as described

in [8,41].

Based on this highly parallel and optimized solver, the

eutectic solidiﬁcation of idealized systems [42–44] and real

ternary alloys like Al-Ag-Cu [8–10,41,45,46] and Ni-Al-

Cr [47] were investigated in large scale domains. Thus, the

experimentally assumed growth of spirals could be proved

[42]. In [46] the inﬂuence of different melt compositions on

the evolving patterns could be shown.

7 Benchmarks

In this section, we present benchmarks to illustrate the

performance of our implementation. First, we introduce the

application parameters. Then we cover different aspects of

the covered checkpointing scheme, explain and present the

respective results.

7.1 The test cases

To benchmark the presented checkpointing scheme of

Section 5.2, we simulate the directional solidiﬁcation of a

ternary eutectic system using the implementation presented

in [31] of the phase-ﬁeld model introduced in Section 6. For

the simulation parameters, the values of [42] to study spiral

growth are used. To resolve spiral growth, large domain sizes

and millions of iterations are required, resulting in massively

Prepared using sagej.cls

Kohl, H¨

otzer, Schornbaum, Bauer, Godenschwager, K ¨

ostler, Nestler, R¨

ude 15

30. Huber M, Gmeiner B, R¨

ude U et al. Resilience for

MassivelyParallel Multigrid Solvers. SIAM Journal on

Scientiﬁc Computing 2016; 38(5). doi:10.1137/15M1026122.

31. Bauer M, H¨

otzer J, Jainta M et al. Massively parallel phase-

ﬁeld simulations for ternary eutectic directional solidiﬁcation.

In Proceedings of the International Conference for High

Performance Computing, Networking, Storage and Analysis.

SC ’15, New York, NY, USA: ACM. ISBN 978-1-4503-3723-

6, pp. 8:1–8:12. doi:10.1145/2807591.2807662.

32. W Kurz PRS. Gerichtet erstarrte eutektische Werkstoffe:

Herstellung, Eigenschaften und Anwendungen von In-situ-

Verbundwerkstoffen. Springer, 1975. ISBN 978-3-642-65994-

2.

33. Fisher K and Kurz W. Fundamentals of Solidiﬁcation. Trans

Tech Publications 1986; doi:10.1002/crat.2170210909.

34. H¨

otzer J, Kellner M, Steinmetz P et al. Applications of the

phase-ﬁeld method for the solidiﬁcation of microstructures in

multi-component systems. Journal of the Indian Institute of

Science 2016; .

35. Dennstedt A and Ratke L. Microstructures of directionally

solidiﬁed Al–Ag–Cu ternary eutectics. Transactions of

the Indian Institute of Metals 2012; 65(6): 777–782.

doi:10.1007/s12666-012-0172-3.

36. Lewis D, Allen S, Notis M et al. Determination of the eutectic

structure in the Ag–Cu–Sn system. Journal of Electronic

Materials 2002; 31(2): 161–167. doi:10.1007/s11664-002-

0163-y.

37. Ruggiero MA and Rutter JW. Origin of microstructure in the

332 k eutectic of the Bi–In–Sn system. Materials Science and

Technology 1997; 13(1): 5–11. doi:10.1179/mst.1997.13.1.5.

38. H¨

otzer J. Massiv-parallele und großskalige phasenfeldsimula-

tionen zur untersuchung der mikrostrukturentwicklung, 2017.

doi:10.5445/IR/1000069984.

39. Choudhury A and Nestler B. Grand-potential formulation for

multicomponent phase transformations combined with thin-

interface asymptotics of the double-obstacle potential. Physical

Review E 2012; 85(2). doi:10.1103/physreve.85.021602.

40. H¨

otzer J, Tschukin O, Said M et al. Calibration of a

multi-phase ﬁeld model with quantitative angle measurement.

Journal of Materials Science 2016; 51(4): 1788–1797.

doi:10.1007/s10853-015-9542-7.

41. H¨

otzer J, Steinmetz P, Dennstedt A et al. Inﬂuence of

growth velocity variations on the pattern formation during the

directional solidiﬁcation of ternary eutectic Al–Ag–Cu. Acta

Materialia 2017; doi:10.1016/j.actamat.2017.07.007.

42. H¨

otzer J, Steinmetz P, Jainta M et al. Phase-ﬁeld

simulations of spiral growth during directional ternary eutectic

solidiﬁcation. Acta Materialia 2016; 106: 249 – 259.

doi:10.1016/j.actamat.2015.12.052.

43. Steinmetz P, H ¨

otzer J, Kellner M et al. Large-scale

phase-ﬁeld simulations of ternary eutectic microstructure

evolution. Computational Materials Science 2016; 117: 205–

214. doi:10.1016/j.commatsci.2016.02.001.

44. Steinmetz P, Kellner M, H¨

otzer J et al. Quantitative comparison

of phase-ﬁeld simulations with a ternary eutectic three-

dimensional jackson-hunt approach. Computational Materials

Science (submitted) 2016; .

45. H¨

otzer J, Jainta M, Steinmetz P et al. Die Vielfalt der

Musterbildung in Metallen. horizonte 2015; 45.

46. Steinmetz P, Kellner M, H¨

otzer J et al. Phase-ﬁeld study of the

pattern formation in Al-Ag-Cu under the inﬂuence of the melt

concentration. Computational Materials Science 2016; 121: 6

– 13. doi:10.1016/j.commatsci.2016.04.025.

47. Kellner M, Sprenger I, Steinmetz P et al. Phase-ﬁeld simulation

of the microstructure evolution in the eutectic NiAl-34Cr

system. Computational Materials Science 2017; 128: 379 –

387. doi:10.1016/j.commatsci.2016.11.049.

48. Padua D. Encyclopedia of Parallel Computing. Springer

Science & Business Media, 2011. ISBN 978-0-387-09844-9.

49. Bland W, Lu H, Seo S et al. Lessons learned implementing

user-level failure mitigation in mpich. In 2015 15th

IEEE/ACM International Symposium on Cluster, Cloud and

Grid Computing. pp. 1123–1126. doi:10.1109/ccgrid.2015.51.

Prepared using sagej.cls

Microstructure

evolution

42

Supercomputing for Material Sciences — Ulrich Rüde

Phase-field simulation of the direction

solidification of the ternary eutectic system

Al-Ag-Cu in a !

12 000 x 12000 x 65142!

voxel cell domain calculated with 19 200

cores on the SuperMUC system.

10 Journal Title XX(X)

TE

T

z

G

v

periodic boundary condition

moving window

Neumann

boundary

condition

Dirichlet

boundary

condition

`

block

↵

↵

Figure 2. Setting to simulate the ternary eutectic directional

solidiﬁcation based on [38]. Thereby, the melt `, consisting of

three components, solidiﬁes in the three phases ↵,and . In

dashed blue the moving window technique with the

block-structured grid is highlighted. Bellow, in red the moving

analytic temperature gradient is shown.

temperature gradient @T@tare given as:

⌧✏@↵

@t=−✏T@a(,∇)

@↵−∇⋅@a(,∇)

@∇↵

∶=r↵

−T

✏

@!()

@↵−@ (,µ,T)

@↵

∶=↵

−1

N

N

�

=1(r+),

(4)

@µ

@t=N

�

↵=1

h↵()@c↵(µ,T)

@µ−1

∇⋅M(,µ,T)∇µ−Jat(,µ,T)

−N

�

↵=1

c↵(µ,T)@h↵()

@t

−N

�

↵=1

h↵()@c↵(µ,T)

@T@T

@t

,(5)

@T

@t=@

@t(T0+G(z−vt))=−Gv. (6)

The evolution equation for the chemical potentials (5) alone

results in 1 384 ﬂoating point operations per cell and 680

Bytes that need to be transferred from main memory [31].

Further details of the phase-ﬁeld model are presented in

[8,39]. The discretizations in space with a ﬁnite difference

scheme and in time with an explicit Euler scheme are

speciﬁed in [40]. To efﬁciently solve the evolution equations

on current HPC systems, the model is optimized and

parallelized on different levels as proposed in [31]. Besides

explicit vectorization of the sweeps and parallelization with

MPI, a moving window approach is implemented on top

of the block structured grid data structures of WA LBERLA

as depicted in Figure 2. This allows to reduce the total

simulation domain to just a region around the solidiﬁcation

front. Typical simulations in representative volume elements

require between 10 000 to 20 000 compute cores for multiple

days [8–10]. A typical phase-ﬁeld simulation of the direction

solidiﬁcation of the ternary eutectic system Al-Ag-Cu as

Figure 3. Phase-ﬁeld simulation of the direction solidiﬁcation of

the ternary eutectic system Al-Ag-Cu in a

12 000 ×12 000 ×65 142 voxel cell domain which was calculated

with 19 200 cores on the SuperMUC system. A detailed

discussion is presented in [8,41].

described in [8] is shown in ﬁg. 3. The 12 000 ×12 000 ×

65 142 voxel cell domain was calculated on the SuperMUC

system with 19 200 cores for approximately 19 hours. Three

distinct solid phases of different composition grow into

the undercooled melt and form characteristic microstructure

patterns. On the left and right, rods for the two phases Al2Ag

and Ag2Al are exempted to extract the evolution of the single

solid phases within the complex microstructure. The rods

split, merge and overgrow during the simulation as described

in [8,41].

Based on this highly parallel and optimized solver, the

eutectic solidiﬁcation of idealized systems [42–44] and real

ternary alloys like Al-Ag-Cu [8–10,41,45,46] and Ni-Al-

Cr [47] were investigated in large scale domains. Thus, the

experimentally assumed growth of spirals could be proved

[42]. In [46] the inﬂuence of different melt compositions on

the evolving patterns could be shown.

7 Benchmarks

In this section, we present benchmarks to illustrate the

performance of our implementation. First, we introduce the

application parameters. Then we cover different aspects of

the covered checkpointing scheme, explain and present the

respective results.

7.1 The test cases

To benchmark the presented checkpointing scheme of

Section 5.2, we simulate the directional solidiﬁcation of a

ternary eutectic system using the implementation presented

in [31] of the phase-ﬁeld model introduced in Section 6. For

the simulation parameters, the values of [42] to study spiral

growth are used. To resolve spiral growth, large domain sizes

and millions of iterations are required, resulting in massively

Prepared using sagej.cls

Ternary eutectic directional solidification. The melt l,

consists of three components, solidifies in the three

phases α, β and γ. Moving window technique, block-

structured grid. Moving analytic temperature gradient

2022242628210 212 214

cores

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

MLUP/s per core

interface

liquid

solid

Figure 9: Weak Scaling on SuperMUC(left), Hornet(middle) and JUQUEEN(right)

ring connection chain

ring connection chain

ring connection chain

ring connection chain

(a) Simulation result

ring connection chain

ring connection chain

ring connection chain

ring connection chain

(b) tomography reconstruction of experiment from

A.Dennstedt

Figure 10: Three dimensional simulation and experimental results of directional solidiﬁcation of the ternary

eutectic system Ag-Al-Cu

(a) Phase Al2Cu (b) Phase Ag2Al

Figure 11: Exempted lamellae from the simulation depicted in Figure 10. The lamellae grew from left to

right.

Büro für Gestaltung Wangler & Abele 04. April 2011

Conclusions

43

Ultrascalable Algorithms — Ulrich Rüde

Büro für Gestaltung Wangler & Abele 04. April 2011

44

The Two Principles of Science

Theory

mathematical models,

differential equations,

Newton

Experiments

observation and

prototypes

empirical sciences

Computational Science

simulation, optimization

(quantitative) virtual reality

Three

Computational methods open the path to

Predictive Science

Ultrascalable Algorithms — Ulrich Rüde

Büro für Gestaltung Wangler & Abele 04. April 2011

Coupled Flow for ExaScale — Ulrich Rüde

Computational Science is done in Teams

Dr.-Ing. Dominik Bartuschat

Martin Bauer, M.Sc. (hons)

Dr. Regina Degenhardt

Sebastian Eibl, M. Sc.

Dipl. Inf. Christian Godenschwager

Marco Heisig, M.Sc.(hons)

PD Dr.-Ing. Harald Köstler

Nils Kohl, M. Sc.

Sebastian Kuckuk, M. Sc.

Christoph Rettinger, M.Sc.(hons)

Jonas Schmitt, M. Sc.

Dipl.-Inf. Florian Schornbaum

Dominik Schuster, M. Sc.

Dominik Thönnes, M. Sc.

Dr.-Ing. Benjamin Bergen

Dr.-Ing. Simon Bogner

Dr.-Ing. Stefan Donath

Dr.-Ing. Jan Eitzinger

Dr.-Ing. Uwe Fabricius

Dr. rer. nat. Ehsan Fattahi

Dr.-Ing. Christian Feichtinger

Dr.-Ing. Björn Gmeiner

Dr.-Ing. Jan Götz

Dr.-Ing. Tobias Gradl

Dr.-Ing. Klaus Iglberger

Dr.-Ing. Markus Kowarschik

Dr.-Ing. Christian Kuschel

Dr.-Ing. Marcus Mohr

Dr.-Ing. Kristina Pickl

Dr.-Ing. Tobias Preclik

Dr.-Ing. Thomas Pohl

Dr.-Ing. Daniel Ritter

Dr.-Ing. Markus Stürmer

Dr.-Ing. Nils Thürey

45

Büro für Gestaltung Wangler & Abele 04. April 2011

Simulation und Vorhersagbarkeit — Ulrich Rüde

Thank you for your attention!

46

Bogner, S., & UR. (2013). Simulation of floating bodies with the lattice Boltzmann method. Computers & Mathematics with

Applications, 65(6), 901-913.

Anderl, D., Bogner, S., Rauh, C., UR, & Delgado, A. (2014). Free surface lattice Boltzmann with enhanced bubble model.

Computers & Mathematics with Applications, 67(2), 331-339.

Bogner, S. Harting, J., & UR (2017). Simulation of liquid-gas-solid flow with a free surface lattice Boltzmann method. Submitted.