Conference PaperPDF Available

# 10M-Core Scalable Fully-Implicit Solver for Nonhydrostatic Atmospheric Dynamics

Authors:

## Abstract and Figures

An ultra-scalable fully-implicit solver is developed for stiff time-dependent problems arising from the hyperbolic conservation laws in nonhydrostatic atmospheric dynamics. In the solver, we propose a highly efficient hybrid domain-decomposed multigrid precondi-tioner that can greatly accelerate the convergence rate at the extreme scale. For solving the overlapped subdo-main problems, a geometry-based pipelined incomplete LU factorization method is designed to further exploit the on-chip fine-grained concurrency. We perform systematic optimizations on different hardware levels to achieve best utilization of the heterogeneous computing units and substantial reduction of data movement cost. The fully-implicit solver successfully scales to the entire system of the Sunway TaihuLight supercomputer with over 10.5M heterogeneous cores, sustaining an aggregate performance of 7.95 PFLOPS in double-precision, and enables fast and accurate atmospheric simulations at the 488-m horizontal resolution (over 770 billion unknowns) with 0.07 simulated-years-per-day. This is, to our knowledge, the largest fully-implicit simulation to date.
Content may be subject to copyright.
10M-Core Scalable Fully-Implicit Solver for
Nonhydrostatic Atmospheric Dynamics
Chao Yang∗§, Wei Xue†‡ ∗∗, Haohuan Fu ∗∗, Hongtao You, Xinliang Wang†‡, Yulong Ao∗§,
Fangfang Liu∗§, Lin Gan ‡∗∗ , Ping Xu†‡ , Lanning Wangk, Guangwen Yang‡∗∗, Weimin Zheng
Institute of Software and State Key Laboratory of Computer Science, Chinese Academy of Sciences, China
Department of Computer Science and Technology, Tsinghua University, China
MOE Key Lab for Earth System Modeling, and Center for Earth System Science, Tsinghua University, China
§University of Chinese Academy of Sciences, China
National Research Center of Parallel Computer Engineering and Technology, China
kCollege of Global Change and Earth System Science, Beijing Normal University, China
∗∗National Supercomputing Center in Wuxi, China
Abstract—An ultra-scalable fully-implicit solver is de-
veloped for stiff time-dependent problems arising from
the hyperbolic conservation laws in nonhydrostatic at-
mospheric dynamics. In the solver, we propose a highly
efﬁcient hybrid domain-decomposed multigrid precondi-
tioner that can greatly accelerate the convergence rate
at the extreme scale. For solving the overlapped subdo-
main problems, a geometry-based pipelined incomplete
LU factorization method is designed to further exploit the
on-chip ﬁne-grained concurrency. We perform systematic
optimizations on different hardware levels to achieve
best utilization of the heterogeneous computing units and
substantial reduction of data movement cost. The fully-
implicit solver successfully scales to the entire system of
the Sunway TaihuLight supercomputer with over 10.5M
heterogeneous cores, sustaining an aggregate performance
of 7.95 PFLOPS in double-precision, and enables fast and
accurate atmospheric simulations at the 488-m horizontal
resolution (over 770 billion unknowns) with 0.07 simulated-
years-per-day. This is, to our knowledge, the largest fully-
implicit simulation to date.
Index Terms—atmospheric modeling; fully implicit
solver; Sunway TaihuLight supercomputer; heterogeneous
many-core architecture.
I. JUS TI FICATI ON F OR AC M GORDON BELL PRI ZE
An important attempt is made to design an ultra-
scalable fully-implicit solver for nonhydrostatic atmo-
spheric simulations. With both algorithmic and optimiza-
tion innovations, the solver scales to 10.5-million het-
erogeneous cores on Sunway TaihuLight at an unprece-
dented 488-m resolution with 770-billion unknowns, sus-
taining 7.95 PFLOPS performance in double-precision
with 0.07 simulated-years-per-day (SYPD).
Performance Attributes Content
Category of achievement Time-to-solution
Type of method used Fully implicit
Results reported on basis of Whole application including I/O
Precision reported Double precision
System scale Measured on full-scale system
Measurement mechanism Timers
II. SIMULATION OF ATMOSPHERIC DYNAM IC S
Every year, extreme weather/climate events may bring
economic loss in hundreds of billion dollars [1] and
sometimes cause catastrophic disasters to the living
condition of human beings [2]. Ever since the ENIAC
system in 1950s, generations of scientists have been
continuously working on improving the simulation and
prediction capability of atmosphere models by develop-
ing innovative numerical algorithms on state-of-the-art
computing platforms [2]. With six decades passed, the
the climate system, the computing methods, and the
computing capabilities have ﬁnally pushed us to the edge
of seamless weather-climate simulations/predictions at
the resolution of the km-level and beyond.
On the road to the seamless weather-climate predic-
tion, a major obstacle is the difﬁculty of dealing with
various spatial and temporal scales [3]. The atmosphere
contains time-dependent multi-scale dynamics that sup-
port a variety of wave motions. For example, the seasonal
Asian summer monsoon usually comes at the planetary
length scale of the earth with the order of 103-104
km, but thunderstorms and tornadoes often develop in
minutes with an horizontal scale range of 10 km to
under a few hundred meters. It is therefore important for
atmosphere models to deliver accurate simulation results
at the ultra-high horizontal resolutions of kilometer or
even hundreds of meters, along which great efforts in
both the numerical algorithms and the high-performance
The fastest traveling waves of the atmosphere, such as
the acoustic and inertia-gravity waves, are usually not of
interest to the scientists. They often impose restrictive
time step constraints for explicit time-stepping meth-
ods; and these restrictions are the major limiting factor
of explicit methods in ultra-high-resolution atmospheric
modeling. By using a simpliﬁed equation set such as
the hydrostatic or anelastic (Boussinesq or sound-proof)
equations, the fast waves are ﬁltered out, but these
simpliﬁcations are usually not accurate when the grid
resolution tends to the km-level [5], [6]. Another way to
stabilize fast waves is to make use of semi-implicit [7],
[8] or split-explicit methods [9], [10], which relax the de-
pendency between the time step length and the horizontal
grid resolution. However, the relaxed dependency can
still become a bottleneck of the time-to-solution perfor-
mance at the extreme scale [11]. Fully implicit methods,
on the other hand, are free of the stability limitation, and
are therefore potentially desirable. The price of using a
fully implicit method is that one need to solve one or a
few large linear/nonlinear equation system at each time
step, which requires innovative design to achieve high
efﬁciency on state-of-the-art supercomputing platforms.
With many-core accelerators becoming the major
provider of computing power in various supercomputers,
we see a huge demand to migrate the weather/climate
models to heterogeneous supercomputers. One challenge
is to make an efﬁcient utilization of the increasingly pop-
ular many-core accelerators or processors, which can be
extremely difﬁcult for implicit solvers on heterogeneous
supercomputers. The Sunway TaihuLight supercomputer
[12], released in 2016, pushes the parallelism to the level
of over ten million cores, which poses another great
scalability challenge to the current numerical algorithms
In this work, we present a highly scalable fully im-
plicit solver for three-dimensional nonhydrostatic atmo-
spheric simulations governed by the fully compressible
Euler equations. Unlike simpliﬁed equations with the hy-
drostatic or anelastic assumptions, the fully compressible
Euler equations are accurate to the mesoscale with al-
most no assumption made [3]. In particular, we consider
atmospheric ﬂows in a regional domain above a rotating
sphere with possibly nonsmooth bottom topography [13]:
∂Q
∂t +F
∂x +G
∂y +H
∂z +S= 0,(1)
Q= (ρ0, ρu, ρv, ρw, (ρeT)0,(ρq)0)T,
F= (ρu, ρuu +p0, ρuv, ρuw, (ρeT+p)u, ρuq)T,
G= (ρv, ρvu, ρvv +p0, ρvw, (ρeT+p)v, ρvq)T,
H= (ρw, ρwu, ρwv, ρww +p0,(ρeT+p)w, ρwq)T,
S= (0, ∂ ¯p/∂x f ρv, ¯p/∂y +f ρu, ρ0g, 0,0)T,
where ρ,v= (u, v, w),p,θand qare the den-
sity, velocity, pressure, total energy and moisture of
the atmosphere, respectively. The Coriolis parameter
is provided in fand all other variables such as g,
γare given constants. The values of ρ0=ρ¯ρ,
(ρeT)0=ρeT¯ρ¯eTand p0=p¯phave been
shifted according to a hydrostatic state that satisﬁes
∂p/∂z =ρg. The system is closed with the equation
of state p= (γ1)ρ(eTgz − ||v||2/2). Note that we
pressure- or temperature-based values as a prognostic
variable to fully recover the energy conservation law and
avoid the repeated calculation of powers.
The fully compressible Euler equations (1) are dis-
cretized with a conservative cell-centered ﬁnite volume
scheme of second-order accuracy on a height-based
terrain-following 3-D grid. A fully implicit second-order
Rosenbrock method is employed for time integration,
which supports adaptive time-stepping (turned off in this
work to simplify the discussion on the performance).
III. STATE OF THE ART
Due to the long development history, existing weather
and climate models are mainly designed for CPU-based
platforms. Related HPC efforts are mainly focused on
improving the scalability and efﬁciency to support in-
creasingly higher resolutions. For example, thanks to
the huge performance boosts delivered by the Earth
Simulator and the K computer, Japanese groups have
done a series of pioneering works, such as the 3.5-
km and 7-km global simulations on the earth simulator
[14] that successfully captured the lifecycles of two real
tropical cyclones [15], and the 870-m global resolution
simulation on K computer with 230 TFLOPS double-
precision performance for 68 billion grid cells. In US,
the CAM-SE dynamic core of NCAR supports up to
12.5-km resolution, and can provide a simulation speed
of around 4.6 SYPD when using 172,800 CPU cores
on Jaguar [16]. The Weather Research and Forecasting
(WRF) model has been employed to simulate the landfall
of Hurricane Sandy, providing a single-precision perfor-
mance of 285 TFLOPS on 437,760 cores of Blue Waters
[17]. In the recent initiative to build next generation
global prediction system (NGGPS) of US [18], we
see a number of candidates that can already support
seamless weather-climate simulation at the scale of a few
kilometers. Examples include the Model for Prediction
Across Scales (MPAS) and the Finite Volume Model
version 3 (FV3), which scales to 110,592 CPU cores
on the Edison system with a simulation speed of around
0.16 SYPD for the 3-km resolution in double precision.
Due to the heavy legacy and the distributed compu-
tation pattern of atmospheric models, it involves both
design challenges as well as huge programming ef-
forts to port weather/climate models onto many-core
accelerators. Early studies often focused on the many-
core acceleration of standalone physics parameterization
schemes ([19], [20]). In recent years, more efforts were
made to migrate the dynamic cores or even complete
atmospheric models to accelerator-based platforms, also
pioneered by Japanese researchers. For example, on the
TSUBAME 1.2 and 2.0 systems, T. Shimokawabe et
al. conducted successful multi-node GPU-based accel-
eration of the ASUCA nonhydrostatic model [21], with
a single-precision performance of 145 TFLOPS. More
recently, a GPU-based acceleration of the NICAM model
[22] on TSUBAME 2.5 sustained a double-precision
performance of 60 TFLOPS using 2,560 GPUs. In China,
our group have enabled both CPU-GPU and CPU-MIC
accelerations of an explicit time-stepping global shallow
water model on Tianhe-1A and Tianhe-2, both scaling
to half-system levels with sustained double-precision
performance of 800 TFLOPS [23] and 1.63 PFLOPS
[24], respectively. Further, the work was extended to
the 3-D nonhydrostatic case on Tianhe-2, scaling also
to the half-system scale with over 8%FLOPS efﬁciency
in double-precision [25]. The previous work mentioned
above, though mostly focuses on explicit methods, may
serve as guidances for us to develop highly efﬁcient
implicit solvers.
Many complex partial differential equation (PDE)
based problems often require implicit solvers that allow
for large time-step size but require to solve nonlinear
equations. For homogeneous supercomputers, a most re-
cent work is by Rudi et al. [26], in which a fully implicit
solver based on an innovative AMG method scaled to
1.57 million homogeneous cores on the IBM Sequoia
supercomputer with 96% and 33% parallel efﬁciency
in terms of weak and strong scalability, respectively,
sustaining a FLOPS efﬁciency of around 3.41% of in
double-precision. Some previous efforts on designing
highly efﬁcient implicit solvers include [27], [28], [29],
all on homogeneous CPU-based systems.
Due to the intrinsic “divide-and-conquer” nature, do-
main decomposition methods (DDMs) were recognized
as a good iterative solver or preconditioner for solv-
ing large-scale linear or nonlinear equation systems
resulted from the discretization of PDE-based problems
on massively parallel cluster systems. In the past three
the theoretical analysis and the application techniques
of DDMs for different types of PDEs. For elliptic PDEs,
classical DDMs such as the additive and the multiplica-
tive Schwarz methods have optimal convergence rate in
terms of both strong and weak scalability, as long as
certain coarse-level corrections are added [30]. But sim-
ilar theoretical analysis does not apply to time-dependent
hyperbolic PDEs such as the the fully compressible Euler
equations arising from multi-physics conservation laws.
It was observed in our previous work [31], [32] that, for
time-dependent hyperbolic PDEs, coarse-level corrected
DDMs are also a promising approach. We remark that
multigrid based approaches, such as the AMG work by
[26], are also potentially valuable to apply. But we prefer
to keep a uniform data partition strategy on all mesh
levels to achieve balanced load across different parallel
computing units, which is easier to achieve when DDMs
are used as the basic design on each level. In particular,
we combine the DDMs within a multigrid cycle and
propose a low-cost DD-MG method for preconditioning
the solution of the discretized Euler equations.
A homogeneous domain partition strategy is usually
preferred in traditional DDMs for the consideration of
load balance. But this is no longer suitable for het-
erogeneous architectures, which provide another level
of parallelism inside each compute node. This means
that the subdomain solver of a DDM should be able
to exploit the on-chip many-core resources and pro-
vide robust approximations to the subdomain solution.
Unfortunately, classical subdomain solvers such as the
incomplete LU factorizations are difﬁcult to parallelize
due to the sequential nature and the irregular behavior
of the method. Considering a general many-core pro-
cessor, the newly proposed PILU method [33], [34] is a
promising approach. But it usually requires a few sweeps
to achieve the similar convergence rate as the sequential
ILU does, because the asynchronization introduced in
the method breaks the data dependency. Therefore the
parallel speedup of the PILU method is sub-optimal. By
taking architecture advantage of Suway TaihuLight, we
design a highly parallel ILU method that provides high
speedup without sacriﬁcing the convergence rate. Based
on the newly design ILU method and the DD-MG algo-
rithm, our proposed fully implicit solver can efﬁciently
scale to the full-system scale on Sunway TaihuLight for
solving nonhydrostatic atmospheric problems at ultra-
high resolutions.
IV. THE SUNWAY TAIHULIG HT SUPERCOMPUTER
A. System Overview
Released in June 2016, the Sunway TaihuLight super-
computer [12] claims the top place in the latest TOP500
list with a peak performance of 125 PFLOPS and a
sustained Linpack performance of 93 PFLOPS. There
are 40,960 compute nodes in the system, spanning across
40 cabinets, with each cabinet containing 4 supernodes.
Each supernode includes 256 SW26010 processors that
are fully connected by a customized network switch
board, and 8 TB DDR3 memory. The network topology
across supernodes is a two-level fat-tree. The global ﬁle
system manages both SSD storage and HDD storage with
the aggregation bandwidth of over 250 GB/s and the ca-
pacity exceeding 10 PB. An I/O forwarding architecture
is integrated to handle the stability issue of the Lustre
ﬁle system due to massive connections between clients
and I/O servers.
The software environment of the system includes a
customized 64-bit linux OS kernel and a customized
compiler supporting C/C++, Fortran and mainstream par-
allel programing languages such as MPI, OpenMP and
OpenACC. The message passing library on the Sunway
TaihuLight supports MPI 3.0 speciﬁcation and has been
tuned for massively parallel run. A high-performance
provided to exploit ﬁne-grained parallelism within the
socket.
B. The SW26010 Many-core Processor
The SW26010 processor works at the frequency of
1.45 GHz with an aggregated peak performance of 3.06
TFLOPS in double precision and an aggregated memory
bandwidth of 130 GB/s. The general architecture of the
processor is shown in Fig. 1. Each SW26010 processor
comes with 4 core groups (CGs), with each including
one management processing element (MPE) and one
computing processing element (CPE) cluster of 64 CPEs,
in total 260 cores in each processor. The MPE and CPE
are both complete 64-bit RISC cores but serve different
roles during the computation. The MPE, supporting
the complete interrupt functions, memory management,
CPE CPE CPE CPE
CPE CPE CPE CPE
CPE
CPE
CPE CPE CPE
CPE CPE CPE
8 8
CPE cluster
SPM
Main Memory Main Memory
Main Memory Main Memory
L1
L2
Network on Chip
(NoC)
SI
CPE
Cluster
M
C
M
P
ECG
CPE
Cluster
M
C
M
P
ECG
CPE
Cluster
M
P
E
CG
M
C
CPE
Cluster
M
C
M
P
E
CG
Fig. 1. The general architecture of SW26010 processor [12]. Each
CG includes one MPE, one CPE cluster with 8×8 CPEs, and one
memory controller (MC). These 4 CGs are connected via the network
on chip (NoC). Each CG has shared memory space, connected to
the MPE and the CPE cluster through the MC. All processors are
connected with each other through a system interface (SI).
superscalar, and out-of-order issue/execution, is good
at handling the management, task schedule, and data
communications. In terms of the memory hierarchy, each
MPE has a 32 KB L1 data cache, and a 256 KB
L2 cache for both instruction and data. The CPE is
designed for the purpose of maximizing the aggregated
computing throughput while minimizing the complexity
of the micro-architecture. The CPE cluster is organized
as an 8×8 mesh, with a mesh network to achieve low-
latency register data communication (P2P and collective
communications) among the CPEs in one CG. Unlike
the MPE, the CPE does not support interrupt functions.
And each CPE has its own 16 KB L1 instruction cache
and a 64 KB Scratch Pad Memory (SPM) that can
be conﬁgured as either a Local Data Memory (LDM)
that serves as user-controlled buffer (for performance-
oriented programming) or a software-emulated cache for
automatic data buffering (for more convenient porting
of the program). Through the memory controller, Direct
Memory Access (DMA) is supported for data transporta-
tion across the SPM and the main memory, and normal
load/store instructions are also available for registers to
transfer data with the main memory.
V. MAJOR INNOVATI VE CONTRIBUTIONS
A. Summary of Contributions
Our major contribution is a highly scalable fully
implicit solver for the nonhydrostatic atmospheric dy-
namics governed by hyperbolic conservation laws, which
enables fast and accurate atmospheric simulations at
ultra-high resolutions. Our solver is developed based
on a hybrid domain-decomposed multigrid (DD-MG)
preconditioner to achieve robust convergence rate on
distributed parallel computers at the extreme scale, and
a geometry-based pipelined incomplete LU factorization
(GP-ILU) method to efﬁciently solve the overlapping
subdomain problems by fully exploiting the on-chip
many-core parallelism.
We have implemented the fully implicit solver in an
experimental atmospheric dynamic core and deployed
it on the Sunway TaihuLight supercomputer. The fully
implicit solver scales well to the entire system with
over 10.5 million heterogeneous cores in both strong
and weak scaling cases. In particular, our implicit solver
is free of the time step constraint, and can provide a
simulation speed of around 1.0 SYPD at the horizontal
resolution of 3-km, which is substantially superior to our
explicit counterpart developed from our previous work
on Tianhe-2. The fully implicit solver is able to conduct
simulations at the unprecedented 488-m resolution (total
DOFs: over 770 billion) with 0.07 SYPD, sustaining
an aggregate double-precision performance of nearly 8
PFLOPS with over 50%parallel efﬁciency. This is, to
the best of our knowledge, the largest fully implicit
simulation in terms of total DOFs, total number of cores
and aggregate performance, to date.
B. Algorithm
1) The DD-MG preconditioner: For the fully com-
pressible Euler equations, the linear Jacobian system
is especially difﬁcult to solve due to the hyperbolic
and stiff nature of the problem. We propose a hybrid
preconditioner DD-MG that combines both geometric
multigrid and algebraic domain decomposition methods
to accelerate the convergence of the linear solver. In
the DD-MG method, the MG component is deﬁned as
M1=M1
f+M1
cM1
fAfM1
c, where M1
fis
the one-level DD preconditioner, M1
cis the projected
coarse-level correction that can be deﬁned recursively,
and Afis the Matrix-free Jacobian. In particular, we
use the cascade κ-cycle MG with low-order pre- and
post-smoothers and the left restricted additive Schwarz
(RAS) [35] DD component built based on a low-order
ﬁnite volume scheme in the DD-MG preconditioner, as
illustrated in Fig. 2.
2) The GP-ILU factorization: On a given overlapping
subdomain, we construct an approximated Jacobian ma-
trix based on a low-order 7-point spatial discretization
and order the unknowns without breaking the coupling
of all physical components on each mesh cell. The
subdomain matrix then carries the mesh information that
Fig. 2. The DD-MG preconditioner of three levels, which is a hybrid
composition of the algebraic DD and a geometric κ-cycle MG. In
particular, on each MG level, we use the one-level RAS method for
the DD preconditioning to exploit the same degree of parallelism on
the process level.
can be used further in the subdomain solver, which
has been found helpful [29] to improve not only the
convergence but also the parallel performance. In the
DD-MG framework, the subdomain solver can be solved
inexactly by an incomplete factorization method. How-
ever, classical ILU-based methods are difﬁcult to be
parallelized due to the sequential nature and the irregular
behavior of the method. Considering a general many-
core processor, the newly proposed PILU method [33],
[34] is a promising approach. Based on it, we can design
a geometric ILU method for solving the subdomain
problems. But the parallel speedup of the PILU method
is sub-optimal due to the break of the data dependency.
Using the fast register communication mechanism (de-
tailed in Section V-C) supported by the SW26010 CPU,
we are able to design a new parallel ILU method, the
geometry-based piplined ILU (GP-ILU) method, which
faithfully maintains the data dependency of the original
ILU method, but exploits the on-chip parallelism more
efﬁciently.
All major operations in our solvers are summarized
in Table I. For the explicit solver, only the FX kernel is
required, along with a few vector update operations.
C. Implementation and Optimization
We implement the proposed fully implicit solver as
well as an explicit one based on the PETSc (Portable
Extensible Toolkit for Scientiﬁc computation [36]) li-
brary, by which we set the in-memory data layout as the
array-of-structures in the z-x-yorder. Then we perform
a systematic optimization across the process, the thread,
as well as the instruction level, and achieve substantial
speedups in all performance-critical kernels.
Bottom
Top
South
North
West
East
Inner
T
S
W
B
E
N
X
YZ
8×8
8×8
8×8
X
Y
Z
Core (0,0)
Core (0,1)
Core (0,2)
Core (0,3)
13-point Stencil
Core (0,0)
Core (7,7)
XOZ
X
Y
Z
(a) FX, AX (b) MAT (c) ILU
Core (0,1)
Core (0,2)
Inner
Core (7,7)
2.5D blocking
Core (1,0)
Core (0,0)
Core (0,1)
Core (7,7)
8×8
Two-level Pipeline
X
Y
Z
Fig. 3. The data partitioning and task scheduling of different kernels in our solver. (a). the AX and F X kernels are partitioned into inner
and halo parts. Following the 2.5-D blocking for inner part, the proper block size for one CPE can be determined by the consideration on the
LDM size, vectorization, double-buffering footprint and DMA efﬁciency, which is 4x4 on SW26010. (b) a column-wise blocking/pipelining
methodology is presented for the MAT kernel and the block size is a multiple of 4 for vectorization. (c). the data domain is partitioned
into several 8x8 blocks, to perform a two-level pipelining method. Take the forward process of ILU(0) as an example, the inter-thread level
pipeline can be exploited on the x-yplane by taking the advantage of the fast register communication across the CPE cluster, as one CPE
only needs the results from its east and south neighbors to start the calculation. The inner-thread level pipeline is performed along with z
direction within each CPE according to the limited size of LDM.
TABLE I
LIST OF MAJOR KERNELS IN THE FULLY IMPLICIT SOLVER.
Kernel Input Output
FX x F (x)
AX x,˜x Ax (∂F (˜x)/∂ ˜x)x
MAT ˜x Ap∂F low (˜x)/∂˜x
p
RAS bPnp
p=1(R0
p)T(LpUp)1Rδ
pb
ILU bp=Rδ
pb(LpUp)1bpwhere LpUpAp
GCR bOne GCR iteration applied on b
MG bOne MG κ-cycle applied on b
Due to the massive parallel computing capabilities
and limited memory bandwidth of SW26010, how to
exploit as much parallelism as possible and how to
best utilize the limited memory resources are crucial for
performance. In particular, we focus on the effective use
of the small but fast LDM on each CPE of SW26010
processor. In this section, we highlight two major cat-
egories of optimization techniques: the LDM-oriented
partitioning that identiﬁes the most suitable form of
parallelism for multi-threading and vectorization; and the
memory-related optimization for maximizing the data
reuse and coalescing memory accesses.
1) LDM-oriented Partitioning: On the SW26010 pro-
cessor, three partitioning schemes are employed, cor-
responding to different types of kernels, as shown in
Fig. 3. For the AX kernel, the computation domain is
decomposed into the halo and the inner parts to do
communication-computation overlapping. For the inner
part, a 2.5D blocking is combined with a double-
buffering scheme to hide the memory access latency. The
same partition and scheduling strategy is also used in the
dominated FX kernel of our explicit solver. Compared
with the AX kernel, the MAT kernel has a similar
computation pattern but does not require halo exchange,
and involves fewer inputs and more outputs. Therefore,
we use a column-wise blocking/pipelining along the z-
axis in the z-xplane, as shown in Fig.3(b).
With the support of inter-thread communication and
synchronization, we implement the GP-ILU method as a
two-level pipeline (inter/intra-thread levels). This method
provides a better solver performance when compared
to the blocked PILU method since only one sweep is
needed. Details are shown in Fig.3 (c). We partition
the data domain of each process into several 8×8 cell
columns, which exactly maps to the layout of the 8×8
CPE cluster. With this ﬁne-grained partition, the over-
head of imbalance during startup and ﬁnalization of the
pipelines can be minimized. Note that the factorization,
forward (solving the lower triangular part), and backward
(solving the upper triangular part) processes, can be
performed in a similar manner.
2) Memory-related Optimization:
a) A customized data sharing scheme through reg-
ister communication: In 2.5-D blocking, each CPE has
to directly access the data in a strided way from memory,
which leads to inefﬁcient memory usage. To resolve this
LDM 0 1 2 3
Duplicating
20 32
2
2-layer halo
20
8
4 4
4 4
4 4
4 4
LDM 0 1 2 3
LDM 0 12 3
LDM 0 1 2 3
Step 1
Step 2
Step 3
Sync.
Sync.
Sync.
3
LDM 0 1 2 3
Z
YX
Inner
Z
X
Exchanging
Decomposing
1
Memory
LDM
Fig. 4. The customized data sharing method used in stencil-like
kernels including FX and AX. Here, each block contains 4×4cells
and 2 halo layers. 1
decomposing: 4 cores are grouped together, each
of which loads the data of 4×4+2×2 = 20 cells continuously.
2
duplicating: certain data on each core is duplicated to construct
the data domain with 4×8(i.e. 32) cells due to 2-layer halos. 3
exchanging: the resulting data is exchanged between different cores
along with pairs in group via register communication, and ﬁnally
each core obtains their required data.
issue, an on-line data sharing method is implemented
to maximize data locality via the fast register commu-
nication feature, as shown in Fig. 4. More CPEs in a
group lead to more continuous memory access and a
better data reuse, but the overhead of the higher on-line
process and synchronization is also higher. Based on our
experiments, the optimal choice is to use 4 CPEs in a
same CG together.
b) On-the-ﬂy array transposition: In the FX,AX
and MAT kernels, there are both AOS-friendly and SOA-
friendly computation parts. We conduct the on-the-ﬂy
array transposition to achieve highly efﬁcient transforma-
tions between AOS and SOA, and better vectorization.
The shufﬂe instruction supported by SW26010 is used to
implement this feature. In normal scenarios, the shufﬂe
of two vectors can be ﬁnished in one operation. Using
the shufﬂe instruction, we reduce the latency of conver-
sion to only 12 instruction cycles on SW26010 when
converting four cell structures with six double-precision
members into six 256-bit vectors and vice versa.
The partitioning method, the GP-ILU method, the on-
the-ﬂy array transposition method, and the on-cache data
sharing technique, can also be applied to other many-core
processors, such as MIC and GPU.
3) xMath: There are several other operations that
need to be optimized on the Sunway supercomputer. The
operations include BLAS-1 vector updates as well as
halo exchange. We have developed a high-performance
extened math library called xMath that supports highly
optimized BLAS, LAPACK, FFT operations on the Sun-
way TaihuLight platform. We call the BLAS-1 operations
in the xMath library to improve the performance of
vector operators using many-core parallelization and fuse
some kernels when it is possible. By calling the xMath
library and conduct manual optimizations, most BLAS-1
vector operations are boosted by a factor of around 20×
as compared to MPE-only versions.
VI. EX PE RIMEN T SET UP
A. Design of Experiments
As a fundamental atmospheric process, the baroclinic
instability is responsible for the generations of mid-
latitude cyclones and storm systems that may result in
severe weather/climate disasters. It is therefore of cru-
cial importance for an atmospheric model to accurately
reproduce this dynamical phenomenon with high efﬁ-
ciency. We employ the moist baroclinic instability test in
aβ-plane 3D channel [6] to validate the correctness and
examine the performance of the proposed fully implicit
solver in our experimental dynamic core. In the setup,
the test is initiated by adding a conﬁned perturbation
in the zonal wind ﬁeld to a geostrophically balanced
background zonal ﬂow at the earth troposphere. The
computational domain in the baroclinic instability test is
a 3D channel spanning a 40,000 km ×6,000 km ×30
km range, with periodic boundary conditions along the
ﬂow direction and free-slip, non-penetrating boundaries
everywhere else. Although designed in a plane channel,
this conﬁguration retains the triggering mechanism of
the baroclinic jet with great details, resembling the north
semisphere with the latitude range of 18N to 72N.
We run the test using our optimized fully implicit
solver with a horizontal resolution of 10 km and a ver-
tical resolution of 500 m for the purpose of comparison
with referenced results obtained in other atmospheric
models. With the fully implicit method, the time step
for the simulation can be set to as large as 1200 s,
which is substantially greater than the explicit time step
(usually a few seconds or less). By using 16,000 CGs
on the Sunway TaihuLight supercomputer, we are able to
conduct the simulation at the speed of around 4.1 SYPD.
The simulation results of the 500 m level proﬁle at day
10 to 16 are presented in Fig. 5. It is observed from
the ﬁgure that our fully implicit solver can successfully
capture the baroclinic jet where distinct low and high
temperature regions are generated with sharp fronts. The
simulated results at day 10 agree well with reference
results such as those in [6]. After day 12, the wave
starts to break due to the increasingly strong interaction
of large eddies; and our fully implicit solver is able
to continue the simulation with unreduced computing
efﬁciency (c.f., the time step size should be quickly
reduced when using an explicit solver instead).
Fig. 5. The 500 m level simulation results at day 10, 12, 14 and 16
for the baroclinic instability test. Shown here are the temperature
contours with the overlaid horizontal wind ﬁeld, for which only
around 1/3of the whole computational domain is drawn to see more
details of the baroclinc jets and eddies.
B. FLOPS Measurement
To conduct an accurate performance measurement for
both our implicit and explicit solvers, we collect the
number of double-precision arithmetic operations using
three different methods summarized as follows.
Manually counting all double-precision arithmetic
instructions in the assembly code.
Analysis by using hardware performance monitor,
PERF, provided by the vendor of the Sunway Tai-
huLight supercomputer, to collect the number of
double-precision arithmetic instructions retired on
the CPE cluster.
Measuring the double-precision arithmetic opera-
tions by running the same MPE-only versions of our
solvers instrumented by Performance API (PAPI) on
an Intel Xeon E5-2697v2 platform.
The ﬁrst and the second methods provide almost
identical double-precision arithmetic operation counts,
while the result with PAPI is 3% higher. This is probably
due to the difference between the Intel Xeon platform
and the SW26010 platform. In our study, we employ
the second method (the PERF result) to count the exact
total number of double-precision arithmetic operations.
The FLOPS results can then be calculated in PETSc by
utilizing its proﬁling functionality.
VII. PERFORMANCE RES ULTS
A. Many-core Acceleration on SW26010
Fig. 6. Average runtime measured by per-minute-long simulation.
Both the implicit and explicit solvers are using 7,260 MPI processes,
with a horizontal resolution of 2.8km. Each process is dealing with
a 3D block in the size of 64 ×64 ×128.MPE,+PAR,+VEC
and +MEM represent the MPE-only version, the threaded version
with the LDM-oriented partition, the vectorized version, and the
ﬁnal version, respectively. The substantial difference of the run time
between the fully-implicit and the explicit solvers are partially due
to the difference of the time step size, which will be discussed later
in detail.
As far as we know, our work in this paper is the
ﬁrst atmospheric dynamic solver that scales over a het-
erogeneous supercomputer with over 10 million cores.
Therefore, the ﬁrst key effort in improving the overall
performance is to make an efﬁcient utilization of the
260 cores within the SW26010 many-core processor, in
terms of both the computing and the memory resources.
Figure 6 demonstrates the performance evaluation of
both our implicit and explicit solvers on the SW26010
many-core processor, when applying three different opti-
mization steps covered in Section V (the LDM-oriented
partition, vectorization, and memory-related optimiza-
tion). We use the MPE version (using only the 4 MPEs
within the processor) as the base line.
In our implicit solver, the AX kernel is the most
time-consuming kernel, taking over 84.5% of the entire
execution time. By applying the LDM-oriented partition
scheme, we manage to cut the execution time of the
AX kernel by 22 times in the multi-threading step, and
further by 6.5 times in the vectorization step (with
beneﬁts of both vectorization and strength reduction),
which demonstrates the effectiveness of the identiﬁed
level of parallelism (over 140 times for the two steps
combined). We observe a similar performance boost
for the the MAT kernel (53.5 times for the two steps
combined).
The only exception is the ILU kernel. Due to the ﬁne-
grained and unaligned memory operations introduced
by the proposed blocked PILU method, the ILU kernel
only achieves 4.6 times, and becomes the most time-
consuming part after the ﬁrst two steps. While this is
considered a major performance issue in almost all the
existing many-core architectures, by mainly combining
the geometry-based pipelined ILU scheme and our pro-
posed register communication based data sharing, we can
achieve both a better data locality and fewer iterations,
leading to a substantial speedup of 4.8 times.
In our explicit solver, the FX kernel consumes almost
all the computation time. Similarly, a carefully-designed
LDM-partition scheme manages to achieves 43 times
speedup at the multi-threading step, and a further 1.9
times speedup at the vectorization step. The memory-
related optimization provides another 1.4 times perfor-
mance improvement.
In general, our optimization strategies proposed in
Section V-C enable a good scaling of performance within
the 260 cores of SW26010 processor. The execution
times of our implicit and explicit solvers are reduced
by 52 and 110 times respectively. The signiﬁcant perfor-
mance improvements demonstrate that our optimization
strategies are capable of identifying the right map-
ping between our atmospheric simulation algorithm and
the underlying SW26010 many-core architecture, which
form the basis for the tremendous simulation capability
of our solver in a large-scale scenario.
B. Strong Scaling Results
5120
10240
20480
40960
81920
163840
0.06
0.15
0.30
0.60
1.20
2.40
100%
99%
94%
89%
87%
80%
72%
67%
100%
96%
91%
87%
90%
84%
77%
66%
56%
45%
SYPD
Num of Processes
Ideal 2 km resolution 2 km res. with IO
Ideal 3 km resolution 3 km res. with IO
Fig. 7. Strong scaling results on the Sunway TaihuLight supercom-
puter. For the 2-km run, the solver scales from 13,568 processes to
the whole machine with a parallel efﬁciency of 67%, leading to a
8.1×increase of the SYPD from 0.07 to 0.57. For the 3-km run,
the code scales from 5,964 processes to the entire machine with a
parallel efﬁciency of 45%, leading to a 12.2×increase of the SYPD
from 0.083 to 1.01.
The strong scaling tests are carried out with two
conﬁgurations. One is on a 20352×3072×192 mesh
of 72.0 billion unknowns and the other is on a
13632×2016×192 mesh of 31.8 billion unknowns, cor-
responding to the 2-km and the 3-km horizontal resolu-
tions, respectively. We set the number of vertical levels to
192, which is relatively large than normal, so that we can
examine the effects of the domain decomposition in all
three directions. The strong scaling results are shown in
Fig. 7, in which the computing throughput is measured in
number of processes (i.e., CGs) that the memory capacity
allows and increase the number of processes gradually
to the full-system scale. The fully implicit solver scales
well to the whole machine in both cases. In particular, a
1.01 SYPD in double-precision is achieved for the 3-km
run at the full-system scale (10.46 million cores in total).
In addition, we also evaluate the performance of the
entire dynamic core including the I/O and initialization
and provide the results in the same ﬁgure. Compared
to the solver-only results, the overhead of I/O and
initialization is within the 5% range, depending on how
frequent I/O is required. This is due to the overlapping
of computation and I/O efﬁciently supported by the
asynchronous and non-blocking I/O facility and high
performance SSD storage of the Sunway TaihuLight
supercomputer.
C. Weak Scaling Results
TABLE II
CAS E CONFIG UR ATIO NS F OR WEAK -SC AL ING TES TS .
Number of Mesh size #1 Mesh size #2
processes (Implicit/Explicit) (Explicit)
168 ×38 ×1 16128 ×2432 ×128 32256 ×4864 ×128
200 ×45 ×1 19200 ×2880 ×128 38400 ×5760 ×128
250 ×56 ×1 24000 ×3584 ×128 48000 ×7168 ×128
300 ×67 ×1 28800 ×4288 ×128 57600 ×8576 ×128
367 ×82 ×1 35232 ×5248 ×128 70464 ×10496 ×128
408 ×91 ×1 39168 ×5824 ×128 78336 ×11648 ×128
453 ×101 ×1 43488 ×6464 ×128 86976 ×12928 ×128
504 ×113 ×1 48384 ×7232 ×128 96768 ×14464 ×128
610 ×137 ×1 58560 ×8768 ×128 117120 ×17536 ×128
672 ×151 ×1 64512 ×9664 ×128 129024 ×19328 ×128
756 ×170 ×1 72576 ×10880 ×128 145152 ×21760 ×128
853 ×192 ×1 81888 ×12288 ×128 163776 ×24576 ×128
In the weak scaling tests we focus on examining the
performance of both implicit and explicit solvers. The
detailed conﬁgurations of the weak scaling tests can
be found in Table II. For the implicit run, the size
of the subdomain is 96×64×128 and the number of
processes is increased from 6,384 to the full-system
scale (25.6-fold increase), corresponding to 2.48-km and
488-m horizontal resolutions, respectively. The weak
scaling results in terms of sustained PFLOPS are shown
in Fig. 8, from which we observe that both the fully
implicit and the explicit codes scales well to the whole
machine, sustaining up to 7.95 and 23.66 PFLOPS,
respectively. In addition, we also provide the scaling
results of the explicit code with a larger subdomain
size and the sustained performance of the explicit code
at the full-system scale is increased to 25.96 PFLOPS.
Compared with state-of-the-art work of scalable implicit
solver [26] (3.41% of peak performance on 1.57 million
homogeneous cores of IBM Sequoia) , we manage to get
2-fold FLOP efﬁciency on a much larger system (6.45%
of peak performance with 10.65 million heterogeneous
cores of the Sunway TaihuLight supercomputer).
We remark that performance measured only in
PFLOPS could be misleading. When considering the
overhead due to the increase of number iterations, the
parallel efﬁciency of the fully implicit solver is 52% as
indicated in the ﬁgure. For the explicit solver, the time
step size is required to be decreased as the number of
processes increases (c.f., the fully implicit solver uses
a ﬁxed time step size of 240s). But it is usually not
considered when calculating the weak scaling parallel
efﬁciency of explicit codes; therefore we follow the
tradition and show the parallel efﬁciency of the explicit
solver in the picture, which is 99.5% for the test case
with mesh size #1 and 100% for the case with mesh
size #2, suggesting that the cost of halo exchange is
successfully hidden by the computation of the inner part
of each subdomain.
5120
10240
20480
40960
81920
163840
0.25
0.5
1
2
4
8
16
32
25.96 PFLOPS, Para. eff. 99.5%
7.95 PFLOPS, Para. eff. 52%
PFLOPS
Number of Processes
Implicit (mesh size #1)
Explicit (mesh size #1)
Explicit (mesh size #2)
23.66 PFLOPS, Para. eff. 100%
Fig. 8. Weak scaling results on Sunway TaihuLight.
D. Analysis of the Time-to-solution
To conduct a more fair comparison of the fully-
implicit and the explicit solvers, we further examine the
weak scaling performance in terms of SYPD, which is
the time-to-solution measured with respect to the sim-
ulation/prediction capability of atmospheric models. We
5120
10240
20480
40960
81920
163840
0.00125
0.0025
0.005
0.01
0.02
0.04
0.08
0.16
0.488
34X
SYPD
Number of Processes
Implicit
Explicit
89.5X
2.480 1.389 0.920 0.620
Resolution (km)
Fig. 9. Time-to-solution results on the Sunway TaihuLight super-
computer. The performance is measured in SYPD, which is the
simulation/prediction capability of atmospheric models.
use the same conﬁguration (#1 in Table II) of the weak
scaling tests and show the results in Fig. 9. In the ideal
case, it is expected that a same SYPD is kept as more
processes are used. However, for the stiff time-dependent
problems governed by hyperbolic conservation laws, it
is often hard to achieve. For the explicit solver, as the
number of processes increases, the resolution becomes
ﬁner, leading to the decrease of the time step size due
to stability reasons. For the implicit solver, a uniform
time step size can be used (240s here), but there is a
mild increase (1.9×) of the number of iterations as the
problem size gets 25.6-fold larger. Observed from Fig. 9,
the fully implicit solver is able to deliver 34×more
SYPD compared to the explicit solver when using 6,384
processes at the 2.48-km resolution. And the SYPD
increase is further improved to 89.5×at the ultra-ﬁne
488-m resolution using the whole system, leading to the
simulation/prediction capability of 0.07 SYPD.
VIII. IMPLICATIONS
In this work, we design a highly scalable fully implicit
solver and apply it in an experimental dynamic core
for nonhydrostatic atmospheric simulations. The solver
scales to the full-system scale of 10.5 million cores on
the Sunway TaihuLight, providing a simulation speed of
1 SYPD at the horizontal resolution of 3-km and sustain-
ing an aggregate performance of 7.95 PFLOPS in double
precision. We summarize the measured simulation speed
of our fully implicit solver in Table III, which show
competitive results as compared to the current state-of-
the-art. The scalable, efﬁcient and robust experimental
dynamic core may open the opportunity to revisit the
possibility of practical use and the potential for the fully-
implicit method in next-generation atmospheric model-
ing, especially for ultra-high-resolution simulations and
large scale computing systems.
TABLE III
SIM ULATION SPE ED O F OU R FULLY IM PLI CI T DYNA MI C CO RE .
Resolution 488-m 3-km 12.5-km 25-km
SYPD 0.07 1.0 4.9 14.4
# Processes 163,776 161,028 16,000 16,000
In terms of the numerical methodology and appli-
cation advancement, our work has demonstrated that
fully-implicit methods could be an important option for
performing numerical weather and climate simulations,
especially for ultra high resolution scenarios. Our com-
parison between the implicit method and the explicit
method (with a similar level of optimizations on the Sun-
way TaihuLight system) demonstrates an 89.5X time-to-
solution advantage for the ultra high resolution of 488
m. Although our explicit solver provides a signiﬁcantly
higher arithmetic performance of 25.96 PFLOPS, our
implicit method, with an arithmetic performance of only
7.95 PFLOPS, can cut the computing time-to-solution by
almost two orders of magnitude and achieve a simulation
speed that is equivalent to our explicit method running
on an Exascale system. While one may argue that the
explicit method can be further improved by, e.g., time-
splitting methods, we can also improve the performance
of the implicit solver by applying techniques such as
adaptive time-stepping. Therefore, we consider this as
a reasonable comparison that demonstrates the bene-
ﬁts and constraints of explicit and implicit methods
in atmospheric modeling. The success of the method
depends strongly on the design of the solver, in which
the convergence, scalability, utilization of underlying
hardware architecture are of great importance. The pro-
posed fully implicit solver, especially the hybrid DD-
MG preconditioner and the GP-ILU factorization are
not only suitable to atmospheric modeling, but also
applicable to many important applications governed by
time-dependent hyperbolic conservation laws.
On the aspect of the hardware system, our work has
demonstrated that many-core architectures could be a
suitable candidate for running large-scale PDE solvers
with innovative algorithms and well-tuned implemen-
tations. While the SW26010 processor is a brand new
CPU based on many-core architecture, the corresponding
parallel programming interfaces make it possible for
us to customize our numerical algorithms and carefully
tune the parallelization, buffering, and communication
schemes to achieve a highly-efﬁcient implementation
of our fully-implicit solver on the Sunway TaihuLight
supercomputer. Our work may serve as a base and
guidance for domain experts, to get inspiring knowledge
on designing scalable and efﬁcient algorithms for large-
scale PDE solvers, especially in geophysical ﬂuid dy-
namics on the cutting-edge heterogeneous systems.
In the future, we would continue our efforts on the
new-generation Sunway system, and collaborate with
climate scientists to extend our current dynamic solver
framework into an atmospheric model that can perform
ultra-high resolution atmospheric simulations/predictions
for seamless weather-climate modeling.
ACKNOWLEDGMENTS
We would like to acknowledge the contributions of Prof.
Jiachang Sun and Prof. Xiao-Chuan Cai for insightful sug-
gestions on algorithm design. We wish to offer our deep-
est thanks to NRCPC and NSCC-Wuxi for building the
Sunway TaihuLight supercomputer and supporting us with
the computing resources as well as the technical guid-
ance to bring our ideas into reality. This work was par-
tially supported by the National Key Research & De-
velopment Plan of China under grant# 2016YFB0200600,
2016YFA0602200 and 2016YFA0602103, Natural Science
Foundation of China under grant# 91530103, 91530323,
61361120098, 61303003 and 41374113, and the 863 Pro-
gram of China under grant# 2015AA01A302. The correspond-
ing authors are C. Yang (yangchao@iscas.ac.cn), W. Xue
(xuewei@tsinghua.edu.cn), H. Fu (haohuan@tsinghua.edu.cn)
and L. Gan (lin.gan27@gmail.com).
REFERENCES
[1] J. K. Lazo, M. Lawson, P. H. Larsen, and D. M. Waldman,
“US economic sensitivity to weather variability,Bulletin of
the American Meteorological Society, p. 709, 2011.
[2] P. Bauer, A. Thorpe, and G. Brunet, “The quiet revolution of
numerical weather prediction,” Nature, vol. 525, no. 7567, pp.
47–55, 2015.
[3] P. H. Lauritzen, C. Jablonowski, M. A. Taylor, and R. D. Nair,
Eds., Numerical Techniques for Global Atmospheric Models.
Springer, 2011.
[4] C. Bretherton, “A National Strategy for Advancing Climate
Modeling,” 2012.
[5] R. Klein, U. Achatz, D. Bresch, O. Knio, and P. Smolarkiewicz,
“Regime of validity of soundproof atmospheric ﬂow models,
J. Atmos. Sci., vol. 67, pp. 3226–3237, 2010.
[6] P. Ullrich and C. Jablonowski, “Operator-split Runge-Kutta-
Rosenbrock methods for nonhydrostatic atmospheric models,”
Monthly Weather Review, vol. 140, no. 4, pp. 1257–1284, 2012.
[7] M. Kwizak and A. J. Robert, “A semi-implicit scheme for
grid point atmospheric models of the primitive equations,
Mon. Wea. Rev., vol. 99, pp. 32–36, 1971.
[8] C. Temperton and A. Staniforth, “An efﬁcient two-time-
level semi-Lagrangian semi-implicit integration scheme,
Q. J. R. Meteorol. Soc., vol. 113, pp. 1025–1039, 1987.
[9] J. Klemp and R. B. Wilhelmson, “The simulation of three
dimensional convective storm dynamics,” J. Atmos. Sci., vol. 35,
pp. 1070–1096, 1978.
[10] M. Satoh, “Conservative scheme for the compressible nonhy-
drostatic models with the horizontally explicit and vertically
implicit time integration scheme,Mon. Wea. Rev., vol. 130,
pp. 1227–1245, 2002.
[11] V. A. Mousseau, D. A. Knoll, and J. M. Reisner, “An implicit
nonlinearly consistent method for the two-dimensional shallow-
water equations with Coriolis force,” Mon. Wea. Rev., vol. 130,
pp. 2611–2625, 2002.
[12] H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang,
W. Xue, F. Liu, F. Qiao et al., “The Sunway TaihuLight super-
computer: system and applications,” Science China Information
Sciences, vol. 59, no. 7, p. 072001, 2016.
[13] Y. Ogura and N. A. Phillips, “Scale analysis of deep and shallow
convection in the atmosphere,Journal of the atmospheric
sciences, vol. 19, no. 2, pp. 173–179, 1962.
[14] M. Satoh, T. Matsuno, H. Tomita, H. Miura, T. Nasuno,
and S.-i. Iga, “Nonhydrostatic icosahedral atmospheric model
(NICAM) for global cloud resolving simulations,” Journal of
Computational Physics, vol. 227, no. 7, pp. 3486–3514, 2008.
[15] H. Fudeyasu, Y. Wang, M. Satoh, T. Nasuno, H. Miura, and
W. Yanase, “Global cloud-system-resolving model NICAM suc-
cessfully simulated the lifecycles of two real tropical cyclones,
Geophysical Research Letters, vol. 35, no. 22, 2008.
[16] J. M. Dennis, J. Edwards, K. J. Evans, O. Guba, P. H. Lauritzen,
A. A. Mirin, and et. al., “CAM-SE: A scalable spectral element
dynamical core for the Community Atmosphere Model,” Inter-
national Journal of High Performance Computing Applications,
vol. 26, no. 1, pp. 74–89, 2012.
[17] J. Peter, M. Straka et al., “Petascale wrf simulation of hurricane
sandy deployment of ncsa’s cray xe6 blue waters,” in Proceed-
ings of ACM SC’13 Conference, vol. 63, 2013.
[18] J. Michalakes, R. Benson, T. Black, M. Duda, M. Govett,
T. Henderson, P. Madden, G. Mozdzynski, A. Reinecke, and
W. Skamarock, “Evaluating Performance and Scalability of
Candidate Dynamical Cores for the Next Generation Global
Prediction System,” 2015.
[19] R. Kelly, “GPU Computing for Atmospheric Modeling,” Com-
puting in Science and Engineering, vol. 12, no. 4, pp. 26–33,
2010.
[20] J. Linford, J. Michalakes, M. Vachharajani, and A. Sandu,
“Multi-core acceleration of chemical kinetics for simulation
and prediction,” in High Performance Computing Networking,
Storage and Analysis, Proceedings of the Conference on, Nov
2009, pp. 1–11.
[21] T. Shimokawabe, T. Aoki, J. Ishida, K. Kawano, and C. Muroi,
“145 TFlops performance on 3990 GPUs of TSUBAME 2.0
supercomputer for an operational weather prediction,” Procedia
Computer Science, vol. 4, pp. 1535–1544, 2011.
[22] H. Yashiro, M. Terai, R. Yoshida, S.-i. Iga, K. Minami, and
H. Tomita, “Performance Analysis and Optimization of Nonhy-
drostatic ICosahedral Atmospheric Model (NICAM) on the K
Computer and TSUBAME2. 5,” in Proceedings of the Platform
for Advanced Scientiﬁc Computing Conference. ACM, 2016,
p. 3.
[23] C. Yang, W. Xue, H. Fu, L. Gan, L. Li, Y. Xu, Y. Lu, J. Sun,
G. Yang, and W. Zheng, “A Peta-scalable CPU-GPU algorithm
for global atmospheric simulations,” in ACM SIGPLAN Notices,
vol. 48, no. 8. ACM, 2013, pp. 1–12.
[24] W. Xue, C. Yang, H. Fu, X. Wang, Y. Xu, L. Gan, Y. Lu,
and X. Zhu, “Enabling and scaling a global shallow-water
atmospheric model on Tianhe-2,” in Parallel and Distributed
Processing Symposium, 2014 IEEE 28th International. IEEE,
2014, pp. 745–754.
[25] W. Xue, C. Yang, H. Fu, X. Wang, Y. Xu, J. Liao, L. Gan,
Y. Lu, R. Ranjan, and L. Wang, “Ultra-Scalable CPU-MIC
Acceleration of Mesoscale Atmospheric Modeling on Tianhe-
2,” IEEE Trans. Computers, vol. 64, no. 8, pp. 2382–2393,
2015.
[26] J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis,
P. W. Staar, Y. Ineichen, C. Bekas, A. Curioni, and O. Ghattas,
“An extreme-scale implicit solver for complex PDEs: highly
heterogeneous ﬂow in earth’s mantle,” in Proceedings of the
International Conference for High Performance Computing,
Networking, Storage and Analysis. ACM, 2015, p. 5.
[27] T. Ichimura, K. Fujita, P. E. B. Quinay, L. Maddegedara,
M. Hori, S. Tanaka, Y. Shizawa, H. Kobayashi, and K. Minami,
“Implicit nonlinear wave simulation with 1.08t dof and 0.270t
unstructured ﬁnite elements to enhance comprehensive earth-
quake simulation,” in Proceedings of the International Confer-
ence for High Performance Computing, Networking, Storage
and Analysis, ser. SC ’15. New York, NY, USA: ACM, 2015,
pp. 4:1–4:12.
[28] A. Rahimian, I. Lashuk, S. Veerapaneni, A. Chandramowlish-
waran, D. Malhotra, L. Moon, R. Sampath, A. Shringarpure,
J. Vetter, R. Vuduc, D. Zorin, and G. Biros, “Petascale direct
numerical simulation of blood ﬂow on 200k cores and hetero-
geneous architectures,” in Proceedings of the 2010 ACM/IEEE
International Conference for High Performance Computing,
Networking, Storage and Analysis, ser. SC ’10. Washington,
DC, USA: IEEE Computer Society, 2010, pp. 1–11.
[29] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith,
“High-performacne parallel implicit CFD,” Parallel Computing,
vol. 27, pp. 337–362, 2001.
[30] O. B. Widlund, “The development of coarse spaces for domain
decomposition algorithms,” in Domain Decomposition methods
in science and engineering XVIII. Springer, 2009, pp. 241–248.
[31] C. Yang and X.-C. Cai, “Parallel multilevel methods for implicit
solution of shallow water equations with nonsmooth topography
on the cubed-sphere,” J. Comput. Phys., vol. 230, no. 7, pp.
2523–2539, 2011.
[32] ——, “A scalable fully implicit compressible Euler solver for
mesoscale nonhydrostatic simulation of atmospheric ﬂows,
SIAM Journal on Scientiﬁc Computing, vol. 36, no. 5, pp. S23–
S47, 2014.
[33] E. Chow and A. Patel, “Fine-grained parallel incomplete LU
factorization,” SIAM Journal on Scientiﬁc Computing, vol. 37,
no. 2, pp. C169–C193, 2015.
[34] H. Anzt, E. Chow, and J. Dongarra, “Iterative sparse triangular
solves for preconditioning,” in Euro-Par 2015: Parallel Pro-
cessing. Springer, 2015, pp. 650–661.
[35] X.-C. Cai and M. Sarkis, “A restricted additive Schwarz precon-
ditioner for general sparse linear systems,” SIAM J. Sci. Com-
put., vol. 21, pp. 792–797, 1999.
[36] S. Balay, J. Brown, K. Buschelman, V. Eijkhout, W. D. Gropp,
D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith,
and H. Zhang, “PETSc Users Manual,” Argonne National
Laboratory, Tech. Rep. ANL-95/11 – Revision 3.4, July 2013.
... Several studies have been conducted to explore the technical/scientific issues that are critical for global atmospheric simulations at the kilometer scale with the powerful High-Performance Computing (HPC). Many of them targeted to only assess the performance of dynamic cores of atmospheric models [2,16,[22][23][24]. For example, some institutions have already demonstrated the capability of global simulations with 3-km mesh using two state-ofthe-art atmospheric dynamic cores (MPAS and FV3) [22]. ...
... The global simulations with different dynamic cores with a mesh of less than 10 km were inter-compared in the dynamics of the atmospheric general circulation modeled on non-hydrostatic domains (DYAMOND) project [24]. The fully implicit atmospheric dynamical core achieved an idealized experiment with a 488-m mesh (1.1 SDPH) [23]. Some studies conducted real-world global simulations with several-kilometer meshes but only with hydrostatic dynamic cores to increase computational efficiency (through using larger integration time step) compared to non-hydrostatic dynamic cores [18,25]. ...
... It indicates that the performance improves from 0.054 SDPH to 0.82 SDPH when enlarging the system scale from 30,000 processes to 600,000 processes. Compared to the simulation speed range (0.10-0.65 SDPH; [17][18][19]21,23,25]) from previous studies of the global non-hydrostatic atmospheric simula- tions at kilometer-level mesh, this study also represents a significant improvement in computational performance achievement. In particular, the simulation speed measured in this study is for meteorology-aerosol online coupled forecast and includes routine I/O (46 GB every simulation hour and 6.8 TB every 3 simulation hours). ...
Article
During the era of global warming and highly urbanized development, extreme and high impact weather as well as air pollution incidents influence everyday life and might even cause the incalculable loss of life and property. Although, with the vast development of atmospheric model, there still exists substantial numerical forecast biases objectively. To predict accurately extreme weather, severe air pollution, and abrupt climate change, the numerical atmospheric model requires not only to simulate meteorology and atmospheric compositions simultaneously involving many sophisticated physical and chemical processes but also at high spatiotemporal resolution. Global integrated atmospheric simulation at spatial resolutions of a few kilometers remains challenging due to its intensive computational and input/output (I/O) requirement. Through multi-dimension-parallelism structuring, aggressive and finer-grained optimizing, manual vectorizing, and parallelized I/O fragmenting, an integrated Atmospheric Model Across Scales (iAMAS) was established on the new Sunway supercomputer platform to significantly increase the computational efficiency and reduce the I/O cost. The global 3-km atmospheric simulation for meteorology with online integrated aerosol feedbacks with iAMAS was scaled to 39,000,000 processor cores and achieved the speed of 0.82 simulation day per hour (SDPH) with routine I/O, which enabled us to perform 5-day global weather forecast at 3-km horizontal resolution with online natural aerosol impacts. The results demonstrate the promising future that the increasing of spatial resolution to a few kilometers with online integrated aerosol feedbacks may significantly improve the global weather forecast.
... While there has been a significant amount work which has investigated the use of graphics processing units (GPUs) to accelerate computational physics, it has mostly been in the context of simple fluid interaction, while in the context of multi-material finite-volume methods (FVM) which are able to handle complex material interactions, GPUs are yet to be applied with results superior to existing well-developed CPU implementations and algorithms. For single-material FVM, a GPU code for the Shallow-Water Equations (SWE) was presented by Boubekeur et al. [22], as well as by Yang et al. [171]. FVM solvers for both the SWE and Euler equations are described by Xu et al. [169], Xu et al. [168], where they were able to achieve up to 70.4x and 93.9x on an NVIDIA V100 compared to a 12 core Intel Xeon E5-2697. ...
... Another simple parallel algorithm is presented by Breuß et al. [23] which adaptively decomposes the domain, however, the scaling is not optimal as a speedup of 2.5x is achieved with 4 CPU cores, and thus will likely not scale well to multiple GPUs where the cost of data transfer is significantly higher than for multiple CPUs. More recently, Yang et al. [171] present a more involved parallel algorithm for the fast marching method which is shown to scale well up to thousands of cores, and is particularly effective for lower core counts (i.e less than 16). Additionally, their algorithm is designed for effective execution in a narrow band, but requires relatively complex data structures (again, a heap) which are not practical for implementation on the GPU. ...
Thesis
The aims of this thesis are to develop a framework, which we have named Ripple, for the efficient execution of general purpose computations on modern heterogeneous compute architectures, with a focus on multiple graphics processing units (GPUs), as well as to develop algorithms for the numerical simulation of multiple interacting materials on modern, massively parallel, computer hardware. The Ripple framework is applicable to a wide range of HPC problems, allowing programmers to concentrate on algorithm design, while the framework takes the high- level domain application logic and executes it with near optimal performance across all devices, handling data layout transformations, inter-device communication, and optimal scheduling of the computation sub-stages. To demonstrate the effectiveness of the framework, we develop efficient parallel algorithms for numerical schemes commonly used in finite-volume methods, the solution of the Eikonal equation in a narrow band, the Ghost Fluid Method (GFM), and signed-distance function generation from multiple external geometry file types. Additionally, we present the main problems involved in scaling multi-material interaction simulations across multiple GPUs, and provide solutions to these problems which allow multi-material simulations to scale almost linearly with the number of compute devices, enabling the simulation of prob- lems on massive domains. We also demonstrate how the Ripple framework improves on existing work in terms of performance and simplification of software development, and enables solutions to execute across multiple GPUs. The algorithms which we present are built around well developed methods for multi-material interaction. These methods currently require vast computational resources for simulation on domains of modest size—even when well developed adaptive mesh refinement (AMR) tech- niques are used—since they do not make use of modern GPUs. The work presented in this thesis demonstrates the benefits of utilising modern hardware, particularly GPUs, reduces com- putational time. To ensure that our algorithms are correct, we validate the developed code on standard test cases used for finite-volume methods, as well as for multi-material interaction problems in two- and three-dimensions involving gas-gas, gas-liquid, and gas-liquid-solid inter- faces. We then apply the developed techniques to novel, unvalidated real-world use cases to demonstrate the utility of the techniques. We performed comparison simulations using multi-GPU unigrid with the developed frame- work and multi-core CPU adaptive mesh refinement using existing implementation which have demonstrated good performance. While these are different algorithms executed on different hardware, the cost of execution per hour using a single NVIDIA A100 GPU and a 48 core Intel Xeon Cascade Lake is effectively equivalent on current cloud providers, such as AWS. This com- parison demonstrates the improvements in both performance and cost which can be achieved by adapting current state of the art scientific computing algorithms which have been designed for CPU execution, for execution on the GPU. For two-dimensional multi-material simulations, our framework is able to achieve up to a 24x improvement in performance and 3x reduction in cost using 8 GPUs without adaptive mesh refinement compared with a 48 core CPU implemen- tation using adaptive mesh refinement. While multi-GPU unigrid and multi-core CPU AMR are different algorithms executed on different hardware, the cost of execution for The algorithms developed in this thesis allow strong scaling of 6.95x across 8 GPUs, and the Ripple framework makes performance gains of this magnitude accessible to other computationally demanding scientific domains, with minimal effort. For novel three-dimensional blast problems involving complex geometries, we show that our signed-distance function generation for such geometries using the Ripple framework can be performed multiple orders of magnitude faster on the GPU than on the CPU, and that the overall simulation can be performed up to 35x faster on a single GPU without AMR than when using a 32 core CPU implementation with AMR, at a reduction in cost of 22x.
... In recent years, heterogeneous architectures that involve many-core computing devices [4][5][6] have become the foremost provider of computing power in various modern High-Performance Computing (HPC) systems and have shown their ability to run various kinds of climate or weather models [7,8]. The new generation Sunway supercomputer is the successor of Sunway Taihulight [9,10], equipped with new heterogeneous many-core processors SW26010pro, of which hardware and software details are shown in Section 3.4. ...
Article
With the computing power of High-Performance Computing (HPC) systems having stepped into the exascale era, more complex problems can be solved with scientific applications on a large scale. However, due to the significant performance gap between computing nodes and storage subsystems, suboptimal design for the Input/Output (I/O) module will significantly impede the efficiency of scientific applications, especially for the ubiquitous atmosphere applications. Two-phase I/O implemented in N-to-1 mode creates a serious bottleneck that hinders the scalability for the Model for Prediction Across Scales-Atmosphere (MPAS-A) on the new generation Sunway supercomputer. To address the I/O problem, we apply a custom data reorganization method to enable N-to-M I/O mode to exploit the parallel file system's performance and limit the data transfer among MPI ranks to a restricted scope to alleviate communication overhead. Moreover, we have conducted several methods to accelerate the computations, including the redesign for tracer transport, a hybrid buffering scheme, and a three-level parallelization scheme, which allows MPAS-A to use all heterogeneous computing resources efficiently. Experimental results show admirable scalability and efficiency of our I/O method, which achieves speedups of 41× and 58.9× for input and output compared with the raw I/O method on 30,000 MPI ranks. By scaling MPAS-A to 39 million heterogeneous cores, we demonstrate the necessity of a well-constructed I/O module for a real-world atmosphere application. Speed tests show that our optimization methods obtain good results for computations, and MPAS-A achieves a speed of 0.82 Simulated Day per Hour (SDPH) and 0.76 parallel efficiency of strong scaling with 600,000 MPI ranks.
... Undeniably, elliptic solvers are computationally demanding and substantial development has been invested into making them as efficient as possible for W&C models on modern supercomputers Mueller & Scheichl, 2014;Yang et al., 2016). To efficiently solve an elliptic problem posed in a thin spherical shell (such as the global atmosphere) ultimately requires matrix inversion-if not in the main solver than at least in its preconditioner. ...
Article
Full-text available
Semi‐implicit (SI) time‐stepping schemes for atmosphere and ocean models require elliptic solvers that work efficiently on modern supercomputers. This paper reports our study of the potential computational savings when using mixed precision arithmetic in the elliptic solvers. Precision levels as low as half (16 bits) are used and a detailed evaluation of the impact of reduced precision on the solver convergence and the solution quality is performed. This study is conducted in the context of a novel SI shallow‐water model on the sphere, purposely designed to mimic numerical intricacies of modern all‐scale weather and climate (W&C) models. The governing algorithm of the shallow‐water model is based on the non‐oscillatory MPDATA methods for geophysical flows, whereas the resulting elliptic problem employs a strongly preconditioned non‐symmetric Krylov‐subspace Generalized Conjugated‐Residual (GCR) solver, proven in advanced atmospheric applications. The classical longitude/latitude grid is deliberately chosen to retain the stiffness of global W&C models. The analysis of the precision reduction is done on a software level, using an emulator, whereas the performance is measured on actual reduced precision hardware. The reduced‐precision experiments are conducted for established dynamical‐core test‐cases, like the Rossby‐Haurwitz wavenumber 4 and a zonal orographic flow. The study shows that selected key components of the elliptic solver, most prominently the preconditioning and the application of the linear operator, can be performed at the level of half precision. For these components, the use of half precision is found to yield a speed‐up of a factor 4 compared to double precision for a wide range of problem sizes.
... Thanks to the development of large-scale parallel computing technology, computational fluid dynamics (CFD) is becoming increasingly significant in dealing with unsteady numerical simulations [1]. It is well acknowledged that a single grid is no longer satisfactory for the simulation of complex multi-body dynamics such as the dynamic flow field of flapping wings [2], the internal flow field of turbines [3], the aerodynamic interactions of a rigid coaxial rotor in hover [4] and external store separation [5]. ...
Article
Full-text available
The assembly of overlapping grids is a key technology to deal with the relative motion of multi-bodies in computational fluid dynamics. However, the conventional implicit assembly techniques for overlapping grids are often confronted with the problem of complicated geometry analysis, and consequently, they usually have a low parallel assembly efficiency resulting from the undifferentiated searching of grid nodes. To deal with this, a parallel implicit assembly method that employs a two-step node classification scheme to accelerate the hole-cutting operation is proposed. Furthermore, the aforementioned method has been implemented as a library, which can be conveniently integrated into the existing numerical simulators and enable efficient assembly of large-scale multi-component overlapping grids. The algorithm and relevant library are validated with a seven-sphere configuration and multi-body trajectory prediction case in the aspects of parallel computing efficiency and interpolation accuracy.
... Multigrid methods [5,17] are one of the most efficient algorithms suitable for solution of elliptic problems, including those arising in global atmospheric models [2,8,9,13,16,19,21]. The main components of the multigrid method are defined as follows. ...
Chapter
One of the most important aspects that determine the efficiency of an atmospheric dynamics numerical model is the time integration scheme. It is common to apply semi-implicit integrators, which allow to use larger time steps, but requires solution of a linear elliptic equation at the every time step of a model. We present implementation of linear solvers (geometric multigrid and BICGstab) within ParCS parallel framework, which is used for development of the new non-hydrostatic global atmospheric model at INM RAS and Hydrometcentre of Russia. The efficiency and parallel scalability of the implemented algorithms have been tested for the elliptic problem typical for numerical weather prediction models using semi-implicit discretization at the cubed sphere grid.
Article
Although matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse matrix multiplications. The Sparse Approximate Matrix Multiply (SpAMM) is an algorithm to accelerate the multiplication of decay matrices, the sparsity of which is between dense and sparse matrices. In addition, large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems. To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight, we present swSpAMM, an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight. Specifically, we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution. For intra-node optimizations, we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor. For inter-node optimizations, we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes. We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes. The experiment results show that swSpAMM achieves a speedup up to 14.5× and 2.2× when compared to xMath library on a single node and 2D GEMM method on multiple nodes, respectively.
Article
Due to the high computational complexity of wormhole propagation in carbonate acidization, nonphysical oscillations of the numerical solutions often appear, which seriously ruins the physical accuracy or even breaks down the whole process of reservoir flows at large-scale simulation. In this paper, we introduce and study a family of recently developed nonlinear complementarity formulations for handling the nonphysical oscillations. Typically, this bound-preserving process is modeled by the corresponding nonlinear partial differential equations (PDEs) along with the inequality restrictions as a non-smooth system of equations under the application of a minimum-type complementary function. Because of the non-smoothness of the nonlinear complementarity system, the nonlinear algebraic system after the fully implicit discretization is solved by using a semismooth Newton method combined with a generalized Jacobian matrix. To accelerate the convergence and enhance the robustness of the corresponding linear iterations, we employ an improved class of constrained pressure residual (CPR) preconditioners with different combinations of physics-based and domain decomposition methods. Experiments on two- and three-dimensional wormhole propagation problems are presented to demonstrate the applicability of the aforementioned algorithms. Large-scale reservoir simulation with more than one billion degrees of freedom is provided to show the algorithmic scalability by using tens of thousands of processors.
Article
With the incremental applications of Newton–Krylov methods for solving large sparse nonlinear systems of equations, the design of robust and scalable linear preconditioners plays an essential role for the whole solver. In this paper, we investigate the family of field-split (FS) preconditioners with different combinations of physics-based and domain decomposition methods, applied to the two typical fluid problems, i.e., the unsteady flow through fractured porous media and the steady buoyancy driven flow. In the implementation, several new versions of FS preconditioners are considered under the framework of the domain decomposition technique: additive FS, multiplicative FS, Schur-complement FS, and the constrained pressure residual (CPR) method, where the inverse of corresponding matrices is approximated by using the restricted additive Schwarz (RAS) algorithm. Rigorous eigenvalue analysis for various FS preconditioners is also provided for facilitating the design of algorithms. In particular, our approach further enhances the numerical performance by presenting a family of multilevel field-split methods for efficiently preconditioning. Numerical experiments are presented to demonstrate the robustness and parallel scalability of the proposed preconditioning strategies for both standard benchmarks as well as realistic flow problems on a supercomputer.
Preprint
Full-text available
Solving differential equations is a critical task in scientific computing. Domain-specific languages (DSLs) have been a promising direction in achieving performance and productivity, but the current state of the art only supports stencil computation, leaving solvers requiring loop-carried dependencies aside. Alternatively, sparse matrices can represent such equation solvers and are more general than existing DSLs, but the performance is sacrificed. This paper points out that sparse matrices can be represented as programs instead of data, having both the generality from the matrix-based representation and the performance from program optimizations. Based on the idea, we propose the Staged Sparse Row (SSR) sparse matrix representation that can efficiently cover applications on structured grids. With SSR representation, users can intuitively define SSR matrices using generator functions and use SSR matrices through a concise object-oriented interface. SSR matrices can then be chained and applied to construct the algorithm, including those with loop-carried dependences. We then apply a set of dedicated optimizations, and ultimately simplify the SSR matrix-based codes into straightforward matrix-free ones, which are efficient and friendly for further analysis. Implementing BT pseudo application in the NAS Parallel Benchmark, with less than $10\%$ lines of code compared with the matrix-free reference FORTRAN implementation, we achieved up to $92.8\%$ performance. Implementing a matrix-free variant for the High-Performance Conjugate Gradient benchmark, we achieve $3.29\times$ performance compared with the reference implementation, while our implementation shares the same algorithm on the same programming abstraction, which is sparse matrices.
Article
Full-text available
The Sunway TaihuLight supercomputer is the world’s first system with a peak performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the TaihuLight system. In contrast with other existing heterogeneous supercomputers, which include both CPU processors and PCIe-connected many-core accelerators (NVIDIA GPU or Intel Xeon Phi), the computing power of TaihuLight is provided by a homegrown many-core SW26010 CPU that includes both the management processing elements (MPEs) and computing processing elements (CPEs) in one chip. With 260 processing elements in one CPU, a single SW26010 provides a peak performance of over three TFlops. To alleviate the memory bandwidth bottleneck in most applications, each CPE comes with a scratch pad memory, which serves as a user-controlled cache. To support the parallelization of programs on the new many-core architecture, in addition to the basic C/C++ and Fortran compilers, the system provides a customized Sunway OpenACC tool that supports the OpenACC 2.0 syntax. This paper also reports our preliminary efforts on developing and optimizing applications on the TaihuLight system, focusing on key application domains, such as earth system modeling, ocean surface wave modeling, atomistic simulation, and phase-field simulation.
Conference Paper
Full-text available
Mantle convection is the fundamental physical process within earth's interior responsible for the thermal and geological evolution of the planet, including plate tectonics. The mantle is modeled as a viscous, incompressible, non-Newtonian fluid. The wide range of spatial scales, extreme variability and anisotropy in material properties, and severely nonlinear rheology have made global mantle convection modeling with realistic parameters prohibitive. Here we present a new implicit solver that exhibits optimal algorithmic performance and is capable of extreme scaling for hard PDE problems, such as mantle convection. To maximize accuracy and minimize runtime, the solver incorporates a number of advances, including aggressive multi-octree adaptivity, mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraic multigrid, and novel Schur-complement preconditioning. These features present enormous challenges for extreme scalability. We demonstrate that---contrary to conventional wisdom---algorithmically optimal implicit solvers can be designed that scale out to 1.5 million cores for severely nonlinear, ill-conditioned, heterogeneous, and anisotropic PDEs.
Article
Full-text available
This paper presents a new heroic computing method for unstructured, low-order, finite-element, implicit nonlinear wave simulation: 1.97 PFLOPS (18.6% of peak) was attained on the full K computer when solving a 1.08T degrees-of-freedom (DOF) and 0.270T-element problem. This is 40.1 times more DOF and elements, a 2.68-fold improvement in peak performance, and 3.67 times faster in time-to-solution compared to the SC14 Gordon Bell finalist's state-of-the-art simulation. The method scales up to the full K computer with 663,552 CPU cores with 96.6% sizeup efficiency, enabling solving of a 1.08T DOF problem in 29.7 s per time step. Using such heroic computing, we solved a practical problem involving an area 23.7 times larger than the state-of-the-art, and conducted a comprehensive earthquake simulation by combining earthquake wave propagation analysis and evacuation analysis. Application at such scale is a groundbreaking accomplishment and is expected to change the quality of earthquake disaster estimation and contribute to society.
Conference Paper
Full-text available
Sparse triangular solvers are typically parallelized using level-scheduling techniques, but parallel efficiency is poor on high-throughput architectures like GPUs. We propose using an iterative approach for solving sparse triangular systems when an approximation is suitable. This approach will not work for all problems, but can be successful for sparse triangular matrices arising from incomplete factorizations, where an approximate solution is acceptable. We demonstrate the performance gains that this approach can have on GPUs in the context of solving sparse linear systems with a preconditioned Krylov subspace method. We also illustrate the effect of using asynchronous iterations.
Article
Full-text available
Advances in numerical weather prediction represent a quiet revolution because they have resulted from a steady accumulation of scientific knowledge and technological advances over many years that, with only a few exceptions, have not been associated with the aura of fundamental physics breakthroughs. Nonetheless, the impact of numerical weather prediction is among the greatest of any area of physical science. As a computational problem, global weather prediction is comparable to the simulation of the human brain and of the evolution of the early Universe, and it is performed every day at major operational centres across the world.
Article
Full-text available
In this work an ultra-scalable algorithm is designed and optimized to accelerate a 3D compressible Euler atmospheric model on the CPU-MIC hybrid system of Tianhe-2. We first reformulate the mesocale model to avoid long-latency operations, and then employ carefully designed inter-node and intra-node domain decomposition algorithms to achieve balance utilization of different computing units. Proper communication-computation overlap and concurrent data transfer methods are utilized to reduce the cost of data movement at scale. A variety of optimization techniques on both the CPU side and the accelerator side are exploited to enhance the in-socket performance. The proposed hybrid algorithm successfully scales to 6,144 Tianhe-2 nodes with a nearly ideal weak scaling efficiency, and achieve over 8 percent of the peak performance in double precision. This ultra-scalable hybrid algorithm may be of interest to the community to accelerating atmospheric models on increasingly dominated heterogeneous supercomputers.
Conference Paper
Full-text available
This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07×, 3.18×, and 4.35× speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5× (out of the 8× theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.
Conference Paper
We summarize the optimization and performance evaluation of the Nonhydrostatic ICosahedral Atmospheric Model (NICAM) on two different types of supercomputers: the K computer and TSUBAME2.5. First, we evaluated and improved several kernels extracted from the model code on the K computer. We did not significantly change the loop and data ordering for sufficient usage of the features of the K computer, such as the hardware-aided thread barrier mechanism and the relatively high bandwidth of the memory, i.e., a 0.5 Byte/FLOP ratio. Loop optimizations and code cleaning for a reduction in memory transfer contributed to a speed-up of the model execution time. The sustained performance ratio of the main loop of the NICAM reached 0.87 PFLOPS with 81,920 nodes on the K computer. For GPU-based calculations, we applied OpenACC to the dynamical core of NICAM. The performance and scalability were evaluated using the TSUBAME2.5 supercomputer. We achieved good performance results, which showed efficient use of the memory throughput performance of the GPU as well as good weak scalability. A dry dynamical core experiment was carried out using 2560 GPUs, which achieved 60 TFLOPS of sustained performance.
Article
The semi-implicit semi-Lagrangian integration technique enables numerical weather prediction models to be run with much longer timesteps than permitted by a semi-implicit Eulerian scheme. The choice of timestep can then be made on the basis of accuracy rather than stability requirements. To realize the full potential of the technique, it is important to maintain second-order accuracy in time; this has previously been achieved by applying it in the context of a three-time-level integration scheme. In this paper we present a two-time level version of the technique which yields the same level of accuracy for half the computational effort. Unlike other efficient two-time-level schemes, ours does not rely on operator splitting. -from Authors
Article
This paper presents a new fine-grained parallel algorithm for computing an incomplete LU factorization. All nonzeros in the incomplete factors can be computed in parallel and asynchronously, using one or more sweeps that iteratively improve the accuracy of the factorization. Unlike existing parallel algorithms, the amount of parallelism is large irrespective of the ordering of the matrix, and matrix ordering can be used to enhance the accuracy of the factorization rather than to increase parallelism. Numerical tests show that very few sweeps are needed to construct a factorization that is an effective preconditioner.