An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along Geometrically Complex Faults on CMP Systems.
-
Citations (0)
-
Cited In (0)
Page 1
1
An OpenMP Approach to Modeling Dynamic Earthquake Rupture Along
Geometrically Complex Faults on CMP Systems
Xingfu Wu
Department of Computer Science &
Engineering
Texas A&M University
College Station, TX 77843
Email: wuxf@cse.tamu.edu
Benchun Duan
Department of Geology &
Geophysics
Texas A&M University
College Station, TX 77843
Email: duan@tamu.edu
Valerie Taylor
Department of Computer Science &
Engineering
Texas A&M University
College Station, TX 77843
Email: taylor@cse.tamu.edu
Abstract
Chip multiprocessors (CMP) are widely used for high
performance computing and are being configured in a
hierarchical manner to compose a CMP compute node in a
parallel system. OpenMP parallel programming within such
a CMP node can take advantage of the globally shared
address space and on-chip high inter-core bandwidth and
low inter-core latency. In this paper, we use OpenMP to
parallelize a sequential earthquake simulation code for
modeling spontaneous dynamic earthquake rupture along
geometrically complex faults on two CMP systems, IBM
POWER5+ system and SUN Opteron server. The
experimental results indicate
implementation has the accurate output results and the good
scalability on the two CMP systems. Further, we apply the
optimization techniques such as large page and processor
binding to the OpenMP implementation to achieve up to
7.05% performance improvement on the CMP systems
without any code modification.
1. Introduction
There are no analytical solutions for a spontaneous dynamic
earthquake rupture with a friction law such as slip-
weakening friction operating on the fault. Thus, numerical
methods are required to study spontaneous dynamic rupture
processes on faults. The most widely used numerical codes
in the field of earthquake dynamic source models are based
on the finite difference method (FDM). But it is difficult for
FDM to deal with complex fault geometry and complex
geologic structures. Duan et al. [1, 2, 3] have been
developing and using an explicit dynamic finite element
method (EQdyna) to implement sequential simulations for
modeling spontaneous earthquake rupture on geometrically
complex faults, such as faults with bends, stepovers, or
that the OpenMP
branches. However, a sequential simulation takes time from
more than 40 hours to several days for relatively small
earthquake model datasets (i.e., several to ten million
elements) on a SUN server with 4 dual-core AMD Opteron
processors. It means waiting for several days to verify and
validate a model. Therefore, it is necessary to parallelize the
sequential earthquake simulation code in order to
significantly shorten the simulation time by efficiently
utilizing all processors within a CMP node. In this paper, we
propose to use OpenMP to parallelize the EQdyna, and
discuss the OpenMP implementation in detail.
Today, the trend in high performance computing systems
has been shifting towards cluster systems with CMPs.
Further, CMPs are usually configured hierarchically to form
a compute node of parallel systems. For example, Hydra at
Texas A&M University Supercomputer Facility [13]
consists of nodes that have 8 DCMs (Dual-Chip Modules)
with one dual-core POWER5+ processor per DCM. While
CMP presents significant new opportunities such as on-chip
high inter-core bandwidth and low latency, it also presents
new challenges in the form of inter-core resource conflict
and contention. In [8, 6], it is argued that the full benefit of
these architectures will not be harnessed until the software
industry and community
programming. A challenge to be addressed is how well
current shared-memory parallel programming paradigms,
such as OpenMP, exploit the potential offered by such a
CMP node for scientific applications.
OpenMP [9, 10, 7] is the most popular shared-memory
parallel programming model. OpenMP is a set of compiler
directives and callable runtime library routines that extend
sequential programming languages such as Fortran, C and
C++ to express shared memory parallelism. OpenMP
provides a fork-join execution model in which an OpenMP
program begins execution as a single process (master
thread). The master thread executes sequentially until a
parallelization directive is invoked. Then, the master thread
fully embrace parallel
Page 2
2
creates a team of threads and becomes the master of the
team. The statements enclosed in the parallel region are
executed in parallel by each thread until a work-sharing
directive is invoked. The work-sharing directive such as
“parallel do” or “parallel sections” distributes the workload
among the threads. All threads need to synchronize at the
end of the parallel region unless a “nowait” clause is
specified. Upon completion of the parallel region, all
threads synchronize and only the master thread continues
execution. The advangate of OpenMP is that an existing
sequential code can be easily parallelized by inserting
OpenMP directives around time consuming loops which do
not contain data dependencies, leaving the code unchanged.
This is the most common and cost-effective way to generate
a parallel program for utilizing the CMPs. Therefore, we use
OpenMP to parallelize the EQdyna for exploring the
parallelism of the code at node level by fully utilizing all
processors.
Our validation and evaluation experiments conducted for
this work utilize two CMP systems with different number of
cores per node. Pangu is a SUN Opteron server with 4 dual-
core AMD Opteron processors. Hydra at Texas A&M
University Supercomputer Facility [13] is an IBM
POWER5+ cluster with 40 p5-575 nodes, and each node has
32 GB of memory and 8 DCMs (Dual-Chip Modules) with
one dual-core POWER5+ processor per DCM. Further, each
system has a different node memory hierarchy. We use two
production datasets to validate and evaluate our OpenMP
implementation of EQdyna. The experimental results
indicate that the OpenMP implementation has the accurate
output results and the good scalability on the two CMP
systems. Further, we apply the optimization techniques such
as using the large page and processor binding to our
OpenMP implementation to achieve up to 6.04%
performance improvement on Pangu and up to 7.05%
performance improvement on Hydra without any code
modification.
The remainder of this paper is organized as follows.
Section 2 describes the sequential earthquake simulation
code EQdyna and its control flow. Section 3 proposes our
OpenMP implementation of EQdyna. Section 4 describes
the architecture and memory hierarchy of two CMP systems
used in our experiments. Section 5 evaluates and explores
performance characteristics of our OpenMP implementation,
and presents our optimization results using large page and
processor binding. Section 6 concludes this paper.
2. A Sequential Earthquake Simulation
EQdyna is an explicit finite element dynamic code which
has been under development since 2005. This code is
intentionally developed to simulate spontaneous dynamic
rupture propagation along geometrically complex faults and
wave propagation in complex geologic structures [1, 2, 3].
The sequential version of EQdyna had been verified in a
community-wide code validation exercise on seven
benchmark problems [4] by the end of 2008. A brief
description of mathematical and physical aspects of EQdyna
can be found in [1]. As an explicit finite element earthquake
simulation code, EQdyna does not need to solve a coupled
set of equations for solution because the coefficient matrix
is diagonal. The central difference method used in the code
is conditionally stable. Therefore, the time step used in
simulations, which is limited by the minimum element size
and wave speed in a model, must be small enough to ensure
numerical stability.
Figure 1. Control Flow of EQdyna
Figure 1 gives a control flow that displays the basic
structure of EQdyna. There are three main phases in the
program: Input phase, solution phase, and output phase.
During the input phase, the geometrical, material and
computational data in the model are read in. These data
include 1) execution control parameters such as total
simulation time, time step, stiffness damping coefficient,
etc.; 2) nodal coordinates; 3) nodal boundary conditions; 4)
initial conditions; 5) element data (topological data of
Page 3
3
elements and material parameters for elements); 6) fault
data (fault node pairs, frictional coefficients, initial shear
and normal stresses, critical slip distance D0). In addition, a
couple of calculations are also performed in this phase. First,
global equation numbers and assembly mapping arrays are
established after nodal and element data are input. Second,
the material moduli matrix is computed and stored for each
type of material.
The main body of the code is the solution phase. There
are four main tasks in the solution phase: 1) forming the
left-hand-side diagonal mass matrix; 2) forming element
contribution to the right-hand-side force vector; 3)
computing the hourglass resistance contribution to the force
vector; 4) implementing the fault boundary and forming the
fault boundary constraint force contribution to the force
vector. The first task can be executed outside of the timestep
loop. The other three tasks are within the time step loop.
The first three tasks involve the iterations over all elements
in the model. The tasks 2) and 3), performed by the
functions qdct3 and hourglass respectively, are most time-
consuming, because they involve loops over all elements in
the model at each timestep. They provide element
contribution to nodal force at each node of an element.
Element contributions to individual nodes are assembled
through assembling arrays established in the input phase.
The output phase outputs the results for the fault data,
including time histories of fault slip velocity, slip, stresses
and rupture time, and time histories of particle velocity at
desired nodes off the fault.
3. OpenMP Implementation of EQdyna
OpenMP parallel programming within one CMP node can
take advantage of the globally shared address space and on-
chip high inter-core bandwidth and low inter-core latency.
The use of globally addressable memory on the CMP node
allows users to exploit parallelism by inserting OpenMP
compiler directives where applicable into a sequential
program to generate an OpenMP program. This is the most
common and cost-effective way to generate a parallel
program for utilizing the CMPs. Therefore, we use OpenMP
to parallelize the EQdyna for exploring the parallelism of
the code at node level by efficiently utilizing all processors.
According to Figure 1, Figure 2 presents high-level
structure of our OpenMP implementation of EQdyna by
minimizing the number of OpenMP parallel regions.
Because there are some data dependencies between
timesteps and the number of timesteps is usually much
smaller than the number of nodes or elements (shown in
Table 2 in Section 5.1), we focus on the parallelization
inside each timestep. The functions qdct3 and hourglass
dominates the most of execution time for the sequential
EQdyna (more than 96% for two datasets we used later in
Table 2), so our OpenMP implementation focuses on the
two functions which consist of very time-consuming loops
with the number of iterations that equals the number of
elements for the datasets. We find that there is no data
dependency between the two functions qdct3 and hourglass.
Figure 2. OpenMP parallelization of EQdyna
For the sake of simplicity, Figure 2 only shows our
OpenMP implementation for the two time-consuming loops
in the functions qdct3 and hourglass. The OpenMP program
proceeds in the fork-join execution model shown in Figure 2.
First, it processes Input and qdct2, then enters the timestep
loop. We insert a parallelization directive (!$omp parallel)
just before the function qdct3. Secondly, when the
parallelization directive is invoked, the master thread (with
solid line) forks several new threads (with dash lines). A
worksharing directive (!$omp do) is added inside the
function qdct3 so that the workload in qdct3 is divided
equally among the threads. Thirdly, the threads process their
own workload in parallel, and share the same address space
(green lines) for easily referencing data that other threads
have updated. Because of no data dependency between the
functions qdct3 and hourglass, the worksharing directive
with the nowait clause is added in qdct3. This is very
beneficial because the threads that process qdct3 continue
immediately to hourglass without waiting for all threads to
finish qdct3 so that it can reduce the amount of time that
threads are idle. After all threads finish hourglass, they are
joined to the master thread. Then the program processes the
function faulting, and so on.
Of course, we also use OpenMP to parallelize other loops
in the earthquake simulation code written in Fortran 90. For
instance, we find that large array operations like brhs =
brhs/alhs (where the arrays brhs and alhs with the array size
of more than the number of nodes) also are time-consuming.
So we use the following statements to parallelize the large
array operation (where neq is larger than the number of
nodes):
Page 4
4
!$omp parallel do default(shared) private(i)
do i = 1,neq
brhs(i) = brhs(i)/alhs(i)
enddo
!$omp end parallel do
Based on our experience in the OpenMP implementation,
it is important to avoid various false sharings. For instance,
most data is shared by default, and some data is made
private explicitly in our OpenMP implementation. However,
one local logical variable zerodl was not made private, this
caused that the OpenMP program was executed much
slower than its sequential counterpart because multiple
OpenMP threads updated the shared data (to true or false)
simultaneously and very frequently to result in the
unsatisfied conditions in some if-statements. After the
logical variable zerodl was made private, the OpenMP
program is executed very fast.
We found that parallelizing the function faulting caused
the incorrect results, mainly because results were written out
at a given time interval in the function. This function takes
relatively little time in the sequential run, so we keep the
function unchanged. We also tried different OpenMP
implementations of the EQdyna, especially parallelizing the
timestep loop, however, because the number of timesteps is
much smaller than the number of nodes or elements (shown
in Table 2) and there are some data dependencies between
timesteps, the parallelizing the timestep loop was not an
efficient OpenMP implementation.
experimental results, the OpenMP implementation proposed
in this paper is the most efficient.
Based on our
Table 1. Specifications of two CMP systems
Configurations Hydra
Total Nodes
Cores/chip
Cores / Node
CPU type
1.9GHz POWER5+
Memory/Node
L1 Cache/CPU
L2 Cache/chip
L3 Cache/chip
Pangu
40
2
16
1
2
8
2.6 GHz dual-core Opteron
32GB 48GB
64/32 KB
1.92MB
36MB
64/64 KB
1MB
NA
Table 2. Two Datasets of EQdyna
Total Nodes Total Elements
5,017,500
10,625,471 10,447,920
Datasets Element Sizes Time Steps
D1
D2
4,917,488 150 m
100 m
2,858
7,500
4. Execution Testbeds
Details about the two CMP systems used for our
experiments are given in Table 1. These systems differ in
the following main features: number of processors per node,
configurations of node memory hierarchy, CPU speed,
multi-core processors, operating
communication networks.
Hydra at Texas A&M University Supercomputer Facility
[13] is an IBM POWER5+ cluster with 40 p5-575 nodes,
and each node has 32 GB of memory and 8 DCMs (Dual-
Chip Modules) with one dual-core POWER5+ processor per
DCM. Hydra has the default page size of 4KB and IBM
AIX 5.3, and it supports user-level large page size of 64KB
using the ldedit or ld commands [5]. The SMT
(Simultaneous Multi-Threading) mode is not enabled for
regular use. IBM AIX provides the command bindprocessor
to bind a process to a physical processor, and provides the
systems, and
environment variable XLSMPOPTS to bind a thread to a
physical processor. For example, XLSMPOPTS= startproc=
0:stride=2 means binding threads to different processors on
different chips with one thread per chip (Note that each chip
has two processor cores on Hydra).
A SUN Opteron server Pangu from Department of
Geology & Geophysics at TAMU has 4 dual-core AMD
Opteron processors and 48 GB of memory. Pangu has the
default page size of 4KB and SUN Solaris operating system,
and it supports user-level large page sizes of 2MB or 4MB
using the compiler option –xpagesize=2M or 4M [11]. SUN
Solaris dynamically schedules OpenMP threads to physical
processors, and provides the command pbind to bind a
thread/process to a physical processor. On Pangu, we use
the command pbind to develop a batch tool to automatically
bind multiple threads to different processors in order to
reduce the system overhead caused by the Solaris dynamic
scheduling.
Page 5
5
5. Experimental Results and Performance
Analysis
5.1 Benchmark Problems
We work on a benchmark problem TPV10 of the SCEC
(Southern California Earthquake Center) code validation
exercises [4, 12] to test our OpenMP implementation of
EQdyna. The benchmark
propagation along a 60 dipping normal fault (30 km x 15
km) and wave propagation in a homogeneous three-
dimensional half space. Initial stress on the fault linearly
increases with depth. Two datasets are generated for our
tests shown in Table 2. In dataset 1 (D1), we use an element
size of 150 m (i.e., the edge length of brick elements near
the fault before being sheared to conform the dipping fault
geometry) to create finite element mesh, with a termination
time of 10 seconds for the simulation. In dataset 2 (D2), we
use an element size of 100 m and a termination time of 15
seconds, which are parameters chosen by the SCEC exercise.
The model sizes of the two datasets are listed in Table 2.
solves dynamic rupture
(a) Rupture time contours on the fault plane
(b) Vertical slip velocity at the fault station
Figure 3. Results of dynamic rupture obtained by the
dataset D2. (a) Rupture time contours on the fault plane.
The spacing of two adjacent contour lines is 0.5 second.
(b) Vertical slip velocity at the fault station with 0 km
along both down-dip and along-strike distances
5.2 Experimental Results
To validate the OpenMP Implementation of the sequential
code EQdyna, we use the above two datasets executed on
these platforms such as the SUN server and TAMU Hydra
in the following sections. We found that our OpenMP
implementation generates the accurate output results. Figure
3 shows (a) rupture time contour on the fault plane and (b)
vertical slip velocity at the fault station with 0 km of both
down-dip distance and along-strike distance obtained by the
dataset D2. These results have been verified within the
SCEC code validation community [4, 12].
5.3 Performance Analysis and Optimization
In this section, we analyze the performance of our OpenMP
implementation on the two CMP systems, and use large
page and processor binding to further optimize the code.
Function Performance on Pangu
0
5000
10000
15000
20000
25000
30000
02468 10
Number of Cores
Time (seconds)
Total runtime
Input
qdct2
qdct3
hourglass
faulting
Figure 4. Function-level performance of our OpenMP
implementation on Pangu
Function Performance on Hydra
0
10000
20000
30000
40000
50000
60000
02468 10 12 14 16 18
Number of Cores
Time (seconds)
Total runtime
Input
qdct2
qdct3
hourglass
faulting
Figure 5. Function-level performance of our OpenMP
implementation on Hydra