Content uploaded by W. A. Mulder

Author content

All content in this area was uploaded by W. A. Mulder on Nov 22, 2016

Content may be subject to copyright.

Performance and scalability of ﬁnite-difference and ﬁnite-element wave-propagation modeling on

Intel’s Xeon Phi

Elena Zhebel∗, Shell Global Solutions International B.V., Sara Minisini, Shell Global Solutions International B.V.,

Alexey Kononov, Source Contracting, Wim Mulder, Shell Global Solutions International B.V. and Delft University of

Technology

SUMMARY

With the rapid developments in parallel compute architectures,

algorithms for seismic modeling and imaging need to be re-

considered in terms of parallelization. The aim of this paper

is to compare scalability of seismic modeling algorithms: ﬁ-

nite differences, continuous mass-lumped ﬁnite elements and

discontinuous Galerkin ﬁnite elements. The performance for

these methods is considered for a given accuracy. The exper-

iments were performed on an Intel Sandy Bridge dual 8-core

machine and on Intel’s 61-core Xeon Phi, which is based on

the Many Integrated Core architecture. The codes ran without

any modiﬁcations. On the Sandy Bridge, the scalability is sim-

ilar for all methods. On the Xeon Phi, the ﬁnite elements out-

perform ﬁnite differences on larger number of cores in terms

of scalability.

INTRODUCTION

High-performance computing develops rapidly and new tech-

nologies keep appearing on the market. A few years ago, a

new generation of graphics processing units (GPUs) appeared,

offering tera-FLOPs performance on a single card. The GPU

was originally designed to accelerate the manipulation of im-

ages in a frame buffer that was mapped to an output display.

Later on, dedicated GPUs found use as general-purpose copro-

cessors suited for parallelizable applications. In 2012, Intel re-

leased its Xeon Phi coprocessor based on the Many Integrated

Core (MIC) architecture.

With the changes in compute architectures, the algorithms for

seismic modeling and imaging need to be reconsidered in terms

of parallelization and scalability. Scalability refers to the abil-

ity of an algorithm to sustain increasing performance on an

increasing number of cores.

The ﬁnite-difference method has become the workhorse for

time-domain modelling of the wave equation with applica-

tions in acquisition optimization, development and testing of

seismic processing algorithms, reverse-time migration and full

waveform inversion. Its advantages are the relative ease of

coding and parallelization, the use of high-order spatial dis-

cretization schemes and explicit time stepping, and its com-

putational speed. The common opinion and experience is that

ﬁnite differences perform well on rectangular domains with

smooth velocity variations. However, in case of an irregular

free surface or sharp contrasts in the medium properties, they

lose their accuracy when using a Cartesian coordinate system.

If the interface does not follow the grid, the staircasing effect

generates ﬁrst-order errors. Because the solution is continu-

ous but not differentiable across an impedance contrast, a local

second-order error will be incurred as well.

Finite-element methods have some advantages over ﬁnite dif-

ferences because they can easily handle geometric or property

discontinuities by using unstructured meshes and spatial local

reﬁnement. For instance, meshes can be used that consist of

tetrahedral elements having their size scale with the local ve-

locity. In this way the resulting meshes have less elements

while maintaining the same number of points per wavelength

in the computational domain. Finite elements that follow sharp

interfaces do not suffer from a loss of accuracy. Moreover,

they offer ﬂexibility in mixing of discretization orders, mixing

element geometries and deploying hybrid discretizations. The

choice of a suitable time discretization scheme enables explicit

time stepping. If the standard ﬁnite-element approach is used,

a large sparse linear system of equations has to be solved at

each time step, which has a negative impact on performance.

There are a number of ﬁnite-element techniques that avoid the

inversion of the large sparse matrix, for example, spectral ele-

ments, discontinuous Galerkin ﬁnite elements and continuous

mass-lumped ﬁnite-elements. The last two will be considered

here.

Several authors compared the various methods (Fornberg, 1987;

Moczo et al., 2011; Chaljub et al., 2007; Wang et al., 2010;

Pasquetti and Rapetti, 2004; De Basabe and Sen, 2007, e.g.).

Zhebel et al. (2012a) considered the continuous mass-lumped

and the discontinuous Galerkin ﬁnite elements in terms of ac-

curacy, stability and computational cost. Numerical experi-

ments on 3-D problems showed that both methods have similar

stability conditions and require a comparable computational

time to obtain a result with a given accuracy, assuming that

the stiffness and mass matrices are pre-assembled. Here, we

will recompute them on the ﬂy to save storage. Another paper

(Zhebel et al., 2013) compares the continuous mass-lumped

ﬁnite elements to the ﬁnite-difference method. Zhebel et al.

(2012b) cover the stability analysis.

The goal of the present paper is to compare ﬁnite-difference

and ﬁnite-element algorithms in terms of scalability on many-

core architectures. The next section reviews the methods. Then,

we provide detail of the compute architecture, followed by the

results of experiments with the three methods on different ar-

chitectures.

METHODS

We consider the 3-D constant-density acoustic wave equation

1

c

∂

2u

∂

t2−

∂

2u

∂

x2−

∂

2u

∂

y2−

∂

2u

∂

z2=s,(1)

DOI http://dx.doi.org/10.1190/segam2013-0861.1© 2013 SEG

SEG Houston 2013 Annual Meeting Page 3386

Downloaded 08/27/13 to 80.254.146.84. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/

Performance and scalability of FD and FE on Intel’s Xeon Phi

on a bounded domain Ω⊂R3, where u(t,x,y,z)is the pres-

sure wave ﬁeld, c(x,y,z)is the velocity of the medium and

s(t,x,y,z)is the source function, x,y,z∈Ω. If the domain Ω

has topography, the free surface can have reﬂecting (zero pres-

sure) boundary conditions. Absorbing boundary conditions are

imposed elsewhere.

By choosing a symmetric time-marching scheme with a time

step ∆t, for example, the leap-frog method, we obtain the fol-

lowing algebraic system of equations,

un+1=2un−un−1+∆t2(−Lun+sn),(2)

where Ldenotes a spatial discretization. The only unknown

is the vector un+1. The values of the solution at the previous

time steps, nand (n−1), are known. For the spatial discretiza-

tion, we consider a ﬁnite-difference method (FD), symmetric

interior penalty discontinuous Galerkin ﬁnite elements (DG)

and continuous mass-lumped ﬁnite-elements (ML).

Finite differences

The problem 1 is discretized by central ﬁnite differences on

a regular Cartesian grid in the rectangular domain represented

by Ω={(xi,yj,zk)|xi=x0+i∆x,yj=y0+j∆y,zk=z0+

k∆z,i,j,k∈N}, where ∆x,∆yand ∆zare grid spacings in the

x-, y- and z-directions, respectively, and i=0,1,...,Nx−1,

j=0,1,...,Ny−1, k=0,1,...,Nz−1. With this, the acoustic

wave equation in the second-order formulation at one point in

space and time becomes

un+1

i,j,k=2un

i,j,k−un−1

i,j,k+∆t2c2

i,j,k“−[Dxxun]i,j,k

−ˆDyyun˜i,j,k−[Dzzun]i,j,k+sn

i,j,k”,

(3)

where Dxx,Dyy and Dzz denote the discretized second deriva-

tives in the x-, y- and z-directions, respectively. Time is sam-

pled at tn=Tmin +n∆t, with ∆t= (Tmax −Tmin)/NTand n=

0,...,NT. The pressure ﬁeld at the next time step, un+1, de-

pends on the current nand the previous (n−1)time steps.

The legacy ﬁnite-difference code makes use of distributed mem-

ory parallelization with MPI, also when used on a single many-

core node. Grid partitioning is applied to divide the domain Ω

into a number of subdomains. From the expression 3, it is clear

that the computation of a single grid point (i,j,k)requires

information from its neighbors in three directions. The sub-

domains require additional points for the calculations, called

halos. At each time step, the points in the halos need to be

exchanged to synchronize their values. Therefore, the paral-

lelization for a distributed memory design requires extra mem-

ory and has a communication overhead.

Discontinuous Galerkin

We brieﬂy review the formulation of the problem when using

the symmetric interior penalty discontinuous Galerkin formu-

lation. A detailed derivation can be found in Rivi`

ere (2008).

We choose the symmetric and conservative form of the scheme.

In this method, the solution is allowed to be discontinuous

across element boundaries. Because in seismics, we want con-

tinuous solutions, a penalty is added to make the discontinu-

ities small, of the order of the numerical discretization error.

To discretize the wave equation 1 with ﬁnite elements, we use

the weak formulation

ZΩ

1

c2

∂

2u

∂

t2vdΩ+ZΩ∇u∇vdΩ−ZΓ(n·∇u)vdΩ=

ZΩs vdΩ,

(4)

for all test functions vthat are chosen as polynomials up to

degree M. Here, ndenotes the outward normal and Γconsists

of internal and external boundaries of the domain Ω.

The domain Ωcan be partitioned into tetrahedral elements

τ

.

To have a certain number of points per wavelength, we require

the diameter of the inscribed sphere to scale with the domi-

nant wavelength, hence with the local velocity. The meshing

approach has been described elsewhere (Kononov et al., 2012).

Since the solution is discontinuous across the internal bound-

aries, additional computations of so-called ﬂuxes are needed to

make the solution continuous. The term with the normal in 4

is a ﬂux term that is given on the faces of elements. It consists

of incoming and outgoing ﬂuxes.

By discretizing the weak formulation 4 with leap-frog, we ob-

tain for each element

τ

,

un+1

τ

=2un

τ

−un−1

τ

+∆t2M−1`−Kun

τ

−F+un

τ

−F−un

τ

−+sn

τ

´,

where Mand Kare the local mass and stiffness matrix, respec-

tively. Depending on the element degree M, the matrices will

have different sizes, see Table 1. We also have the contribution

of the ﬂuxes. The term F+denotes the sum of outgoing ﬂuxes

over the four faces in the given element and can be stored in

one matrix. The second term, F−, contains incoming ﬂuxes

from the four neighboring elements,

τ

−, and requires the stor-

age of four matrices. Note that for elements which lie on the

boundary, we need to store less outgoing ﬂux matrices, since

one or more neighbors are absent.

The implementation of DG uses shared memory paralleliza-

tion with OpenMP on many-cores architectures. The paral-

lelization is performed over the grid points in space. That

means that, for each time step, the computations for the grid

points (i,j,k)are performed in parallel. The information from

the neighboring points is read from shared memory. To save

memory, the matrices M,Kand F+and the four F−for each

element are reassembled at each time step.

Continuous mass-lumped ﬁnite elements

The weak formulation of equation 1 for the continuous mass-

lumped ﬁnite element method is given by equation 4, but now

Γonly refers to the external boundaries, because the test func-

tions and solution are assumed to be continuous. Discretizing

equation 4 with mass-lumped ﬁnite elements in space and with

second-order ﬁnite differences in time, we obtain a global fully

algebraic system of the form

un+1=2un−un−1+∆t2M−1(−Kun+sn).(5)

The matrix Mis diagonal, avoiding the need to solve a large

sparse linear system of equations. Chin-Joe-Kong et al. (1999)

DOI http://dx.doi.org/10.1190/segam2013-0861.1© 2013 SEG

SEG Houston 2013 Annual Meeting Page 3387

Downloaded 08/27/13 to 80.254.146.84. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/

Performance and scalability of FD and FE on Intel’s Xeon Phi

Table 1: The size of the matrices per element used in DG and

ML is N×Nand depends on the degree M.

degree M N (DG) N(ML)

1 4 4

2 10 23

3 20 50

4 35 –

describe the construction of the mass-lumped tetrahedral ele-

ments in detail.

The global discretized problem 5 can be reformulated element-

by-element, improving parallelization of the time-stepping. In

this case, depending on the element degree M, the matrices will

have different sizes, see Table 1. The mass matrix was pre-

assembled before the time stepping started, but the stiffness

matrices were reassembled on the ﬂy during the time stepping.

The contribution per element was determined from the 6 pre-

computed matrices of size N×Nfor the reference element and

6 scalar factors related to the local coordinate transformation

of each element. In this way, only two vectors for the solution

and one for the inverse mass matrix need to be stored, each of

the size of the total number of degrees of freedom. The veloc-

ity, c(x,y,z)can be absorbed in the inverse mass matrix. The

shared-memory parallelization of ML uses OpenMP. During

each time step, the elements

τ

are treated in parallel.

COMPUTE ARCHITECTURES

The codes were run on two architectures: Intel’s standard x86

processor, Sandy Bridge, and a Xeon Phi based on Intel Many

Integrated Core architecture.

The Sandy Bridge based compute nodes has two eight-core

Intel ES-2670 2.6 GHz CPUs processors.

The Xeon Phi is a coprocessor based on Intel’s Many Inte-

grated Core architecture. The difference between a processor

and coprocessor is that a processor does not require another

compute device to be present in a working system. A copro-

cessor cannot be used as an independent device. It has to be

connected to a processor (host). In our case, an Intel Xeon Phi

card is connected to a host through a PCIe×16 bus.

There are two ways to use the Intel Xeon Phi card. The ﬁrst

is to run the whole application on the Intel Xeon Phi, called

its native mode. The second is to identify the parallel parts

in the algorithm and run the program on the host, letting only

those parallel parts run out-of-core on a MIC, also called off-

load mode. In our experiments, we only used the Xeon Phi

in the ﬁrst way by running the whole application on a single

card. For this, the codes were recompiled with the latest Intel

Compiler (ICS2013). The Intel Xeon Phi coprocessor runs

Linux. Each card has his own IP address. Each Xeon Phi

coprocessor combines 61 Intel cores on a single chip running

with a frequency of 1.1 GHz. One core is responsible for the

OS. The maximum number of threads per core is 4, allowing a

total of 244. The Xeon Phi has 8 GB of DDR5 memory and a

(a) (b)

Figure 1: (a) Velocity model for the model problem. The

source is denoted by a red star and receivers by yellow tri-

angles. (b) Snapshot of the wave ﬁeld.

bandwidth of 320 GB/s.

The FD code, containing MPI calls, and the DG and ML code,

parallelized with OpenMP pragmas, did not require any modi-

ﬁcation.

RESULTS

We consider seismic modelling 1, which represents the essen-

tial part of migration and inversion algorithms. The problem

size has been chosen such that it would ﬁt in the 8 GB of Xeon

Phi memory and is kept the same also for the experiments on

Sandy Bridge. This means we only consider strong scaling,

as opposed to weak scaling where the problem size per core is

kept constant.

The goal of the experiments is to compare ﬁnite-element and

ﬁnite-difference algorithms in terms of their scalability on many-

core architectures.

The model problem is the dipping interface shown in Fig-

ure 1(a). We consider a 3-D domain of size (2 km)3with

two halfspaces having velocities of 1.5 and 3.0 km/s. The in-

terface runs from 0.7 to 1.3 km depth between 0 and 2 km

in x-direction, so the dip angle is 16.7◦. A shot is located at

(779.7, 1000, 516.3) m, 350 m above the interface in the shal-

low low-velocity part of the model. The receivers are located

250 m above the interface and have offsets from 100 to 700 m

with a 25-m interval, parallel to the interface in the down-dip

direction of the source.

Figure 2 shows the errors as function of the total degrees of

freedom N, using either the ﬁnite-difference scheme with a

4th-order approximation of the second derivatives in each co-

ordinate direction, the M=3 type 2 continuous mass-lumped

ﬁnite-element method, or the discontinuous Galerkin ﬁnite el-

ement of degree 3. The ﬁnite-element methods therefore also

have 4th-order spatial accuracy and a 2nd-order temporal error.

Note that 4th-order accuracy of the ﬁnite-difference scheme is

lost near discontinuities in the model. The ﬁnite-difference

code was run on grids with 2013, 2513, 3013, 3513, 4013

and 5013points. The continuous mass-lumped ﬁnite-element

code and discontinuous Galerkin ﬁnite-element code ran on

meshes with 294,508, 567,071, and 2,320,289 elements, re-

DOI http://dx.doi.org/10.1190/segam2013-0861.1© 2013 SEG

SEG Houston 2013 Annual Meeting Page 3388

Downloaded 08/27/13 to 80.254.146.84. Redistribution subject to SEG license or copyright; see Terms of Use at http://library.seg.org/

Performance and scalability of FD and FE on Intel’s Xeon Phi

102.3 102.4 102.5 102.6

10−3

10−2

10−1

100

N1/3

FD, 2−norm

FD, max

ML, 2−norm

ML, max

DG, 2−norm

DG, max

Figure 2: Errors for the model problem as a function of

1/h=N1/3, where Nis the number of grid points for the

ﬁnite-difference method (FD) or the number of degrees of free-

dom for the ﬁnite-element methods (ML and DG). The dashed

lines represent the maximum norm, the drawn the 2-norm. The

green line corresponds to a spatial fourth-order error. Second-

order time stepping was used.

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Number of cores

Speed−up

Sandy Bridge

FD

ML

DG

ideal

Figure 3: Speed-up on Intel’s Sandy Bridge. The red line

with crosses denotes FD, the black line with circles denotes

ML and the blue line with squares denotes DG. The solid line

represents ideal speed-up.

spectively. For ﬁnite differences and for ﬁnite elements, the

direct wave was modelled on the same mesh but with a con-

stant velocity of 1.5 km/s and then subtracted to only have the

reﬂected/refracted event. It is obvious that, asymptotically, the

ML and DG ﬁnite elements are more accurate than the ﬁnite

differences because the mesh follows the interface.

For the experiments on the Intel Xeon Phi, we have selected

the FD problem of size 3013and the mesh with 567,071 tetra-

hedra for the ﬁnite elements. For these problems, FD, ML and

DG have a similar accuracy of about 3%. In total, ML had

28,353,550 and DG had 11,341,420 degrees of freedom, re-

spectively. Figure 3 and 4 show the speed-up for the ﬁnite dif-

ferences and the two ﬁnite-element methods on Intel’s Sandy

Bridge and Xeon Phi, respectively. The results on Sandy Bridge

50 100 150 200

10

20

30

40

50

60

70

80

90

100

110

120

Number of cores

Speed−up

MIC

FD

ML

DG

ideal

Figure 4: Speed-up on Intel’s Xeon Phi. The red crosses

denote FD, black circles denote ML and the blue line with

squares denotes DG. The solid line presents ideal speed-up.

show that the speed-up measured on the nodes on one socket

(up to 8 cores) is close to optimal for the three methods. To

make sure that 1 process is running on the same core and is not

dynamically relocated with cache losses, we have forced afﬁn-

ity. When using more than 8 cores, afﬁnity is switched off.

Since the DG code is more computational intensive that ML

and FD, due to calculation of the ﬂuxes, the speed-up is larger.

The results on Xeon Phi show that the speed-up is almost linear

with less than 50 threads. For a larger number of threads, the

efﬁciency is reduced. Similar to the experiments with FD, we

note a performance peak with 60, 120 and 180 threads, which

are running at maximum capacity. This is especially obvious

with DG. FD uses 1 thread for I/O and programme control,

which is mostly idle during the computations. This one is ex-

cluded from the node count in the graphs. The code attempts

to ﬁnd a subdivision that favors cube-like structures, but the

choice of the number of cores limits the options. The spe-

ciﬁc choice of subdivision impacts the performance. With the

3013points on the mesh, a subdivision on, for instance, 173+1

nodes into 173 subdomains in one coordinate direction does

not make sense, so cases like that are absent from the graph.

CONCLUSIONS

We have compared the scalability of three seismic modeling

algorithms: ﬁnite differences, continuous mass-lumped ﬁnite

elements and discontinuous Galerkin ﬁnite elements. Their

performance was considered for a given accuracy. Asymptot-

ically, the ML and DG ﬁnite elements are more accurate than

the ﬁnite differences when the ﬁnite-element mesh follows the

interfaces between sharp contrasts in medium properties. The

experiments were run on the Intel Sandy Bridge dual 8-core

machine and Intel Xeon Phi based on the Many Integrated

Core architecture. We observed that the ﬁnite elements have

similar speed-up as ﬁnite differences, if a proper subdivision

is chosen. Only beyond 120 cores, FD degrades. On Sandy

Bridge, all methods show similar scalability.

DOI http://dx.doi.org/10.1190/segam2013-0861.1© 2013 SEG

SEG Houston 2013 Annual Meeting Page 3389

http://dx.doi.org/10.1190/segam2013-0861.1

EDITED REFERENCES

Note: This reference list is a copy-edited version of the reference list submitted by the author. Reference lists for the 2013

SEG Technical Program Expanded Abstracts have been copy edited so that references provided with the online metadata for

each paper will achieve a high degree of linking to cited sources that appear on the Web.

REFERENCES

Chaljub, E., D. Komatitsch, J. Vilotte, Y. Capdeville, B. Valette, and G. Festa , 2007, Spectral element

analysis in seismology: in R.S. Wu and V. Maupin, eds., Advances in geophysics: Advances in wave

propagation in heterogeneous media: Elsevier, 365–419.

Chin-Joe-Kong, M. J. S., W. A. Mulder, and M. van Veldhuizen, 1999, Higher-order triangular and

tetrahedral finite elements with mass lumping for solving the wave equation: Journal of Engineering

Mathematics, 35, 405–426, http://dx.doi.org/10.1023/A:1004420829610.

De Basabe, J. D., and M. K. Sen, 2007, Grid dispersion and stability criteria of some common finite-

element methods for acoustic and elastic wave equations: Geophysics, 72, no. 6, T81–T95,

http://dx.doi.org/10.1190/1.2785046.

Fornberg, B., 1987, The pseudospectral method: Comparisons with finite-differences for the elastic wave

equation: Geophysics, 52, 483–501, http://dx.doi.org/10.1190/1.1442319.

Kononov, A., S. Minisini, E. Zhebel, and W. A. Mulder, 2012, A 3D tetrahedral mesh generator for

seismic problems: 74th Conference & Exhibition, EAGE, Extended Abstracts, 58922.

Moczo, P., J. Kristek, M. Galis, E. Chaljub, and V. Etienne, 2011, 3-D finite-difference, finite-element,

discontinuous-Galerkin and spectral-element schemes analysed for their accuracy with respect to P-

wave to S-wave speed ratio: Geophysical Journal International, 187, 1645–1667,

http://dx.doi.org/10.1111/j.1365-246X.2011.05221.x.

Pasquetti, R., and F. Rapetti, 2004, Spectral element methods on triangles and quadrilaterals: comparisons

and applications: Journal of Computational Physics, 198, 349–362,

http://dx.doi.org/10.1016/j.jcp.2004.01.010.

Rivière, B., 2008, Discontinuous Galerkin methods for solving elliptic and parabolic equations: Theory

and implementation: SIAM Frontiers in Mathematics 35.

Wang, X., W. W. Symes, and T. Warburton, 2010, Comparison of discontinuous Galerkin and finite

difference methods for time domain acoustics: 80th Annual International Meeting, SEG, Expanded

Abstracts, 3060–3065, http://dx.doi.org/10.1190/1.3513482.

Zhebel, E., S. Minisini, A. Kononov, and W. A. Mulder, 2012a, A comparison of continuous mass-

lumped finite elements and finite differences for 3-D acoustics in the time domain: 74th Conference

& Exhibition, EAGE, Extended Abstracts, 59474.

Zhebel, E., S. Minisini, A. Kononov, and W. A. Mulder, 2012b, On the time-stepping stability of

continuous mass-lumped and discontinuous Galerkin finite elements for the 3D acoustic wave

equation: Proceedings of the 6th European Congress on Computational Methods in Applied Sciences

and Engineering (ECCOMAS 2012), CD-ROM, 1041–1051.

Zhebel, E., S. Minisini, A. Kononov, and W. A. Mulder, 2013, Personal communication.

DOI http://dx.doi.org/10.1190/segam2013-0861.1© 2013 SEG

SEG Houston 2013 Annual Meeting Page 3390