Conference PaperPDF Available

3D Helmholtz Krylov Solver Preconditioned by a Shifted Laplace Multigrid Method on Multi-GPUs

Authors:

Abstract and Figures

We are focusing on an iterative solver for the three-dimensional Helmholtz equation on multi-GPU using CUDA (Compute Unified Device Architecture). The Helmholtz equation discretized by a second order finite difference scheme is solved with Bi-CGSTAB preconditioned by a shifted Laplace multigrid method. Two multi-GPU approaches are considered: data parallelism and split of the algorithm. Their implementations on multi-GPU architecture are compared to a multi-threaded CPU and single GPU implementation. The results show that the data parallel implementation is suffering from communication between GPUs and CPU, but is still a number of times faster compared to many-cores. The split of the algorithm across GPUs limits communication and delivers speedups comparable to a single GPU implementation.
Content may be subject to copyright.
3D Helmholtz Krylov Solver Preconditioned by
a Shifted Laplace Multigrid Method on
Multi-GPUs
H. Knibbe, C. W. Oosterlee, C. Vuik
Abstract We are focusing on an iterative solver for the three-dimensional Helmholtz
equation on multi-GPU using CUDA (Compute Unified Device Architecture). The
Helmholtz equation discretized by a second order finite difference scheme is solved
with Bi-CGSTAB preconditioned by a shifted Laplace multigrid method. Two
multi-GPU approaches are considered: data parallelism and split of the algorithm.
Their implementations on multi-GPU architecture are compared to a multi-threaded
CPU and single GPU implementation. The results show that the data parallel imple-
mentation is suffering from communication between GPUs and CPU, but is still a
number of times faster compared to many-cores. The split of the algorithm across
GPUs limits communication and delivers speedups comparable to a single GPU im-
plementation.
1 Introduction
As it has been shown in paper [5] the implementation of numerical solvers for
indefinite Helmholtz problems with spatially dependent wavenumber, such as Bi-
CGSTAB and IDR(s) preconditioned by shifted Laplace multigrid method on a GPU
is more than 25 times faster than on a single CPU. Comparison of single GPU to a
single CPU is important but it is not representative for problems of realistic size. By
realistic problem size we mean three-dimensional problems which lead after dis-
H. Knibbe
Delft University of Technology, e-mail: hknibbe@gmail.com
C. W. Oosterlee
Dutch national research centre for mathematics and computer science (CWI) and Delft University
of Technology e-mail: c.w.oosterlee@cwi.nl
C. Vuik
Delft University of Technology, e-mail: c.vuik@tudelft.nl
1
2 H. Knibbe, C. W. Oosterlee, C. Vuik
cretization to linear systems of equations with more than one million unknowns.
Such problems arise when modeling a wavefield in geophysics.
Problems of realistic size are too large to fit in the memory of one GPU, even with
the latest NVVIDIA Fermi graphics card (see [6]). One solution is to use multiple
GPUs. The currently widely used architecture consists of a multi-core connected to
one or at most two GPUs. Moreover, in most of the cases those GPUs have different
characteristics and memory size. A setup with four or more identical GPUs is rather
uncommon, but it would be ideal from a memory point of view. It implies that the
maximum memory is four times or more than on a single GPU. However GPUs are
connected to a PCI bus and in some cases two GPUs share the same PCI bus, this
creates data transfer limitation. To summarize, using multi-GPUs increases the total
memory size but data transfer problems appear.
The aim of this paper is to consider different multi-GPU approaches and under-
stand how data transfer affects performance of a Krylov solver with shifted Laplace
multigrid preconditioner for the three-dimensional Helmholtz equation.
2 Helmholtz Equation and Solver
The Helmholtz equation in three dimensions for a wave problem in a heterogeneous
medium is considered
2
φ
x2
2
φ
y2
2
φ
z2(1
α
i)k2
φ
=g,(1)
where
φ
=
φ
(x,y,z)is the wave pressure field, k=k(x,y,z)is the wavenumber,
α
1 is the damping coefficient, g=g(x,y,z)is the source term. The correspond-
ing differential operator has the form A=
(1
α
i)k2, where
denotes the
Laplace operator. The problem is given in a cubic domain
= [(0,0,0),(X,Y,Z)],
X,Y,ZR. A first order radiation boundary condition is applied
∂ η
ik
φ
=0,
where
η
is the outward normal vector to the boundary (see [2]). Discretizing equa-
tion (1) using the 7-point central finite difference scheme gives the following linear
system of equations: A
φ
=g,ACN×N,
φ
,gCN, where N=nxnynzis a product
of the number of discretization points in the x,yand zdirections. Note that
the closer the damping parameter
α
is set to zero, the more difficult it is to solve
the Helmholtz equation. We are focusing on the original Helmholtz equation with
α
=0.
As a solver for the discretized Helmholtz equation we have chosen the Bi-
CGSTAB method preconditioned by shifted Laplace multigrid method with matrix-
dependent transfer operations and a Gauss-Seidel smoother, (see [3]). It has been
shown in [5] that this solver is parallelizable on CPUs as well as on a single GPU
and provides good speed-up on parallel architectures. The prolongation in this work
is based on the three dimensional matrix-dependent prolongation for real-valued
matrices described in [7]. This prolongation is also valid at the boundaries. The
3D Helmholtz Solver on Multi-GPUs 3
restriction is chosen as full weighting restriction. As a smoother the multi-colored
Gauss-Seidel method has been used. In particular, for 3D problems the smoother
uses 8 colors, so that the color of a given point will be different from its neighbours.
Since our goal is to speed up the Helmholtz solver with the help of GPUs, we
still would like to keep the double precision convergence rate of the Krylov method.
Therefore Bi-CGSTAB is implemented in double precision. For the preconditioner,
single precision is sufficient for CPU as well as GPU.
3 Multi-GPU Implementation
For our numerical experiments NVIDIA [6] provided a Westmere based 12-cores
machine connected to 8 GPUs Tesla 2050 as shown on Figure 1. The 12-core ma-
chine has 48 GB of RAM. Each socket has 6 CPU cores Intel(R) Xeon(R) CPU
X5670 @ 2.93GHz and is connected through 2 PCI-buses to 4 graphics cards. Note
that two GPUs are sharing one PCI-bus connected to a socket. Each GPU consist of
448 cores with clock rate 1.5 GHz and has 3 GB of memory.
In the experiments CUDA version 3.21is used. All experiments on CPU are done
using a multi-threaded CPU implementation (pthreads).
Fig. 1 NVIDIA machine
with 12 Westmere CPUs
and 8 Fermi GPUs, where
two GPUs share a PCI bus
connected to a socket.
In general GPU memory is much more limited than CPU memory so we chose
a multi-GPU approach to be able to solve larger problems. The implementation on
a single GPU of major components of the solver such as vector operations, matrix-
vector-multiplication or the smoother has been described in [5]. In this section we
focus on the multi-GPU implementation.
There are two ways to do computations on multi-GPU: push different Cuda con-
texts to different GPUs (see [6]) or create multiple threads on the CPU, where each
thread communicates with one GPU. For our purposes we have chosen the second
option, since it is easier to understand and implement.
Multiple open source libraries for multi-threading have been considered and
tested. For our implementation of numerical methods on a GPU the main require-
ment for multi-threading was that a created thread stays alive to do further pro-
1During the work on this paper, the newer version of CUDA 4.0 has been released. It was not
possible to have the newer version installed on all systems for our experiments. That is why for
consistency and comparability of experiments, we use the previous version
4 H. Knibbe, C. W. Oosterlee, C. Vuik
cessing. It is crucial for performance that a thread remains alive as a GPU context
is attached to it. Pthreads has been chosen as we have total control of the threads
during the program execution.
There are several approaches to deal with multi-GPU hardware:
1. Domain-Decomposition approach, where the original continuous or discrete
problem is decomposed into parts which are executed on different GPUs and
the overlapping information (halos) is exchanged by data transfer. This approach
can however have difficulties with convergence for higher frequencies (see [4]).
2. Data-parallel approach, where all matrix-vector and vector-vector operations
are split between multiple GPUs. The advantage of this approach is that it is
relatively easy to implement. However, matrix-vector multiplication requires ex-
change of the data between different GPUs, that can lead to significant data trans-
fer times if the computational part is small. The convergence of the solver is not
affected.
3. Split of the algorithm, where different parts of the algorithm are executed on
different devices. For instance, the solver is executed on one GPU and the pre-
conditioner on another one. In this way the communication between GPUs will
be minimized. However this approach requires an individual solution for each
algorithm.
Note that the data-parallel approach can be seen as a method splitting the data across
multi-GPUs, whereas the split of the algorithm can be seen as a method splitting the
tasks across multiple devices. In this paper we are investigating the data-parallel
approach and the split of the algorithms and make a comparison between multi-core
and multi-GPUs. We leave out the domain decomposition approach because the
convergence of the Helmholtz solver is not guaranteed. The data parallel approach
is more intuitive and is described in detail in Section 4.
3.1 Split of the Algorithm
The split can be unique for every algorithm. The main idea of this approach is to
limit communication between GPUs but still be able to compute large problems.
One way to apply this approach to the Bi-CGSTAB preconditioned by shifted
Laplace multigrid method is to execute the Bi-CGSTAB on one GPU and the multi-
grid preconditioner on another one. In this case the communication only between
the Krylov solver and preconditioner is required but not for intermediate results.
The second way to apply split of the algorithm to our solver is to execute the Bi-
CGSTAB and the finest level of shifted Laplace multigrid across all available GPUs
using data parallel approach. The coarser levels of multigrid method are executed
on only one GPU due to small memory requirements. Since the LU-decomposition
is used to compute an exact solution on the coarsest level, we use the CPU for that.
3D Helmholtz Solver on Multi-GPUs 5
3.2 Issues
Implementation on multi-GPUs requires careful consideration of possibilities and
optimization options. The issues we encountered during our work are listed below:
Multi-threading implementation, where the life of a thread should be as long as
the application. This is crucial for the multi-threading way of implementation on
multi-GPU. Note that in case of pushing contexts this is not an issue.
Because of limited GPU memory size, large problems need multiple GPUs.
Efficient memory reusage to avoid allocation/deallocation. Due to memory limi-
tations the memory should be reused as much as possible, especially in the multi-
grid method. In our work we create a pool of vectors on the GPU and reuse them
during the whole solution time.
Limit communications CPUGPU and GPUCPU.
The use of texture memory on Multi-GPU is complicated as each GPU needs its
own texture reference.
Coalescing is difficult since each matrix row has a different number of elements.
4 Numerical Results on Multi-GPU
4.1 Vector- and Sparse Matrix-Vector operations
Vector operations such as addition, dot product are trivial to implement on multi-
GPU. Vectors are split across multiple GPUs, so that each GPU gets a part of the
vector. In case of vector addition, the parts of a vector remain on GPU or can be send
to a CPU and assembled in a result vector of original size. The speedup for vector ad-
dition on 8-GPUs compared to a multi-threaded implementation (12 CPUs) is about
40 times for single and double precision. For the dot product, each GPU sends its
own sub-dot product to a CPU, where they will be summed into the final result. The
speedup for dot product is about 8 for single precision and 5 for double precision.
Note that in order to avoid cache effects on a CPU and to make a fair comparison,
the dot product has been taken from two different vectors. The speedups for vector
addition and dot product on multi-GPU are smaller compared to the single GPU
because of the communication between CPU and multiple GPUs.
The matrix is stored in a CRS matrix format (Compressed Row Storage, see e.g.
[1]) and is split row-wise. In this case a part of the matrix rows is transferred to
each GPU as well as the whole vector. After matrix-vector multiplication parts of
the result are transferred to a CPU where they are assembled into the final resulting
vector. The timings for matrix-vector multiplication are given in Table 1.
6 H. Knibbe, C. W. Oosterlee, C. Vuik
Table 1 Matrix-Vector-Multiplication in single (SP) and double (DP) precision.
Size Speedup (SP) Speedup (SP) Speedup (DP) Speedup (DP)
12-cores/1 GPU 12-cores/8 GPUs 12-cores/1 GPU 12-cores/8 GPUs
100,000 54.5 6.81 30.75 5.15
1 Mln 88.5 12.95 30.94 5.97
20 Mln 78.87 12.13 32.63 6.47
4.2 Bi-CGSTAB and Gauss-Seidel on Multi-GPU
Since the Bi-CGSTAB algorithm is a collection of vector additions, dot products and
matrix-vector multiplications described in the previous section, the multi-GPU ver-
sion of the Bi-CGSTAB is straight forward. In Table 2 the timings of Bi-CGSTAB
on many-core CPU, single GPU and multi-GPU are presented. The stopping crite-
rion is 105. It is easy to see that the speedup on multi-GPUs is smaller than on a
single GPU due to the data transfer between CPU and GPU. Note that for the largest
problem in Table 2 it is not possible to compute on a single GPU because there is not
enough memory available. However it is possible to compute this problem on multi-
GPUs and the computation on multi-GPU is still many times faster than 12-core
Westmere CPU.
Table 2 Speedups for Bi-CGSTAB in single (SP) and double (DP) precision.
Size Speedup (SP) Speedup (SP) Speedup (DP) Speedup (DP)
12-cores/1 GPU 12-cores/8 GPUs 12-cores/1 GPU 12-cores/8 GPUs
100,000 12.72 1.27 9.59 1.43
1 Mln 32.67 7.58 15.84 5.11
15 Mln 45.37 15.23 19.71 8.48
As mentioned above, the shifted Laplace multigrid preconditioner consists of a
coarse grid correction based on the Galerkin method with matrix-dependent prolon-
gation and of a Gauss-Seidel smoother. The implementation of coarse grid correc-
tion on multi-GPU is straight forward, since the main ingredient of the coarse grid
correction is the matrix-vector multiplication. The coarse grid matrices are con-
structed on a CPU and then transferred to the GPUs. The matrix-vector multiplica-
tion on multi-GPU is described in Section 4.1.
The Gauss-Seidel smoother on multi-GPU requires adaptation of the algorithm.
We use 8-colored Gauss-Seidel, since the problem (1) is given in three dimensions
and computations at each discretization point should be done independently of the
neighbours to allow parallelism. For the multi-GPU implementation the rows of the
matrix for one color will be split between multi-GPUs. Basically, the colors are
computed sequentially, but within a color the data parallelism is applied across the
3D Helmholtz Solver on Multi-GPUs 7
multi-GPUs. The timing comparisons for 8-colored Gauss-Seidel implementation
on different architectures are given in Table 3.
Table 3 Speedups for colored Gauss-Seidel method on different architectures in single precision.
Size 12-cores/1 GPU 12-cores/8 GPUs
5 Mln 16.5 5.2
30 Mln 89.1 6.1
5 Numerical Experiments for the Wedge Problem
This model problem represents a layered heterogeneous problem taken from [3].
For
α
Rfind
φ
Cn×n×n
φ
(x,y,z)(1
α
i)k(x,y,z)2
φ
(x,y,z) =
δ
((x500)(y500)z),(2)
(x,y,z)
= [0,0,0]×[1000,1000,1000], with the first order boundary conditions.
We assume that
α
=0. The coefficient k(x,y,z)is given by k(x,y,z) = 2
π
f l/c(x,y,z)
where c(x,y,z)is presented in the Figure 2. The grid size satisfies the condition
maxx(k(x,y,z))h=0.625, where h=1
n1. Table 4 shows timings for Bi-CGSTAB
preconditioned by the shifted Laplace multigrid method on the problem (2) with
43 millions unknowns. The single GPU implementation is about 13 times faster
than a multi-threaded CPU implementation. The data-parallel approach shows that
on multi-GPUs the communication between GPUs and CPUs takes a significant
amount of the computational time, leading to smaller speedup than on a single GPU.
However, using the split of the algorithm, where Bi-CGSTAB is computed on one
GPU and the preconditioner on the another one, increases the speedup to 15.5 times.
Fig. 2 The velocity profile of the wedge problem. Fig. 3 Real part of the solution, f=30 Hz.
8 H. Knibbe, C. W. Oosterlee, C. Vuik
Table 4 Timings for Bi-CGSTAB preconditioned by shifted Laplace multigrid.
Size 12-cores/1 GPU 12-cores/8 GPUs
Bi-CGSTAB (DP) Preconditioner (SP) Total Speedup
12-cores 94 s 690 s 784 s 1
1 GPU 13 s 47 s 60 s 13.1
8 GPUs 83 s 86 s 169 s 4.6
2 GPUs+split 12 s 38 s 50 s 15.5
6 Conclusions
In this paper we presented a multi-GPU implementation of the Bi-CGSTAB solver
preconditioned by a shifted Laplace multigrid method for a three-dimentional
Helmholtz equation. To keep the double precision convergence the Bi-CGSTAB
method is implemented on GPU in double precision and the preconditioner in sin-
gle precision. We have compared the multi-GPU implementation to a single-GPU
and a multi-threaded CPU implementation on a realistic problem size. Two multi-
GPU approaches have been considered: data parallel approach and a split of the
algorithm. For the data parallel approach, we were able to solve larger problems
than on one GPU and get a better performance than multi-threaded CPU implemen-
tation. However due to the communication between GPUs and a CPU the resulting
speedups have been considerably smaller compared to the single-GPU implemen-
tation. To minimize the communication but still be able to solve large problems we
have introduced split of the algorithm. In this case the speedup on multi-GPUs is
similar to the single GPU compared to the multi-core implementation.
The autors thank NVIDIA Corporation for access to the latest many-core-multi-
GPU architecture.
References
1. J.J. Dongarra, I.S. Duff, D.C. Sorensen, and H.A. van der Vorst. Solving Linear Systems on
Vector and Shared Memory Computers. SIAM, Philadelphia (1991).
2. B. Engquist and A. Majda. Absorbing boundary conditions for numerical simulation of
waves. Math. Comput., 31:629–651 (1977).
3. Y. A. Erlangga, C. W. Oosterlee, and C. Vuik. A novel multigrid based preconditioner for
heterogeneous Helmholtz problems. SIAM J. Sci. Comput., 27:1471–1492 (2006).
4. O. Ernst and M. Gander. Why it is difficult to solve Helmholtz problems with classical
iterative methods. In Durham Symposium 2010 (2010).
5. H. Knibbe, C. W. Oosterlee, and C. Vuik. GPU implementation of a Helmholtz Krylov solver
preconditioned by a shifted Laplace multigrid method. Journal of Computational and Applied
Mathematics, 236:281–293 (2011).
6. www.nvidia.com (2011).
7. E. Zhebel. A Multigrid Method with Matrix-Dependent Transfer Operators for 3D Diffusion
Problems with Jump Coefficients. PhD thesis, Technical University Bergakademie Freiberg,
Germany (2006).
... This approach is parallelized by Riyanti et al. (2007). Knibbe et al. (2013) use GPUs to speed up the computations. ...
... To achieve the best performance, the data are kept on the GPU when possible. We have exploited this way of using a GPU for the Helmholtz equation earlier (Knibbe et al., 2011(Knibbe et al., , 2013. ...
... We use a finite-difference discretization of the constant-density acoustic wave equation for computing the wavefields. Here, we solve the 3D wave equation in the frequency domain with the iterative Helmholtz solver described by Knibbe et al. (2013). This solver reduces the number of iterations by a complex-valued generalization of the matrixdependent multigrid method. ...
Article
Full-text available
Three-dimensional reverse-time migration with the constant- density acoustic wave equation requires an efficient numerical scheme for the computation of wavefields. An explicit finite- difference scheme in the time domain is a common choice. However, it requires a significant amount of disk space for the imaging condition. The frequency-domain approach simpli- fies the correlation of the source and receiver wavefields, but requires the solution of a large sparse linear system of equations. For the latter, we use an iterative Krylov solver based on a shifted Laplace multigrid preconditioner with matrix- dependent prolongation. The question is whether migration in the frequency domain can compete with a time-domain imple- mentation when both are performed on a parallel architecture. Both methods are naturally parallel over shots, but the fre- quency-domain method is also parallel over frequencies. If we have a sufficiently large number of compute nodes, we can compute the result for each frequency in parallel and the required time is dominated by the number of iterations for the highest frequency. As a parallel architecture, we consider a commodity hardware cluster that consists of multicore central processing units (CPUs), each of them connected to two graphics processing units (GPUs). Here, GPUs are used as accelerators and not as an independent compute node. The parallel implementation of the 3D migration in frequency domain is compared to a time-domain implementation. We optimize the throughput of the latter with dynamic load balancing, asynchronous I/O, and compression of snapshots. Because the frequency-domain solver uses ma- trix-dependent prolongation, the coarse-grid operators require more storage than available on GPUs for problems of realistic size. Due to data transfer, there is no significant speedup using GPU-accelerators. Therefore, we consider an implementation on CPUs only. Nevertheless, with the parallelization over shots and frequencies, this approach could compete with the time-domain implementation on multiple GPUs.
... In the first case, the data lives in GPU memory to avoid memory transfers between CPU and GPU memory. We have already investigated this approach for the Helmholtz equation in the frequency domain in Knibbe et al. [15,16]. The advantage of the migration with a frequency domain solver is that it does not require large amounts of disk space to store the snapshots. ...
... As the solver for the discretized Helmholtz equation, we have chosen the Bi-CGSTAB method preconditioned by a shifted Laplacian multigrid method with matrixdependent transfer operators and a multi-colored Gauss-Seidel smoother (see Erlangga et al. [7] and Knibbe et al. [16]). The preconditioner for the system (16) is given by ...
... Therefore, we split the algorithm, where the Krylov solver runs on a CPU and the preconditioner runs on one or more GPUs. We exploited this technique in Knibbe et al. [16] and concluded that it reduces the communication between different devices. When the GPU is used as a replacement, the hardware setup consists of one multi-core CPU connected to one GPU. ...
Article
Full-text available
In geophysical applications, the interest in least-squares migration (LSM) as an imaging algorithm is increasing due to the demand for more accurate solutions and the development of high-performance computing. The computational engine of LSM in this work is the numerical solution of the 3D Helmholtz equation in the frequency domain. The Helmholtz solver is Bi-CGSTAB preconditioned with the shifted Laplace matrix-dependent multigrid method. In this paper, an efficient LSM algorithm is presented using several enhancements. First of all, a frequency decimation approach is introduced that makes use of redundant information present in the data. It leads to a speedup of LSM, whereas the impact on accuracy is kept minimal. Secondly, a new matrix storage format Very Compressed Row Storage (VCRS) is presented. It not only reduces the size of the stored matrix by a certain factor but also increases the efficiency of the matrix-vector computations. The effects of lossless and lossy compression with a proper choice of the compression parameters are positive. Thirdly, we accelerate the LSM engine by graphics cards (GPUs). A GPU is used as an accelerator, where the data is partially transferred to a GPU to execute a set of operations or as a replacement, where the complete data is stored in the GPU memory. We demonstrate that using the GPU as a replacement leads to higher speedups and allows us to solve larger problem sizes. Summarizing the effects of each improvement, the resulting speedup can be at least an order of magnitude compared to the original LSM method.
... The performance evaluation provided in [20] records speedups of 10 using 2048 cores with respect to 256 cores. In [21] , the authors describe a multi-GPU implementation of the BCGSTAB solver preconditioned by a shifted Laplace multigrid method for a 3D Helmholtz equation. In the implementation, the sparse matrix is stored in a compressed row storage (CRS) matrix format that does not take any advantage of the regularities of the Helmholtz equation. ...
... Output data can be merged into the global solution output file, if necessary. The decomposition method used for the data structures is a kind of row-wise matrix decomposition that has been used for solving similar problems [21, 23]. Important issues in the Fast-Helmholtz approach are the communications among processors that occur twice at every iteration: (1) when computing the two SpMV operations; and (2) during the reduction operations required for obtaining the results of dot operations. ...
Conference Paper
We are interested in the resolution of the 3D Helmholtz equation for real applications. Solving this problem numerically is a computational challenge due to the large memory requirements of the matrices and vectors involved.For these cases, the massive parallelism of GPU architectures and the high performance at lower energy of the multicores can be exploited. To do a fair comparison between the benefit of accelerating the three-dimensional Helmholtz equation using GPU architectures and multicore platforms, this paper describes three different parallelization schemes on a multi-GPU cluster and also includes an evaluation of their performance. The three parallel schemes consist of:(1) using the multicore processors (CPU version), (2) using the GPU devices (GPU version); and (3) using a hybrid implementation which combines CPU cores and GPU devices simultaneously (hybrid version).Experimental results show that our hybrid implementation outperforms the other approaches in terms of performance.
... The performance evaluation provided in [20] records speedups of 10 using 2048 cores with respect to 256 cores. In [21], the authors describe a multi-GPU implementation of the BCGSTAB solver preconditioned by a shifted Laplace multigrid method for a 3D Helmholtz equation. In the implementation, the sparse matrix is stored in a compressed row storage (CRS) matrix format that does not take any advantage of the regularities of the Helmholtz equation. ...
... The decomposition method used for the data structures is a kind of row-wise matrix decomposition that has been used for solving similar problems [21,23]. ...
Article
The resolution of the 3D Helmholtz equation is required in the development of models related to a wide range of scientific and technological applications. For solving this equation in complex arithmetic, the biconjugate gradient (BCG) method is one of the most relevant solvers. However, this iterative method has a high computational cost because of the large sparse matrix and the vector operations involved. In this paper, a specific BCG method, adapted for the regularities of the Helmholtz equation is presented. This BCG is based on the implementation of a novel format (named ‘Regular Format’) that allows the storage of the large sparse matrix involved in the sparse matrix vector product in a compact form. The contribution of this work is twofold: (1) decreasing the memory requirements of the 3D Helmholtz equation using the ‘Regular Format’ and (2) speeding up the resolution of the equation using high performance computing resources. A hybrid Message Passing Interface (MPI)-graphics processing unit CUDA GPU parallelization that is capable of solving complex problems in short time has carried out (Fast-Helmholtz). Fast-Helmholtz combines optimizations at Message Passing Interface and GPU levels to reduce communications costs and to improve the exploitation of GPU architecture. This strategy makes it possible to extend the dimension of the Helmholtz problem to be solved, thanks to the relevant reduction of memory requirements and runtime. Copyright © 2014 John Wiley & Sons, Ltd.
Article
Full-text available
A Helmholtz equation in two dimensions discretized by a second order finite difference scheme is considered. Krylov methods such as Bi-CGSTAB and IDR(ss) have been chosen as solvers. Since the convergence of the Krylov solvers deteriorates with increasing wave number, a shifted Laplace multigrid preconditioner is used to improve the convergence. The implementation of the preconditioned solver on CPU (Central Processing Unit) is compared to an implementation on GPU (Graphics Processing Units or graphics card) using CUDA (Compute Unified Device Architecture). The results show that preconditioned Bi-CGSTAB on GPU as well as preconditioned IDR(ss) on GPU is about 30 times faster than on CPU for the same stopping criterion.
Article
Full-text available
An iterative solution method, in the form of a preconditioner for a Krylov subspace method, is presented for the Helmholtz equation. The preconditioner is based on a Helmholtz-type differential operator with a complex term. A multigrid iteration is used for approximately inverting the preconditioner. The choice of multigrid components for the corresponding preconditioning matrix with a complex diagonal is validated with Fourier analysis. Multigrid analysis results are verified by numerical experiments. High wave number Helmholtz problems in heterogeneous media are solved indicating the performance of the preconditioner.
Article
Full-text available
Gegeben sei ein lineares Gleichungssystem $Au = f$ mit Koeffizientenmatrix $A$, welche eine spezielle block-tridiagonale Struktur besitzt. Solche lineare Gleichungssysteme entstehen bei der Diskretisierung dreidimensionaler elliptischer Randwertprobleme mit 7- oder 27-Punkte-Stern. In geophysikalischen Anwedungen, insbesondere bei Aufgaben aus der Geoelektrik, haben die Randwertprobleme unstetige Koeffizienten und sind meistens auf nicht-uniformen Gittern diskretisiert. Klassische geometrische Mehrgitterverfahren konvergieren um so langsamer, je stärker die Koeffizientensprünge ausfallen. Außerdem kann die Konvergenz durch die Variation der Gitterabstände beeinträchtigt werden. Zur Lösung wird ein matrix-abhängiges Mehrgitterverfahren vorgestellt. Als Glätter wird eine unvollständige Block LU-Zerlegung verwendet. Die Gittertransferoperationen werden anhand der Einträge der Matrix $A$ ermittelt. Das resultierende Verfahren erweist sich als sehr robust, insbesondere wenn es als Vorkonditionierung für das Verfahren der konjugierten Gradienten eingesetzt wird.
Article
Full-text available
In practical calculations, it is often essential to introduce artificial boundaries to limit the area of computation. Here we develop a systematic method for obtaining a hierarchy of local boundary conditions at these artifical boundaries. These boundary conditions not only guarantee stable difference approximations, but also minimize the (unphysical) artificial reflections that occur at the boundaries.
Article
In practical calculations, it is often essential to introduce artificial boundaries to limit the area of computation. Here we develop a systematic method for obtaining a hierarchy of local boundary conditions at these artificial boundaries. These boundary conditions not only guarantee stable difference approximations but also minimize the (unphysical) artificial reflections which occur at the boundaries.
Article
In contrast to the positive definite Helmholtz equation, the deceivingly similar looking indefinite Helmholtz equation is difficult to solve using classical iterative methods. Simply using a Krylov method is much less effective, especially when the wave number in the Helmholtz operator becomes large, and also algebraic preconditioners such as incomplete LU factorizations do not remedy the situation. Even more powerful preconditioners such as classical domain decomposition and multigrid methods fail to lead to a convergent method, and often behave differently from their usual behavior for positive definite problems. For example increasing the overlap in a classical Schwarz method degrades its performance, as does increasing the number of smoothing steps in multigrid. The purpose of this review paper is to explain why classical iterative methods fail to be effective for Helmholtz problems, and to show different avenues that have been taken to address this difficulty.