Content uploaded by Ulrich Hetmaniuk

Author content

All content in this area was uploaded by Ulrich Hetmaniuk on Feb 14, 2014

Content may be subject to copyright.

Content uploaded by Peter Arbenz

Author content

All content in this area was uploaded by Peter Arbenz

Content may be subject to copyright.

Parallel Numerics ’05, 25-34 M. Vajterˇsic, R. Trobec, P. Zinterhof, A. Uhl (Eds.)

Chapter 2: Matrix Algebra ISBN 961-6303-67-8

Parallel Maxwell Eigensolver Using

Trilinos Software Framework ∗

Peter Arbenz 1, Martin Beˇcka 2,†, Roman Geus 3,

Ulrich Hetmaniuk 4, Tiziano Mengotti 1

1Institute of Computational Science, Swiss Federal Institute of Technology,

CH-8092 Zurich, Switzerland

2Department of Informatics, Institute of Mathematics,

Slovak Academy of Sciences,

D´ubravsk´a cesta 9, 841 04 Bratislava, Slovak Republic

3Paul Scherrer Institut,

CH-5232 Villigen PSI, Switzerland

4Sandia National Laboratories,

Albuquerque, NM 87185-1110, U.S.A. ‡

We report on a parallel implementation of the Jacobi–Davidson algo-

rithm to compute a few eigenvalues and corresponding eigenvectors of a

large real symmetric generalized matrix eigenvalue problem. The eigen-

value problem stems from the design of cavities of particle accelerators.

It is obtained by the ﬁnite element discretization of the time-harmonic

Maxwell equation in weak form by a combination of N´ed´elec (edge) and

Lagrange (node) elements. We found the Jacobi–Davidson (JD) method

to be a very eﬀective solver provided that a good preconditioner is avail-

able for the correction equations. The parallel code makes extensive use

of the Trilinos software framework. In our examples from accelerator

physics we observe satisfactory speedup and eﬃciency.

†Corresponding author. E-mail: martin.becka@savba.sk

‡Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin

Company, for the United States Department of Energy’s National Nuclear Security Admini-

stration under contract DE-AC04-94AL85000.

26 P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, T. Mengotti

1 Introduction

Many applications in electromagnetics require the computation of some of the

eigenpairs of the curl-curl operator,

curl µ−1

rcurl e(x)−k2

0εre(x) = 0,div e(x) = 0,x∈Ω,(1)

Equations (1) are obtained from the Maxwell equations after separation of

the time and space variables and after elimination of the magnetic ﬁeld in-

tensity. The discretization of (1) by ﬁnite elements leads to a real symmetric

generalized matrix eigenvalue problem

Ax=λMx, CTx=0,(2)

where Ais positive semideﬁnite and Mis positive deﬁnite. In order to avoid

spurious modes we approximate the electric ﬁeld eby N´ed´elec (or edge) el-

ements [17]. The Lagrange multiplier that is a function introduced to treat

properly the divergence free condition is approximated by Lagrange (or nodal)

ﬁnite elements [3].

In this paper we consider a parallel eigensolver for computing a few of

the smallest eigenvalues and corresponding eigenvectors of (2) as eﬃciently as

possible with regard to execution time and memory cost. In earlier studies [3]

we found the Jacobi–Davidson algorithm [18, 9] a very eﬀective solver for this

task. We have parallelized this solver in the framework of the Trilinos parallel

solver environment [10].

In section 2 we brieﬂy review the symmetric Jacobi-Davidson eigensolver

and the preconditioner that is needed for its eﬃcient application. In section 3

we discuss data distribution and issues involving the use of Trilinos.

In section 4 we report on experiments that we conducted by means of

problems originating in the design of the RF cavity of the 590 MeV ring cy-

clotron installed at the Paul Scherrer Institute (PSI) in Villigen, Switzerland.

These experiments indicate that the implemented solution procedure is almost

optimal in that the number of iteration steps until convergence only slightly

depends on the problem size.

2 The eigensolver

The Jacobi–Davidson algorithm has been introduced by Sleijpen and van der

Vorst [18]. There are variants for all types of eigenvalue problems [5]. Here

we use a variant (JDSYM) adapted to the generalized symmetric eigenvalue

problem (2) as described in detail in [2, 9]. This algorithm is well-suited since

Parallel Maxwell Eigensolver Using Trilinos Software Framework 27

it does not require the factorization of the matrices Aor M. In [2, 3, 4] we

found JDSYM to be the method of choice for this problem.

In addition to the standard JDSYM algorithm, we keep solutions of the cor-

rection equation orthogonal to Capplying a projector operator (I−Y H−1CT)

in each iteration step. Note that Y=M−1Cis a very sparse basis of the null

space of Aand that H=YTCis the discretization of the Laplace operator in

the nodal element space [3].

Our preconditioner K≈A−σM , where σis a ﬁxed shift, is a combina-

tion of a hierarchical basis preconditioner and an algebraic multigrid (AMG)

preconditioner.

Since our ﬁnite element spaces consist of N´ed´elec and Lagrange ﬁnite ele-

ments of degree 2 and since we are using hierarchical bases, we employ the

hierarchical basis preconditioner that we used successfully in [3]. Numbering

the linear before the quadratic degrees of freedom, the matrices A,Mand

Kget a 2-by-2 block structure. The (1,1)-blocks correspond to the bilinear

forms involving linear basis functions. The hierarchical basis preconditioners

as discussed by Bank [6] are stationary iteration methods that respect the

2-by-2 block structure of Aand M. We use the symmetric block Gauss–Seidel

iteration as the underlying stationary method.

For very large problems (order 105and more), we solve with K11 by a

single V-cycle of an AMG preconditioner. This makes our preconditioner a

true multilevel preconditioner. We found ML [16] the AMG solver of choice as

it can handle unstructured systems that originate from the Maxwell equation

discretized by linear N´ed´elec ﬁnite elements. ML implements a smoothed

aggregation AMG method [20] that extends the straightforward aggregation

approach of Reitzinger and Sch¨oberl [14]. ML is part of Trilinos which is

discussed in the next section.

The approximation

e

K22 of K22 again represents a stationary iteration

method of which we execute a single iteration step.

3 Parallelization issues

For very large problems, the data must be distributed over a series of proces-

sors. To make the solution of these large problems feasible, an eﬃcient parallel

implementation of the algorithm is necessary. Such a parallelization of the al-

gorithm requires proper data structures and data layout, some parallel direct

and iterative solvers, and some parallel preconditioners. For our project, we

found the Trilinos Project [19] to be an eﬃcient environment to develop such

a complex parallel application.

28 P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, T. Mengotti

3.1 Trilinos

The Trilinos Project is an ongoing eﬀort to design, develop, and integrate

parallel algorithms and libraries within an object-oriented software framework

for the solution of large-scale, complex multi-physics engineering and scientiﬁc

applications [19, 10, 15]. Trilinos is a collection of compatible software pack-

ages. Their capabilities include parallel linear algebra computations, parallel

algebraic preconditioners, the solution of linear and non-linear equations, the

solution of eigenvalue problems, and related capabilities. Trilinos is primarily

written in C++ and provides interfaces to essential Fortran and C libraries.

For our project, we use the following packages

•Epetra, the fundamental package for basic parallel algebraic operations.

It provides a common infrastructure to the higher level packages,

•Amesos, the Trilinos wrapper for linear direct solvers (SuperLU, UMF-

PACK, KLU, etc.),

•AztecOO, an object-oriented descendant of the Aztec library of parallel

iterative solvers and preconditioners,

•ML, the multilevel preconditioner package, that implements a smoothed

aggregation AMG preconditioner capable of handling Maxwell equa-

tions [7, 16].

For a detailed overview of Trilinos and its packages, we refer the reader to [10].

3.2 Data structures

Real valued double precision distributed vectors, multivectors (collections of

one or more vectors) and (sparse) matrices are fundamental data structures,

which are implemented in Epetra. The distribution of the data is done by

specifying a communicator and a map, both Epetra objects.

The notion of a communicator is known from MPI [13]. A communicator

deﬁnes a context of communication, a group of processes and their topology,

and it provides the scope for all communication operations. Epetra imple-

ments communicators for serial and MPI use. Moreover, communicator classes

provide methods similar to other MPI functions.

Vectors, multivectors and matrices are distributed row wise. The distribu-

tion is deﬁned by means of a map. A map can be deﬁned as the distribution

of a set of integers across the processes, it relates the global and local row in-

dices. To create a map object, a communicator, the global and local numbers

of elements (rows), and the global numbering of all local elements have to be

Parallel Maxwell Eigensolver Using Trilinos Software Framework 29

provided. So, a map completely describes the distribution of vector elements

or matrix rows. Note that rows can be stored on several processors redun-

dantly. To create a distributed vector object, in addition to a map, one must

assign values to the vector elements. The Epetra vector class oﬀers standard

functions for doing this and other common vector manipulations.

Trilinos supports dense and sparse matrices. Sparse matrices are stored

locally in the compressed row storage (CRS) format [5]. Construction of a

matrix is row by row or element by element. Afterwards, a transformation

of the matrix is required in order to perform matrix-(multi)vector product

Y=A×Xeﬃciently, specifying maps of the vectors Xand Y.

Some algorithms require only the application of a linear operator, such that

the underlying matrix need not be available as an object. Epetra handles this

by means of a virtual operator class. Epetra also admits to work with block

sparse matrices. Unfortunately, there is no particular support for symmetric

matrices.

To redistribute data, one deﬁnes a new, so-called target map and creates

an empty data object according to this new map as well as an Epetra’s im-

port/export object from the original and the new map. The new data object

can be ﬁlled with the values of the original data object using the import/export

object, which describes the communication plan.

3.3 Data distribution

A suitable data distribution can reduce communication costs and balance the

computational load. The gain from such a redistribution can, in general,

overcome the cost of this preprocessing step.

Zoltan [21, 8] is a library that contains tools for load balancing and parallel

data management. It provides a common interface to graph partitioners like

METIS and ParMetis [12, 11]. Zoltan is not a Trilinos package. But the

Trilinos package EpetraExt provides an interface between Epetra and Zoltan.

In our experiments, we use ParMetis to distribute the data. This parti-

tioner tries to distribute a graph such that (I) the number of graph vertices

per processor is balanced and (II) the number of edge cuts is minimized. The

former balances the work load. The latter minimizes the communication over-

head by concentrating elements in diagonal blocks and minimizing the number

of non-zero oﬀ-diagonal blocks. In our experiments, we deﬁne a graph G, which

contains connectivity informations for each node, edge, and face of the ﬁnite

element mesh. Gis constructed from portions of the sparse matrices M,H,

and C. To reduce overhead of the redistribution we also work with a smaller

than our artiﬁcial graph G. We determine a good parallel distribution just for

the vertices and then adjust the edges and faces accordingly.

30 P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, T. Mengotti

4 Numerical experiments

In this section, we discuss the numerical experiments used to assess the parallel

implementation. Results have been presented in [1].

The experiments have been executed on a 32 dual-node PC cluster in

dedicated mode. Each node has 2 AMD Athlon 1.4 GHz processors, 2 GB

main memory, and 160 GB local disk. The nodes are connected by a Myrinet

providing a communication bandwidth of 2000 Mbit/s. The system operates

with Linux.

For these experiments, we use the developer version of Trilinos on top of

MPICH. We computed the 5 smallest positive eigenvalues and corresponding

eigenvectors using JDSYM with the multilevel preconditioner.

Test problems originate in the design of the RF cavity of the 590 MeV ring

cyclotron installed at the Paul Scherrer Institute (PSI) in Villigen, Switzer-

land. We deal with two problem sizes. They are labelled cop40k and cop300k.

Their characteristics are given in Table 1, where we list the order nand the

Table 1: Matrix characteristics

grid nA−σM nnzA−σM nHnnzH

cop40k 231,668 4,811,786 46,288 1,163,834

cop300k 1,822,854 39,298,588 373,990 10,098,456

number of non-zeros nnz for the shifted operator A−σM and for the discrete

Laplacian H. Here the eigenvalues to be computed are

λ1≈1.13, λ2≈4.05, λ3≈9.89, λ4≈11.3, λ5≈14.2.

We set σ= 1.5.

In Table 2, we report the execution times t=t(p) for solving the eigenvalue

problem with various numbers pof processors. These times do not include

preparatory work, such as the assembly of matrices or the data redistribution.

E(p) describes the parallel eﬃciency with respect to the simulation run with

the smallest number of processors. tprec and tpro j indicate the percentage

of the time the solver spent applying the preconditioner and the projector,

respectively. navg

inner is the average number of inner (QMRS) iterations per

outer iteration. The total number of applications for the preconditioner K

is approximately nouter ·navg

inner. Here we use an AMG preconditioner for the

block K11, Jacobi steps for K22 and an AMG preconditioner for the whole H.

In Table 3, we use the AMG preconditioner for the block K11, Jacobi

steps for K22, and a similar strategy for H(AMG preconditioner for H11 and

Jacobi steps for H22). We investigate the eﬀect of redistributing the matrices.

Parallel Maxwell Eigensolver Using Trilinos Software Framework 31

Table 2: Results for matrix cop40k

p t [sec] E(p)tprec [%] tproj [%] nouter navg

inner

1 2092 1.00 37 18 53 19.02

2 1219 0.86 38 17 54 18.96

4 642 0.81 37 17 54 19.43

8 321 0.81 38 18 53 19.23

12 227 0.77 40 19 53 19.47

16 174 0.75 43 20 53 18.96

Results in Table 3 show that the quality of data distribution is important.

For the largest number of processors (p= 16), the execution time with the

redistributed matrices is half the time obtained with the original matrices.

These were straightforward block distributions of the matrices.

Table 3: cop40k: Comparison of results with (left) and without (right) redis-

tribution

p t [sec] E(p)nouter navg

inner

1 1957 2005 1.00 1.00 53 53 19.02 19.02

2 1159 1297 0.84 0.77 54 53 19.06 19.66

4 622 845 0.79 0.59 54 55 19.43 19.18

8 318 549 0.77 0.45 53 54 19.23 19.67

12 231 451 0.71 0.37 53 54 20.47 19.78

16 184 366 0.66 0.34 53 54 19.00 19.04

Finally, in Table 4, we report results for our larger problem size cop300k.

We use the 2-level preconditioner for Kand H: an appropriate AMG pre-

conditioner for the blocks K11 and H11 and one step of Jacobi for the blocks

K22 and H22. Table 4 shows that, for these experiments, the iteration counts

behave nicely and that eﬃciencies stay high.

Table 4: Results for matrix cop300k

p t [sec] E(p)nouter navg

inner

8 4346 1.00 62 28.42

12 3160 0.91 62 28.23

16 2370 0.92 61 28.52

32 P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, T. Mengotti

5 Conclusions

In conclusion, the parallel algorithm shows a very satisfactory behavior. The

eﬃciency of the parallelized code does not get below 65 percent for 16 proces-

sors. We usually have a big eﬃciency loss initially. Then eﬃciency decreases

slowly as the number of processors increases. This is natural due to the grow-

ing communication-to-computation ratio.

The accuracy of the results are satisfactory. The computed eigenvectors

were M-orthogonal and orthogonal to Cto machine precision.

We plan to compare the Jacobi-Davidson approach with another eigen-

value solvers like the locally optimal block preconditioned conjugate gradient

methods (LOBPCG).

References

[1] P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, and T. Mengotti, On a

Parallel Multilevel Preconditioned Maxwell Eigensolver. Technical Report

465, Institute of Computational Science, ETH Z¨urich, December 2004.

[2] P. Arbenz and R. Geus, A comparison of solvers for large eigenvalue

problems originating from Maxwell’s equations. Numer. Linear Algebra

Appl., 6(1):3–16, 1999.

[3] P. Arbenz and R. Geus, Multilevel preconditioners for solving eigenvalue

problems occuring in the design of resonant cavities. Applied Numerical

Mathematics, 2004. Article in press. Corrected proof available from doi:

10.1016/j.apnum.2004.09.026.

[4] P. Arbenz, R. Geus, and S. Adam, Solving Maxwell eigenvalue problems

for accelerating cavities. Phys. Rev. ST Accel. Beams, 4:022001, 2001.

(Electronic journal available from http://prst-ab.aps.org/).

[5] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates

for the Solution of Algebraic Eigenvalue Problems: A Practical Guide.

SIAM, Philadelphia, PA, 2000.

[6] R. E. Bank, Hierarchical bases and the ﬁnite element method. Acta

Numerica, 5:1–43, 1996.

[7] P. B. Bochev, C. J. Garasi, J. J. Hu, A. C. Robinson, and R. S. Tuminaro,

An improved algebraic multigrid method for solving Maxwell’s equations.

SIAM J. Sci. Comput., 25(2):623–642, 2003.

Parallel Maxwell Eigensolver Using Trilinos Software Framework 33

[8] K. Devine, E. Boman, R. Heaphy, B. Hendrickson, and C. Vaughan,

Zoltan data management services for parallel dynamic applications. Com-

puting in Science and Engineering, 4(2):90–97, 2002.

[9] R. Geus, The Jacobi–Davidson algorithm for solving large sparse sym-

metric eigenvalue problems. PhD Thesis No. 14734, ETH Z¨urich,

2002. (Available at URL http://e-collection.ethbib.ethz.ch/show?

type=diss&nr=14734).

[10] M. Heroux, R. Bartlett, V. Howle, R. Hoekstra, J. Hu, T. Kolda,

R. Lehoucq, K. Long, R. Pawlowski, E. Phipps, A. Salinger, H. Thorn-

quist, R. Tuminaro, J. Willenbring, and A. Williams, An overview of the

Trilinos Project. ACM Trans. Math. Softw., 5:1–23, 2003.

[11] G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme

for irregular graphs. SIAM Rev., 41(2):278–300, 1999.

[12] METIS: A family of programs for partitioning unstructured graphs and

hypergraphs and computing ﬁll-reducing orderings of sparse matrices. See

URL http://www-users.cs.umn.edu/~karypis/metis/.

[13] P. S. Pacheco, Parallel programming with MPI. Morgan Kaufmann, San

Francisco CA, 1997.

[14] S. Reitzinger and J. Sch¨oberl, An algebraic multigrid method for ﬁnite

element discretizations with edge elements. Numer. Linear Algebra Appl.,

9(3):223–238, 2002.

[15] M. Sala, M. A. Heroux, and D. D. Day, Trilinos 4.0 Tutorial. Technical

Report SAND2004-2189, Sandia National Laboratories, May 2004.

[16] M. Sala, J. Hu, and R. S. Tuminaro, ML 3.1 Smoothed Aggregation

User’s Guide. Tech. Report SAND2004-4819, Sandia National Laborato-

ries, September 2004.

[17] P. P. Silvester and R. L. Ferrari, Finite Elements for Electrical Engineers.

Cambridge University Press, Cambridge, 3rd edition, 1996.

[18] G. L. G. Sleijpen and H. A. van der Vorst, A Jacobi–Davidson iteration

method for linear eigenvalue problems. SIAM J. Matrix Anal. Appl.,

17(2):401–425, 1996.

[19] The Trilinos Project Home Page, http://software.sandia.gov/

trilinos/.

34 P. Arbenz, M. Beˇcka, R. Geus, U. Hetmaniuk, T. Mengotti

[20] P. Vanˇek, J. Mandel, and M. Brezina, Algebraic multigrid based on

smoothed aggregation for second and fourth order problems. Computing,

56(3):179–196, 1996.

[21] Zoltan Home Page, http://www.cs.sandia.gov/Zoltan/.