Content uploaded by Hussam Al Daas

Author content

All content in this area was uploaded by Hussam Al Daas on Jul 20, 2021

Content may be subject to copyright.

A ROBUST ALGEBRAIC DOMAIN DECOMPOSITION

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS∗

HUSSAM AL DAAS†, PIERRE JOLIVET‡,AND JENNIFER A. SCOTT†§

Abstract. Solving the normal equations corresponding to large sparse linear least-squares

problems is an important and challenging problem. For very large problems, an iterative solver

is needed and, in general, a preconditioner is required to achieve good convergence. In recent

years, a number of preconditioners have been proposed. These are largely serial and reported

results demonstrate that none of the commonly used preconditioners for the normal equations

matrix is capable of solving all sparse least-squares problems. Our interest is thus in designing

new preconditioners for the normal equations that are eﬃcient, robust, and can be implemented in

parallel. Our proposed preconditioners can be constructed eﬃciently and algebraically without any

knowledge of the problem and without any assumption on the least-squares matrix except that it

is sparse. We exploit the structure of the symmetric positive deﬁnite normal equations matrix and

use the concept of algebraic local symmetric positive semi-deﬁnite splittings to introduce two-level

Schwarz preconditioners for least-squares problems. The condition number of the preconditioned

normal equations is shown to be theoretically bounded independently of the number of subdomains in

the splitting. This upper bound can be adjusted using a single parameter τthat the user can specify.

We discuss how the new preconditioners can be implemented on top of the PETSc library using only

150 lines of Fortran, C, or Python code. Problems arising from practical applications are used to

compare the performance of the proposed new preconditioner with that of other preconditioners.

Key words. Algebraic domain decomposition, two-level preconditioner, additive Schwarz,

normal equations, sparse linear least-squares.

1. Introduction. We are interested in solving large-scale linear least-

squares (LS) problems

(1.1) min

xkAx −bk2,

where A∈Rm×n(m≥n) and b∈Rmare given. Solving (1.1) is mathematically

equivalent to solving the n×nnormal equations

(1.2) Cx =A>b, C =A>A,

where, provided Ahas full column rank, the normal equations matrix Cis symmetric

and positive deﬁnite (SPD). Two main classes of methods may be used to solve the

normal equations: direct methods and iterative methods. A direct method proceeds

by computing an explicit factorization, either using a sparse Cholesky factorization

of Cor a “thin” QR factorization of A. While well-engineered direct solvers [2,11,32]

are highly robust, iterative methods may be preferred because they generally require

signiﬁcantly less storage (allowing them to tackle very large problems for which the

memory requirements of a direct solver are prohibitive) and, in some applications,

it may not be necessary to solve the system with the high accuracy oﬀered by a

direct solver. However, the successful application of an iterative method usually

requires a suitable preconditioner to achieve acceptable (and ideally, fast) convergence

∗Submitted to the editors July 19, 2021.

†STFC Rutherford Appleton Laboratory, Harwell Campus, Didcot, Oxfordshire, OX11 0QX, UK

(hussam.al-daas@stfc.ac.uk,jennifer.scott@stfc.ac.uk).

‡CNRS, ENSEEIHT, 2 rue Charles Camichel, 31071 Toulouse Cedex 7, France

(pierre.jolivet@enseeiht.fr).

§School of Mathematical, Physical and Computational Sciences, University of Reading, Reading

RG6 6AQ, UK.

1

2H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

rates. Currently, there is much less knowledge of preconditioners for LS problems

than there is for sparse symmetric linear systems and, as observed in Bru et al. [7],

“the problem of robust and eﬃcient iterative solution of LS problems is much harder

than the iterative solution of systems of linear equations.” This is, at least in part,

because Adoes not have the properties of diﬀerential problems that can make standard

preconditioners eﬀective for solving many classes of linear systems.

Compared with other classes of linear systems, the development of preconditioners

for sparse LS problems may be regarded as still being in its infancy and includes

•variants of block Jacobi (also known as block Cimmino) and SOR [18];

•incomplete factorizations such as incomplete Cholesky, QR, and LU

factorizations, for example, [7,29,37,38];

•and sparse approximate inverse [10].

A review and performance comparison is given in [21]. This found that, whilst

none of the approaches is successful for all LS problems, limited memory incomplete

Cholesky factorization preconditioners appear to be the most reliable The incomplete

factorization-based preconditioners are designed for moderate size problems because

current approaches, in general, are not suitable for parallel computers. The block

Cimmino method can be parallelized easily, however, it lacks robustness as the

iteration count to reach convergence cannot be controlled and typically increases

signiﬁcantly when the number of blocks increases for a ﬁxed problem [16]. Several

techniques have been proposed to improve the convergence of block Cimmino but

they still lack robustness [17]. Thus, we are motivated to design a new class of LS

preconditioners that are not only reliable but can also be implemented in parallel.

In [3], Al Daas and Grigori presented a class of robust fully algebraic two-level

additive Schwarz preconditioners for solving SPD linear systems of equations. They

introduced the notion of an algebraic local symmetric positive semi-deﬁnite (SPSD)

splitting of an SPD matrix with respect to local subdomains. They used this splitting

to construct a class of second-level spaces that bound the spectral condition number

of the preconditioned system by a user-deﬁned value. Unfortunately, Al Daas and

Grigori reported that for general sparse SPD matrices, constructing the splitting is

prohibitively expensive. Our interest is in examining whether the particular structure

of the normal equations matrix allows the approach to be successfully used for

preconditioning LS problems. In this paper, we show how to compute the splitting

eﬃciently. Based on this splitting, we apply the theory presented in [3] to construct

a two-level Schwarz preconditioner for the normal equations.

Note that for most existing preconditioners of the normal equations, there is

no need to form and store all of the normal equations matrix Cexplicitly. For

example, the lower triangular part of its columns can be computed one at a time,

used to perform the corresponding step of an incomplete Cholesky algorithm, and

then discarded. However, forming the normal equations matrix, even piecemeal, can

entail a signiﬁcant overhead and potentially may lead to a severe loss of information in

highly ill-conditioned cases. Although building our proposed preconditioner does not

need the explicit computation of C, our parallel implementation computes it eﬃciently

and uses it to setup the preconditioner. This is mainly motivated by technical

reasons. As an example, state-of-the-art distributed-memory graph partitioners such

as ParMETIS [27] or PT-SCOTCH [35] cannot directly partition the columns of

the rectangular matrix A. Our numerical experiments on highly ill-conditioned

LS problems showed that forming Cand using diagonal shift to construct the

preconditioner had no major eﬀect on the robustness of the resulting preconditioner.

This paper is organized as follows. The notation used in the manuscript is given

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 3

at the end of the introduction. In section 2, we present an overview of domain

decomposition (DD) methods for a sparse SPD matrix. We present a framework for

the DD approach when applied to the sparse LS problem in section 3. Afterwards, we

show how to compute the local SPSD splitting matrices eﬃciently and use them in line

with the theory presented in [3] to construct a robust two-level Schwarz preconditioner

for the normal equations matrix. We then discuss some technical details that clarify

how to construct the preconditioner eﬃciently. In section 4, we brieﬂy discuss how

the new preconditioner can be implemented on top of the PETSc library [6] and

we illustrate its eﬀectiveness using large-scale LS problems coming from practical

applications. Finally, concluding comments are made in section 5.

Notation. We end our introduction by deﬁning notation that will be used in this

paper. Let 1 ≤n≤mand let A∈Rm×n. Let S1⊂J1, mKand S2⊂J1, nKbe

two sets of integers. A(S1,:) is the submatrix of Aformed by the rows whose indices

belong to S1and A(:, S2) is the submatrix of Aformed by the columns whose indices

belong to S2. The matrix A(S1, S2) is formed by taking the rows whose indices belong

to S1and only retaining the columns whose indices belong to S2. The concatenation

of any two sets of integers S1and S2is represented by [S1, S2]. Note that the order

of the concatenation is important. The set of the ﬁrst ppositive integers is denoted

by J1, pK. The identity matrix of size nis denoted by In. We denote by ker(A) and

range(A) the null space and the range of A, respectively.

2. Introduction to domain decomposition. Throughout this section, we

assume that Cis a general n×nsparse SPD matrix. Let the nodes Vin the

corresponding adjacency graph G(C) be numbered from 1 to n. A graph partitioning

algorithm can be used to split Vinto Nndisjoint subsets ΩIi (1 ≤i≤N) of size

nIi . These sets are called nonoverlapping subdomains. Deﬁning an additive Schwarz

preconditioner requires overlapping subdomains. Let ΩΓibe the subset of size nΓi

of nodes that are distance one in G(C) from the nodes in ΩI i (1 ≤i≤N). The

overlapping subdomain Ωiis deﬁned to be Ωi= [ΩIi ,ΩΓi], with size ni=nΓi+nIi .

Associated with Ωiis a restriction (or projection) matrix Ri∈Rni×ngiven by

Ri=In(Ωi,:). Rimaps from the global domain to subdomain Ωi. Its transpose R>

i

is a prolongation matrix that maps from subdomain Ωito the global domain. The

one-level additive Schwarz preconditioner [15] is deﬁned to be

(2.1) M−1

ASM =

N

X

i=1

R>

iC−1

ii Ri, Cii =RiCR>

i.

That is,

M−1

ASM =R1

C−1

11

...

C−1

NN

R>

1,

where R1is the one-level interpolation operator deﬁned by

R1:

N

Y

i=1

Rni→Rn

(ui)1≤i≤N7→

N

X

i=1

R>

iui.

4H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Applying this preconditioner to a vector involves solving concurrent local problems

in the overlapping subdomains. Increasing Nreduces the sizes niof the overlapping

subdomains, leading to smaller local problems and faster computations. However,

in practice, the preconditioned system using M−1

ASM may not be well-conditioned,

inhibiting convergence of the iterative solver. In fact, the local nature of this

preconditioner can lead to a deterioration in its eﬀectiveness as the number of

subdomains increases because of the lack of global information from the matrix C[15,

20]. To maintain robustness with respect to N, an artiﬁcial subdomain is added to

the preconditioner (also known as second-level correction or coarse correction) that

includes global information.

Let 0 < n0n. If R0∈Rn0×nis of full row rank, the two-level additive Schwarz

preconditioner [15] is deﬁned to be

(2.2) M−1

additive =

N

X

i=0

R>

iC−1

ii Ri=R>

0C−1

00 R0+M−1

ASM, C00 =R0C R>

0.

That is,

M−1

additive =R2

C−1

00

C−1

11

...

C−1

NN

R>

2,

where R2is the two-level interpolation operator

R2:

N

Y

i=0

Rni→Rn

(ui)0≤i≤N7→

N

X

i=0

R>

iui.

(2.3)

In the rest of this paper, we will make use of the canonical one-to-one correspondence

between QN

i=0 Rniand RPN

i=0 niso that R2can be applied to vectors in RPN

i=0 ni.

Observe that, because Cand R0are of full rank, C00 is also of full rank. For any full

rank R0, it is possible to cheaply obtain upper bounds on the largest eigenvalue of the

preconditioned matrix, independently of nand N[3]. However, bounding the smallest

eigenvalue is highly dependent on R0. Thus, the choice of R0is key to obtaining a well-

conditioned system and building eﬃcient two-level Schwarz preconditioners. Two-

level Schwarz preconditioners have been used to solve a large class of systems arising

from a range of engineering applications (see, for example, [22,26,28,30,40,43] and

references therein).

Following [3], we denote by Di∈Rni×ni(1 ≤i≤N) any non-negative diagonal

matrices such that

(2.4)

N

X

i=1

R>

iDiRi=In.

We refer to (Di)1≤i≤Nas an algebraic partition of unity. In [3], Al Daas and Grigori

show how to select local subspaces Zi∈Rni×piwith pini(1 ≤i≤N) such that,

if R>

0is deﬁned to be R>

0= [R>

1D1Z1, . . . , R>

NDNZN], the spectral condition number

of the preconditioned matrix M−1

additiveCis bounded from above independently of N

and n.

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 5

2.1. Algebraic local SPSD splitting of an SPD matrix. We now recall

the deﬁnition of an algebraic local SPSD splitting of an SPD matrix given in [3].

This requires some additional notation. Denote the complement of Ωiin J1, nKby

Ωci. Deﬁne restriction matrices Rci,RIi, and RΓithat map from the global domain

to Ωci, ΩIi , and ΩΓi, respectively. Reordering the matrix Cusing the permutation

matrix Pi=In([ΩIi ,ΩΓi,Ωci],:) gives the block tridiagonal matrix

(2.5) PiCP >

i=

CI,i CIΓ,i

CΓI,i CΓ,i CΓc,i

CcΓ,i Cc,i

,

where CI,i =RI iC R>

Ii ,C>

ΓI,i =CIΓ,i =RI iC R>

Γi,CΓ,i =RΓiCR>

Γi,C>

cΓ,i =CΓc,i =

RΓiCR>

ci, and Cc,i =RciCR>

ci. The ﬁrst block on the diagonal corresponds to the

nodes in ΩI i, the second block on the diagonal corresponds to the nodes in ΩΓi, and

the third block on the diagonal is associated with the remaining nodes.

An algebraic local SPSD splitting of the SPD matrix Cwith respect to the i-th

subdomain is deﬁned to be any SPSD matrix e

Ci∈Rn×nof the form

Pie

CiP>

i=

CI,i CIΓ,i 0

CΓI,i e

CΓ,i 0

0 0 0

such that the following condition holds:

0≤u>e

Ciu≤u>Cu, for all u∈Rn.

We denote the 2 ×2 block nonzero matrix of Pie

CiP>

iby e

Cii so that

e

Ci=R>

ie

CiiRi.

Associated with the local SPSD splitting matrices, we deﬁne a multiplicity

constant kmthat satisﬁes the inequality

(2.6) 0 ≤

N

X

i=1

u>e

Ciu≤kmu>Cu, for all u∈Rn.

Note that, for any set of SPSD splitting matrices, km≤N.

The main motivation for deﬁning splitting matrices is to ﬁnd local seminorms that

are bounded from above by the C-norm. These seminorms will be used to determine a

subspace that contains the eigenvectors of Cassociated with its smallest eigenvalues.

2.2. Two-level Schwarz method. We next review the abstract theory of the

two-level Schwarz method as presented in [3]. For the sake of completeness, we present

some elementary lemmas that are widely used in multilevel methods. These will be

used in proving eﬃciency of the two-level Schwarz preconditioner and will also help

in understanding how the preconditioner is constructed.

2.2.1. Useful lemmas. The following lemma [33] provides a uniﬁed framework

for bounding the spectral condition number of a preconditioned operator. It can be

found in diﬀerent forms for ﬁnite and inﬁnite dimensional spaces. Here, we follow the

presentation from [15, Lemma 7.4].

6H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Lemma 2.1 (Fictitious Subspace Lemma). Let C∈RnC×nCand B∈RnB×nB

be SPD. Let the operator Rbe deﬁned as

R:RnB→RnC

v7→ Rv,

and let R>be its transpose. Assume the following conditions hold:

(i) Ris surjective;

(ii) there exists cu>0such that for all v∈RnB

(Rv)>C(Rv)≤cuv>Bv;

(iii) there exists cl>0such that for all vC∈RnCthere exists vB∈RnBsuch that

vC=RvBand

clv>

BBvB≤(RvB)>C(RvB) = v>

CCvC.

Then, the spectrum of the operator RB−1R>Cis contained in the interval [cl, cu].

The challenge is to deﬁne the second-level projection matrix R0such that the two-level

additive Schwarz preconditioner M−1

additive and the operator R2(2.3), corresponding

respectively to Band Rin Lemma 2.1, satisfy conditions (i) to (iii) and, in addition,

ensures the ratio between cland cuis small because this determines the quality of the

preconditioner.

As shown in [15, Lemmas 7.10 and 7.11], a two-level additive Schwarz

preconditioner satisﬁes (i) and (ii) for any full rank R0. Furthermore, the constant cu

is bounded from above independently of the number of subdomains N, as shown in

the following result [9, Theorem 12].

Lemma 2.2. Let kcbe the minimum number of distinct colours so that the spaces

spanned by the columns of the matrices R>

1, . . . , R>

Nthat are of the same colour are

mutually C-orthogonal. Then,

(R2uB)>C(R2uB)≤(kc+ 1)

N

X

i=0

u>

iCiiui,

for all uB= (ui)0≤i≤N∈QN

i=0 Rni.

Note that kcis independent of N. Indeed, it depends only on the sparsity structure

of Cand is less than the maximum number of neighbouring subdomains.

The following result is the ﬁrst step in a three-step approach to deﬁne a two-level

additive Schwarz operator R2that satisﬁes condition (iii) in Lemma 2.1.

Lemma 2.3. Let uB= (ui)0≤i≤N∈QN

i=0 Rniand u=R2uB∈Rn. Then,

provided R0is of full rank,

N

X

i=0

u>

iCiiui≤2u>C u + (2kc+ 1)

N

X

i=1

u>

iCiiui,

where kcis deﬁned in Lemma 2.2.

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 7

It follows that (iii) is satisﬁed if the squared localized seminorm u>

iCiiuiis

bounded from above by the squared C-norm of u.

In the second step, we bound u>

iCiiuiby the squared localized seminorm deﬁned

by the SPSD splitting matrix e

Ci, which can be bounded by the squared C-norm (2.6).

The decomposition of u=PN

i=0 R>

iui∈Rnis termed stable if, for some τ > 0,

τu>

iCiiui≤u>C u, 1≤i≤N.

The two-level approach in [3] aims to decompose each Rni(1 ≤i≤N) into two

subspaces, one that makes the decomposition of ustable and the other is part of the

artiﬁcial subdomain associated with the second level of the preconditioner. Given the

partition of unity (2.4),u=PN

i=1 R>

iDiRiuand, if Πi= Π>

i∈Rni×ni, we can write

u=

N

X

i=1

R>

iDi(Ini−Πi)Riu+

N

X

i=1

R>

iDiΠiRiu

=

N

X

i=1

R>

iui+

N

X

i=1

R>

iDiΠiRiu, with ui=Di(Ini−Πi)Riu.

Therefore, we need to construct Πisuch that

τu>R>

i(Ini−Πi)DiCiiDi(Ini−Πi)Riu≤u>C u.

The following lemma shows how this can be done.

Lemma 2.4. Let e

Ci=R>

ie

CiiRibe a local SPSD splitting of Crelated to the i-th

subdomain (1≤i≤N). Let Dibe the partition of unity (2.4). Let P0,i be the

projection on range(e

Cii)parallel to ker(e

Cii). Deﬁne Li=ker(DiCiiDi)∩ker(e

Cii),

and let L⊥

idenote the orthogonal complementary of Liin ker(e

Cii). Consider the

following generalized eigenvalue problem:

find (vi,k, λi,k )∈Rni×R

such that P0,iDiCii DiP0,ivi,k =λi,k e

Ciivi,k .

Given τ > 0, deﬁne

(2.7) Zi=L⊥

i⊕span vi,k |λi,k >1

τ

and let Πibe the orthogonal projection on Zi. Then, Ziis the subspace of smallest

dimension such that for all u∈Rn,

τu>

iCiiui≤u>e

Ciu≤u>Cu,

where ui=Di(Ini−Πi)Riu.

Lemma 2.5 provides the last step that we need for condition (iii) in Lemma 2.1.

It deﬁnes u0and checks whether (ui)0≤i≤Nis a stable decomposition.

Lemma 2.5. Let e

Ci,Zi, and Πibe as in Lemma 2.4 and let Zibe a matrix whose

columns span Zi(1≤i≤N). Let the columns of the matrix R>

0span the space

(2.8) Z=

N

M

i=1

R>

iDiZi.

8H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Let u∈Rnand ui=Di(Ini−Πi)Riu(1≤i≤N). Deﬁne

u0=R0R>

0−1R0 N

X

i=1

R>

iDiΠiRiu!.

Then,

u=

N

X

i=0

R>

iui,

and N

X

i=0

u>

iCiiui≤2 + (2kc+ 1) km

τu>Cu.

Finally, using the preceding results, Theorem 2.6 presents a theoretical upper

bound on the spectral condition number of the preconditioned system.

Theorem 2.6. If the two-level additive Schwarz preconditioner M−1

additive (2.2) is

constructed using R0as deﬁned in Lemma 2.5, then the following inequality is

satisﬁed:

κM−1

additiveC≤(kc+ 1) 2 + (2kc+ 1)km

τ.

2.3. Variants of the Schwarz preconditioner. So far, we have

presented M−1

ASM, the symmetric additive Schwarz method (ASM) and M−1

additive,

the additive correction for the second level. It was noted in [8] that using the

partition of unity to weight the preconditioner can improve its quality. The

resulting preconditioner is referred to as M−1

RAS, the restricted additive Schwarz (RAS)

preconditioner, and is deﬁned to be

(2.9) M−1

RAS =

N

X

i=1

R>

iDiC−1

ii Ri.

This preconditioner is nonsymmetric and thus can only be used with iterative

methods such as GMRES [36] that are for solving nonsymmetric problems. With

regards to the second level, diﬀerent strategies yield either a symmetric or a

nonsymmetric preconditioner [42]. Given a ﬁrst-level preconditioner M−1

?and setting

Q=R>

0C−1

00 R0, the balanced and deﬂated two-level preconditioners are as follows

(2.10) M−1

balanced =Q+ (I−CQ)>M−1

?(I−CQ),

and

(2.11) M−1

deflated =Q+M−1

?(I−CQ),

respectively. It is well-known in the literature that M−1

balanced and M−1

deflated yield better

convergence behavior than M−1

additive (see [42] for a thorough comparison). Although the

theory we present relies on M−1

additive, in practice we will use M−1

balanced and M−1

deflated. If

the one-level preconditioner M−1

?is symmetric, then so is M−1

balanced, while M−1

deflated

is typically nonsymmetric. For this reason, in the rest of the paper, we always

couple M−1

ASM with M−1

balanced, and M−1

RAS with M−1

deflated. All three variants have the same

setup cost, and only diﬀer in how the second level is applied. M−1

balanced is slightly more

expensive because two second-level corrections (multiplications by Q) are required

instead of a single one for M−1

additive and M−1

deflated.

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 9

3. The normal equations. The theory explained thus far is fully algebraic but

somehow disconnected from our initial LS problem (1.1). We now show how it can

be readily applied to the normal equations matrix C=A>A, with A∈Rm×nsparse,

ﬁrst deﬁning a one-level Schwarz preconditioner, and then a robust algebraic second-

level correction. We start by partitioning the ncolumns of Ainto disjoint subsets

ΩIi . Let Ξibe the set of indices of the nonzero rows in A(:,ΩIi ) and let Ξcibe the

complement of Ξiin the set J1, mK. Now deﬁne ΩΓito be the complement of ΩI i in

the set of indices of nonzero columns of A(Ξi,:). The set Ωi= [ΩIi,ΩΓi] deﬁnes the

i-th overlapping subdomain and we have the permuted matrix

(3.1) A([Ξi,Ξci],[ΩIi ,ΩΓi,Ωci]) = AI,i AIΓ,i

AΓ,i Ac,i.

To illustrate the concepts and notation, consider the 5 ×4 matrix

A=

1060

2400

3000

0507

0008

and set N= 2, ΩI1={1,3}, ΩI2={2,4}. Consider the ﬁrst subdomain. We have

A(:,ΩI1) =

1 6

2 0

3 0

0 0

0 0

.

The set of indices of the nonzero rows is Ξ1={1,2,3}, and its complement is Ξc1 =

{4,5}. To deﬁne ΩΓ,1, select the nonzero columns in the submatrix A(Ξ1,:) and

remove those already in ΩI1, that is,

(3.2) A(Ξ1,:) =

1060

2400

3000

,

so that ΩΓ1 ={2}and Ωc1 ={4}. Permuting Ato the form (3.1) gives

A([Ξ1,Ξc1],[ΩI1,ΩΓ1 ,Ωc1]) =

1600

2040

3000

0057

0008

.

In the same way, consider the second subdomain. ΩI2={2,4}and

A(:,ΩI2) =

0 0

4 0

0 0

5 7

0 8

,

10 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

so that Ξ2={2,4,5}and Ξc2 ={1,3}. To deﬁne ΩΓ2, select the nonzero columns in

the submatrix A(Ξ2,:) and remove those already in ΩI2, that is,

(3.3) A(Ξ2,:) =

2400

0507

0008

,

which gives ΩΓ2 ={1}and Ωc2 ={3}. Permuting Ato the form (3.1) gives

A([Ξ2,Ξc2],[ΩI2,ΩΓ2 ,Ωc2]) =

4020

5700

0800

0016

0030

.

Now that we have ΩIi and ΩΓi, we can deﬁne the restriction operators

R1=I4(Ω1,:) =

1000

0010

0100

, R2=I4(Ω2,:) =

0100

0001

1000

.

For our example, nI1=nI2= 2 and nΓ1 =nΓ2 = 1. The partition of unity

matrices Diare of dimension (nIi +nΓi)×(nI i +nΓi) (i= 1,2) and have ones

on the nIi leading diagonal entries and zeros elsewhere, so that

(3.4) D1=D2=

100

010

000

.

Observe that Di(k, k) scales the columns A(:,Ωi(k)).

Note that it is possible to obtain the partitioning sets and the sets of indices using

the normal equations matrix C. Most graph partitioners, especially those that are

implemented in parallel, require an undirected graph (corresponding to a matrix with

a symmetric sparsity pattern). Therefore, in practice, we use the graph of Cto setup

the ﬁrst-level preconditioner for LS problems.

3.1. One-level DD for the normal equations. This section presents the

one-level additive Schwarz preconditioner for the normal equations matrix C=

A>A. Following (2.1) and given the sets ΩIi,ΩΓi,and Ξi, the one-level Schwarz

preconditioner of C=A>Ais

M−1

ASM =

N

X

i=1

R>

iRiA>AR>

i−1Ri,

=

N

X

i=1

R>

iA(:,Ωi)>A(:,Ωi)−1Ri,

Remark 3.1. Note that the local matrix Cii =A(:,Ωi)>A(:,Ωi) need not be

computed explicitly to be factored. Instead, the Cholesky factor of Cii can be

computed by using a “thin” QR factorization of A(:,Ωi).

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 11

3.2. Algebraic local SPSD splitting of the normal equations matrix. In

this section, we show how to cheaply construct algebraic local SPSD splittings for

sparse matrices of the form C=A>A. Combining (2.5) and (3.1), we can write

PiA>AP >

i=

A>

I,i AI,i A>

I,i AIΓ,i

A>

IΓ,iAI ,i A>

IΓ,iAIΓ,i +A>

Γ,iAΓ,i A>

Γ,iAc,i

A>

c,iAΓ,i A>

c,iAc,i

,

where Pi=In([ΩIi ,ΩΓi,Ωci],:) is a permutation matrix. A straightforward splitting

of PiA>AP >

iis given by

PiA>AP >

i=

A>

I,i AI,i A>

I,i AIΓ,i 0

A>

IΓ,iAI ,i A>

IΓ,iAIΓ,i 0

0 0 0

+

0 0 0

0A>

Γ,iAΓ,i A>

Γ,iAc,i

0A>

c,iAΓ,i A>

c,iAc,i

.

It is clear that both summands are SPSD. Indeed, they both have the form X>X,

where Xis AI,i AIΓ,i 0and 0AΓ,i Ac,i , respectively. The local SPSD

splitting matrix related to the i-th subdomain is then deﬁned as:

e

Cii =A(Ξi,Ωi)>A(Ξi,Ωi) = AI,i AIΓ,i >AI,i AIΓ,i ,

(3.5)

and e

Ci=R>

ie

CiiRi=A(Ξi,:)>A(Ξi,:).

Hence, the theory presented in [3] and summarised in subsection 2.2 is applicable. In

particular, the two-level Schwarz preconditioner M−1

additive (2.2) satisﬁes

κ(M−1

additiveC)≤(kc+ 1) 2 + 2(kc+ 1)km

τ,

where kcis the minimal number of colours required to colour the partitions of C

such that each two neighbouring subdomains have diﬀerent colours, and kmis the

multiplicity constant that satisﬁes the following inequality

N

X

i=1

R>

ie

CiiRi≤kmC.

The constant kcis independent of Nand depends only on the graph G(C), which is

determined by the sparsity pattern of A. The multiplicity constant kmdepends on

the local SPSD splitting matrices. For the normal equations matrix, the following

lemma provides an upper bound on km.

Lemma 3.2. Let C=A>A. Let mjbe the number of subdomains such that

A(j, ΩIi )6= 0 (1≤i≤N), that is,

mj= #{i|j∈Ξi}.

Then, kmcan be chosen to be km= max1≤j≤mmj. Furthermore, if kΩiis the number

of neighbouring subdomains of the i-th subdomain, that is,

kΩi= #{j|Ωi∩Ωj6=φ},

then

km= max

1≤j≤mmj≤max

1≤i≤NkΩi.

12 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Proof. Since C=A>Aand e

Ci=A(Ξi,:)>A(Ξi,:), we have

u>Cu =

m

X

j=1

u>A(j, :)>A(j, :)u,

u>e

Ciu=X

j∈Ξi

u>A(j, :)>A(j, :)u,

N

X

i=1

u>e

Ciu=

N

X

i=1 X

j∈Ξi

u>A(j, :)>A(j, :)u.

From the deﬁnition of mj, the term u>A(j, :)>A(j, :)uappears mjtimes in the last

equation. Thus,

N

X

i=1

u>e

Ciu=

m

X

j=1

mju>A(j, :)>A(j, :)u,

≤max

1≤j≤mmj

m

X

j=1

u>A(j, :)>A(j, :)u,

= max

1≤j≤mmj(u>Cu),

from which it follows that we can choose km= max1≤j≤mmj. Now, if 1 ≤l≤m,

there exist i1, . . . , imlsuch that l∈Ξi1∩···∩Ξiml. Furthermore, ml≤max1≤p≤lkΩip.

Taking the maximum over lon both sides we obtain

km≤max

1≤i≤NkΩi.

Note that because Ais sparse, kmis independent of the number of subdomains.

3.3. Algorithms and technical details. In this section, we discuss the

technical details involved in constructing a two-level preconditioner for the normal

equations matrix.

3.3.1. Partition of unity. Because the matrix AIΓ,i may be of low rank, the

null space of e

Cii (3.5) can be large. Recall that the diagonal matrices Dihave

dimension ni=nIi +nΓi. Choosing the entries in positions nI i + 1, . . . , niof the

diagonal of Dito be zero, as in (3.4), results in the subspace of ker(e

Cii) caused

by the rank deﬁciency of AIΓ,i to lie within ker(DiCii Di), reducing the size of the

space Zgiven by (2.8). In other words, if AIΓ,iu= 0, we have e

Ciiv= 0, where

v>= (0, u>), i.e., v∈ker(e

Cii) and because by construction Div= 0, we have

v∈ker(e

Cii)∩ker(DiCiiDi), therefore, vneed not be included in Zi.

3.3.2. The eigenvalue problem. The generalized eigenvalue problem

presented in Lemma 2.4 is critical in the construction of the two-level preconditioner.

Although the deﬁnition of Zifrom (2.7) suggests it is necessary to compute the null

space of e

Cii and that of DiCiiDiand their intersection, in practice, this can be

avoided. Consider the generalized eigenvalue problem

(3.6) DiCiiDiv=λe

Ciiv,

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 13

where, by convention, we set λ= 0 if v∈ker(e

Cii)∩ker(DiCiiDi) and λ=∞if

v∈ker(e

Cii)\ker(DiCiiDi). The subspace Zideﬁned in (2.7) can then be written as

span v|DiCiiDiv=λe

Ciivand λ > 1

τ.

Consider also the shifted generalized eigenvalue problem

(3.7) DiCiiDiv=λ(e

Cii +sIni)v,

where 0 < s 1. Note that if sis such that e

Cii +sIniis numerically of full rank,

(3.7) can be solved using any oﬀ-the-shelf generalized eigenproblem solver. Let (v , λ)

be an eigenpair of (3.7). Then, we can only have one of the following situations:

•v∈range(e

Cii)∩ker(DiCiiDi) or v∈ker(e

Cii)∩ker(DiCiiDi). In which

case, (v, 0) is an eigenpair of (3.6).

•v∈range(e

Cii)∩range(DiCiiDi). Then,

kDiCiiDiv−λe

Ciivk2

λkvk2

=s,

and, as sis small, (v, λ) is a good approximation of an eigenpair of (3.6)

corresponding to a ﬁnite eigenvalue.

•v∈ker(e

Cii)∩range(DiCiiDi). Then, DiCiiDiv=λsv, i.e., λs is a nonzero

eigenvalue of DiCii Di. Because Diis deﬁned such that the diagonal values

corresponding to the boundary nodes are zero, the nonzero eigenvalues of

DiCiiDicorrespond to the squared singular values of A(:,ΩI i). Hence, all

the eigenpairs of (3.6) corresponding to an inﬁnite eigenvalue are included in

the set of eigenpairs (v, λ) of (3.7) such that

(3.8) σ2

min (A(:,ΩIi )) ≤λs ≤σ2

max (A(:,ΩIi )) ,

where σmin (A(:,ΩIi )) and σmax (A(:,ΩIi )) are the smallest and largest

singular values of A(:,ΩIi), respectively.

Note that, in practice, A(:,ΩI i) is well conditioned because if not, there would be

local linear dependence between some columns of A. Therefore, choosing

s=O(ke

Ciik2ε),

where εis the machine precision, ensures e

Cii +sIniis numerically invertible and

s1. Setting s=ke

Ciik2εin (3.8), we obtain

σ2

min (A(:,ΩIi )) ≤λke

Ciik2ε≤σ2

max (A(:,ΩIi )) .

By (3.5), we have

ke

Ciik2≤ kCii k2,

and because ΩI i ⊂Ωi, it follows that

kC−1

ii k2=kA(:,Ωi)>A(:,Ωi)−1k2≤σ2

min (A(:,ΩIi )) .

Hence, if (v, λ) is an eigenpair of (3.7) with v∈ker(e

Cii)∩range(DiCiiDi), then

(κ(Cii)ε)−1≤λ,

14 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

where κ(Cii) is the condition number of Cii and Zican be deﬁned to be

(3.9) span v|DiCiiDiv=λ(e

Cii +εke

Ciik2Ini)vand λ≥min 1

τ,(κ(Cii)ε)−1.

Ziis then taken to be the matrix whose columns are the vertical concatenation of

corresponding eigenvectors.

Remark 3.3. Note that solving the generalized eigenvalue problem (3.7) by an

iterative method such as Krylov–Schur [41] does not require the explicit form of Cii

and e

Cii. Rather, it requires solving linear systems of the form ( e

Cii +sIni)u=v,

together with matrix–vector products of the form ( e

Cii +sIni)vand Ciiv. It is clear

that these products do not require the matrices e

Cii and Cii to be formed. Regarding

the solution of the linear system ( e

Cii +sIni)u=v,Remark 3.1 also applies to the

Cholesky factorization of e

Cii +sIni=X>X, where X>=A(Ξi,Ωi)>√sIni, that

can be computed by using a “thin” QR factorization of X.

From Remarks 3.1 and 3.3, and applying the same technique therein to factor

C00 =R0CR>

0= (AR>

0)>(AR>

0), we observe that given the overlapping partitions

of A, the proposed two-level preconditioner can be constructed without forming the

normal equations matrix. Algorithm 3.1 gives an overview of the steps for constructing

our two-level Schwarz preconditioner for the normal equations matrix. The actual

implementation of our proposed preconditioner will be discussed in greater detail

in subsection 4.1.

Algorithm 3.1 Two-level Schwarz preconditioner for the normal equations matrix

Input: matrix A, number of subdomains N, threshold τto bound the condition

number.

Output: two-level preconditioner M−1for C=A>A.

1: (ΩI1,...,ΩIN ) = Partition(A, N )

2: for i= 1 to Nin parallel do

3: Ξi= FindNonzeroRows(A(:,ΩIi))

4: Ωi= [ΩIi ,ΩΓi] = FindNonzeroColumns(A(Ξi,:))

5: Deﬁne Dias in subsection 3.3.1 and Rias in section 2

6: Perform Cholesky factorization of Cii =A(:,Ωi)>A(:,Ωi), see Remark 3.1

7: Perform Cholesky factorization of e

Cii =A(Ξi,Ωi)>A(Ξi,Ωi), possibly using a

small shift s, see Remark 3.3

8: Compute Zias deﬁned in (3.9)

9: end for

10: Set R>

0=R>

1D1Z1, . . . , R>

NDNZN

11: Perform Cholesky factorization of C00 = (AR>

0)>(AR>

0)

12: Set M−1=M−1

additive =PN

i=0 R>

iC−1

ii Rior M−1

balanced (2.10) or M−1

deflated (2.11)

4. Numerical experiments. In this section, we illustrate the eﬀectiveness of

the new two-level LS preconditioners M−1

balanced and M−1

deflated, their robustness with

respect to the number of subdomains, and their eﬃciency in tackling large-scale sparse

and ill-conditioned LS problems selected from the SuiteSparse Matrix Collection [12].

The test matrices are listed in Table 1. For each matrix, we report its dimensions,

the number of entries in Aand in the normal equations matrix C, and the condition

number of C(estimated using the MATLAB function condest).

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 15

Table 1

Test matrices taken from the SuiteSparse Matrix Collection

Identiﬁer m n nnz(A) nnz(C) condest(C)

mesh deform 234,023 9,393 853,829 117,117 2.7·106

EternityII E 262,144 11,077 1,503,732 1,109,181 5.1·1019

lp stocfor3 23,541 16,675 72,721 223,395 4.0·1010

deltaX 68,600 21,961 247,424 2,623,073 3.7·1020

sc205-2r 62,423 35,213 123,239 12,984,043 1.7·107

stormg2-125 172,431 65,935 433,256 1,953,519 ∞

Rucci1 1,977,885 109,900 7,791,168 9,747,744 2.0·108

image interp 232,485 120,000 711,683 1,555,994 4.7·107

mk13-b5 270,270 135,135 810,810 1,756,755 ∞

pds-100 514,577 156,016 1,096,002 1,470,688 ∞

fome21 267,596 216,350 465,294 640,240 ∞

sgpf5y6 312,540 246,077 831,976 2,761,021 6.0·106

Hardesty2 929,901 303,645 4,020,731 3,936,209 1.2·1010

Delor338K 450,807 343,236 4,211,599 44,723,076 1.5·107

watson 2 677,224 352,013 1,846,391 3,390,279 1.0·107

stormG2 1000 1,377,306 526,185 3,459,881 82,987,269 ∞

LargeRegFile 2,111,154 801,374 4,944,201 6,378,592 3.0·108

cont11 l 1,961,394 1,468,599 5,382,999 18,064,261 2.0·1010

In subsection 4.1, we discuss our implementation based on the parallel backend [6].

In particular, we show that very little coding eﬀort is needed to construct all the

necessary algebraic tools, and that it is possible to take advantage of an existing

package, such as HPDDM [26], to setup the new preconditioners eﬃciently. We

then show in subsection 4.2 how M−1

balanced and M−1

deflated perform compared to other

preconditioners when solving challenging LS problems. The preconditioners we

consider are:

•limited memory incomplete Cholesky (IC) factorization specialized for

the normal equations matrix as implemented in HSL MI35 from the HSL

library [24] (note that this package is written in Fortran and we run it using

the supplied MATLAB interface with default parameter settings);

•one-level overlapping Schwarz methods M−1

ASM and M−1

RAS as implemented in

PETSc;

•algebraic multigrid methods as implemented both in BoomerAMG from the

HYPRE library [19] and in GAMG [1] from PETSc.

Finally, in subsection 4.3, we study the strong scalability of M−1

balanced and its robustness

with respect to the number of subdomains by using a ﬁxed problem and increasing

the number of subdomains.

With the exception of the serial IC code HSL MI35, all the numerical experiments

are performed on Ir`ene, a system composed of 2,292 nodes with two 64-core AMD

Rome processors clocked at 2.6 GHz and, unless stated otherwise, 256 MPI processes

are used. For the domain decomposition methods, one subdomain is assigned per

process.

In all our experiments, the vector bin (1.1) is generated randomly and the

initial guess for the iterative solver is zero. When constructing our new two-level

preconditioners, with the exception of the results presented in Figure 1, at most 300

16 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

eigenpairs are computed on each subdomain and the threshold parameter τfrom (3.9)

is set to 0.6. These parameters were found to provide good numerical performance

after a very quick trial-and-error approach on a single problem. We did not want to

adjust them for each problem from Table 1, but it will be shown next that they are

ﬁne overall without additional tuning.

4.1. Implementation aspects. The new two-level preconditioners are

implemented on top of the well-known distributed memory library PETSc. This

section is not aimed at PETSc specialists. Rather, we want to brieﬂy explain what

was needed to provide an eﬃcient yet concise implementation. Our new code is open-

source, available at https://github.com/prj-/aldaas2021robust. It comprises fewer

than 150 lines of code (including the initialization and error analysis). The main

source ﬁles, written in Fortran, C, and Python, have three major phases, which we

now outline.

4.1.1. Loading and partitioning phase. First, PETSc is used to load the

matrix Ain parallel, following a contiguous one-dimensional row partitioning among

MPI processes. We explicitly assemble the normal equations matrix using the routine

MatTransposeMatMult [31]. The initial PETSc-enforced parallel decomposition of A

among processes may not be appropriate for the normal equations, so ParMETIS is

used by PETSc to repartition C. This also induces a permutation of the columns

of A.

4.1.2. Setup phase. To ensure that the normal equations matrix Cis deﬁnite,

it is shifted by 10−10kCkFIn(here and elsewhere, k·kFdenotes the Frobenius norm).

Note that this is only needed to setup the preconditioner; the preconditioner is

used to solve the original LS problem. Given the indices of the columns owned

by a MPI process, we call the routine MatIncreaseOverlap on the normal equations

matrix to build an extended set of column indices of Athat will be used to deﬁne

overlapping subdomains. These are the Ωias deﬁned in (3.1). Using the routine

MatFindNonzeroRows, this extended set of indices is used to concurrently ﬁnd on

each subdomain the set of nonzero rows. These are the sets Ξias illustrated in

(3.2) and (3.3). The subdomain matrices Cii from (2.1) as well as the partition

of unity Dias illustrated in (3.4) are automatically assembled by PETSc when

using domain decomposition preconditioners such as PCASM or PCHPDDM. The

right-hand side matrices of the generalized eigenvalue problems (3.6) are assembled

using MatTransposeMatMult, but note that this product is this time performed

concurrently on each subdomain. The small shift sfrom (3.7) is set to 10−8ke

CiikF.

These matrices and the sets of overlapping column indices are passed to PCHPDDM

using routine PCHPDDMSetAuxiliaryMat. The rest of the setup is hidden from

the user. It includes solving the generalized eigenvalue problems using SLEPc [23],

followed by the assembly and redistribution of the second-level operator using a

Galerkin product (2.2) (see [25] for more details on how this is performed eﬃciently

in PCHPDDM).

4.1.3. Solution phase. For the solution phase, users can choose between

multiple Krylov methods, including LSQR [34] and GMRES. Each iteration of LSQR

requires matrix–vector products with Aand A>. For GMRES, instead of using

the previously explicitly assembled normal equations matrix, we use an implicit

representation of the operator that computes the matrix–vector product with A

followed by the product with A>. The type of overlapping Schwarz method (additive

or restricted additive) as well as the type of second-level correction (balanced or

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 17

Table 2

Preconditioner comparison when running LSQR. Iteration counts are reported. M−1

ASM and

M−1

balanced are the one- and two-level overlapping Schwarz preconditioners, respectively. †denotes

iteration count exceeds 1,000.‡denotes either a failure in computing the preconditioner because of

memory issues or a breakdown of LSQR.

Identiﬁer M−1

balanced M−1

ASM BoomerAMG GAMG HSL MI35

mesh deform 13 27 ‡35 5

EternityII E 43 91 ‡63 199

lp stocfor3 34 136 ‡513 211

deltaX 23 98 ‡784 640

sc205-2r 54 61 ‡195 97

stormg2-125 42 174 ‡ † †

Rucci1 21 484 118 364 †

image interp 11 409 40 203 †

mk13-b5 19 21 11 ‡11

pds-100 18 202 16 35 110

fome21 20 104 16 20 41

sgpf5y6 224 264 ‡163 110

Hardesty2 30 913 88 404 †

Delor338K 10 11 ‡ † 829

watson 2 15 109 ‡64 73

stormG2 1000 139 64 ‡ ‡ †

LargeRegFile 41 109 19 ‡12

cont11 l 30 490 53 723 ‡

deﬂated) may be selected at runtime by the user. This ﬂexibility is important because

LSQR requires a symmetric preconditioner.

4.2. Numerical validation. In this section, we validate the eﬀectiveness of

the two-level method when compared to other preconditioners. Table 2 presents a

comparison between ﬁve preconditioners: two-level additive Schwarz with balanced

coarse correction M−1

balanced, one-level additive Schwarz M−1

ASM, BoomerAMG, GAMG,

and HSL MI35. The ﬁrst level of the one- and two-level methods both use the additive

Schwarz formulation; the second level uses the balanced deﬂation formulation (2.10).

The results are for the iterative solver LSQR. If Mdenotes the preconditioner, LSQR

terminates when the LS residual satisﬁes

kAM−1>(Ax −b)k2

kAkM,F kAx −bk2

<10−8,

where kAkM,F =Pn

i=1 λi(M−1A>A) is the sum of the positive eigenvalues of

M−1A>Athat is approximated by LSQR itself. Note that if M−1=W−1W−> ,

then kAkM,F =kAW −1kF.

It is clear that both the one- and two-level Schwarz methods are more robust

than the other preconditioners as they encounter no breakdowns and solve all the

LS problems using fewer than 1,000 iterations. Because HSL MI35 is a sequential

code that runs on a single core, there was not enough memory to compute the

preconditioner for problem cont11 l. For many of the problems, the iteration

count for HSL MI35 can be reduced by increasing the parameters that determine the

number of entries in the IC factor (the default values are rather small for the large

18 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Table 3

Preconditioner comparison when running GMRES. Iteration counts are reported. M−1

RAS and

M−1

deﬂated are the one- and two-level overlapping Schwarz preconditioners, respectively. †denotes

iteration count exceeds 1,000.‡denotes either a failure in computing the preconditioner because of

memory issues or a breakdown of GMRES.

Identiﬁer M−1

deflated M−1

RAS BoomerAMG GAMG HSL MI35

mesh deform 6 27 21 50 5

EternityII E 593 †97 186

lp stocfor3 21 † † † 198

deltaX 693 † † †

sc205-2r 12 125 †490 69

stormg2-125 23 ‡ ‡ ‡ †

Rucci1 10 958 213 882 †

image interp 10 971 67 476 †

mk13-b5 14 18 21 ‡12

pds-100 10 84 23 51 115

fome21 10 55 22 29 41

sgpf5y6 116 † † 249 100

Hardesty2 26 †155 † †

Delor338K 59† † †

watson 2 7134 252 96 73

stormG2 1000 108 22 186 ‡ †

LargeRegFile 621 23 ‡11

cont11 l 45 †172 † ‡

test examples). LSQR preconditioned with BoomerAMG breaks down for several

problems, as reported by PETSc error code KSP DIVERGED BREAKDOWN.

GAMG is more robust but requires more iterations for problems where both algebraic

multigrid solvers are successful. Note that even with more advanced options than the

default ones set by PETSc, such as PMIS coarsening [13] with extended classical

interpolation [14] for BoomerAMG or Schwarz smoothing for GAMG, these solvers

do not perform considerably better numerically. We can also see that the two-

level preconditioner outperforms the one-level preconditioner, with the exception of

problem, stormG2 1000, for which the normal equations matrix is not very sparse (see

column 5 of Table 1). In fact, the matrix Ahas 121 relatively dense rows that have

more than 1,000 nonzeros while the rest of the rows have at most 5 nonzeros per row.

Table 3 presents a similar comparison, but using right-preconditioned GMRES

applied directly to the normal equations (1.2). A restart parameter of 100 is used. The

relative tolerance is again set to 10−8, but this now applies to the unpreconditioned

residual. We switch from M−1

ASM to M−1

RAS (2.9), which is known to perform better

numerically. For the two-level method, we switch from M−1

balanced to M−1

deflated (2.11).

Switching from LSQR to GMRES can be beneﬁcial for some preconditioners, e.g.,

BoomerAMG now converges in 21 iterations instead of breaking down for problem

mesh deform. But this is not always the case, e.g., HSL MI35 applied to problem

deltaX does not converge within the 1,000 iteration limit. The two-level method

is the most robust approach, while the restricted additive Schwarz preconditioner

struggles to solve some problems, either because of a breakdown (problem stormg2-

125) or because of slow convergence (problems lp stocfor3, sgpf5y6, Hardesty2, and

cont11 l).

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 19

20 40 60 80 100

10−8

10−7

10−6

10−5

10−4

10−3

Iteration number

Relative residual

M−1

ASM

τ= 0.01275

τ= 0.02

τ= 0.05

τ= 0.1

τ= 0.4

τ= 0.9

τ n0Iterations

0.01275 2,400 49

0.02 2,683 39

0.05 3,049 30

0.1 3,337 24

0.4 8,979 15

0.9 44,682 13

1.2 153,600 11

Fig. 1.Inﬂuence of the threshold parameter τon the convergence of preconditioned LSQR for

problem watson 2 (m= 677,224 and n= 352,013).

Recall that for the results in Tables 2 and 3, the two-level preconditioner was

constructed using at most 300 eigenpairs and the threshold parameter τwas set

to 0.6. Whilst this highlights that tuning τfor individual problems is not necessary

to successfully solve a range of problems, it does not validate the ability of our

preconditioner to concurrently select the most appropriate local eigenpairs to deﬁne

an adaptive preconditioner. To that end, for problem watson 2, we consider the eﬀect

on the performance of our two-level preconditioner of varying τ. Results for LSQR

with M−1

ASM and M−1

balanced are presented in Figure 1. Here, 512 MPI processes are used

and the convergence tolerance is again 10−8. We observe that the two-level method

consistently outperforms the one-level method. Furthermore, as we increase τ, the

convergence is faster and the size n0of the second level increases. It is also interesting

to highlight that the convergence is smooth even with a very small value τ= 0.01275,

n0= 2,400 compared to the dimension 3.52 ·105of the normal equations matrix.

4.3. Performance study. We next investigate the algorithmic cost of the two-

level method. To do so, we perform a strong scaling analysis using a large problem

not presented in Table 1 but still from the SuiteSparse Matrix Collection, Hardesty3.

The matrix is of dimension 8,217,820 ×7,591,564, and the number of nonzero entries

in Cis 98,634,426. In Table 4, we report the number of iterations as well as the

eigensolve, setup, and solve times as the number Nof subdomains ranges from 16 to

4,096. The times are obtained using the PETSc -log view command line option. For

diﬀerent N, the reported times on each row of the table are the maximum among

all processes. The setup time includes the numerical factorization of the ﬁrst-level

subdomain matrices, the assembly of the second-level operator and its factorization.

Note that the symbolic factorization of the ﬁrst-level subdomain is shared between

the domain decomposition preconditioner and the eigensolver because we use the

Krylov–Schur method as implemented in SLEPc, which requires the factorization of

the right-hand side matrices from (3.7). The Cholesky factorizations of the subdomain

matrices and of the second-level operator are performed using the sparse direct solver

MUMPS [5]. For small numbers of subdomains (N < 128), the cost of the eigensolves

are clearly prohibitive. By increasing the number of subdomains, thus reducing their

20 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

Table 4

Strong scaling for problem Hardesty3 (m= 8,217,820 and n= 7,591,564) for Nranging from 16

to 4,096 subdomains. All times are in seconds. Column 2 reports the LSQR iteration count. Column

4 reports the setup time minus the concurrent solution time of the generalized eigenproblems, which

is given in column 3.

NIterations Eigensolve Setup Solve n0Total Speedup

16 113 2,417.4 24.5 301.3 4,800 2,743.2−

32 117 1,032.7 14.1 154.2 9,600 1,201.0 2.3

64 129 887.2 11.4 112.3 19,200 1,010.9 2.7

128 144 224.1 6.9 55.4 38,400 286.3 9.6

256 97 128.0 6.7 32.2 76,800 166.9 16.4

512 87 45.5 13.0 26.9 153,391 85.3 32.2

1,024 85 23.8 20.2 35.3 303,929 79.3 34.6

2,048 55 14.6 31.4 43.2 497,704 89.1 30.8

4,096 59 11.7 30.8 44.9 695,774 87.3 31.4

size, the time to construct the preconditioner becomes much more tractable and

overall, our implementation yields good speedups on a wide range of process counts.

Note that the threshold parameter τ= 0.6 is not attained on any of the subdomains

for Nranging from 16 up to 256, so that n0= 300 ×N. For larger N,τ= 0.6 is

attained, the preconditioner automatically selects the appropriate eigenmodes, and

convergence improves (see column 2 of Table 4). When Nis large (N≥1,024),

the setup and solve times are impacted by the high cost of factorizing and solving

the second-level problems, which, as highlighted by the values of n0, become large.

Multilevel variants [4] could be used to overcome this but goes beyond the scope of

the current study.

5. Concluding comments. Solving large-scale sparse linear least-squares

problems is known to be challenging. Previously proposed preconditioners have

generally been serial and have involved incomplete factorizations of Aor C=A>A.

In this paper, we have employed ideas that have been developed in the area of domain

decomposition, which (as far as we are aware) have not previously been applied to

least-squares problems. In particular, we have exploited recent work by Al Daas and

Grigori [3] on algebraic domain decomposition preconditioners for SPD systems to

propose a new two-level algebraic domain preconditioner for the normal equations

matrix C. We have used the concept of an algebraic local SPSD splitting of an SPD

matrix and we have shown that the structure of Cas the product of A>and Acan

be used to eﬃciently perform the splitting. Furthermore, we have proved that using

the two-level preconditioner, the spectral condition number of the preconditioned

normal equations matrix is bounded from above independently of the number of the

subdomains and the size of the problem. Moreover, this upper bound depends on a

parameter τthat can be chosen by the user to decrease (resp. increase) the upper

bound with the costs of setting up the preconditioner being larger (resp. smaller).

The new two-level preconditioner has been implemented in parallel within PETSc.

Numerical experiments on a range of problems from real applications have shown that

whilst both one-level and two-level domain decomposition preconditioners are eﬀective

when used with LSQR to solve the normal equations, the latter consistently results

in signiﬁcantly faster convergence. It also outperforms other possible preconditioners,

both in terms of robustness and iteration counts. Furthermore, our numerical

experiments on a set of challenging least-squares problems show that the two-level

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 21

preconditioner is robust with respect to the parameter τ. Moreover, a strong

scalability test of the two-level preconditioner assessed its robustness with respect

to the number of subdomains.

Future work includes extending the approach to develop preconditioners for

solving large sparse–dense least-squares problems in which Acontains a small number

of rows that have many more entries than the other rows. These cause the normal

equations matrix to be dense and so they need to be handled separately (see, for

example, the recent work of Scott and T˚uma [39] and references therein). As already

observed, we also plan to consider multilevel variants to allow the use of a larger

number of subdomains and processes.

Acknowledgments. This work was granted access to the GENCI-sponsored

HPC resources of TGCC@CEA under allocation A0090607519. The authors would

like to thank L. Dalcin, V. Hapla, and T. Isaac for their recent contributions to PETSc

that made the implementation of our preconditioner more ﬂexible.

Code reproducibility. Interested readers are referred to https://github.

com/prj-/aldaas2021robust/blob/main/README.md for setting up the appropriate

requirements, compiling, and running our proposed preconditioner. Fortran, C, and

Python source codes are provided.

REFERENCES

[1] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, and P. Papadopoulos,Ultrascalable

implicit ﬁnite element analyses in solid mechanics with over a half a billion degrees of

freedom, in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC04,

IEEE Computer Society, 2004, pp. 34:1–34:15.

[2] E. Agullo, A. Buttari, A. Guermouche, and F. Lopez,Implementing multifrontal

sparse solvers for multicore architectures with sequential task ﬂow runtime systems,

ACM Transactions on Mathematical Software, 43 (2016), http://buttari.perso.enseeiht.

fr/qr mumps.

[3] H. Al Daas and L. Grigori,A class of eﬃcient locally constructed preconditioners based on

coarse spaces, SIAM Journal on Matrix Analysis and Applications, 40 (2019), pp. 66–91.

[4] H. Al Daas, L. Grigori, P. Jolivet, and P.-H. Tournier,A multilevel Schwarz

preconditioner based on a hierarchy of robust coarse spaces, SIAM Journal on Scientiﬁc

Computing, 43 (2021), pp. A1907–A1928.

[5] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster,A fully asynchronous

multifrontal solver using distributed dynamic scheduling, SIAM Journal on Matrix Analysis

and Applications, 23 (2001), pp. 15–41, http://mumps.enseeiht.fr.

[6] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,

A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,

D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,

S. Zampini, H. Zhang, and H. Zhang,PETSc web page, 2021, https://petsc.org.

[7] R. Bru, J. Mar

´

ın, J. Mas, and M. T˚

uma,Preconditioned iterative methods for solving linear

least-squares problems, SIAM Journal on Scientiﬁc Computing, 36 (2014), pp. A2002–

A2022.

[8] X.-C. Cai and M. Sarkis,A restricted additive Schwarz preconditioner for general sparse

linear systems, SIAM Journal on Scientiﬁc Computing, 21 (1999), pp. 792–797.

[9] T. F. Chan and T. P. Mathew,Domain decomposition algorithms, Acta Numerica, 3 (1994),

pp. 61–143.

[10] X. Cui and K. Hayami,Generalized approximate inverse preconditioners for least-squares

problems, Japan Journal of Industrial and Applied Mathematics, 26 (2009).

[11] T. A. Davis,Algorithm 915, SuiteSparseQR: multifrontal multithreaded rank-revealing sparse

QR factorization, ACM Transactions on Mathematical Software, 38 (2011).

[12] T. A. Davis and Y. Hu,The University of Florida sparse matrix collection, ACM Transactions

on Mathematical Software, 38 (2011), pp. 1–28.

[13] H. De Sterck, R. D. Falgout, J. W. Nolting, and U. M. Yang,Distance-two interpolation

for parallel algebraic multigrid, Numerical Linear Algebra with Applications, 15 (2008),

22 H. AL DAAS, P. JOLIVET, AND J. A. SCOTT

pp. 115–139.

[14] H. De Sterck, U. M. Yang, and J. J. Heys,Reducing complexity in parallel algebraic

multigrid preconditioners, SIAM Journal on Matrix Analysis and Applications, 27 (2006),

pp. 1019–1039.

[15] V. Dolean, P. Jolivet, and F. Nataf,An introduction to domain decomposition methods.

Algorithms, theory, and parallel implementation, Society for Industrial and Applied

Mathematics, 2015.

[16] I. S. Duff, R. Guivarch, D. Ruiz, and M. Zenadi,The augmented block Cimmino distributed

method, SIAM Journal on Scientiﬁc Computing, 37 (2015), pp. A1248–A1269.

[17] A. Dumitras¸c, P. Leleux, C. Popa, D. Ruiz, and S. Torun,The augmented block Cimmino

algorithm revisited, 2018, https://arxiv.org/abs/1805.11487.

[18] T. Elfving,Block-iterative methods for consistent and inconsistent linear equations,

Numerische Mathematik, 35 (1980), pp. 1–12.

[19] R. D. Falgout and U. M. Yang, hypre: a library of high performance preconditioners,

Computational Science—ICCS 2002, (2002), pp. 632–641.

[20] M. J. Gander and A. Loneland,SHEM: an optimal coarse space for RAS and its multiscale

approximation, in Domain Decomposition Methods in Science and Engineering XXIII, C.-

O. Lee, X.-C. Cai, D. E. Keyes, H. H. Kim, A. Klawonn, E.-J. Park, and O. B. Widlund,

eds., Cham, 2017, Springer International Publishing, pp. 313–321.

[21] N. I. M. Gould and J. A. Scott,The state-of-the-art of preconditioners for sparse linear

least-squares problems, ACM Transactions on Mathematical Software, 43 (2017), pp. 36:1–

35.

[22] A. Heinlein, C. Hochmuth, and A. Klawonn,Reduced dimension GDSW coarse spaces for

monolithic Schwarz domain decomposition methods for incompressible ﬂuid ﬂow problems,

International Journal for Numerical Methods in Engineering, 121 (2020), pp. 1101–1119.

[23] V. Hernandez, J. E. Roman, and V. Vidal,SLEPc: a scalable and ﬂexible toolkit for the

solution of eigenvalue problems, ACM Transactions on Mathematical Software, 31 (2005),

pp. 351–362, https://slepc.upv.es.

[24] HSL. A collection of Fortran codes for large-scale scientiﬁc computation, 2018. http://www.

hsl.rl.ac.uk.

[25] P. Jolivet, F. Hecht, F. Nataf, and C. Prud’homme,Scalable domain decomposition

preconditioners for heterogeneous elliptic problems, in Proceedings of the International

Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13,

New York, NY, USA, 2013, ACM, pp. 80:1–80:11.

[26] P. Jolivet, J. E. Roman, and S. Zampini,KSPHPDDM and PCHPDDM: extending PETSc

with advanced Krylov methods and robust multilevel overlapping Schwarz preconditioners,

Computers & Mathematics with Applications, 84 (2021), pp. 277–295.

[27] G. Karypis and V. Kumar,Multilevel k-way partitioning scheme for irregular graphs, Journal

of Parallel and Distributed computing, 48 (1998), pp. 96–129.

[28] F. Kong and X.-C. Cai,A scalable nonlinear ﬂuid–structure interaction solver based on a

Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D, Journal of

Computational Physics, 340 (2017), pp. 498–518.

[29] N. Li and Y. Saad,MIQR: a multilevel incomplete QR preconditioner for large sparse least-

squares problems, SIAM Journal on Matrix Analysis and Applications, 28 (2006), pp. 524–

550.

[30] P. Marchand, X. Claeys, P. Jolivet, F. Nataf, and P.-H. Tournier,Two-level

preconditioning for h-version boundary element approximation of hypersingular operator

with GenEO, Numerische Mathematik, 146 (2020), pp. 597–628.

[31] M. McCourt, B. F. Smith, and H. Zhang,Sparse matrix–matrix products executed through

coloring, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 90–109.

[32] Intel MKL Sparse QR, 2018.

[33] S. V. Nepomnyaschikh,Mesh theorems of traces, normalizations of function traces and their

inversions, Russian Journal of Numerical Analysis and Mathematical Modelling, 6 (1991),

pp. 1–25.

[34] C. C. Paige and M. A. Saunders,LSQR: an algorithm for sparse linear equations and sparse

least squares, ACM Transactions on Mathematical Software, 8 (1982), p. 43–71.

[35] F. Pellegrini and J. Roman,SCOTCH: a software package for static mapping by

dual recursive bipartitioning of process and architecture graphs, in High-Performance

Computing and Networking, Springer, 1996, pp. 493–498.

[36] Y. Saad and M. H. Schultz,GMRES: a generalized minimal residual algorithm for solving

nonsymmetric linear systems, SIAM Journal on Scientiﬁc and Statistical Computing, 7

(1986), pp. 856–869.

PRECONDITIONER FOR SPARSE NORMAL EQUATIONS 23

[37] J. A. Scott and M. T˚

uma,Preconditioning of linear least squares by robust incomplete

factorization for implicitly held normal equations, SIAM Journal on Scientiﬁc Computing,

38 (2016), pp. C603–C623.

[38] J. A. Scott and M. T˚

uma,Solving mixed sparse–dense linear least-squares problems by

preconditioned iterative methods, SIAM Journal on Scientiﬁc Computing, 39 (2017),

pp. A2422–A2437.

[39] J. A. Scott and M. T˚

uma,Strengths and limitations of stretching for least-squares problems

with some dense rows, ACM Transactions on Mathematical Software, 41 (2021), pp. 1:1—

1:25.

[40] B. F. Smith, P. E. Bjørstad, and W. D. Gropp,Domain decomposition: parallel multilevel

methods for elliptic partial diﬀerential equations, Cambridge University Press, 1996.

[41] G. W. Stewart,A Krylov–Schur algorithm for large eigenproblems, SIAM Journal on Matrix

Analysis and Applications, 23 (2002), pp. 601–614.

[42] J. M. Tang, R. Nabben, C. Vuik, and Y. A. Erlangga,Comparison of two-level

preconditioners derived from deﬂation, domain decomposition and multigrid methods,

Journal of Scientiﬁc Computing, 39 (2009), pp. 340–370.

[43] J. Van lent, R. Scheichl, and I. G. Graham,Energy-minimizing coarse spaces for two-level

Schwarz methods for multiscale PDEs, Numerical Linear Algebra with Applications, 16

(2009), pp. 775–799.