PreprintPDF Available

A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal Equations

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Solving the normal equations corresponding to large sparse linear least-squares problems is an important and challenging problem. For very large problems, an iterative solver is needed and, in general, a preconditioner is required to achieve good convergence. In recent years, a number of preconditioners have been proposed. These are largely serial and reported results demonstrate that none of the commonly used preconditioners for the normal equations matrix is capable of solving all sparse least-squares problems. Our interest is thus in designing new preconditioners for the normal equations that are efficient, robust, and can be implemented in parallel. Our proposed preconditioners can be constructed efficiently and algebraically without any knowledge of the problem and without any assumption on the least-squares matrix except that it is sparse. We exploit the structure of the symmetric positive definite normal equations matrix and use the concept of algebraic local symmetric positive semi-definite splittings to introduce two-level Schwarz preconditioners for least-squares problems. The condition number of the preconditioned normal equations is shown to be theoretically bounded independently of the number of subdomains in the splitting. This upper bound can be adjusted using a single parameter $\tau$ that the user can specify. We discuss how the new preconditioners can be implemented on top of the PETSc library using only 150 lines of Fortran, C, or Python code. Problems arising from practical applications are used to compare the performance of the proposed new preconditioner with that of other preconditioners.
Content may be subject to copyright.
Abstract. Solving the normal equations corresponding to large sparse linear least-squares
problems is an important and challenging problem. For very large problems, an iterative solver
is needed and, in general, a preconditioner is required to achieve good convergence. In recent
years, a number of preconditioners have been proposed. These are largely serial and reported
results demonstrate that none of the commonly used preconditioners for the normal equations
matrix is capable of solving all sparse least-squares problems. Our interest is thus in designing
new preconditioners for the normal equations that are efficient, robust, and can be implemented in
parallel. Our proposed preconditioners can be constructed efficiently and algebraically without any
knowledge of the problem and without any assumption on the least-squares matrix except that it
is sparse. We exploit the structure of the symmetric positive definite normal equations matrix and
use the concept of algebraic local symmetric positive semi-definite splittings to introduce two-level
Schwarz preconditioners for least-squares problems. The condition number of the preconditioned
normal equations is shown to be theoretically bounded independently of the number of subdomains in
the splitting. This upper bound can be adjusted using a single parameter τthat the user can specify.
We discuss how the new preconditioners can be implemented on top of the PETSc library using only
150 lines of Fortran, C, or Python code. Problems arising from practical applications are used to
compare the performance of the proposed new preconditioner with that of other preconditioners.
Key words. Algebraic domain decomposition, two-level preconditioner, additive Schwarz,
normal equations, sparse linear least-squares.
1. Introduction. We are interested in solving large-scale linear least-
squares (LS) problems
(1.1) min
xkAx bk2,
where ARm×n(mn) and bRmare given. Solving (1.1) is mathematically
equivalent to solving the n×nnormal equations
(1.2) Cx =A>b, C =A>A,
where, provided Ahas full column rank, the normal equations matrix Cis symmetric
and positive definite (SPD). Two main classes of methods may be used to solve the
normal equations: direct methods and iterative methods. A direct method proceeds
by computing an explicit factorization, either using a sparse Cholesky factorization
of Cor a “thin” QR factorization of A. While well-engineered direct solvers [2,11,32]
are highly robust, iterative methods may be preferred because they generally require
significantly less storage (allowing them to tackle very large problems for which the
memory requirements of a direct solver are prohibitive) and, in some applications,
it may not be necessary to solve the system with the high accuracy offered by a
direct solver. However, the successful application of an iterative method usually
requires a suitable preconditioner to achieve acceptable (and ideally, fast) convergence
Submitted to the editors July 19, 2021.
STFC Rutherford Appleton Laboratory, Harwell Campus, Didcot, Oxfordshire, OX11 0QX, UK
CNRS, ENSEEIHT, 2 rue Charles Camichel, 31071 Toulouse Cedex 7, France
§School of Mathematical, Physical and Computational Sciences, University of Reading, Reading
RG6 6AQ, UK.
rates. Currently, there is much less knowledge of preconditioners for LS problems
than there is for sparse symmetric linear systems and, as observed in Bru et al. [7],
“the problem of robust and efficient iterative solution of LS problems is much harder
than the iterative solution of systems of linear equations.” This is, at least in part,
because Adoes not have the properties of differential problems that can make standard
preconditioners effective for solving many classes of linear systems.
Compared with other classes of linear systems, the development of preconditioners
for sparse LS problems may be regarded as still being in its infancy and includes
variants of block Jacobi (also known as block Cimmino) and SOR [18];
incomplete factorizations such as incomplete Cholesky, QR, and LU
factorizations, for example, [7,29,37,38];
and sparse approximate inverse [10].
A review and performance comparison is given in [21]. This found that, whilst
none of the approaches is successful for all LS problems, limited memory incomplete
Cholesky factorization preconditioners appear to be the most reliable The incomplete
factorization-based preconditioners are designed for moderate size problems because
current approaches, in general, are not suitable for parallel computers. The block
Cimmino method can be parallelized easily, however, it lacks robustness as the
iteration count to reach convergence cannot be controlled and typically increases
significantly when the number of blocks increases for a fixed problem [16]. Several
techniques have been proposed to improve the convergence of block Cimmino but
they still lack robustness [17]. Thus, we are motivated to design a new class of LS
preconditioners that are not only reliable but can also be implemented in parallel.
In [3], Al Daas and Grigori presented a class of robust fully algebraic two-level
additive Schwarz preconditioners for solving SPD linear systems of equations. They
introduced the notion of an algebraic local symmetric positive semi-definite (SPSD)
splitting of an SPD matrix with respect to local subdomains. They used this splitting
to construct a class of second-level spaces that bound the spectral condition number
of the preconditioned system by a user-defined value. Unfortunately, Al Daas and
Grigori reported that for general sparse SPD matrices, constructing the splitting is
prohibitively expensive. Our interest is in examining whether the particular structure
of the normal equations matrix allows the approach to be successfully used for
preconditioning LS problems. In this paper, we show how to compute the splitting
efficiently. Based on this splitting, we apply the theory presented in [3] to construct
a two-level Schwarz preconditioner for the normal equations.
Note that for most existing preconditioners of the normal equations, there is
no need to form and store all of the normal equations matrix Cexplicitly. For
example, the lower triangular part of its columns can be computed one at a time,
used to perform the corresponding step of an incomplete Cholesky algorithm, and
then discarded. However, forming the normal equations matrix, even piecemeal, can
entail a significant overhead and potentially may lead to a severe loss of information in
highly ill-conditioned cases. Although building our proposed preconditioner does not
need the explicit computation of C, our parallel implementation computes it efficiently
and uses it to setup the preconditioner. This is mainly motivated by technical
reasons. As an example, state-of-the-art distributed-memory graph partitioners such
as ParMETIS [27] or PT-SCOTCH [35] cannot directly partition the columns of
the rectangular matrix A. Our numerical experiments on highly ill-conditioned
LS problems showed that forming Cand using diagonal shift to construct the
preconditioner had no major effect on the robustness of the resulting preconditioner.
This paper is organized as follows. The notation used in the manuscript is given
at the end of the introduction. In section 2, we present an overview of domain
decomposition (DD) methods for a sparse SPD matrix. We present a framework for
the DD approach when applied to the sparse LS problem in section 3. Afterwards, we
show how to compute the local SPSD splitting matrices efficiently and use them in line
with the theory presented in [3] to construct a robust two-level Schwarz preconditioner
for the normal equations matrix. We then discuss some technical details that clarify
how to construct the preconditioner efficiently. In section 4, we briefly discuss how
the new preconditioner can be implemented on top of the PETSc library [6] and
we illustrate its effectiveness using large-scale LS problems coming from practical
applications. Finally, concluding comments are made in section 5.
Notation. We end our introduction by defining notation that will be used in this
paper. Let 1 nmand let ARm×n. Let S1J1, mKand S2J1, nKbe
two sets of integers. A(S1,:) is the submatrix of Aformed by the rows whose indices
belong to S1and A(:, S2) is the submatrix of Aformed by the columns whose indices
belong to S2. The matrix A(S1, S2) is formed by taking the rows whose indices belong
to S1and only retaining the columns whose indices belong to S2. The concatenation
of any two sets of integers S1and S2is represented by [S1, S2]. Note that the order
of the concatenation is important. The set of the first ppositive integers is denoted
by J1, pK. The identity matrix of size nis denoted by In. We denote by ker(A) and
range(A) the null space and the range of A, respectively.
2. Introduction to domain decomposition. Throughout this section, we
assume that Cis a general n×nsparse SPD matrix. Let the nodes Vin the
corresponding adjacency graph G(C) be numbered from 1 to n. A graph partitioning
algorithm can be used to split Vinto Nndisjoint subsets ΩIi (1 iN) of size
nIi . These sets are called nonoverlapping subdomains. Defining an additive Schwarz
preconditioner requires overlapping subdomains. Let ΩΓibe the subset of size nΓi
of nodes that are distance one in G(C) from the nodes in ΩI i (1 iN). The
overlapping subdomain Ωiis defined to be Ωi= [ΩIi ,Γi], with size ni=nΓi+nIi .
Associated with Ωiis a restriction (or projection) matrix RiRni×ngiven by
Ri=In(Ωi,:). Rimaps from the global domain to subdomain Ωi. Its transpose R>
is a prolongation matrix that maps from subdomain Ωito the global domain. The
one-level additive Schwarz preconditioner [15] is defined to be
(2.1) M1
ii Ri, Cii =RiCR>
That is,
where R1is the one-level interpolation operator defined by
Applying this preconditioner to a vector involves solving concurrent local problems
in the overlapping subdomains. Increasing Nreduces the sizes niof the overlapping
subdomains, leading to smaller local problems and faster computations. However,
in practice, the preconditioned system using M1
ASM may not be well-conditioned,
inhibiting convergence of the iterative solver. In fact, the local nature of this
preconditioner can lead to a deterioration in its effectiveness as the number of
subdomains increases because of the lack of global information from the matrix C[15,
20]. To maintain robustness with respect to N, an artificial subdomain is added to
the preconditioner (also known as second-level correction or coarse correction) that
includes global information.
Let 0 < n0n. If R0Rn0×nis of full row rank, the two-level additive Schwarz
preconditioner [15] is defined to be
(2.2) M1
additive =
ii Ri=R>
00 R0+M1
ASM, C00 =R0C R>
That is,
additive =R2
where R2is the two-level interpolation operator
In the rest of this paper, we will make use of the canonical one-to-one correspondence
between QN
i=0 Rniand RPN
i=0 niso that R2can be applied to vectors in RPN
i=0 ni.
Observe that, because Cand R0are of full rank, C00 is also of full rank. For any full
rank R0, it is possible to cheaply obtain upper bounds on the largest eigenvalue of the
preconditioned matrix, independently of nand N[3]. However, bounding the smallest
eigenvalue is highly dependent on R0. Thus, the choice of R0is key to obtaining a well-
conditioned system and building efficient two-level Schwarz preconditioners. Two-
level Schwarz preconditioners have been used to solve a large class of systems arising
from a range of engineering applications (see, for example, [22,26,28,30,40,43] and
references therein).
Following [3], we denote by DiRni×ni(1 iN) any non-negative diagonal
matrices such that
We refer to (Di)1iNas an algebraic partition of unity. In [3], Al Daas and Grigori
show how to select local subspaces ZiRni×piwith pini(1 iN) such that,
if R>
0is defined to be R>
0= [R>
1D1Z1, . . . , R>
NDNZN], the spectral condition number
of the preconditioned matrix M1
additiveCis bounded from above independently of N
and n.
2.1. Algebraic local SPSD splitting of an SPD matrix. We now recall
the definition of an algebraic local SPSD splitting of an SPD matrix given in [3].
This requires some additional notation. Denote the complement of Ωiin J1, nKby
ci. Define restriction matrices Rci,RIi, and RΓithat map from the global domain
to Ωci, ΩIi , and ΩΓi, respectively. Reordering the matrix Cusing the permutation
matrix Pi=In([ΩIi ,Γi,ci],:) gives the block tridiagonal matrix
(2.5) PiCP >
CI,i CIΓ,i
CΓI,i CΓ,i CΓc,i
C,i Cc,i
where CI,i =RI iC R>
Ii ,C>
ΓI,i =CIΓ,i =RI iC R>
Γi,CΓ,i =RΓiCR>
,i =CΓc,i =
ci, and Cc,i =RciCR>
ci. The first block on the diagonal corresponds to the
nodes in ΩI i, the second block on the diagonal corresponds to the nodes in ΩΓi, and
the third block on the diagonal is associated with the remaining nodes.
An algebraic local SPSD splitting of the SPD matrix Cwith respect to the i-th
subdomain is defined to be any SPSD matrix e
CiRn×nof the form
CI,i CIΓ,i 0
CΓI,i e
CΓ,i 0
0 0 0
such that the following condition holds:
Ciuu>Cu, for all uRn.
We denote the 2 ×2 block nonzero matrix of Pie
iby e
Cii so that
Associated with the local SPSD splitting matrices, we define a multiplicity
constant kmthat satisfies the inequality
(2.6) 0
Ciukmu>Cu, for all uRn.
Note that, for any set of SPSD splitting matrices, kmN.
The main motivation for defining splitting matrices is to find local seminorms that
are bounded from above by the C-norm. These seminorms will be used to determine a
subspace that contains the eigenvectors of Cassociated with its smallest eigenvalues.
2.2. Two-level Schwarz method. We next review the abstract theory of the
two-level Schwarz method as presented in [3]. For the sake of completeness, we present
some elementary lemmas that are widely used in multilevel methods. These will be
used in proving efficiency of the two-level Schwarz preconditioner and will also help
in understanding how the preconditioner is constructed.
2.2.1. Useful lemmas. The following lemma [33] provides a unified framework
for bounding the spectral condition number of a preconditioned operator. It can be
found in different forms for finite and infinite dimensional spaces. Here, we follow the
presentation from [15, Lemma 7.4].
Lemma 2.1 (Fictitious Subspace Lemma). Let CRnC×nCand BRnB×nB
be SPD. Let the operator Rbe defined as
v7→ Rv,
and let R>be its transpose. Assume the following conditions hold:
(i) Ris surjective;
(ii) there exists cu>0such that for all vRnB
(iii) there exists cl>0such that for all vCRnCthere exists vBRnBsuch that
BBvB(RvB)>C(RvB) = v>
Then, the spectrum of the operator RB1R>Cis contained in the interval [cl, cu].
The challenge is to define the second-level projection matrix R0such that the two-level
additive Schwarz preconditioner M1
additive and the operator R2(2.3), corresponding
respectively to Band Rin Lemma 2.1, satisfy conditions (i) to (iii) and, in addition,
ensures the ratio between cland cuis small because this determines the quality of the
As shown in [15, Lemmas 7.10 and 7.11], a two-level additive Schwarz
preconditioner satisfies (i) and (ii) for any full rank R0. Furthermore, the constant cu
is bounded from above independently of the number of subdomains N, as shown in
the following result [9, Theorem 12].
Lemma 2.2. Let kcbe the minimum number of distinct colours so that the spaces
spanned by the columns of the matrices R>
1, . . . , R>
Nthat are of the same colour are
mutually C-orthogonal. Then,
(R2uB)>C(R2uB)(kc+ 1)
for all uB= (ui)0iNQN
i=0 Rni.
Note that kcis independent of N. Indeed, it depends only on the sparsity structure
of Cand is less than the maximum number of neighbouring subdomains.
The following result is the first step in a three-step approach to define a two-level
additive Schwarz operator R2that satisfies condition (iii) in Lemma 2.1.
Lemma 2.3. Let uB= (ui)0iNQN
i=0 Rniand u=R2uBRn. Then,
provided R0is of full rank,
iCiiui2u>C u + (2kc+ 1)
where kcis defined in Lemma 2.2.
It follows that (iii) is satisfied if the squared localized seminorm u>
bounded from above by the squared C-norm of u.
In the second step, we bound u>
iCiiuiby the squared localized seminorm defined
by the SPSD splitting matrix e
Ci, which can be bounded by the squared C-norm (2.6).
The decomposition of u=PN
i=0 R>
iuiRnis termed stable if, for some τ > 0,
iCiiuiu>C u, 1iN.
The two-level approach in [3] aims to decompose each Rni(1 iN) into two
subspaces, one that makes the decomposition of ustable and the other is part of the
artificial subdomain associated with the second level of the preconditioner. Given the
partition of unity (2.4),u=PN
i=1 R>
iDiRiuand, if Πi= Π>
iRni×ni, we can write
iDiΠiRiu, with ui=Di(IniΠi)Riu.
Therefore, we need to construct Πisuch that
i(IniΠi)DiCiiDi(IniΠi)Riuu>C u.
The following lemma shows how this can be done.
Lemma 2.4. Let e
CiiRibe a local SPSD splitting of Crelated to the i-th
subdomain (1iN). Let Dibe the partition of unity (2.4). Let P0,i be the
projection on range(e
Cii)parallel to ker(e
Cii). Define Li=ker(DiCiiDi)ker(e
and let L
idenote the orthogonal complementary of Liin ker(e
Cii). Consider the
following generalized eigenvalue problem:
find (vi,k, λi,k )Rni×R
such that P0,iDiCii DiP0,ivi,k =λi,k e
Ciivi,k .
Given τ > 0, define
(2.7) Zi=L
ispan vi,k |λi,k >1
and let Πibe the orthogonal projection on Zi. Then, Ziis the subspace of smallest
dimension such that for all uRn,
where ui=Di(IniΠi)Riu.
Lemma 2.5 provides the last step that we need for condition (iii) in Lemma 2.1.
It defines u0and checks whether (ui)0iNis a stable decomposition.
Lemma 2.5. Let e
Ci,Zi, and Πibe as in Lemma 2.4 and let Zibe a matrix whose
columns span Zi(1iN). Let the columns of the matrix R>
0span the space
(2.8) Z=
Let uRnand ui=Di(IniΠi)Riu(1iN). Define
01R0 N
and N
iCiiui2 + (2kc+ 1) km
Finally, using the preceding results, Theorem 2.6 presents a theoretical upper
bound on the spectral condition number of the preconditioned system.
Theorem 2.6. If the two-level additive Schwarz preconditioner M1
additive (2.2) is
constructed using R0as defined in Lemma 2.5, then the following inequality is
additiveC(kc+ 1) 2 + (2kc+ 1)km
2.3. Variants of the Schwarz preconditioner. So far, we have
presented M1
ASM, the symmetric additive Schwarz method (ASM) and M1
the additive correction for the second level. It was noted in [8] that using the
partition of unity to weight the preconditioner can improve its quality. The
resulting preconditioner is referred to as M1
RAS, the restricted additive Schwarz (RAS)
preconditioner, and is defined to be
(2.9) M1
ii Ri.
This preconditioner is nonsymmetric and thus can only be used with iterative
methods such as GMRES [36] that are for solving nonsymmetric problems. With
regards to the second level, different strategies yield either a symmetric or a
nonsymmetric preconditioner [42]. Given a first-level preconditioner M1
?and setting
00 R0, the balanced and deflated two-level preconditioners are as follows
(2.10) M1
balanced =Q+ (ICQ)>M1
(2.11) M1
deflated =Q+M1
respectively. It is well-known in the literature that M1
balanced and M1
deflated yield better
convergence behavior than M1
additive (see [42] for a thorough comparison). Although the
theory we present relies on M1
additive, in practice we will use M1
balanced and M1
deflated. If
the one-level preconditioner M1
?is symmetric, then so is M1
balanced, while M1
is typically nonsymmetric. For this reason, in the rest of the paper, we always
couple M1
ASM with M1
balanced, and M1
RAS with M1
deflated. All three variants have the same
setup cost, and only differ in how the second level is applied. M1
balanced is slightly more
expensive because two second-level corrections (multiplications by Q) are required
instead of a single one for M1
additive and M1
3. The normal equations. The theory explained thus far is fully algebraic but
somehow disconnected from our initial LS problem (1.1). We now show how it can
be readily applied to the normal equations matrix C=A>A, with ARm×nsparse,
first defining a one-level Schwarz preconditioner, and then a robust algebraic second-
level correction. We start by partitioning the ncolumns of Ainto disjoint subsets
Ii . Let Ξibe the set of indices of the nonzero rows in A(:,Ii ) and let Ξcibe the
complement of Ξiin the set J1, mK. Now define ΩΓito be the complement of ΩI i in
the set of indices of nonzero columns of Ai,:). The set Ωi= [ΩIi,Γi] defines the
i-th overlapping subdomain and we have the permuted matrix
(3.1) A([Ξi,Ξci],[ΩIi ,Γi,ci]) = AI,i AIΓ,i
AΓ,i Ac,i.
To illustrate the concepts and notation, consider the 5 ×4 matrix
and set N= 2, ΩI1={1,3}, ΩI2={2,4}. Consider the first subdomain. We have
A(:,I1) =
1 6
2 0
3 0
0 0
0 0
The set of indices of the nonzero rows is Ξ1={1,2,3}, and its complement is Ξc1 =
{4,5}. To define ΩΓ,1, select the nonzero columns in the submatrix A1,:) and
remove those already in ΩI1, that is,
(3.2) A1,:) =
so that ΩΓ1 ={2}and Ωc1 ={4}. Permuting Ato the form (3.1) gives
A([Ξ1,Ξc1],[ΩI1,Γ1 ,c1]) =
In the same way, consider the second subdomain. I2={2,4}and
A(:,I2) =
0 0
4 0
0 0
5 7
0 8
so that Ξ2={2,4,5}and Ξc2 ={1,3}. To define ΩΓ2, select the nonzero columns in
the submatrix A2,:) and remove those already in ΩI2, that is,
(3.3) A2,:) =
which gives ΩΓ2 ={1}and Ωc2 ={3}. Permuting Ato the form (3.1) gives
A([Ξ2,Ξc2],[ΩI2,Γ2 ,c2]) =
Now that we have ΩIi and ΩΓi, we can define the restriction operators
R1=I4(Ω1,:) =
, R2=I4(Ω2,:) =
For our example, nI1=nI2= 2 and nΓ1 =nΓ2 = 1. The partition of unity
matrices Diare of dimension (nIi +nΓi)×(nI i +nΓi) (i= 1,2) and have ones
on the nIi leading diagonal entries and zeros elsewhere, so that
(3.4) D1=D2=
Observe that Di(k, k) scales the columns A(:,i(k)).
Note that it is possible to obtain the partitioning sets and the sets of indices using
the normal equations matrix C. Most graph partitioners, especially those that are
implemented in parallel, require an undirected graph (corresponding to a matrix with
a symmetric sparsity pattern). Therefore, in practice, we use the graph of Cto setup
the first-level preconditioner for LS problems.
3.1. One-level DD for the normal equations. This section presents the
one-level additive Schwarz preconditioner for the normal equations matrix C=
A>A. Following (2.1) and given the sets ΩIi,Γi,and Ξi, the one-level Schwarz
preconditioner of C=A>Ais
Remark 3.1. Note that the local matrix Cii =A(:,i)>A(:,i) need not be
computed explicitly to be factored. Instead, the Cholesky factor of Cii can be
computed by using a “thin” QR factorization of A(:,i).
3.2. Algebraic local SPSD splitting of the normal equations matrix. In
this section, we show how to cheaply construct algebraic local SPSD splittings for
sparse matrices of the form C=A>A. Combining (2.5) and (3.1), we can write
PiA>AP >
I,i AI,i A>
I,i AIΓ,i
IΓ,iAI ,i A>
IΓ,iAIΓ,i +A>
Γ,iAΓ,i A>
c,iAΓ,i A>
where Pi=In([ΩIi ,Γi,ci],:) is a permutation matrix. A straightforward splitting
of PiA>AP >
iis given by
PiA>AP >
I,i AI,i A>
I,i AIΓ,i 0
IΓ,iAI ,i A>
IΓ,iAIΓ,i 0
0 0 0
0 0 0
Γ,iAΓ,i A>
c,iAΓ,i A>
It is clear that both summands are SPSD. Indeed, they both have the form X>X,
where Xis AI,i AIΓ,i 0and 0AΓ,i Ac,i , respectively. The local SPSD
splitting matrix related to the i-th subdomain is then defined as:
Cii =Ai,i)>Ai,i) = AI,i AIΓ,i >AI,i AIΓ,i ,
and e
Hence, the theory presented in [3] and summarised in subsection 2.2 is applicable. In
particular, the two-level Schwarz preconditioner M1
additive (2.2) satisfies
additiveC)(kc+ 1) 2 + 2(kc+ 1)km
where kcis the minimal number of colours required to colour the partitions of C
such that each two neighbouring subdomains have different colours, and kmis the
multiplicity constant that satisfies the following inequality
The constant kcis independent of Nand depends only on the graph G(C), which is
determined by the sparsity pattern of A. The multiplicity constant kmdepends on
the local SPSD splitting matrices. For the normal equations matrix, the following
lemma provides an upper bound on km.
Lemma 3.2. Let C=A>A. Let mjbe the number of subdomains such that
A(j, Ii )6= 0 (1iN), that is,
mj= #{i|jΞi}.
Then, kmcan be chosen to be km= max1jmmj. Furthermore, if kiis the number
of neighbouring subdomains of the i-th subdomain, that is,
ki= #{j|ij6=φ},
km= max
Proof. Since C=A>Aand e
Ci=Ai,:)>Ai,:), we have
u>Cu =
u>A(j, :)>A(j, :)u,
u>A(j, :)>A(j, :)u,
i=1 X
u>A(j, :)>A(j, :)u.
From the definition of mj, the term u>A(j, :)>A(j, :)uappears mjtimes in the last
equation. Thus,
mju>A(j, :)>A(j, :)u,
u>A(j, :)>A(j, :)u,
= max
from which it follows that we can choose km= max1jmmj. Now, if 1 lm,
there exist i1, . . . , imlsuch that lΞi1∩···Ξiml. Furthermore, mlmax1plkip.
Taking the maximum over lon both sides we obtain
Note that because Ais sparse, kmis independent of the number of subdomains.
3.3. Algorithms and technical details. In this section, we discuss the
technical details involved in constructing a two-level preconditioner for the normal
equations matrix.
3.3.1. Partition of unity. Because the matrix AIΓ,i may be of low rank, the
null space of e
Cii (3.5) can be large. Recall that the diagonal matrices Dihave
dimension ni=nIi +nΓi. Choosing the entries in positions nI i + 1, . . . , niof the
diagonal of Dito be zero, as in (3.4), results in the subspace of ker(e
Cii) caused
by the rank deficiency of AIΓ,i to lie within ker(DiCii Di), reducing the size of the
space Zgiven by (2.8). In other words, if AIΓ,iu= 0, we have e
Ciiv= 0, where
v>= (0, u>), i.e., vker(e
Cii) and because by construction Div= 0, we have
Cii)ker(DiCiiDi), therefore, vneed not be included in Zi.
3.3.2. The eigenvalue problem. The generalized eigenvalue problem
presented in Lemma 2.4 is critical in the construction of the two-level preconditioner.
Although the definition of Zifrom (2.7) suggests it is necessary to compute the null
space of e
Cii and that of DiCiiDiand their intersection, in practice, this can be
avoided. Consider the generalized eigenvalue problem
(3.6) DiCiiDiv=λe
where, by convention, we set λ= 0 if vker(e
Cii)ker(DiCiiDi) and λ=if
Cii)\ker(DiCiiDi). The subspace Zidefined in (2.7) can then be written as
span v|DiCiiDiv=λe
Ciivand λ > 1
Consider also the shifted generalized eigenvalue problem
(3.7) DiCiiDiv=λ(e
Cii +sIni)v,
where 0 < s 1. Note that if sis such that e
Cii +sIniis numerically of full rank,
(3.7) can be solved using any off-the-shelf generalized eigenproblem solver. Let (v , λ)
be an eigenpair of (3.7). Then, we can only have one of the following situations:
Cii)ker(DiCiiDi) or vker(e
Cii)ker(DiCiiDi). In which
case, (v, 0) is an eigenpair of (3.6).
Cii)range(DiCiiDi). Then,
and, as sis small, (v, λ) is a good approximation of an eigenpair of (3.6)
corresponding to a finite eigenvalue.
Cii)range(DiCiiDi). Then, DiCiiDiv=λsv, i.e., λs is a nonzero
eigenvalue of DiCii Di. Because Diis defined such that the diagonal values
corresponding to the boundary nodes are zero, the nonzero eigenvalues of
DiCiiDicorrespond to the squared singular values of A(:,I i). Hence, all
the eigenpairs of (3.6) corresponding to an infinite eigenvalue are included in
the set of eigenpairs (v, λ) of (3.7) such that
(3.8) σ2
min (A(:,Ii )) λs σ2
max (A(:,Ii )) ,
where σmin (A(:,Ii )) and σmax (A(:,Ii )) are the smallest and largest
singular values of A(:,Ii), respectively.
Note that, in practice, A(:,I i) is well conditioned because if not, there would be
local linear dependence between some columns of A. Therefore, choosing
where εis the machine precision, ensures e
Cii +sIniis numerically invertible and
s1. Setting s=ke
Ciik2εin (3.8), we obtain
min (A(:,Ii )) λke
max (A(:,Ii )) .
By (3.5), we have
Ciik2≤ kCii k2,
and because ΩI i i, it follows that
ii k2=kA(:,i)>A(:,i)1k2σ2
min (A(:,Ii )) .
Hence, if (v, λ) is an eigenpair of (3.7) with vker(e
Cii)range(DiCiiDi), then
where κ(Cii) is the condition number of Cii and Zican be defined to be
(3.9) span v|DiCiiDiv=λ(e
Cii +εke
Ciik2Ini)vand λmin 1
Ziis then taken to be the matrix whose columns are the vertical concatenation of
corresponding eigenvectors.
Remark 3.3. Note that solving the generalized eigenvalue problem (3.7) by an
iterative method such as Krylov–Schur [41] does not require the explicit form of Cii
and e
Cii. Rather, it requires solving linear systems of the form ( e
Cii +sIni)u=v,
together with matrix–vector products of the form ( e
Cii +sIni)vand Ciiv. It is clear
that these products do not require the matrices e
Cii and Cii to be formed. Regarding
the solution of the linear system ( e
Cii +sIni)u=v,Remark 3.1 also applies to the
Cholesky factorization of e
Cii +sIni=X>X, where X>=Ai,i)>sIni, that
can be computed by using a “thin” QR factorization of X.
From Remarks 3.1 and 3.3, and applying the same technique therein to factor
C00 =R0CR>
0= (AR>
0), we observe that given the overlapping partitions
of A, the proposed two-level preconditioner can be constructed without forming the
normal equations matrix. Algorithm 3.1 gives an overview of the steps for constructing
our two-level Schwarz preconditioner for the normal equations matrix. The actual
implementation of our proposed preconditioner will be discussed in greater detail
in subsection 4.1.
Algorithm 3.1 Two-level Schwarz preconditioner for the normal equations matrix
Input: matrix A, number of subdomains N, threshold τto bound the condition
Output: two-level preconditioner M1for C=A>A.
1: (ΩI1,...,IN ) = Partition(A, N )
2: for i= 1 to Nin parallel do
3: Ξi= FindNonzeroRows(A(:,Ii))
4: i= [ΩIi ,Γi] = FindNonzeroColumns(Ai,:))
5: Define Dias in subsection 3.3.1 and Rias in section 2
6: Perform Cholesky factorization of Cii =A(:,i)>A(:,i), see Remark 3.1
7: Perform Cholesky factorization of e
Cii =Ai,i)>Ai,i), possibly using a
small shift s, see Remark 3.3
8: Compute Zias defined in (3.9)
9: end for
10: Set R>
1D1Z1, . . . , R>
11: Perform Cholesky factorization of C00 = (AR>
12: Set M1=M1
additive =PN
i=0 R>
ii Rior M1
balanced (2.10) or M1
deflated (2.11)
4. Numerical experiments. In this section, we illustrate the effectiveness of
the new two-level LS preconditioners M1
balanced and M1
deflated, their robustness with
respect to the number of subdomains, and their efficiency in tackling large-scale sparse
and ill-conditioned LS problems selected from the SuiteSparse Matrix Collection [12].
The test matrices are listed in Table 1. For each matrix, we report its dimensions,
the number of entries in Aand in the normal equations matrix C, and the condition
number of C(estimated using the MATLAB function condest).
Table 1
Test matrices taken from the SuiteSparse Matrix Collection
Identifier m n nnz(A) nnz(C) condest(C)
mesh deform 234,023 9,393 853,829 117,117 2.7·106
EternityII E 262,144 11,077 1,503,732 1,109,181 5.1·1019
lp stocfor3 23,541 16,675 72,721 223,395 4.0·1010
deltaX 68,600 21,961 247,424 2,623,073 3.7·1020
sc205-2r 62,423 35,213 123,239 12,984,043 1.7·107
stormg2-125 172,431 65,935 433,256 1,953,519
Rucci1 1,977,885 109,900 7,791,168 9,747,744 2.0·108
image interp 232,485 120,000 711,683 1,555,994 4.7·107
mk13-b5 270,270 135,135 810,810 1,756,755
pds-100 514,577 156,016 1,096,002 1,470,688
fome21 267,596 216,350 465,294 640,240
sgpf5y6 312,540 246,077 831,976 2,761,021 6.0·106
Hardesty2 929,901 303,645 4,020,731 3,936,209 1.2·1010
Delor338K 450,807 343,236 4,211,599 44,723,076 1.5·107
watson 2 677,224 352,013 1,846,391 3,390,279 1.0·107
stormG2 1000 1,377,306 526,185 3,459,881 82,987,269
LargeRegFile 2,111,154 801,374 4,944,201 6,378,592 3.0·108
cont11 l 1,961,394 1,468,599 5,382,999 18,064,261 2.0·1010
In subsection 4.1, we discuss our implementation based on the parallel backend [6].
In particular, we show that very little coding effort is needed to construct all the
necessary algebraic tools, and that it is possible to take advantage of an existing
package, such as HPDDM [26], to setup the new preconditioners efficiently. We
then show in subsection 4.2 how M1
balanced and M1
deflated perform compared to other
preconditioners when solving challenging LS problems. The preconditioners we
consider are:
limited memory incomplete Cholesky (IC) factorization specialized for
the normal equations matrix as implemented in HSL MI35 from the HSL
library [24] (note that this package is written in Fortran and we run it using
the supplied MATLAB interface with default parameter settings);
one-level overlapping Schwarz methods M1
ASM and M1
RAS as implemented in
algebraic multigrid methods as implemented both in BoomerAMG from the
HYPRE library [19] and in GAMG [1] from PETSc.
Finally, in subsection 4.3, we study the strong scalability of M1
balanced and its robustness
with respect to the number of subdomains by using a fixed problem and increasing
the number of subdomains.
With the exception of the serial IC code HSL MI35, all the numerical experiments
are performed on Ir`ene, a system composed of 2,292 nodes with two 64-core AMD
Rome processors clocked at 2.6 GHz and, unless stated otherwise, 256 MPI processes
are used. For the domain decomposition methods, one subdomain is assigned per
In all our experiments, the vector bin (1.1) is generated randomly and the
initial guess for the iterative solver is zero. When constructing our new two-level
preconditioners, with the exception of the results presented in Figure 1, at most 300
eigenpairs are computed on each subdomain and the threshold parameter τfrom (3.9)
is set to 0.6. These parameters were found to provide good numerical performance
after a very quick trial-and-error approach on a single problem. We did not want to
adjust them for each problem from Table 1, but it will be shown next that they are
fine overall without additional tuning.
4.1. Implementation aspects. The new two-level preconditioners are
implemented on top of the well-known distributed memory library PETSc. This
section is not aimed at PETSc specialists. Rather, we want to briefly explain what
was needed to provide an efficient yet concise implementation. Our new code is open-
source, available at It comprises fewer
than 150 lines of code (including the initialization and error analysis). The main
source files, written in Fortran, C, and Python, have three major phases, which we
now outline.
4.1.1. Loading and partitioning phase. First, PETSc is used to load the
matrix Ain parallel, following a contiguous one-dimensional row partitioning among
MPI processes. We explicitly assemble the normal equations matrix using the routine
MatTransposeMatMult [31]. The initial PETSc-enforced parallel decomposition of A
among processes may not be appropriate for the normal equations, so ParMETIS is
used by PETSc to repartition C. This also induces a permutation of the columns
of A.
4.1.2. Setup phase. To ensure that the normal equations matrix Cis definite,
it is shifted by 1010kCkFIn(here and elsewhere, k·kFdenotes the Frobenius norm).
Note that this is only needed to setup the preconditioner; the preconditioner is
used to solve the original LS problem. Given the indices of the columns owned
by a MPI process, we call the routine MatIncreaseOverlap on the normal equations
matrix to build an extended set of column indices of Athat will be used to define
overlapping subdomains. These are the Ωias defined in (3.1). Using the routine
MatFindNonzeroRows, this extended set of indices is used to concurrently find on
each subdomain the set of nonzero rows. These are the sets Ξias illustrated in
(3.2) and (3.3). The subdomain matrices Cii from (2.1) as well as the partition
of unity Dias illustrated in (3.4) are automatically assembled by PETSc when
using domain decomposition preconditioners such as PCASM or PCHPDDM. The
right-hand side matrices of the generalized eigenvalue problems (3.6) are assembled
using MatTransposeMatMult, but note that this product is this time performed
concurrently on each subdomain. The small shift sfrom (3.7) is set to 108ke
These matrices and the sets of overlapping column indices are passed to PCHPDDM
using routine PCHPDDMSetAuxiliaryMat. The rest of the setup is hidden from
the user. It includes solving the generalized eigenvalue problems using SLEPc [23],
followed by the assembly and redistribution of the second-level operator using a
Galerkin product (2.2) (see [25] for more details on how this is performed efficiently
4.1.3. Solution phase. For the solution phase, users can choose between
multiple Krylov methods, including LSQR [34] and GMRES. Each iteration of LSQR
requires matrix–vector products with Aand A>. For GMRES, instead of using
the previously explicitly assembled normal equations matrix, we use an implicit
representation of the operator that computes the matrix–vector product with A
followed by the product with A>. The type of overlapping Schwarz method (additive
or restricted additive) as well as the type of second-level correction (balanced or
Table 2
Preconditioner comparison when running LSQR. Iteration counts are reported. M1
ASM and
balanced are the one- and two-level overlapping Schwarz preconditioners, respectively. denotes
iteration count exceeds 1,000.denotes either a failure in computing the preconditioner because of
memory issues or a breakdown of LSQR.
Identifier M1
balanced M1
mesh deform 13 27 35 5
EternityII E 43 91 63 199
lp stocfor3 34 136 513 211
deltaX 23 98 784 640
sc205-2r 54 61 195 97
stormg2-125 42 174 † †
Rucci1 21 484 118 364
image interp 11 409 40 203
mk13-b5 19 21 11 11
pds-100 18 202 16 35 110
fome21 20 104 16 20 41
sgpf5y6 224 264 163 110
Hardesty2 30 913 88 404
Delor338K 10 11 829
watson 2 15 109 64 73
stormG2 1000 139 64 ‡ †
LargeRegFile 41 109 19 12
cont11 l 30 490 53 723
deflated) may be selected at runtime by the user. This flexibility is important because
LSQR requires a symmetric preconditioner.
4.2. Numerical validation. In this section, we validate the effectiveness of
the two-level method when compared to other preconditioners. Table 2 presents a
comparison between five preconditioners: two-level additive Schwarz with balanced
coarse correction M1
balanced, one-level additive Schwarz M1
and HSL MI35. The first level of the one- and two-level methods both use the additive
Schwarz formulation; the second level uses the balanced deflation formulation (2.10).
The results are for the iterative solver LSQR. If Mdenotes the preconditioner, LSQR
terminates when the LS residual satisfies
kAM1>(Ax b)k2
kAkM,F kAx bk2
where kAkM,F =Pn
i=1 λi(M1A>A) is the sum of the positive eigenvalues of
M1A>Athat is approximated by LSQR itself. Note that if M1=W1W−> ,
then kAkM,F =kAW 1kF.
It is clear that both the one- and two-level Schwarz methods are more robust
than the other preconditioners as they encounter no breakdowns and solve all the
LS problems using fewer than 1,000 iterations. Because HSL MI35 is a sequential
code that runs on a single core, there was not enough memory to compute the
preconditioner for problem cont11 l. For many of the problems, the iteration
count for HSL MI35 can be reduced by increasing the parameters that determine the
number of entries in the IC factor (the default values are rather small for the large
Table 3
Preconditioner comparison when running GMRES. Iteration counts are reported. M1
RAS and
deflated are the one- and two-level overlapping Schwarz preconditioners, respectively. denotes
iteration count exceeds 1,000.denotes either a failure in computing the preconditioner because of
memory issues or a breakdown of GMRES.
Identifier M1
deflated M1
mesh deform 6 27 21 50 5
EternityII E 593 97 186
lp stocfor3 21 198
deltaX 693 † †
sc205-2r 12 125 490 69
stormg2-125 23 ‡ †
Rucci1 10 958 213 882
image interp 10 971 67 476
mk13-b5 14 18 21 12
pds-100 10 84 23 51 115
fome21 10 55 22 29 41
sgpf5y6 116 † † 249 100
Hardesty2 26 155 † †
Delor338K 59 † †
watson 2 7134 252 96 73
stormG2 1000 108 22 186 ‡ †
LargeRegFile 621 23 11
cont11 l 45 172 † ‡
test examples). LSQR preconditioned with BoomerAMG breaks down for several
problems, as reported by PETSc error code KSP DIVERGED BREAKDOWN.
GAMG is more robust but requires more iterations for problems where both algebraic
multigrid solvers are successful. Note that even with more advanced options than the
default ones set by PETSc, such as PMIS coarsening [13] with extended classical
interpolation [14] for BoomerAMG or Schwarz smoothing for GAMG, these solvers
do not perform considerably better numerically. We can also see that the two-
level preconditioner outperforms the one-level preconditioner, with the exception of
problem, stormG2 1000, for which the normal equations matrix is not very sparse (see
column 5 of Table 1). In fact, the matrix Ahas 121 relatively dense rows that have
more than 1,000 nonzeros while the rest of the rows have at most 5 nonzeros per row.
Table 3 presents a similar comparison, but using right-preconditioned GMRES
applied directly to the normal equations (1.2). A restart parameter of 100 is used. The
relative tolerance is again set to 108, but this now applies to the unpreconditioned
residual. We switch from M1
ASM to M1
RAS (2.9), which is known to perform better
numerically. For the two-level method, we switch from M1
balanced to M1
deflated (2.11).
Switching from LSQR to GMRES can be beneficial for some preconditioners, e.g.,
BoomerAMG now converges in 21 iterations instead of breaking down for problem
mesh deform. But this is not always the case, e.g., HSL MI35 applied to problem
deltaX does not converge within the 1,000 iteration limit. The two-level method
is the most robust approach, while the restricted additive Schwarz preconditioner
struggles to solve some problems, either because of a breakdown (problem stormg2-
125) or because of slow convergence (problems lp stocfor3, sgpf5y6, Hardesty2, and
cont11 l).
20 40 60 80 100
Iteration number
Relative residual
τ= 0.01275
τ= 0.02
τ= 0.05
τ= 0.1
τ= 0.4
τ= 0.9
τ n0Iterations
0.01275 2,400 49
0.02 2,683 39
0.05 3,049 30
0.1 3,337 24
0.4 8,979 15
0.9 44,682 13
1.2 153,600 11
Fig. 1.Influence of the threshold parameter τon the convergence of preconditioned LSQR for
problem watson 2 (m= 677,224 and n= 352,013).
Recall that for the results in Tables 2 and 3, the two-level preconditioner was
constructed using at most 300 eigenpairs and the threshold parameter τwas set
to 0.6. Whilst this highlights that tuning τfor individual problems is not necessary
to successfully solve a range of problems, it does not validate the ability of our
preconditioner to concurrently select the most appropriate local eigenpairs to define
an adaptive preconditioner. To that end, for problem watson 2, we consider the effect
on the performance of our two-level preconditioner of varying τ. Results for LSQR
with M1
ASM and M1
balanced are presented in Figure 1. Here, 512 MPI processes are used
and the convergence tolerance is again 108. We observe that the two-level method
consistently outperforms the one-level method. Furthermore, as we increase τ, the
convergence is faster and the size n0of the second level increases. It is also interesting
to highlight that the convergence is smooth even with a very small value τ= 0.01275,
n0= 2,400 compared to the dimension 3.52 ·105of the normal equations matrix.
4.3. Performance study. We next investigate the algorithmic cost of the two-
level method. To do so, we perform a strong scaling analysis using a large problem
not presented in Table 1 but still from the SuiteSparse Matrix Collection, Hardesty3.
The matrix is of dimension 8,217,820 ×7,591,564, and the number of nonzero entries
in Cis 98,634,426. In Table 4, we report the number of iterations as well as the
eigensolve, setup, and solve times as the number Nof subdomains ranges from 16 to
4,096. The times are obtained using the PETSc -log view command line option. For
different N, the reported times on each row of the table are the maximum among
all processes. The setup time includes the numerical factorization of the first-level
subdomain matrices, the assembly of the second-level operator and its factorization.
Note that the symbolic factorization of the first-level subdomain is shared between
the domain decomposition preconditioner and the eigensolver because we use the
Krylov–Schur method as implemented in SLEPc, which requires the factorization of
the right-hand side matrices from (3.7). The Cholesky factorizations of the subdomain
matrices and of the second-level operator are performed using the sparse direct solver
MUMPS [5]. For small numbers of subdomains (N < 128), the cost of the eigensolves
are clearly prohibitive. By increasing the number of subdomains, thus reducing their
Table 4
Strong scaling for problem Hardesty3 (m= 8,217,820 and n= 7,591,564) for Nranging from 16
to 4,096 subdomains. All times are in seconds. Column 2 reports the LSQR iteration count. Column
4 reports the setup time minus the concurrent solution time of the generalized eigenproblems, which
is given in column 3.
NIterations Eigensolve Setup Solve n0Total Speedup
16 113 2,417.4 24.5 301.3 4,800 2,743.2
32 117 1,032.7 14.1 154.2 9,600 1,201.0 2.3
64 129 887.2 11.4 112.3 19,200 1,010.9 2.7
128 144 224.1 6.9 55.4 38,400 286.3 9.6
256 97 128.0 6.7 32.2 76,800 166.9 16.4
512 87 45.5 13.0 26.9 153,391 85.3 32.2
1,024 85 23.8 20.2 35.3 303,929 79.3 34.6
2,048 55 14.6 31.4 43.2 497,704 89.1 30.8
4,096 59 11.7 30.8 44.9 695,774 87.3 31.4
size, the time to construct the preconditioner becomes much more tractable and
overall, our implementation yields good speedups on a wide range of process counts.
Note that the threshold parameter τ= 0.6 is not attained on any of the subdomains
for Nranging from 16 up to 256, so that n0= 300 ×N. For larger N,τ= 0.6 is
attained, the preconditioner automatically selects the appropriate eigenmodes, and
convergence improves (see column 2 of Table 4). When Nis large (N1,024),
the setup and solve times are impacted by the high cost of factorizing and solving
the second-level problems, which, as highlighted by the values of n0, become large.
Multilevel variants [4] could be used to overcome this but goes beyond the scope of
the current study.
5. Concluding comments. Solving large-scale sparse linear least-squares
problems is known to be challenging. Previously proposed preconditioners have
generally been serial and have involved incomplete factorizations of Aor C=A>A.
In this paper, we have employed ideas that have been developed in the area of domain
decomposition, which (as far as we are aware) have not previously been applied to
least-squares problems. In particular, we have exploited recent work by Al Daas and
Grigori [3] on algebraic domain decomposition preconditioners for SPD systems to
propose a new two-level algebraic domain preconditioner for the normal equations
matrix C. We have used the concept of an algebraic local SPSD splitting of an SPD
matrix and we have shown that the structure of Cas the product of A>and Acan
be used to efficiently perform the splitting. Furthermore, we have proved that using
the two-level preconditioner, the spectral condition number of the preconditioned
normal equations matrix is bounded from above independently of the number of the
subdomains and the size of the problem. Moreover, this upper bound depends on a
parameter τthat can be chosen by the user to decrease (resp. increase) the upper
bound with the costs of setting up the preconditioner being larger (resp. smaller).
The new two-level preconditioner has been implemented in parallel within PETSc.
Numerical experiments on a range of problems from real applications have shown that
whilst both one-level and two-level domain decomposition preconditioners are effective
when used with LSQR to solve the normal equations, the latter consistently results
in significantly faster convergence. It also outperforms other possible preconditioners,
both in terms of robustness and iteration counts. Furthermore, our numerical
experiments on a set of challenging least-squares problems show that the two-level
preconditioner is robust with respect to the parameter τ. Moreover, a strong
scalability test of the two-level preconditioner assessed its robustness with respect
to the number of subdomains.
Future work includes extending the approach to develop preconditioners for
solving large sparse–dense least-squares problems in which Acontains a small number
of rows that have many more entries than the other rows. These cause the normal
equations matrix to be dense and so they need to be handled separately (see, for
example, the recent work of Scott and T˚uma [39] and references therein). As already
observed, we also plan to consider multilevel variants to allow the use of a larger
number of subdomains and processes.
Acknowledgments. This work was granted access to the GENCI-sponsored
HPC resources of TGCC@CEA under allocation A0090607519. The authors would
like to thank L. Dalcin, V. Hapla, and T. Isaac for their recent contributions to PETSc
that made the implementation of our preconditioner more flexible.
Code reproducibility. Interested readers are referred to https://github.
com/prj-/aldaas2021robust/blob/main/ for setting up the appropriate
requirements, compiling, and running our proposed preconditioner. Fortran, C, and
Python source codes are provided.
[1] M. F. Adams, H. H. Bayraktar, T. M. Keaveny, and P. Papadopoulos,Ultrascalable
implicit finite element analyses in solid mechanics with over a half a billion degrees of
freedom, in Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC04,
IEEE Computer Society, 2004, pp. 34:1–34:15.
[2] E. Agullo, A. Buttari, A. Guermouche, and F. Lopez,Implementing multifrontal
sparse solvers for multicore architectures with sequential task flow runtime systems,
ACM Transactions on Mathematical Software, 43 (2016), http://buttari.perso.enseeiht.
fr/qr mumps.
[3] H. Al Daas and L. Grigori,A class of efficient locally constructed preconditioners based on
coarse spaces, SIAM Journal on Matrix Analysis and Applications, 40 (2019), pp. 66–91.
[4] H. Al Daas, L. Grigori, P. Jolivet, and P.-H. Tournier,A multilevel Schwarz
preconditioner based on a hierarchy of robust coarse spaces, SIAM Journal on Scientific
Computing, 43 (2021), pp. A1907–A1928.
[5] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster,A fully asynchronous
multifrontal solver using distributed dynamic scheduling, SIAM Journal on Matrix Analysis
and Applications, 23 (2001), pp. 15–41,
[6] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,
A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,
D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,
S. Zampini, H. Zhang, and H. Zhang,PETSc web page, 2021,
[7] R. Bru, J. Mar
ın, J. Mas, and M. T˚
uma,Preconditioned iterative methods for solving linear
least-squares problems, SIAM Journal on Scientific Computing, 36 (2014), pp. A2002–
[8] X.-C. Cai and M. Sarkis,A restricted additive Schwarz preconditioner for general sparse
linear systems, SIAM Journal on Scientific Computing, 21 (1999), pp. 792–797.
[9] T. F. Chan and T. P. Mathew,Domain decomposition algorithms, Acta Numerica, 3 (1994),
pp. 61–143.
[10] X. Cui and K. Hayami,Generalized approximate inverse preconditioners for least-squares
problems, Japan Journal of Industrial and Applied Mathematics, 26 (2009).
[11] T. A. Davis,Algorithm 915, SuiteSparseQR: multifrontal multithreaded rank-revealing sparse
QR factorization, ACM Transactions on Mathematical Software, 38 (2011).
[12] T. A. Davis and Y. Hu,The University of Florida sparse matrix collection, ACM Transactions
on Mathematical Software, 38 (2011), pp. 1–28.
[13] H. De Sterck, R. D. Falgout, J. W. Nolting, and U. M. Yang,Distance-two interpolation
for parallel algebraic multigrid, Numerical Linear Algebra with Applications, 15 (2008),
pp. 115–139.
[14] H. De Sterck, U. M. Yang, and J. J. Heys,Reducing complexity in parallel algebraic
multigrid preconditioners, SIAM Journal on Matrix Analysis and Applications, 27 (2006),
pp. 1019–1039.
[15] V. Dolean, P. Jolivet, and F. Nataf,An introduction to domain decomposition methods.
Algorithms, theory, and parallel implementation, Society for Industrial and Applied
Mathematics, 2015.
[16] I. S. Duff, R. Guivarch, D. Ruiz, and M. Zenadi,The augmented block Cimmino distributed
method, SIAM Journal on Scientific Computing, 37 (2015), pp. A1248–A1269.
[17] A. Dumitras¸c, P. Leleux, C. Popa, D. Ruiz, and S. Torun,The augmented block Cimmino
algorithm revisited, 2018,
[18] T. Elfving,Block-iterative methods for consistent and inconsistent linear equations,
Numerische Mathematik, 35 (1980), pp. 1–12.
[19] R. D. Falgout and U. M. Yang, hypre: a library of high performance preconditioners,
Computational Science—ICCS 2002, (2002), pp. 632–641.
[20] M. J. Gander and A. Loneland,SHEM: an optimal coarse space for RAS and its multiscale
approximation, in Domain Decomposition Methods in Science and Engineering XXIII, C.-
O. Lee, X.-C. Cai, D. E. Keyes, H. H. Kim, A. Klawonn, E.-J. Park, and O. B. Widlund,
eds., Cham, 2017, Springer International Publishing, pp. 313–321.
[21] N. I. M. Gould and J. A. Scott,The state-of-the-art of preconditioners for sparse linear
least-squares problems, ACM Transactions on Mathematical Software, 43 (2017), pp. 36:1–
[22] A. Heinlein, C. Hochmuth, and A. Klawonn,Reduced dimension GDSW coarse spaces for
monolithic Schwarz domain decomposition methods for incompressible fluid flow problems,
International Journal for Numerical Methods in Engineering, 121 (2020), pp. 1101–1119.
[23] V. Hernandez, J. E. Roman, and V. Vidal,SLEPc: a scalable and flexible toolkit for the
solution of eigenvalue problems, ACM Transactions on Mathematical Software, 31 (2005),
pp. 351–362,
[24] HSL. A collection of Fortran codes for large-scale scientific computation, 2018. http://www.
[25] P. Jolivet, F. Hecht, F. Nataf, and C. Prud’homme,Scalable domain decomposition
preconditioners for heterogeneous elliptic problems, in Proceedings of the International
Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13,
New York, NY, USA, 2013, ACM, pp. 80:1–80:11.
[26] P. Jolivet, J. E. Roman, and S. Zampini,KSPHPDDM and PCHPDDM: extending PETSc
with advanced Krylov methods and robust multilevel overlapping Schwarz preconditioners,
Computers & Mathematics with Applications, 84 (2021), pp. 277–295.
[27] G. Karypis and V. Kumar,Multilevel k-way partitioning scheme for irregular graphs, Journal
of Parallel and Distributed computing, 48 (1998), pp. 96–129.
[28] F. Kong and X.-C. Cai,A scalable nonlinear fluid–structure interaction solver based on a
Schwarz preconditioner with isogeometric unstructured coarse spaces in 3D, Journal of
Computational Physics, 340 (2017), pp. 498–518.
[29] N. Li and Y. Saad,MIQR: a multilevel incomplete QR preconditioner for large sparse least-
squares problems, SIAM Journal on Matrix Analysis and Applications, 28 (2006), pp. 524–
[30] P. Marchand, X. Claeys, P. Jolivet, F. Nataf, and P.-H. Tournier,Two-level
preconditioning for h-version boundary element approximation of hypersingular operator
with GenEO, Numerische Mathematik, 146 (2020), pp. 597–628.
[31] M. McCourt, B. F. Smith, and H. Zhang,Sparse matrix–matrix products executed through
coloring, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 90–109.
[32] Intel MKL Sparse QR, 2018.
[33] S. V. Nepomnyaschikh,Mesh theorems of traces, normalizations of function traces and their
inversions, Russian Journal of Numerical Analysis and Mathematical Modelling, 6 (1991),
pp. 1–25.
[34] C. C. Paige and M. A. Saunders,LSQR: an algorithm for sparse linear equations and sparse
least squares, ACM Transactions on Mathematical Software, 8 (1982), p. 43–71.
[35] F. Pellegrini and J. Roman,SCOTCH: a software package for static mapping by
dual recursive bipartitioning of process and architecture graphs, in High-Performance
Computing and Networking, Springer, 1996, pp. 493–498.
[36] Y. Saad and M. H. Schultz,GMRES: a generalized minimal residual algorithm for solving
nonsymmetric linear systems, SIAM Journal on Scientific and Statistical Computing, 7
(1986), pp. 856–869.
[37] J. A. Scott and M. T˚
uma,Preconditioning of linear least squares by robust incomplete
factorization for implicitly held normal equations, SIAM Journal on Scientific Computing,
38 (2016), pp. C603–C623.
[38] J. A. Scott and M. T˚
uma,Solving mixed sparse–dense linear least-squares problems by
preconditioned iterative methods, SIAM Journal on Scientific Computing, 39 (2017),
pp. A2422–A2437.
[39] J. A. Scott and M. T˚
uma,Strengths and limitations of stretching for least-squares problems
with some dense rows, ACM Transactions on Mathematical Software, 41 (2021), pp. 1:1—
[40] B. F. Smith, P. E. Bjørstad, and W. D. Gropp,Domain decomposition: parallel multilevel
methods for elliptic partial differential equations, Cambridge University Press, 1996.
[41] G. W. Stewart,A Krylov–Schur algorithm for large eigenproblems, SIAM Journal on Matrix
Analysis and Applications, 23 (2002), pp. 601–614.
[42] J. M. Tang, R. Nabben, C. Vuik, and Y. A. Erlangga,Comparison of two-level
preconditioners derived from deflation, domain decomposition and multigrid methods,
Journal of Scientific Computing, 39 (2009), pp. 340–370.
[43] J. Van lent, R. Scheichl, and I. G. Graham,Energy-minimizing coarse spaces for two-level
Schwarz methods for multiscale PDEs, Numerical Linear Algebra with Applications, 16
(2009), pp. 775–799.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Monolithic preconditioners for incompressible fluid flow problems can significantly improve the convergence speed compared to preconditioners based on incomplete block factorizations. However, the computational costs for the setup and the application of monolithic preconditioners are typically higher. In this paper, several techniques to further improve the convergence speed as well as the computing time are applied to monolithic two‐level Generalized Dryja–Smith–Widlund (GDSW) preconditioners. In particular, reduced dimension GDSW (RGDSW) coarse spaces, restricted and scaled versions of the first level, hybrid and parallel coupling of the levels, and recycling strategies are investigated. Using a combination of all these improvements, for a small time‐dependent Navier‐Stokes problem on 240 MPI ranks, a reduction of 86 % of the time‐to‐solution can be obtained. Even without applying recycling strategies, the time‐to‐solution can be reduced by more than 50 % for a larger steady Stokes problem on 4 608 MPI ranks. For the largest problems with 11 979 MPI ranks the scalability deteriorates drastically for the monolithic GDSW coarse space. On the other hand, using the reduced dimension coarse spaces, good scalability up to 11 979 MPI ranks, which corresponds to the largest problem configuration fitting on the employed supercomputer, could be achieved. This article is protected by copyright. All rights reserved.
Full-text available
In this paper we present a class of robust and fully algebraic two-level preconditioners for SPD matrices. We introduce the notion of algebraic local SPSD splitting of an SPD matrix and we give a characterization of this splitting. This splitting leads to construct algebraically and locally a class of efficient coarse spaces which bound the spectral condition number of the preconditioned system by a number defined a priori. We also introduce the τ-filtering subspace. This concept helps compare the dimension minimality of coarse spaces. Some PDEs-dependant preconditioners correspond to a special case. The examples of the algebraic coarse spaces in this paper are not practical due to expensive construction. We propose a heuristic approximation that is not costly. Numerical experiments illustrate the efficiency of the proposed method.
Full-text available
The efficient solution of large linear least-squares problems in which the system matrix A contains rows with very different densities is challenging. Previous work has focused on direct methods for problems in which A has a few relatively dense rows. These rows are initially ignored, a factorization of the sparse part is computed using a sparse direct solver, and then the solution is updated to take account of the omitted dense rows. In some practical applications the number of dense rows can be significant, and for very large problems, using a direct solver may not be feasible. We propose processing rows that are identified as dense separately within a conjugate gradient method using an incomplete factorization preconditioner combined with the factorization of a dense matrix of size equal to the number of dense rows. Numerical experiments on large-scale problems from real applications are used to illustrate the effectiveness of our approach. The results demonstrate that we can efficiently solve problems that could not be solved by a preconditioned conjugate gradient method without exploiting the dense rows.
Full-text available
The efficient solution of the normal equations corresponding to a large sparse linear least squares problem can be extremely challenging. Robust incomplete factorization (RIF) preconditioners represent one approach that has the important feature of computing an incomplete LLT factorization of the normal equations matrix without having to form the normal matrix itself. The right-looking implementation of Benzi and Tuma has been used in a number of studies but experience has shown that in some cases it can be computationally slow and its memory requirements are not known a priori. Here a new left-looking variant is presented that employs a symbolic preprocessing step to replace the potentially expensive searching through entries of the normal matrix. This involves a directed acyclic graph (DAG) that is computed as the computation proceeds. An inexpensive but effective pruning algorithm is proposed to limit the number of edges in the DAG. Problems arising from practical applications are used to compare the performance of the right-looking approach with a left-looking implementation that computes the normal matrix explicitly and our new implicit DAG-based left-looking variant.
We recently introduced a sparse stretching strategy for handling dense rows that can arise in large-scale linear least-squares problems and make such problems challenging to solve. Sparse stretching is designed to limit the amount of fill within the stretched normal matrix and hence within the subsequent Cholesky factorization. While preliminary results demonstrated that sparse stretching performs significantly better than standard stretching, it has a number of limitations. In this article, we discuss and illustrate these limitations and propose new strategies that are designed to overcome them. Numerical experiments on problems arising from practical applications are used to demonstrate the effectiveness of these new ideas. We consider both direct and preconditioned iterative solvers.
Nonlinear fluid-structure interaction (FSI) problems on unstructured meshes in 3D appear many applications in science and engineering, such as vibration analysis of aircrafts and patient-specific diagnosis of cardiovascular diseases. In this work, we develop a highly scalable, parallel algorithmic and software framework for FSI problems consisting of a nonlinear fluid system and a nonlinear solid system, that are coupled monolithically. The FSI system is discretized by a stabilized finite element method in space and a fully implicit backward difference scheme in time. To solve the large, sparse system of nonlinear algebraic equations at each time step, we propose an inexact Newton–Krylov method together with a multilevel, smoothed Schwarz preconditioner with isogeometric coarse meshes generated by a geometry preserving coarsening algorithm. Here “geometry” includes the boundary of the computational domain and the wet interface between the fluid and the solid. We show numerically that the proposed algorithm and implementation are highly scalable in terms of the number of linear and nonlinear iterations and the total compute time on a supercomputer with more than 10,000 processor cores for several problems with hundreds of millions of unknowns.
In domain decomposition methods, coarse spaces are traditionally added to make the method scalable. Coarse spaces can however do much more: they can act on other error components that the subdomain iteration has difficulties with, and thus accelerate the overall solution process. We identify here the optimal coarse space for RAS, where optimal does not refer to scalable, but to best possible. This coarse space leads to convergence of the subdomain iterative method in two steps. Since this coarse space is very rich, we propose an approximation which turns out to be also very effective for multiscale problems.
In recent years, a variety of preconditioners have been proposed for use in solving large sparse linear least-squares problems. These include simple diagonal preconditioning, preconditioners based on incomplete factorizations, and stationary inner iterations used with Krylov subspace methods. In this study, we briefly review preconditioners for which software has been made available, then present a numerical evaluation of them using performance profiles and a large set of problems arising from practical applications. Comparisons are made with state-of-the-art sparse direct methods.
To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This paper evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications , namely, sparse matrix multifrontal factorizations which feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system. ACM Reference Format: Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche and Florent Lopez, 2014. Implementing multifrontal sparse solvers for multicore architectures with Sequential Task Flow runtime systems