PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Communication lower bounds have long been established for matrix multiplication algorithms. However, most methods of asymptotic analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main result of this work is the establishment of memory-independent communication lower bounds with tight constants for parallel matrix multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios of the matrices.
arXiv:2205.13407v1 [cs.DC] 26 May 2022
Tight Memory-Independent Parallel Matrix Multiplication Communication
Lower Bounds
HUSSAM AL DAAS, Rutherford Appleton Laboratory, UK
GREY BALLARD, Wake Forest University, USA
LAURA GRIGORI, Inria Paris, France
SURAJ KUMAR, Inria Paris, France
KATHRYN ROUSE, Inmar Intelligence, USA
Communication lower bounds have long been e stablished for matrix multiplication algorithms. However, most methods of asymptotic
analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more
careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more
efficient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main
result of this work is the establishment of memory-independent communication lower bounds with tight constants f or parallel matrix
multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios
of the matrices.
1 INTRODUCTION
The cost of communication relative to computation continues to grow, so the time complexity of an algorithm must
account for both the computation it performs and the data that it communicates. Communication lower bounds for
computations set targets for efficient algorithms and spur algorithmic development. Matrix multiplication is one of
the most fundamental computations, and its I/O complexity on sequential machines and parallel communication costs
have been well studied over decades [2, 11, 13, 14, 21].
The earliest results established asymptotic lower bounds, ignoring constant factors and lower order terms. Because
of the ubiquity of matrix multiplication in high performance computations, more recent attempts have tightened the
analysis to obtain the best constant factors [17, 20, 22]. These improvements in the lower bound also helped identify the
best performing sequential and parallel algorithms that can be further tuned for high performance in settings where
even small constant factors make significant differences. We review these advances and other related work in § 2.
The main result of this paper is the establishment of tight constants for memory-independent communication lower
bounds for parallel classical matrix multiplication. In the context of a distributed-memory parallel machine model (see
§ 3.1), these bounds apply even when the local memory is infinite, and they are the tightest bounds in many cases when
the memory is limited. Demmel et al. [11] prove asymptotic bounds for general rectangular matrix multiplication and
show that three different bounds are asymptotically tight in separate cases that depend on the relative sizes of matrix
dimensions and the number of processors. Our main result, Theorem 1 in § 4, reproduces those asymptotic bounds
and improves the constants in all three cases. Further, in § 5, we analyze a known algorithm to show that it attains the
lower bounds exactly, proving that the constants we obtain are tight. We present a comparison to previous work in
Tab. 1 and discuss it in detail in § 6.
We believe one of the main features of our lower bound result is the simplicity of the proof technique, which
makes a unified argument that applies to all three cases. The key idea is to cast the lower bound as the solution to
Authors’ addresses: Hussa m Al Daas, Rutherford Appleton Laboratory, Didcot, Oxfordshire, UK, hussa m.al-daas@stfc.ac.uk; Grey Ballard, Wake Forest
University, Winston-Salem, NC, USA, ballard@wfu.edu; Laura Gr igori, Inria Paris, Paris , France, laura.grigori@inria.fr; Sura j Kumar, Inria Paris, Paris,
France, suraj.kumar@inria.fr; Kathryn Rouse, Inmar Intelligence, Winston-Salem, NC, USA, kathryn.rouse@inmar.com.
1
2 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
a constrained optimization problem (see Lemma 5) whose objective function is the sum of variables that correspond
to the amount of data of each matrix required by a single processor’s computation. The constraints include the well-
known Loomis-Whitney inequality [18] as well as new lower bounds on individual array access (see Lemma 4) that are
necessary to establish separate results for the three cases. All of the complexity of the three cases, including establishing
the thresholds between cases and the leading terms in each case, are confined to a single optimization problem. We
use fundamental results from convex optimization (reviewed in § 3.2) to solve the problem analytically. This unified
argument is elegant, it improves previous results to obtain tight constants, and it can be applied more generally to
other computations that have iteration spaces with uneven dimensions.
2 RELATED WORK
2.1 Memory-Dependent Bounds for Matrix Multiplication
The first communication lower bound for matrix multiplication was established by Hong and Kung [13], who obtain
the result using computation directed acyclic graph (CDAG) analysis that multiplication of square 𝑛×𝑛matrices on
a machine with cache size 𝑀requires Ω(𝑛3/𝑀)words of communication. Irony, Toledo, and Tiskin [14] reproduce
the result using a geometric proof based on the Loomis-Whitney inequality (Lemma 1), show it applies to general
rectangular matrices (replacing 𝑛3with 𝑛1𝑛2𝑛3), and obtain an explicit constant of (1/2)3/2.35. They also observe
that the result is easily extended to the distributed memory parallel computing model (as described in § 3.1) under mild
assumptions by dividing the bound by the number of processors 𝑃. We refer to such bounds as “memory-dependent,
following [3], where the cache size 𝑀is interpreted as the size of each processor’s local memory. Later, Dongarra et al.
[12] tightened the constant for sequential and parallel memory-dependent bounds to (3/2)3/21.84. More recently,
Smith et al. [22] prove the constant of 2 and show that it is obtained by an optimal sequential algorithm and is therefore
tight. Both of these results are proved using the Loomis-Whitney inequality. Kwasniewski et al. [17] use CDAG analysis
to obtain the same constant of 2 and show that it is tight in the parallel case (when memory is limited) by providing
an optimal algorithm.
2.2 Bounds for Other Computations
Hong and Kung’s proof technique can be applied to a more general set of computations, including the FFT [13]. Ballard
et al. [4] use the proof technique of [14] to generalize the lower bound (with the same explicit constant) to other linear
algebra computations such as LU, Cholesky, and QR factorizations. The constants of the lower bounds for these and
other computations are tightened by Olivry et al. [20], including reproducing the constant of 2 for matrix multiplication.
Kwasniewski et al. [16] also obtain tighter constants for LU and Cholesky factorizations using CDAG analysis. Christ et
al. [10] show that a generalization of the Loo mis-Whitney inequality can be used to prove communication lower bounds
for a much wider set of computations, but the asymptotic bounds do not include explicit constants. This approach is
applied to a tensor computation by Ballard and Rouse [5, 6].
2.3 Memory-Independent Bounds for Parallel Matrix Multiplication
This section describes the related work that focuses on the topic of this paper. Aggarwal, Chandra, and Snir [2] extend
Hong and Kung’s result for matrix multiplication to the LPRAM parallel model, which closely resembles the model
we consider with the exception that there exists a global shared memory where the inputs initially reside and where
the output must be stored at the end of the computation. Communication bounds in the related BSP parallel model are
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 3
1𝑃𝑚
𝑛𝑚
𝑛𝑃𝑚𝑛
𝑘2𝑚𝑛
𝑘2𝑃
Leading term 𝑛𝑘 𝑚𝑛𝑘2
𝑃1/2𝑚𝑛𝑘
𝑃2/3
[2] - - 1
22/3.63
[14] - - 1
2=.5
[11] 16
25 =.64 2
31/2.82 1
Theorem 1 1 2 3
Table 1. Summary of explicit constants of leading term of parallel memory-independent rectangular matrix multiplication commu-
nication lower bounds for multiplication dimensions 𝑚𝑛𝑘and 𝑃processors
also memory independent, and Scquizzato and Silvestri [21] establish the same asymptotic lower bounds for matrix
multiplication in that model. In addition to proving bounds for sequential matrix multiplication and the associated
memory-dependent bound for parallel matrix multiplication, Irony, Toledo, and Tiskin [14] prove also that a parallel
algorithm must communicate Ω(𝑛2/𝑃2/3)words, and they provide explicit constants in their analysis. Note that the
size of the lo calmemor y 𝑀does not appear in this bound. Ballard et al. [3] reproduce this result for classical matrix mul-
tiplication as well as prove similar results for fast matrix multiplication. They distinguish between memory-dependent
bounds (results described in § 2.1) and memory-independent bounds for parallel algorithms, and they show the two
bounds relate and affect strong scaling behavior. In particular, when 𝑀𝑛2/𝑃2/3(or equivalently 𝑃𝑛3/𝑀3/2), the
memory-dependent bound is unattainable because the memory-independent bound is larger. Demmel et al. [11] ex-
tend the memory-independent results to the rectangular case (multiplying matrices of dimensions 𝑛1×𝑛2and 𝑛2×𝑛3),
showing that three different bounds apply that depend on the relative sizes of the three dimensions and the number of
processors, and their proof provides explicit constants. For one of the cases, and for a restricted class of parallelizations,
Kwasniewski et al. [17] prove a tighter constant and show that it is attainable by an optimal algorithm.
We summarize the constants obtained by these previous works and compare them to our results in Tab. 1. Fur-
ther details of the comparison are given in § 6.1. Following Theorem 1, the table assumes 𝑚=max{𝑛1, 𝑛2, 𝑛3},
𝑛=median{𝑛1, 𝑛2, 𝑛 3}, and 𝑘=min{𝑛1, 𝑛2, 𝑛 3}.
2.4 Communication-Optimal Parallel Matrix Multiplication Algorithms
Both theoretical and practical algorithms that attain the communication lower bounds have been proposed for various
computation models and implemented on many different types of parallel systems. The idea of “3D algorithms” for
matrix multiplication was developed soon after communication lower bounds were established; see [1, 2, 7, 15] for a few
examples. These algorithms partition the 3D iteration space of matrix multiplication in each of the three dimensions and
assign subblocks across a 3D logical grid of processors. McColl and Tiskin [19] and Demmel et al. [11] present recursive
algorithms that effectively achieve similar 3D logical processor grid for square and general rectangular problems,
respectively. High-performance implementations of these algorithms on today’s supercomputers demonstrate that
these algorithms are indeed practical and outperform standard library implementations [17, 23].
4 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
3 PRELIMINARIES
3.1 Parallel Computation Model
We consider the 𝛼-𝛽-𝛾parallel machine model [9, 24]. In this model, each of 𝑃processors has its own local memory of
size 𝑀and can compute only with data in its local memory. The processors can communicate data to and from other
processors via messages that are sent over a fully connected network (i.e., each pair of processors has a dedicated link
so that there is no contention on the network). Further, we assume the links are bidirectional so that a pair of processors
can exchange data with no contention. Each processor can send and receive at most one message at the same time.
The cost of communication is a function of two parameters 𝛼and 𝛽, where 𝛼is the per-message latency cost and 𝛽is
the per-word bandwidth cost. A message of 𝑤words sent from one processor to another costs 𝛼+𝛽𝑤. The parameter
𝛾is the cost to perform a single arithmetic operation. For dense matrix multiplication when sufficiently large local
memory is available, bandwidth cost nearly always dominates latency cost. Hence, we focus on the bandwidth cost in
this work. In our model, the communication cost of an algorithm is counted along the critical path of the algorithm
so that if two pairs of processors are communicating messages simultaneously, the communication cost is that of the
largest message. In this work, we focus on memory-independent analysis, so the local memory size 𝑀can be assumed
to be infinite. We consider limited-memory scenarios in § 6.2.
3.2 Fundamental Results
In this section we collect the fundamental existing results we use to prove our main result, Theorem 1. The first
lemma is a geometric inequality that has been used before in establishing communication lower bounds for matrix
multiplication [4, 11, 14]. We use it to relate the computation performed by a processor to the data it must access.
Lemma 1 (Loomis-Whitney [18]). Let𝑉be a finite set of lattice points in R3, i.e., poin ts (𝑖, 𝑗, 𝑘 )with integer coordinates.
Let 𝜙𝑖(𝑉)be the projection of 𝑉in the 𝑖-direction, i.e., all points (𝑗 , 𝑘)such that there exists an 𝑖so that (𝑖 , 𝑗, 𝑘 ) ∈ 𝑉. Define
𝜙𝑗(𝑉)and 𝜙𝑘(𝑉)similarly. Then
|𝑉| ≤ |𝜙𝑖(𝑉)| · |𝜙𝑗(𝑉)| · |𝜙𝑘(𝑉)|,
where | · | denotes the cardinality of a set.
The next set of definitions and lemmas allow us to solve the key constrained optimization problem (Lemma 5)
analytically. We first remind the reader of the definitions of differentiable convex and quasiconvex functions and of
the Karush-Kuhn-Tucker (KKT) conditions. Here and throughout, we use boldface to indicate vectors and matrices and
subscripts to index them, so that 𝑥𝑖is the 𝑖th element of 𝒙, for example.
Definition 1 ([8, eq. (3.2)]). A differentiable function 𝑓:R𝑑Ris convex if its domain is a convex set and for all
𝒙,𝒚dom 𝑓,
𝑓(𝒚) ≥ 𝑓(𝒙) + h∇𝑓(𝒙),𝒚𝒙i.
Definition 2 ([8, eq. (3.20)]). A differentiable function 𝑔:R𝑑Ris quasiconvex if its domain is a convex set and
for all 𝒙,𝒚dom 𝑔,
𝑔(𝒚) ≤ 𝑔(𝒙)implies that h∇𝑔(𝒙),𝒚𝒙i ≤ 0.
Definition 3 ([8, eq. (5.49)]). Consider an optimization problem of the form
min
𝒙
𝑓(𝒙)subject to 𝒈(𝒙) ≤ 0(1)
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 5
where 𝑓:R𝑑Rand 𝒈:R𝑑R𝑐are both differentiable. Define the dual variables 𝝁R𝑐, and let 𝑱𝒈be the Jacobian
of 𝒈. The Karush-Kuhn-Tucker (KKT) conditions of (𝒙,𝝁)are as follows:
Primal feasibility:𝒈(𝒙) ≤ 0;
Dual feasibility:𝝁0;
Stationarity:𝑓(𝒙) + 𝝁·𝑱𝒈(𝒙)=0;
Complementary slackness:𝜇𝑖𝑔𝑖(𝒙)=0for all 𝑖∈ {1, . . . , 𝑐 }.
The next two results establish that our key optimization problem in Lemma 5 can be solved analytically using the
KKT conditions. While the results are not novel, we provide proofs for completeness.
Lemma 2 ([6, Lemma 2.2]). The function 𝑔0(𝒙)=𝐿𝑥1𝑥2𝑥3, for some constant 𝐿, is quasiconvex in the positive octant.
Proof. Let 𝒙,𝒚be points in the positive octant with 𝑔0(𝒚) 𝑔0(𝒙). Then 𝑦1𝑦2𝑦3𝑥1𝑥2𝑥3. Applying the inequality
of arithmetic and geometric means (AM-GM inequality) to the values 𝑦1/𝑥1,𝑦2/𝑥2,𝑦3/𝑥3(which are all positive), we
have
1
3𝑦1
𝑥1+𝑦2
𝑥2+𝑦3
𝑥3𝑦1𝑦2𝑦3
𝑥1𝑥2𝑥31/3
1.(2)
Then 𝑔0(𝒙)=h𝑥2𝑥3𝑥1𝑥3𝑥1𝑥2i, and
h∇𝑔0(𝒙),𝒚𝒙i=3𝑥1𝑥2𝑥3𝑦1𝑥2𝑥3𝑥1𝑦2𝑥3𝑥1𝑥2𝑦3
=3𝑥1𝑥2𝑥311
3𝑦1
𝑥1+𝑦2
𝑥2+𝑦3
𝑥3
0,
where the last inequality follows from eq. (2). Then by Def. 2, 𝑔0is quasiconvex on the positive octant.
Lemma 3. Consider an optimization problem of the form given in eq. (1). If 𝑓is a convex function and each 𝑔𝑖is a
quasiconvex function, then the KKT conditions are sufficient for optimality.
Proof. Suppose 𝒙and 𝝁satisfy the KKT conditions given in Def. 3. If 𝝁=0, then by stationarity, 𝑓(𝒙)=0.
Then the convexity of 𝑓(Def. 1) implies
𝑓(𝒙) ≥ 𝑓(𝒙) + h∇𝑓(𝒙),𝒙𝒙i=𝑓(𝒙)
for all 𝒙dom 𝑓, which implies that 𝒙is a global optimum.
Now suppose 𝝁0, then without loss of generality (and by dual feasibility) there exists 𝑚𝑐such that 𝜇
𝑖>0
for 1 𝑖𝑚and 𝜇
𝑖=0 for 𝑚<𝑖𝑐. Complementary slackness implies that 𝑔𝑖(𝒙)=0 for 1 𝑖𝑚. Consider any
(primal) feasible 𝒙dom 𝑓. Then 𝑔𝑖(𝒙) ≤ 0 for all 𝑖, and thus 𝑔𝑖(𝒙) ≤ 𝑔𝑖(𝒙)for 1 𝑖𝑚. By quasiconvexity of 𝑔𝑖
(Def. 2), this implies
h∇𝑔𝑖(𝒙),𝒙𝒙i ≤ 0.
Stationarity implies that 𝑓(𝒙)=Í𝑚
𝑖=1𝜇
𝑖𝑔𝑖(𝒙), and thus
h∇𝑓(𝒙),𝒙𝒙i=
𝑚
Õ
𝑖=1
𝜇
𝑖h∇𝑔𝑖(𝒙),𝒙𝒙i ≥ 0.
By convexity of 𝑓(Def. 1), we therefore have
𝑓(𝒙) ≥ 𝑓(𝒙) + h∇𝑓(𝒙),𝒙𝒙i ≥ 𝑓(𝒙),
6 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
and thus 𝒙is a global optimum.
4 MAIN LOWER BOUND RESULT
4.1 Lower Bounds on Individual Array Access
The following lemma establishes lower bounds on the number of elements of each individual matrix a processor must
access based on the number of computations a given element is involved with. This result is used to establish a set of
constraints in the key optimization problem used in the proof of Theorem 1.
Lemma 4. Given a parallel matrix multiplication algorithm that multiplies an 𝑛1×𝑛2matrix Aby an 𝑛2×𝑛3matrix
Busing 𝑃processors, any processor that performs at least 1/𝑃th of the scalar multiplications must access at least 𝑛1𝑛2/𝑃
elements of Aand at least 𝑛2𝑛3/𝑃elements of Band also compute contributions to at least 𝑛2𝑛3/𝑃elements of C=A·B.
Proof. The total number of scalar multiplications that must be computed is 𝑛1𝑛2𝑛3. Consider a processor that
computes at least 1/𝑃th of these computations. Each element of Ais involved in 𝑛3multiplications. If the processor
accesses fewer than 𝑛1𝑛2/𝑃elements of A, then it would perform fewer than 𝑛3·𝑛1𝑛2/𝑃scalar multiplications, which
is a contradiction. Likewise, each element of Bis involved in 𝑛1multiplications. If the processor accesses fewer than
𝑛2𝑛3/𝑃elements of B, then it would perform fewer than 𝑛1·𝑛2𝑛3/𝑃scalar multiplications, which is a contradiction.
Finally, each element of Cis the sum of 𝑛2scalar multiplications. If the processor computes contributions to fewer
than 𝑛1𝑛3/𝑃elements of C, then it would perform fewer than 𝑛2·𝑛1𝑛3/𝑃scalar multiplications, which is again a
contradiction.
4.2 Key Optimization Problem
The following lemma is the crux of the proof of our main result (Theorem 1). We state the optimization problem
abstractly here, but it may be useful to have the following intuition: the variable vector 𝒙represents the sizes of the
projections of the computation assigned to a single processor onto the three matrices, where 𝑥1corresponds to the
smallest matrix and 𝑥3corresponds to the largest matrix. In order to design a communication-efficient algorithm, we
wish to minimize the sum of the sizes of these projections subject to the constraints of matrix multiplication (and
the processor performing 1/𝑃th of the computation), as specified by the Loomis-Whitney inequality (Lemma 1) and
Lemma 4. A more rigorous argument that any parallel matrix multiplication algorithm is subject to these constraints
is given in Theorem 1.
We are able to solve this optimization problem analytically using properties of convex optimization (Lemma 3). The
three cases of the solution correspond to how many of the individual variable constraints are tight. When none of
them is tight, we can minimize the sum of variables subject to the bound on their product by setting them all equal
to each other (Case 3). However, when the individual variable constraints make this solution infeasible, those become
active and the free variables must be adjusted (Cases 1 and 2).
Lemma 5. Consider the following optimization problem:
min
𝒙R3𝑥1+𝑥2+𝑥3
such that 𝑚𝑛𝑘
𝑃2
𝑥1𝑥2𝑥3
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 7
𝑛𝑘
𝑃𝑥1
𝑚𝑘
𝑃𝑥2
𝑚𝑛
𝑃𝑥3,
where 𝑚𝑛𝑘1and 𝑃1. The optimal solution 𝒙depends on the relative values of the constraints, yielding three
cases:
(1) if 𝑃𝑚
𝑛, then 𝑥
1=𝑛𝑘,𝑥
2=𝑚𝑘
𝑃,𝑥
3=𝑚𝑛
𝑃;
(2) if 𝑚
𝑛𝑃𝑚𝑛
𝑘2, then 𝑥
1=𝑥
2=𝑚𝑛𝑘2
𝑃1/2,𝑥
3=𝑚𝑛
𝑃;
(3) if 𝑚𝑛
𝑘2𝑃, then 𝑥
1=𝑥
2=𝑥
3=𝑚𝑛𝑘
𝑃2
3.
This can be visualized as follows:
𝑃
1𝑚
𝑛𝑚𝑛
𝑘2
𝑥
1=𝑛𝑘
𝑥
2=𝑚𝑘
𝑃
𝑥
3=𝑚𝑛
𝑃
𝑥
1=𝑥
2=𝑚𝑛𝑘2
𝑃1/2
𝑥
3=𝑚𝑛
𝑃
𝑥
1=𝑥
2=𝑥
3=𝑚𝑛𝑘
𝑃2/3
Proof. By Lemma 3, we can establish the optimality of the solution for each case by verifying that there exist dual
variables such that the KKT conditions specified in Def. 3 are satisfied. This optimization problem fits the assumptions
of Lemma 3 because the objective function and all but the first constraint are affine functions, which are convex and
quasiconvex, and the first constraint is quasiconvex on the positive octant (which contains the intersection of the affine
constraints) by Lemma 2.
To match standard notation (and that of Lemma 3), we let
𝑓(𝒙)=𝑥1+𝑥2+𝑥3
and
𝒈(𝒙)=
(𝑚𝑛𝑘/𝑃)2𝑥1𝑥2𝑥3
𝑛𝑘/𝑃𝑥1
𝑚𝑘/𝑃𝑥2
𝑚𝑛/𝑃𝑥3
.
Thus the gradient of the objective function is 𝑓(𝒙)=h1 1 1iand the Jacobian of the constraint function is
𝑱𝒈(𝒙)=
𝑥2𝑥3𝑥1𝑥3𝑥1𝑥2
1 0 0
01 0
0 0 1
.
Case 1 (𝑃𝑛
𝑚). We let
𝒙=h𝑛𝑘 𝑚𝑘
𝑃𝑚𝑛
𝑃i
and
𝝁=h𝑃2
𝑚2𝑛𝑘 0 1 𝑃𝑛
𝑚1𝑃𝑘
𝑚i
8 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
and verify the KKT conditions. Primal feasibility is immediate, and dual feasibility follows from 𝑃𝑚
𝑛𝑚
𝑘, the
condition of this case and by the assumption 𝑛𝑘. Stationarity follows from direct verification that
𝝁·𝑱𝒈(𝒙)=h111i.
Complementary slackness is satisfied because the only nonzero dual variables are 𝜇
1,𝜇
3, and 𝜇
4, and the 1st, 3rd, and
4th constraints are tight.
Case 2 (𝑚
𝑛𝑃𝑚𝑛
𝑘2). We let
𝒙=𝑚𝑛𝑘2
𝑃1/2𝑚𝑛𝑘2
𝑃1/2𝑚𝑛
𝑃
and
𝝁=𝑃
𝑚𝑛𝑘2/33/2001𝑃𝑘 2
𝑚𝑛 1/2
and verify the KKT conditions. The primal feasibility of 𝑥1=𝑥2is satisfied because
𝑛𝑘
𝑃𝑚𝑘
𝑃𝑚𝑛𝑘2
𝑃1/2
where the first inequality follows from the assumption 𝑚𝑛and the second inequality follows from 𝑚/𝑛𝑃(one
condition of this case). The other constraints are clearly satisfied. Dual feasibility requires that 1 − (𝑃 𝑘2/𝑚𝑛)1/20,
which is satisfied because 𝑃𝑚𝑛/𝑘2(the other condition of this case). Stationarity can be directly verified. Com-
plementary slackness is satisfied because the 1st and 4th constraints are both tight for 𝒙, corresponding to the only
nonzeros in 𝝁.
Case 3 (𝑚𝑛
𝑘2𝑃). We let
𝒙=𝑚𝑛𝑘
𝑃2/3𝑚𝑛𝑘
𝑃2/3𝑚𝑛𝑘
𝑃2/3
and
𝝁=𝑃
𝑚𝑛𝑘 4/30 0 0
and verify the KKT conditions. We first consider the primal feasibility conditions. We have
𝑛𝑘
𝑃𝑚𝑘
𝑃𝑚𝑛
𝑃𝑚𝑛𝑘
𝑃2/3
,
where the first two inequalities are implied by the assumption 𝑚𝑛𝑘and the last follows from 𝑚𝑛
𝑘2𝑃, the
condition of this case. Dual feasibility is immediate, and stationarity is directly verified. Complementary slackness is
satisfied because the 1st constraint is tight for 𝒙and 𝜇
1is the only nonzero.
Note that the optimal solutions coincide at boundary points between cases so that the values are continuous as 𝑃
varies.
4.3 Communication Lower Bound
We now state our main result, memory-independent communication lower bounds for general matrix multiplication
with tight constants. After the general result, we also present a corollary for square matrix multiplication. The tightness
of the constants in the lower bound is proved in § 5.
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 9
Theorem 1. Consider a classical matrix multiplication computation involving matrices of size 𝑛1×𝑛2and 𝑛2×𝑛3. Let
𝑚=max{𝑛1, 𝑛2, 𝑛 3},𝑛=median{𝑛1, 𝑛2, 𝑛 3}, and 𝑘=min{𝑛1, 𝑛2, 𝑛3}, so that 𝑚𝑛𝑘. Any parallel algorithm using
𝑃processors that starts with one copy of the two input matrices and ends with one copy of the output matrix and load
balances either the computation or the data must communicate at least
𝐷𝑚𝑛 +𝑚𝑘 +𝑛𝑘
𝑃words of data,
where
𝐷=
𝑚𝑛+𝑚𝑘
𝑃+𝑛𝑘 if 1𝑃𝑚
𝑛
2𝑚𝑛𝑘2
𝑃1/2+𝑚𝑛
𝑃if 𝑚
𝑛𝑃𝑚𝑛
𝑘2
3𝑚𝑛𝑘
𝑃2/3if 𝑚𝑛
𝑘2𝑃.
Proof. To establish the lower bound, we focus on a single processor. If the algorithm load balances the computation,
then every processor performs 𝑚𝑛𝑘/𝑃scalar multiplications, and there exists some processor whose input data at the
start of the algorithm plus output data at the end of the algorithm must be at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃words of data
(otherwise the algorithm would either start with more than one copy of the input matrices or end with more than
one copy of the output matrix). If the algorithm load balances the data, then every processor starts and end with a
total of (𝑚𝑛 +𝑚𝑘 +𝑛𝑘 )/𝑃words, and some processor must perform at least 𝑚𝑛𝑘/𝑃scalar multiplications (otherwise
fewer than 𝑚𝑛𝑘 multiplications are performed). In either case, there exists a processor that performs at least 𝑚𝑛𝑘 /𝑃
multiplications and has access to at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃data.
Let 𝐹be the set of multiplications assigned to this processor, so that |𝐹| 𝑚𝑛𝑘/𝑃. Each element of 𝐹can be indexed
by three indices (𝑖1, 𝑖 2, 𝑖3)and corresponds to the multiplication of A(𝑖1, 𝑖2)with B(𝑖2, 𝑖 3), which contributes to the
result C(𝑖1, 𝑖3). Let 𝜙A(𝐹)be the projection of the set 𝐹onto the matrix A, so that 𝜙A(𝐹)are the entries of Arequired
for the processor to perform the scalar multiplications in 𝐹. Here, elements of 𝜙A(𝐹)can be indexed by two indices:
𝜙A(𝐹)={(𝑖1, 𝑖2):𝑖3s.t. (𝑖1, 𝑖 2, 𝑖3) ∈ 𝐹}. Define 𝜙B(𝐹)and 𝜙C(𝐹)similarly. The processor must access all of the
elements in 𝜙A(𝐹),𝜙B(𝐹), and 𝜙C(𝐹)in order to perform all the scalar multiplications in 𝐹. Because the processor
starts and ends with at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃data, the communication performed by the processor is at least
|𝜙A(𝐹)| + |𝜙B(𝐹)| + |𝜙C(𝐹)| − 𝑚𝑛 +𝑚𝑘 +𝑛𝑘
𝑃,
which is a lower bound on the communication along the critical path of the algorithm.
In order to lower bound |𝜙A(𝐹)|+|𝜙B(𝐹)|+|𝜙C(𝐹)|, we form a constrained minimization problem with this expression
as the objective function and constraints derived from Lemmas 1 and 4. The Loomis-Whitney inequality (Lemma 1)
implies that
|𝜙A(𝐹)| · |𝜙B(𝐹)| · |𝜙C(𝐹)| ≥ |𝐹| ≥ 𝑛1𝑛2𝑛3
𝑃=𝑚𝑛𝑘
𝑃,
and the lower bound on the projections from Lemma 4 means
|𝜙A(𝐹)| ≥ 𝑛1𝑛2
𝑃,|𝜙B(𝐹)| ≥ 𝑛2𝑛3
𝑃,|𝜙C(𝐹)| ≥ 𝑛1𝑛3
𝑃.
For any algorithm, the processor’s projections must satisfy these constraints, so the sum of their sizes must be at least
the minimum value of optimization problem. Then by Lemma 5 (and assigning the projections to 𝑥1, 𝑥2, 𝑥3appropriately
based on the relative sizes of 𝑛1,𝑛 2, 𝑛3), the result follows.
10 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
We also state the result for square matrix multiplication, which is a direct application of Theorem 1 with 𝑛1=𝑛2=
𝑛3.
Corollary 2. Consid er a classical matrix multiplication computation involving two matrices of size 𝑛×𝑛. Any parallel
algorithm using 𝑃processors that starts with one copy of the input data and ends with one copy of the output data and
load balances either the computation or the data must communicate at least
3𝑛2
𝑃2/33𝑛2
𝑃words of data.
5 OPTIMAL PARALLEL ALGORITHM
In this section we present an optimal parallel algorithm (Alg. 1) to show that the lower bound (including the constants)
is tight. The idea is to organize the processors into a 3D processor grid and assign the computation of the matrix
multiplication (a 3D iteration space) to processors according to their location in the grid. The algorithm is not new,
but we present it here in full detail for completeness and to provide the complete analysis, which has not appeared
before. In particular, Alg. 1 is nearly identical to the one proposed by Aggarwal et al. [1], though they use the LPRAM
model and analyze only the case where 𝑃is large. In the LPRAM model, for example, processors can read concurrently
from a global shared memory, while in the 𝛼-𝛽-𝛾model, the data is distributed across local memories and is shared via
collectives like All-Gathers. Demmel et al. [11] present and analyze their recursive algorithm to show its asymptotic
optimality in all three cases, but they do not track constants. See § 2.4 for more discussion of previous work on optimal
parallel algorithms.
Consider the multiplication of an 𝑛1×𝑛2matrix Awith an 𝑛2×𝑛3matrix B, and let C=A·B. Algorithm 1 organizes
the 𝑃processors into a 3-dimensional 𝑝1×𝑝2×𝑝3logical processor grid with 𝑝1𝑝2𝑝3=𝑃. Note that one or two of the
processor grid dimensions may be equal to 1, which simplifies to a 2- or 1-dimensional grid. A processor coordinate is
represented as (𝑝
1, 𝑝
2, 𝑝
3), where 1 𝑝
𝑘𝑝𝑘, for 𝑘=1,2,3.
The basic idea of the algorithm is to perform two collective operations, All-Gathers, so that each processor receives
the input data it needs to perform all of its computation (in an All-Gather, all the processors involved end up with the
union of the input data that starts on each processor). The result of each local computation must be summed with all
other contributions to the same output matrix entries from other processors, and the summations are performed via a
Reduce-Scatter collective operation (in a Reduce-Scatter, the sum of the input data from all processors is computed so
that it ends up evenly distributed across processors).
Algorithm 1 Comm-Optimal Parallel Matrix Multiplication
Input: Ais 𝑛1×𝑛2,Bis 𝑛2×𝑛3,𝑝1×𝑝2×𝑝3logical processor grid
Output: C=A·Bis 𝑛1×𝑛3
1: (𝑝
1, 𝑝
2, 𝑝
3)is my processor ID
2: // Gather input matrix data
3: A𝑝
1𝑝
2= All-Gather(A𝑝
1𝑝
2𝑝
3,(𝑝
1, 𝑝
2,:))
4: B𝑝
2𝑝
3= All-Gather(B𝑝
1𝑝
2𝑝
3,(:, 𝑝
2, 𝑝
3))
5: // Perform local computation
6: D𝑝
1𝑝
2𝑝
3=A𝑝
1𝑝
2·B𝑝
2𝑝
3
7: // Sum results to compute C𝑝
1𝑝
3
8: C𝑝
1𝑝
2𝑝
3= Reduce-Scatter(D𝑝
1𝑝
2𝑝
3,(𝑝
1,:, 𝑝
3))
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 11
C
C11
C12
C13
C21
C23
C31
C32
C33
A
A11 A12 A13
A21 A23
A31 A32 A33
B
B11 B12 B13
B21 B23
B31 B32 B33
Fig. 1. Visualization of Alg. 1 with a 3×3×3processor grid. The 3D iteration space is mapped onto the processor grid, and the
matrices are mapped to the faces of the grid. The dark highlighting corresponds to the input data initially owned and the output
data finally owned by processor (1,3,1), and the light highlighting signifies the data of other processors it uses to perform the local
computation. The arrows show the sets of processors involved in the three collective operations involving processor (1,3,1).
Algorithm 1 imposes requirements on the initial distribution of the input matrices and the final distribution of the
output. These conditions do not always hold in practice, b ut to show that the lower bound (which makes no assumption
on data distribution except that only 1 copy of the input exists at the start of the computation) is tight, we allow the
algorithm to specify its distributions. For simplicity of explanation, we assume that 𝑝1,𝑝2,𝑝3evenly divide 𝑛1,𝑛2,𝑛3,
respectively. We use the notation A𝑝
1𝑝
2to denote the submatrix of Asuch that
A𝑝
1𝑝
2=A(𝑝
11) · 𝑛1
𝑝1+1 : 𝑝
1·𝑛1
𝑝1,(𝑝
21) · 𝑛2
𝑝2+1 : 𝑝
2·𝑛2
𝑝2,
and we define B𝑝
2𝑝
3and C𝑝
1𝑝
3similarly. The algorithm assumes that at the start of the computation, A𝑝
1𝑝
2is distributed
evenly across processors (𝑝
1, 𝑝
2,:)and B𝑝
2𝑝
3is distributed evenly across processors (:,𝑝
2, 𝑝
3). We define A𝑝
1𝑝
2𝑝
3and
B𝑝
1𝑝
2𝑝
3as the elements of the input matrices that processor (𝑝
1, 𝑝
2, 𝑝
3)initially owns. At the end of the algorithm, C𝑝
1𝑝
3
is distributed evenly across processors (𝑝
1,:, 𝑝
3), and we let C𝑝
1𝑝
2𝑝
3be the elements owned by processor (𝑝
1, 𝑝
2, 𝑝
3).
Figure 1 presents a visualization of Alg. 1. In t his example, we have 𝑛1=𝑛2=𝑛3, and 27 processors are arranged in a
3×3×3 grid. We highlight the data and communication of a particular processor with ID (1,3,1). The dark highlighting
corresponds to the input data initially owned by the processor (A131 and B131) as well as the output data owned by
the processor at the end of the computation (C131). The figure shows each of these submatrices as block columns of
the submatrices A13,B31 , and C11, but any even distribution of these across the same set of processors suffices. The
light highlighting of the submatrices A13 ,B31, and C11 corresponds to the data of other processors involved in the
local computation on processor (1,3,1), and their size corresponds to the communication cost. The three collectives
that involve processor (1,3,1)occur across three different fibers in the processor grid, as depicted by the arrows in the
figure.
12 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
5.1 Cost Analysis
Now we analyze computation and communication costs of the algorithm. Each processor performs 𝑛1
𝑝1.𝑛2
𝑝2.𝑛3
𝑝3=𝑛1𝑛2𝑛3
𝑃
local computations in Line 6. Communication occurs only in the All-Gather and Reduce-Scatter collectives in Lines 3,
4, and 8. Each processor is involved in two All-Gathers involving input matrices and one Reduce-Scatter involving the
output matrix. Lines 3, 4 specify simultaneous All-Gathers across sets of 𝑝3,𝑝1processors, respectively, and Line 8
specifies simultaneous Reduce-Scatters across sets of 𝑝2processors. Note that if 𝑝𝑘=1 for any 𝑘=1,2,3, then the
corresponding collective can be ignored as no communication occurs. The difference between Alg. 1 and [1, Algorithm
1] is the Reduce-Scatter collective, which replaces the All-to-All collective and has smaller latency cost.
We assume that bandwidth-optimal algorithms, such as bidirectional exchange or recursive doubling/halving, are
used for the All-Gather and Reduce-Scatter collectives. The optimal communication cost of these collectives on 𝑝
processors is (11
𝑝)𝑤, where 𝑤is the words of data in each processor after All-Gather or before Reduce-Scatter
collective [9, 24]. Each processor also performs (11
𝑝)𝑤computations for the Reduce-Scatter collective.
𝑛𝑘
𝑚
C
A
B
(a) 1D case: 𝑃=3with grid 3×1×1
𝑛𝑘
𝑚
C
A
B
(b) 2D case: 𝑃=36 with grid 12 ×3×1
𝑛𝑘
𝑚
C
A
B
(c) 3D case: 𝑃=512 with grid 32 ×8×2
Fig. 2. Example parallelizations of iteration space of multiplication of a 9600 ×2400 matrix Awith a 2400 ×600 matrix B
Hence the communication costs of Lines 3, 4 in Algorithm 1 are (11
𝑝3)𝑛1𝑛2
𝑝1𝑝2and (11
𝑝1)𝑛2𝑛3
𝑝2𝑝3, respectively, to
accomplish All-Gather operations, and the communication cost of performing the Reduce-Scatter operation in Line 8
is (11
𝑝2)𝑛1𝑛3
𝑝1𝑝3. Note that if 𝑝𝑘=1 for any 𝑘=1,2,3, then the cost of the corresponding collective reduces to 0. Thus
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 13
the overall cost of Algorithm 1 is
𝑛1𝑛2
𝑝1𝑝2+𝑛2𝑛3
𝑝2𝑝3+𝑛1𝑛3
𝑝1𝑝3𝑛1𝑛2+𝑛2𝑛3+𝑛1𝑛3
𝑃.(3)
Due to Reduce-Scatter operation, each processor also performs (11
𝑝2)𝑛1𝑛3
𝑝1𝑝3computations, which is dominated by the
𝑛1𝑛2𝑛3
𝑃computations of Line 6.
5.2 Optimal Processor Grid Selection
The communication cost of Algorithm 1, given by eq. (3), depends on the processor grid dimensions. Here we discuss
how to select the processor grid dimensions such that the lower bound on communication given in Theorem 1 is
attained by Alg. 1 given the matrices dimensions 𝑛1, 𝑛2and 𝑛3and the number of processors 𝑃. As before, we let 𝑚, 𝑛 , 𝑘
represent the maximum, median, and minimum values of the three dimensions. Letting 𝑝1, 𝑝2, 𝑝 3be the grid dimensions,
we similarly define 𝑝, 𝑞 ,𝑟 to be the pro cessor grid dimensions corresponding to matrix dimensions 𝑚, 𝑛, 𝑘 , respectively.
Because the order of processor grid dimensions will be chosen to be consistent with the matrix dimensions, we will
have 𝑝𝑞𝑟. To demonstrate the tightness of the lower bound, the analysis below assumes that the processor grid
dimensions divide the matrices dimensions.
Following Theorem 1, depending on the relative sizes of the aspect ratios among matrix dimensions and the number
of processors, we encounter three cases that correspond to 3D, 2D, and 1D processor grids. That is, when 𝑝𝑖=1 for
one value of 𝑖, then the processor grid is effectively 2D, and when 𝑝𝑖=1 for two values of 𝑖, the processor grid is
effectively 1D. In the following we show how to obtain the grid dimensions and show that Algorithm 1 attains the
communication lower bound given in Theorem 1 in each case.
First, suppose 𝑃𝑚
𝑛. In this case, we set 𝑟=𝑞=1, and set 𝑝=𝑃to obtain a 1D grid. From eq. (3), Algorithm 1 has
communication cost 𝑚𝑛 +𝑚𝑘
𝑃+𝑛𝑘 𝑚𝑛 +𝑚𝑘 +𝑛𝑘
𝑃=11
𝑃𝑛𝑘,
which matches the 1st case of Theorem 1.
Now suppose 𝑚
𝑛<𝑃𝑚𝑛
𝑘2. We set 𝑟=1, and set 𝑝and 𝑞such that 𝑚
𝑝=𝑛
𝑞, yielding 𝑝=𝑃
𝑚𝑛 1/2𝑚and
𝑞=𝑃
𝑚𝑛 1/2𝑛. Note that the assumption 𝑚
𝑛<𝑃is required so that 𝑞>1, and 𝑝>1 also follows. Our analysis
also assumes that 𝑝and 𝑞are integers, which is sufficient to show that the lower bound is tight in general as there
are an infinite number of dimensions for which the assumption holds. In this case, we have a 2D processor grid, and
Algorithm 1 has communication cost
𝑚𝑛
𝑝𝑞 +𝑚𝑘
𝑝+𝑛𝑘
𝑞𝑚𝑛 +𝑚𝑘 +𝑛𝑘
𝑝𝑞 =2𝑚𝑛𝑘 2
𝑃1/2
𝑚𝑘 +𝑛𝑘
𝑃,
matching the 2nd case of Theorem 1.
Finally, suppose 𝑚𝑛
𝑘2<𝑃. As suggested in [1], we set the grid dimensions such that 𝑚
𝑝=𝑛
𝑞=𝑘
𝑟. That is, 𝑟=
𝑃
𝑚𝑛𝑘 1/3𝑘,𝑞=𝑃
𝑚𝑛𝑘 1/3𝑛, and 𝑝=𝑃
𝑚𝑛𝑘 1/3𝑚. Note that the assumption 𝑚𝑛
𝑘2<𝑃is required so that 𝑟>1 (which
also implies 𝑞>1 and 𝑝>1). This assumption was implicit in the analysis of [1]. Again, we assume that 𝑝, 𝑞, 𝑟 are
integers. Thus, we have a 3D processor grid and a communication cost of
3𝑚𝑛𝑘
𝑃2/3
𝑚𝑛 +𝑚𝑘 +𝑛𝑘
𝑃,
14 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
which matches the 3rd case of Theorem 1.
Comparing the obtained communication cost in each case with the lower bound obtained in Theorem 1 we conclude
that Algorithm 1 is optimal given the grid dimensions are selected as above.
5.3 Optimal Processor Grid Examples
Figure 2 illustrates each of the three cases for a fixed set of matrix dimensions. Here we consider multiplying a 9600 ×
2400 matrix Awith a 2400 ×600 Bso that Cis 9600 ×600, so in our notation with 𝑚𝑛𝑘,Ais 𝑚×𝑛,Bis 𝑛×𝑘,
and Cis 𝑚×𝑘. The 3D 𝑚×𝑛×𝑘iteration space is visualized with faces corresponding to correctly oriented matrices.
In this example, we consider 𝑃∈ {3,36,512}.
With 3 processors, we fall into the 1st case, as 𝑃𝑚
𝑛=4, and the optimal processor grid is 3 ×1×1, which is 1D as
shown in Fig. 2a. Note that the computation assigned to each processor is not a cube in this case, that is, 𝑚
𝑝𝑛
𝑞𝑘
𝑟.
The only data that must be communicated are entries of B, though all processors need all of B.
When 𝑃=36, we fall into the 2nd case, and the optimal processor grid is 2D: 12 ×3×1, as shown in Fig. 2b. Here
we see that the iteration space assigned to each processor is 800 ×800 ×600, so 𝑚
𝑝=𝑛
𝑞𝑘
𝑟. In this case, entries of B
and Cmust be communicated, but each entry of Ais required by only one processor.
Finally, for 𝑃=512, we satisfy 𝑃>𝑚𝑛
𝑘2=64 and fall into the 3rd case. The optimal processor grid is shown in Fig. 2c
to be 32 ×8×2, and we see that the local computation of each processor is a cube: 𝑚
𝑝=𝑛
𝑞=𝑘
𝑟. For a 3D grid, entries
of all 3 matrices are communicated.
6 CONCLUSION
Theorem 1 establishes memory-independent communication lower bounds for parallel matrix multiplication. By cast-
ing the lower bound on accessed data as the solution to a constrained optimization problem, we are able to obtain
a result with explicit constants spanning over three scenarios that depend on the relative sizes of the matrix aspect
ratios and the number of processors. Algorithm 1 demonstrates that the constants established in Theorem 1 are tight,
as the algorithm is general enough to be applied in each of the three scenarios by tuning the processor grid. As we
discuss below, our lower bound proof technique tightens the constants proved in earlier work, and we believe it can
be generalized to improve known communication lower bounds for other computations.
6.1 Comparison to Existing Results
We now provide full details of the constants presented in Tab. 1, and compare the previous results with the constants
of Theorem 1. The first row of the table gives the constant from the proof by Aggarwal, Chandra, and Snir [2, Theorem
2.3]. While the result is stated asymptotically, an explicit constant is given in a key lemma ([2, Lemma 2.2]) used in the
proof, from which we can derive the constant for the main result.
The second row of the table corresponds to the work of Irony, Toledo, and Tiskin [14], who establish the first parallel
bounds for matrix multiplication. Their memory-independent bound is stated for square matrices with a parametrized
prefactor corresponding to the amount of local memory available [14, Theorem 5.1]. If we generalize it straightfor-
wardly to rectangular dimensions and minimize the prefactor over any amount of local memory, then we obtain a
bound of at least 1/2· (𝑚𝑛𝑘/𝑃)2/3, which is asymptotically tight for 𝑚𝑛 /𝑘2𝑃. They do not provide any tighter
results for 𝑃<𝑚𝑛/𝑘2.
The third row of the table corresponds to the results of Demmel et al. [11]. This work was the first to establish
bounds for small values of 𝑃and identify the three cases of asymptotic expressions. Theorem 1 obtains the same cases
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 15
and leading order terms (up to constant factors) [11, Table I], and we present the explicit constant factors for leading
order terms derived in [11, Section II.B]. We note that the boundaries between cases differ by a constant in that paper,
which we do not reflect in Tab. 1. Compared to these results, Theorem 1 establishes a tighter constant in all three cases.
We note that Kwasniewski et al. claim a combined result of memory-dependent and memory-independent parallel
bounds [17, Theorem 2]. The memory-independent term has a constant that matches the 3rd case of Theorem 1. How-
ever, the proof includes a restrictive assumption on parallelization strategies, requiring that each processor is assigned
a set of domains that are subblocks of the iteration space with dimensions 𝑎×𝑎×𝑏for some 𝑎, 𝑏, and therefore does
not apply to all parallel algorithms.
6.2 Limited-Memory Scenarios
The local memory required by Alg. 1 matches the amount of communication performed plus the data already owned
by the processor, which is given by the positive terms in eq. (3) and matches the value of 𝐷in Theorem 1 with the
optimal processor grid. Note that the local memory 𝑀must be large enough to store the inputs and output matrices,
so 𝑀≥ (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃. When 1D or 2D processor grids are used, the local memory required is no more than a
constant more than the minimum required to store the problem. Further, Alg. 1 can be adapted to reduce the temporary
memory required to a negligible amount at the expense of higher latency cost but without affecting the bandwidth
cost, and thus the algorithmic approach can be used even in extremely limited memory scenarios. In the case of 3D
processor grids, however, the temporary memory used by Alg. 1 asymptotically dominates the minimum required,
and thus the algorithm cannot be applied in limited-memory scenarios. Reducing the memory footprint in this case
necessarily increases the bandwidth cost. Algorithms that smoothly trade off memory for communication savings in
these limited memory scenarios are well studied [3, 17, 19, 23].
From the lower bound point of view, while Theorem 1 is always valid, it may not be the tightest bound in limited-
memory scenarios. The memory-dependent bound with leading term 2𝑚𝑛𝑘/(𝑃𝑀)(see [17, 20, 22] and discussion
in § 2.1) can be larger. In particular, this occurs when 𝑚𝑛/𝑘2<𝑃8/27 ·𝑚𝑛𝑘/𝑀3/2, and the memory-dependent
bound dominates the memory-independent bound of 3(𝑚𝑛𝑘 /𝑃)2/3in that case. This scenario implies that 𝑀<4/9·
(𝑚𝑛𝑘/𝑃)2/3,in which case the temporary space required by Alg. 1 exceeds the available memory. Thus, the tightness
of Theorem 1 for 𝑚𝑛/𝑘2<𝑃requires an assumption of sufficient memory.
When 𝑃𝑚𝑛/𝑘2, the memory-independent bounds in the first two cases of Theorem 1 are always tight, with no
assumption on local memory size. That is, the memory-dependent bound never dominates the memory-independent
bound. Consider the 2nd case, so that 𝑚/𝑛𝑃𝑚𝑛/𝑘2and the memory-independent bound is 2(𝑚𝑛𝑘 2/𝑃)1/2.
Because the local memory must be large enough to store the largest matrix as well as the other two matrices, we have
𝑀>𝑚𝑛/𝑃. This implies 2𝑚𝑛𝑘/(𝑃𝑀)<2(𝑚𝑛𝑘2/𝑃)1/2, so the memory-independent bound dominates.
Suppose further that 𝑃𝑚
𝑛. In this case, the leading-order term of the memory-independent bound is 𝑛𝑘 . This
1st-case bound dominates the 2nd-case bound, which dominates the memory-dependent bound from the argument
above. Comparison of the full bounds of the 1st and 2nd cases simplifies to 2(𝑚𝑛𝑘 2/𝑃)1/2𝑚𝑘/𝑃+𝑛𝑘, which holds
by the arithmetic-geometric mean inequality.
6.3 Extensions
The proof technique we use to obtain Theorem 1 is more generally applicable. The basic approach of defining a con-
strained optimization problem to minimize the sum amount of data accessed subject to constraints on that data that
16 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse
depend on the nature of the computation has been used before for matrix multiplication [22] and for tensor compu-
tations [5, 6]. The key to the results presented in this work is the imposition of lower bound constraints on the data
accessed in each individual array given by Lemma 4. These lower bounds become active when the aspect ratios of the
matrices are large relative to the number of processors and allow for tighter lower bounds in those cases. The argument
given in Lemma 4 is not specific to matrix multiplication, it depends only on the number of operations a given word of
data is involved in, so it can be applied to many other computations that have iteration spaces with uneven dimensions.
We believe this will yield new or tighter parallel communication bounds in several cases.
ACKNOWLEDGMENTS
This work is supported by the National Science Foundation under Grant No. CCF-1942892 and OAC-2106920. This
project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020
research and innovation program (Grant agreement No. 810367).
REFERENCES
[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM
Journal of Research and Development 39, 5 (1995), 575–582. https://doi.org/10.1147/rd.395.0575
[2] A. Aggarwal, A. K. Chandra, and M. Snir. 1990. Communication complexity of PRAMs. Theor. Comp. Sci. 71, 1 (1990), 3–28.
https://doi.org/10.1016/0304-3975(90)90188- N
[3] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and
memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA ’12). ACM, New York, NY, USA, 77–79. https://doi.org/10.1145/2312005.2312021
[4] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. J. ACM 59, 6,
Article 32 (2012), 23 pages. https://doi.org/10.1145/2395116.2395121
[5] G. Ballard, N. Knight, and K. Rouse. 2018. Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In IPDPS. 557–567.
https://doi.org/10.1109/IPDPS.2018.00065
[6] G. Ballard and K. Rouse. 2020. General Memory-Independent Lower Bound for MTTKRP. In SIAM PP. 1–11.
https://doi.org/10.1137/1.9781611976137.1
[7] J. Berntsen. 1989. Communication efficient matrix multiplication on hypercubes. Parallel Comput. 12, 3 (1989), 335–342.
https://doi.org/10.1016/0167-8191(89)90091- 4
[8] S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/
[9] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and
Computation: Practice and Experience 19, 13 (2007), 1749–1783. https://doi.org/10.1002/cpe.1206
[10] M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Pro-
grams That Reference Arrays - Part 1. Technical Report UCB/EECS-2013-61. EECS Department, University of California, Berkeley.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013- 61.html
[11] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular
Matrix Multiplication. In IPDPS. 261–272. https://doi.org/10.1109/IPDPS.2013.80
[12] Jack Dongarra, Jean-François Pineau, Yves Robert, Zhiao Shi, and Frédéric Vivien. 2008. Revisiting Matrix Product on Master-Worker Platforms.
International Journal of Foundations of Computer Science 19, 06 (2008), 1317–1336. https://doi.org/10.1142/S0129054108006303
[13] J. W. Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In STOC. ACM, 326–333. https://doi.org/10.1145/800076.802486
[14] D. Irony, S. Toledo, and A. Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Par. and Dist. Comp. 64, 9
(2004), 1017–1026. https://doi.org/10.1016/j.jpdc.2004.03.021
[15] S. Lennart Johnsson. 1993. Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 11 (1993), 1235
– 1257. https://doi.org/10.1016/0167- 8191(93)90029-K
[16] G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T.
Hoefler. 2021. On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery,
New York, NY, USA, Article 70, 15 pages. https://doi.org/10.1145/3458817.3476167
[17] G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-
Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver,
Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 24, 22 pages. https://doi.org/10.1145/3295500.3356181
Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 17
[18] L. H. Loomis and H. Whitney. 1949. An inequality related to the isoperimetric inequality. Bull. Amer. Math. Soc. 55, 10 (1949), 961 – 962.
https://doi.org/10.1090/S0002-9904- 1949-09320- 5
[19] W. McColl and A. Tiskin. 1999. Memory-Efficient Matrix Multiplication in the BSP Model. Algorithmica 24, 3-4 (1999), 287–297.
https://doi.org/10.1007/PL00008264
[20] A. Olivry, J. Langou, L.-N. Pouchet, P. Sadayappan, and F. Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for
Affine Programs. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI
2020). ACM, New York, NY, USA, 808–822. https://doi.org/10.1145/3385412.3385989
[21] M. Scquizzato and F. Silvestri. 2014. Communication Lower Bounds for Distributed-Memory Computations. In 31st International Sym-
posium on Theoretical Aspects of Computer Science (STACS 2014), Vol. 25. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 627–638.
https://doi.org/10.4230/LIPIcs.STACS.2014.627
[22] T. M. Smith, B. Lowery, J. Langou, and R. A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. Technical Report. arXiv.
https://arxiv.org/abs/1702.02017
[23] E. Solomonik and J. Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par
2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Lecture Notes in Computer Science, Vol. 6853. Springer
Berlin Heidelberg, 90–109. https://doi.org/10.1007/978-3- 642-23397-5_10
[24] R. Thakur, R. Rabenseifner, and W. Gropp. 2005. Optimization of Collective Communication Operations in MPICH. Intl. J. High Perf. Comp. App.
19, 1 (2005), 49–66. https://doi.org/10.1177/1094342005051521
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to √ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daint's peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.
Article
Full-text available
The movement of data (communication) between levels of a memory hierarchy, or between parallel processors on a network, can greatly dominate the cost of computation, so algorithms that minimize communication are of interest. Motivated by this, attainable lower bounds for the amount of communication required by algorithms were established by several groups for a variety of algorithms, including matrix computations. Prior work of Ballard-Demmel-Holtz-Schwartz relied on a geometric inequality of Loomis and Whitney for this purpose. In this paper the general theory of discrete multilinear Holder-Brascamp-Lieb (HBL) inequalities is used to establish communication lower bounds for a much wider class of algorithms. In some cases, algorithms are presented which attain these lower bounds. Several contributions are made to the theory of HBL inequalities proper. The optimal constant in such an inequality for torsion-free Abelian groups is shown to equal one whenever it is finite. Bennett-Carbery-Christ-Tao had characterized the tuples of exponents for which such an inequality is valid as the convex polyhedron defined by a certain finite list of inequalities. The problem of constructing an algorithm to decide whether a given inequality is on this list, is shown to be equivalent to Hilbert's Tenth Problem over the rationals, which remains open. Nonetheless, an algorithm which computes the polyhedron itself is constructed.
Article
Full-text available
Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectan- gular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache- and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared- and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
Article
A tight lower bound for required I/O when computing a matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of phases needed to perform C:=AB, where each phase is a series of operations involving S reads and writes to and from fast memory, and S is the size of fast memory. A lower bound on the number of phases was then determined by obtaining an upper bound on the number of scalar multiplications performed per phase. This paper follows the same high level approach, but improves the lower bound by considering C:=AB+C instead of C:=AB, and obtains the maximum number of scalar fused multiply-adds (FMAs) per phase instead of scalar additions. Key to obtaining the new result is the decoupling of the per-phase I/O from the size of fast memory. The new lower bound is 2mnk/ S - 2S where S is the size of fast memory. The constant for the leading term is an improvement of a factor 4/ 2. A theoretical algorithm that attains the lower bound is given, and how the state-of-the-art Goto's algorithm also in some sense meets the lower bound is discussed.
Article
We describe our work on improving the performance of collective communication operations in MPICH for clusters connected by switched networks. For each collective operation, we use multiple algorithms depending on the message size, with the goal of minimizing latency for short messages and minimizing bandwidth use for long messages. Although we have implemented new algorithms for all MPI (Message Passing Interface) collective operations, because of limited space we describe only the algorithms for allgather, broadcast, all-to-all, reduce-scatter, reduce, and allreduce. Performance results on a Myrinet-connected Linux cluster and an IBM SP indicate that, in all cases, the new algorithms significantly outperform the old algorithms used in MPICH on the Myrinet cluster, and, in many cases, they outperform the algorithms used in IBM's MPI on the SP. We also explore in further detail the optimization of two of the most commonly used collective operations, allreduce and reduce, particularly for long messages and nonpower-of-two numbers of processes. The optimized algorithms for these operations perform several times better than the native algorithms on a Myrinet cluster, IBM SP, and Cray T3E. Our results indicate that to achieve the best performance for a collective communication operation, one needs to use a number of different algorithms and select the right algorithm for a particular message size and number of processes.
Article
We give lower bounds on the communication complexity required to solve several computational problems in a distributed-memory parallel machine, namely standard matrix multiplication, stencil computations, comparison sorting, and the Fast Fourier Transform. We revisit the assumptions under which preceding results were derived and provide new lower bounds which use much weaker and appropriate hypotheses. Our bounds rely on a mild assumption on work distribution, and strengthen previous results which require either the computation to be balanced among the processors, or specific initial distributions of the input data, or an upper bound on the size of processors' local memories.