Available via license: CC BY 4.0

Content may be subject to copyright.

arXiv:2205.13407v1 [cs.DC] 26 May 2022

Tight Memory-Independent Parallel Matrix Multiplication Communication

Lower Bounds

HUSSAM AL DAAS, Rutherford Appleton Laboratory, UK

GREY BALLARD, Wake Forest University, USA

LAURA GRIGORI, Inria Paris, France

SURAJ KUMAR, Inria Paris, France

KATHRYN ROUSE, Inmar Intelligence, USA

Communication lower bounds have long been e stablished for matrix multiplication algorithms. However, most methods of asymptotic

analysis have either ignored the constant factors or not obtained the tightest possible values. Recent work has demonstrated that more

careful analysis improves the best known constants for some classical matrix multiplication lower bounds and helps to identify more

eﬃcient algorithms that match the leading-order terms in the lower bounds exactly and improve practical performance. The main

result of this work is the establishment of memory-independent communication lower bounds with tight constants f or parallel matrix

multiplication. Our constants improve on previous work in each of three cases that depend on the relative sizes of the aspect ratios

of the matrices.

1 INTRODUCTION

The cost of communication relative to computation continues to grow, so the time complexity of an algorithm must

account for both the computation it performs and the data that it communicates. Communication lower bounds for

computations set targets for eﬃcient algorithms and spur algorithmic development. Matrix multiplication is one of

the most fundamental computations, and its I/O complexity on sequential machines and parallel communication costs

have been well studied over decades [2, 11, 13, 14, 21].

The earliest results established asymptotic lower bounds, ignoring constant factors and lower order terms. Because

of the ubiquity of matrix multiplication in high performance computations, more recent attempts have tightened the

analysis to obtain the best constant factors [17, 20, 22]. These improvements in the lower bound also helped identify the

best performing sequential and parallel algorithms that can be further tuned for high performance in settings where

even small constant factors make signiﬁcant diﬀerences. We review these advances and other related work in § 2.

The main result of this paper is the establishment of tight constants for memory-independent communication lower

bounds for parallel classical matrix multiplication. In the context of a distributed-memory parallel machine model (see

§ 3.1), these bounds apply even when the local memory is inﬁnite, and they are the tightest bounds in many cases when

the memory is limited. Demmel et al. [11] prove asymptotic bounds for general rectangular matrix multiplication and

show that three diﬀerent bounds are asymptotically tight in separate cases that depend on the relative sizes of matrix

dimensions and the number of processors. Our main result, Theorem 1 in § 4, reproduces those asymptotic bounds

and improves the constants in all three cases. Further, in § 5, we analyze a known algorithm to show that it attains the

lower bounds exactly, proving that the constants we obtain are tight. We present a comparison to previous work in

Tab. 1 and discuss it in detail in § 6.

We believe one of the main features of our lower bound result is the simplicity of the proof technique, which

makes a uniﬁed argument that applies to all three cases. The key idea is to cast the lower bound as the solution to

Authors’ addresses: Hussa m Al Daas, Rutherford Appleton Laboratory, Didcot, Oxfordshire, UK, hussa m.al-daas@stfc.ac.uk; Grey Ballard, Wake Forest

University, Winston-Salem, NC, USA, ballard@wfu.edu; Laura Gr igori, Inria Paris, Paris , France, laura.grigori@inria.fr; Sura j Kumar, Inria Paris, Paris,

France, suraj.kumar@inria.fr; Kathryn Rouse, Inmar Intelligence, Winston-Salem, NC, USA, kathryn.rouse@inmar.com.

1

2 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

a constrained optimization problem (see Lemma 5) whose objective function is the sum of variables that correspond

to the amount of data of each matrix required by a single processor’s computation. The constraints include the well-

known Loomis-Whitney inequality [18] as well as new lower bounds on individual array access (see Lemma 4) that are

necessary to establish separate results for the three cases. All of the complexity of the three cases, including establishing

the thresholds between cases and the leading terms in each case, are conﬁned to a single optimization problem. We

use fundamental results from convex optimization (reviewed in § 3.2) to solve the problem analytically. This uniﬁed

argument is elegant, it improves previous results to obtain tight constants, and it can be applied more generally to

other computations that have iteration spaces with uneven dimensions.

2 RELATED WORK

2.1 Memory-Dependent Bounds for Matrix Multiplication

The ﬁrst communication lower bound for matrix multiplication was established by Hong and Kung [13], who obtain

the result using computation directed acyclic graph (CDAG) analysis that multiplication of square 𝑛×𝑛matrices on

a machine with cache size 𝑀requires Ω(𝑛3/√𝑀)words of communication. Irony, Toledo, and Tiskin [14] reproduce

the result using a geometric proof based on the Loomis-Whitney inequality (Lemma 1), show it applies to general

rectangular matrices (replacing 𝑛3with 𝑛1𝑛2𝑛3), and obtain an explicit constant of (1/2)3/2≈.35. They also observe

that the result is easily extended to the distributed memory parallel computing model (as described in § 3.1) under mild

assumptions by dividing the bound by the number of processors 𝑃. We refer to such bounds as “memory-dependent,”

following [3], where the cache size 𝑀is interpreted as the size of each processor’s local memory. Later, Dongarra et al.

[12] tightened the constant for sequential and parallel memory-dependent bounds to (3/2)3/2≈1.84. More recently,

Smith et al. [22] prove the constant of 2 and show that it is obtained by an optimal sequential algorithm and is therefore

tight. Both of these results are proved using the Loomis-Whitney inequality. Kwasniewski et al. [17] use CDAG analysis

to obtain the same constant of 2 and show that it is tight in the parallel case (when memory is limited) by providing

an optimal algorithm.

2.2 Bounds for Other Computations

Hong and Kung’s proof technique can be applied to a more general set of computations, including the FFT [13]. Ballard

et al. [4] use the proof technique of [14] to generalize the lower bound (with the same explicit constant) to other linear

algebra computations such as LU, Cholesky, and QR factorizations. The constants of the lower bounds for these and

other computations are tightened by Olivry et al. [20], including reproducing the constant of 2 for matrix multiplication.

Kwasniewski et al. [16] also obtain tighter constants for LU and Cholesky factorizations using CDAG analysis. Christ et

al. [10] show that a generalization of the Loo mis-Whitney inequality can be used to prove communication lower bounds

for a much wider set of computations, but the asymptotic bounds do not include explicit constants. This approach is

applied to a tensor computation by Ballard and Rouse [5, 6].

2.3 Memory-Independent Bounds for Parallel Matrix Multiplication

This section describes the related work that focuses on the topic of this paper. Aggarwal, Chandra, and Snir [2] extend

Hong and Kung’s result for matrix multiplication to the LPRAM parallel model, which closely resembles the model

we consider with the exception that there exists a global shared memory where the inputs initially reside and where

the output must be stored at the end of the computation. Communication bounds in the related BSP parallel model are

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 3

1≤𝑃≤𝑚

𝑛𝑚

𝑛≤𝑃≤𝑚𝑛

𝑘2𝑚𝑛

𝑘2≤𝑃

Leading term 𝑛𝑘 𝑚𝑛𝑘2

𝑃1/2𝑚𝑛𝑘

𝑃2/3

[2] - - 1

22/3≈.63

[14] - - 1

2=.5

[11] 16

25 =.64 2

31/2≈.82 1

Theorem 1 1 2 3

Table 1. Summary of explicit constants of leading term of parallel memory-independent rectangular matrix multiplication commu-

nication lower bounds for multiplication dimensions 𝑚≥𝑛≥𝑘and 𝑃processors

also memory independent, and Scquizzato and Silvestri [21] establish the same asymptotic lower bounds for matrix

multiplication in that model. In addition to proving bounds for sequential matrix multiplication and the associated

memory-dependent bound for parallel matrix multiplication, Irony, Toledo, and Tiskin [14] prove also that a parallel

algorithm must communicate Ω(𝑛2/𝑃2/3)words, and they provide explicit constants in their analysis. Note that the

size of the lo calmemor y 𝑀does not appear in this bound. Ballard et al. [3] reproduce this result for classical matrix mul-

tiplication as well as prove similar results for fast matrix multiplication. They distinguish between memory-dependent

bounds (results described in § 2.1) and memory-independent bounds for parallel algorithms, and they show the two

bounds relate and aﬀect strong scaling behavior. In particular, when 𝑀≫𝑛2/𝑃2/3(or equivalently 𝑃≫𝑛3/𝑀3/2), the

memory-dependent bound is unattainable because the memory-independent bound is larger. Demmel et al. [11] ex-

tend the memory-independent results to the rectangular case (multiplying matrices of dimensions 𝑛1×𝑛2and 𝑛2×𝑛3),

showing that three diﬀerent bounds apply that depend on the relative sizes of the three dimensions and the number of

processors, and their proof provides explicit constants. For one of the cases, and for a restricted class of parallelizations,

Kwasniewski et al. [17] prove a tighter constant and show that it is attainable by an optimal algorithm.

We summarize the constants obtained by these previous works and compare them to our results in Tab. 1. Fur-

ther details of the comparison are given in § 6.1. Following Theorem 1, the table assumes 𝑚=max{𝑛1, 𝑛2, 𝑛3},

𝑛=median{𝑛1, 𝑛2, 𝑛 3}, and 𝑘=min{𝑛1, 𝑛2, 𝑛 3}.

2.4 Communication-Optimal Parallel Matrix Multiplication Algorithms

Both theoretical and practical algorithms that attain the communication lower bounds have been proposed for various

computation models and implemented on many diﬀerent types of parallel systems. The idea of “3D algorithms” for

matrix multiplication was developed soon after communication lower bounds were established; see [1, 2, 7, 15] for a few

examples. These algorithms partition the 3D iteration space of matrix multiplication in each of the three dimensions and

assign subblocks across a 3D logical grid of processors. McColl and Tiskin [19] and Demmel et al. [11] present recursive

algorithms that eﬀectively achieve similar 3D logical processor grid for square and general rectangular problems,

respectively. High-performance implementations of these algorithms on today’s supercomputers demonstrate that

these algorithms are indeed practical and outperform standard library implementations [17, 23].

4 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

3 PRELIMINARIES

3.1 Parallel Computation Model

We consider the 𝛼-𝛽-𝛾parallel machine model [9, 24]. In this model, each of 𝑃processors has its own local memory of

size 𝑀and can compute only with data in its local memory. The processors can communicate data to and from other

processors via messages that are sent over a fully connected network (i.e., each pair of processors has a dedicated link

so that there is no contention on the network). Further, we assume the links are bidirectional so that a pair of processors

can exchange data with no contention. Each processor can send and receive at most one message at the same time.

The cost of communication is a function of two parameters 𝛼and 𝛽, where 𝛼is the per-message latency cost and 𝛽is

the per-word bandwidth cost. A message of 𝑤words sent from one processor to another costs 𝛼+𝛽𝑤. The parameter

𝛾is the cost to perform a single arithmetic operation. For dense matrix multiplication when suﬃciently large local

memory is available, bandwidth cost nearly always dominates latency cost. Hence, we focus on the bandwidth cost in

this work. In our model, the communication cost of an algorithm is counted along the critical path of the algorithm

so that if two pairs of processors are communicating messages simultaneously, the communication cost is that of the

largest message. In this work, we focus on memory-independent analysis, so the local memory size 𝑀can be assumed

to be inﬁnite. We consider limited-memory scenarios in § 6.2.

3.2 Fundamental Results

In this section we collect the fundamental existing results we use to prove our main result, Theorem 1. The ﬁrst

lemma is a geometric inequality that has been used before in establishing communication lower bounds for matrix

multiplication [4, 11, 14]. We use it to relate the computation performed by a processor to the data it must access.

Lemma 1 (Loomis-Whitney [18]). Let𝑉be a ﬁnite set of lattice points in R3, i.e., poin ts (𝑖, 𝑗, 𝑘 )with integer coordinates.

Let 𝜙𝑖(𝑉)be the projection of 𝑉in the 𝑖-direction, i.e., all points (𝑗 , 𝑘)such that there exists an 𝑖so that (𝑖 , 𝑗, 𝑘 ) ∈ 𝑉. Deﬁne

𝜙𝑗(𝑉)and 𝜙𝑘(𝑉)similarly. Then

|𝑉| ≤ |𝜙𝑖(𝑉)| · |𝜙𝑗(𝑉)| · |𝜙𝑘(𝑉)|,

where | · | denotes the cardinality of a set.

The next set of deﬁnitions and lemmas allow us to solve the key constrained optimization problem (Lemma 5)

analytically. We ﬁrst remind the reader of the deﬁnitions of diﬀerentiable convex and quasiconvex functions and of

the Karush-Kuhn-Tucker (KKT) conditions. Here and throughout, we use boldface to indicate vectors and matrices and

subscripts to index them, so that 𝑥𝑖is the 𝑖th element of 𝒙, for example.

Definition 1 ([8, eq. (3.2)]). A diﬀerentiable function 𝑓:R𝑑→Ris convex if its domain is a convex set and for all

𝒙,𝒚∈dom 𝑓,

𝑓(𝒚) ≥ 𝑓(𝒙) + h∇𝑓(𝒙),𝒚−𝒙i.

Definition 2 ([8, eq. (3.20)]). A diﬀerentiable function 𝑔:R𝑑→Ris quasiconvex if its domain is a convex set and

for all 𝒙,𝒚∈dom 𝑔,

𝑔(𝒚) ≤ 𝑔(𝒙)implies that h∇𝑔(𝒙),𝒚−𝒙i ≤ 0.

Definition 3 ([8, eq. (5.49)]). Consider an optimization problem of the form

min

𝒙

𝑓(𝒙)subject to 𝒈(𝒙) ≤ 0(1)

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 5

where 𝑓:R𝑑→Rand 𝒈:R𝑑→R𝑐are both diﬀerentiable. Deﬁne the dual variables 𝝁∈R𝑐, and let 𝑱𝒈be the Jacobian

of 𝒈. The Karush-Kuhn-Tucker (KKT) conditions of (𝒙,𝝁)are as follows:

•Primal feasibility:𝒈(𝒙) ≤ 0;

•Dual feasibility:𝝁≥0;

•Stationarity:∇𝑓(𝒙) + 𝝁·𝑱𝒈(𝒙)=0;

•Complementary slackness:𝜇𝑖𝑔𝑖(𝒙)=0for all 𝑖∈ {1, . . . , 𝑐 }.

The next two results establish that our key optimization problem in Lemma 5 can be solved analytically using the

KKT conditions. While the results are not novel, we provide proofs for completeness.

Lemma 2 ([6, Lemma 2.2]). The function 𝑔0(𝒙)=𝐿−𝑥1𝑥2𝑥3, for some constant 𝐿, is quasiconvex in the positive octant.

Proof. Let 𝒙,𝒚be points in the positive octant with 𝑔0(𝒚) ≤ 𝑔0(𝒙). Then 𝑦1𝑦2𝑦3≥𝑥1𝑥2𝑥3. Applying the inequality

of arithmetic and geometric means (AM-GM inequality) to the values 𝑦1/𝑥1,𝑦2/𝑥2,𝑦3/𝑥3(which are all positive), we

have

1

3𝑦1

𝑥1+𝑦2

𝑥2+𝑦3

𝑥3≥𝑦1𝑦2𝑦3

𝑥1𝑥2𝑥31/3

≥1.(2)

Then ∇𝑔0(𝒙)=h−𝑥2𝑥3−𝑥1𝑥3−𝑥1𝑥2i, and

h∇𝑔0(𝒙),𝒚−𝒙i=3𝑥1𝑥2𝑥3−𝑦1𝑥2𝑥3−𝑥1𝑦2𝑥3−𝑥1𝑥2𝑦3

=3𝑥1𝑥2𝑥31−1

3𝑦1

𝑥1+𝑦2

𝑥2+𝑦3

𝑥3

≤0,

where the last inequality follows from eq. (2). Then by Def. 2, 𝑔0is quasiconvex on the positive octant.

Lemma 3. Consider an optimization problem of the form given in eq. (1). If 𝑓is a convex function and each 𝑔𝑖is a

quasiconvex function, then the KKT conditions are suﬃcient for optimality.

Proof. Suppose 𝒙∗and 𝝁∗satisfy the KKT conditions given in Def. 3. If 𝝁∗=0, then by stationarity, ∇𝑓(𝒙∗)=0.

Then the convexity of 𝑓(Def. 1) implies

𝑓(𝒙) ≥ 𝑓(𝒙∗) + h∇𝑓(𝒙∗),𝒙−𝒙∗i=𝑓(𝒙∗)

for all 𝒙∈dom 𝑓, which implies that 𝒙∗is a global optimum.

Now suppose 𝝁∗≠0, then without loss of generality (and by dual feasibility) there exists 𝑚≤𝑐such that 𝜇∗

𝑖>0

for 1 ≤𝑖≤𝑚and 𝜇∗

𝑖=0 for 𝑚<𝑖≤𝑐. Complementary slackness implies that 𝑔𝑖(𝒙∗)=0 for 1 ≤𝑖≤𝑚. Consider any

(primal) feasible 𝒙∈dom 𝑓. Then 𝑔𝑖(𝒙) ≤ 0 for all 𝑖, and thus 𝑔𝑖(𝒙) ≤ 𝑔𝑖(𝒙∗)for 1 ≤𝑖≤𝑚. By quasiconvexity of 𝑔𝑖

(Def. 2), this implies

h∇𝑔𝑖(𝒙∗),𝒙−𝒙∗i ≤ 0.

Stationarity implies that ∇𝑓(𝒙∗)=−Í𝑚

𝑖=1𝜇∗

𝑖∇𝑔𝑖(𝒙∗), and thus

h∇𝑓(𝒙∗),𝒙−𝒙∗i=−

𝑚

Õ

𝑖=1

𝜇∗

𝑖h∇𝑔𝑖(𝒙∗),𝒙−𝒙∗i ≥ 0.

By convexity of 𝑓(Def. 1), we therefore have

𝑓(𝒙) ≥ 𝑓(𝒙∗) + h∇𝑓(𝒙∗),𝒙−𝒙∗i ≥ 𝑓(𝒙∗),

6 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

and thus 𝒙∗is a global optimum.

4 MAIN LOWER BOUND RESULT

4.1 Lower Bounds on Individual Array Access

The following lemma establishes lower bounds on the number of elements of each individual matrix a processor must

access based on the number of computations a given element is involved with. This result is used to establish a set of

constraints in the key optimization problem used in the proof of Theorem 1.

Lemma 4. Given a parallel matrix multiplication algorithm that multiplies an 𝑛1×𝑛2matrix Aby an 𝑛2×𝑛3matrix

Busing 𝑃processors, any processor that performs at least 1/𝑃th of the scalar multiplications must access at least 𝑛1𝑛2/𝑃

elements of Aand at least 𝑛2𝑛3/𝑃elements of Band also compute contributions to at least 𝑛2𝑛3/𝑃elements of C=A·B.

Proof. The total number of scalar multiplications that must be computed is 𝑛1𝑛2𝑛3. Consider a processor that

computes at least 1/𝑃th of these computations. Each element of Ais involved in 𝑛3multiplications. If the processor

accesses fewer than 𝑛1𝑛2/𝑃elements of A, then it would perform fewer than 𝑛3·𝑛1𝑛2/𝑃scalar multiplications, which

is a contradiction. Likewise, each element of Bis involved in 𝑛1multiplications. If the processor accesses fewer than

𝑛2𝑛3/𝑃elements of B, then it would perform fewer than 𝑛1·𝑛2𝑛3/𝑃scalar multiplications, which is a contradiction.

Finally, each element of Cis the sum of 𝑛2scalar multiplications. If the processor computes contributions to fewer

than 𝑛1𝑛3/𝑃elements of C, then it would perform fewer than 𝑛2·𝑛1𝑛3/𝑃scalar multiplications, which is again a

contradiction.

4.2 Key Optimization Problem

The following lemma is the crux of the proof of our main result (Theorem 1). We state the optimization problem

abstractly here, but it may be useful to have the following intuition: the variable vector 𝒙represents the sizes of the

projections of the computation assigned to a single processor onto the three matrices, where 𝑥1corresponds to the

smallest matrix and 𝑥3corresponds to the largest matrix. In order to design a communication-eﬃcient algorithm, we

wish to minimize the sum of the sizes of these projections subject to the constraints of matrix multiplication (and

the processor performing 1/𝑃th of the computation), as speciﬁed by the Loomis-Whitney inequality (Lemma 1) and

Lemma 4. A more rigorous argument that any parallel matrix multiplication algorithm is subject to these constraints

is given in Theorem 1.

We are able to solve this optimization problem analytically using properties of convex optimization (Lemma 3). The

three cases of the solution correspond to how many of the individual variable constraints are tight. When none of

them is tight, we can minimize the sum of variables subject to the bound on their product by setting them all equal

to each other (Case 3). However, when the individual variable constraints make this solution infeasible, those become

active and the free variables must be adjusted (Cases 1 and 2).

Lemma 5. Consider the following optimization problem:

min

𝒙∈R3𝑥1+𝑥2+𝑥3

such that 𝑚𝑛𝑘

𝑃2

≤𝑥1𝑥2𝑥3

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 7

𝑛𝑘

𝑃≤𝑥1

𝑚𝑘

𝑃≤𝑥2

𝑚𝑛

𝑃≤𝑥3,

where 𝑚≥𝑛≥𝑘≥1and 𝑃≥1. The optimal solution 𝒙∗depends on the relative values of the constraints, yielding three

cases:

(1) if 𝑃≤𝑚

𝑛, then 𝑥∗

1=𝑛𝑘,𝑥∗

2=𝑚𝑘

𝑃,𝑥∗

3=𝑚𝑛

𝑃;

(2) if 𝑚

𝑛≤𝑃≤𝑚𝑛

𝑘2, then 𝑥∗

1=𝑥∗

2=𝑚𝑛𝑘2

𝑃1/2,𝑥∗

3=𝑚𝑛

𝑃;

(3) if 𝑚𝑛

𝑘2≤𝑃, then 𝑥∗

1=𝑥∗

2=𝑥∗

3=𝑚𝑛𝑘

𝑃2

3.

This can be visualized as follows:

𝑃

1𝑚

𝑛𝑚𝑛

𝑘2

𝑥∗

1=𝑛𝑘

𝑥∗

2=𝑚𝑘

𝑃

𝑥∗

3=𝑚𝑛

𝑃

𝑥∗

1=𝑥∗

2=𝑚𝑛𝑘2

𝑃1/2

𝑥∗

3=𝑚𝑛

𝑃

𝑥∗

1=𝑥∗

2=𝑥∗

3=𝑚𝑛𝑘

𝑃2/3

Proof. By Lemma 3, we can establish the optimality of the solution for each case by verifying that there exist dual

variables such that the KKT conditions speciﬁed in Def. 3 are satisﬁed. This optimization problem ﬁts the assumptions

of Lemma 3 because the objective function and all but the ﬁrst constraint are aﬃne functions, which are convex and

quasiconvex, and the ﬁrst constraint is quasiconvex on the positive octant (which contains the intersection of the aﬃne

constraints) by Lemma 2.

To match standard notation (and that of Lemma 3), we let

𝑓(𝒙)=𝑥1+𝑥2+𝑥3

and

𝒈(𝒙)=

(𝑚𝑛𝑘/𝑃)2−𝑥1𝑥2𝑥3

𝑛𝑘/𝑃−𝑥1

𝑚𝑘/𝑃−𝑥2

𝑚𝑛/𝑃−𝑥3

.

Thus the gradient of the objective function is ∇𝑓(𝒙)=h1 1 1iand the Jacobian of the constraint function is

𝑱𝒈(𝒙)=

−𝑥2𝑥3−𝑥1𝑥3−𝑥1𝑥2

−1 0 0

0−1 0

0 0 −1

.

Case 1 (𝑃≤𝑛

𝑚). We let

𝒙∗=h𝑛𝑘 𝑚𝑘

𝑃𝑚𝑛

𝑃i

and

𝝁∗=h𝑃2

𝑚2𝑛𝑘 0 1 −𝑃𝑛

𝑚1−𝑃𝑘

𝑚i

8 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

and verify the KKT conditions. Primal feasibility is immediate, and dual feasibility follows from 𝑃≤𝑚

𝑛≤𝑚

𝑘, the

condition of this case and by the assumption 𝑛≥𝑘. Stationarity follows from direct veriﬁcation that

𝝁∗·𝑱𝒈(𝒙∗)=h−1−1−1i.

Complementary slackness is satisﬁed because the only nonzero dual variables are 𝜇∗

1,𝜇∗

3, and 𝜇∗

4, and the 1st, 3rd, and

4th constraints are tight.

Case 2 (𝑚

𝑛≤𝑃≤𝑚𝑛

𝑘2). We let

𝒙∗=𝑚𝑛𝑘2

𝑃1/2𝑚𝑛𝑘2

𝑃1/2𝑚𝑛

𝑃

and

𝝁∗=𝑃

𝑚𝑛𝑘2/33/2001−𝑃𝑘 2

𝑚𝑛 1/2

and verify the KKT conditions. The primal feasibility of 𝑥1=𝑥2is satisﬁed because

𝑛𝑘

𝑃≤𝑚𝑘

𝑃≤𝑚𝑛𝑘2

𝑃1/2

where the ﬁrst inequality follows from the assumption 𝑚≥𝑛and the second inequality follows from 𝑚/𝑛≤𝑃(one

condition of this case). The other constraints are clearly satisﬁed. Dual feasibility requires that 1 − (𝑃 𝑘2/𝑚𝑛)1/2≥0,

which is satisﬁed because 𝑃≤𝑚𝑛/𝑘2(the other condition of this case). Stationarity can be directly veriﬁed. Com-

plementary slackness is satisﬁed because the 1st and 4th constraints are both tight for 𝒙∗, corresponding to the only

nonzeros in 𝝁∗.

Case 3 (𝑚𝑛

𝑘2≤𝑃). We let

𝒙∗=𝑚𝑛𝑘

𝑃2/3𝑚𝑛𝑘

𝑃2/3𝑚𝑛𝑘

𝑃2/3

and

𝝁∗=𝑃

𝑚𝑛𝑘 4/30 0 0

and verify the KKT conditions. We ﬁrst consider the primal feasibility conditions. We have

𝑛𝑘

𝑃≤𝑚𝑘

𝑃≤𝑚𝑛

𝑃≤𝑚𝑛𝑘

𝑃2/3

,

where the ﬁrst two inequalities are implied by the assumption 𝑚≥𝑛≥𝑘and the last follows from 𝑚𝑛

𝑘2≤𝑃, the

condition of this case. Dual feasibility is immediate, and stationarity is directly veriﬁed. Complementary slackness is

satisﬁed because the 1st constraint is tight for 𝒙∗and 𝜇∗

1is the only nonzero.

Note that the optimal solutions coincide at boundary points between cases so that the values are continuous as 𝑃

varies.

4.3 Communication Lower Bound

We now state our main result, memory-independent communication lower bounds for general matrix multiplication

with tight constants. After the general result, we also present a corollary for square matrix multiplication. The tightness

of the constants in the lower bound is proved in § 5.

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 9

Theorem 1. Consider a classical matrix multiplication computation involving matrices of size 𝑛1×𝑛2and 𝑛2×𝑛3. Let

𝑚=max{𝑛1, 𝑛2, 𝑛 3},𝑛=median{𝑛1, 𝑛2, 𝑛 3}, and 𝑘=min{𝑛1, 𝑛2, 𝑛3}, so that 𝑚≥𝑛≥𝑘. Any parallel algorithm using

𝑃processors that starts with one copy of the two input matrices and ends with one copy of the output matrix and load

balances either the computation or the data must communicate at least

𝐷−𝑚𝑛 +𝑚𝑘 +𝑛𝑘

𝑃words of data,

where

𝐷=

𝑚𝑛+𝑚𝑘

𝑃+𝑛𝑘 if 1≤𝑃≤𝑚

𝑛

2𝑚𝑛𝑘2

𝑃1/2+𝑚𝑛

𝑃if 𝑚

𝑛≤𝑃≤𝑚𝑛

𝑘2

3𝑚𝑛𝑘

𝑃2/3if 𝑚𝑛

𝑘2≤𝑃.

Proof. To establish the lower bound, we focus on a single processor. If the algorithm load balances the computation,

then every processor performs 𝑚𝑛𝑘/𝑃scalar multiplications, and there exists some processor whose input data at the

start of the algorithm plus output data at the end of the algorithm must be at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃words of data

(otherwise the algorithm would either start with more than one copy of the input matrices or end with more than

one copy of the output matrix). If the algorithm load balances the data, then every processor starts and end with a

total of (𝑚𝑛 +𝑚𝑘 +𝑛𝑘 )/𝑃words, and some processor must perform at least 𝑚𝑛𝑘/𝑃scalar multiplications (otherwise

fewer than 𝑚𝑛𝑘 multiplications are performed). In either case, there exists a processor that performs at least 𝑚𝑛𝑘 /𝑃

multiplications and has access to at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃data.

Let 𝐹be the set of multiplications assigned to this processor, so that |𝐹| ≥ 𝑚𝑛𝑘/𝑃. Each element of 𝐹can be indexed

by three indices (𝑖1, 𝑖 2, 𝑖3)and corresponds to the multiplication of A(𝑖1, 𝑖2)with B(𝑖2, 𝑖 3), which contributes to the

result C(𝑖1, 𝑖3). Let 𝜙A(𝐹)be the projection of the set 𝐹onto the matrix A, so that 𝜙A(𝐹)are the entries of Arequired

for the processor to perform the scalar multiplications in 𝐹. Here, elements of 𝜙A(𝐹)can be indexed by two indices:

𝜙A(𝐹)={(𝑖1, 𝑖2):∃𝑖3s.t. (𝑖1, 𝑖 2, 𝑖3) ∈ 𝐹}. Deﬁne 𝜙B(𝐹)and 𝜙C(𝐹)similarly. The processor must access all of the

elements in 𝜙A(𝐹),𝜙B(𝐹), and 𝜙C(𝐹)in order to perform all the scalar multiplications in 𝐹. Because the processor

starts and ends with at most (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃data, the communication performed by the processor is at least

|𝜙A(𝐹)| + |𝜙B(𝐹)| + |𝜙C(𝐹)| − 𝑚𝑛 +𝑚𝑘 +𝑛𝑘

𝑃,

which is a lower bound on the communication along the critical path of the algorithm.

In order to lower bound |𝜙A(𝐹)|+|𝜙B(𝐹)|+|𝜙C(𝐹)|, we form a constrained minimization problem with this expression

as the objective function and constraints derived from Lemmas 1 and 4. The Loomis-Whitney inequality (Lemma 1)

implies that

|𝜙A(𝐹)| · |𝜙B(𝐹)| · |𝜙C(𝐹)| ≥ |𝐹| ≥ 𝑛1𝑛2𝑛3

𝑃=𝑚𝑛𝑘

𝑃,

and the lower bound on the projections from Lemma 4 means

|𝜙A(𝐹)| ≥ 𝑛1𝑛2

𝑃,|𝜙B(𝐹)| ≥ 𝑛2𝑛3

𝑃,|𝜙C(𝐹)| ≥ 𝑛1𝑛3

𝑃.

For any algorithm, the processor’s projections must satisfy these constraints, so the sum of their sizes must be at least

the minimum value of optimization problem. Then by Lemma 5 (and assigning the projections to 𝑥1, 𝑥2, 𝑥3appropriately

based on the relative sizes of 𝑛1,𝑛 2, 𝑛3), the result follows.

10 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

We also state the result for square matrix multiplication, which is a direct application of Theorem 1 with 𝑛1=𝑛2=

𝑛3.

Corollary 2. Consid er a classical matrix multiplication computation involving two matrices of size 𝑛×𝑛. Any parallel

algorithm using 𝑃processors that starts with one copy of the input data and ends with one copy of the output data and

load balances either the computation or the data must communicate at least

3𝑛2

𝑃2/3−3𝑛2

𝑃words of data.

5 OPTIMAL PARALLEL ALGORITHM

In this section we present an optimal parallel algorithm (Alg. 1) to show that the lower bound (including the constants)

is tight. The idea is to organize the processors into a 3D processor grid and assign the computation of the matrix

multiplication (a 3D iteration space) to processors according to their location in the grid. The algorithm is not new,

but we present it here in full detail for completeness and to provide the complete analysis, which has not appeared

before. In particular, Alg. 1 is nearly identical to the one proposed by Aggarwal et al. [1], though they use the LPRAM

model and analyze only the case where 𝑃is large. In the LPRAM model, for example, processors can read concurrently

from a global shared memory, while in the 𝛼-𝛽-𝛾model, the data is distributed across local memories and is shared via

collectives like All-Gathers. Demmel et al. [11] present and analyze their recursive algorithm to show its asymptotic

optimality in all three cases, but they do not track constants. See § 2.4 for more discussion of previous work on optimal

parallel algorithms.

Consider the multiplication of an 𝑛1×𝑛2matrix Awith an 𝑛2×𝑛3matrix B, and let C=A·B. Algorithm 1 organizes

the 𝑃processors into a 3-dimensional 𝑝1×𝑝2×𝑝3logical processor grid with 𝑝1𝑝2𝑝3=𝑃. Note that one or two of the

processor grid dimensions may be equal to 1, which simpliﬁes to a 2- or 1-dimensional grid. A processor coordinate is

represented as (𝑝′

1, 𝑝 ′

2, 𝑝 ′

3), where 1 ≤𝑝′

𝑘≤𝑝𝑘, for 𝑘=1,2,3.

The basic idea of the algorithm is to perform two collective operations, All-Gathers, so that each processor receives

the input data it needs to perform all of its computation (in an All-Gather, all the processors involved end up with the

union of the input data that starts on each processor). The result of each local computation must be summed with all

other contributions to the same output matrix entries from other processors, and the summations are performed via a

Reduce-Scatter collective operation (in a Reduce-Scatter, the sum of the input data from all processors is computed so

that it ends up evenly distributed across processors).

Algorithm 1 Comm-Optimal Parallel Matrix Multiplication

Input: Ais 𝑛1×𝑛2,Bis 𝑛2×𝑛3,𝑝1×𝑝2×𝑝3logical processor grid

Output: C=A·Bis 𝑛1×𝑛3

1: (𝑝′

1, 𝑝 ′

2, 𝑝 ′

3)is my processor ID

2: // Gather input matrix data

3: A𝑝′

1𝑝′

2= All-Gather(A𝑝′

1𝑝′

2𝑝′

3,(𝑝′

1, 𝑝 ′

2,:))

4: B𝑝′

2𝑝′

3= All-Gather(B𝑝′

1𝑝′

2𝑝′

3,(:, 𝑝′

2, 𝑝 ′

3))

5: // Perform local computation

6: D𝑝′

1𝑝′

2𝑝′

3=A𝑝′

1𝑝′

2·B𝑝′

2𝑝′

3

7: // Sum results to compute C𝑝′

1𝑝′

3

8: C𝑝′

1𝑝′

2𝑝′

3= Reduce-Scatter(D𝑝′

1𝑝′

2𝑝′

3,(𝑝′

1,:, 𝑝′

3))

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 11

C

C11

C12

C13

C21

C23

C31

C32

C33

A

A11 A12 A13

A21 A23

A31 A32 A33

B

B11 B12 B13

B21 B23

B31 B32 B33

Fig. 1. Visualization of Alg. 1 with a 3×3×3processor grid. The 3D iteration space is mapped onto the processor grid, and the

matrices are mapped to the faces of the grid. The dark highlighting corresponds to the input data initially owned and the output

data finally owned by processor (1,3,1), and the light highlighting signifies the data of other processors it uses to perform the local

computation. The arrows show the sets of processors involved in the three collective operations involving processor (1,3,1).

Algorithm 1 imposes requirements on the initial distribution of the input matrices and the ﬁnal distribution of the

output. These conditions do not always hold in practice, b ut to show that the lower bound (which makes no assumption

on data distribution except that only 1 copy of the input exists at the start of the computation) is tight, we allow the

algorithm to specify its distributions. For simplicity of explanation, we assume that 𝑝1,𝑝2,𝑝3evenly divide 𝑛1,𝑛2,𝑛3,

respectively. We use the notation A𝑝′

1𝑝′

2to denote the submatrix of Asuch that

A𝑝′

1𝑝′

2=A(𝑝′

1−1) · 𝑛1

𝑝1+1 : 𝑝′

1·𝑛1

𝑝1,(𝑝′

2−1) · 𝑛2

𝑝2+1 : 𝑝′

2·𝑛2

𝑝2,

and we deﬁne B𝑝′

2𝑝′

3and C𝑝′

1𝑝′

3similarly. The algorithm assumes that at the start of the computation, A𝑝′

1𝑝′

2is distributed

evenly across processors (𝑝′

1, 𝑝 ′

2,:)and B𝑝′

2𝑝′

3is distributed evenly across processors (:,𝑝 ′

2, 𝑝 ′

3). We deﬁne A𝑝′

1𝑝′

2𝑝′

3and

B𝑝′

1𝑝′

2𝑝′

3as the elements of the input matrices that processor (𝑝′

1, 𝑝 ′

2, 𝑝 ′

3)initially owns. At the end of the algorithm, C𝑝′

1𝑝′

3

is distributed evenly across processors (𝑝′

1,:, 𝑝′

3), and we let C𝑝′

1𝑝′

2𝑝′

3be the elements owned by processor (𝑝′

1, 𝑝 ′

2, 𝑝 ′

3).

Figure 1 presents a visualization of Alg. 1. In t his example, we have 𝑛1=𝑛2=𝑛3, and 27 processors are arranged in a

3×3×3 grid. We highlight the data and communication of a particular processor with ID (1,3,1). The dark highlighting

corresponds to the input data initially owned by the processor (A131 and B131) as well as the output data owned by

the processor at the end of the computation (C131). The ﬁgure shows each of these submatrices as block columns of

the submatrices A13,B31 , and C11, but any even distribution of these across the same set of processors suﬃces. The

light highlighting of the submatrices A13 ,B31, and C11 corresponds to the data of other processors involved in the

local computation on processor (1,3,1), and their size corresponds to the communication cost. The three collectives

that involve processor (1,3,1)occur across three diﬀerent ﬁbers in the processor grid, as depicted by the arrows in the

ﬁgure.

12 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

5.1 Cost Analysis

Now we analyze computation and communication costs of the algorithm. Each processor performs 𝑛1

𝑝1.𝑛2

𝑝2.𝑛3

𝑝3=𝑛1𝑛2𝑛3

𝑃

local computations in Line 6. Communication occurs only in the All-Gather and Reduce-Scatter collectives in Lines 3,

4, and 8. Each processor is involved in two All-Gathers involving input matrices and one Reduce-Scatter involving the

output matrix. Lines 3, 4 specify simultaneous All-Gathers across sets of 𝑝3,𝑝1processors, respectively, and Line 8

speciﬁes simultaneous Reduce-Scatters across sets of 𝑝2processors. Note that if 𝑝𝑘=1 for any 𝑘=1,2,3, then the

corresponding collective can be ignored as no communication occurs. The diﬀerence between Alg. 1 and [1, Algorithm

1] is the Reduce-Scatter collective, which replaces the All-to-All collective and has smaller latency cost.

We assume that bandwidth-optimal algorithms, such as bidirectional exchange or recursive doubling/halving, are

used for the All-Gather and Reduce-Scatter collectives. The optimal communication cost of these collectives on 𝑝

processors is (1−1

𝑝)𝑤, where 𝑤is the words of data in each processor after All-Gather or before Reduce-Scatter

collective [9, 24]. Each processor also performs (1−1

𝑝)𝑤computations for the Reduce-Scatter collective.

𝑛𝑘

𝑚

C

A

B

(a) 1D case: 𝑃=3with grid 3×1×1

𝑛𝑘

𝑚

C

A

B

(b) 2D case: 𝑃=36 with grid 12 ×3×1

𝑛𝑘

𝑚

C

A

B

(c) 3D case: 𝑃=512 with grid 32 ×8×2

Fig. 2. Example parallelizations of iteration space of multiplication of a 9600 ×2400 matrix Awith a 2400 ×600 matrix B

Hence the communication costs of Lines 3, 4 in Algorithm 1 are (1−1

𝑝3)𝑛1𝑛2

𝑝1𝑝2and (1−1

𝑝1)𝑛2𝑛3

𝑝2𝑝3, respectively, to

accomplish All-Gather operations, and the communication cost of performing the Reduce-Scatter operation in Line 8

is (1−1

𝑝2)𝑛1𝑛3

𝑝1𝑝3. Note that if 𝑝𝑘=1 for any 𝑘=1,2,3, then the cost of the corresponding collective reduces to 0. Thus

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 13

the overall cost of Algorithm 1 is

𝑛1𝑛2

𝑝1𝑝2+𝑛2𝑛3

𝑝2𝑝3+𝑛1𝑛3

𝑝1𝑝3−𝑛1𝑛2+𝑛2𝑛3+𝑛1𝑛3

𝑃.(3)

Due to Reduce-Scatter operation, each processor also performs (1−1

𝑝2)𝑛1𝑛3

𝑝1𝑝3computations, which is dominated by the

𝑛1𝑛2𝑛3

𝑃computations of Line 6.

5.2 Optimal Processor Grid Selection

The communication cost of Algorithm 1, given by eq. (3), depends on the processor grid dimensions. Here we discuss

how to select the processor grid dimensions such that the lower bound on communication given in Theorem 1 is

attained by Alg. 1 given the matrices dimensions 𝑛1, 𝑛2and 𝑛3and the number of processors 𝑃. As before, we let 𝑚, 𝑛 , 𝑘

represent the maximum, median, and minimum values of the three dimensions. Letting 𝑝1, 𝑝2, 𝑝 3be the grid dimensions,

we similarly deﬁne 𝑝, 𝑞 ,𝑟 to be the pro cessor grid dimensions corresponding to matrix dimensions 𝑚, 𝑛, 𝑘 , respectively.

Because the order of processor grid dimensions will be chosen to be consistent with the matrix dimensions, we will

have 𝑝≥𝑞≥𝑟. To demonstrate the tightness of the lower bound, the analysis below assumes that the processor grid

dimensions divide the matrices dimensions.

Following Theorem 1, depending on the relative sizes of the aspect ratios among matrix dimensions and the number

of processors, we encounter three cases that correspond to 3D, 2D, and 1D processor grids. That is, when 𝑝𝑖=1 for

one value of 𝑖, then the processor grid is eﬀectively 2D, and when 𝑝𝑖=1 for two values of 𝑖, the processor grid is

eﬀectively 1D. In the following we show how to obtain the grid dimensions and show that Algorithm 1 attains the

communication lower bound given in Theorem 1 in each case.

First, suppose 𝑃≤𝑚

𝑛. In this case, we set 𝑟=𝑞=1, and set 𝑝=𝑃to obtain a 1D grid. From eq. (3), Algorithm 1 has

communication cost 𝑚𝑛 +𝑚𝑘

𝑃+𝑛𝑘 −𝑚𝑛 +𝑚𝑘 +𝑛𝑘

𝑃=1−1

𝑃𝑛𝑘,

which matches the 1st case of Theorem 1.

Now suppose 𝑚

𝑛<𝑃≤𝑚𝑛

𝑘2. We set 𝑟=1, and set 𝑝and 𝑞such that 𝑚

𝑝=𝑛

𝑞, yielding 𝑝=𝑃

𝑚𝑛 1/2𝑚and

𝑞=𝑃

𝑚𝑛 1/2𝑛. Note that the assumption 𝑚

𝑛<𝑃is required so that 𝑞>1, and 𝑝>1 also follows. Our analysis

also assumes that 𝑝and 𝑞are integers, which is suﬃcient to show that the lower bound is tight in general as there

are an inﬁnite number of dimensions for which the assumption holds. In this case, we have a 2D processor grid, and

Algorithm 1 has communication cost

𝑚𝑛

𝑝𝑞 +𝑚𝑘

𝑝+𝑛𝑘

𝑞−𝑚𝑛 +𝑚𝑘 +𝑛𝑘

𝑝𝑞 =2𝑚𝑛𝑘 2

𝑃1/2

−𝑚𝑘 +𝑛𝑘

𝑃,

matching the 2nd case of Theorem 1.

Finally, suppose 𝑚𝑛

𝑘2<𝑃. As suggested in [1], we set the grid dimensions such that 𝑚

𝑝=𝑛

𝑞=𝑘

𝑟. That is, 𝑟=

𝑃

𝑚𝑛𝑘 1/3𝑘,𝑞=𝑃

𝑚𝑛𝑘 1/3𝑛, and 𝑝=𝑃

𝑚𝑛𝑘 1/3𝑚. Note that the assumption 𝑚𝑛

𝑘2<𝑃is required so that 𝑟>1 (which

also implies 𝑞>1 and 𝑝>1). This assumption was implicit in the analysis of [1]. Again, we assume that 𝑝, 𝑞, 𝑟 are

integers. Thus, we have a 3D processor grid and a communication cost of

3𝑚𝑛𝑘

𝑃2/3

−𝑚𝑛 +𝑚𝑘 +𝑛𝑘

𝑃,

14 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

which matches the 3rd case of Theorem 1.

Comparing the obtained communication cost in each case with the lower bound obtained in Theorem 1 we conclude

that Algorithm 1 is optimal given the grid dimensions are selected as above.

5.3 Optimal Processor Grid Examples

Figure 2 illustrates each of the three cases for a ﬁxed set of matrix dimensions. Here we consider multiplying a 9600 ×

2400 matrix Awith a 2400 ×600 Bso that Cis 9600 ×600, so in our notation with 𝑚≥𝑛≥𝑘,Ais 𝑚×𝑛,Bis 𝑛×𝑘,

and Cis 𝑚×𝑘. The 3D 𝑚×𝑛×𝑘iteration space is visualized with faces corresponding to correctly oriented matrices.

In this example, we consider 𝑃∈ {3,36,512}.

With 3 processors, we fall into the 1st case, as 𝑃≤𝑚

𝑛=4, and the optimal processor grid is 3 ×1×1, which is 1D as

shown in Fig. 2a. Note that the computation assigned to each processor is not a cube in this case, that is, 𝑚

𝑝≠𝑛

𝑞≠𝑘

𝑟.

The only data that must be communicated are entries of B, though all processors need all of B.

When 𝑃=36, we fall into the 2nd case, and the optimal processor grid is 2D: 12 ×3×1, as shown in Fig. 2b. Here

we see that the iteration space assigned to each processor is 800 ×800 ×600, so 𝑚

𝑝=𝑛

𝑞≠𝑘

𝑟. In this case, entries of B

and Cmust be communicated, but each entry of Ais required by only one processor.

Finally, for 𝑃=512, we satisfy 𝑃>𝑚𝑛

𝑘2=64 and fall into the 3rd case. The optimal processor grid is shown in Fig. 2c

to be 32 ×8×2, and we see that the local computation of each processor is a cube: 𝑚

𝑝=𝑛

𝑞=𝑘

𝑟. For a 3D grid, entries

of all 3 matrices are communicated.

6 CONCLUSION

Theorem 1 establishes memory-independent communication lower bounds for parallel matrix multiplication. By cast-

ing the lower bound on accessed data as the solution to a constrained optimization problem, we are able to obtain

a result with explicit constants spanning over three scenarios that depend on the relative sizes of the matrix aspect

ratios and the number of processors. Algorithm 1 demonstrates that the constants established in Theorem 1 are tight,

as the algorithm is general enough to be applied in each of the three scenarios by tuning the processor grid. As we

discuss below, our lower bound proof technique tightens the constants proved in earlier work, and we believe it can

be generalized to improve known communication lower bounds for other computations.

6.1 Comparison to Existing Results

We now provide full details of the constants presented in Tab. 1, and compare the previous results with the constants

of Theorem 1. The ﬁrst row of the table gives the constant from the proof by Aggarwal, Chandra, and Snir [2, Theorem

2.3]. While the result is stated asymptotically, an explicit constant is given in a key lemma ([2, Lemma 2.2]) used in the

proof, from which we can derive the constant for the main result.

The second row of the table corresponds to the work of Irony, Toledo, and Tiskin [14], who establish the ﬁrst parallel

bounds for matrix multiplication. Their memory-independent bound is stated for square matrices with a parametrized

prefactor corresponding to the amount of local memory available [14, Theorem 5.1]. If we generalize it straightfor-

wardly to rectangular dimensions and minimize the prefactor over any amount of local memory, then we obtain a

bound of at least 1/2· (𝑚𝑛𝑘/𝑃)2/3, which is asymptotically tight for 𝑚𝑛 /𝑘2≤𝑃. They do not provide any tighter

results for 𝑃<𝑚𝑛/𝑘2.

The third row of the table corresponds to the results of Demmel et al. [11]. This work was the ﬁrst to establish

bounds for small values of 𝑃and identify the three cases of asymptotic expressions. Theorem 1 obtains the same cases

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 15

and leading order terms (up to constant factors) [11, Table I], and we present the explicit constant factors for leading

order terms derived in [11, Section II.B]. We note that the boundaries between cases diﬀer by a constant in that paper,

which we do not reﬂect in Tab. 1. Compared to these results, Theorem 1 establishes a tighter constant in all three cases.

We note that Kwasniewski et al. claim a combined result of memory-dependent and memory-independent parallel

bounds [17, Theorem 2]. The memory-independent term has a constant that matches the 3rd case of Theorem 1. How-

ever, the proof includes a restrictive assumption on parallelization strategies, requiring that each processor is assigned

a set of domains that are subblocks of the iteration space with dimensions 𝑎×𝑎×𝑏for some 𝑎, 𝑏, and therefore does

not apply to all parallel algorithms.

6.2 Limited-Memory Scenarios

The local memory required by Alg. 1 matches the amount of communication performed plus the data already owned

by the processor, which is given by the positive terms in eq. (3) and matches the value of 𝐷in Theorem 1 with the

optimal processor grid. Note that the local memory 𝑀must be large enough to store the inputs and output matrices,

so 𝑀≥ (𝑚𝑛 +𝑚𝑘 +𝑛𝑘)/𝑃. When 1D or 2D processor grids are used, the local memory required is no more than a

constant more than the minimum required to store the problem. Further, Alg. 1 can be adapted to reduce the temporary

memory required to a negligible amount at the expense of higher latency cost but without aﬀecting the bandwidth

cost, and thus the algorithmic approach can be used even in extremely limited memory scenarios. In the case of 3D

processor grids, however, the temporary memory used by Alg. 1 asymptotically dominates the minimum required,

and thus the algorithm cannot be applied in limited-memory scenarios. Reducing the memory footprint in this case

necessarily increases the bandwidth cost. Algorithms that smoothly trade oﬀ memory for communication savings in

these limited memory scenarios are well studied [3, 17, 19, 23].

From the lower bound point of view, while Theorem 1 is always valid, it may not be the tightest bound in limited-

memory scenarios. The memory-dependent bound with leading term 2𝑚𝑛𝑘/(𝑃√𝑀)(see [17, 20, 22] and discussion

in § 2.1) can be larger. In particular, this occurs when 𝑚𝑛/𝑘2<𝑃≤8/27 ·𝑚𝑛𝑘/𝑀3/2, and the memory-dependent

bound dominates the memory-independent bound of 3(𝑚𝑛𝑘 /𝑃)2/3in that case. This scenario implies that 𝑀<4/9·

(𝑚𝑛𝑘/𝑃)2/3,in which case the temporary space required by Alg. 1 exceeds the available memory. Thus, the tightness

of Theorem 1 for 𝑚𝑛/𝑘2<𝑃requires an assumption of suﬃcient memory.

When 𝑃≤𝑚𝑛/𝑘2, the memory-independent bounds in the ﬁrst two cases of Theorem 1 are always tight, with no

assumption on local memory size. That is, the memory-dependent bound never dominates the memory-independent

bound. Consider the 2nd case, so that 𝑚/𝑛≤𝑃≤𝑚𝑛/𝑘2and the memory-independent bound is 2(𝑚𝑛𝑘 2/𝑃)1/2.

Because the local memory must be large enough to store the largest matrix as well as the other two matrices, we have

𝑀>𝑚𝑛/𝑃. This implies 2𝑚𝑛𝑘/(𝑃√𝑀)<2(𝑚𝑛𝑘2/𝑃)1/2, so the memory-independent bound dominates.

Suppose further that 𝑃≤𝑚

𝑛. In this case, the leading-order term of the memory-independent bound is 𝑛𝑘 . This

1st-case bound dominates the 2nd-case bound, which dominates the memory-dependent bound from the argument

above. Comparison of the full bounds of the 1st and 2nd cases simpliﬁes to 2(𝑚𝑛𝑘 2/𝑃)1/2≤𝑚𝑘/𝑃+𝑛𝑘, which holds

by the arithmetic-geometric mean inequality.

6.3 Extensions

The proof technique we use to obtain Theorem 1 is more generally applicable. The basic approach of deﬁning a con-

strained optimization problem to minimize the sum amount of data accessed subject to constraints on that data that

16 Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse

depend on the nature of the computation has been used before for matrix multiplication [22] and for tensor compu-

tations [5, 6]. The key to the results presented in this work is the imposition of lower bound constraints on the data

accessed in each individual array given by Lemma 4. These lower bounds become active when the aspect ratios of the

matrices are large relative to the number of processors and allow for tighter lower bounds in those cases. The argument

given in Lemma 4 is not speciﬁc to matrix multiplication, it depends only on the number of operations a given word of

data is involved in, so it can be applied to many other computations that have iteration spaces with uneven dimensions.

We believe this will yield new or tighter parallel communication bounds in several cases.

ACKNOWLEDGMENTS

This work is supported by the National Science Foundation under Grant No. CCF-1942892 and OAC-2106920. This

project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020

research and innovation program (Grant agreement No. 810367).

REFERENCES

[1] R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM

Journal of Research and Development 39, 5 (1995), 575–582. https://doi.org/10.1147/rd.395.0575

[2] A. Aggarwal, A. K. Chandra, and M. Snir. 1990. Communication complexity of PRAMs. Theor. Comp. Sci. 71, 1 (1990), 3–28.

https://doi.org/10.1016/0304-3975(90)90188- N

[3] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. 2012. Brief announcement: strong scaling of matrix multiplication algorithms and

memory-independent communication lower bounds. In Proceedings of the 24th ACM Symposium on Parallelism in Algorithms and Architectures

(SPAA ’12). ACM, New York, NY, USA, 77–79. https://doi.org/10.1145/2312005.2312021

[4] G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2012. Graph expansion and communication costs of fast matrix multiplication. J. ACM 59, 6,

Article 32 (2012), 23 pages. https://doi.org/10.1145/2395116.2395121

[5] G. Ballard, N. Knight, and K. Rouse. 2018. Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In IPDPS. 557–567.

https://doi.org/10.1109/IPDPS.2018.00065

[6] G. Ballard and K. Rouse. 2020. General Memory-Independent Lower Bound for MTTKRP. In SIAM PP. 1–11.

https://doi.org/10.1137/1.9781611976137.1

[7] J. Berntsen. 1989. Communication eﬃcient matrix multiplication on hypercubes. Parallel Comput. 12, 3 (1989), 335–342.

https://doi.org/10.1016/0167-8191(89)90091- 4

[8] S. Boyd and L. Vandenberghe. 2004. Convex Optimization. Cambridge University Press. https://web.stanford.edu/~boyd/cvxbook/

[9] E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and

Computation: Practice and Experience 19, 13 (2007), 1749–1783. https://doi.org/10.1002/cpe.1206

[10] M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick. 2013. Communication Lower Bounds and Optimal Algorithms for Pro-

grams That Reference Arrays - Part 1. Technical Report UCB/EECS-2013-61. EECS Department, University of California, Berkeley.

http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013- 61.html

[11] J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz, and O. Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular

Matrix Multiplication. In IPDPS. 261–272. https://doi.org/10.1109/IPDPS.2013.80

[12] Jack Dongarra, Jean-François Pineau, Yves Robert, Zhiao Shi, and Frédéric Vivien. 2008. Revisiting Matrix Product on Master-Worker Platforms.

International Journal of Foundations of Computer Science 19, 06 (2008), 1317–1336. https://doi.org/10.1142/S0129054108006303

[13] J. W. Hong and H. T. Kung. 1981. I/O complexity: The red-blue pebble game. In STOC. ACM, 326–333. https://doi.org/10.1145/800076.802486

[14] D. Irony, S. Toledo, and A. Tiskin. 2004. Communication lower bounds for distributed-memory matrix multiplication. J. Par. and Dist. Comp. 64, 9

(2004), 1017–1026. https://doi.org/10.1016/j.jpdc.2004.03.021

[15] S. Lennart Johnsson. 1993. Minimizing the communication time for matrix multiplication on multiprocessors. Parallel Comput. 19, 11 (1993), 1235

– 1257. https://doi.org/10.1016/0167- 8191(93)90029-K

[16] G. Kwasniewski, M. Kabic, T. Ben-Nun, A. N. Ziogas, J. E. Saethre, A. Gaillard, T. Schneider, M. Besta, A. Kozhevnikov, J. VandeVondele, and T.

Hoeﬂer. 2021. On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations. In Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC ’21). Association for Computing Machinery,

New York, NY, USA, Article 70, 15 pages. https://doi.org/10.1145/3458817.3476167

[17] G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoeﬂer. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-

Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver,

Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 24, 22 pages. https://doi.org/10.1145/3295500.3356181

Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds 17

[18] L. H. Loomis and H. Whitney. 1949. An inequality related to the isoperimetric inequality. Bull. Amer. Math. Soc. 55, 10 (1949), 961 – 962.

https://doi.org/10.1090/S0002-9904- 1949-09320- 5

[19] W. McColl and A. Tiskin. 1999. Memory-Eﬃcient Matrix Multiplication in the BSP Model. Algorithmica 24, 3-4 (1999), 287–297.

https://doi.org/10.1007/PL00008264

[20] A. Olivry, J. Langou, L.-N. Pouchet, P. Sadayappan, and F. Rastello. 2020. Automated Derivation of Parametric Data Movement Lower Bounds for

Aﬃne Programs. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (London, UK) (PLDI

2020). ACM, New York, NY, USA, 808–822. https://doi.org/10.1145/3385412.3385989

[21] M. Scquizzato and F. Silvestri. 2014. Communication Lower Bounds for Distributed-Memory Computations. In 31st International Sym-

posium on Theoretical Aspects of Computer Science (STACS 2014), Vol. 25. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 627–638.

https://doi.org/10.4230/LIPIcs.STACS.2014.627

[22] T. M. Smith, B. Lowery, J. Langou, and R. A. van de Geijn. 2019. A Tight I/O Lower Bound for Matrix Multiplication. Technical Report. arXiv.

https://arxiv.org/abs/1702.02017

[23] E. Solomonik and J. Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par

2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Lecture Notes in Computer Science, Vol. 6853. Springer

Berlin Heidelberg, 90–109. https://doi.org/10.1007/978-3- 642-23397-5_10

[24] R. Thakur, R. Rabenseifner, and W. Gropp. 2005. Optimization of Collective Communication Operations in MPICH. Intl. J. High Perf. Comp. App.

19, 1 (2005), 49–66. https://doi.org/10.1177/1094342005051521