Content uploaded by Peter Benner

Author content

All content in this area was uploaded by Peter Benner on Nov 21, 2020

Content may be subject to copyright.

arXiv:2011.06532v1 [math.NA] 12 Nov 2020

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC∗

HUSSAM AL DAAS†, GREY BALLARD‡,AND PETER BENNER†

Abstract. We present eﬃcient and scalable parallel algorithms for performing mathematical

operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms

for addition, elementwise multiplication, computing norms and inner products, orthogonalization,

and rounding (rank truncation). These are the kernel operations for applications such as iterative

Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-

memory computation, and we use a data distribution and strategy that parallelizes computations

for individual cores within the TT format. We analyze the computation and communication costs

of the proposed algorithms to show their scalability, and we present numerical experiments that

demonstrate their eﬃciency on both shared-memory and distributed-memory parallel systems. For

example, we observe better single-core performance than the existing MATLAB TT-Toolbox in

rounding a 2GB TT tensor, and our implementation achieves a 34×speedup using all 40 cores of a

single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores

for all mathematical operations.

Key words. low-rank tensor format, tensor train, parallel algorithms, QR, SVD

AMS subject classiﬁcations. 15A69, 15A23 , 65Y05, 65Y20

1. Introduction. Multi-dimensional data, or tensors, appear in a variety of

applications where numerical values represent multi-way relationships. The Tensor

Train (TT) format is a low-rank representation of a tensor that has been applied

to solving problems in areas such as parameter-dependent PDEs, stochastic PDEs,

molecular simulations, uncertainty quantiﬁcation, data completion, and classiﬁcation

[7, 8, 13, 15, 24, 26, 30, 34]. As the number of dimensions or modes of a tensor

becomes large, the total number of data elements grows exponentially fast, which is

known as the curse of dimensionality [15]. Fortunately, it can be shown in many cases

that the tensors exhibit low-rank structure and can be represented or approximated

by signiﬁcantly fewer parameters. Low-rank tensor approximations allow for storing

the data implicitly and performing arithmetic operations in feasible time and space

complexity, avoiding the curse of dimensionality.

In contrast to the matrix case where the singular value decomposition (SVD)

provides optimal low-rank representations, there are more diverse possibilities for

low-rank representations of tensors [22]. Various representations have been proposed,

such as CP [11, 16], Tucker [38], quantized tensor train [21], and hierarchical Tucker

[15], in addition to TT [30], and each has been demonstrated to be most eﬀective

in certain applications. The TT format consists of a sequence of TT cores, one for

each tensor dimension, and each core is a 3-way tensor except for the ﬁrst and last

cores, which are matrices. The primary advantages of TT are that (1) the number of

parameters in the representation is linear, rather than exponential, in the number of

modes and (2) the representation can be computed to satisfy a speciﬁed approximation

error threshold in a numerically stable way.

As these low-rank tensor techniques have been applied to larger and larger data

sets, eﬃcient sequential and parallel implementations of algorithms for computing

∗Submission date: November 12, 2020.

†Department of Computational Methods in Systems and Control Theory, Max Planck Institute

for Dynamics of Complex Technical Systems, Magdeburg, Germany (aldaas@mpi-magdeburg.mpg.de,

benner@mpi-magdeburg.mpg.de).

‡Computer Science Department, Wake Forest University, Winston Salem, North Carolina, USA

(ballard@wfu.edu).

1

2H. AL DAAS, G. BALLARD, P. BENNER

and manipulating these formats have also been developed. Toolboxes and libraries in

productivity-oriented languages such as MATLAB and Python [3, 23, 28, 40] are avail-

able for moderately sized data, and parallel algorithms implemented in performance-

oriented languages exist for computation of decompositions such as CP [14, 36, 27]

and Tucker [2, 6, 20, 35] and operations such as tensor contraction [37], allowing for

scalability to much larger data and numbers of processors. However, no such par-

allelization exists for TT tensors. The goal of this work is to establish eﬃcient and

scalable algorithms for implementing the key mathematical operations on TT tensors.

We consider mathematical operations such as addition, Hadamard (elementwise)

multiplication, computing norms and inner products, left- and right- orthogonaliza-

tion, as well as rounding (rank truncation). These are the operations required to, for

example, solve a structured linear system whose solution can be approximated well

by a tensor in TT format [26]. As we will see in Section 2, mathematical operations

can increase the formal ranks of the TT tensor, which can then be recompressed, or

rounded back to smaller ranks, in order to maintain feasible time and space com-

plexity with some controllable loss of accuracy. As a result, the rounding procedure

(and the orthogonalization it requires) is of prime importance in developing eﬃcient

and scalable TT algorithms. We will assume throughout that full tensors are never

formed explicitly, though there are eﬃcient (sequential) procedures for computing a

TT approximation of a full tensor [30].

In order to develop scalable parallel algorithms, we use a data distribution and

parallelization techniques that maintain computational load balance and attempt to

minimize interprocessor communication, which is the most expensive operation on

parallel machines in terms of both time and energy consumption. As discussed in

Section 3, we distribute the slices of each TT core across all processors, where slices

are matrices (or vectors) whose dimensions are determined by the low ranks of the TT

representation. This distribution allows for full parallelization of each core-wise com-

putation and avoids the need for communication within slice-wise computations. The

orthogonalization and rounding algorithms depend on parallel QR decompositions,

and our approach enables the use of the Tall-Skinny QR algorithm, which is commu-

nication optimal for the matrix dimensions in this application [12]. We analyze the

parallel computation and communication costs of each TT algorithm, demonstrat-

ing that the bulk of the computation is load balanced perfectly across processors.

The communication costs are independent of the original tensor dimensions, so their

relative costs diminish with small ranks.

We verify the theoretical analysis and benchmark our C/MPI implementation

on up to 256 nodes (10,240 cores) of a distributed-memory parallel platform in Sec-

tion 4. Our experiments are performed on synthetic data using tensor dimensions

and ranks that arise in a variety of scientiﬁc and data analysis applications. On

a shared-memory system (one node of the system), we compare our TT-rounding

implementation against the TT-Toolbox [28] in MATLAB and show that our imple-

mentation is 70% more eﬃcient using a single core and achieves up to a 34×parallel

speedup using all 40 cores on the node. We also present strong scaling performance

experiments for computing inner products, norms, orthogonalization, and rounding

using up to over 10K MPI processes. The experimental results show that the time

remains dominated by local computation even at that scale, allowing for nearly linear

scaling for multiple operations, achieving for example a 97×speedup of TT-rounding

when scaling from 1 node to 128 nodes on a TT tensor with a 28 GB memory footprint.

We conclude in Section 5 and discuss limitations of our approaches and perspectives

for future improvements.

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 3

I1

R1

I2

R1

R2

I3

R2

R3

I4

R3

R4

I5

R4

Fig. 2.1: Order-5 TT tensor with a particular slice from each TT core highlighted.

The chain product of those slices produces a scalar element of the full tensor with

indices corresponding to the slices.

2. Notation and background. In this section, we review the tensor train (TT)

format and present a brief overview of the notation and computational kernels asso-

ciated with it. Tensors are denoted by boldface Euler script letters (e.g. X), and ma-

trices are denoted by boldface block letters (e.g. A). The number Infor 1 ≤n≤N

is referred to as the mode size or mode dimension, and we use into index that dimen-

sion. The order of a tensor is its number of modes, e.g., the order of Xis N. The nth

TT core (described below) of a tensor Xis denoted by TX,n. We use MATLAB-style

notation to obtain elements or sub-tensors, where a solitary colon (:) refers to the

entire range of a dimension. For example X(i, j, k) is a tensor entry, X(i, :,:) is a

tensor slice (a matrix in this case), and X(:, j, k) is a tensor ﬁber (a vector).

The mode-n“modal” unfolding (or matricization or ﬂattening) of a tensor X∈

RI1×I2×I3is the matrix X(n)∈RIn×I

In, where I=I1I2I3. In this case, the columns

of the modal unfolding are ﬁbers in that mode. The mode-nproduct or tensor-times-

matrix operation is denoted by ×nand is deﬁned so that the mode-nunfolding of

X×nAis AX(n). We refer to [22, 33] for more details.

2.1. TT tensors. A tensor X∈RI1×···×INis in the TT format if there ex-

ist strictly positive integers R0,...,RNwith R0=RN= 1 and Norder-3 tensors

TX,1,...,TX,N , called TT cores, with TX,n ∈RRn−1×In×Rn, such that:

X(i1,...,iN) = TX,1(i1,:) ···TX,n(:, in,:) ···TX,N (:, iN).

We note that because R0=RN= 1, the ﬁrst and last TT cores are (order-2) matrices

so TX,1(i1,:) ∈RR1and TX,N (:, iN)∈RRN−1. The Rn−1×Rnmatrix TX,n (:, in,:) is

referred to as the inth slice of the nth TT core of X, where 1 ≤in≤In. Subsection 2.1

shows an illustration of an order-5 TT tensor.

Due to the multiplicative formulation of the TT format, the cores of a TT tensor

are not unique. For example, let Xbe a TT tensor and M∈RRn×Rnbe an invertible

matrix. Then, the TT tensor Ydeﬁned such that

Y(i1,...,iN) = TX,1(i1,:) ···(TX,n(:, in,:)M)·(M−1TX,n+1 (:, in+1,:)) ···TX,N (:, iN)

4H. AL DAAS, G. BALLARD, P. BENNER

In

Rn−1

Rn

TX,n ∈RRn−1×In×Rn

are TT cores

Rn

Rn−1···

Rn

···

Rn

In

H(TX,n)∈RRn−1×InRn

Rn

Rn−1

.

.

.

Rn−1

.

.

.

Rn−1

In

V(TX,n)∈RRn−1In×Rn

Fig. 2.2: Horizontal and vertical unfoldings of a TT core

is equal to X. Another important remark is the following:

(2.1) TX,1(i1,:) ···(TX,n(:, in,:)M)·TX,n+1 (:, in+1,:) ···TX,N (:, iN) =

TX,1(i1,:) ···TX,n(:, in,:) ·(MTX,n+1 (:, in+1,:)) ···TX,N (:, iN)

where Min this case need not be invertible. Thus, we can “pass” a matrix between

adjacent cores without changing the tensor. This property is used to orthogonalize

TT cores as we will see in Subsection 2.3.

2.2. Unfolding TT cores. In order to express the arithmetic operations on TT

cores using linear algebra, we will often use two speciﬁc matrix unfoldings of the 3D

tensors. The horizontal unfolding of TT core TX,n corresponds to the concatenation

of the slices TX,n (:, in,:) for in= 1,...,Inhorizontally. We denote the corresponding

operator by H, so that H(TX,n) is an Rn−1×RnInmatrix. The vertical unfolding

corresponds to the concatenation of the slices TX,n(:, in,:) for in= 1,...,Invertically.

We denote the corresponding operator by V, so that V(TX,n) is an Rn−1In×Rn

matrix. These unfoldings are illustrated in Figure 2.2.

Note that the horizontal unfolding is equivalent to the modal unfolding with

respect to the 1st mode, often denoted with subscript (1) to denote the mode that

corresponds to rows [22]. Similarly, the vertical unfolding is the transpose of the modal

unfolding with respect to the 3rd mode, which also corresponds to the more general

unfolding that maps the ﬁrst two modes to rows and the third mode to columns,

denoted with subscript (1:2) to denote the modes that correspond to rows [31]. These

connections are important for the linearization of tensor entries in memory and our

eﬃcient use of BLAS and LAPACK, discussed in Subsection 3.1.

2.3. TT Orthogonalization. Diﬀerent types of orthogonalization can be de-

ﬁned for TT tensors. We focus in this paper on left and right orthogonalizations

which are required in the rounding procedure. We use the terms column orthogonal

and row orthogonal to refer to matrices that have orthogonal columns and orthogonal

rows, respectively, so that a matrix Qis column orthogonal if Q⊤Q=Iand row

orthogonal if QQ⊤=I.

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 5

A TT tensor is said to be right orthogonal if H(TX,n ) is row orthogonal for

n= 2,...,N (all but the ﬁrst core). On the other hand, a tensor is said to be left

orthogonal if V(TX,n ) is column orthogonal for n= 1,...,N−1 (all but the last core).

More generally, we deﬁne a tensor to be n-right orthogonal if the horizontal unfoldings

of cores n+ 1,...,N are all row orthogonal, and a tensor is n-left orthogonal if the

vertical unfoldings of cores 1, . . . , n −1 are all column orthogonal.

These deﬁnitions correspond to the fact that the tensor that represents the con-

traction of these sets of TT cores inherits their orthogonality. For example, let X

be a right-orthogonal TT tensor, then we can write X(1) =TX,1Z(1) , where Zis a

R1×I2× · · · × INtensor whose entries are given by

Z(r1, i2,...,iN) = TX,2(r1, i2,:) ·TX,3(:, i3,:) ···TX,n(:, in,:) ···TX,N (:, iN).

The 1st modal unfolding of Zis row orthogonal, as shown below [30, Lemma 3.1]:

Z(1)Z⊤

(1) =X

i2,...,iN

Z(:, i2, . . . , iN)Z(:, i2,...,iN)⊤

=X

i2,...,iN

TX,2(:, i2,:) ···TX,N (:, iN)TX,N (:, iN)⊤···TX,2(:, i2,:)⊤

=X

i2

TX,2(:, i2,:) ···X

iN

TX,N (:, iN)TX,N (:, iN)⊤···TX,2(:, i2,:)⊤

=X

i2

TX,2(:, i2,:) ···H(TX,N )H(TX,N )⊤···TX,2(:, i2,:)⊤

=X

i2

TX,2(:, i2,:) ···IRN−1···TX,2(:, i2,:)⊤

=X

i2

TX,2(:, i2,:) ···H(TX,N −1)H(TX,N −1)⊤···TX,2(:, i2,:)⊤

=···=IR1.

Similar arguments show that the 1st modal unfolding of the tensor representing the

last N−ncores of an n-right orthogonal TT tensor is row orthogonal and that the last

modal unfolding of the tensor representing the ﬁrst n−1 cores of an n-left orthogonal

TT tensor is row orthogonal.

Given a TT tensor, we can orthogonalize it by exploiting the non-uniqueness of

TT tensors expressed in Equation (2.1). That is, we can right- or left-orthogonalize

a TT core using a QR decomposition of one of its unfoldings and pass its triangular

factor to its neighbor core without changing the represented tensor. By starting from

one end and repeating this process on each core in order, we can obtain a left or right

orthogonal TT tensor, as shown in Algorithm 2.1 (for right orthogonalization).

We note that the norm of right- or left-orthogonal TT tensor can be cheaply

computed, based on the idea that post-multiplication by a matrix with orthonormal

rows or pre-multiplication by a matrix with orthonormal columns does not aﬀect

the Frobenius norm of a matrix. Thus, we have that kXk=kTX,1kFprovided

that Z(1) has orthonormal rows. Likewise, if Xis a left-orthogonal TT tensor, then

kXk=kTX,N kF.

2.4. TT Rounding. Orthogonalization plays an essential role in compressing

the TT format of a tensor (decreasing the TT ranks Rn) [30]. This compression is

known as TT rounding and is given in Algorithm 2.2.

6H. AL DAAS, G. BALLARD, P. BENNER

Algorithm 2.1 TT-right-orthogonalization

Require: A TT tensor X

Ensure: A right orthogonal TT tensor Yequivalent to X

1: function Y=Right-Orthogonalization(X)

2: Set TY,N =TX,N

3: for n=Ndown to 2 do

4: [H(TY,n)⊤,R] = QR(H(TY,n )⊤)⊲QR factorization

5: V(TY,n−1) = V(TX,n−1)R⊤⊲TY,n−1=TX,n−1×3R⊤

6: end for

7: end function

The intuition for rounding can be expressed in matrix notation as follows. Suppose

we have a matrix represented by a product

(2.2) A=QBCZ,

where Qand Zare column and row orthogonal, respectively. Then the truncated

SVD of Acan be readily expressed in terms of the truncated SVD of BC. In our

case, Bis tall and skinny and Cis short and wide, so the rank is bounded by their

shared dimension. To truncate the rank, one can row-orthogonalize Cand then

perform a truncated SVD of B(or vice-versa). That is, if we compute RCQC=C

and UBΣBV⊤

B=BRC, then to round Awe can replace Bwith ˆ

UBand Cwith

ˆ

ΣBˆ

V⊤

BQC, where ˆ

UBˆ

ΣBˆ

V⊤

Bis the SVD truncated to the desired tolerance.

In order to truncate a particular rank Rnby considering only the nth TT core

using this idea, the TT format should be both n-left and n-right orthogonal. The

unfolding of Xthat maps the ﬁrst ntensor dimensions to rows can be expressed as a

product of four matrices:

(2.3) X(1:n)= (IIn⊗Q(1:n−1))· V (TX,n )· H(TX,n+1 )·(IIn+1 ⊗Z(1)),

where Qis I1× · · · × In−1×Rn−1with

Q(i1,...,in−1, rn−1) = TX,1(i1,:) ·TX,2(:, i2,:) ···TX,n−1(:, in−1, rn−1),

and Zis Rn+1 ×In+2 × · · · × INwith

Z(rn+1, in+2 ,...,iN) = TX,n+2 (rn+1, in+2,:) ·TX,n+3 (:, in+3,:) ···TX,N (:, iN).

See Figure 2.3 for a visualization and Appendix A for a full derivation of (2.3). If X

is n-left and n-right orthogonal, then Q(1:n−1) and Z(1) are column and row orthog-

onal (and so are their Kronecker products with an identity matrix), respectively, and

H(TX,n+1) is also row orthogonal.

In order to truncate Rn, we view (2.3) as an instance of (2.2) where V(TX,n)

plays the role of Band H(TX,n+1) plays the role of C(though H(TX,n+1 ) is already

orthogonalized). We compute the truncated SVD V(TX,n)≈ˆ

Uˆ

Σˆ

V⊤, replace V(TX,n )

with ˆ

U, and apply ˆ

Σˆ

V⊤to H(TX,n+1). In this way, Rnis truncated, V(TX,n) becomes

column orthogonal, and because Qand Zare not modiﬁed, Xbecomes (n+1)-left and

(n+1)-right orthogonal and ready for the truncation of Rn+1.

The rounding procedure consists of two sweeps along the modes. During the ﬁrst,

the tensor is left or right orthogonalized. On the second, sweeping in the opposite

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 7

I1· · ·In−1

Rn−1

IIn⊗Q(1:n−1)

I1···In

InRn−1

V(TX,n)

Rn

H(TX,n+1)

In+1Rn+1

In+2···IN

Rn+1

IIn+1 ⊗Z(1)

In+1 ···IN

Fig. 2.3: Visualization of identity (2.3) for X(1:n).

direction, the TT ranks are reduced sequentially via SVD truncation of the matricized

cores. The rounding accuracy ε0can be deﬁned a priori such that the rounded TT

tensor is ε0-relatively close to the original TT tensor. We note that this method is

quasi-optimal in ﬁnding the closest TT tensor with prescribed TT ranks to a given

TT tensor [29].

Algorithm 2.2 TT-rounding

Require: A tensor Yin TT format, a threshold ε0

Ensure: A tensor Xin TT format with reduced ranks such that kX−YkF≤ε0

1: function X=Rounding(Y, ε0)

2: X=Right-Orthogonalization(Y)

3: Compute kYkF=kTX,1kFand the truncation threshold ε=kYkF

√N−1ε0

4: for n= 1 to N−1do

5: [V(TX,n),Σ,V] = SVD(V(TX,n ), ε)⊲ ε-truncated SVD factorization

6: H(TX,n+1) = ΣV⊤H(TX,n+1 )⊲TX,n+1 =TX,n+1 ×1(ΣV⊤)

7: end for

8: end function

3. Parallel Algorithms for Tensor Train. In this section we detail the par-

allel algorithms for manipulating TT tensors that are distributed over multiple pro-

cessors’ memories. We describe our proposed data distribution of the core tensors in

Subsection 3.1, which is designed for eﬃcient orthogonalization and truncation of TT

tensors. In Subsection 3.2 we show how to perform basic operations on TT tensors

in this distribution such as addition, elementwise multiplication, and applying certain

linear operators. Our proposed parallel orthogonalization and truncation routines

are presented in Subsections 3.4 and 3.5, respectively. Both of those routines rely

on an existing communication-eﬃcient parallel QR decomposition algorithm called

Tall-Skinny QR (TSQR) [12], which is given for completeness in Subsection 3.3. A

summary of the costs of the parallel algorithms is presented in Table 3.1.

8H. AL DAAS, G. BALLARD, P. BENNER

TT Algorithm Computation Comm. Data Comm. Msgs

Summation — — —

Hadamard N IR4

P— —

Inner Product 4 NIR3

PO(NR2)O(Nlog P)

Norm 2 NIR3

PO(NR2)O(Nlog P)

Orthogonalization 5 N I R3

P+O(NR3log P)O(N R2log P)O(Nlog P)

Rounding 7 NIR3

P+O(NR3log P)O(N R2log P)O(Nlog P)

Table 3.1: Summary of computation and communication costs of parallel TT opera-

tions using Pprocessors, assuming inputs are N-way tensors with identical dimensions

In=Iand ranks Rn=R. The computation cost of rounding assumes the original

ranks are reduced in half; the constant can range from 3 to 13 depending on the

reduced ranks.

I1

R1

I2

R1

R2

I3

R2

R3

I4

R3

R4

I5

R4

Fig. 3.1: In blue, data owned by a processor in a 1D distribution of TT tensor across

Pprocessor

3.1. Data Distribution and Layout. We are interested in the parallelization

of TT operations with a large number of modes and where one or multiple mode

sizes are very large comparing to the TT ranks. This type of conﬁguration arises in

many applications such as parameter dependent PDEs [26], stochastic PDEs [24], and

molecular simulations [34].

To simplify the introduction and without loss of generality, we consider in this

paper the case where all mode sizes are very large comparing to the TT ranks. In case

there exist TT cores with relatively small mode sizes, those can be stored redundantly

on each processor. We note that our implementation can deal with both cases.

Algorithms for orthogonalization and rounding of TT tensors are sequential with

respect to the mode; often computation can occur on only one mode at a time. In

order to utilize all processors and maintain load balancing in a parallel environment,

we choose to distribute each TT core over all processors, so that each processor owns

a subtensor of each TT core. To ensure the computations on each core can be done

in a communication-eﬃcient way, we choose a 1D distribution for each core, where

the mode corresponding to the original tensor is divided across processors. This

corresponds to a Cartesian distribution of each Rn−1×In×Rncore over a 1 ×P×1

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 9

processor grid, or equivalently, a block row distribution of V(TX,n) or a block column

distribution of H(TX,n), for n= 1,...,N; see Figure 3.1. In this manner, each

processors owns Nlocal subtensors with dimensions {Rn−1×(In/P )×Rn}. The

notation T(p)

X,n denotes the local subtensor of the nth core owned by processor p.

This distribution allows performing basic operations, such as addition and ele-

mentwise multiplication, on the TT representation locally, see Subsection 3.2. Fur-

thermore, the distribution of a TT core in this way can be seen as a generalization of

the distribution of a vector in parallel iterative linear solvers [1, 19]. Indeed, if Ais

an In×Insparse matrix distributed across processors as block row panels, the com-

putation of ATX,n(k , :, l) can be done by using a sparse-matrix-vector multiplication.

Tensor entries are linearized in memory. Each local core tensor T(p)

X,n is Rn−1×

(In/P )×Rn, and we store it in the “vec-oriented” or “natural descending” order

[6, 33] in memory. For 3-way tensors, this means that mode-1 ﬁbers (of length Rn−1)

are contiguous in memory, as this corresponds to the mode-1 modal unfolding. Addi-

tionally, the mode-3 slices (of size Rn−1×(In/P )) are also contiguous in memory and

internally linearized in column-ma jor order, as this corresponds to the more general

(1:2) unfolding [31, 33]. In particular, these facts imply that both the vertical and

horizontal unfoldings are column ma jor in memory.

BLAS and LAPACK routines require either row- or column-major ordering (unit

stride for one dimension and constant stride for the other), but this property of the

vertical and horizontal unfoldings means that we can operate on them without any

physical permutation of the tensor data. For example, we can perform operations

such as QR factorization of V(TX,n ) and V(TX,n )R, where R∈RRn×Rn, with a

single LAPACK or BLAS call.

This choice of ordering comes at the expense of less convenient access to the mode-

2 modal unfolding (of dimension (In/P )×Rn−1Rn), which is neither row or column

major in memory. This unfolding can be visualized in memory as a concatenation of

Rncontiguous submatrices, each of dimension (In/P )×Rn−1and each stored in row-

major order [6]. In order to perform the mode-2 multiplication (tensor times matrix

operation), as is necessary in the application of a spacial operator on the core, we

must make a sequence of calls to the matrix-matrix multiplication BLAS subroutine.

That is, we make Rncalls for multiplications of the same In×Inmatrix with diﬀerent

In×Rn−1matrices.

3.2. Basic Operations.

3.2.1. Summation. To sum two tensors Xand Y, we can write [30]:

Z(i1,...,iN) = X(i1,...,iN) + Y(i1,...,iN)

=TX,1(i1,:) ···TX,N (:, iN) + TY,1(i1,:) ···TY,N (:, iN)

=TX,1(i1,:) TY,1(i1,:)TX,2(:, i2,:)

TY,2(:, i2,:)

···TX,N−1(:, iN−1,:)

TY,N −1(:, iN−1,:)TX,N (:, iN)

TY,N (:, iN).

Thus, the TT representation of Z=X+Yis given by the following slice-wise formula:

TZ,n(:, in,:) = TX,n (:, in,:)

TY,n(:, in,:)

10 H. AL DAAS, G. BALLARD, P. BENNER

for 2 ≤n≤N−1, and 1 ≤in≤In. We also have TZ,1=TX,1TY,1and

TZ,N =TX,N

TY,N . Note that the formal TT ranks of Zare the sums of the TT ranks

of Xand Y.

Given the 1D data distribution of each core described in Subsection 3.1, the

summation operation can be performed locally with no interprocessor communication.

That is, because X,Y, and Zhave identical dimensions, they will have identical

distributions, and each slice of a core tensor of Zwill be owned by the processor that

owns the corresponding slices of cores of Xand Y.

3.2.2. Hadamard Product. To compute the Hadamard (elementwise) product

of two tensors Xand Y, we can write [30]:

Z(i1,...,iN) = X(i1, . . . , iN)·Y(i1,...,iN)

= (TX,1(i1,:) ···TX,N (:, iN)) ·(TY,1(i1,:) ···TY,N (:, iN))

= (TX,1(i1,:) ···TX,N (:, iN)) ⊗(TY,1(i1,:) ···TY,N (:, iN))

= (TX,1(i1,:) ⊗TY,1(i1,:)) ···(TX,N (:, iN)⊗TY,N (:, iN)) .

Thus, the TT representation of Z=X∗Yis given by the following slice-wise formula:

TZ,n(:, in,:) = TX,n (:, in,:) ⊗TY,n(:, in,:) for 1 ≤n≤Nand 1 ≤in≤In. Here, the

formal TT ranks of Zare the products of the TT ranks of Xand Y.

Again, given the 1D data distribution of each core and the fact that each core is

computed slice-wise, the Hadamard product can be performed locally with no inter-

processor communication. We note that because of the extra expense of the Hadamard

product (due to computing explicit Kronecker products of slices), it is likely advan-

tageous to maintain Hadamard products in implicit form for later operations such

as rounding. The combination of Hadamard products and recompression has been

shown to be eﬀective for Tucker tensors [25].

3.2.3. Inner Product. To compute the inner product of two tensors Xand Y,

using similar identities as for the Hadamard product, we can write [30]:

hX,Yi=X

i1,...,iN

X(i1, . . . , iN)·Y(i1,...,iN)

=X

i1,...,iN

(TX,1(i1,:) ⊗TY,1(i1,:)) ···(TX,N (:, iN)⊗TY,N (:, iN))

=X

i1

(TX,1(i1,:) ⊗TY,1(i1,:)) X

i2

(TX,2(:, i2,:) ⊗TY,2(:, i2,:))

···X

iN

(TX,N (:, iN)⊗TY,N (:, iN)) .

This expression can be evaluated eﬃciently by a sequence of structured matrix-vector

products that avoid forming Kronecker products of matrices, and these matrix-vector

products are cast as matrix-matrix multiplications.

To see how, we assume that the TT ranks of Xand Yare {RX

n}and {RY

n},

respectively. First, we explicitly construct the row vector

w1=X

i1

TX,1(i1,:) ⊗TY,1(i1,:),

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 11

which has dimension RX

1·RY

1. Note that w1is the vectorization of the matrix

V(TY,1)⊤V(TX,1). Then we distribute w1to all terms within the next summation

to compute w2using

w2=X

i2

w1(TX,2(:, i2,:) ⊗TY,2(:, i2,:)) ,

with each term in the summation evaluated via vec TY,2(:, i2,:)⊤W1TX,2(:, i2,:),

where W1is a reshaping of the vector w1into a RY

1×RX

1matrix and vec is a row-

wise vectorization operator. We note that TX,2(:, i2,:) is RX

1×RX

2, and TY,2(:, i2,:)

is RY

1×RY

2, and w2therefore has dimension RX

2·RY

2. This process is repeated with

(3.1) Wn=X

in

TY,n(:, in,:)⊤Wn−1TX,n (:, in,:),

until the last core, when we compute the inner product as

hX,Yi=X

iN

TY,N (:, iN)⊤WN−1TX,N (:, iN),

where WN−1is a RY

N−1×RX

N−1matrix.

If all the tensor dimensions are the same and all TT ranks are the same, i.e.,

I=I1=··· =INand R=RX

1=RY

1=··· =RX

N−1=RY

N−1, the computational

complexity is approximately 4N IR3.

Evaluating (3.1) directly can exploit the eﬃciency of dense matrix multiplication,

but it requires many calls to the BLAS subroutine. With some extra temporary

memory, we can reduce the number of BLAS calls to 2, performing the same overall

number of ﬂops. Let Zbe deﬁned such that H(TZ,n ) = Wn−1H(TX,n), or the mode-

1 multiplication between the core and the matrix, for n= 1,...,N (with W0= 1).

Then, we have Wnas a contraction of modes 1 and 2 between cores of Yand Z, or

that

Wn=V(TY,n)⊤V(TZ,n ),for n= 1,...,N.

Each of these two multiplications requires a single BLAS call because horizontal and

vertical unfoldings are column major in memory. We note the ﬁnal contraction in

mode Nis a dot product instead of a matrix multiplication.

When the input TT tensors are distributed across processors as described in

Subsection 3.1, we can compute the inner product using this technique. Each term

in the summation of (3.1), which involves corresponding slices of the input tensors,

is evaluated by a single processor as long as the matrix Wnis available on each

processor. Thus, the computation can be load balanced across processors as long as

the distribution is load balanced, and each processor can apply the optimization to

reduce BLAS calls independently. We perform an AllReduce collective operation

to compute the summation for each mode. With constant tensor dimensions and TT

ranks, the computational cost is approximately 4N IR3/P and the communication

cost is β·O(N R2) + α·O(Nlog P).

3.2.4. Norms. To compute the norm of a tensor in TT format, we consider two

approaches. The ﬁrst approach is to use the inner product algorithm described in Sub-

section 3.2.3 and the identity kXk2=hX,Xi. We note that in this case, the matrices

{Wn}are symmetric and positive semi-deﬁnite, see (3.1), and the structured matrix-

vector products can exploit this property to save roughly half the computation. Since

12 H. AL DAAS, G. BALLARD, P. BENNER

Wnis SPSD, it admits a triangular factorization given by pivoted Cholesky (or LDL):

Wn=PnLnL⊤

nP⊤

n. Thus, the matrix Wnis computed as Wn=V(TZ,n)⊤V(TZ,n ),

where H(TZ,n) = L⊤

n−1(P⊤

n−1H(TX,n)). The triangular multiplication to compute

the nth core of Zand the symmetric multiplication to compute Wneach require half

the ﬂops of a normal matrix multiplication, so the overall computational complexity

of this approach is 2N I R3. It is parallelized in the same way as the general inner

product.

The second approach is to ﬁrst right- or left-orthogonalize the tensor using Algo-

rithm 2.1, and then the norm of the tensor is given by kTX,1kFor kTX,N kFas shown

in Subsection 2.3. When the TT tensor is distributed, the orthogonalization proce-

dure is more complicated than computing inner products; we describe the parallel

algorithm in Subsection 3.4.

3.2.5. Matrix-Vector Multiplication. In order to build Krylov-like iterative

methods to solve linear systems with solutions in TT-format, we must also be able to

apply a matrix operator to a vector in TT-format. We will consider a restricted set

of matrix operators: sums of Kronecker products of matrices [10, 24, 26, 39].

Each term in the sum can be seen as a generalization of a rank-one tensor to the

operator case. We use the notation

A=A1⊗ · · · ⊗ AN

to denote a single Kronecker product of matrices, where the dimensions of Anare

In×In, conforming to the dimensions of Xin TT-format. In this case, we can compute

the matrix-vector multiplication vec(Y) = A·vec(X), where

Y(i1,...,iN) = X

j1,...,jN

A1(i1, j1)···AN(iN, jN)·X(j1,...,jN)

=X

j1,...,jN

A1(i1, j1)···AN(iN, jN)·TX,1(j1,:) ···TX,N (:, jN)

=X

j1

A1(i1, j1)TX,1(j1,:) ···X

jN

AN(iN, jN)TX,N (:, jN)

=TY,1(i1,:) ···TY,N (:, iN)

with TY,1=A1TX,1,TY,n =TX,n ×2Anfor 1 < n < N, and TY,N =TX,N A⊤

N.

Here the notation ×2refers to the mode-2 tensor-matrix product, deﬁned so that

TY,n(rn−1,:, rn) = AnTX,n (rn−1,:, rn)

for 1 < n < N , 1 ≤rn−1≤Rn−1, and 1 ≤rn≤Rn.

Thus, applying a Kronecker product of matrices to a vector in TT-format main-

tains the TT-format with the same ranks, and operations on cores can be performed

independently. In order to apply an operator that is a sum of multiple Kronecker

products of matrices, we can apply each term separately and use the summation pro-

cedure described in Subsection 3.2.1 along with TT-rounding to control rank growth.

We note that it is possible to apply more general forms of tensorized operators to

vectors in TT-format [30], but we do not consider them here.

When the vector in TT-format is distributed as described in Subsection 3.1, we

must perform the mode-2 tensor-matrix product using a parallel algorithm. We can

view the mode-2 tensor-matrix product as applying the matrix to the mode-2 un-

folding of the tensor core TX,n (often denoted with subscript (2) [22]), which has

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 13

dimensions In×Rn−1Rn. We observe that the parallel distribution of the mode-2

unfolding of TX,n is 1D row-distributed: each processor owns a subset of the rows

of the matrix (corresponding to slices of the core tensor). Thus, the application of

Anto this unfolding has the same algorithmic structure as the sparse-matrix-times-

multiple-vectors operation (SpMM) where all vectors have the same parallel distribu-

tion. Assuming the matrix Anis sparse and also row-distributed, as is common in

libraries such as PETSc [4] and Trilinos [17], the parallel algorithm involves communi-

cation of input tensor core slices among processors, where the communication pattern

is determined by Anand its distribution. We do not explore experimental results for

such matrix-vector multiplications in this paper, as the performance depends heavily

on the application and sparsity structure of the operator matrices.

3.3. TSQR. To compute the QR factorizations within the TT-rounding proce-

dure in parallel, we use the Tall-Skinny QR algorithm [12], which is designed (and

communication eﬃcient) for matrices with many more rows than columns. For com-

pleteness, we present the TSQR subroutine as Algorithm 3.1, which corresponds to

[5, Alg. 7], and the TSQR-Apply-Q subroutine as Algorithm 3.2. The subroutines

assume a power-of-two number of processors to simplify the pseudocode; see Appen-

dix B for the generalizations to any number of processors.

For a tall-skinny matrix that is 1D row distributed over processors (as is the

case for the vertical unfolding and the transpose of the horizontal unfolding), the

parallel Householder QR algorithm requires synchronizations for each column of the

matrix (to compute and apply each Householder vector). The idea of the TSQR

algorithm is that the entire factorization can be computed using a single reduction

across processors. The price of this reduction is that the implicit representation of

the orthogonal factor is more complicated than a single set of Householder vectors,

and that the representation depends on the structure of the reduction tree. We can

maintain and apply the orthogonal factor in this implicit form as long as the parallel

algorithm for applying it uses a consistent tree structure. We note that we employ

the “butterﬂy” variant of TSQR, which corresponds to an all-reduce-like collective

operation such that at the end of the algorithm the triangular factor Ris owned by

all processors redundantly. Another variant uses a binomial tree, corresponding to

a reduce-like collective with the triangular factor owned by a single processor. We

compare performance of these two variants in Subsection 4.2.1.

3.3.1. Factorization. TSQR (Algorithm 3.1) has two phases: orthogonalization

of local submatrix (Algorithm 3.1) and parallel reduction of remaining triangular

factors (Algorithm 3.1 through Algorithm 3.1). The cost of the TSQR is as follows:

(3.2) γ·2mb2

P+O(b3log P)+β·O(b2log P) + α·O(log P),

where mis the number of rows and bis the number of columns [12]. The leading order

ﬂop cost is the (Householder) QR of the local (m/P )×bsubmatrix (Algorithm 3.1),

the leaf of the TSQR tree. The communication costs come from the TSQR tree, which

has height O(log P).

3.3.2. Applying and Forming Q.The structure of the TSQR-Apply-Q al-

gorithm (Algorithm 3.2) matches that of TSQR, but in reverse order (because the

TSQR algorithm corresponds to applying Q⊤). Thus, the root of the tree is applied

ﬁrst and the leaves last. However, by using a butterﬂy tree the communication cost

of the TSQR-Apply-Q algorithm (Algorithm 3.2) is 0 if the number of processors is a

14 H. AL DAAS, G. BALLARD, P. BENNER

Algorithm 3.1 Parallel Butterﬂy TSQR

Require: A is an m×bmatrix 1D-distributed so that proc powns row block A(p)

Require: Number of procs is power of two; see Algorithm B.1 for general case

Ensure: A =QR with Rowned by all procs and Qrepresented by {Y(p)

ℓ}with

redundancy Y(p)

ℓ=Y(q)

ℓfor p≡qmod 2ℓand ℓ < log P

1: function [{Y(p)

ℓ},R] = Par-TSQR(A(p))

2: p=MyProcID()

3: [Y(p)

log P,¯

R(p)

log P] = Local-QR(A(p))⊲Leaf node QR

4: for ℓ= log P−1 down to 0 do

5: j= 2ℓ+1⌊p

2ℓ+1 ⌋+ (p+ 2ℓ) mod 2ℓ+1 ⊲Determine partner

6: Send ¯

R(p)

ℓ+1 to and receive ¯

R(j)

ℓ+1 from proc j ⊲ Communication

7: if p < j then

8: [Y(p)

ℓ,¯

R(p)

ℓ] = Lo cal-QR "¯

R(p)

ℓ+1

¯

R(j)

ℓ+1#! ⊲Tree node QR

9: else

10: [Y(p)

ℓ,¯

R(p)

ℓ] = Lo cal-QR "¯

R(j)

ℓ+1

¯

R(p)

ℓ+1#! ⊲Partner tree node QR

11: end if

12: end for

13: R=¯

R(p)

0

14: end function

power of 2 and β·bc +αotherwise (the cost of one message; see Appendix B). The

cost of TSQR-Apply-Q is then,

(3.3) γ·4mbc

P+O(b2clog P)+β·bc +α,

where the additional parameter cis the number of columns of C. The leading or-

der ﬂop cost is the application of the local Qmatrix at the leaf of the TSQR tree

(Algorithm 3.2).

Using a binomial tree TSQR algorithm requires more communication in the ap-

plication phase (see [5, Algorithm 8], for example). We also note that if the input

matrix Cis upper triangular, then the leading constant can be reduced from 4 to 2 by

exploiting the sparsity structure in this local application (and within the tree because

all ¯

B(p)

ℓmatrices are upper triangular in this case, throughout the algorithm), which

matches the computation cost of the factorization. In particular, when we form Q

explicitly, we can use this algorithm with Cas the identity matrix, which is upper

triangular.

3.4. TT Orthogonalization. Algorithm 3.3 shows right orthogonalization and

is a parallelization of Algorithm 2.1. The approach for left orthogonalization is anal-

ogous. The algorithm is performed via a sequential sweep over the cores, where at

each iteration, an LQ factorization row-orthogonalizes the horizontal unfolding of a

core and the triangular factor is applied to its left neighbor core. The 1D parallel dis-

tribution of each core implies that the transpose of the horizontal unfolding is 1D row

distributed, ﬁtting the requirements of the TSQR algorithm. Note that we perform

a QR factorization of the transpose of the horizontal unfolding, which corresponds to

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 15

Algorithm 3.2 Parallel Application of Implicit Qfrom Butterﬂy TSQR

Require: {Y(p)

ℓ}represents orthogonal matrix Qcomputed by Algorithm 3.1

Require: C is b×cand redundantly owned by all processors

Require: Number of procs is power of two; see Algorithm B.2 for general case

Ensure: B =QC

0is m×cand 1D-distributed so that proc powns row block B(p)

1: function B =Par-TSQR-Apply-Q({Y(p)

ℓ},C)

2: p=MyProcID()

3: ¯

B(p)

0=C

4: for ℓ= 0 to log P−1do

5: j= 2ℓ+1⌊p

2ℓ+1 ⌋+ (p+ 2ℓ) mod 2ℓ+1 ⊲Determine partner

6: if p < j then

7: "¯

B(p)

ℓ+1

¯

B(j)

ℓ+1#=Loc-Apply-Q Ib

Y(p)

ℓ,"¯

B(p)

ℓ

0#! ⊲Tree node apply

8: else

9: "¯

B(j)

ℓ+1

¯

B(p)

ℓ+1#=Loc-Apply-Q Ib

Y(p)

ℓ,"¯

B(p)

ℓ

0#! ⊲Part. tree node apply

10: end if

11: end for

12: B(p)=Loc-Apply-Q Y(p)

log P,"¯

B(p)

log P

0#! ⊲Leaf node apply

13: end function

an LQ factorization of the unfolding itself.

Figure 3.2 depicts the operations within a single iteration of the sweep. At itera-

tion n, TSQR is applied to the nth core in Algorithm 3.3 (Figure 3.2c) and then the

orthogonal factor is formed explicitly in Algorithm 3.3 (Figure 3.2b). The notation

{Y(p)

ℓ,n}signiﬁes the set of triangular matrices owned by processor pin the implicit

representation of the QR factorization of the nth core, where ℓrefers to the level of

the tree and indexes the set. In the case Pis a power of 2, each processor owns log P

matrices in its set. Because the TSQR subroutine ends with all processors owning the

triangular factor Rn, each processor can apply it to core n−1 in the 3rd mode without

further communication via local matrix multiplication in Algorithm 3.3 (Figure 3.2d).

Algorithm 3.3 have the costs, given by (3.2) and (3.3) with m=InRnand

b=c=Rn−1. Since the computation to form the explicit Qmatrix exploits the

sparsity structure of the identity matrix the constant 4 in (3.3) is reduced to 2. These

two lines together cost

γ·4InRnR2

n−1

P+O(R3

n−1log P)+β·O(R2

n−1log P) + α·O(log P).

Algorithm 3.3 is a local triangular matrix multiplication that costs γ·Ik−1Rk−2R2

k−1/P .

Assuming Ik=Iand Rk=Rfor 1 ≤k≤N−1, the total cost of TT orthogonalization

is then

(3.4) γ·5NIR3

P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).

16 H. AL DAAS, G. BALLARD, P. BENNER

Algorithm 3.3 Parallel TT-Right-Orthogonalization

Require: Xin TT format with each core 1D-distributed

Ensure: Xis right orthogonal, in TT format with same distribution

1: function Par-TT-Right-Orthogonalization({T(p)

X,n})

2: for n=Ndown to 2 do

3: [{Y(p)

ℓ,n},Rn] = TSQR(H(T(p)

X,n)⊤)⊲QR factorization

4: H(T(p)

X,n)⊤=TSQR-Apply-Q({Y(p)

ℓ,n},IRn−1)⊲Form explicit Q

5: V(T(p)

X,n−1) = V(T(p)

X,n−1)·Rn⊤⊲Apply Rto previous core

6: end for

7: end function

Rn−1

Rn−2

.

.

.

Rn−2

.

.

.

Rn−2

V(TX,n−1)

Rn

Rn−1

···

Rn

···

Rn

H(TX,n)

(a) Two consecutive cores

Rn−1

Rn−2

.

.

.

Rn−2

.

.

.

Rn−2

V(TX,n−1)

Rn−1

R⊤

Rn

Rn−1

···

Rn

···

Rn

Q⊤

LQ factorization

(b) QR factorization of H(TX,n )⊤

Rn−1

Rn−2

.

.

.

Rn−2

.

.

.

Rn−2

V(TX,n−1)

Rn−1

R⊤

Rn

Rn−1

···

Rn

···

Rn

H(TX,n) := Q⊤

(c) Update nth core

Rn−1

Rn−2

.

.

.

Rn−2

.

.

.

Rn−2

V(TX,n−1) := V(TX,n−1)·R⊤

Rn

Rn−1

···

Rn

···

Rn

H(TX,n)

(d) Update (n−1)th core

Fig. 3.2: Steps performed in TT right orthogonalization

3.5. TT Rounding. We present the parallel TT rounding procedure in Algo-

rithm 3.4, which is a parallelization of Algorithm 2.2. The computation consists of

two sweeps over the cores, one to orthogonalize and one to truncate. The algorithm

shown performs right-orthogonalization and then truncates left to right, and the other

ordering works analogously.

Algorithm 3.4 does not call Algorithm 3.3 to perform the orthogonalization sweep.

This is because Algorithm 3.3 forms the orthogonalized cores explicitly, and Algo-

rithm 3.4 can leave the orthogonalized cores from the ﬁrst sweep in implicit form to

be applied during the second sweep.

Iteration nof the right-to-left orthogonalization sweep occurs in Algorithm 3.4,

which matches Algorithm 3.3 except for the explicit formation of the orthogonal factor.

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 17

Algorithm 3.4 Parallel TT-Rounding

Require: Xin TT format with each core 1D-distributed over 1×P×1 processor grid

Ensure: Yin TT format with reduced ranks identically distributed across processors

1: function {T(p)

Y,n}=Par-TT-Rounding({T(p)

X,n}, ǫ)

2: for n=Ndown to 2 do

3: [{Y(p)

ℓ,n},Rn] = TSQR(H(T(p)

X,n)⊤)⊲QR factorization

4: V(T(p)

X,n−1) = V(T(p)

X,n−1)·Rn⊤⊲Apply Rto previous core

5: end for

6: Compute kXk

7: Y=X

8: for n= 1 to N−1do

9: [{Y(p)

ℓ,n},Rn] = TSQR(V(T(p)

Y,n)) ⊲QR factorization

10: [ˆ

UR,ˆ

Σ,ˆ

V] = tSVD(Rn,ǫ

√N−1kXk)⊲Redundant truncated SVD of R

11: V(T(p)

Y,n) = TSQR-Apply-Q({Y(p)

ℓ,n},ˆ

UR)⊲Form explicit ˆ

U

12: H(T(p)

Y,n+1)⊤=TSQR-Apply-Q({Y(p)

ℓ,n+1},ˆ

Vˆ

Σ) ⊲Apply ˆ

Σˆ

V⊤

13: end for

14: end function

Thus, the cost of the orthogonalization sweep is

(3.5) γ·3NIR3

P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).

At iteration nof the second loop, Algorithm 3.4 implement the left-to-right trun-

cation procedure for the nth core in parallel. Algorithm 3.4 is a QR factorization and

has cost given by Equation (3.2) with m=InLn−1and b=Rn, as the number of

rows of V(T(p)

Y,n) has been reduced from InRn−1to InLn−1during iteration n−1:

γ·2InLn−1R2

n

P+O(R3

nlog P)+β·O(R2

nlog P) + α·O(log P).

We note that we re-use the notation {Y(p)

ℓ,n}to store the implicit factorization; while

the same variable stored the orthogonal factor of the nth core’s horizontal unfolding

from the orthogonalization sweep, it can be overwritten by this step of the algo-

rithm (the set of matrices will now have diﬀerent dimensions). Algorithm 3.4 requires

O(R3

n) ﬂops, assuming the full SVD is computed before truncating. Algorithm 3.4

implicitly applies an orthogonal matrix to an Rn×Lnmatrix ˆ

URwith cost given by

Equation (3.3) with m=InLn−1,b=Rn, and c=Ln:

γ·4InLn−1RnLn

P+O(R2

nLnlog P)+β·RnLn+α.

Algorithm 3.4 implicitly applies an orthogonal matrix to an Rn×Lnmatrix ˆ

Vˆ

Σwith

cost given by Equation (3.3) with m=In+1Rn+1 ,b=Rn, and c=Ln:

γ·4In+1Rn+1 RnLn

P+O(R2

nLnlog P)+β·RnLn+α.

18 H. AL DAAS, G. BALLARD, P. BENNER

Rn

Ln−1

.

.

.

Ln−1

.

.

.

Ln−1

V(TX,n)

QR factorization

Rn+1

Rn

···

Rn+1

···

Rn+1

H(TX,n+1)

(a) Two consecutive cores

Rn

Ln−1

.

.

.

Ln−1

.

.

.

Ln−1

Q

QR factorization

Rn

R

Rn+1

Rn

···

Rn+1

···

Rn+1

H(TX,n+1)

(b) QR factorization of V(TX,n )

Rn

Ln−1

.

.

.

Ln−1

.

.

.

Ln−1

Q

Ln

U

Ln

(ΣV⊤)

Rn+1

Rn

···

Rn+1

···

Rn+1

H(TX,n+1)

(c) Truncated-SVD of R

Ln

Ln−1

.

.

.

Ln−1

.

.

.

Ln−1

V(TX,n) := QU

Ln

ΣV⊤

Rn+1

Rn

···

Rn+1

···

Rn+1

H(TX,n+1)

(d) Update the nth core

Ln

Ln−1

.

.

.

Ln−1

.

.

.

Ln−1

V(TX,n)

Rn+1

Ln

···

Rn+1

···

Rn+1

H(TX,n+1) := ΣV⊤H(TX,n+1 )

(e) Update the (n+1)th core

Fig. 3.3: Steps performed in iteration of the TT left-to-right truncation

Assuming Ik=I,Rk=R, and Lk=Lfor 1 ≤k≤N−1, the total cost of

Algorithm 3.4 is then

(3.6)

γ·N IR 3R2+ 6RL + 4L2

P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).

We note that leaving the orthogonal factors in implicit form during the orthogonaliza-

tion sweep (as opposed to calling Algorithm 3.3) saves up to 40% of the computation,

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 19

Model # Modes Dimensions Ranks Memory

1 50 2K× · · · × 2K50 2 GB

216 100M×50K× · · · × 50K×1M30 28 GB

3 30 2M× · · · × 2M30 385 GB

Table 4.1: Synthetic TT models used for performance experiments. In each case the

formal ranks are all the same and are cut in half by the TT rounding procedure.

when the reduced ranks Lnare much smaller than the original ranks Rn. As the

rank reduction diminishes, so does the advantage of the implicit optimization. For

example, when the ranks are all cut in half, the reduction in leading order ﬂop cost

is 12.5%.

4. Numerical Experiments. In this section we present performance results

for TT computations using synthetic tensors with mode and dimension parameters

inspired by physics and chemistry applications, as described in Subsection 4.1. We

ﬁrst present microbenchmarks in Subsection 4.2 to justify key design decisions, and

then demonstrate performance eﬃciency and parallel scaling in Subsection 4.3.

All numerical experiments are run on the Max Planck Society supercomputer CO-

BRA. All computation nodes contain two Intel Xeon Gold 6148 processors (Skylake,

20 cores each at 2.4 GHz) and 192 GB of memory, and the nodes are connected through

a 100 Gb/s OmniPath interconnect. We link to MKL 2020.1 for single-threaded BLAS

and LAPACK subroutines.

4.1. Synthetic TT Models. As we are interested in large scale systems, we

consider two contexts of applications in which large number of modes exists. The ﬁrst

context is the existence of many modes with each mode of relatively the same (large)

dimension, and the second context is a single or a few modes with large dimension

as well as many modes of relatively smaller dimension. Table 4.1 presents the details

of the three models of synthetic tensors we use in the experiments, in order of their

memory size. The ﬁrst and third models correspond to the ﬁrst context (all modes

of the same dimension) and the second model corresponds to the second context (two

large modes and many more smaller modes). The ﬁrst model is chosen to be small

enough to be processed by a single core, while the second and third are larger and

beneﬁt more from distributed-memory parallelization (the third does not ﬁt in the

memory of a single node). The paragraphs below describe the applications that inspire

these choices of modes and dimensions.

In all experiments, we generate a random TT tensor Xwith a given number of

modes N, modes sizes Infor n= 1,...,N, and TT ranks RX

nfor n= 1,...,N −1.

Then, we form the formal TT tensor Y= 2X−Xwhich has the formal ranks RY

n=

2RX

nfor n= 1,...,N−1. The algorithms are then applied on the TT tensor Y. Note

that the minimal TT ranks of Yare less or equal than the TT ranks of X.

High-Order Correlation Functions. In the study of stochastic processes, Gaussian

random ﬁelds are widely used. If fis a Gaussian random ﬁeld deﬁned on a bounded

domain Ω ⊂RN, an N-point correlation function for fis deﬁned on ΩN. These N-

point correlation functions can often be eﬃciently approximated in TT format via a

cross approximation algorithm [24]. Typically, cross approximation algorithms induce

larger ranks. Thus, compressing the resulting TT tensors is required to maintain the

tractability of computations.

20 H. AL DAAS, G. BALLARD, P. BENNER

Molecular Simulations. Another important class of applications is molecular sim-

ulations. For example, when a spin system can be considered as a weakly branched

linear chain, it is typical to represent it as a TT tensor [34]. Each branch is then

considered as a spatial coordinate (mode). The number of branches can be arbitrarily

large; for example, a simple backbone protein may have hundreds of branches. The

TT representation is then inherited from the weak correlation between the branches.

However, in the same branch, the correlation between spins cannot be ignored, and

thus the exponential growth in the number of states cannot be avoided.

Parameter-Dependent PDEs. In the second context, one or a few modes may

be much larger than the rest. This is typically the case in physical applications

such as parameter dependent PDEs, stochastic PDEs, uncertainty quantiﬁcation, and

optimal control systems [7, 8, 9, 13, 18, 26, 32]. In such applications, the spatial

discretization leads to a high number of degrees of freedom. This typically results

from large domains, reﬁnement procedures, and a large number of parameter samples.

Most of other modes correspond to control or uncertainty parameters and can have

relatively smaller dimension.

4.2. Microbenchmarks. We next present experimental results for microbench-

marks to justify our choices for subroutine algorithms and optimizations. The re-

sults presented in Subsection 4.3 use the best-performing variants and optimizations

demonstrated in this section.

4.2.1. TSQR. As discussed in Subsection 3.3, the TSQR algorithm depends on

a hierarchical tree. Two tree choices are commonly used in practice, the binomial

tree and the butterﬂy tree. In both cases the TSQR computes the QR decomposition

sharing the same complexity and communication costs along the critical path, whereas

the butterﬂy requires less communication cost along the critical path of the application

of the implicit orthogonal factor.

Here we compare the performance of the TSQR algorithms using the binomial and

butterﬂy trees for both factorization and single application of the orthogonal factor.

Since the diﬀerence in their costs is solely related to the number of columns, we ﬁx the

number of rows in the comparison and vary the number of columns. Figure 4.1 reports

the breakdown of time of the variants using 256 nodes with 4 MPI processes per node

(2 cores per socket). The local matrix size on each processor is 1,000 ×bwhere b

varies in {40,80,120,160}. We observe that the butterﬂy tree has better performance

in terms of communication time in the application phase. Note that the factorization

runtime (computation and communication) is relatively the same for both variants.

We also time the cost of communication of the triangular factor R, which is required

of the binomial variant in the context of TT-rounding, but that cost is negligible in

these experiments.

Based on these results (and corroborating experiments with various other param-

eters), we use the butterﬂy variant of TSQR for TT computations that require TSQR

in all subsequent numerical experiments.

4.2.2. TT Rounding. In this section, we consider 4 variants of TT round-

ing (Algorithm 3.4), based on the orthogonalization/truncation ordering and the

use of the implicit orthogonal factor optimization. As discussed in Subsection 2.4,

the rounding procedure can perform right- or left- orthogonalization followed by

a truncation phase in the opposite direction. We refer to the ordering based on

right-orthogonalization and left-truncation as RLR and the ordering based on left-

orthogonalization and right-truncation as LRL. The implicit optimization avoids the

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 21

Binomial

Binomial

Binomial

Binomial

Butterﬂy

Butterﬂy

Butterﬂy

Butterﬂy

0

0.5

1

1.5

2·10−2

b= 40 b= 80 b= 120 b= 160

Time (seconds)

Comm Apply-Q

Comp Apply-Q

Comm R

Comm QR

Comp QR

Fig. 4.1: Time breakdown for TSQR variants for 1,024,000 ×bmatrix over 1024

processors, including both factorization and application of the orthogonal factor to a

dense b×bmatrix.

explicit formation of orthogonal factors during the orthogonalization phase; instead of

using Algorithm 3.3 as a black-box subroutine, Algorithm 3.4 leaves orthogonal factors

in implicit TSQR form as much as possible, saving a constant factor of computation

(and a small amount of communication).

Although the asymptotic complexity of the variants of the rounding procedure

are equal, their performance is not the same. This disparity between RLR and LRL

orderings is because of the performance diﬀerence between the QR and the LQ im-

plementations of the LAPACK subroutines provided by the MKL implementations.

Despite the same computation complexity, the QR subroutines has much better per-

formance than the LQ subroutines.

In the LRL ordering, a sequence of calls to the QR subroutine are performed

on the vertically unfolded TT cores TX,n with the increased ranks Rn−1, Rn. Along

the truncation sweep, the LQ subroutine is called in a sequence to factor the hori-

zontally unfolded TT cores TX,n with one reduced rank Rn−1, Ln. As presented in

Subsections 3.4 and 3.5, the RLR ordering employs the QR and LQ subroutines in

the opposite order. Because the truncation phase involves less computation within

local QR/LQ subroutine calls than the orthogonalization phase, the LRL ordering

has the advantage that it spends computation time in LQ subroutine calls than the

RLR ordering.

The eﬀect of the implicit optimization is a reduction in computation (approxi-

mately 12.5% in these experiments) and communication, but this advantage is oﬀset

in part by the performance of local subroutines. The implicit application of the or-

thogonal factor involves auxiliary LAPACK routines for applying sets of Householder

22 H. AL DAAS, G. BALLARD, P. BENNER

LRLI LRL RLRI RLR

0

5·10−2

0.1

0.15

Time (seconds)

(a) Model 2

LRLI LRL RLRI RLR

0

0.5

1

1.5

2

Time (seconds)

(b) Model 3

Fig. 4.2: Performance comparison of TT-Rounding variants for large TT models on 32

nodes (1,280 cores). LRL refers to left-orthogonalization followed by right-truncation

(vice versa for RLR) and I indicates the use of the implicit optimization.

vectors in various formats. The explicit multiplication of an orthogonal factor to a

small square matrix involves a broadcast and a local subroutine call to matrix mul-

tiplication, which has much higher performance than the auxiliary routines involving

Householder vectors. We use an “I” to indicate the use of the implicit optimization,

so that the 4 variants are LRLI, LRL, RLRI, and RLR.

Figure 4.2 presents the performance results for TT Models 2 and 3 running on 256

nodes. We see that for both models, the LRL ordering with the implicit optimization

(LRLI) is the fastest. In the case of Model 2, the implicit optimization makes more

of a diﬀerence than the ordering. This is because a considerable amount of time is

spent in the ﬁrst mode, where the QR is used (once) in either ordering. In the case of

Model 3, the ordering makes a much larger diﬀerence in running time, as the internal

modes dominate the running time and the QR/LQ diﬀerence has a large eﬀect. The

implicit optimization still improves performance, but it has less of an eﬀect than the

ordering. Based on these results, we use the LRLI variant of TT-rounding in all the

experiments presented in Subsection 4.3.

4.3. Parallel Scaling.

4.3.1. Norms. In this section we compare the performance and parallel scaling

of three diﬀerent algorithms for computing the norm of a TT tensor as discussed

in Subsection 3.2.4. We focus on this computation because the multiple approaches

represent the performance of algorithms for computing inner products and orthogo-

nalization, which are essential on their own in other contexts. We use “Ortho” to

denote the approach of ﬁrst right- or left-orthogonalizing the TT tensor and then

(cheaply) computing the norm of the ﬁrst or last core, respectively. Thus, Ortho per-

formance represents that of Algorithm 3.3. The name “InnPro” refers to the approach

of computing the inner product of the TT tensor with itself, and “InnPro-Sym” in-

cludes the optimization that exploits the symmetry in the inner product to save up

to half the computation. InnPro captures the performance of the algorithm described

in Subsection 3.2.3 for general TT inner products as well.

We report parallel scaling and a breakdown of computation and communication

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 23

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

Ortho

Ortho

Ortho

Ortho

Ortho

Ortho

Ortho

Ortho

Ortho

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

16 32 64 128 2561 2 4 8

Time fraction

Comm

Comp

(a) Time breakdown for Model 2

InnPro

InnPro

InnPro

InnPro

InnPro

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

InnPro-Sym

Ortho

Ortho

Ortho

Ortho

Ortho

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

16 32 64 128 256

Time fraction

Comm

Comp

(b) Time breakdown for Model 3

1 2 4 8 16 32 64 128 256

2−92−72−52−32−121

Number of Nodes

Time (seconds)

Perfect

Ortho

InnPro-Sym

InnPro

(c) Parallel scaling for Model 2

16 32 64 128 256

2−42−32−22−12021

Number of Nodes

Time (seconds)

Perfect

Ortho

InnPro-Sym

InnPro

(d) Parallel scaling for Model 3

Fig. 4.3: Time breakdown and and parallel scaling of variants for TT norm compu-

tation. “Ortho” refers to orthogonalization (following by computing the norm of a

single core), “InnPro” refers to using the inner product algorithm, and “InnPro-Sym”

refers to using the inner product algorithm with symmetric optimization.

for all three algorithms and TT Models 2 and 3 in Figure 4.3. Model 2 can be processed

on a single node, but Model 3 requires 16 nodes to achieve suﬃcient memory; we scale

both models up to 256 nodes (10,240 cores). Based on the theoretical analysis (see

Table 3.1), when all tensor dimensions are equivalent such as Model 3, Ortho has a

leading-order ﬂop constant of 5, InnPro has a constant of 4, and InnPro-Sym has a

constant of 2. Ortho also requires more complicated TSQR reductions compared to

the All-Reduces performed in InnPro and InnPro-Sym, involving an extra log Pfactor

in data communicated in theory and slightly less eﬃcient implementations in practice.

In addition, the eﬃciencies of the local computations diﬀer across approaches: Ortho

is bottlenecked by local QR, InnPro is bottlenecked by local matrix multiplication

24 H. AL DAAS, G. BALLARD, P. BENNER

1 core 20 cores Par. Speedup 40 cores Par. Speedup

TT-Toolbox 15.68 8.34 1.9×8.752 1.8×

Our Implementation 9.2 0.44 20.9×0.27 33.9×

Speedup 1.7×18.95×32.2×

Table 4.2: Single-node performance results on TT Model 1 and comparison with the

MATLAB TT-Toolbox.

(GEMM), and InnPro-Sym is bottlenecked by local triangular matrix multiplication

(TRMM).

Overall, we see that InnPro is typically the best performing approach. The main

factor in its superiority is that its computation is cast as GEMM calls, which are

more eﬃcient than TRMM and QR subroutines. Although InnPro-Sym performs half

the ﬂops of InnPro, the relative ineﬃciency of those ﬂops translates to a less than

2×speedup over InnPro for Model 3 and a slight slowdown for Model 2. We also

note that for high node counts, the cost of the LDLT factorization performed within

InnPro-Sym becomes nonneglible and begins to hinder parallel scaling.

Based on the breakdown of computation and communication, we see that all

three approaches are able to scale reasonably well because they remain computation

bound up to 256 nodes. For Model 2, we see that communication costs are relatively

higher, as that tensor is much smaller. Note that Ortho scales better than InnPro-

Sym and InnPro, even superlinearly for Model 3, which is due in large part to the

higher ﬂop count and relative ineﬃciency of the local QRs, allowing it to remain more

computation bound than the alternatives. Overall, these results conﬁrm that the

parallel distribution of TT cores allows for high performance and scalability of the

basic TT operations as described in Subsection 3.2.

4.3.2. TT Rounding.

Single-Node Performance. We compare in this section our implementation of TT

rounding against the MATLAB TT-Toolbox [28] rounding process. Table 4.2 presents

a performance comparison on a single node of COBRA, which has 40 cores available.

We run the experiment on TT Model 1, which is small enough to be processed by a

single core. Because it is written in MATLAB, the TT-Toolbox accesses the available

parallelism only through underlying calls to a multithreaded implementation of BLAS

and LAPACK. However, the bulk of the computation occurs in MATLAB functions

that make direct calls to eﬃcient BLAS and LAPACK subroutines, so it can achieve

relatively high sequential performance.

We observe from Table 4.2 that the single-core performance of the two imple-

mentations is similar, with a 70% speedup from our implementation. The single-core

implementations are employing the same algorithm, and we attribute the speedup to

our lower-level interface to LAPACK subroutines and the ability to maintain implicit

orthogonal factors to reduce computation. The parallel strong scaling diﬀers more

drastically, as expected. The MATLAB implementation, which is not designed for

parallelization, achieves less than a 2×speedup when using 20 or 40 cores. Our par-

allelization, which is designed for distributed-memory systems, also scales very well

on this shared-memory machine, achieving over 20×speedup on 20 cores and 34×

speedup on 40 cores.

Distributed-Memory Strong Scaling. We now present the parallel performance of

TT rounding scaling up to hundreds of nodes (over 10,000 cores). As in the case of

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 25

Subsection 4.3.1, we consider Models 2 and 3. Figure 4.4 presents the relative time

breakdown and raw timing numbers for each model. We use the ‘LRLI’ variant of

TT rounding in these experiments per the results of Subsection 4.2.2. As in other

rounding experiments, the ranks are cut in half for each model.

In the time breakdown plots of Figures 4.4a and 4.4b, we distinguish among

TSQR factorization (TSQR), application of orthogonal factors (AppQ), and the rest

of the computation that includes SVDs and triangular multiplication (Other). We

also separate the computation and communication of each category. In the context

of Algorithm 3.4, TSQR corresponds to Algorithm 3.4, AppQ corresponds to Algo-

rithm 3.4, and Other corresponds to Algorithm 3.4.

In Figures 4.4c and 4.4d, we observe the strong scaling raw times in log scale

compared to perfect scaling (based on time at the fewest number of nodes). We see

nearly perfect scaling for Model 2 until 128 nodes; time continues to decrease but is

not cut in half when scaling to 256 nodes. The parallel speedup numbers for Model 2

are 97×for 128 nodes and 108×for 256 nodes, compared to performance on 1 node.

In the case of Model 3, we see super-linear scaling, even at 256 nodes. We attribute

this scaling in part to the baseline comparison of 16 nodes, which already involves

parallelization/communication, and in part to local data ﬁtting into higher levels of

cache as the number of processors increases, which particularly helps memory-bound

local computations. We observe a 48×speedup for Model 3, scaling from 16 to 256

nodes.

The time breakdown plots also help to explain the scaling performance. We

see that for Model 2, over 70% of the time is spent in local computation, while for

Model 3, over 90% of the time is computation. Of this computation, the majority is

spent in TSQR, which itself is dominated by the initial local leaf QR computations.

If the rank is reduced by a smaller factor, then relatively more ﬂops will occur in

AppQ. We note that AppQ involves minimal communication because of the use of the

Butterﬂy TSQR variant. The Other category is dominated by the triangular matrix

multiplication, which achieves higher performance than the LAPACK subroutines

involving orthogonal factors.

5. Conclusions. This work presents the parallel implementation of the basic

computational algorithms for tensors represented in low-rank TT format. Because

most TT computations involve dependence through the train, we specify a data distri-

bution that distributes each core across all processors and show that the computations

and communication costs of our proposed algorithms enable eﬃciency and scalability

for each core computation. The orthogonalization and rounding procedures for TT

tensors depend heavily on the TSQR algorithm, which is designed to scale well on

architectures with a large number of processors for matrices with highly skewed as-

pect ratios. Our numerical experiments show that our algorithms are indeed eﬃcient

and scalable, outperforming productivity-oriented implementations on a single core

and single node and scaling well to hundreds of nodes (thousands of cores). Thus,

we believe our approach is useful to applications and users who are restricted to a

shared-memory workstation as well to those requiring the memory and performance

of a supercomputer.

We note that the raw performance of our implementation depends heavily on the

local BLAS/LAPACK implementation and the eﬃciency of the QR decomposition

and related subroutines. For example, we observe signiﬁcant performance diﬀerences

between MKL’s implementations of QR and LQ subroutines, which caused the LRL

ordering of TT-rounding to outperform RLR. We also observe performance diﬀerences

26 H. AL DAAS, G. BALLARD, P. BENNER

1 2 4 8 16 32 64 128 256

0

0.2

0.4

0.6

0.8

1

Time fraction

Other Comp

AppQ Comm

AppQ Comp

TSQR Comm

TSQR Comp

(a) Time breakdown for Model 2

16 32 64 128 256

0

0.2

0.4

0.6

0.8

1

Time fraction

Other Comp

AppQ Comm

AppQ Comp

TSQR Comm

TSQR Comp

(b) Time breakdown for Model 3

1 2 4 8 16 32 64 128 256

2−62−42−22022

Number of Nodes

Time (seconds)

Perfect

LRLI

(c) Parallel scaling for Model 2

16 32 64 128 256

2−22−12021222324

Number of Nodes

Time (seconds)

Perfect

LRLI

(d) Parallel scaling for Model 3

Fig. 4.4: Time breakdown and and parallel scaling of LRLI variant of TT rounding.

among other subroutines, such as triangular matrix multiplication and general matrix

multiplication, again conﬁrming that simple ﬂop counting (even tracking constants

closely) does not always accurately predict running times.

There do exist limitations of the parallelization approach proposed in this paper.

In particular, modes with small dimensions beneﬁt less from parallelization and can

become bottlenecks if there are too many of them. For example, we see the limits of

scalability with TT Model 2, which has large ﬁrst and last modes but smaller internal

modes. In fact, the distribution scheme assumes that P≤Infor n= 1,...,N, and

involves idle processors when the assumption is broken. We also note that TSQR may

not be the optimal algorithm to factor the unfolding, which can happen if two succes-

sive ranks diﬀer greatly and Pis large with respect to the original tensor dimensions.

Alternative possibilities to avoid these limitations include cheaper but less accu-

rate methods for the SVD, including via the associated Gram matrices or by using

randomization. We plan to pursue such strategies in the future, in addition to consid-

ering the case of computing a TT approximation from a tensor in explicit full format.

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 27

Given these eﬃcient computational building blocks, the next step is to build scalable

Krylov and alternating-scheme based solvers that exploit the TT format.

REFERENCES

[1] H. Al Daas,Solving linear systems arising from reservoirs modeling, theses, Inria Paris ;

Sorbonne Universit´e, UPMC University of Paris 6, Laboratoire Jacques-Louis Lions, Dec.

2018.

[2] W. Austin, G. Ballard, and T. G. Kolda,Parallel tensor compression for large-scale sci-

entiﬁc data, in Proceedings of the 30th IEEE International Parallel and Distributed Pro-

cessing Symposium, May 2016, pp. 912–922.

[3] B. W. Bader, T. G. Kolda, et al.,MATLAB Tensor Toolbox version 3.0-dev. Available

online, Oct. 2017.

[4] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,

A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,

D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,

S. Zampini, H. Zhang, and H. Zhang,PETSc Web page.https://www.mcs.anl.gov/petsc,

2019.

[5] G. Ballard, J. Demmel, L. Grigori, N. Knight, M. Jacquelin, and H. D. Nguyen,Re-

constructing Householder vectors from tall-skinny QR, Journal of Parallel and Distributed

Computing, 85 (2015), pp. 3–31.

[6] G. Ballard, A. Klinvex, and T. G. Kolda,TuckerMPI: A parallel C++/MPI software

package for large-scale data compression via the Tucker tensor decomposition, ACM Trans.

Math. Softw., 46 (2020).

[7] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll,Low-rank solvers for unsteady stokes–

brinkman optimal control problem with random data, Computer Methods in Applied Me-

chanics and Engineering, 304 (2016), pp. 26–54.

[8] ,Low-rank solution of an optimal control problem constrained by random navier-stokes

equations, International Journal for Numerical Methods in Fluids, 92 (2020), pp. 1653–

1678.

[9] P. Benner, S. Gugercin, and K. Willcox,A survey of projection-based model reduction

methods for parametric dynamical systems, SIAM Review, 57 (2015), pp. 483–531.

[10] G. Beylkin and M. J. Mohlenkamp,Numerical operator calculus in higher dimensions, Pro-

ceedings of the National Academy of Sciences, 99 (2002), pp. 10246–10251.

[11] J. D. Carroll and J.-J. Chang,Analysis of individual diﬀerences in multidimensional scaling

via an n-way generalization of “Eckart-Young” decomposition, Psychometrika, 35 (1970),

pp. 283–319.

[12] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou,Communication-optimal paral lel and

sequential QR and LU factorizations, SIAM Journal on Scientiﬁc Computing, 34 (2012),

pp. A206–A239.

[13] S. Dolgov and M. Stoll,Low-rank solution to an optimization problem constrained by the

Navier-Stokes equations, SIAM J. Sci. Comput., 39 (2017), pp. A255–A280.

[14] S. Eswar, K. Hayashi, G. Ballard, R. Kannan, M. A. Matheson, and H. Park,PLANC:

Parallel low rank approximation with non-negativity constraints, Tech. Rep. 1909.01149,

arXiv, 2019.

[15] W. Hackbusch and S. K¨

uhn,A new scheme for the tensor representation, J. Fourier Anal.

Appl., 15 (2009), pp. 706–722.

[16] R. A. Harshman,Foundations of the PARAFAC procedure: Models and conditions for an

explanatory multimodal factor analysis, Working Papers in Phonetics, 16 (1970), pp. 1 –

84.

[17] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda,

R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K.

Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley,An

overview of the Trilinos project, ACM Transactions on Mathematical Software, 31 (2005),

pp. 397–423.

[18] J. Hesthaven, G. Rozza, and B. Stamm,Certiﬁed Reduced Basis Methods for Parametrized

Partial Diﬀerential Equations, SpringerBriefs in Mathematics, Springer International Pub-

lishing, 2015.

[19] P. Jolivet,Domain decomposition methods. Application to high-performance computing, the-

ses, Universit´e de Grenoble, Oct. 2014.

[20] O. Kaya and B. Uc¸ar,High performance parallel algorithms for the Tucker decomposition of

28 H. AL DAAS, G. BALLARD, P. BENNER

sparse tensors, in 45th International Conference on Parallel Processing (ICPP ’16), 2016,

pp. 103–112.

[21] B. N. Khoromskij,O(dlog N)-quantics approximation of N-dtensors in high-dimensional

numerical modeling, Constr. Approx., 34 (2011), pp. 257–280.

[22] T. G. Kolda and B. W. Bader,Tensor decompositions and applications, SIAM Rev., 51

(2009), pp. 455–500.

[23] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic,TensorLy: Tensor learning in

python, Tech. Rep. 1610.09555, arXiv, 2018.

[24] D. Kressner, R. Kumar, F. Nobile, and C. Tobler,Low-rank tensor approximation for

high-order correlation functions of Gaussian random ﬁelds, SIAM/ASA Journal on Un-

certainty Quantiﬁcation, 3 (2015), pp. 393–416.

[25] D. Kressner and L. Periˇ

sa,Recompression of Hadamard products of tensors in Tucker for-

mat, SIAM Journal on Scientiﬁc Computing, 39 (2017), pp. A1879–A1902.

[26] D. Kressner and C. Tobler,Krylov subspace methods for linear systems with tensor product

structure, SIAM J. Matrix Anal. Appl., 31 (2009/10), pp. 1688–1714.

[27] J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc,Model-driven sparse CP decomposition for

higher-order tensors, in IEEE International Parallel and Distributed Processing Sympo-

sium, IPDPS, May 2017, pp. 1048–1057.

[28] I. Oseledets et al.,Tensor Train Toolbox version 2.2.2. Available online, Apr. 2020.

[29] I. Oseledets and E. Tyrtyshnikov,TT-cross approximation for multidimensional arrays,

Linear Algebra and its Applications, 432 (2010), pp. 70–88.

[30] I. V. Oseledets,Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295–2317.

[31] A.-H. Phan, P. Tichavsky, and A. Cichocki,Fast alternating LS algorithms for high order

CANDECOMP/PARAFAC tensor factorizations, IEEE Transactions on Signal Processing,

61 (2013), pp. 4834–4846.

[32] A. Quarteroni, A. Manzoni, and F. Negri,Reduced Basis Methods for Partial Diﬀerential

Equations: An Introduction, UNITEXT, Springer International Publishing, 2015.

[33] S. Ragnarsson and C. F. Van Loan,Block tensor unfoldings, SIAM Journal on Matrix

Analysis and Applications, 33 (2012), pp. 149–169.

[34] D. V. Savostyanov, S. V. Dolgov, J. M. Werner, and I. Kuprov,Exact NMR simulation of

protein-size spin systems using tensor train formalism, Phys. Rev. B, 90 (2014), p. 085139.

[35] S. Smith and G. Karypis,Accelerating the Tucker decomposition with compressed sparse

tensors, in Euro-Par 2017, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, eds., Cham,

2017, Springer International Publishing, pp. 653–668.

[36] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis,SPLATT: Eﬃcient and

paral lel sparse tensor-matrix multiplication, in Proceedings of the 2015 IEEE International

Parallel and Distributed Processing Symposium, IPDPS ’15, Washington, DC, USA, 2015,

IEEE Computer Society, pp. 61–70.

[37] E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel,A massively

paral lel tensor contraction framework for coupled-cluster computations, Journal of Parallel

and Distributed Computing, 74 (2014), pp. 3176–3190.

[38] L. R. Tucker,Some mathematical notes on three-mode factor analysis, Psychometrika, 31

(1966), pp. 279–311.

[39] E. E. Tyrtyshnikov,Tensor approximations of matrices generated by asymptotically smooth

functions, Sbornik: Mathematics, 194 (2003), pp. 941–954.

[40] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer,Tensorlab 3.0.

http://www.tensorlab.net, Mar. 2016.

Appendix A. TT Rounding Identity.

We provide the full derivation of (2.3), which we repeat here. The unfolding of

Xthat maps the ﬁrst ntensor dimensions to rows can be expressed as a product of

four matrices:

X(1:n)= (IIn⊗Q(1:n−1))· V (TX,n )· H(TX,n+1)·(IIn+1 ⊗Z(1)),

where Qis I1× · · · × In−1×Rn−1with

Q(i1,...,in−1, rn−1) = TX,1(i1,:) ·TX,2(:, i2,:) ···TX,n−1(:, in−1, rn−1),

and Zis Rn+1 ×In+2 × · · · × INwith

Z(rn+1, in+2 ,...,iN) = TX,n+2 (rn+1, in+2,:) ·TX,n+3 (:, in+3,:) ···TX,N (:, iN).

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 29

Let Ube I1× · · · × In×Rnsuch that U(1:n)= (IIn⊗Q(1:n−1))V(TX,n ), then

U(i1,...,in, rn) = X

i′

nX

rn−1

δ(i′

n,in)Q(i1,...,in−1, rn−1)TX,n(rn−1, i′

n, rn)

=X

rn−1

Q(i1,...,in−1, rn−1)TX,n(rn−1, in, rn)

=Q(i1,...,in−1,:) ·TX,n(:, in, rn)

=TX,1(i1,:) ···TX,n−1(:, in−1,:) ·TX,n(:, in, rn).

Let Vbe Rn×In+1 × · · · × INsuch that V(1) =H(TX,n+1)(IIn+1 ⊗Z(1) ), then

V(rn, in+1,...,iN) = X

i′

n+1 X

rn+1

TX,n+1(rn, i′

n+1, rn+1 )δ(i′

n+1,in+1 )Z(rn+1, in+2,...,iN)

=X

rn−1

TX,n+1(rn, in+1 , rn+1)Z(rn+1 , in+2,...,iN)

=TX,n+1(rn, in+1,:) ·Z(:, in+2,...,iN)

=TX,n+1(rn, in+1,:) ·TX,n+2(:, in+2 ,:) ···TX,N (:, iN).

Then we conﬁrm that Y=Xfor Y(1:n)=U(1:n)·V(1):

Y(i1,...,iN) = X

rn

U(i1,...,in, rn)V(rn, in+1,...,iN)

=X

rn

TX,1(i1,:) ···TX,n−1(:, in−1,:) ·TX,n(:, in, rn)·

TX,n+1(rn, in+1,:) ·TX,n+2(:, in+2 ,:) ···TX,N (:, iN)

=TX,1(i1,:) ···TX,n−1(:, in−1,:)·

X

rn

TX,n(:, in, rn)·TX,n+1 (rn, in+1,:)!·

TX,n+2(:, in+2 ,:) ···TX,N (:, iN)

=TX,1(i1,:) ···TX,n(:, in,:) ·TX,n+1 (:, in+1,:) ···TX,N (:, iN).

Appendix B. TSQR Subroutines for Non-Powers-of-Two.

We provide here the full details of the butterﬂy TSQR algorithm and the algorithm

for applying the resulting implicit orthogonal factor to a matrix. These two algorithms

generalize Algorithms 3.1 and 3.2 presented in Subsection 3.3 which can run only on

powers-of-two processors. To handle a non-power-of-two number of processors, we

consider the ﬁrst 2⌊log P⌋processors to be “regular” processors and the last P−2⌊log P⌋

processors to be “remainder” processors. Each remainder process has a partner in the

set of regular processors, and we perform cleanup steps between remainder processors

and their partners before and after the regular butterﬂy loop of the TSQR algorithm.

For the application algorithm, the clean up occurs after the butterﬂy on the regular

processors (which requires no communication) and involves a single message between

remainder processors and their partners. We note that the notation and indexing

matches that of Algorithms 3.1 and 3.2, so that the algorithms coincide when Pis a

power of two.

30 H. AL DAAS, G. BALLARD, P. BENNER

Algorithm B.1 Parallel Butterﬂy TSQR

Require: A is an m×bmatrix 1D-distributed so that proc powns row block A(p)

Ensure: A =QR with Rowned by all procs and Qrepresented by {Y(p)

ℓ}with

redundancy Y(p)

ℓ=Y(q)

ℓfor p≡qmod 2ℓwhere p, q < 2⌊log P⌋and l < ⌈log P⌈

1: function [{Y(p)

ℓ},R] = Par-TSQR(A(p))

2: p=MyProcID()

3: [Y(p)

⌈log P⌉,¯

R(p)

⌈log P⌉] = Local-QR(A(p))⊲Leaf node QR

4: if ⌈log P⌉ 6=⌊log P⌋then ⊲Non-power-of-two case

5: j= (p+ 2⌊log P⌋) mod 2⌈log P⌉

6: if p≥2⌊log P⌋then ⊲Remainder processor

7: Send ¯

R(p)

⌈log P⌉to proc j

8: else if p < P −2⌊log P⌋then ⊲Partner of remainder processor

9: Receive ¯

R(j)

⌈log P⌉from proc j

10: [Y(p)

⋆,¯

R(p)

⌈log P⌉] = Lo cal-QR "¯

R(p)

⌈log P⌉

¯

R(j)

⌈log P⌉#!

11: end if

12: end if

13: if p < 2⌊log P⌋then ⊲Butterﬂy tree on power-of-two procs

14: for ℓ=⌈log P⌉ − 1 down to 0 do

15: j= 2ℓ+1⌊p

2ℓ+1 ⌋+ (p+ 2ℓ) mod 2ℓ+1 ⊲Determine partner

16: Send ¯

R(p)

ℓ+1 to and receive ¯

R(j)

ℓ+1 from proc j ⊲ Communication

17: if p < j then

18: [Y(p)

ℓ,¯

R(p)

ℓ] = Lo cal-QR "¯

R(p)

ℓ+1

¯

R(j)

ℓ+1#! ⊲Tree node QR

19: else

20: [Y(p)

ℓ,¯

R(p)

ℓ] = Lo cal-QR "¯

R(j)

ℓ+1

¯

R(p)

ℓ+1#! ⊲Partner tree node QR

21: end if

22: end for

23: R=¯

R(p)

0

24: end if

25: if ⌊log P⌋ 6=⌈log P⌉then ⊲Non-power-of-two case

26: j= (p+ 2⌊log P⌋) mod 2⌈log P⌉

27: if p < P −2⌊log P⌋then ⊲Partner of remainder proc

28: Send Rto proc j

29: else if p≥2⌊log P⌋then ⊲Remainder proc

30: Receive Rfrom proc j

31: end if

32: end if

33: end function

PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 31

Algorithm B.2 Parallel Application of Implicit Qfrom Butterﬂy TSQR

Require: {Y(p)

ℓ}represents orthogonal matrix Qcomputed by Algorithm B.1

Require: C is b×cand redundantly owned by all processors

Ensure: B =QC

0is m×cand 1D-distributed so that proc powns row block B(p)

1: function B =Par-TSQR-Apply-Q({Y(p)

ℓ},C)

2: p=MyProcID()

3: if p < 2⌊log P⌋then ⊲Butterﬂy apply on power-of-two procs

4: ¯

B(p)

0=C

5: for ℓ= 0 to ⌈log P⌉ − 1do

6: j= 2ℓ+1⌊p

2ℓ+1 ⌋+ (p+ 2ℓ) mod 2ℓ+1 ⊲Determine partner

7: if p < j then

8: "¯

B(p)

ℓ+1

¯

B(j)

ℓ+1#=Loc-Apply-Q Ib

Y(p)

ℓ,"¯

B(p)

ℓ

0#! ⊲Tree node apply

9: else

10: "¯

B(j)

ℓ+1

¯

B(p)

ℓ+1#=Loc-Apply-Q Ib

Y(p)

ℓ,"¯

B(p)

ℓ

0#! ⊲Partner apply

11: end if

12: end for

13: end if

14: if ⌊log P⌋ 6=⌈log P⌉then ⊲Non-power-of-two case

15: j= (p+ 2⌊log P⌋) mod 2⌈log P⌉

16: if p < P −2⌊log P⌋then ⊲Partner of remainder proc

17: "¯

B(p)

⌈log P⌉

¯

B(j)

⌈log P⌉#=Loc-Apply-Q Ib

Y(p)

⋆,"¯

B(p)

⌈log P⌉

0#!

18: Send ¯

B(j)

⌈log P⌉to proc j

19: else if p≥2⌊log P⌋then ⊲Remainder proc

20: Receive ¯

B(p)

⌈log P⌉from proc j

21: end if

22: end if

23: B(p)=Loc-Apply-Q Y(p)

⌈log P⌉,"¯

B(p)

⌈log P⌉

0#! ⊲Leaf node apply

24: end function