PreprintPDF Available

Parallel Algorithms for Tensor Train Arithmetic

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

We present efficient and scalable parallel algorithms for performing mathematical operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms for addition, elementwise multiplication, computing norms and inner products, orthogonalization, and rounding (rank truncation). These are the kernel operations for applications such as iterative Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-memory computation, and we use a data distribution and strategy that parallelizes computations for individual cores within the TT format. We analyze the computation and communication costs of the proposed algorithms to show their scalability, and we present numerical experiments that demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For example, we observe better single-core performance than the existing MATLAB TT-Toolbox in rounding a 2GB TT tensor, and our implementation achieves a $34\times$ speedup using all 40 cores of a single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores for all mathematical operations.
Content may be subject to copyright.
arXiv:2011.06532v1 [math.NA] 12 Nov 2020
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC
HUSSAM AL DAAS, GREY BALLARD,AND PETER BENNER
Abstract. We present efficient and scalable parallel algorithms for performing mathematical
operations for low-rank tensors represented in the tensor train (TT) format. We consider algorithms
for addition, elementwise multiplication, computing norms and inner products, orthogonalization,
and rounding (rank truncation). These are the kernel operations for applications such as iterative
Krylov solvers that exploit the TT structure. The parallel algorithms are designed for distributed-
memory computation, and we use a data distribution and strategy that parallelizes computations
for individual cores within the TT format. We analyze the computation and communication costs
of the proposed algorithms to show their scalability, and we present numerical experiments that
demonstrate their efficiency on both shared-memory and distributed-memory parallel systems. For
example, we observe better single-core performance than the existing MATLAB TT-Toolbox in
rounding a 2GB TT tensor, and our implementation achieves a 34×speedup using all 40 cores of a
single node. We also show nearly linear parallel scaling on larger TT tensors up to over 10,000 cores
for all mathematical operations.
Key words. low-rank tensor format, tensor train, parallel algorithms, QR, SVD
AMS subject classifications. 15A69, 15A23 , 65Y05, 65Y20
1. Introduction. Multi-dimensional data, or tensors, appear in a variety of
applications where numerical values represent multi-way relationships. The Tensor
Train (TT) format is a low-rank representation of a tensor that has been applied
to solving problems in areas such as parameter-dependent PDEs, stochastic PDEs,
molecular simulations, uncertainty quantification, data completion, and classification
[7, 8, 13, 15, 24, 26, 30, 34]. As the number of dimensions or modes of a tensor
becomes large, the total number of data elements grows exponentially fast, which is
known as the curse of dimensionality [15]. Fortunately, it can be shown in many cases
that the tensors exhibit low-rank structure and can be represented or approximated
by significantly fewer parameters. Low-rank tensor approximations allow for storing
the data implicitly and performing arithmetic operations in feasible time and space
complexity, avoiding the curse of dimensionality.
In contrast to the matrix case where the singular value decomposition (SVD)
provides optimal low-rank representations, there are more diverse possibilities for
low-rank representations of tensors [22]. Various representations have been proposed,
such as CP [11, 16], Tucker [38], quantized tensor train [21], and hierarchical Tucker
[15], in addition to TT [30], and each has been demonstrated to be most effective
in certain applications. The TT format consists of a sequence of TT cores, one for
each tensor dimension, and each core is a 3-way tensor except for the first and last
cores, which are matrices. The primary advantages of TT are that (1) the number of
parameters in the representation is linear, rather than exponential, in the number of
modes and (2) the representation can be computed to satisfy a specified approximation
error threshold in a numerically stable way.
As these low-rank tensor techniques have been applied to larger and larger data
sets, efficient sequential and parallel implementations of algorithms for computing
Submission date: November 12, 2020.
Department of Computational Methods in Systems and Control Theory, Max Planck Institute
for Dynamics of Complex Technical Systems, Magdeburg, Germany (aldaas@mpi-magdeburg.mpg.de,
benner@mpi-magdeburg.mpg.de).
Computer Science Department, Wake Forest University, Winston Salem, North Carolina, USA
(ballard@wfu.edu).
1
2H. AL DAAS, G. BALLARD, P. BENNER
and manipulating these formats have also been developed. Toolboxes and libraries in
productivity-oriented languages such as MATLAB and Python [3, 23, 28, 40] are avail-
able for moderately sized data, and parallel algorithms implemented in performance-
oriented languages exist for computation of decompositions such as CP [14, 36, 27]
and Tucker [2, 6, 20, 35] and operations such as tensor contraction [37], allowing for
scalability to much larger data and numbers of processors. However, no such par-
allelization exists for TT tensors. The goal of this work is to establish efficient and
scalable algorithms for implementing the key mathematical operations on TT tensors.
We consider mathematical operations such as addition, Hadamard (elementwise)
multiplication, computing norms and inner products, left- and right- orthogonaliza-
tion, as well as rounding (rank truncation). These are the operations required to, for
example, solve a structured linear system whose solution can be approximated well
by a tensor in TT format [26]. As we will see in Section 2, mathematical operations
can increase the formal ranks of the TT tensor, which can then be recompressed, or
rounded back to smaller ranks, in order to maintain feasible time and space com-
plexity with some controllable loss of accuracy. As a result, the rounding procedure
(and the orthogonalization it requires) is of prime importance in developing efficient
and scalable TT algorithms. We will assume throughout that full tensors are never
formed explicitly, though there are efficient (sequential) procedures for computing a
TT approximation of a full tensor [30].
In order to develop scalable parallel algorithms, we use a data distribution and
parallelization techniques that maintain computational load balance and attempt to
minimize interprocessor communication, which is the most expensive operation on
parallel machines in terms of both time and energy consumption. As discussed in
Section 3, we distribute the slices of each TT core across all processors, where slices
are matrices (or vectors) whose dimensions are determined by the low ranks of the TT
representation. This distribution allows for full parallelization of each core-wise com-
putation and avoids the need for communication within slice-wise computations. The
orthogonalization and rounding algorithms depend on parallel QR decompositions,
and our approach enables the use of the Tall-Skinny QR algorithm, which is commu-
nication optimal for the matrix dimensions in this application [12]. We analyze the
parallel computation and communication costs of each TT algorithm, demonstrat-
ing that the bulk of the computation is load balanced perfectly across processors.
The communication costs are independent of the original tensor dimensions, so their
relative costs diminish with small ranks.
We verify the theoretical analysis and benchmark our C/MPI implementation
on up to 256 nodes (10,240 cores) of a distributed-memory parallel platform in Sec-
tion 4. Our experiments are performed on synthetic data using tensor dimensions
and ranks that arise in a variety of scientific and data analysis applications. On
a shared-memory system (one node of the system), we compare our TT-rounding
implementation against the TT-Toolbox [28] in MATLAB and show that our imple-
mentation is 70% more efficient using a single core and achieves up to a 34×parallel
speedup using all 40 cores on the node. We also present strong scaling performance
experiments for computing inner products, norms, orthogonalization, and rounding
using up to over 10K MPI processes. The experimental results show that the time
remains dominated by local computation even at that scale, allowing for nearly linear
scaling for multiple operations, achieving for example a 97×speedup of TT-rounding
when scaling from 1 node to 128 nodes on a TT tensor with a 28 GB memory footprint.
We conclude in Section 5 and discuss limitations of our approaches and perspectives
for future improvements.
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 3
I1
R1
I2
R1
R2
I3
R2
R3
I4
R3
R4
I5
R4
Fig. 2.1: Order-5 TT tensor with a particular slice from each TT core highlighted.
The chain product of those slices produces a scalar element of the full tensor with
indices corresponding to the slices.
2. Notation and background. In this section, we review the tensor train (TT)
format and present a brief overview of the notation and computational kernels asso-
ciated with it. Tensors are denoted by boldface Euler script letters (e.g. X), and ma-
trices are denoted by boldface block letters (e.g. A). The number Infor 1 nN
is referred to as the mode size or mode dimension, and we use into index that dimen-
sion. The order of a tensor is its number of modes, e.g., the order of Xis N. The nth
TT core (described below) of a tensor Xis denoted by TX,n. We use MATLAB-style
notation to obtain elements or sub-tensors, where a solitary colon (:) refers to the
entire range of a dimension. For example X(i, j, k) is a tensor entry, X(i, :,:) is a
tensor slice (a matrix in this case), and X(:, j, k) is a tensor fiber (a vector).
The mode-n“modal” unfolding (or matricization or flattening) of a tensor X
RI1×I2×I3is the matrix X(n)RIn×I
In, where I=I1I2I3. In this case, the columns
of the modal unfolding are fibers in that mode. The mode-nproduct or tensor-times-
matrix operation is denoted by ×nand is defined so that the mode-nunfolding of
X×nAis AX(n). We refer to [22, 33] for more details.
2.1. TT tensors. A tensor XRI1×···×INis in the TT format if there ex-
ist strictly positive integers R0,...,RNwith R0=RN= 1 and Norder-3 tensors
TX,1,...,TX,N , called TT cores, with TX,n RRn1×In×Rn, such that:
X(i1,...,iN) = TX,1(i1,:) ···TX,n(:, in,:) ···TX,N (:, iN).
We note that because R0=RN= 1, the first and last TT cores are (order-2) matrices
so TX,1(i1,:) RR1and TX,N (:, iN)RRN1. The Rn1×Rnmatrix TX,n (:, in,:) is
referred to as the inth slice of the nth TT core of X, where 1 inIn. Subsection 2.1
shows an illustration of an order-5 TT tensor.
Due to the multiplicative formulation of the TT format, the cores of a TT tensor
are not unique. For example, let Xbe a TT tensor and MRRn×Rnbe an invertible
matrix. Then, the TT tensor Ydefined such that
Y(i1,...,iN) = TX,1(i1,:) ···(TX,n(:, in,:)M)·(M1TX,n+1 (:, in+1,:)) ···TX,N (:, iN)
4H. AL DAAS, G. BALLARD, P. BENNER
In
Rn1
Rn
TX,n RRn1×In×Rn
are TT cores
Rn
Rn1···
Rn
···
Rn
In
H(TX,n)RRn1×InRn
Rn
Rn1
.
.
.
Rn1
.
.
.
Rn1
In
V(TX,n)RRn1In×Rn
Fig. 2.2: Horizontal and vertical unfoldings of a TT core
is equal to X. Another important remark is the following:
(2.1) TX,1(i1,:) ···(TX,n(:, in,:)M)·TX,n+1 (:, in+1,:) ···TX,N (:, iN) =
TX,1(i1,:) ···TX,n(:, in,:) ·(MTX,n+1 (:, in+1,:)) ···TX,N (:, iN)
where Min this case need not be invertible. Thus, we can “pass” a matrix between
adjacent cores without changing the tensor. This property is used to orthogonalize
TT cores as we will see in Subsection 2.3.
2.2. Unfolding TT cores. In order to express the arithmetic operations on TT
cores using linear algebra, we will often use two specific matrix unfoldings of the 3D
tensors. The horizontal unfolding of TT core TX,n corresponds to the concatenation
of the slices TX,n (:, in,:) for in= 1,...,Inhorizontally. We denote the corresponding
operator by H, so that H(TX,n) is an Rn1×RnInmatrix. The vertical unfolding
corresponds to the concatenation of the slices TX,n(:, in,:) for in= 1,...,Invertically.
We denote the corresponding operator by V, so that V(TX,n) is an Rn1In×Rn
matrix. These unfoldings are illustrated in Figure 2.2.
Note that the horizontal unfolding is equivalent to the modal unfolding with
respect to the 1st mode, often denoted with subscript (1) to denote the mode that
corresponds to rows [22]. Similarly, the vertical unfolding is the transpose of the modal
unfolding with respect to the 3rd mode, which also corresponds to the more general
unfolding that maps the first two modes to rows and the third mode to columns,
denoted with subscript (1:2) to denote the modes that correspond to rows [31]. These
connections are important for the linearization of tensor entries in memory and our
efficient use of BLAS and LAPACK, discussed in Subsection 3.1.
2.3. TT Orthogonalization. Different types of orthogonalization can be de-
fined for TT tensors. We focus in this paper on left and right orthogonalizations
which are required in the rounding procedure. We use the terms column orthogonal
and row orthogonal to refer to matrices that have orthogonal columns and orthogonal
rows, respectively, so that a matrix Qis column orthogonal if QQ=Iand row
orthogonal if QQ=I.
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 5
A TT tensor is said to be right orthogonal if H(TX,n ) is row orthogonal for
n= 2,...,N (all but the first core). On the other hand, a tensor is said to be left
orthogonal if V(TX,n ) is column orthogonal for n= 1,...,N1 (all but the last core).
More generally, we define a tensor to be n-right orthogonal if the horizontal unfoldings
of cores n+ 1,...,N are all row orthogonal, and a tensor is n-left orthogonal if the
vertical unfoldings of cores 1, . . . , n 1 are all column orthogonal.
These definitions correspond to the fact that the tensor that represents the con-
traction of these sets of TT cores inherits their orthogonality. For example, let X
be a right-orthogonal TT tensor, then we can write X(1) =TX,1Z(1) , where Zis a
R1×I2× · · · × INtensor whose entries are given by
Z(r1, i2,...,iN) = TX,2(r1, i2,:) ·TX,3(:, i3,:) ···TX,n(:, in,:) ···TX,N (:, iN).
The 1st modal unfolding of Zis row orthogonal, as shown below [30, Lemma 3.1]:
Z(1)Z
(1) =X
i2,...,iN
Z(:, i2, . . . , iN)Z(:, i2,...,iN)
=X
i2,...,iN
TX,2(:, i2,:) ···TX,N (:, iN)TX,N (:, iN)···TX,2(:, i2,:)
=X
i2
TX,2(:, i2,:) ···X
iN
TX,N (:, iN)TX,N (:, iN)···TX,2(:, i2,:)
=X
i2
TX,2(:, i2,:) ···H(TX,N )H(TX,N )···TX,2(:, i2,:)
=X
i2
TX,2(:, i2,:) ···IRN1···TX,2(:, i2,:)
=X
i2
TX,2(:, i2,:) ···H(TX,N 1)H(TX,N 1)···TX,2(:, i2,:)
=···=IR1.
Similar arguments show that the 1st modal unfolding of the tensor representing the
last Nncores of an n-right orthogonal TT tensor is row orthogonal and that the last
modal unfolding of the tensor representing the first n1 cores of an n-left orthogonal
TT tensor is row orthogonal.
Given a TT tensor, we can orthogonalize it by exploiting the non-uniqueness of
TT tensors expressed in Equation (2.1). That is, we can right- or left-orthogonalize
a TT core using a QR decomposition of one of its unfoldings and pass its triangular
factor to its neighbor core without changing the represented tensor. By starting from
one end and repeating this process on each core in order, we can obtain a left or right
orthogonal TT tensor, as shown in Algorithm 2.1 (for right orthogonalization).
We note that the norm of right- or left-orthogonal TT tensor can be cheaply
computed, based on the idea that post-multiplication by a matrix with orthonormal
rows or pre-multiplication by a matrix with orthonormal columns does not affect
the Frobenius norm of a matrix. Thus, we have that kXk=kTX,1kFprovided
that Z(1) has orthonormal rows. Likewise, if Xis a left-orthogonal TT tensor, then
kXk=kTX,N kF.
2.4. TT Rounding. Orthogonalization plays an essential role in compressing
the TT format of a tensor (decreasing the TT ranks Rn) [30]. This compression is
known as TT rounding and is given in Algorithm 2.2.
6H. AL DAAS, G. BALLARD, P. BENNER
Algorithm 2.1 TT-right-orthogonalization
Require: A TT tensor X
Ensure: A right orthogonal TT tensor Yequivalent to X
1: function Y=Right-Orthogonalization(X)
2: Set TY,N =TX,N
3: for n=Ndown to 2 do
4: [H(TY,n),R] = QR(H(TY,n ))QR factorization
5: V(TY,n1) = V(TX,n1)RTY,n1=TX,n1×3R
6: end for
7: end function
The intuition for rounding can be expressed in matrix notation as follows. Suppose
we have a matrix represented by a product
(2.2) A=QBCZ,
where Qand Zare column and row orthogonal, respectively. Then the truncated
SVD of Acan be readily expressed in terms of the truncated SVD of BC. In our
case, Bis tall and skinny and Cis short and wide, so the rank is bounded by their
shared dimension. To truncate the rank, one can row-orthogonalize Cand then
perform a truncated SVD of B(or vice-versa). That is, if we compute RCQC=C
and UBΣBV
B=BRC, then to round Awe can replace Bwith ˆ
UBand Cwith
ˆ
ΣBˆ
V
BQC, where ˆ
UBˆ
ΣBˆ
V
Bis the SVD truncated to the desired tolerance.
In order to truncate a particular rank Rnby considering only the nth TT core
using this idea, the TT format should be both n-left and n-right orthogonal. The
unfolding of Xthat maps the first ntensor dimensions to rows can be expressed as a
product of four matrices:
(2.3) X(1:n)= (IInQ(1:n1))· V (TX,n )· H(TX,n+1 )·(IIn+1 Z(1)),
where Qis I1× · · · × In1×Rn1with
Q(i1,...,in1, rn1) = TX,1(i1,:) ·TX,2(:, i2,:) ···TX,n1(:, in1, rn1),
and Zis Rn+1 ×In+2 × · · · × INwith
Z(rn+1, in+2 ,...,iN) = TX,n+2 (rn+1, in+2,:) ·TX,n+3 (:, in+3,:) ···TX,N (:, iN).
See Figure 2.3 for a visualization and Appendix A for a full derivation of (2.3). If X
is n-left and n-right orthogonal, then Q(1:n1) and Z(1) are column and row orthog-
onal (and so are their Kronecker products with an identity matrix), respectively, and
H(TX,n+1) is also row orthogonal.
In order to truncate Rn, we view (2.3) as an instance of (2.2) where V(TX,n)
plays the role of Band H(TX,n+1) plays the role of C(though H(TX,n+1 ) is already
orthogonalized). We compute the truncated SVD V(TX,n)ˆ
Uˆ
Σˆ
V, replace V(TX,n )
with ˆ
U, and apply ˆ
Σˆ
Vto H(TX,n+1). In this way, Rnis truncated, V(TX,n) becomes
column orthogonal, and because Qand Zare not modified, Xbecomes (n+1)-left and
(n+1)-right orthogonal and ready for the truncation of Rn+1.
The rounding procedure consists of two sweeps along the modes. During the first,
the tensor is left or right orthogonalized. On the second, sweeping in the opposite
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 7
I1· · ·In1
Rn1
IInQ(1:n1)
I1···In
InRn1
V(TX,n)
Rn
H(TX,n+1)
In+1Rn+1
In+2···IN
Rn+1
IIn+1 Z(1)
In+1 ···IN
Fig. 2.3: Visualization of identity (2.3) for X(1:n).
direction, the TT ranks are reduced sequentially via SVD truncation of the matricized
cores. The rounding accuracy ε0can be defined a priori such that the rounded TT
tensor is ε0-relatively close to the original TT tensor. We note that this method is
quasi-optimal in finding the closest TT tensor with prescribed TT ranks to a given
TT tensor [29].
Algorithm 2.2 TT-rounding
Require: A tensor Yin TT format, a threshold ε0
Ensure: A tensor Xin TT format with reduced ranks such that kXYkFε0
1: function X=Rounding(Y, ε0)
2: X=Right-Orthogonalization(Y)
3: Compute kYkF=kTX,1kFand the truncation threshold ε=kYkF
N1ε0
4: for n= 1 to N1do
5: [V(TX,n),Σ,V] = SVD(V(TX,n ), ε)⊲ ε-truncated SVD factorization
6: H(TX,n+1) = ΣVH(TX,n+1 )TX,n+1 =TX,n+1 ×1(ΣV)
7: end for
8: end function
3. Parallel Algorithms for Tensor Train. In this section we detail the par-
allel algorithms for manipulating TT tensors that are distributed over multiple pro-
cessors’ memories. We describe our proposed data distribution of the core tensors in
Subsection 3.1, which is designed for efficient orthogonalization and truncation of TT
tensors. In Subsection 3.2 we show how to perform basic operations on TT tensors
in this distribution such as addition, elementwise multiplication, and applying certain
linear operators. Our proposed parallel orthogonalization and truncation routines
are presented in Subsections 3.4 and 3.5, respectively. Both of those routines rely
on an existing communication-efficient parallel QR decomposition algorithm called
Tall-Skinny QR (TSQR) [12], which is given for completeness in Subsection 3.3. A
summary of the costs of the parallel algorithms is presented in Table 3.1.
8H. AL DAAS, G. BALLARD, P. BENNER
TT Algorithm Computation Comm. Data Comm. Msgs
Summation — —
Hadamard N IR4
P— —
Inner Product 4 NIR3
PO(NR2)O(Nlog P)
Norm 2 NIR3
PO(NR2)O(Nlog P)
Orthogonalization 5 N I R3
P+O(NR3log P)O(N R2log P)O(Nlog P)
Rounding 7 NIR3
P+O(NR3log P)O(N R2log P)O(Nlog P)
Table 3.1: Summary of computation and communication costs of parallel TT opera-
tions using Pprocessors, assuming inputs are N-way tensors with identical dimensions
In=Iand ranks Rn=R. The computation cost of rounding assumes the original
ranks are reduced in half; the constant can range from 3 to 13 depending on the
reduced ranks.
I1
R1
I2
R1
R2
I3
R2
R3
I4
R3
R4
I5
R4
Fig. 3.1: In blue, data owned by a processor in a 1D distribution of TT tensor across
Pprocessor
3.1. Data Distribution and Layout. We are interested in the parallelization
of TT operations with a large number of modes and where one or multiple mode
sizes are very large comparing to the TT ranks. This type of configuration arises in
many applications such as parameter dependent PDEs [26], stochastic PDEs [24], and
molecular simulations [34].
To simplify the introduction and without loss of generality, we consider in this
paper the case where all mode sizes are very large comparing to the TT ranks. In case
there exist TT cores with relatively small mode sizes, those can be stored redundantly
on each processor. We note that our implementation can deal with both cases.
Algorithms for orthogonalization and rounding of TT tensors are sequential with
respect to the mode; often computation can occur on only one mode at a time. In
order to utilize all processors and maintain load balancing in a parallel environment,
we choose to distribute each TT core over all processors, so that each processor owns
a subtensor of each TT core. To ensure the computations on each core can be done
in a communication-efficient way, we choose a 1D distribution for each core, where
the mode corresponding to the original tensor is divided across processors. This
corresponds to a Cartesian distribution of each Rn1×In×Rncore over a 1 ×P×1
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 9
processor grid, or equivalently, a block row distribution of V(TX,n) or a block column
distribution of H(TX,n), for n= 1,...,N; see Figure 3.1. In this manner, each
processors owns Nlocal subtensors with dimensions {Rn1×(In/P )×Rn}. The
notation T(p)
X,n denotes the local subtensor of the nth core owned by processor p.
This distribution allows performing basic operations, such as addition and ele-
mentwise multiplication, on the TT representation locally, see Subsection 3.2. Fur-
thermore, the distribution of a TT core in this way can be seen as a generalization of
the distribution of a vector in parallel iterative linear solvers [1, 19]. Indeed, if Ais
an In×Insparse matrix distributed across processors as block row panels, the com-
putation of ATX,n(k , :, l) can be done by using a sparse-matrix-vector multiplication.
Tensor entries are linearized in memory. Each local core tensor T(p)
X,n is Rn1×
(In/P )×Rn, and we store it in the “vec-oriented” or “natural descending” order
[6, 33] in memory. For 3-way tensors, this means that mode-1 fibers (of length Rn1)
are contiguous in memory, as this corresponds to the mode-1 modal unfolding. Addi-
tionally, the mode-3 slices (of size Rn1×(In/P )) are also contiguous in memory and
internally linearized in column-ma jor order, as this corresponds to the more general
(1:2) unfolding [31, 33]. In particular, these facts imply that both the vertical and
horizontal unfoldings are column ma jor in memory.
BLAS and LAPACK routines require either row- or column-major ordering (unit
stride for one dimension and constant stride for the other), but this property of the
vertical and horizontal unfoldings means that we can operate on them without any
physical permutation of the tensor data. For example, we can perform operations
such as QR factorization of V(TX,n ) and V(TX,n )R, where RRRn×Rn, with a
single LAPACK or BLAS call.
This choice of ordering comes at the expense of less convenient access to the mode-
2 modal unfolding (of dimension (In/P )×Rn1Rn), which is neither row or column
major in memory. This unfolding can be visualized in memory as a concatenation of
Rncontiguous submatrices, each of dimension (In/P )×Rn1and each stored in row-
major order [6]. In order to perform the mode-2 multiplication (tensor times matrix
operation), as is necessary in the application of a spacial operator on the core, we
must make a sequence of calls to the matrix-matrix multiplication BLAS subroutine.
That is, we make Rncalls for multiplications of the same In×Inmatrix with different
In×Rn1matrices.
3.2. Basic Operations.
3.2.1. Summation. To sum two tensors Xand Y, we can write [30]:
Z(i1,...,iN) = X(i1,...,iN) + Y(i1,...,iN)
=TX,1(i1,:) ···TX,N (:, iN) + TY,1(i1,:) ···TY,N (:, iN)
=TX,1(i1,:) TY,1(i1,:)TX,2(:, i2,:)
TY,2(:, i2,:)
···TX,N1(:, iN1,:)
TY,N 1(:, iN1,:)TX,N (:, iN)
TY,N (:, iN).
Thus, the TT representation of Z=X+Yis given by the following slice-wise formula:
TZ,n(:, in,:) = TX,n (:, in,:)
TY,n(:, in,:)
10 H. AL DAAS, G. BALLARD, P. BENNER
for 2 nN1, and 1 inIn. We also have TZ,1=TX,1TY,1and
TZ,N =TX,N
TY,N . Note that the formal TT ranks of Zare the sums of the TT ranks
of Xand Y.
Given the 1D data distribution of each core described in Subsection 3.1, the
summation operation can be performed locally with no interprocessor communication.
That is, because X,Y, and Zhave identical dimensions, they will have identical
distributions, and each slice of a core tensor of Zwill be owned by the processor that
owns the corresponding slices of cores of Xand Y.
3.2.2. Hadamard Product. To compute the Hadamard (elementwise) product
of two tensors Xand Y, we can write [30]:
Z(i1,...,iN) = X(i1, . . . , iN)·Y(i1,...,iN)
= (TX,1(i1,:) ···TX,N (:, iN)) ·(TY,1(i1,:) ···TY,N (:, iN))
= (TX,1(i1,:) ···TX,N (:, iN)) (TY,1(i1,:) ···TY,N (:, iN))
= (TX,1(i1,:) TY,1(i1,:)) ···(TX,N (:, iN)TY,N (:, iN)) .
Thus, the TT representation of Z=XYis given by the following slice-wise formula:
TZ,n(:, in,:) = TX,n (:, in,:) TY,n(:, in,:) for 1 nNand 1 inIn. Here, the
formal TT ranks of Zare the products of the TT ranks of Xand Y.
Again, given the 1D data distribution of each core and the fact that each core is
computed slice-wise, the Hadamard product can be performed locally with no inter-
processor communication. We note that because of the extra expense of the Hadamard
product (due to computing explicit Kronecker products of slices), it is likely advan-
tageous to maintain Hadamard products in implicit form for later operations such
as rounding. The combination of Hadamard products and recompression has been
shown to be effective for Tucker tensors [25].
3.2.3. Inner Product. To compute the inner product of two tensors Xand Y,
using similar identities as for the Hadamard product, we can write [30]:
hX,Yi=X
i1,...,iN
X(i1, . . . , iN)·Y(i1,...,iN)
=X
i1,...,iN
(TX,1(i1,:) TY,1(i1,:)) ···(TX,N (:, iN)TY,N (:, iN))
=X
i1
(TX,1(i1,:) TY,1(i1,:)) X
i2
(TX,2(:, i2,:) TY,2(:, i2,:))
···X
iN
(TX,N (:, iN)TY,N (:, iN)) .
This expression can be evaluated efficiently by a sequence of structured matrix-vector
products that avoid forming Kronecker products of matrices, and these matrix-vector
products are cast as matrix-matrix multiplications.
To see how, we assume that the TT ranks of Xand Yare {RX
n}and {RY
n},
respectively. First, we explicitly construct the row vector
w1=X
i1
TX,1(i1,:) TY,1(i1,:),
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 11
which has dimension RX
1·RY
1. Note that w1is the vectorization of the matrix
V(TY,1)V(TX,1). Then we distribute w1to all terms within the next summation
to compute w2using
w2=X
i2
w1(TX,2(:, i2,:) TY,2(:, i2,:)) ,
with each term in the summation evaluated via vec TY,2(:, i2,:)W1TX,2(:, i2,:),
where W1is a reshaping of the vector w1into a RY
1×RX
1matrix and vec is a row-
wise vectorization operator. We note that TX,2(:, i2,:) is RX
1×RX
2, and TY,2(:, i2,:)
is RY
1×RY
2, and w2therefore has dimension RX
2·RY
2. This process is repeated with
(3.1) Wn=X
in
TY,n(:, in,:)Wn1TX,n (:, in,:),
until the last core, when we compute the inner product as
hX,Yi=X
iN
TY,N (:, iN)WN1TX,N (:, iN),
where WN1is a RY
N1×RX
N1matrix.
If all the tensor dimensions are the same and all TT ranks are the same, i.e.,
I=I1=··· =INand R=RX
1=RY
1=··· =RX
N1=RY
N1, the computational
complexity is approximately 4N IR3.
Evaluating (3.1) directly can exploit the efficiency of dense matrix multiplication,
but it requires many calls to the BLAS subroutine. With some extra temporary
memory, we can reduce the number of BLAS calls to 2, performing the same overall
number of flops. Let Zbe defined such that H(TZ,n ) = Wn1H(TX,n), or the mode-
1 multiplication between the core and the matrix, for n= 1,...,N (with W0= 1).
Then, we have Wnas a contraction of modes 1 and 2 between cores of Yand Z, or
that
Wn=V(TY,n)V(TZ,n ),for n= 1,...,N.
Each of these two multiplications requires a single BLAS call because horizontal and
vertical unfoldings are column major in memory. We note the final contraction in
mode Nis a dot product instead of a matrix multiplication.
When the input TT tensors are distributed across processors as described in
Subsection 3.1, we can compute the inner product using this technique. Each term
in the summation of (3.1), which involves corresponding slices of the input tensors,
is evaluated by a single processor as long as the matrix Wnis available on each
processor. Thus, the computation can be load balanced across processors as long as
the distribution is load balanced, and each processor can apply the optimization to
reduce BLAS calls independently. We perform an AllReduce collective operation
to compute the summation for each mode. With constant tensor dimensions and TT
ranks, the computational cost is approximately 4N IR3/P and the communication
cost is β·O(N R2) + α·O(Nlog P).
3.2.4. Norms. To compute the norm of a tensor in TT format, we consider two
approaches. The first approach is to use the inner product algorithm described in Sub-
section 3.2.3 and the identity kXk2=hX,Xi. We note that in this case, the matrices
{Wn}are symmetric and positive semi-definite, see (3.1), and the structured matrix-
vector products can exploit this property to save roughly half the computation. Since
12 H. AL DAAS, G. BALLARD, P. BENNER
Wnis SPSD, it admits a triangular factorization given by pivoted Cholesky (or LDL):
Wn=PnLnL
nP
n. Thus, the matrix Wnis computed as Wn=V(TZ,n)V(TZ,n ),
where H(TZ,n) = L
n1(P
n1H(TX,n)). The triangular multiplication to compute
the nth core of Zand the symmetric multiplication to compute Wneach require half
the flops of a normal matrix multiplication, so the overall computational complexity
of this approach is 2N I R3. It is parallelized in the same way as the general inner
product.
The second approach is to first right- or left-orthogonalize the tensor using Algo-
rithm 2.1, and then the norm of the tensor is given by kTX,1kFor kTX,N kFas shown
in Subsection 2.3. When the TT tensor is distributed, the orthogonalization proce-
dure is more complicated than computing inner products; we describe the parallel
algorithm in Subsection 3.4.
3.2.5. Matrix-Vector Multiplication. In order to build Krylov-like iterative
methods to solve linear systems with solutions in TT-format, we must also be able to
apply a matrix operator to a vector in TT-format. We will consider a restricted set
of matrix operators: sums of Kronecker products of matrices [10, 24, 26, 39].
Each term in the sum can be seen as a generalization of a rank-one tensor to the
operator case. We use the notation
A=A1⊗ · · · AN
to denote a single Kronecker product of matrices, where the dimensions of Anare
In×In, conforming to the dimensions of Xin TT-format. In this case, we can compute
the matrix-vector multiplication vec(Y) = A·vec(X), where
Y(i1,...,iN) = X
j1,...,jN
A1(i1, j1)···AN(iN, jN)·X(j1,...,jN)
=X
j1,...,jN
A1(i1, j1)···AN(iN, jN)·TX,1(j1,:) ···TX,N (:, jN)
=X
j1
A1(i1, j1)TX,1(j1,:) ···X
jN
AN(iN, jN)TX,N (:, jN)
=TY,1(i1,:) ···TY,N (:, iN)
with TY,1=A1TX,1,TY,n =TX,n ×2Anfor 1 < n < N, and TY,N =TX,N A
N.
Here the notation ×2refers to the mode-2 tensor-matrix product, defined so that
TY,n(rn1,:, rn) = AnTX,n (rn1,:, rn)
for 1 < n < N , 1 rn1Rn1, and 1 rnRn.
Thus, applying a Kronecker product of matrices to a vector in TT-format main-
tains the TT-format with the same ranks, and operations on cores can be performed
independently. In order to apply an operator that is a sum of multiple Kronecker
products of matrices, we can apply each term separately and use the summation pro-
cedure described in Subsection 3.2.1 along with TT-rounding to control rank growth.
We note that it is possible to apply more general forms of tensorized operators to
vectors in TT-format [30], but we do not consider them here.
When the vector in TT-format is distributed as described in Subsection 3.1, we
must perform the mode-2 tensor-matrix product using a parallel algorithm. We can
view the mode-2 tensor-matrix product as applying the matrix to the mode-2 un-
folding of the tensor core TX,n (often denoted with subscript (2) [22]), which has
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 13
dimensions In×Rn1Rn. We observe that the parallel distribution of the mode-2
unfolding of TX,n is 1D row-distributed: each processor owns a subset of the rows
of the matrix (corresponding to slices of the core tensor). Thus, the application of
Anto this unfolding has the same algorithmic structure as the sparse-matrix-times-
multiple-vectors operation (SpMM) where all vectors have the same parallel distribu-
tion. Assuming the matrix Anis sparse and also row-distributed, as is common in
libraries such as PETSc [4] and Trilinos [17], the parallel algorithm involves communi-
cation of input tensor core slices among processors, where the communication pattern
is determined by Anand its distribution. We do not explore experimental results for
such matrix-vector multiplications in this paper, as the performance depends heavily
on the application and sparsity structure of the operator matrices.
3.3. TSQR. To compute the QR factorizations within the TT-rounding proce-
dure in parallel, we use the Tall-Skinny QR algorithm [12], which is designed (and
communication efficient) for matrices with many more rows than columns. For com-
pleteness, we present the TSQR subroutine as Algorithm 3.1, which corresponds to
[5, Alg. 7], and the TSQR-Apply-Q subroutine as Algorithm 3.2. The subroutines
assume a power-of-two number of processors to simplify the pseudocode; see Appen-
dix B for the generalizations to any number of processors.
For a tall-skinny matrix that is 1D row distributed over processors (as is the
case for the vertical unfolding and the transpose of the horizontal unfolding), the
parallel Householder QR algorithm requires synchronizations for each column of the
matrix (to compute and apply each Householder vector). The idea of the TSQR
algorithm is that the entire factorization can be computed using a single reduction
across processors. The price of this reduction is that the implicit representation of
the orthogonal factor is more complicated than a single set of Householder vectors,
and that the representation depends on the structure of the reduction tree. We can
maintain and apply the orthogonal factor in this implicit form as long as the parallel
algorithm for applying it uses a consistent tree structure. We note that we employ
the “butterfly” variant of TSQR, which corresponds to an all-reduce-like collective
operation such that at the end of the algorithm the triangular factor Ris owned by
all processors redundantly. Another variant uses a binomial tree, corresponding to
a reduce-like collective with the triangular factor owned by a single processor. We
compare performance of these two variants in Subsection 4.2.1.
3.3.1. Factorization. TSQR (Algorithm 3.1) has two phases: orthogonalization
of local submatrix (Algorithm 3.1) and parallel reduction of remaining triangular
factors (Algorithm 3.1 through Algorithm 3.1). The cost of the TSQR is as follows:
(3.2) γ·2mb2
P+O(b3log P)+β·O(b2log P) + α·O(log P),
where mis the number of rows and bis the number of columns [12]. The leading order
flop cost is the (Householder) QR of the local (m/P )×bsubmatrix (Algorithm 3.1),
the leaf of the TSQR tree. The communication costs come from the TSQR tree, which
has height O(log P).
3.3.2. Applying and Forming Q.The structure of the TSQR-Apply-Q al-
gorithm (Algorithm 3.2) matches that of TSQR, but in reverse order (because the
TSQR algorithm corresponds to applying Q). Thus, the root of the tree is applied
first and the leaves last. However, by using a butterfly tree the communication cost
of the TSQR-Apply-Q algorithm (Algorithm 3.2) is 0 if the number of processors is a
14 H. AL DAAS, G. BALLARD, P. BENNER
Algorithm 3.1 Parallel Butterfly TSQR
Require: A is an m×bmatrix 1D-distributed so that proc powns row block A(p)
Require: Number of procs is power of two; see Algorithm B.1 for general case
Ensure: A =QR with Rowned by all procs and Qrepresented by {Y(p)
}with
redundancy Y(p)
=Y(q)
for pqmod 2and ℓ < log P
1: function [{Y(p)
},R] = Par-TSQR(A(p))
2: p=MyProcID()
3: [Y(p)
log P,¯
R(p)
log P] = Local-QR(A(p))Leaf node QR
4: for = log P1 down to 0 do
5: j= 2+1p
2+1 + (p+ 2) mod 2+1 Determine partner
6: Send ¯
R(p)
+1 to and receive ¯
R(j)
+1 from proc j Communication
7: if p < j then
8: [Y(p)
,¯
R(p)
] = Lo cal-QR "¯
R(p)
+1
¯
R(j)
+1#! Tree node QR
9: else
10: [Y(p)
,¯
R(p)
] = Lo cal-QR "¯
R(j)
+1
¯
R(p)
+1#! Partner tree node QR
11: end if
12: end for
13: R=¯
R(p)
0
14: end function
power of 2 and β·bc +αotherwise (the cost of one message; see Appendix B). The
cost of TSQR-Apply-Q is then,
(3.3) γ·4mbc
P+O(b2clog P)+β·bc +α,
where the additional parameter cis the number of columns of C. The leading or-
der flop cost is the application of the local Qmatrix at the leaf of the TSQR tree
(Algorithm 3.2).
Using a binomial tree TSQR algorithm requires more communication in the ap-
plication phase (see [5, Algorithm 8], for example). We also note that if the input
matrix Cis upper triangular, then the leading constant can be reduced from 4 to 2 by
exploiting the sparsity structure in this local application (and within the tree because
all ¯
B(p)
matrices are upper triangular in this case, throughout the algorithm), which
matches the computation cost of the factorization. In particular, when we form Q
explicitly, we can use this algorithm with Cas the identity matrix, which is upper
triangular.
3.4. TT Orthogonalization. Algorithm 3.3 shows right orthogonalization and
is a parallelization of Algorithm 2.1. The approach for left orthogonalization is anal-
ogous. The algorithm is performed via a sequential sweep over the cores, where at
each iteration, an LQ factorization row-orthogonalizes the horizontal unfolding of a
core and the triangular factor is applied to its left neighbor core. The 1D parallel dis-
tribution of each core implies that the transpose of the horizontal unfolding is 1D row
distributed, fitting the requirements of the TSQR algorithm. Note that we perform
a QR factorization of the transpose of the horizontal unfolding, which corresponds to
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 15
Algorithm 3.2 Parallel Application of Implicit Qfrom Butterfly TSQR
Require: {Y(p)
}represents orthogonal matrix Qcomputed by Algorithm 3.1
Require: C is b×cand redundantly owned by all processors
Require: Number of procs is power of two; see Algorithm B.2 for general case
Ensure: B =QC
0is m×cand 1D-distributed so that proc powns row block B(p)
1: function B =Par-TSQR-Apply-Q({Y(p)
},C)
2: p=MyProcID()
3: ¯
B(p)
0=C
4: for = 0 to log P1do
5: j= 2+1p
2+1 + (p+ 2) mod 2+1 Determine partner
6: if p < j then
7: "¯
B(p)
+1
¯
B(j)
+1#=Loc-Apply-Q Ib
Y(p)
,"¯
B(p)
0#! Tree node apply
8: else
9: "¯
B(j)
+1
¯
B(p)
+1#=Loc-Apply-Q Ib
Y(p)
,"¯
B(p)
0#! Part. tree node apply
10: end if
11: end for
12: B(p)=Loc-Apply-Q Y(p)
log P,"¯
B(p)
log P
0#! Leaf node apply
13: end function
an LQ factorization of the unfolding itself.
Figure 3.2 depicts the operations within a single iteration of the sweep. At itera-
tion n, TSQR is applied to the nth core in Algorithm 3.3 (Figure 3.2c) and then the
orthogonal factor is formed explicitly in Algorithm 3.3 (Figure 3.2b). The notation
{Y(p)
ℓ,n}signifies the set of triangular matrices owned by processor pin the implicit
representation of the QR factorization of the nth core, where refers to the level of
the tree and indexes the set. In the case Pis a power of 2, each processor owns log P
matrices in its set. Because the TSQR subroutine ends with all processors owning the
triangular factor Rn, each processor can apply it to core n1 in the 3rd mode without
further communication via local matrix multiplication in Algorithm 3.3 (Figure 3.2d).
Algorithm 3.3 have the costs, given by (3.2) and (3.3) with m=InRnand
b=c=Rn1. Since the computation to form the explicit Qmatrix exploits the
sparsity structure of the identity matrix the constant 4 in (3.3) is reduced to 2. These
two lines together cost
γ·4InRnR2
n1
P+O(R3
n1log P)+β·O(R2
n1log P) + α·O(log P).
Algorithm 3.3 is a local triangular matrix multiplication that costs γ·Ik1Rk2R2
k1/P .
Assuming Ik=Iand Rk=Rfor 1 kN1, the total cost of TT orthogonalization
is then
(3.4) γ·5NIR3
P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).
16 H. AL DAAS, G. BALLARD, P. BENNER
Algorithm 3.3 Parallel TT-Right-Orthogonalization
Require: Xin TT format with each core 1D-distributed
Ensure: Xis right orthogonal, in TT format with same distribution
1: function Par-TT-Right-Orthogonalization({T(p)
X,n})
2: for n=Ndown to 2 do
3: [{Y(p)
ℓ,n},Rn] = TSQR(H(T(p)
X,n))QR factorization
4: H(T(p)
X,n)=TSQR-Apply-Q({Y(p)
ℓ,n},IRn1)Form explicit Q
5: V(T(p)
X,n1) = V(T(p)
X,n1)·RnApply Rto previous core
6: end for
7: end function
Rn1
Rn2
.
.
.
Rn2
.
.
.
Rn2
V(TX,n1)
Rn
Rn1
···
Rn
···
Rn
H(TX,n)
(a) Two consecutive cores
Rn1
Rn2
.
.
.
Rn2
.
.
.
Rn2
V(TX,n1)
Rn1
R
Rn
Rn1
···
Rn
···
Rn
Q
LQ factorization
(b) QR factorization of H(TX,n )
Rn1
Rn2
.
.
.
Rn2
.
.
.
Rn2
V(TX,n1)
Rn1
R
Rn
Rn1
···
Rn
···
Rn
H(TX,n) := Q
(c) Update nth core
Rn1
Rn2
.
.
.
Rn2
.
.
.
Rn2
V(TX,n1) := V(TX,n1)·R
Rn
Rn1
···
Rn
···
Rn
H(TX,n)
(d) Update (n1)th core
Fig. 3.2: Steps performed in TT right orthogonalization
3.5. TT Rounding. We present the parallel TT rounding procedure in Algo-
rithm 3.4, which is a parallelization of Algorithm 2.2. The computation consists of
two sweeps over the cores, one to orthogonalize and one to truncate. The algorithm
shown performs right-orthogonalization and then truncates left to right, and the other
ordering works analogously.
Algorithm 3.4 does not call Algorithm 3.3 to perform the orthogonalization sweep.
This is because Algorithm 3.3 forms the orthogonalized cores explicitly, and Algo-
rithm 3.4 can leave the orthogonalized cores from the first sweep in implicit form to
be applied during the second sweep.
Iteration nof the right-to-left orthogonalization sweep occurs in Algorithm 3.4,
which matches Algorithm 3.3 except for the explicit formation of the orthogonal factor.
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 17
Algorithm 3.4 Parallel TT-Rounding
Require: Xin TT format with each core 1D-distributed over 1×P×1 processor grid
Ensure: Yin TT format with reduced ranks identically distributed across processors
1: function {T(p)
Y,n}=Par-TT-Rounding({T(p)
X,n}, ǫ)
2: for n=Ndown to 2 do
3: [{Y(p)
ℓ,n},Rn] = TSQR(H(T(p)
X,n))QR factorization
4: V(T(p)
X,n1) = V(T(p)
X,n1)·RnApply Rto previous core
5: end for
6: Compute kXk
7: Y=X
8: for n= 1 to N1do
9: [{Y(p)
ℓ,n},Rn] = TSQR(V(T(p)
Y,n)) QR factorization
10: [ˆ
UR,ˆ
Σ,ˆ
V] = tSVD(Rn,ǫ
N1kXk)Redundant truncated SVD of R
11: V(T(p)
Y,n) = TSQR-Apply-Q({Y(p)
ℓ,n},ˆ
UR)Form explicit ˆ
U
12: H(T(p)
Y,n+1)=TSQR-Apply-Q({Y(p)
ℓ,n+1},ˆ
Vˆ
Σ) Apply ˆ
Σˆ
V
13: end for
14: end function
Thus, the cost of the orthogonalization sweep is
(3.5) γ·3NIR3
P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).
At iteration nof the second loop, Algorithm 3.4 implement the left-to-right trun-
cation procedure for the nth core in parallel. Algorithm 3.4 is a QR factorization and
has cost given by Equation (3.2) with m=InLn1and b=Rn, as the number of
rows of V(T(p)
Y,n) has been reduced from InRn1to InLn1during iteration n1:
γ·2InLn1R2
n
P+O(R3
nlog P)+β·O(R2
nlog P) + α·O(log P).
We note that we re-use the notation {Y(p)
ℓ,n}to store the implicit factorization; while
the same variable stored the orthogonal factor of the nth core’s horizontal unfolding
from the orthogonalization sweep, it can be overwritten by this step of the algo-
rithm (the set of matrices will now have different dimensions). Algorithm 3.4 requires
O(R3
n) flops, assuming the full SVD is computed before truncating. Algorithm 3.4
implicitly applies an orthogonal matrix to an Rn×Lnmatrix ˆ
URwith cost given by
Equation (3.3) with m=InLn1,b=Rn, and c=Ln:
γ·4InLn1RnLn
P+O(R2
nLnlog P)+β·RnLn+α.
Algorithm 3.4 implicitly applies an orthogonal matrix to an Rn×Lnmatrix ˆ
Vˆ
Σwith
cost given by Equation (3.3) with m=In+1Rn+1 ,b=Rn, and c=Ln:
γ·4In+1Rn+1 RnLn
P+O(R2
nLnlog P)+β·RnLn+α.
18 H. AL DAAS, G. BALLARD, P. BENNER
Rn
Ln1
.
.
.
Ln1
.
.
.
Ln1
V(TX,n)
QR factorization
Rn+1
Rn
···
Rn+1
···
Rn+1
H(TX,n+1)
(a) Two consecutive cores
Rn
Ln1
.
.
.
Ln1
.
.
.
Ln1
Q
QR factorization
Rn
R
Rn+1
Rn
···
Rn+1
···
Rn+1
H(TX,n+1)
(b) QR factorization of V(TX,n )
Rn
Ln1
.
.
.
Ln1
.
.
.
Ln1
Q
Ln
U
Ln
(ΣV)
Rn+1
Rn
···
Rn+1
···
Rn+1
H(TX,n+1)
(c) Truncated-SVD of R
Ln
Ln1
.
.
.
Ln1
.
.
.
Ln1
V(TX,n) := QU
Ln
ΣV
Rn+1
Rn
···
Rn+1
···
Rn+1
H(TX,n+1)
(d) Update the nth core
Ln
Ln1
.
.
.
Ln1
.
.
.
Ln1
V(TX,n)
Rn+1
Ln
···
Rn+1
···
Rn+1
H(TX,n+1) := ΣVH(TX,n+1 )
(e) Update the (n+1)th core
Fig. 3.3: Steps performed in iteration of the TT left-to-right truncation
Assuming Ik=I,Rk=R, and Lk=Lfor 1 kN1, the total cost of
Algorithm 3.4 is then
(3.6)
γ·N IR 3R2+ 6RL + 4L2
P+O(NR3log P)+β·O(N R2log P) + α·O(Nlog P).
We note that leaving the orthogonal factors in implicit form during the orthogonaliza-
tion sweep (as opposed to calling Algorithm 3.3) saves up to 40% of the computation,
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 19
Model # Modes Dimensions Ranks Memory
1 50 2K× · · · × 2K50 2 GB
216 100M×50K× · · · × 50K×1M30 28 GB
3 30 2M× · · · × 2M30 385 GB
Table 4.1: Synthetic TT models used for performance experiments. In each case the
formal ranks are all the same and are cut in half by the TT rounding procedure.
when the reduced ranks Lnare much smaller than the original ranks Rn. As the
rank reduction diminishes, so does the advantage of the implicit optimization. For
example, when the ranks are all cut in half, the reduction in leading order flop cost
is 12.5%.
4. Numerical Experiments. In this section we present performance results
for TT computations using synthetic tensors with mode and dimension parameters
inspired by physics and chemistry applications, as described in Subsection 4.1. We
first present microbenchmarks in Subsection 4.2 to justify key design decisions, and
then demonstrate performance efficiency and parallel scaling in Subsection 4.3.
All numerical experiments are run on the Max Planck Society supercomputer CO-
BRA. All computation nodes contain two Intel Xeon Gold 6148 processors (Skylake,
20 cores each at 2.4 GHz) and 192 GB of memory, and the nodes are connected through
a 100 Gb/s OmniPath interconnect. We link to MKL 2020.1 for single-threaded BLAS
and LAPACK subroutines.
4.1. Synthetic TT Models. As we are interested in large scale systems, we
consider two contexts of applications in which large number of modes exists. The first
context is the existence of many modes with each mode of relatively the same (large)
dimension, and the second context is a single or a few modes with large dimension
as well as many modes of relatively smaller dimension. Table 4.1 presents the details
of the three models of synthetic tensors we use in the experiments, in order of their
memory size. The first and third models correspond to the first context (all modes
of the same dimension) and the second model corresponds to the second context (two
large modes and many more smaller modes). The first model is chosen to be small
enough to be processed by a single core, while the second and third are larger and
benefit more from distributed-memory parallelization (the third does not fit in the
memory of a single node). The paragraphs below describe the applications that inspire
these choices of modes and dimensions.
In all experiments, we generate a random TT tensor Xwith a given number of
modes N, modes sizes Infor n= 1,...,N, and TT ranks RX
nfor n= 1,...,N 1.
Then, we form the formal TT tensor Y= 2XXwhich has the formal ranks RY
n=
2RX
nfor n= 1,...,N1. The algorithms are then applied on the TT tensor Y. Note
that the minimal TT ranks of Yare less or equal than the TT ranks of X.
High-Order Correlation Functions. In the study of stochastic processes, Gaussian
random fields are widely used. If fis a Gaussian random field defined on a bounded
domain Ω RN, an N-point correlation function for fis defined on ΩN. These N-
point correlation functions can often be efficiently approximated in TT format via a
cross approximation algorithm [24]. Typically, cross approximation algorithms induce
larger ranks. Thus, compressing the resulting TT tensors is required to maintain the
tractability of computations.
20 H. AL DAAS, G. BALLARD, P. BENNER
Molecular Simulations. Another important class of applications is molecular sim-
ulations. For example, when a spin system can be considered as a weakly branched
linear chain, it is typical to represent it as a TT tensor [34]. Each branch is then
considered as a spatial coordinate (mode). The number of branches can be arbitrarily
large; for example, a simple backbone protein may have hundreds of branches. The
TT representation is then inherited from the weak correlation between the branches.
However, in the same branch, the correlation between spins cannot be ignored, and
thus the exponential growth in the number of states cannot be avoided.
Parameter-Dependent PDEs. In the second context, one or a few modes may
be much larger than the rest. This is typically the case in physical applications
such as parameter dependent PDEs, stochastic PDEs, uncertainty quantification, and
optimal control systems [7, 8, 9, 13, 18, 26, 32]. In such applications, the spatial
discretization leads to a high number of degrees of freedom. This typically results
from large domains, refinement procedures, and a large number of parameter samples.
Most of other modes correspond to control or uncertainty parameters and can have
relatively smaller dimension.
4.2. Microbenchmarks. We next present experimental results for microbench-
marks to justify our choices for subroutine algorithms and optimizations. The re-
sults presented in Subsection 4.3 use the best-performing variants and optimizations
demonstrated in this section.
4.2.1. TSQR. As discussed in Subsection 3.3, the TSQR algorithm depends on
a hierarchical tree. Two tree choices are commonly used in practice, the binomial
tree and the butterfly tree. In both cases the TSQR computes the QR decomposition
sharing the same complexity and communication costs along the critical path, whereas
the butterfly requires less communication cost along the critical path of the application
of the implicit orthogonal factor.
Here we compare the performance of the TSQR algorithms using the binomial and
butterfly trees for both factorization and single application of the orthogonal factor.
Since the difference in their costs is solely related to the number of columns, we fix the
number of rows in the comparison and vary the number of columns. Figure 4.1 reports
the breakdown of time of the variants using 256 nodes with 4 MPI processes per node
(2 cores per socket). The local matrix size on each processor is 1,000 ×bwhere b
varies in {40,80,120,160}. We observe that the butterfly tree has better performance
in terms of communication time in the application phase. Note that the factorization
runtime (computation and communication) is relatively the same for both variants.
We also time the cost of communication of the triangular factor R, which is required
of the binomial variant in the context of TT-rounding, but that cost is negligible in
these experiments.
Based on these results (and corroborating experiments with various other param-
eters), we use the butterfly variant of TSQR for TT computations that require TSQR
in all subsequent numerical experiments.
4.2.2. TT Rounding. In this section, we consider 4 variants of TT round-
ing (Algorithm 3.4), based on the orthogonalization/truncation ordering and the
use of the implicit orthogonal factor optimization. As discussed in Subsection 2.4,
the rounding procedure can perform right- or left- orthogonalization followed by
a truncation phase in the opposite direction. We refer to the ordering based on
right-orthogonalization and left-truncation as RLR and the ordering based on left-
orthogonalization and right-truncation as LRL. The implicit optimization avoids the
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 21
Binomial
Binomial
Binomial
Binomial
Butterfly
Butterfly
Butterfly
Butterfly
0
0.5
1
1.5
2·102
b= 40 b= 80 b= 120 b= 160
Time (seconds)
Comm Apply-Q
Comp Apply-Q
Comm R
Comm QR
Comp QR
Fig. 4.1: Time breakdown for TSQR variants for 1,024,000 ×bmatrix over 1024
processors, including both factorization and application of the orthogonal factor to a
dense b×bmatrix.
explicit formation of orthogonal factors during the orthogonalization phase; instead of
using Algorithm 3.3 as a black-box subroutine, Algorithm 3.4 leaves orthogonal factors
in implicit TSQR form as much as possible, saving a constant factor of computation
(and a small amount of communication).
Although the asymptotic complexity of the variants of the rounding procedure
are equal, their performance is not the same. This disparity between RLR and LRL
orderings is because of the performance difference between the QR and the LQ im-
plementations of the LAPACK subroutines provided by the MKL implementations.
Despite the same computation complexity, the QR subroutines has much better per-
formance than the LQ subroutines.
In the LRL ordering, a sequence of calls to the QR subroutine are performed
on the vertically unfolded TT cores TX,n with the increased ranks Rn1, Rn. Along
the truncation sweep, the LQ subroutine is called in a sequence to factor the hori-
zontally unfolded TT cores TX,n with one reduced rank Rn1, Ln. As presented in
Subsections 3.4 and 3.5, the RLR ordering employs the QR and LQ subroutines in
the opposite order. Because the truncation phase involves less computation within
local QR/LQ subroutine calls than the orthogonalization phase, the LRL ordering
has the advantage that it spends computation time in LQ subroutine calls than the
RLR ordering.
The effect of the implicit optimization is a reduction in computation (approxi-
mately 12.5% in these experiments) and communication, but this advantage is offset
in part by the performance of local subroutines. The implicit application of the or-
thogonal factor involves auxiliary LAPACK routines for applying sets of Householder
22 H. AL DAAS, G. BALLARD, P. BENNER
LRLI LRL RLRI RLR
0
5·102
0.1
0.15
Time (seconds)
(a) Model 2
LRLI LRL RLRI RLR
0
0.5
1
1.5
2
Time (seconds)
(b) Model 3
Fig. 4.2: Performance comparison of TT-Rounding variants for large TT models on 32
nodes (1,280 cores). LRL refers to left-orthogonalization followed by right-truncation
(vice versa for RLR) and I indicates the use of the implicit optimization.
vectors in various formats. The explicit multiplication of an orthogonal factor to a
small square matrix involves a broadcast and a local subroutine call to matrix mul-
tiplication, which has much higher performance than the auxiliary routines involving
Householder vectors. We use an “I” to indicate the use of the implicit optimization,
so that the 4 variants are LRLI, LRL, RLRI, and RLR.
Figure 4.2 presents the performance results for TT Models 2 and 3 running on 256
nodes. We see that for both models, the LRL ordering with the implicit optimization
(LRLI) is the fastest. In the case of Model 2, the implicit optimization makes more
of a difference than the ordering. This is because a considerable amount of time is
spent in the first mode, where the QR is used (once) in either ordering. In the case of
Model 3, the ordering makes a much larger difference in running time, as the internal
modes dominate the running time and the QR/LQ difference has a large effect. The
implicit optimization still improves performance, but it has less of an effect than the
ordering. Based on these results, we use the LRLI variant of TT-rounding in all the
experiments presented in Subsection 4.3.
4.3. Parallel Scaling.
4.3.1. Norms. In this section we compare the performance and parallel scaling
of three different algorithms for computing the norm of a TT tensor as discussed
in Subsection 3.2.4. We focus on this computation because the multiple approaches
represent the performance of algorithms for computing inner products and orthogo-
nalization, which are essential on their own in other contexts. We use “Ortho” to
denote the approach of first right- or left-orthogonalizing the TT tensor and then
(cheaply) computing the norm of the first or last core, respectively. Thus, Ortho per-
formance represents that of Algorithm 3.3. The name “InnPro” refers to the approach
of computing the inner product of the TT tensor with itself, and “InnPro-Sym” in-
cludes the optimization that exploits the symmetry in the inner product to save up
to half the computation. InnPro captures the performance of the algorithm described
in Subsection 3.2.3 for general TT inner products as well.
We report parallel scaling and a breakdown of computation and communication
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 23
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
Ortho
Ortho
Ortho
Ortho
Ortho
Ortho
Ortho
Ortho
Ortho
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
16 32 64 128 2561 2 4 8
Time fraction
Comm
Comp
(a) Time breakdown for Model 2
InnPro
InnPro
InnPro
InnPro
InnPro
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
InnPro-Sym
Ortho
Ortho
Ortho
Ortho
Ortho
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
16 32 64 128 256
Time fraction
Comm
Comp
(b) Time breakdown for Model 3
1 2 4 8 16 32 64 128 256
292725232121
Number of Nodes
Time (seconds)
Perfect
Ortho
InnPro-Sym
InnPro
(c) Parallel scaling for Model 2
16 32 64 128 256
242322212021
Number of Nodes
Time (seconds)
Perfect
Ortho
InnPro-Sym
InnPro
(d) Parallel scaling for Model 3
Fig. 4.3: Time breakdown and and parallel scaling of variants for TT norm compu-
tation. “Ortho” refers to orthogonalization (following by computing the norm of a
single core), “InnPro” refers to using the inner product algorithm, and “InnPro-Sym”
refers to using the inner product algorithm with symmetric optimization.
for all three algorithms and TT Models 2 and 3 in Figure 4.3. Model 2 can be processed
on a single node, but Model 3 requires 16 nodes to achieve sufficient memory; we scale
both models up to 256 nodes (10,240 cores). Based on the theoretical analysis (see
Table 3.1), when all tensor dimensions are equivalent such as Model 3, Ortho has a
leading-order flop constant of 5, InnPro has a constant of 4, and InnPro-Sym has a
constant of 2. Ortho also requires more complicated TSQR reductions compared to
the All-Reduces performed in InnPro and InnPro-Sym, involving an extra log Pfactor
in data communicated in theory and slightly less efficient implementations in practice.
In addition, the efficiencies of the local computations differ across approaches: Ortho
is bottlenecked by local QR, InnPro is bottlenecked by local matrix multiplication
24 H. AL DAAS, G. BALLARD, P. BENNER
1 core 20 cores Par. Speedup 40 cores Par. Speedup
TT-Toolbox 15.68 8.34 1.9×8.752 1.8×
Our Implementation 9.2 0.44 20.9×0.27 33.9×
Speedup 1.7×18.95×32.2×
Table 4.2: Single-node performance results on TT Model 1 and comparison with the
MATLAB TT-Toolbox.
(GEMM), and InnPro-Sym is bottlenecked by local triangular matrix multiplication
(TRMM).
Overall, we see that InnPro is typically the best performing approach. The main
factor in its superiority is that its computation is cast as GEMM calls, which are
more efficient than TRMM and QR subroutines. Although InnPro-Sym performs half
the flops of InnPro, the relative inefficiency of those flops translates to a less than
2×speedup over InnPro for Model 3 and a slight slowdown for Model 2. We also
note that for high node counts, the cost of the LDLT factorization performed within
InnPro-Sym becomes nonneglible and begins to hinder parallel scaling.
Based on the breakdown of computation and communication, we see that all
three approaches are able to scale reasonably well because they remain computation
bound up to 256 nodes. For Model 2, we see that communication costs are relatively
higher, as that tensor is much smaller. Note that Ortho scales better than InnPro-
Sym and InnPro, even superlinearly for Model 3, which is due in large part to the
higher flop count and relative inefficiency of the local QRs, allowing it to remain more
computation bound than the alternatives. Overall, these results confirm that the
parallel distribution of TT cores allows for high performance and scalability of the
basic TT operations as described in Subsection 3.2.
4.3.2. TT Rounding.
Single-Node Performance. We compare in this section our implementation of TT
rounding against the MATLAB TT-Toolbox [28] rounding process. Table 4.2 presents
a performance comparison on a single node of COBRA, which has 40 cores available.
We run the experiment on TT Model 1, which is small enough to be processed by a
single core. Because it is written in MATLAB, the TT-Toolbox accesses the available
parallelism only through underlying calls to a multithreaded implementation of BLAS
and LAPACK. However, the bulk of the computation occurs in MATLAB functions
that make direct calls to efficient BLAS and LAPACK subroutines, so it can achieve
relatively high sequential performance.
We observe from Table 4.2 that the single-core performance of the two imple-
mentations is similar, with a 70% speedup from our implementation. The single-core
implementations are employing the same algorithm, and we attribute the speedup to
our lower-level interface to LAPACK subroutines and the ability to maintain implicit
orthogonal factors to reduce computation. The parallel strong scaling differs more
drastically, as expected. The MATLAB implementation, which is not designed for
parallelization, achieves less than a 2×speedup when using 20 or 40 cores. Our par-
allelization, which is designed for distributed-memory systems, also scales very well
on this shared-memory machine, achieving over 20×speedup on 20 cores and 34×
speedup on 40 cores.
Distributed-Memory Strong Scaling. We now present the parallel performance of
TT rounding scaling up to hundreds of nodes (over 10,000 cores). As in the case of
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 25
Subsection 4.3.1, we consider Models 2 and 3. Figure 4.4 presents the relative time
breakdown and raw timing numbers for each model. We use the ‘LRLI’ variant of
TT rounding in these experiments per the results of Subsection 4.2.2. As in other
rounding experiments, the ranks are cut in half for each model.
In the time breakdown plots of Figures 4.4a and 4.4b, we distinguish among
TSQR factorization (TSQR), application of orthogonal factors (AppQ), and the rest
of the computation that includes SVDs and triangular multiplication (Other). We
also separate the computation and communication of each category. In the context
of Algorithm 3.4, TSQR corresponds to Algorithm 3.4, AppQ corresponds to Algo-
rithm 3.4, and Other corresponds to Algorithm 3.4.
In Figures 4.4c and 4.4d, we observe the strong scaling raw times in log scale
compared to perfect scaling (based on time at the fewest number of nodes). We see
nearly perfect scaling for Model 2 until 128 nodes; time continues to decrease but is
not cut in half when scaling to 256 nodes. The parallel speedup numbers for Model 2
are 97×for 128 nodes and 108×for 256 nodes, compared to performance on 1 node.
In the case of Model 3, we see super-linear scaling, even at 256 nodes. We attribute
this scaling in part to the baseline comparison of 16 nodes, which already involves
parallelization/communication, and in part to local data fitting into higher levels of
cache as the number of processors increases, which particularly helps memory-bound
local computations. We observe a 48×speedup for Model 3, scaling from 16 to 256
nodes.
The time breakdown plots also help to explain the scaling performance. We
see that for Model 2, over 70% of the time is spent in local computation, while for
Model 3, over 90% of the time is computation. Of this computation, the majority is
spent in TSQR, which itself is dominated by the initial local leaf QR computations.
If the rank is reduced by a smaller factor, then relatively more flops will occur in
AppQ. We note that AppQ involves minimal communication because of the use of the
Butterfly TSQR variant. The Other category is dominated by the triangular matrix
multiplication, which achieves higher performance than the LAPACK subroutines
involving orthogonal factors.
5. Conclusions. This work presents the parallel implementation of the basic
computational algorithms for tensors represented in low-rank TT format. Because
most TT computations involve dependence through the train, we specify a data distri-
bution that distributes each core across all processors and show that the computations
and communication costs of our proposed algorithms enable efficiency and scalability
for each core computation. The orthogonalization and rounding procedures for TT
tensors depend heavily on the TSQR algorithm, which is designed to scale well on
architectures with a large number of processors for matrices with highly skewed as-
pect ratios. Our numerical experiments show that our algorithms are indeed efficient
and scalable, outperforming productivity-oriented implementations on a single core
and single node and scaling well to hundreds of nodes (thousands of cores). Thus,
we believe our approach is useful to applications and users who are restricted to a
shared-memory workstation as well to those requiring the memory and performance
of a supercomputer.
We note that the raw performance of our implementation depends heavily on the
local BLAS/LAPACK implementation and the efficiency of the QR decomposition
and related subroutines. For example, we observe significant performance differences
between MKL’s implementations of QR and LQ subroutines, which caused the LRL
ordering of TT-rounding to outperform RLR. We also observe performance differences
26 H. AL DAAS, G. BALLARD, P. BENNER
1 2 4 8 16 32 64 128 256
0
0.2
0.4
0.6
0.8
1
Time fraction
Other Comp
AppQ Comm
AppQ Comp
TSQR Comm
TSQR Comp
(a) Time breakdown for Model 2
16 32 64 128 256
0
0.2
0.4
0.6
0.8
1
Time fraction
Other Comp
AppQ Comm
AppQ Comp
TSQR Comm
TSQR Comp
(b) Time breakdown for Model 3
1 2 4 8 16 32 64 128 256
2624222022
Number of Nodes
Time (seconds)
Perfect
LRLI
(c) Parallel scaling for Model 2
16 32 64 128 256
22212021222324
Number of Nodes
Time (seconds)
Perfect
LRLI
(d) Parallel scaling for Model 3
Fig. 4.4: Time breakdown and and parallel scaling of LRLI variant of TT rounding.
among other subroutines, such as triangular matrix multiplication and general matrix
multiplication, again confirming that simple flop counting (even tracking constants
closely) does not always accurately predict running times.
There do exist limitations of the parallelization approach proposed in this paper.
In particular, modes with small dimensions benefit less from parallelization and can
become bottlenecks if there are too many of them. For example, we see the limits of
scalability with TT Model 2, which has large first and last modes but smaller internal
modes. In fact, the distribution scheme assumes that PInfor n= 1,...,N, and
involves idle processors when the assumption is broken. We also note that TSQR may
not be the optimal algorithm to factor the unfolding, which can happen if two succes-
sive ranks differ greatly and Pis large with respect to the original tensor dimensions.
Alternative possibilities to avoid these limitations include cheaper but less accu-
rate methods for the SVD, including via the associated Gram matrices or by using
randomization. We plan to pursue such strategies in the future, in addition to consid-
ering the case of computing a TT approximation from a tensor in explicit full format.
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 27
Given these efficient computational building blocks, the next step is to build scalable
Krylov and alternating-scheme based solvers that exploit the TT format.
REFERENCES
[1] H. Al Daas,Solving linear systems arising from reservoirs modeling, theses, Inria Paris ;
Sorbonne Universit´e, UPMC University of Paris 6, Laboratoire Jacques-Louis Lions, Dec.
2018.
[2] W. Austin, G. Ballard, and T. G. Kolda,Parallel tensor compression for large-scale sci-
entific data, in Proceedings of the 30th IEEE International Parallel and Distributed Pro-
cessing Symposium, May 2016, pp. 912–922.
[3] B. W. Bader, T. G. Kolda, et al.,MATLAB Tensor Toolbox version 3.0-dev. Available
online, Oct. 2017.
[4] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin,
A. Dener, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley,
D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith,
S. Zampini, H. Zhang, and H. Zhang,PETSc Web page.https://www.mcs.anl.gov/petsc,
2019.
[5] G. Ballard, J. Demmel, L. Grigori, N. Knight, M. Jacquelin, and H. D. Nguyen,Re-
constructing Householder vectors from tall-skinny QR, Journal of Parallel and Distributed
Computing, 85 (2015), pp. 3–31.
[6] G. Ballard, A. Klinvex, and T. G. Kolda,TuckerMPI: A parallel C++/MPI software
package for large-scale data compression via the Tucker tensor decomposition, ACM Trans.
Math. Softw., 46 (2020).
[7] P. Benner, S. Dolgov, A. Onwunta, and M. Stoll,Low-rank solvers for unsteady stokes–
brinkman optimal control problem with random data, Computer Methods in Applied Me-
chanics and Engineering, 304 (2016), pp. 26–54.
[8] ,Low-rank solution of an optimal control problem constrained by random navier-stokes
equations, International Journal for Numerical Methods in Fluids, 92 (2020), pp. 1653–
1678.
[9] P. Benner, S. Gugercin, and K. Willcox,A survey of projection-based model reduction
methods for parametric dynamical systems, SIAM Review, 57 (2015), pp. 483–531.
[10] G. Beylkin and M. J. Mohlenkamp,Numerical operator calculus in higher dimensions, Pro-
ceedings of the National Academy of Sciences, 99 (2002), pp. 10246–10251.
[11] J. D. Carroll and J.-J. Chang,Analysis of individual differences in multidimensional scaling
via an n-way generalization of “Eckart-Young” decomposition, Psychometrika, 35 (1970),
pp. 283–319.
[12] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou,Communication-optimal paral lel and
sequential QR and LU factorizations, SIAM Journal on Scientific Computing, 34 (2012),
pp. A206–A239.
[13] S. Dolgov and M. Stoll,Low-rank solution to an optimization problem constrained by the
Navier-Stokes equations, SIAM J. Sci. Comput., 39 (2017), pp. A255–A280.
[14] S. Eswar, K. Hayashi, G. Ballard, R. Kannan, M. A. Matheson, and H. Park,PLANC:
Parallel low rank approximation with non-negativity constraints, Tech. Rep. 1909.01149,
arXiv, 2019.
[15] W. Hackbusch and S. K¨
uhn,A new scheme for the tensor representation, J. Fourier Anal.
Appl., 15 (2009), pp. 706–722.
[16] R. A. Harshman,Foundations of the PARAFAC procedure: Models and conditions for an
explanatory multimodal factor analysis, Working Papers in Phonetics, 16 (1970), pp. 1 –
84.
[17] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda,
R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K.
Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley,An
overview of the Trilinos project, ACM Transactions on Mathematical Software, 31 (2005),
pp. 397–423.
[18] J. Hesthaven, G. Rozza, and B. Stamm,Certified Reduced Basis Methods for Parametrized
Partial Differential Equations, SpringerBriefs in Mathematics, Springer International Pub-
lishing, 2015.
[19] P. Jolivet,Domain decomposition methods. Application to high-performance computing, the-
ses, Universit´e de Grenoble, Oct. 2014.
[20] O. Kaya and B. Uc¸ar,High performance parallel algorithms for the Tucker decomposition of
28 H. AL DAAS, G. BALLARD, P. BENNER
sparse tensors, in 45th International Conference on Parallel Processing (ICPP ’16), 2016,
pp. 103–112.
[21] B. N. Khoromskij,O(dlog N)-quantics approximation of N-dtensors in high-dimensional
numerical modeling, Constr. Approx., 34 (2011), pp. 257–280.
[22] T. G. Kolda and B. W. Bader,Tensor decompositions and applications, SIAM Rev., 51
(2009), pp. 455–500.
[23] J. Kossaifi, Y. Panagakis, A. Anandkumar, and M. Pantic,TensorLy: Tensor learning in
python, Tech. Rep. 1610.09555, arXiv, 2018.
[24] D. Kressner, R. Kumar, F. Nobile, and C. Tobler,Low-rank tensor approximation for
high-order correlation functions of Gaussian random fields, SIAM/ASA Journal on Un-
certainty Quantification, 3 (2015), pp. 393–416.
[25] D. Kressner and L. Periˇ
sa,Recompression of Hadamard products of tensors in Tucker for-
mat, SIAM Journal on Scientific Computing, 39 (2017), pp. A1879–A1902.
[26] D. Kressner and C. Tobler,Krylov subspace methods for linear systems with tensor product
structure, SIAM J. Matrix Anal. Appl., 31 (2009/10), pp. 1688–1714.
[27] J. Li, J. Choi, I. Perros, J. Sun, and R. Vuduc,Model-driven sparse CP decomposition for
higher-order tensors, in IEEE International Parallel and Distributed Processing Sympo-
sium, IPDPS, May 2017, pp. 1048–1057.
[28] I. Oseledets et al.,Tensor Train Toolbox version 2.2.2. Available online, Apr. 2020.
[29] I. Oseledets and E. Tyrtyshnikov,TT-cross approximation for multidimensional arrays,
Linear Algebra and its Applications, 432 (2010), pp. 70–88.
[30] I. V. Oseledets,Tensor-train decomposition, SIAM J. Sci. Comput., 33 (2011), pp. 2295–2317.
[31] A.-H. Phan, P. Tichavsky, and A. Cichocki,Fast alternating LS algorithms for high order
CANDECOMP/PARAFAC tensor factorizations, IEEE Transactions on Signal Processing,
61 (2013), pp. 4834–4846.
[32] A. Quarteroni, A. Manzoni, and F. Negri,Reduced Basis Methods for Partial Differential
Equations: An Introduction, UNITEXT, Springer International Publishing, 2015.
[33] S. Ragnarsson and C. F. Van Loan,Block tensor unfoldings, SIAM Journal on Matrix
Analysis and Applications, 33 (2012), pp. 149–169.
[34] D. V. Savostyanov, S. V. Dolgov, J. M. Werner, and I. Kuprov,Exact NMR simulation of
protein-size spin systems using tensor train formalism, Phys. Rev. B, 90 (2014), p. 085139.
[35] S. Smith and G. Karypis,Accelerating the Tucker decomposition with compressed sparse
tensors, in Euro-Par 2017, F. F. Rivera, T. F. Pena, and J. C. Cabaleiro, eds., Cham,
2017, Springer International Publishing, pp. 653–668.
[36] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis,SPLATT: Efficient and
paral lel sparse tensor-matrix multiplication, in Proceedings of the 2015 IEEE International
Parallel and Distributed Processing Symposium, IPDPS ’15, Washington, DC, USA, 2015,
IEEE Computer Society, pp. 61–70.
[37] E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel,A massively
paral lel tensor contraction framework for coupled-cluster computations, Journal of Parallel
and Distributed Computing, 74 (2014), pp. 3176–3190.
[38] L. R. Tucker,Some mathematical notes on three-mode factor analysis, Psychometrika, 31
(1966), pp. 279–311.
[39] E. E. Tyrtyshnikov,Tensor approximations of matrices generated by asymptotically smooth
functions, Sbornik: Mathematics, 194 (2003), pp. 941–954.
[40] N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer,Tensorlab 3.0.
http://www.tensorlab.net, Mar. 2016.
Appendix A. TT Rounding Identity.
We provide the full derivation of (2.3), which we repeat here. The unfolding of
Xthat maps the first ntensor dimensions to rows can be expressed as a product of
four matrices:
X(1:n)= (IInQ(1:n1))· V (TX,n )· H(TX,n+1)·(IIn+1 Z(1)),
where Qis I1× · · · × In1×Rn1with
Q(i1,...,in1, rn1) = TX,1(i1,:) ·TX,2(:, i2,:) ···TX,n1(:, in1, rn1),
and Zis Rn+1 ×In+2 × · · · × INwith
Z(rn+1, in+2 ,...,iN) = TX,n+2 (rn+1, in+2,:) ·TX,n+3 (:, in+3,:) ···TX,N (:, iN).
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 29
Let Ube I1× · · · × In×Rnsuch that U(1:n)= (IInQ(1:n1))V(TX,n ), then
U(i1,...,in, rn) = X
i
nX
rn1
δ(i
n,in)Q(i1,...,in1, rn1)TX,n(rn1, i
n, rn)
=X
rn1
Q(i1,...,in1, rn1)TX,n(rn1, in, rn)
=Q(i1,...,in1,:) ·TX,n(:, in, rn)
=TX,1(i1,:) ···TX,n1(:, in1,:) ·TX,n(:, in, rn).
Let Vbe Rn×In+1 × · · · × INsuch that V(1) =H(TX,n+1)(IIn+1 Z(1) ), then
V(rn, in+1,...,iN) = X
i
n+1 X
rn+1
TX,n+1(rn, i
n+1, rn+1 )δ(i
n+1,in+1 )Z(rn+1, in+2,...,iN)
=X
rn1
TX,n+1(rn, in+1 , rn+1)Z(rn+1 , in+2,...,iN)
=TX,n+1(rn, in+1,:) ·Z(:, in+2,...,iN)
=TX,n+1(rn, in+1,:) ·TX,n+2(:, in+2 ,:) ···TX,N (:, iN).
Then we confirm that Y=Xfor Y(1:n)=U(1:n)·V(1):
Y(i1,...,iN) = X
rn
U(i1,...,in, rn)V(rn, in+1,...,iN)
=X
rn
TX,1(i1,:) ···TX,n1(:, in1,:) ·TX,n(:, in, rn)·
TX,n+1(rn, in+1,:) ·TX,n+2(:, in+2 ,:) ···TX,N (:, iN)
=TX,1(i1,:) ···TX,n1(:, in1,:)·
X
rn
TX,n(:, in, rn)·TX,n+1 (rn, in+1,:)!·
TX,n+2(:, in+2 ,:) ···TX,N (:, iN)
=TX,1(i1,:) ···TX,n(:, in,:) ·TX,n+1 (:, in+1,:) ···TX,N (:, iN).
Appendix B. TSQR Subroutines for Non-Powers-of-Two.
We provide here the full details of the butterfly TSQR algorithm and the algorithm
for applying the resulting implicit orthogonal factor to a matrix. These two algorithms
generalize Algorithms 3.1 and 3.2 presented in Subsection 3.3 which can run only on
powers-of-two processors. To handle a non-power-of-two number of processors, we
consider the first 2log Pprocessors to be “regular” processors and the last P2log P
processors to be “remainder” processors. Each remainder process has a partner in the
set of regular processors, and we perform cleanup steps between remainder processors
and their partners before and after the regular butterfly loop of the TSQR algorithm.
For the application algorithm, the clean up occurs after the butterfly on the regular
processors (which requires no communication) and involves a single message between
remainder processors and their partners. We note that the notation and indexing
matches that of Algorithms 3.1 and 3.2, so that the algorithms coincide when Pis a
power of two.
30 H. AL DAAS, G. BALLARD, P. BENNER
Algorithm B.1 Parallel Butterfly TSQR
Require: A is an m×bmatrix 1D-distributed so that proc powns row block A(p)
Ensure: A =QR with Rowned by all procs and Qrepresented by {Y(p)
}with
redundancy Y(p)
=Y(q)
for pqmod 2where p, q < 2log Pand l < log P
1: function [{Y(p)
},R] = Par-TSQR(A(p))
2: p=MyProcID()
3: [Y(p)
log P,¯
R(p)
log P] = Local-QR(A(p))Leaf node QR
4: if log P⌉ 6=log Pthen Non-power-of-two case
5: j= (p+ 2log P) mod 2log P
6: if p2log Pthen Remainder processor
7: Send ¯
R(p)
log Pto proc j
8: else if p < P 2log Pthen Partner of remainder processor
9: Receive ¯
R(j)
log Pfrom proc j
10: [Y(p)
,¯
R(p)
log P] = Lo cal-QR "¯
R(p)
log P
¯
R(j)
log P#!
11: end if
12: end if
13: if p < 2log Pthen Butterfly tree on power-of-two procs
14: for =log P⌉ − 1 down to 0 do
15: j= 2+1p
2+1 + (p+ 2) mod 2+1 Determine partner
16: Send ¯
R(p)
+1 to and receive ¯
R(j)
+1 from proc j Communication
17: if p < j then
18: [Y(p)
,¯
R(p)
] = Lo cal-QR "¯
R(p)
+1
¯
R(j)
+1#! Tree node QR
19: else
20: [Y(p)
,¯
R(p)
] = Lo cal-QR "¯
R(j)
+1
¯
R(p)
+1#! Partner tree node QR
21: end if
22: end for
23: R=¯
R(p)
0
24: end if
25: if log P⌋ 6=log Pthen Non-power-of-two case
26: j= (p+ 2log P) mod 2log P
27: if p < P 2log Pthen Partner of remainder proc
28: Send Rto proc j
29: else if p2log Pthen Remainder proc
30: Receive Rfrom proc j
31: end if
32: end if
33: end function
PARALLEL ALGORITHMS FOR TENSOR TRAIN ARITHMETIC 31
Algorithm B.2 Parallel Application of Implicit Qfrom Butterfly TSQR
Require: {Y(p)
}represents orthogonal matrix Qcomputed by Algorithm B.1
Require: C is b×cand redundantly owned by all processors
Ensure: B =QC
0is m×cand 1D-distributed so that proc powns row block B(p)
1: function B =Par-TSQR-Apply-Q({Y(p)
},C)
2: p=MyProcID()
3: if p < 2log Pthen Butterfly apply on power-of-two procs
4: ¯
B(p)
0=C
5: for = 0 to log P⌉ − 1do
6: j= 2+1p
2+1 + (p+ 2) mod 2+1 Determine partner
7: if p < j then
8: "¯
B(p)
+1
¯
B(j)
+1#=Loc-Apply-Q Ib
Y(p)
,"¯
B(p)
0#! Tree node apply
9: else
10: "¯
B(j)
+1
¯
B(p)
+1#=Loc-Apply-Q Ib
Y(p)
,"¯
B(p)
0#! Partner apply
11: end if
12: end for
13: end if
14: if log P⌋ 6=log Pthen Non-power-of-two case
15: j= (p+ 2log P) mod 2log P
16: if p < P 2log Pthen Partner of remainder proc
17: "¯
B(p)
log P
¯
B(j)
log P#=Loc-Apply-Q Ib
Y(p)
,"¯
B(p)
log P
0#!
18: Send ¯
B(j)
log Pto proc j
19: else if p2log Pthen Remainder proc
20: Receive ¯
B(p)
log Pfrom proc j
21: end if
22: end if
23: B(p)=Loc-Apply-Q Y(p)
log P,"¯
B(p)
log P
0#! Leaf node apply
24: end function
ResearchGate has not been able to resolve any citations for this publication.
Thesis
Full-text available
This thesis presents a work on iterative methods for solving linear systems that aim at reducing the communication in parallel computing. The main type of linear systems in which we are interested arises from a real-life reservoir simulation. Both schemes, implicit and explicit, of modelling the system are taken into account. Three approaches are studied separately. We consider non-symmetric (resp. symmetric) linear systems. This corresponds to the explicit (resp. implicit) formulation of the model problem. We start by presenting an approach that adds multiple search directions per iteration rather than one as in the classic iterative methods. Then, we discuss different strategies of recycling search subspaces. These strategies reduce the global iteration count of a considerable factor during a sequence of linear systems. We review different existing strategies and present a new one. We discuss the parallel implementation of these methods using a low-level language. Numerical experiments for both sequential and parallel implementations are presented. We also consider the algebraic domain decomposition approach. In an algebraic framework, we study the two-level additive Schwarz preconditioner. We provide the algebraic explicit form of a class of local coarse spaces that bounds the spectral condition number of the preconditioned matrix by a number pre-defined.
Conference Paper
Full-text available
The Tucker decomposition is a higher-order analogue of the singular value decomposition and is a popular method of performing analysis on multi-way data (tensors). Computing the Tucker decomposition of a sparse tensor is demanding in terms of both memory and computational resources. The primary kernel of the factorization is a chain of tensor-matrix multiplications (TTMc). State-of-the-art algorithms accelerate the underlying computations by trading off memory to memoize the intermediate results of TTMc in order to reuse them across iterations. We present an algorithm based on a compressed data structure for sparse tensors and show that many computational redundancies during TTMc can be identified and pruned without the memory overheads of memoization. In addition, our algorithm can further reduce the number of operations by exploiting an additional amount of user-specified memory. We evaluate our algorithm on a collection of real-world and synthetic datasets and demonstrate up to \(20.7{\times }\) speedup while using \(28.5{\times }\) less memory than the state-of-the-art parallel algorithm.
Article
Full-text available
Tensors are higher-order extensions of matrices. While matrix methods form the cornerstone of traditional machine learning and data analysis, tensor methods have been gaining increasing traction. However, software support for tensor operations is not on the same footing. In order to bridge this gap, we have developed TensorLy, a Python library that provides a high-level API for tensor methods and deep tensorized neural networks. TensorLy aims to follow the same standards adopted by the main projects of the Python scientific community, and to seamlessly integrate with them. Its BSD license makes it suitable for both academic and commercial applications. TensorLy's backend system allows users to perform computations with several libraries such as NumPy or PyTorch to name but a few. They can be scaled on multiple CPU or GPU machines. In addition, using the deep-learning frameworks as backend allows to easily design and train deep tensorized neural networks. TensorLy is available at https://github.com/tensorly/tensorly
Article
Full-text available
Numerical simulation of large-scale dynamical systems plays a fundamental role in studying a wide range of complex physical phenomena; however, the inherent large-scale nature of the models often leads to unmanageable demands on computational resources. Model reduction aims to reduce this computational burden by generating reduced models that are faster and cheaper to simulate, yet accurately represent the original large-scale system behavior. Model reduction of linear, nonparametric dynamical systems has reached a considerable level of maturity, as reflected by several survey papers and books. However, parametric model reduction has emerged only more recently as an important and vibrant research area, with several recent advances making a survey paper timely. Thus, this paper aims to provide a resource that draws together recent contributions in different communities to survey the state of the art in parametric model reduction methods. Parametric model reduction targets the broad class of problems for which the equations governing the system behavior depend on a set of parameters. Examples include parameterized partial differential equations and large-scale systems of parameterized ordinary differential equations. The goal of parametric model reduction is to generate low-cost but accurate models that characterize system response for different values of the parameters. This paper surveys state-of-the-art methods in projection-based parametric model reduction, describing the different approaches within each class of methods for handling parametric variation and providing a comparative discussion that lends insights to potential advantages and disadvantages in applying each of the methods. We highlight the important role played by parametric model reduction in design, control, optimization, and uncertainty quantification - settings that require repeated model evaluations over different parameter values.
Article
Full-text available
The numerical solution of PDE-constrained optimization problems subject to the non-stationary Navier-Stokes equation is a challenging task. While space-time approaches often show favorable convergence properties they often suffer from storage problems. We here propose to approximate the solution to the optimization problem in a low-rank from, which is similar to the Model Order Reduction (MOR) approach. However, in contrast to classical MOR schemes we do not compress the full solution at the end of the algorithm but start our algorithm with low-rank data and maintain this form throughout the iteration. Theoretical results and numerical experiments indicate that this approach reduces the computational costs by two orders of magnitude.
Article
Full-text available
We consider the numerical simulation of an optimal control problem constrained by the unsteady Stokes-Brinkman equation involving random data. More precisely, we treat the state, the control, the target (or the desired state), as well as the viscosity, as analytic functions depending on uncertain parameters. This allows for a simultaneous generalized polynomial chaos approximation of these random functions in the stochastic Galerkin finite element method discretization of the model. The discrete problem yields a prohibitively high dimensional saddle point system with Kronecker product structure. We develop a new alternating iterative tensor method for an efficient reduction of this system by the low-rank Tensor Train representation. Besides, we propose and analyze a robust Schur complement-based preconditioner for the solution of the saddle-point system. The performance of our approach is illustrated with extensive numerical experiments based on two- and three-dimensional examples, where the full problem size exceeds one billion degrees of freedom. The developed Tensor Train scheme reduces the solution storage by two–three orders of magnitude, depending on discretization parameters.
Article
Our goal is compression of massive-scale grid-structured data, such as the multi-terabyte output of a high-fidelity computational simulation. For such data sets, we have developed a new software package called TuckerMPI, a parallel C++/MPI software package for compressing distributed data. The approach is based on treating the data as a tensor, i.e., a multidimensional array, and computing its truncated Tucker decomposition, a higher-order analogue to the truncated singular value decomposition of a matrix. The result is a low-rank approximation of the original tensor-structured data. Compression efficiency is achieved by detecting latent global structure within the data, which we contrast to most compression methods that are focused on local structure. In this work, we describe TuckerMPI, our implementation of the truncated Tucker decomposition, including details of the data distribution and in-memory layouts, the parallel and serial implementations of the key kernels, and analysis of the storage, communication, and computational costs. We test the software on 4.5 and 6.7 terabyte data sets distributed across 100 s of nodes (1,000 s of MPI processes), achieving compression ratios between 100 and 200,000×, which equates to 99--99.999% compression (depending on the desired accuracy) in substantially less time than it would take to even read the same dataset from a parallel file system. Moreover, we show that our method also allows for reconstruction of partial or down-sampled data on a single node, without a parallel computer so long as the reconstructed portion is small enough to fit on a single machine, e.g., in the instance of reconstructing/visualizing a single down-sampled time step or computing summary statistics. The code is available at https://gitlab.com/tensors/TuckerMPI.