Content uploaded by Andersen Ang

Author content

All content in this area was uploaded by Andersen Ang on Sep 24, 2020

Content may be subject to copyright.

EXTRAPOLATED ALTERNATING ALGORITHMS FOR APPROXIMATE CANONICAL

POLYADIC DECOMPOSITION

Andersen Man Shun Ang†, Jeremy E. Cohen‡, Le Thi Khanh Hien†, Nicolas Gillis†

†Department of Mathematics and Operational Research, Universit´

e de Mons, Belgium

‡CNRS, Universit´

e de Rennes, Inria, IRISA Campus de Beaulieu, Rennes, France

Update20200924: corrected the typo in line 9 of Algo. 2.

ABSTRACT

Tensor decompositions have become a central tool in machine learn-

ing to extract interpretable patterns from multiway arrays of data.

However, computing the approximate Canonical Polyadic Decom-

position (aCPD), one of the most important tensor decomposition

model, remains a challenge. In this work, we propose several algo-

rithms based on extrapolation that improve over existing alternating

methods for aCPD. We show on several simulated and real data sets

that carefully designed extrapolation can signiﬁcantly improve the

convergence speed hence reduce the computational time, especially

in difﬁcult scenarios.

Index Terms—Canonical Polyadic Decomposition, Tensor,

Non-convex Optimization, Block-coordinate Descent, Acceleration

1. PROBLEM STATEMENT

Let ⊗be the tensor product [a(1) ⊗. . . ⊗a(p)]i1...ip=

p

Q

j=1

a(j)

ijwith

a(j)∈Rnjand p∈N∗, set n=×jnjfor a collection of nj∈N∗.

We are interested in solving efﬁciently the following approximate

Canonical Polyadic Decomposition (aCPD) optimization problem:

Deﬁnition 1 (aCPD) Given a tensor T∈Rnof order pand an

integer r, ﬁnd a tensor ˆ

Tsuch that

ˆ

T= argmin

rank(G)≤r

kT−Gk2

F,(1)

where the rank of a tensor Gis deﬁned as

min (r∈N| ∃a(j)

i∈Rnj, G =

r

X

i=1

p

O

j=1

a(j)

i).(2)

The aCPD problem is ill-posed in the sense that a solution might not

exist, since the set of rank rtensors is not closed as soon as r > 1

and p > 2[1]. This poses serious problems in practice. It has been

documented that the estimates of each block A(j)= [a(j)

1,...,a(j)

r]

(1≤j≤p) may in consequence have pairs of columns growing to

inﬁnity while canceling each-other out, even if a best rank rapprox-

imation exists; see [2] and the references therein. This degeneracy

causes “swamps” in the cost function decrease, such as shown in

Figure 1. The cost function is also non-convex albeit quadratic with

respect to each block A(j)(1≤j≤p). Therefore, computing

aCPD remains a challenging task in general. We refer the interested

reader to [3] for a comprehensive survey on these questions.

AMSA, LTKH and NG acknowledge the support by the European Re-

search Council (ERC starting grant No 679515), and by the Fonds de la

Recherche Scientiﬁque - FNRS and the Fonds Wetenschappelijk Onderzoek

- Vlanderen (FWO) under EOS project O005318F-RG47.

2. SOME EXISTING ALTERNATING ALGORITHMS

As tensor decompositions have become an important and extensively

studied topic in data science, it is out of the scope of this paper to

summarize all the literature on how to compute aCPD. Therefore, in

what follows, we focus on alternating methods, which update one

block of variables A(j)at a time while keeping the others ﬁxed.

There are mainly two categories of alternating algorithms to

compute aCPD: Exact Block-Coordinate Descent (EBCD) algo-

rithms, and Approximate BCD (ABCD) algorithms. EBCD algo-

rithms feature block-wise optimal updates. The most well-known

EBCD algorithm for aCPD is Alternating Least Squares (ALS),

also called CP-ALS or sometimes PARAFAC (which is also an-

other name for aCPD). ALS sequentially updates the blocks A(j)as

follows:

ˆ

A(j)=g(T, A(l6=j)) := T[j]B(j)†,(3)

where T[j]is the j-th unfolding of T, as deﬁned for instance in [4],

is the Khatri-Rao product and B(j)T=

1

J

l=p

l6=j

A(l). ALS has no con-

vergence guarantee since each block update may have more than one

solution [5]. There are some local convergence guarantee though [6].

Another simple BCD algorithm is Hierarchical ALS (HALS), which

updates each column of each A(j)sequentially. While HALS has

scarcely been used for solving aCPD (contrary to its nonnegative

counterpart [7]), it may in principle be faster than ALS for large ten-

sors since no linear systems are to be solved at each iteration due

to the separability of the objective function. A variant of HALS, A-

HALS, updates the columns of each factor A(j)with several cycles

before jumping to another factor. This allows to reduce the compu-

tational cost as some matrix products can be reused [8].

ABCD algorithms, such as alternating gradient methods, do not

solve each subproblem optimally. These methods have scarcely been

considered for solving aCPD, in favor of all-at-once gradient-based

approaches [9].

3. PROPOSED APPROACHES: IBPG AND HERBCD

In the following, we introduce two algorithms for computing aCPD

that make use of extrapolation in two different ways. Extrapolation

for escaping saddle points and enhancing convergence speed is an in-

tensively studied topic in machine learning, in particular in the con-

text of deep learning where the cost functions to minimize are highly

non-convex and gradient-based algorithms might require a “push” to

escape bad regions of the search space faster. Our goal is to show

empirically that extrapolation, on top of enhancing empirical conver-

gence speed in difﬁcult cases, also helps escaping “swamps” when

computing aCPD. This observation is actually not new, and may be

traced back to seminal work by Harshman [10]. We provide a fresh

view on these issues by using more recent optimization techniques.

For convenience, let us deﬁne the cost function Fas

F(A(1),...,A(p)) = kT−

r

X

i=1

p

O

j=1

a(j)

ik2

F.(4)

3.1. Inertial Block Proximal Gradient (iBPG)

A recent trend in non-convex optimization for machine learning,

which takes root in the seminal work by Nesterov [11], is to make use

of extrapolation of the iterates to enhance convergence speed of the

cost function. Although such techniques were originally designed

for convex smooth problems and gradient descent, extrapolation has

been investigated in the context of non-smooth [12], non-convex [13]

problem and in conjunction with alternating gradient descent [14].

It has also been shown that extrapolation of the iterates can be in-

terpreted in the realm of harmonic mechanical systems, shedding

light on the performance enhancement [15, 16]. These techniques

have been seldom used for solving aCPD, despite some recent at-

tempts [17, 18] summarized in Section 5.

Such an alternating (proximal) gradient descent algorithm with

extrapolation was recently proposed [19], which we refer to as iBPG,

to solve a general nonconvex nonsmooth block separable composite

optimization problem. iBPG embraces some advanced features of

Algorithm 1 iBPG for CPD

1: Initialization: Choose δw= 0.99,β= 1.01,t0= 1,

and 2 sets of initial factor matrices A(1)

−1,...,A(p)

−1and

A(1)

0,...,A(p)

0. Set k= 1.

2: Set A(j)

prev =A(j)

−1,j= 1,...,p.

3: Set A(j)

cur =A(j)

0,j= 1,...,p.

4: repeat

5: for j= 1,...,pdo

6: tk=1

2(1 + q1 + 4t2

k−1),ˆwk−1=tk−1−1

tk

7: w(j)

k−1= min ˆwk−1, δwsL(j)

k−2

L(j)

k−1,

8: L(j)

k=

B(j)

kTB(j)

k

9: repeat

10: Compute two extrapolation points

ˆ

A(j,1) =A(j)

cur +w(j)

k−1A(j)

cur −A(j)

prev,

ˆ

A(j,2) =A(j)

cur +βw(j)

k−1A(j)

cur −A(j)

prev

11: Set A(j)

prev =A(j)

cur.

12: Update A(j)

cur by gradient step:

A(j)

cur =ˆ

A(j,2) −1

L(j)

k−1ˆ

A(j,1)B(j)

k−1T− T[j]B(j)

k−1.

13: until some criteria is satisﬁed

14: Set A(j)

k=A(j)

cur.

15: end for

16: Set k=k+ 1.

17: until some criteria is satisﬁed

acceleration methods using extrapolation:

•iBPG uses two different extrapolation points to evaluate the gradi-

ent and to add inertial force. This feature was experimentally shown

to improve convergence compared to the use of a single extrapola-

tion point.

•iBPG does not require a restarting step: convergence is guaranteed

without any restart. This is in contract with most algorithms using

extrapolation in the non-convex case where a restarting step is nec-

essary to ensure convergence [17, 14]: a step is accepted only if the

objective function decreases and, when this is not the case, the algo-

rithm restarts by taking a standard gradient step. This feature is very

useful when evaluating the objective function is expensive.

•iBPG is very ﬂexible in the choice of the order in which the blocks

are updated: for example each matrix factor can be updated several

times allowing to reuse some computations (like in A-HALS [8])

leading to more updates at a lower computational cost.

iBPG is proved to have sub-sequential convergence under some

mild conditions, and global convergence under some additional as-

sumptions. iBPG can easily be instantiated for aCPD, the resulting

algorithm is summarized in Algorithm 3.1. As the choice of the

parameters in Algorithm 3.1 satisﬁes the relaxed conditions in [19,

Remark 4.7], we can derive from [19, Theorem 4.8] that iBPG for

solving aCPD is guaranteed to have (at least) a sub-sequential con-

vergence; see [19, Section 5] for a similar explanation in the case of

non-negative matrix factorization problem.

3.2. Heuristic Extrapolation and Restart (her) BCD

Although alternating gradient-based approaches are not state-of-the-

art at the moment for computing aCPD, BCD algorithms on the other

hand are extremely popular, in particular the ALS algorithm, mostly

due to its simplicity and efﬁciency for simple problems. However,

ALS is known to converge slowly for instance when the factors A(j)

are ill-conditioned. In this paper, we introduce an extrapolation of

the factor estimates between each block update. Moreover, the ex-

trapolation technique is not a straightforward heuristic line search

such as described in [20, pp.95–96], but mimics Nesterov’s extrap-

olation by introducing pairing variables Z(j). For instance, when

updating the jth factor in the ALS algorithm at the kth iteration, the

update is modiﬁed as:

A(j)

k=gT, hZ(l<j)

k, Z(l>j )

k−1i as deﬁned in (3) (5)

Z(j)

k=A(j)

k+βkA(j)

k−A(j)

k−1,(6)

where βkis updated heuristically (see Algorithm 3.2) using three pa-

rameters (η, ¯γ, γ )following the strategy described in [21]. In short,

the idea is to use a restart criterion ˆ

Fk=F(A(p)

k;Z(l6=p)

k), which

is the cost evaluated at the pairing variable for the ﬁrst p−1modes

and the original variable at the last mode, to update βk. When ˆ

F

decreases, we grow β(multiply it by a constant γ≥1). When ˆ

F

increases, we decrease β(divide it by a constant η≥1). In Al-

gorithm 3.2, ¯

βis the upper bound of β, which is also updated dy-

namically. Beside updating β, restart is carried out based on ˆ

Fto

decide whether to keep A(j)or Z(j)in the next iteration. We re-

fer to this procedure as Heuristic Extrapolation and Restart (her). It

could be used in the same way to design a her-HALS algorithm and

a her-Gradient algorithm, which we do not discuss here do to the

space restriction. In fact, any BCD algorithm can be accelerated us-

ing her, by extrapolating the partial estimates for each block update.

We label such a generic approach herBCD.

Algorithm 2 herALS for CPD

1: Initialization: Choose β0∈(0,1),η≥γ≥¯γ≥

1, and 2 sets of initial factor matrices A(1)

0,...,A(p)

0and

Z(1)

0,...,Z(p)

0. Set ¯

β0= 1 and k= 1.

2: repeat

3: for j= 1,...,pdo

4: Update: get A(j)

kas (5).

5: Extrapolate: get Z(j)

kas (6).

6: end for

7: Compute ˆ

Fk=F(A(p)

k;Z(l6=p)

k).

8: if ˆ

Fk>ˆ

Fk−1for k≥2then

Set Z(j)

k=A(j)

kfor j= 1,...,p

Set ¯

βk=βk−1,βk=βk−1/η

9: else

Set A(j)

k=Z(j)

kfor j= 1,...,p

Set ¯

βk= min{1,¯

βk−1¯γ},βk= min{¯

βk−1, βk−1γ}

10: end if

11: Set k=k+ 1.

12: until some criteria is satisﬁed

3.3. Additional cost of extrapolation

A natural question that arises when modifying well-known algo-

rithms to enhance their convergence speed is the impact of such mod-

iﬁcations on the computational time. Indeed, it is often possible to

improve the relative decrease of the objective at each iteration with

respect to ALS, but doing so while keep a ﬁxed cost per iteration is

more challenging.

The cost of iBPG is essentially that of an alternating gradient

method. Indeed, the computation of the extrapolation points is neg-

ligible given that ris small, since computing the operator norm of

an r×rmatrix is cheap (lines 7 and 8 in Algorithm 3.1). Moreover,

since there is no need for restart like in herALS, the cost function

does not need to be computed at each iteration. Therefore, the cost

for each block update boils down to the cost of computing the gra-

dient, which cost is itself known to be dominated by the so-called

Matricized Tensor Times Khatri-Rao Product (MTTKRP). The MT-

TKRP can be efﬁciently implemented, see for instance [22, 23]. In

summary, for small r, one block update of iBPG has essentially the

same cost as one block update of ALS since the matrix to inverse in

ALS is of size r×r.

The cost of one iteration of herALS is also essentially the same

as one iteration of ALS, although this requires a twist on the restart

condition. Indeed, to perform restart, it is in theory necessary to

check if the cost function is increasing after extrapolation. However,

computing the cost function is expensive for aCPD unless the pre-

viously computed MTTKRPs can be reused. Note that the restart in

herALS is based on the cost evaluated at the pairing variables, for

which MTTKRPs have indeed been computed in the ALS update.

Although this is not a standard way to perform restart, this allows to

keep computational cost low while showing no practical difference.

4. EXPERIMENTS

In this section we compare iBPG and herALS to three algorithms:

the original un-accelerated ALS (ALS), the accelerated ALS using

Bro’s acceleration (Bro-ALS), and the LS-ALS: an accelerated ALS

where extrapolation sequence is computed by line search in the style

of Anderson Acceleration. See section 5 for description of Bro-ALS

and LS-ALS.

In each experiment, the notation [I , J, K, r]denotes the sizes

of the tensor (I, J, K ), and the factorization rank r. All experi-

ments are run over 20 random initilaziations, and we plot the me-

dian of the cost value over these 20 trials. There are two impor-

tant things to note: all herBCD across all experiments use the same

set of default parameters [β0, γ, ¯γ, η] = [0.5,1.05,1.01,1.5]. All

the y-axis of the plots is in the form of F−Fmin, where Fis the

cost evaluated at all A(j)and Fmin is the minimal possible cost ob-

tained in the experiment across all initializations and algorithms.

All the experiments are run with MATLAB (v.2015a) on a laptop

with 2.4GHz CPU and 16GB RAM. The codes are available from

https://angms.science/research.html.

4.1. Synthetic data sets

Figure 1 shows the result over two experiments and details the cho-

sen dimensions of the problem. In both balanced and unbalanced

cases that were tested, the data tensor is generated as Pr

q=1 a(1)

q⊗

a(2)

q⊗a(3)

q+N, where the ground truth factors A(j)are sampled

from a Gaussian distribution with zero mean and unitary variance.

Note that we adjust the condition number of A(j)to 100 using the

SVD and replacing the singular values by logaritmihc scaled values

between 1 and 100. The tensor Nis an additive Gaussian noise with

zero mean and variance 0.001. The results show that iBPG and her-

ALS are the best algorithm among the ﬁve tested algorithms, and in

particular seem to avoid the swamp in which ALS lands in the un-

balanced case. LS-ALS, which converges fast in terms of iterations,

suffers from higher per-iteration cost.

Fig. 1. Median error over 20 runs on synthetic data sets plotted

against iterations (top) and time (bottom). On the left is a square

tensor [50,50,50,10], and on the right is an unbalanced tensor

[150,103,50,10]. For the unbalanced case, ALS improves very

slowly up to the 90th iteration: This phenomenon is often referred to

as a swamp in the literature. The proposed extrapolated algorithms

do not encounter this issue in this experiment.

4.2. Real data sets

We now show the results on real data sets: Wine1(Fig. 2), Hyper

spectral data of Indian Pine2and Blood plasma3(Fig. 4). Again the

curves are the median over 20 initializations. All sub-ﬁgures in a

ﬁgure share the same y-axis. Minimal pre-processing is carried out:

NaN values (if any) are replaced with zeros.

We observe that herALS performs the best, followed by Bro-

ALS. iBPG does not perform as well as for the synthetic data sets.

Fig. 2. Results on Wine [44,2700,200,15]. iBPG gets stuck on

local minima.

Fig. 3. Results on Indian Pines [145,145,200,16].

5. RELATED PRIOR WORK

Although the idea of extrapolation is not new for ALS [10], there has

not been many works tackling extrapolation for speeding up BCD al-

gorithms for aCPD. We are aware of two such works. Bro et al. [20]

extrapolate directly the estimated factor using a heuristic approach

recalled in [24] which we show can be slower than ALS, although it

prevents any appearance of “swamps” in our experiments. Mitchell

et. al. [18] have proposed a similar extrapolation strategy, where

they extrapolate all blocks simultaneously using a shared parameter

βk. In contrast, in this work, we followed the same scheme to com-

pute βkbut the extrapolation is carried out on each block right after

the least-squares update, rather than after all least-squares updates.

This makes the two approaches quite different. Furthermore an ex-

pensive line search (a least square problem) has to be performed to

compute the extrapolation weight β. The per-iteration cost is higher

than all the other methods in ﬁgure 1.

1See http://www.models.life.ku.dk/Wine_GCMS_FTIR

for data description.

2http://www.ehu.eus/ccwintco/index.php?title=

Hyperspectral_Remote_Sensing_Scenes

3See http://www.models.life.ku.dk/anders-cancer

Fig. 4. Results on Blood [289,301,41] with r= 3 (top), r= 6

(mid) and r= 10 (bottom). Note that the data has many NaN

(data polluted due to Rayleigh scattering), all NaN are replaced by

0. There are therefore many zero ﬁbers in the tensor after such cor-

rection.

6. CONCLUSION AND PERSPECTIVES

We have introduced extrapolation-based alternating algorithms for

solving aCPD. On a limited set of synthetic experiments with ill-

conditioned tensors, the recently proposed iBPG algorithm, which is

alternating gradient-based, outperforms workhorse block-coordinate

descent algorithm such as ALS, and helps escaping “swamps”. The

algorithm proposed in this paper, herALS, is a variant of ALS in

which iterates are extrapolated, and also performs well without a

ﬁne tuning of the hyperparameters. On a few real data sets stem-

ming from ﬂuorescence spectroscopy and remote sensing, herALS

outperform all tested methods while iBPG shows mitigated results.

Further tests and comparison should therefore be performed to fur-

ther assess the performance of both iBPG and herALS in speciﬁc

practical cases. Finally, this work provides further practical evidence

that extrapolation helps escaping swamps when computing aCPD.

7. REFERENCES

[1] V. De Silva and L.-H. Lim, “Tensor rank and the ill-posedness

of the best low-rank approximation problem,” SIAM Journal

on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1084–

1127, 2008.

[2] P. Comon, X. Luciani, and A. L.F. De Almeida, “Tensor de-

compositions, alternating least squares and other tales,” Jour-

nal of Chemometrics, vol. 23, no. 7-8, pp. 393–405, 2009.

[3] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E.

Papalexakis, and C. Faloutsos, “Tensor decomposition for sig-

nal processing and machine learning,” IEEE Transactions on

Signal Processing, vol. 65, no. 13, pp. 3551–3582, 2017.

[4] T. G. Kolda and B. W. Bader, “Tensor decompositions and

applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, sep

2009.

[5] D. P. Bertsekas, Nonlinear programming, Athena scientiﬁc

Belmont, 1999.

[6] A. Uschmajew, “Local convergence of the alternating least

squares algorithm for canonical tensor approximation,” SIAM

J. matrix Analysis, vol. 33, no. 2, pp. 639–652, 2012.

[7] A. Cichocki, R. Zdunek, A. H. Phan, and S-I. Amari, Nonneg-

ative Matrix and Tensor Factorization, Wiley, 2009.

[8] N. Gillis and F. Glineur, “Accelerated multiplicative updates

and hierarchical als algorithms for nonnegative matrix factor-

ization,” Neural computation, vol. 24, no. 4, pp. 1085–1105,

2012.

[9] E. Acar, D. M. Dunlavy, and T. G. Kolda, “A scalable op-

timization approach for ﬁtting canonical tensor decomposi-

tions,” Journal of Chemometrics, vol. 25, no. 2, pp. 67–86,

2011.

[10] R. A. Harshman, “Foundations of the PARAFAC procedure:

Models and conditions for an ”explanatory” multimodal factor

analysis,” UCLA working papers in phonetics, vol. 16, 1970.

[11] Y. Nesterov, “A method of solving a convex programming

problem with convergence rate o (1/k2),” in Soviet Mathemat-

ics Doklady, 1983, vol. 27, pp. 372–376.

[12] Y. Nesterov, “Gradient methods for minimizing composite

functions,” Mathematical Programming, vol. 140, no. 1, pp.

125–161, Aug 2013.

[13] T. Pock and S. Sabach, “Inertial proximal alternating linearized

minimization (iPALM) for nonconvex and nonsmooth prob-

lems,” SIAM Journal on Imaging Sciences, vol. 9, no. 4, pp.

1756–1787, 2016.

[14] Y. Xu, “Alternating proximal gradient method for sparse

nonnegative Tucker decomposition,” Mathematical Program-

ming Computation, vol. 7, no. 1, pp. 39–70, mar 2015,

arxiv:1302.2559.

[15] S. Boyd W. Su and E. J. Cand`

es, “A differential equation for

modeling nesterov’s accelerated gradient method: Theory and

insights,” Journal of Machine Learning Research, vol. 17, no.

153, pp. 1–43, 2016.

[16] M. Muehlebach and M. I. Jordan, “A dynamical sys-

tems perspective on nesterov acceleration,” arXiv preprint

arXiv:1905.07436, 2019.

[17] Y. Xu and W. Yin, “A block coordinate descent method for

regularized multiconvex optimization with applications to non-

negative tensor factorization and completion,” SIAM Journal

on imaging sciences, vol. 6, no. 3, pp. 1758–1789, 2013.

[18] D. Mitchell, N. Ye, and H. De Sterck, “Nesterov acceleration of

alternating least squares for canonical tensor decomposition,”

2018.

[19] L. T. K. Hien, N. Gillis, and P. Patrinos, “Inertial block mir-

ror descent method for non-convex non-smooth optimization,”

2019, arxiv:1903.01818.

[20] R. Bro, Multi-way Analysis in the Food Industry: Models, Al-

gorithms, and Applications, Ph.D. thesis, University of Ams-

terdam, The Netherlands, 1998.

[21] A. M. S. Ang and N. Gillis, “Accelerating nonnegative matrix

factorization algorithms using extrapolation,” Neural compu-

tation, vol. 31, no. 2, pp. 417–439, 2019.

[22] N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis,

“Memory-efﬁcient parallel computation of tensor and matrix

products for big tensor decomposition,” in 2014 48th Asilomar

Conference on Signals, Systems and Computers. IEEE, 2014,

pp. 581–585.

[23] E. E Papalexakis, U. Kang, C. Faloutsos, N. D. Sidiropoulos,

and A. Harpale, “Large scale tensor decompositions: Algorith-

mic developments and applications.,” IEEE Data Eng. Bull.,

vol. 36, no. 3, pp. 59–66, 2013.

[24] A. Ang, J. E. Cohen, and N. Gillis, “Accelerating approximate

nonnegative canonical polyadic decomposition using extrapo-

lation,” in XXVIIe Colloque GRETSI, 2019.