EXTRAPOLATED ALTERNATING ALGORITHMS FOR APPROXIMATE CANONICAL
Andersen Man Shun Ang†, Jeremy E. Cohen‡, Le Thi Khanh Hien†, Nicolas Gillis†
†Department of Mathematics and Operational Research, Universit´
e de Mons, Belgium
e de Rennes, Inria, IRISA Campus de Beaulieu, Rennes, France
Update20200924: corrected the typo in line 9 of Algo. 2.
Tensor decompositions have become a central tool in machine learn-
ing to extract interpretable patterns from multiway arrays of data.
However, computing the approximate Canonical Polyadic Decom-
position (aCPD), one of the most important tensor decomposition
model, remains a challenge. In this work, we propose several algo-
rithms based on extrapolation that improve over existing alternating
methods for aCPD. We show on several simulated and real data sets
that carefully designed extrapolation can signiﬁcantly improve the
convergence speed hence reduce the computational time, especially
in difﬁcult scenarios.
Index Terms—Canonical Polyadic Decomposition, Tensor,
Non-convex Optimization, Block-coordinate Descent, Acceleration
1. PROBLEM STATEMENT
Let ⊗be the tensor product [a(1) ⊗. . . ⊗a(p)]i1...ip=
a(j)∈Rnjand p∈N∗, set n=×jnjfor a collection of nj∈N∗.
We are interested in solving efﬁciently the following approximate
Canonical Polyadic Decomposition (aCPD) optimization problem:
Deﬁnition 1 (aCPD) Given a tensor T∈Rnof order pand an
integer r, ﬁnd a tensor ˆ
where the rank of a tensor Gis deﬁned as
min (r∈N| ∃a(j)
i∈Rnj, G =
The aCPD problem is ill-posed in the sense that a solution might not
exist, since the set of rank rtensors is not closed as soon as r > 1
and p > 2. This poses serious problems in practice. It has been
documented that the estimates of each block A(j)= [a(j)
(1≤j≤p) may in consequence have pairs of columns growing to
inﬁnity while canceling each-other out, even if a best rank rapprox-
imation exists; see  and the references therein. This degeneracy
causes “swamps” in the cost function decrease, such as shown in
Figure 1. The cost function is also non-convex albeit quadratic with
respect to each block A(j)(1≤j≤p). Therefore, computing
aCPD remains a challenging task in general. We refer the interested
reader to  for a comprehensive survey on these questions.
AMSA, LTKH and NG acknowledge the support by the European Re-
search Council (ERC starting grant No 679515), and by the Fonds de la
Recherche Scientiﬁque - FNRS and the Fonds Wetenschappelijk Onderzoek
- Vlanderen (FWO) under EOS project O005318F-RG47.
2. SOME EXISTING ALTERNATING ALGORITHMS
As tensor decompositions have become an important and extensively
studied topic in data science, it is out of the scope of this paper to
summarize all the literature on how to compute aCPD. Therefore, in
what follows, we focus on alternating methods, which update one
block of variables A(j)at a time while keeping the others ﬁxed.
There are mainly two categories of alternating algorithms to
compute aCPD: Exact Block-Coordinate Descent (EBCD) algo-
rithms, and Approximate BCD (ABCD) algorithms. EBCD algo-
rithms feature block-wise optimal updates. The most well-known
EBCD algorithm for aCPD is Alternating Least Squares (ALS),
also called CP-ALS or sometimes PARAFAC (which is also an-
other name for aCPD). ALS sequentially updates the blocks A(j)as
A(j)=g(T, A(l6=j)) := T[j]B(j)†,(3)
where T[j]is the j-th unfolding of T, as deﬁned for instance in ,
is the Khatri-Rao product and B(j)T=
A(l). ALS has no con-
vergence guarantee since each block update may have more than one
solution . There are some local convergence guarantee though .
Another simple BCD algorithm is Hierarchical ALS (HALS), which
updates each column of each A(j)sequentially. While HALS has
scarcely been used for solving aCPD (contrary to its nonnegative
counterpart ), it may in principle be faster than ALS for large ten-
sors since no linear systems are to be solved at each iteration due
to the separability of the objective function. A variant of HALS, A-
HALS, updates the columns of each factor A(j)with several cycles
before jumping to another factor. This allows to reduce the compu-
tational cost as some matrix products can be reused .
ABCD algorithms, such as alternating gradient methods, do not
solve each subproblem optimally. These methods have scarcely been
considered for solving aCPD, in favor of all-at-once gradient-based
3. PROPOSED APPROACHES: IBPG AND HERBCD
In the following, we introduce two algorithms for computing aCPD
that make use of extrapolation in two different ways. Extrapolation
for escaping saddle points and enhancing convergence speed is an in-
tensively studied topic in machine learning, in particular in the con-
text of deep learning where the cost functions to minimize are highly
non-convex and gradient-based algorithms might require a “push” to
escape bad regions of the search space faster. Our goal is to show
empirically that extrapolation, on top of enhancing empirical conver-
gence speed in difﬁcult cases, also helps escaping “swamps” when
computing aCPD. This observation is actually not new, and may be
traced back to seminal work by Harshman . We provide a fresh
view on these issues by using more recent optimization techniques.
For convenience, let us deﬁne the cost function Fas
F(A(1),...,A(p)) = kT−
3.1. Inertial Block Proximal Gradient (iBPG)
A recent trend in non-convex optimization for machine learning,
which takes root in the seminal work by Nesterov , is to make use
of extrapolation of the iterates to enhance convergence speed of the
cost function. Although such techniques were originally designed
for convex smooth problems and gradient descent, extrapolation has
been investigated in the context of non-smooth , non-convex 
problem and in conjunction with alternating gradient descent .
It has also been shown that extrapolation of the iterates can be in-
terpreted in the realm of harmonic mechanical systems, shedding
light on the performance enhancement [15, 16]. These techniques
have been seldom used for solving aCPD, despite some recent at-
tempts [17, 18] summarized in Section 5.
Such an alternating (proximal) gradient descent algorithm with
extrapolation was recently proposed , which we refer to as iBPG,
to solve a general nonconvex nonsmooth block separable composite
optimization problem. iBPG embraces some advanced features of
Algorithm 1 iBPG for CPD
1: Initialization: Choose δw= 0.99,β= 1.01,t0= 1,
and 2 sets of initial factor matrices A(1)
0. Set k= 1.
2: Set A(j)
3: Set A(j)
5: for j= 1,...,pdo
2(1 + q1 + 4t2
k−1= min ˆwk−1, δwsL(j)
10: Compute two extrapolation points
11: Set A(j)
12: Update A(j)
cur by gradient step:
13: until some criteria is satisﬁed
14: Set A(j)
15: end for
16: Set k=k+ 1.
17: until some criteria is satisﬁed
acceleration methods using extrapolation:
•iBPG uses two different extrapolation points to evaluate the gradi-
ent and to add inertial force. This feature was experimentally shown
to improve convergence compared to the use of a single extrapola-
•iBPG does not require a restarting step: convergence is guaranteed
without any restart. This is in contract with most algorithms using
extrapolation in the non-convex case where a restarting step is nec-
essary to ensure convergence [17, 14]: a step is accepted only if the
objective function decreases and, when this is not the case, the algo-
rithm restarts by taking a standard gradient step. This feature is very
useful when evaluating the objective function is expensive.
•iBPG is very ﬂexible in the choice of the order in which the blocks
are updated: for example each matrix factor can be updated several
times allowing to reuse some computations (like in A-HALS )
leading to more updates at a lower computational cost.
iBPG is proved to have sub-sequential convergence under some
mild conditions, and global convergence under some additional as-
sumptions. iBPG can easily be instantiated for aCPD, the resulting
algorithm is summarized in Algorithm 3.1. As the choice of the
parameters in Algorithm 3.1 satisﬁes the relaxed conditions in [19,
Remark 4.7], we can derive from [19, Theorem 4.8] that iBPG for
solving aCPD is guaranteed to have (at least) a sub-sequential con-
vergence; see [19, Section 5] for a similar explanation in the case of
non-negative matrix factorization problem.
3.2. Heuristic Extrapolation and Restart (her) BCD
Although alternating gradient-based approaches are not state-of-the-
art at the moment for computing aCPD, BCD algorithms on the other
hand are extremely popular, in particular the ALS algorithm, mostly
due to its simplicity and efﬁciency for simple problems. However,
ALS is known to converge slowly for instance when the factors A(j)
are ill-conditioned. In this paper, we introduce an extrapolation of
the factor estimates between each block update. Moreover, the ex-
trapolation technique is not a straightforward heuristic line search
such as described in [20, pp.95–96], but mimics Nesterov’s extrap-
olation by introducing pairing variables Z(j). For instance, when
updating the jth factor in the ALS algorithm at the kth iteration, the
update is modiﬁed as:
k, Z(l>j )
k−1i as deﬁned in (3) (5)
where βkis updated heuristically (see Algorithm 3.2) using three pa-
rameters (η, ¯γ, γ )following the strategy described in . In short,
the idea is to use a restart criterion ˆ
is the cost evaluated at the pairing variable for the ﬁrst p−1modes
and the original variable at the last mode, to update βk. When ˆ
decreases, we grow β(multiply it by a constant γ≥1). When ˆ
increases, we decrease β(divide it by a constant η≥1). In Al-
gorithm 3.2, ¯
βis the upper bound of β, which is also updated dy-
namically. Beside updating β, restart is carried out based on ˆ
decide whether to keep A(j)or Z(j)in the next iteration. We re-
fer to this procedure as Heuristic Extrapolation and Restart (her). It
could be used in the same way to design a her-HALS algorithm and
a her-Gradient algorithm, which we do not discuss here do to the
space restriction. In fact, any BCD algorithm can be accelerated us-
ing her, by extrapolating the partial estimates for each block update.
We label such a generic approach herBCD.
Algorithm 2 herALS for CPD
1: Initialization: Choose β0∈(0,1),η≥γ≥¯γ≥
1, and 2 sets of initial factor matrices A(1)
0. Set ¯
β0= 1 and k= 1.
3: for j= 1,...,pdo
4: Update: get A(j)
5: Extrapolate: get Z(j)
6: end for
7: Compute ˆ
8: if ˆ
kfor j= 1,...,p
kfor j= 1,...,p
10: end if
11: Set k=k+ 1.
12: until some criteria is satisﬁed
3.3. Additional cost of extrapolation
A natural question that arises when modifying well-known algo-
rithms to enhance their convergence speed is the impact of such mod-
iﬁcations on the computational time. Indeed, it is often possible to
improve the relative decrease of the objective at each iteration with
respect to ALS, but doing so while keep a ﬁxed cost per iteration is
The cost of iBPG is essentially that of an alternating gradient
method. Indeed, the computation of the extrapolation points is neg-
ligible given that ris small, since computing the operator norm of
an r×rmatrix is cheap (lines 7 and 8 in Algorithm 3.1). Moreover,
since there is no need for restart like in herALS, the cost function
does not need to be computed at each iteration. Therefore, the cost
for each block update boils down to the cost of computing the gra-
dient, which cost is itself known to be dominated by the so-called
Matricized Tensor Times Khatri-Rao Product (MTTKRP). The MT-
TKRP can be efﬁciently implemented, see for instance [22, 23]. In
summary, for small r, one block update of iBPG has essentially the
same cost as one block update of ALS since the matrix to inverse in
ALS is of size r×r.
The cost of one iteration of herALS is also essentially the same
as one iteration of ALS, although this requires a twist on the restart
condition. Indeed, to perform restart, it is in theory necessary to
check if the cost function is increasing after extrapolation. However,
computing the cost function is expensive for aCPD unless the pre-
viously computed MTTKRPs can be reused. Note that the restart in
herALS is based on the cost evaluated at the pairing variables, for
which MTTKRPs have indeed been computed in the ALS update.
Although this is not a standard way to perform restart, this allows to
keep computational cost low while showing no practical difference.
In this section we compare iBPG and herALS to three algorithms:
the original un-accelerated ALS (ALS), the accelerated ALS using
Bro’s acceleration (Bro-ALS), and the LS-ALS: an accelerated ALS
where extrapolation sequence is computed by line search in the style
of Anderson Acceleration. See section 5 for description of Bro-ALS
In each experiment, the notation [I , J, K, r]denotes the sizes
of the tensor (I, J, K ), and the factorization rank r. All experi-
ments are run over 20 random initilaziations, and we plot the me-
dian of the cost value over these 20 trials. There are two impor-
tant things to note: all herBCD across all experiments use the same
set of default parameters [β0, γ, ¯γ, η] = [0.5,1.05,1.01,1.5]. All
the y-axis of the plots is in the form of F−Fmin, where Fis the
cost evaluated at all A(j)and Fmin is the minimal possible cost ob-
tained in the experiment across all initializations and algorithms.
All the experiments are run with MATLAB (v.2015a) on a laptop
with 2.4GHz CPU and 16GB RAM. The codes are available from
4.1. Synthetic data sets
Figure 1 shows the result over two experiments and details the cho-
sen dimensions of the problem. In both balanced and unbalanced
cases that were tested, the data tensor is generated as Pr
q+N, where the ground truth factors A(j)are sampled
from a Gaussian distribution with zero mean and unitary variance.
Note that we adjust the condition number of A(j)to 100 using the
SVD and replacing the singular values by logaritmihc scaled values
between 1 and 100. The tensor Nis an additive Gaussian noise with
zero mean and variance 0.001. The results show that iBPG and her-
ALS are the best algorithm among the ﬁve tested algorithms, and in
particular seem to avoid the swamp in which ALS lands in the un-
balanced case. LS-ALS, which converges fast in terms of iterations,
suffers from higher per-iteration cost.
Fig. 1. Median error over 20 runs on synthetic data sets plotted
against iterations (top) and time (bottom). On the left is a square
tensor [50,50,50,10], and on the right is an unbalanced tensor
[150,103,50,10]. For the unbalanced case, ALS improves very
slowly up to the 90th iteration: This phenomenon is often referred to
as a swamp in the literature. The proposed extrapolated algorithms
do not encounter this issue in this experiment.
4.2. Real data sets
We now show the results on real data sets: Wine1(Fig. 2), Hyper
spectral data of Indian Pine2and Blood plasma3(Fig. 4). Again the
curves are the median over 20 initializations. All sub-ﬁgures in a
ﬁgure share the same y-axis. Minimal pre-processing is carried out:
NaN values (if any) are replaced with zeros.
We observe that herALS performs the best, followed by Bro-
ALS. iBPG does not perform as well as for the synthetic data sets.
Fig. 2. Results on Wine [44,2700,200,15]. iBPG gets stuck on
Fig. 3. Results on Indian Pines [145,145,200,16].
5. RELATED PRIOR WORK
Although the idea of extrapolation is not new for ALS , there has
not been many works tackling extrapolation for speeding up BCD al-
gorithms for aCPD. We are aware of two such works. Bro et al. 
extrapolate directly the estimated factor using a heuristic approach
recalled in  which we show can be slower than ALS, although it
prevents any appearance of “swamps” in our experiments. Mitchell
et. al.  have proposed a similar extrapolation strategy, where
they extrapolate all blocks simultaneously using a shared parameter
βk. In contrast, in this work, we followed the same scheme to com-
pute βkbut the extrapolation is carried out on each block right after
the least-squares update, rather than after all least-squares updates.
This makes the two approaches quite different. Furthermore an ex-
pensive line search (a least square problem) has to be performed to
compute the extrapolation weight β. The per-iteration cost is higher
than all the other methods in ﬁgure 1.
for data description.
Fig. 4. Results on Blood [289,301,41] with r= 3 (top), r= 6
(mid) and r= 10 (bottom). Note that the data has many NaN
(data polluted due to Rayleigh scattering), all NaN are replaced by
0. There are therefore many zero ﬁbers in the tensor after such cor-
6. CONCLUSION AND PERSPECTIVES
We have introduced extrapolation-based alternating algorithms for
solving aCPD. On a limited set of synthetic experiments with ill-
conditioned tensors, the recently proposed iBPG algorithm, which is
alternating gradient-based, outperforms workhorse block-coordinate
descent algorithm such as ALS, and helps escaping “swamps”. The
algorithm proposed in this paper, herALS, is a variant of ALS in
which iterates are extrapolated, and also performs well without a
ﬁne tuning of the hyperparameters. On a few real data sets stem-
ming from ﬂuorescence spectroscopy and remote sensing, herALS
outperform all tested methods while iBPG shows mitigated results.
Further tests and comparison should therefore be performed to fur-
ther assess the performance of both iBPG and herALS in speciﬁc
practical cases. Finally, this work provides further practical evidence
that extrapolation helps escaping swamps when computing aCPD.
 V. De Silva and L.-H. Lim, “Tensor rank and the ill-posedness
of the best low-rank approximation problem,” SIAM Journal
on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1084–
 P. Comon, X. Luciani, and A. L.F. De Almeida, “Tensor de-
compositions, alternating least squares and other tales,” Jour-
nal of Chemometrics, vol. 23, no. 7-8, pp. 393–405, 2009.
 N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E.
Papalexakis, and C. Faloutsos, “Tensor decomposition for sig-
nal processing and machine learning,” IEEE Transactions on
Signal Processing, vol. 65, no. 13, pp. 3551–3582, 2017.
 T. G. Kolda and B. W. Bader, “Tensor decompositions and
applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, sep
 D. P. Bertsekas, Nonlinear programming, Athena scientiﬁc
 A. Uschmajew, “Local convergence of the alternating least
squares algorithm for canonical tensor approximation,” SIAM
J. matrix Analysis, vol. 33, no. 2, pp. 639–652, 2012.
 A. Cichocki, R. Zdunek, A. H. Phan, and S-I. Amari, Nonneg-
ative Matrix and Tensor Factorization, Wiley, 2009.
 N. Gillis and F. Glineur, “Accelerated multiplicative updates
and hierarchical als algorithms for nonnegative matrix factor-
ization,” Neural computation, vol. 24, no. 4, pp. 1085–1105,
 E. Acar, D. M. Dunlavy, and T. G. Kolda, “A scalable op-
timization approach for ﬁtting canonical tensor decomposi-
tions,” Journal of Chemometrics, vol. 25, no. 2, pp. 67–86,
 R. A. Harshman, “Foundations of the PARAFAC procedure:
Models and conditions for an ”explanatory” multimodal factor
analysis,” UCLA working papers in phonetics, vol. 16, 1970.
 Y. Nesterov, “A method of solving a convex programming
problem with convergence rate o (1/k2),” in Soviet Mathemat-
ics Doklady, 1983, vol. 27, pp. 372–376.
 Y. Nesterov, “Gradient methods for minimizing composite
functions,” Mathematical Programming, vol. 140, no. 1, pp.
125–161, Aug 2013.
 T. Pock and S. Sabach, “Inertial proximal alternating linearized
minimization (iPALM) for nonconvex and nonsmooth prob-
lems,” SIAM Journal on Imaging Sciences, vol. 9, no. 4, pp.
 Y. Xu, “Alternating proximal gradient method for sparse
nonnegative Tucker decomposition,” Mathematical Program-
ming Computation, vol. 7, no. 1, pp. 39–70, mar 2015,
 S. Boyd W. Su and E. J. Cand`
es, “A differential equation for
modeling nesterov’s accelerated gradient method: Theory and
insights,” Journal of Machine Learning Research, vol. 17, no.
153, pp. 1–43, 2016.
 M. Muehlebach and M. I. Jordan, “A dynamical sys-
tems perspective on nesterov acceleration,” arXiv preprint
 Y. Xu and W. Yin, “A block coordinate descent method for
regularized multiconvex optimization with applications to non-
negative tensor factorization and completion,” SIAM Journal
on imaging sciences, vol. 6, no. 3, pp. 1758–1789, 2013.
 D. Mitchell, N. Ye, and H. De Sterck, “Nesterov acceleration of
alternating least squares for canonical tensor decomposition,”
 L. T. K. Hien, N. Gillis, and P. Patrinos, “Inertial block mir-
ror descent method for non-convex non-smooth optimization,”
 R. Bro, Multi-way Analysis in the Food Industry: Models, Al-
gorithms, and Applications, Ph.D. thesis, University of Ams-
terdam, The Netherlands, 1998.
 A. M. S. Ang and N. Gillis, “Accelerating nonnegative matrix
factorization algorithms using extrapolation,” Neural compu-
tation, vol. 31, no. 2, pp. 417–439, 2019.
 N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis,
“Memory-efﬁcient parallel computation of tensor and matrix
products for big tensor decomposition,” in 2014 48th Asilomar
Conference on Signals, Systems and Computers. IEEE, 2014,
 E. E Papalexakis, U. Kang, C. Faloutsos, N. D. Sidiropoulos,
and A. Harpale, “Large scale tensor decompositions: Algorith-
mic developments and applications.,” IEEE Data Eng. Bull.,
vol. 36, no. 3, pp. 59–66, 2013.
 A. Ang, J. E. Cohen, and N. Gillis, “Accelerating approximate
nonnegative canonical polyadic decomposition using extrapo-
lation,” in XXVIIe Colloque GRETSI, 2019.