PreprintPDF Available

# Extrapolated Alternating Algorithms for Approximate Canonical Polyadic Decomposition

Authors:

## Abstract and Figures

Tensor decompositions have become a central tool in machine learning to extract interpretable patterns from multiway arrays of data. However, computing the approximate Canonical Polyadic Decomposition (aCPD), one of the most important tensor decomposition model, remains a challenge. In this work, we propose several algorithms based on extrapolation that improve over existing alternating methods for aCPD. We show on several simulated and real data sets that carefully designed extrapolation can significantly improve the convergence speed hence reduce the computational time, especially in difficult scenarios.
Content may be subject to copyright.
EXTRAPOLATED ALTERNATING ALGORITHMS FOR APPROXIMATE CANONICAL
Andersen Man Shun Ang, Jeremy E. Cohen, Le Thi Khanh Hien, Nicolas Gillis
Department of Mathematics and Operational Research, Universit´
e de Mons, Belgium
CNRS, Universit´
e de Rennes, Inria, IRISA Campus de Beaulieu, Rennes, France
Update20200924: corrected the typo in line 9 of Algo. 2.
ABSTRACT
Tensor decompositions have become a central tool in machine learn-
ing to extract interpretable patterns from multiway arrays of data.
However, computing the approximate Canonical Polyadic Decom-
position (aCPD), one of the most important tensor decomposition
model, remains a challenge. In this work, we propose several algo-
rithms based on extrapolation that improve over existing alternating
methods for aCPD. We show on several simulated and real data sets
that carefully designed extrapolation can signiﬁcantly improve the
convergence speed hence reduce the computational time, especially
in difﬁcult scenarios.
Non-convex Optimization, Block-coordinate Descent, Acceleration
1. PROBLEM STATEMENT
Let be the tensor product [a(1) . . . a(p)]i1...ip=
p
Q
j=1
a(j)
ijwith
a(j)Rnjand pN, set n=×jnjfor a collection of njN.
We are interested in solving efﬁciently the following approximate
Canonical Polyadic Decomposition (aCPD) optimization problem:
Deﬁnition 1 (aCPD) Given a tensor TRnof order pand an
integer r, ﬁnd a tensor ˆ
Tsuch that
ˆ
T= argmin
rank(G)r
kTGk2
F,(1)
where the rank of a tensor Gis deﬁned as
min (rN| ∃a(j)
iRnj, G =
r
X
i=1
p
O
j=1
a(j)
i).(2)
The aCPD problem is ill-posed in the sense that a solution might not
exist, since the set of rank rtensors is not closed as soon as r > 1
and p > 2[1]. This poses serious problems in practice. It has been
documented that the estimates of each block A(j)= [a(j)
1,...,a(j)
r]
(1jp) may in consequence have pairs of columns growing to
inﬁnity while canceling each-other out, even if a best rank rapprox-
imation exists; see [2] and the references therein. This degeneracy
causes “swamps” in the cost function decrease, such as shown in
Figure 1. The cost function is also non-convex albeit quadratic with
respect to each block A(j)(1jp). Therefore, computing
aCPD remains a challenging task in general. We refer the interested
reader to [3] for a comprehensive survey on these questions.
AMSA, LTKH and NG acknowledge the support by the European Re-
search Council (ERC starting grant No 679515), and by the Fonds de la
Recherche Scientiﬁque - FNRS and the Fonds Wetenschappelijk Onderzoek
- Vlanderen (FWO) under EOS project O005318F-RG47.
2. SOME EXISTING ALTERNATING ALGORITHMS
As tensor decompositions have become an important and extensively
studied topic in data science, it is out of the scope of this paper to
summarize all the literature on how to compute aCPD. Therefore, in
what follows, we focus on alternating methods, which update one
block of variables A(j)at a time while keeping the others ﬁxed.
There are mainly two categories of alternating algorithms to
compute aCPD: Exact Block-Coordinate Descent (EBCD) algo-
rithms, and Approximate BCD (ABCD) algorithms. EBCD algo-
rithms feature block-wise optimal updates. The most well-known
EBCD algorithm for aCPD is Alternating Least Squares (ALS),
also called CP-ALS or sometimes PARAFAC (which is also an-
other name for aCPD). ALS sequentially updates the blocks A(j)as
follows:
ˆ
A(j)=g(T, A(l6=j)) := T[j]B(j),(3)
where T[j]is the j-th unfolding of T, as deﬁned for instance in [4],
is the Khatri-Rao product and B(j)T=
1
J
l=p
l6=j
A(l). ALS has no con-
vergence guarantee since each block update may have more than one
solution [5]. There are some local convergence guarantee though [6].
Another simple BCD algorithm is Hierarchical ALS (HALS), which
updates each column of each A(j)sequentially. While HALS has
scarcely been used for solving aCPD (contrary to its nonnegative
counterpart [7]), it may in principle be faster than ALS for large ten-
sors since no linear systems are to be solved at each iteration due
to the separability of the objective function. A variant of HALS, A-
HALS, updates the columns of each factor A(j)with several cycles
before jumping to another factor. This allows to reduce the compu-
tational cost as some matrix products can be reused [8].
ABCD algorithms, such as alternating gradient methods, do not
solve each subproblem optimally. These methods have scarcely been
considered for solving aCPD, in favor of all-at-once gradient-based
approaches [9].
3. PROPOSED APPROACHES: IBPG AND HERBCD
In the following, we introduce two algorithms for computing aCPD
that make use of extrapolation in two different ways. Extrapolation
for escaping saddle points and enhancing convergence speed is an in-
tensively studied topic in machine learning, in particular in the con-
text of deep learning where the cost functions to minimize are highly
non-convex and gradient-based algorithms might require a “push” to
escape bad regions of the search space faster. Our goal is to show
empirically that extrapolation, on top of enhancing empirical conver-
gence speed in difﬁcult cases, also helps escaping “swamps” when
computing aCPD. This observation is actually not new, and may be
traced back to seminal work by Harshman [10]. We provide a fresh
view on these issues by using more recent optimization techniques.
For convenience, let us deﬁne the cost function Fas
F(A(1),...,A(p)) = kT
r
X
i=1
p
O
j=1
a(j)
ik2
F.(4)
3.1. Inertial Block Proximal Gradient (iBPG)
A recent trend in non-convex optimization for machine learning,
which takes root in the seminal work by Nesterov [11], is to make use
of extrapolation of the iterates to enhance convergence speed of the
cost function. Although such techniques were originally designed
for convex smooth problems and gradient descent, extrapolation has
been investigated in the context of non-smooth [12], non-convex [13]
problem and in conjunction with alternating gradient descent [14].
It has also been shown that extrapolation of the iterates can be in-
terpreted in the realm of harmonic mechanical systems, shedding
light on the performance enhancement [15, 16]. These techniques
have been seldom used for solving aCPD, despite some recent at-
tempts [17, 18] summarized in Section 5.
Such an alternating (proximal) gradient descent algorithm with
extrapolation was recently proposed [19], which we refer to as iBPG,
to solve a general nonconvex nonsmooth block separable composite
optimization problem. iBPG embraces some advanced features of
Algorithm 1 iBPG for CPD
1: Initialization: Choose δw= 0.99,β= 1.01,t0= 1,
and 2 sets of initial factor matrices A(1)
1,...,A(p)
1and
A(1)
0,...,A(p)
0. Set k= 1.
2: Set A(j)
prev =A(j)
1,j= 1,...,p.
3: Set A(j)
cur =A(j)
0,j= 1,...,p.
4: repeat
5: for j= 1,...,pdo
6: tk=1
2(1 + q1 + 4t2
k1),ˆwk1=tk11
tk
7: w(j)
k1= min ˆwk1, δwsL(j)
k2
L(j)
k1,
8: L(j)
k=
B(j)
kTB(j)
k
9: repeat
10: Compute two extrapolation points
ˆ
A(j,1) =A(j)
cur +w(j)
k1A(j)
cur A(j)
prev,
ˆ
A(j,2) =A(j)
cur +βw(j)
k1A(j)
cur A(j)
prev
11: Set A(j)
prev =A(j)
cur.
12: Update A(j)
A(j)
cur =ˆ
A(j,2) 1
L(j)
k1ˆ
A(j,1)B(j)
k1T− T[j]B(j)
k1.
13: until some criteria is satisﬁed
14: Set A(j)
k=A(j)
cur.
15: end for
16: Set k=k+ 1.
17: until some criteria is satisﬁed
acceleration methods using extrapolation:
iBPG uses two different extrapolation points to evaluate the gradi-
ent and to add inertial force. This feature was experimentally shown
to improve convergence compared to the use of a single extrapola-
tion point.
iBPG does not require a restarting step: convergence is guaranteed
without any restart. This is in contract with most algorithms using
extrapolation in the non-convex case where a restarting step is nec-
essary to ensure convergence [17, 14]: a step is accepted only if the
objective function decreases and, when this is not the case, the algo-
rithm restarts by taking a standard gradient step. This feature is very
useful when evaluating the objective function is expensive.
iBPG is very ﬂexible in the choice of the order in which the blocks
are updated: for example each matrix factor can be updated several
times allowing to reuse some computations (like in A-HALS [8])
iBPG is proved to have sub-sequential convergence under some
mild conditions, and global convergence under some additional as-
sumptions. iBPG can easily be instantiated for aCPD, the resulting
algorithm is summarized in Algorithm 3.1. As the choice of the
parameters in Algorithm 3.1 satisﬁes the relaxed conditions in [19,
Remark 4.7], we can derive from [19, Theorem 4.8] that iBPG for
solving aCPD is guaranteed to have (at least) a sub-sequential con-
vergence; see [19, Section 5] for a similar explanation in the case of
non-negative matrix factorization problem.
3.2. Heuristic Extrapolation and Restart (her) BCD
Although alternating gradient-based approaches are not state-of-the-
art at the moment for computing aCPD, BCD algorithms on the other
hand are extremely popular, in particular the ALS algorithm, mostly
due to its simplicity and efﬁciency for simple problems. However,
ALS is known to converge slowly for instance when the factors A(j)
are ill-conditioned. In this paper, we introduce an extrapolation of
the factor estimates between each block update. Moreover, the ex-
trapolation technique is not a straightforward heuristic line search
such as described in [20, pp.95–96], but mimics Nesterov’s extrap-
olation by introducing pairing variables Z(j). For instance, when
updating the jth factor in the ALS algorithm at the kth iteration, the
update is modiﬁed as:
A(j)
k=gT, hZ(l<j)
k, Z(l>j )
k1i as deﬁned in (3) (5)
Z(j)
k=A(j)
k+βkA(j)
kA(j)
k1,(6)
where βkis updated heuristically (see Algorithm 3.2) using three pa-
rameters (η, ¯γ, γ )following the strategy described in [21]. In short,
the idea is to use a restart criterion ˆ
Fk=F(A(p)
k;Z(l6=p)
k), which
is the cost evaluated at the pairing variable for the ﬁrst p1modes
and the original variable at the last mode, to update βk. When ˆ
F
decreases, we grow β(multiply it by a constant γ1). When ˆ
F
increases, we decrease β(divide it by a constant η1). In Al-
gorithm 3.2, ¯
βis the upper bound of β, which is also updated dy-
namically. Beside updating β, restart is carried out based on ˆ
Fto
decide whether to keep A(j)or Z(j)in the next iteration. We re-
fer to this procedure as Heuristic Extrapolation and Restart (her). It
could be used in the same way to design a her-HALS algorithm and
a her-Gradient algorithm, which we do not discuss here do to the
space restriction. In fact, any BCD algorithm can be accelerated us-
ing her, by extrapolating the partial estimates for each block update.
We label such a generic approach herBCD.
Algorithm 2 herALS for CPD
1: Initialization: Choose β0(0,1),ηγ¯γ
1, and 2 sets of initial factor matrices A(1)
0,...,A(p)
0and
Z(1)
0,...,Z(p)
0. Set ¯
β0= 1 and k= 1.
2: repeat
3: for j= 1,...,pdo
4: Update: get A(j)
kas (5).
5: Extrapolate: get Z(j)
kas (6).
6: end for
7: Compute ˆ
Fk=F(A(p)
k;Z(l6=p)
k).
8: if ˆ
Fk>ˆ
Fk1for k2then
Set Z(j)
k=A(j)
kfor j= 1,...,p
Set ¯
βk=βk1,βk=βk1
9: else
Set A(j)
k=Z(j)
kfor j= 1,...,p
Set ¯
βk= min{1,¯
βk1¯γ},βk= min{¯
βk1, βk1γ}
10: end if
11: Set k=k+ 1.
12: until some criteria is satisﬁed
A natural question that arises when modifying well-known algo-
rithms to enhance their convergence speed is the impact of such mod-
iﬁcations on the computational time. Indeed, it is often possible to
improve the relative decrease of the objective at each iteration with
respect to ALS, but doing so while keep a ﬁxed cost per iteration is
more challenging.
The cost of iBPG is essentially that of an alternating gradient
method. Indeed, the computation of the extrapolation points is neg-
ligible given that ris small, since computing the operator norm of
an r×rmatrix is cheap (lines 7 and 8 in Algorithm 3.1). Moreover,
since there is no need for restart like in herALS, the cost function
does not need to be computed at each iteration. Therefore, the cost
for each block update boils down to the cost of computing the gra-
dient, which cost is itself known to be dominated by the so-called
Matricized Tensor Times Khatri-Rao Product (MTTKRP). The MT-
TKRP can be efﬁciently implemented, see for instance [22, 23]. In
summary, for small r, one block update of iBPG has essentially the
same cost as one block update of ALS since the matrix to inverse in
ALS is of size r×r.
The cost of one iteration of herALS is also essentially the same
as one iteration of ALS, although this requires a twist on the restart
condition. Indeed, to perform restart, it is in theory necessary to
check if the cost function is increasing after extrapolation. However,
computing the cost function is expensive for aCPD unless the pre-
viously computed MTTKRPs can be reused. Note that the restart in
herALS is based on the cost evaluated at the pairing variables, for
which MTTKRPs have indeed been computed in the ALS update.
Although this is not a standard way to perform restart, this allows to
keep computational cost low while showing no practical difference.
4. EXPERIMENTS
In this section we compare iBPG and herALS to three algorithms:
the original un-accelerated ALS (ALS), the accelerated ALS using
Bro’s acceleration (Bro-ALS), and the LS-ALS: an accelerated ALS
where extrapolation sequence is computed by line search in the style
of Anderson Acceleration. See section 5 for description of Bro-ALS
and LS-ALS.
In each experiment, the notation [I , J, K, r]denotes the sizes
of the tensor (I, J, K ), and the factorization rank r. All experi-
ments are run over 20 random initilaziations, and we plot the me-
dian of the cost value over these 20 trials. There are two impor-
tant things to note: all herBCD across all experiments use the same
set of default parameters [β0, γ, ¯γ, η] = [0.5,1.05,1.01,1.5]. All
the y-axis of the plots is in the form of FFmin, where Fis the
cost evaluated at all A(j)and Fmin is the minimal possible cost ob-
tained in the experiment across all initializations and algorithms.
All the experiments are run with MATLAB (v.2015a) on a laptop
with 2.4GHz CPU and 16GB RAM. The codes are available from
https://angms.science/research.html.
4.1. Synthetic data sets
Figure 1 shows the result over two experiments and details the cho-
sen dimensions of the problem. In both balanced and unbalanced
cases that were tested, the data tensor is generated as Pr
q=1 a(1)
q
a(2)
qa(3)
q+N, where the ground truth factors A(j)are sampled
from a Gaussian distribution with zero mean and unitary variance.
Note that we adjust the condition number of A(j)to 100 using the
SVD and replacing the singular values by logaritmihc scaled values
between 1 and 100. The tensor Nis an additive Gaussian noise with
zero mean and variance 0.001. The results show that iBPG and her-
ALS are the best algorithm among the ﬁve tested algorithms, and in
particular seem to avoid the swamp in which ALS lands in the un-
balanced case. LS-ALS, which converges fast in terms of iterations,
suffers from higher per-iteration cost.
Fig. 1. Median error over 20 runs on synthetic data sets plotted
against iterations (top) and time (bottom). On the left is a square
tensor [50,50,50,10], and on the right is an unbalanced tensor
[150,103,50,10]. For the unbalanced case, ALS improves very
slowly up to the 90th iteration: This phenomenon is often referred to
as a swamp in the literature. The proposed extrapolated algorithms
do not encounter this issue in this experiment.
4.2. Real data sets
We now show the results on real data sets: Wine1(Fig. 2), Hyper
spectral data of Indian Pine2and Blood plasma3(Fig. 4). Again the
curves are the median over 20 initializations. All sub-ﬁgures in a
ﬁgure share the same y-axis. Minimal pre-processing is carried out:
NaN values (if any) are replaced with zeros.
We observe that herALS performs the best, followed by Bro-
ALS. iBPG does not perform as well as for the synthetic data sets.
Fig. 2. Results on Wine [44,2700,200,15]. iBPG gets stuck on
local minima.
Fig. 3. Results on Indian Pines [145,145,200,16].
5. RELATED PRIOR WORK
Although the idea of extrapolation is not new for ALS [10], there has
not been many works tackling extrapolation for speeding up BCD al-
gorithms for aCPD. We are aware of two such works. Bro et al. [20]
extrapolate directly the estimated factor using a heuristic approach
recalled in [24] which we show can be slower than ALS, although it
prevents any appearance of “swamps” in our experiments. Mitchell
et. al. [18] have proposed a similar extrapolation strategy, where
they extrapolate all blocks simultaneously using a shared parameter
βk. In contrast, in this work, we followed the same scheme to com-
pute βkbut the extrapolation is carried out on each block right after
the least-squares update, rather than after all least-squares updates.
This makes the two approaches quite different. Furthermore an ex-
pensive line search (a least square problem) has to be performed to
compute the extrapolation weight β. The per-iteration cost is higher
than all the other methods in ﬁgure 1.
1See http://www.models.life.ku.dk/Wine_GCMS_FTIR
for data description.
2http://www.ehu.eus/ccwintco/index.php?title=
Hyperspectral_Remote_Sensing_Scenes
3See http://www.models.life.ku.dk/anders-cancer
Fig. 4. Results on Blood [289,301,41] with r= 3 (top), r= 6
(mid) and r= 10 (bottom). Note that the data has many NaN
(data polluted due to Rayleigh scattering), all NaN are replaced by
0. There are therefore many zero ﬁbers in the tensor after such cor-
rection.
6. CONCLUSION AND PERSPECTIVES
We have introduced extrapolation-based alternating algorithms for
solving aCPD. On a limited set of synthetic experiments with ill-
conditioned tensors, the recently proposed iBPG algorithm, which is
descent algorithm such as ALS, and helps escaping “swamps”. The
algorithm proposed in this paper, herALS, is a variant of ALS in
which iterates are extrapolated, and also performs well without a
ﬁne tuning of the hyperparameters. On a few real data sets stem-
ming from ﬂuorescence spectroscopy and remote sensing, herALS
outperform all tested methods while iBPG shows mitigated results.
Further tests and comparison should therefore be performed to fur-
ther assess the performance of both iBPG and herALS in speciﬁc
practical cases. Finally, this work provides further practical evidence
that extrapolation helps escaping swamps when computing aCPD.
7. REFERENCES
[1] V. De Silva and L.-H. Lim, “Tensor rank and the ill-posedness
of the best low-rank approximation problem, SIAM Journal
on Matrix Analysis and Applications, vol. 30, no. 3, pp. 1084–
1127, 2008.
[2] P. Comon, X. Luciani, and A. L.F. De Almeida, “Tensor de-
compositions, alternating least squares and other tales,” Jour-
nal of Chemometrics, vol. 23, no. 7-8, pp. 393–405, 2009.
[3] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E.
Papalexakis, and C. Faloutsos, “Tensor decomposition for sig-
nal processing and machine learning,” IEEE Transactions on
Signal Processing, vol. 65, no. 13, pp. 3551–3582, 2017.
[4] T. G. Kolda and B. W. Bader, “Tensor decompositions and
applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, sep
2009.
[5] D. P. Bertsekas, Nonlinear programming, Athena scientiﬁc
Belmont, 1999.
[6] A. Uschmajew, “Local convergence of the alternating least
squares algorithm for canonical tensor approximation,” SIAM
J. matrix Analysis, vol. 33, no. 2, pp. 639–652, 2012.
[7] A. Cichocki, R. Zdunek, A. H. Phan, and S-I. Amari, Nonneg-
ative Matrix and Tensor Factorization, Wiley, 2009.
[8] N. Gillis and F. Glineur, Accelerated multiplicative updates
and hierarchical als algorithms for nonnegative matrix factor-
ization,” Neural computation, vol. 24, no. 4, pp. 1085–1105,
2012.
[9] E. Acar, D. M. Dunlavy, and T. G. Kolda, “A scalable op-
timization approach for ﬁtting canonical tensor decomposi-
tions,” Journal of Chemometrics, vol. 25, no. 2, pp. 67–86,
2011.
[10] R. A. Harshman, “Foundations of the PARAFAC procedure:
Models and conditions for an ”explanatory” multimodal factor
analysis,” UCLA working papers in phonetics, vol. 16, 1970.
[11] Y. Nesterov, “A method of solving a convex programming
problem with convergence rate o (1/k2), in Soviet Mathemat-
ics Doklady, 1983, vol. 27, pp. 372–376.
[12] Y. Nesterov, “Gradient methods for minimizing composite
functions,” Mathematical Programming, vol. 140, no. 1, pp.
125–161, Aug 2013.
[13] T. Pock and S. Sabach, “Inertial proximal alternating linearized
minimization (iPALM) for nonconvex and nonsmooth prob-
lems,” SIAM Journal on Imaging Sciences, vol. 9, no. 4, pp.
1756–1787, 2016.
[14] Y. Xu, “Alternating proximal gradient method for sparse
nonnegative Tucker decomposition, Mathematical Program-
ming Computation, vol. 7, no. 1, pp. 39–70, mar 2015,
arxiv:1302.2559.
[15] S. Boyd W. Su and E. J. Cand`
es, A differential equation for
modeling nesterov’s accelerated gradient method: Theory and
insights,” Journal of Machine Learning Research, vol. 17, no.
153, pp. 1–43, 2016.
[16] M. Muehlebach and M. I. Jordan, “A dynamical sys-
tems perspective on nesterov acceleration, arXiv preprint
arXiv:1905.07436, 2019.
[17] Y. Xu and W. Yin, “A block coordinate descent method for
regularized multiconvex optimization with applications to non-
negative tensor factorization and completion, SIAM Journal
on imaging sciences, vol. 6, no. 3, pp. 1758–1789, 2013.
[18] D. Mitchell, N. Ye, and H. De Sterck, “Nesterov acceleration of
alternating least squares for canonical tensor decomposition,”
2018.
[19] L. T. K. Hien, N. Gillis, and P. Patrinos, “Inertial block mir-
ror descent method for non-convex non-smooth optimization,
2019, arxiv:1903.01818.
[20] R. Bro, Multi-way Analysis in the Food Industry: Models, Al-
gorithms, and Applications, Ph.D. thesis, University of Ams-
terdam, The Netherlands, 1998.
[21] A. M. S. Ang and N. Gillis, “Accelerating nonnegative matrix
factorization algorithms using extrapolation, Neural compu-
tation, vol. 31, no. 2, pp. 417–439, 2019.
[22] N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis,
“Memory-efﬁcient parallel computation of tensor and matrix
products for big tensor decomposition,” in 2014 48th Asilomar
Conference on Signals, Systems and Computers. IEEE, 2014,
pp. 581–585.
[23] E. E Papalexakis, U. Kang, C. Faloutsos, N. D. Sidiropoulos,
and A. Harpale, “Large scale tensor decompositions: Algorith-
mic developments and applications., IEEE Data Eng. Bull.,
vol. 36, no. 3, pp. 59–66, 2013.
[24] A. Ang, J. E. Cohen, and N. Gillis, “Accelerating approximate
nonnegative canonical polyadic decomposition using extrapo-
lation,” in XXVIIe Colloque GRETSI, 2019.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
We propose a general framework to accelerate significantly the algorithms for nonnegative matrix factorization (NMF). This framework is inspired from the extrapolation scheme used to accelerate gradient methods in convex optimization and from the method of parallel tangents. However, the use of extrapolation in the context of the exact coordinate descent algorithms tackling the nonconvex NMF problems is novel. We illustrate the performance of this approach on two state-of-the-art NMF algorithms: accelerated hierarchical alternating least squares and alternating nonnegative least squares, using synthetic, image, and document data sets. arXiv preprint version : https://www.researchgate.net/publication/325216713_Accelerating_Nonnegative_Matrix_Factorization_Algorithms_using_Extrapolation
Article
Full-text available
We derive a second-order ordinary differential equation (ODE) which is the limit of Nesterov's accelerated gradient method. This ODE exhibits approximate equivalence to Nesterov's scheme and thus can serve as a tool for analysis. We show that the continuous time ODE allows for a better understanding of Nesterov's scheme. As a byproduct, we obtain a family of schemes with similar convergence rates. The ODE interpretation also suggests restarting Nesterov's scheme leading to an algorithm, which can be rigorously proven to converge at a linear rate whenever the objective is strongly convex.
Article
Full-text available
In this paper we study nonconvex and nonsmooth optimization problems with semi-algebraic data, where the variables vector is split into several blocks of variables. The problem consists of one smooth function of the entire variables vector and the sum of nonsmooth functions for each block separately. We analyze an inertial version of the Proximal Alternating Linearized Minimization (PALM) algorithm and prove its global convergence to a critical point of the objective function at hand. We illustrate our theoretical findings by presenting numerical experiments on blind image deconvolution, on sparse non-negative matrix factorization and on dictionary learning, which demonstrate the viability and effectiveness of the proposed method.
Conference Paper
Full-text available
Low-rank tensor decomposition has many applica-tions in signal processing and machine learning, and is becoming increasingly important for analyzing big data. A significant challenge is the computation of intermediate products which can be much larger than the final result of the computation, or even the original tensor. We propose a scheme that allows memory-efficient in-place updates of intermediate matrices. Motivated by recent advances in big tensor decomposition from multiple compressed replicas, we also consider the related problem of memory-efficient tensor compression. The resulting algorithms can be parallelized, and can exploit but do not require sparsity.
Article
Full-text available
This paper considers regularized block multi-convex optimization, where the feasible set and objective function are generally non-convex but convex in each block of variables. We review some of its interesting examples and propose a generalized block coordinate descent method. Under certain conditions, we show that any limit point satisfies the Nash equi-librium conditions. Furthermore, we establish its global convergence and estimate its asymptotic convergence rate by assuming a property based on the Kurdyka-Lojasiewicz inequality. The proposed algorithms are adapted for factorizing nonnegative matrices and tensors, as well as completing them from their incomplete observations. The algorithms were tested on synthetic data, hyperspectral data, as well as image sets from the CBCL and ORL databases. Compared to the existing state-of-the-art algorithms, the proposed algorithms demonstrate superior performance in both speed and solution quality. The Matlab code of nonnegative matrix/tensor decomposition and completion, along with a few demos, are accessible from the authors' homepages.
Article
Tensors or multi-way arrays are functions of three or more indices \$(i,j,k,\cdots)\$ -- similar to matrices (two-way arrays), which are functions of two indices \$(r,c)\$ for (row,column). Tensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth and depth that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning.
Article
A local convergence theorem for calculating canonical low-rank tensor approximations (PARAFAC, CANDECOMP) by the alternating least squares algorithm is established. The main assumption is that the Hessian matrix of the problem is positive definite modulo the scaling indeterminacy. A discussion, whether this is realistic, and numerical illustrations are included. Also regularization is addressed.