Content uploaded by Andersen Ang
Author content
All content in this area was uploaded by Andersen Ang on May 28, 2019
Content may be subject to copyright.
Accelerating Approximate Nonnegative
Canonical Polyadic Decomposition using Extrapolation
Andersen AN G1, Jeremy E. COH EN2, Nicolas GILLIS1∗
,
1Universit´
e de Mons, Rue de Houdain 9, 7000, Mons, Belgium
2Universit´
e de Rennes, INRIA, CNRS, IRISA, France
manshun.ang@umons.ac.be jeremy.cohen@irisa.fr nicolas.gillis@umons.ac.be
R´
esum´
e – On consid`
ere dans cet article le probl`
eme de la D´
ecomposition Positive approch´
ee Canonique Polyadique (aNCPD) d’un tenseur
positif d’ordre trois. Ce probl`
eme consiste `
a minimiser kT −(U⊗V⊗W)Irk2
Fpart rapport `
a des matrices positives U,Vet W. Nous proposons
un algorithme de descente par blocs de coordonn´
ees utilisant une extrapolation `
a la Nesterov, de la forme Uk+1 =Uk+βk(Uk−Uk−1). Les
r´
esultats exp´
erimentaux prouvent l’efficacit´
e de l’algorithme propos´
e par rapport aux autres approches de descente par blocs pour des donn´
ees
simul´
ees mal conditionn´
ees.
Abstract – In this work, we consider the problem of approximate Nonnegative Canonical Polyadic Decomposition (aNCPD) of third-order
nonnegative tensors, which boils down to minimizing kT − (U⊗V⊗W)Irk2
Fover element-wise nonnegative matrices U,Vand W. We
present an accelerated block coordinate descent algorithm that uses Nesterov-like extrapolation in the form Uk+1 =Uk+βk(Uk−Uk−1).
Experimental results showcase the effectiveness of the proposed algorithm with respect to other block-coordinate descent algorithms on ill-
conditioned synthetic data.
1 Introduction
Given a nonnegative three dimensional tensor T ∈ RI×J×K
+,
assume one wants to compute the approximation of Tby a non-
negative tensor G= (U⊗V⊗W)Irof given small nonnega-
tive rank r, where Gijk =
r
P
q=1
UiqVj q Wkq . Using the Frobenius
norm as an error metric, the problem becomes :
min
U≥0,V ≥0,W ≥0f(U, V, W ) = 1
2kT − (U⊗V⊗W)Irk2
F,(1)
where the inequalities are meant element-wise.
Problem (1) is often referred to as the approximate Nonnega-
tive Canonical Polyadic Decomposition (aNCPD) of T, as long
as ris smaller than the rank of T. This problem is the extension
of the low-rank nonnegative approximation problem for ma-
trices, namely nonnegative matrix factorization (NMF). Sol-
ving aNCPD is an important issue for mining information out
of tensors collected in a wide variety of applications; see [7]
and references therein. Note that (1) is well-posed. In fact, a
best low-rank approximation for tensors may not exist, but be-
cause the nonnegativity constraints are lower bounds for the
parameters, this prevents degeneracy and the best nonnegative
low-rank approximation always exists [9, 13].
∗NG acknowledges of the Fonds de la Recherche Scientifique - FNRS and
the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Pro-
ject no O005318F-RG47, and of the European Research Council (ERC starting
grant no 679515).
Contribution In this paper, we are interested in algorithms
featuring both low-cost iterations and fast convergence speed.
Such fast and cheap algorithms are the backbone of modern
machine learning, where both the data and the number of pa-
rameters to estimate can be very large. To this end, we first
study some existing Block-Coordinate Descent (BCD) algo-
rithms, and present an extrapolated BCD algorithm based on
Hierarchical Alternating Least Squares (HALS). To be speci-
fic, we adopt the numerical scheme of [1], which has found
to be able to significantly improve the convergence speed on
the problem of NMF. Experimental results showcase that our
approach is faster at producing accurate results than state-of-
the-art BCD algorithms in all tested scenarios.
2 Existing block-coordinate descent al-
gorithms for aNCPD
Let us first mention some existing algorithms to compute
aNCPD. Problem (1) is non-convex and admits no closed-form
solution. A baseline method called ANLS solves aNCPD by al-
ternating minimization on blocs {U, V, W }as detailled below.
However since computing the gradient of fis simple, and pro-
jecting on the nonnegative orthant only requires clipping, some
algorithms have been proposed based on projected first-order
methods [6, 15]. Compression using Tucker models has also
been explored [3].
As mentioned in the introduction, in what follows, we study
algorithms with i) a low cost per iteration, and ii) a high conver-
gence speed. First-order methods are typically slow to converge
despite having low complexity. However, it is possible to ac-
celerate significantly first-order methods using extrapolation,
which is our main contribution described in Section 3 below.
Note that other accelerations are of interest but outside the
scope of this paper, such as online, mini-batch or randomized
algorithms.
Furthermore, we turn towards a particular set of first-order
algorithms coined as Block-Coordinate Descent (BCD), where
only a subset of the parameters are updated at each iteration.
2.1 Two types of blocks for BCD
To compute approximate unconstrained tensor decomposi-
tion models such as the Canonical Polyadic Decomposition or
the Tucker decomposition [7], the Alternating Least Squares
(ALS) algorithm is considered a baseline algorithm. In what
follows, we first describe the well-known Alternating Nonne-
gative Least Squares algorithm (ANLS), which is an adaptation
of ALS for aNCPD.
2.1.1 A block-coordinate framework : ANLS
In ANLS, the cost function fis minimized over each factor
matrix U, V and Walternatively until convergence. For factor
matrix U, we solve
min
U≥0kT[1] −U(VW)Tk2
F(2)
where matrix T[1] is the first mode unfolding of tensor Tas
defined in [4] and is the Khatri-Rao product. Note that (2) is
simply an equivalent rewriting of (1), with fixed Vand W.
It appears now clearly that fis quadratic with respect to each
block U,Vand W. In other words, problem (2) is a nonnega-
tive least squares problem in a matrix format. Clearly, ANLS is
a BCD algorithm where blocks are {U, V, W }, in this order.
2.1.2 Solving ANLS with BCD : HALS
There are two categories of algorithms to compute NLS. A
first category of algorithms produces the exact solution of the
NLS in a finite number of steps, such as the historical active
set solution by Lawson and Hanson [8]. The number of step is
however not known and can be quite large. Also, each iteration
in the active set algorithm requires solving a least squares pro-
blem, which can be costly. A second category asymptotically
converges towards the optimal solution. These algorithms are
only interesting if they converge in a shorter time than the exact
ones. Among those, Hierarchical Alternating Least Squares is
state of the art [5] when dealing with several, alternated NLS
problems.
In this paper, we call HALS-NLS the BCD algorithm solving
the NLS problem that alternates on the columns of Uin (2)
while fixing the others. The solution for the ith column uiof U
requires to solve
min
ui≥0kT[1] −U\i(V\iW\i)T−ui(viwi)Tk2
F(3)
and has the following closed-form solution
ui="T[1] −U\i(V\iW\i)T(viwi)
kvik2
2kwik2
2#+
(4)
since (3) boils down to the coordinate-wise minimization of se-
cond order polynomial functions over a half space. In (3), ma-
trix M\istands for a matrix Mwith the i-th column removed,
while stands for the Kronecker product. Operator [x]+clips
negative values in xto zero. The NLS problem is solved by al-
ternatively using update rule (4) on the columns of the matrix
being updated in the outer ANLS algorithm until convergence.
In what follows, we refer to the whole procedure “ANLS with
NLS solved by HALS-NLS” as HALS.
Several important remarks can be done at this stage :
— HALS can be seen itself as a BCD algorithm to solve (1),
where blocs are
{u1, u2, . . . , ur, u1, . . . , ur, v1, v2, . . . , w1, w2, . . .}.
— Because updating the columns of Urequires the com-
putation of T[1] (VW)which is costly, it is computa-
tionally interesting to update the columns of Useveral
times before switching to the columns of V.
— The product (VW)T(viwi)can be computed effi-
ciently as (VTvi∗WTwi)where ∗is the element-wise
product.
3 Extrapolation for ANLS
Let us now discuss the proposed extrapolation scheme for
ANLS, which is an extension of the one proposed in [1].
In each step of the regular ANLS, we update a block of pa-
rameters using a NLS solver. For instance, when updating U,
we compute Uk+1 =NLS(Vk, W k;Uk)where Ukis the ini-
tialization for Uin an NLS solver such as HALS-NLS.
Extrapolated ANLS (E-ANLS) involves pairing variables ˜
U,
˜
Vand ˜
Winitialized equal to U,Vand W. At each iteration,
e.g., for U, we compute
Update Uk+1 =NLS(˜
Vk,˜
Wk;˜
Uk)(5)
Extrapolate ˜
Uk+1 =Uk+1 +βk(Uk+1 −Uk),(6)
where kis the current iteration index and βkis the extrapolation
parameter. Similar updates are carried out for Vand W.
Intuitively, extrapolation makes use of the geometry of the
cost function to predict estimates for a future iteration using
previous estimates. In a convex optimization setting, Nesterov
provided an optimal way to compute βkfor steps (6) in terms
of convergence speed. He showed that for this choice of βk, the
error of the iterates of accelerated projected gradient methods
converge in O(1
k2)instead of O(1
k)[11].
However, as aNCPD (1) is non-convex, and because we are
using extrapolation with the same βkfor different blocks, there
is a priori no closed-form solution to update βk. Thus, the cri-
tical part of the update is the parameter βk, which we now dis-
cuss.
3.1 Update of β
The extrapolation scheme [1] has 5 variables, two of them
indexed by the iteration index k:βk,¯
βk, γ, ¯γand η, where
βk∈[0,1] is the extrapolation variable, ¯
βk∈[0,1] is the cei-
ling parameter of βk. Parameters γ(η) control the way βgrows
(decays). Lastly ¯γis the growth parameter for ¯
βk. We grow or
decay βkafter each iteration as follows : let ekdenote the error
at iteration kafter the extrapolation, then
βk+1 =(min{γβk,¯
β}if ek+1 ≤ek,
βk/η else.(7)
It is important to note that extrapolation may increase the
objective function value (ek) due to a bad parameter βk. If this
occurs, we abandon the extrapolation at that iteration and sim-
ply set the paring variable ˜
Uk+1 =Uk+1. In other words, when
the error increases, we perform restart [12]. In addition, an in-
crease of ekmeans the current βkis too large, we then decrease
βkas (7). To avoid βk+1 to grow back to the large value which
caused the increase of the error, we update ¯
βkby (8)
¯
βk+1 =(min{¯γ¯
βk,1}if ek+1 ≤ekand ¯
βk<1,
βkelse.(8)
The relations between the parameters are
∀k≥0,0≤βk≤¯
βk≤1<¯γ≤γ≤η.
3.2 An existing line-search approach for ANLS
It is often reported in the literature on computing approxi-
mate CPD that ALS is a viable algorithm only when using
“line search” [2]. Although it is not stated explicitly in that re-
ference, what is actually done is extrapolation 1. It is also repor-
ted that “line search” speeds up NLS for aNCPD considerably.
The scheme is the following heuristic :
Update Uk+1
2=NLS(Vk, W k;Uk)(9)
Extrapolate Uk+1 =Uk+1
2+ (k1
h(k)−1)(Uk+1
2−Uk)(10)
where h(k)is a recursive function so that h(k+ 1) = h(k)
if the error has not increased for more than four iterations,
h(k+ 1) = 1 + h(k)otherwise, and h(1) = 3. Also, it is re-
ported that for constrained optimization, “line-search” should
be discarded in the first few iterations (4in tests below) be-
cause of instability issues. The main difference between this
approach, which we refer to as “Bro” after the name of its au-
thor, and the extrapolation scheme we propose is that on top
of being based on the Nesterov acceleration, the variables after
our scheme are always nonnegative (only the pairing variables
may have negative entries).
4 Experiments
Let us now compare various accelerated BCD techniques
for aNCPD, namely Projected gradient (PG), HALS (baseline),
1. Line-search as usually understood in smooth optimization has been stu-
died for approximate CPD in [14]
Iteration
50 100 150 200 250 300 350 400 450 500
e-emin
10-10
All curves
Computational time (sec)
0246810
e-emin
10-15
10-10
10-5
Iteration
100 200 300 400 500
10-10
10-5
Median curves
Computational Tim e (sec)
1 2 3 4 5 6 7 8 9 10
10-10
10-5
PGD PGD-r HALS E-HALS Bro APG APG-r
FIGURE 1 – The error curves ek−emin of various algorithms,
where emin is the lowest eamong all trials of all algorithms.
HALS with extrapolation using Bro’s βtuning procedure, E-
HALS (proposed), Accelerated Projected Gradient [15] (APG).
Notice that the HALS-based algorithms haves inner loops wi-
thin the HALS and the PG-based algorithms have no inner
loop. For fair comparisons we also consider PG and APG with
inner loops, namely PG-r and APG-r. We compare these me-
thods in three scenarios described below. In all tests, the rank r
is small enough for uniqueness of factor matrices to hold up to
trivial permutations and scaling.
Set up We form the ground truth tensor Ttrue by sampling ele-
ments of matrices Utrue,Vtrue and Wtrue from the uniform dis-
tribution U[0,1]. To make the problem more difficult, we in-
crease the condition number of the mode-1 matrix Uby re-
placing Utrue(:,1) with 0.01Utrue (:,1) + 0.99Utrue(:,2) (using
Matlab array notations) and we add i.i.d. Gaussian noise to the
tensor with variance σ2= 10−4. Note that some values in the
noisy tensor may be negative due to the noise. We run all the al-
gorithms under the same number of iteration (500) with maxi-
mum run time of 10 seconds, same number of inner loops (50)
and random initialization Uini,Vini ,Wini sampled from U[0,1].
For E-HALS, we set ¯
β0= 1,β0= 0.4,γ= 1.1,¯γ= 1.001
and η= 2 after calibration. We repeat the whole process 20
times. We perform the following three tests :
1. Low rank, balanced sizes We set I=J=K= 50, R =
10. Fig. 1 shows the cost function values for each algorithm
against iteration and computational time. Table 1 provides the
median reconstruction error of each mode compared to ground
truth : we define REfinal as the relative error between Ukand
Utrue in percent after column normalization and matching. The
unpredictable behavior of Bro (error may increase significantly)
shows that it is not a priori well-designed for aNCPD, even
though its final error is sometimes on par with other algorithms.
APG-r performs better than HALS, while E-HALS is faster
than the other methods.
2. Low rank, balanced sizes, ill-conditioned We use the same
setting as in the previous test. However here the condition num-
ber of factor matrix Uis severely increased using U=U(Ir+
1r), where Iris identity matrix of size rand 1ris r-by-rall-1
matrix. Due to Ubeing very ill-conditioned, method Bro di-
100 200 300 400 500
10-8
10-6
10-4
2 4 6 8
10-8
10-6
10-4
HALS E-HALS APG-r
FIGURE 2 – Median value of ek−emin of HALS, E-HALS and
APG-r for test 2.
10 20 30
10-8
10-6
10-4
2 4 6 8 10
10-8
10-6
10-4
HALS E-HALS APG-r
FIGURE 3 – Median value of ek−emin of HALS, E-HALS and
APG-r for test 3.
verges (it therefore does not show on Fig. 2). On the other hand,
E-HALS is not only the fastest method, see Fig. 2, but also pro-
vides much more accurate factor estimates, see Table 1. Unex-
pectedly, E-HALS seems to perform even better in this second
test than in the previous one.
3. Medium rank, unbalanced sizes Here I= 150,J= 103,
K= 35 and R= 20. Fig. 3 shows similar results to the pre-
vious tests : E-HALS outperforms all other BCD approaches.
TABL E 1 – Median REfinal in % of U, V, W , (* means ≥40)
Algo Test 1 Test 2 Test 3
PGD 15, 1.8, 1.8 4.3, *, * *, *, *
PGD-r 0.2, 1.8, 1.6 4.1, *, * *, *, *
HALS 0.2, 1.6, 1.6 2.2, 22, 23 4.7, 5.2, 5.2
E-HALS 0.2, 1.8, 1.8 0.04,0.3, 0.3 0.4,0.8,0.8
Bro 0.2, 1.5, 1.5 0.2, 2.4, 2.4 *, *, *
APG 0.5, 3.0, 2.9 4.3, *, * *, *, *
APG-r 0.2, 1.3, 1.2 2.7, *, 28 10,11,11
5 Conclusion
In this paper, we presented an algorithm to compute approxi-
mate Nonnegative Canonical Polyadic Decomposition that em-
ploys extrapolation to accelerate the convergence of a block-
coordinate scheme. The extrapolation is in the form of Uk+1 =
Uk+βk(Uk−Uk−1)where kis the iteration number and
βk≥0is the extrapolation parameter. Experiment results show
that the proposed scheme can significantly speed up the algo-
rithm and is much faster than benchmark algorithms in several
ill-conditionned settings. Future works will further improve on
speed and adapt the proposed method to other decompositions
such as approximate nonnegative Tucker. Also, we will adapt
other acceleration methods desgigned in the unconstrained case
(here we only compare with Bro [2]), such as the one in [10].
R´
ef´
erences
[1] A. M. S. Ang and N. Gillis. Accelerating nonnegative matrix
factorization algorithms using extrapolation. Neural computa-
tion, 31(2) :417–439, 2019.
[2] R. Bro. Multi-way Analysis in the Food Industry : Models, Algo-
rithms, and Applications. PhD thesis, University of Amsterdam,
The Netherlands, 1998.
[3] J. E. Cohen, R. Cabral-Farias, and P. Comon. Fast decomposition
of large nonnegative tensors. IEEE Signal Processing Letters,
22(7) :862–866, July 2015.
[4] Jeremy E Cohen. About notations in multiway array processing.
arXiv preprint arXiv :1511.01306, 2015.
[5] N. Gillis and F. Glineur. Accelerated multiplicative updates and
hierarchical ALS algorithms for nonnegative matrix factoriza-
tion. Neural computation, 24(4) :1085–1105, 2012.
[6] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A flexible
and efficient algorithmic framework for constrained matrix and
tensor factorization. IEEE Transactions on Signal Processing,
64(19) :5052–5065, 2016.
[7] T. G. Kolda and B. W. Bader. Tensor decompositions and appli-
cations. SIAM Review, 51(3) :455–500, September 2009.
[8] C. L. Lawson and R. J. Hanson. Solving least squares problems,
volume 15. Siam, 1995.
[9] L-H. Lim and P. Comon. Nonnegative approximations of non-
negative tensors. J. Chemom., 23(7-8) :432–441, 2009.
[10] Drew Mitchell, Nan Ye, and Hans De Sterck. Nesterov accele-
ration of alternating least squares for canonical tensor decompo-
sition. arXiv preprint arXiv :1810.05846, 2018.
[11] Y. Nesterov. Introductory lectures on convex optimization : A
basic course. Springer Science & Business Media, 2013.
[12] B. O’donoghue and E. Candes. Adaptive restart for accelerated
gradient schemes. Foundations of computational mathematics,
15(3) :715–732, 2015.
[13] Y. Qi, P. Comon, and L. H. Lim. Uniqueness of nonnegative
tensor approximations. IEEE Trans. Inf. Theory, 62(4) :2170–
2183, April 2016. arXiv :1410.8129.
[14] M. Rajih, P. Comon, and R. A. Harshman. Enhanced line search :
A novel method to accelerate parafac. SIAM journal on matrix
analysis and applications, 30(3) :1128–1147, 2008.
[15] Y. Xu and W. Yin. A block coordinate descent method for regula-
rized multiconvex optimization with applications to nonnegative
tensor factorization and completion. SIAM Journal on imaging
sciences, 6(3) :1758–1789, 2013.